Natural Language Processing

May 27, 2020

Uses

Text Classification

Used for filtering information in web search
Helps to avoid spam mail

Sentiment Analysis

Identify opinions & sentiments of audience

Chatbots

Used for customer support
Used in HR systems
Used in e-commerce systems

Customer service

Insights into audience preferences
Helps improve customer satisfaction'

Helps target right customers

Tokenization

Process of breaking up text into smaller pieces(tokens).
Token can be word or a sentence

Stop Words

an, a, when - which doesn't convey actual meaning

Part of speech (POS)

Tags nouns, verb, adjectives etc

Stemming

Process of reducing or root of the word or taking the stem

Lemmatization

Process of reducing or root of the word or taking the stem in dictionary form

Named Entity Recognition

Recognize entities like People, Organizatio, places etc

Bag of words

covert to lower case
perform stemming and lemmatization
remove stop words
Histogram of all the words across sentences and build word & its respective frequency
sort the histogram on frequency desc
Convert the histogram representation in to Vectors/matrix with words as features/columns and sentences as rows.
Disadvantages;

Words have equal weightage which can be overcome by TF-IDF

TF-IDF

Term Frequency - Inverse Document Frequency
It measures how important a word to a document in a collection of documents
TF is a vector similar to bag of words, but the values are replaced by term frequency instead of 0 or 1
TF= (Number of times the term appears in a document/statement)/(Total number of terms in a document/statement)
IDF=log(Number of documents/sentences)/(Total number of documents/sentences containing the word)
IDF is a vector similar to bag of words, but the values are replaced with IDF values
TFIDF= TF vector*IDF vector

Most important thing in NLP is text preprocessing, converting the word data into vector representation so that the algorithm will be able to generalize these words to do any type of predictions or sentence generation and so on.

one hot representation - representing word in vector - if the word is say in nth position of dictionary, then the we represent the index of word in single column matri/vector where all the values will be 0 except the nth position. These matrix are of high dimension & sparse in nature. The dimension is based on Vocabulary size.
Feature representation - represent/rank the word against features like Gender, royal, age food.. on x axis

Word Embeddings

Convert the sentence into one hot representation considering some vocabulary size.
Pass the value to Embedding Layer and provide the dimension of the features. This is also called feature representation of the word.
Cosine similarity - The similarity is obtained by substracting feature representation vectors.
Word2vec
Glove

Cosine similarity

Sentiment Analysis

The process of computationally classifying and categorizing opinions expressed in a piece of text
It helps to understand the writer opinion about a topic, event, product and so on.

Packages - Keras, tensorflow
Pre-Process the data set

Have the review text size consistent wrt words

Word Embedding

NLTK Packages

numpy
sklearn

PorterStemmer

Search This Blog

MachineLearning & AI

Natural Language Processing

Comments

Post a Comment

Popular posts from this blog

Statistics in Machine Learning

Cluster Analysis