Natural Language Processing
Uses
- Text Classification
- Used for filtering information in web search
- Helps to avoid spam mail
- Sentiment Analysis
- Identify opinions & sentiments of audience
- Chatbots
- Used for customer support
- Used in HR systems
- Used in e-commerce systems
- Customer service
- Insights into audience preferences
- Helps improve customer satisfaction'
- Advertisement
- Helps target right customers
- Process of breaking up text into smaller pieces(tokens).
- Token can be word or a sentence
- an, a, when - which doesn't convey actual meaning
- Tags nouns, verb, adjectives etc
- Process of reducing or root of the word or taking the stem
- Process of reducing or root of the word or taking the stem in dictionary form
- Recognize entities like People, Organizatio, places etc
- covert to lower case
- perform stemming and lemmatization
- remove stop words
- Histogram of all the words across sentences and build word & its respective frequency
- sort the histogram on frequency desc
- Convert the histogram representation in to Vectors/matrix with words as features/columns and sentences as rows.
- Disadvantages;
- Words have equal weightage which can be overcome by TF-IDF
- Term Frequency - Inverse Document Frequency
- It measures how important a word to a document in a collection of documents
- TF is a vector similar to bag of words, but the values are replaced by term frequency instead of 0 or 1
- TF= (Number of times the term appears in a document/statement)/(Total number of terms in a document/statement)
- IDF=log(Number of documents/sentences)/(Total number of documents/sentences containing the word)
- IDF is a vector similar to bag of words, but the values are replaced with IDF values
- TFIDF= TF vector*IDF vector
- one hot representation - representing word in vector - if the word is say in nth position of dictionary, then the we represent the index of word in single column matri/vector where all the values will be 0 except the nth position. These matrix are of high dimension & sparse in nature. The dimension is based on Vocabulary size.
- Feature representation - represent/rank the word against features like Gender, royal, age food.. on x axis
Word Embeddings
- Convert the sentence into one hot representation considering some vocabulary size.
- Pass the value to Embedding Layer and provide the dimension of the features. This is also called feature representation of the word.
- Cosine similarity - The similarity is obtained by substracting feature representation vectors.
- Word2vec
- Glove
Sentiment Analysis
- The process of computationally classifying and categorizing opinions expressed in a piece of text
- It helps to understand the writer opinion about a topic, event, product and so on.
- Packages - Keras, tensorflow
- Pre-Process the data set
- Have the review text size consistent wrt words
- Word Embedding
NLTK Packages
- numpy
- sklearn
- PorterStemmer
Comments
Post a Comment