Natural Language Processing

Uses
  • Text Classification
    • Used for filtering information in web search
    • Helps to avoid spam mail
  •  Sentiment Analysis
    • Identify opinions & sentiments of audience
  • Chatbots
    • Used for customer support
    • Used in HR systems
    • Used in e-commerce systems
  • Customer service
    • Insights into audience preferences
    • Helps improve customer satisfaction'
  • Advertisement
    • Helps target right customers
Tokenization
  • Process of breaking up text into smaller pieces(tokens).
  • Token can be word or a sentence
Stop Words
  • an, a, when - which doesn't convey actual meaning
Part of speech (POS)
  • Tags nouns, verb, adjectives etc
Stemming
  • Process of reducing or root of the word or taking the stem
Lemmatization
  •  Process of reducing or root of the word or taking the stem in dictionary form
Named Entity Recognition
  • Recognize entities like People, Organizatio, places etc
Bag of words
  • covert to lower case
  • perform stemming and lemmatization
  • remove stop words
  • Histogram of all the words across sentences and build word & its respective frequency
  • sort the histogram on frequency desc
  • Convert the histogram representation in to Vectors/matrix with words as features/columns and sentences as rows.
  • Disadvantages;
    • Words have equal weightage which can be overcome by TF-IDF

TF-IDF
  • Term Frequency - Inverse Document Frequency
  • It measures how important a word to a document in a collection of documents
  • TF is a vector similar to bag of words, but the values are replaced by term frequency instead of 0 or 1
  • TF= (Number of times the term appears in a document/statement)/(Total number of terms in a document/statement)
  • IDF=log(Number of documents/sentences)/(Total number of documents/sentences containing the word) 
  • IDF is a vector similar to bag of words, but the values are replaced with IDF values
  • TFIDF= TF vector*IDF vector
Most important thing in NLP is text preprocessing, converting the word data into vector representation so that the algorithm will be able to generalize these words to do any type of predictions or sentence generation and so on.
  • one hot representation - representing word in vector - if the word is say in nth position of dictionary, then the we represent the index of word in single column matri/vector where all the values will be 0 except the nth position. These matrix are of high dimension & sparse in nature. The dimension is based on Vocabulary size.
  • Feature representation - represent/rank the word against features like Gender, royal, age food..  on x axis
 
Word Embeddings
  • Convert the sentence into one hot representation considering some vocabulary size.
  • Pass the value to Embedding Layer and provide the dimension of the features. This is also called feature representation of the word.
  • Cosine similarity - The similarity is obtained by substracting feature representation vectors.
  • Word2vec
  • Glove
Cosine similarity


Sentiment Analysis
  • The process of computationally classifying and categorizing opinions expressed in a piece of text
  • It helps to understand the writer opinion about a topic, event, product and so on.
    • Packages - Keras, tensorflow
    • Pre-Process the data set 
      • Have the review text size consistent wrt words
    • Word Embedding
NLTK Packages
  • numpy
  • sklearn
  • PorterStemmer

Comments

Popular posts from this blog

Statistics in Machine Learning

Cluster Analysis