- 👉
Text Processing
- 👉
SQLLite
- 👉
sklearn
- 👉
nltk
(natural language processing toolkit) - 👉
Handling Imbalanced dataset
How to convert text(words/sentences) into numerical Vectors of d-dimension?
- Most important Features are Review and Text which are in simple English language, aimed to convert them into Vectors then applying
Linear Algebra
to it- Say draw a plane(need to find it) between d-dimensional datapoints, this plane divides data points in two parts, say -ve and +ve
- assume a normal which is perpendicular to plane
- data points in direction of normal are positive and on opposit direction are negative (this can be given to any model)
- Most important Features are Review and Text which are in simple English language, aimed to convert them into Vectors then applying
How to find a good plane to seperate data points?
What Rules this conversion(text to d-dimension vectors) will follow?
- Say we have 1,2,3 reviews and
Sementic Similarity
(Englist Similarity) between (1,2) is greater then (1,3) i.e. if we finds out that (1,2) are more similar reviews then (1,3) - for review 1,2,3 we have vectors v1,v2,v3 then
- distance between (v1,v2) is smaller then (v1,v3) i.e. if they are more similar then there distance will be less
- Geomatrically distance b/w (v1,v2) = (v1-v2)
- Can say if review(1,2) are more similar then there distance(v1,v2) will be less,
or similar points are closer
If all +ve and -ve points will be in there closer boundery it will become easy to draw plane between them and seprate them in two groups
- Say we have 1,2,3 reviews and
How to find a Method which takes Text as input and convert it into d-dimensional vactor, such that similar text must be closer Geometrically(distance must be least between similar reviews)?
- Bag-of-Words (simplest technique to change words to vectors) => works on count of words, going down this list complexity increases
- Binary/Boolean Bag-of-Words (variation of BOW) => if word comes up 1 else 0
- Tf-idf (term frequency-inverse document frequency)
- Word2vec (technique for natural language processing)
- Averag Word2vec
- Tf-idf and Word2Vec
- Bag-of-Words (simplest technique to change words to vectors) => works on count of words, going down this list complexity increases
🙋♂️How Bag-of-Words works?
,What are its variations?
=> variations include n-grams, preprocessing, Tf-idf
set of all Text document is Called a Corpus
=> Corpus means collection of documents(each Review here is called as a Document in NLP)
- It constructs a dictionary(not similar to python) inside this dictionary Set of all unique words in Review(
called as a document in NLP
) will be collected- Say Review is 'This is good product' => constructed by BOW as {'This', 'is', 'good', 'product'} <= set of unique words taken from reviews
- Assume we have
n
number of reviews(called as a document in NLP
) and then we haved-unique words
across all my reviews(document) andd-vectors
which are vector representation of d-unique words. for example: {v1,v2,v3,v4} will correspond to {'This', 'is', 'good', 'product'}Keep in mind each Word is a different dimension of d-dimension_vectors
- Now update every word count(corpus) in Review and assign that number to its vector. for example: how many time This comes say 1 time or 2 time assign this number to say v1 which was associated with word
This
, as all other words in d-dimensions are taken in considration so there corresponding vector values will be filled with 0. - Due to this Vectors for particular sentence we take are generaly
very sparse
(sparse vactor means most of the values are 0 only some are non-0)
- Assume we have
vector length(v1-v2) is calculated as ‖v1-v2‖ i.e. norm
- Say Review is 'This is good product' => constructed by BOW as {'This', 'is', 'good', 'product'} <= set of unique words taken from reviews
Problem Faced with BOW
- distance between reviews will be least even when there Englist interpretation are way different for example: This is very good and This is not very good
Solution is Binary BOW
- Binary BOW => if word comes up then 1 else 0
- distance between vectors i.e. ‖v1-v2‖ is approx eqall to under-root of number of differing words, as these vectors will have bool values in them
- Binary BOW => if word comes up then 1 else 0
Fixing/Improving BOW
usingStopWords, Tokenization, Lemmatization
=> comes underText Preprocessing
- Say Review is This is a goood product => This, is ,a these words do not add much value to analysis and called StopWords, so in application consider removing these kind of words
- 🤔 Why to remove
Stop Words
say we have two sentance, s1. This is good product, s.2 That is good product, both santence have distance of 1 but there Englist interpretation is same, just because of This and That stopword distance becomes 1, which does not make sense- list of english stopwords
- Thing to keep in mind is,
not
is also a stop word and we anyhow have to keep it asnot good
will becomegood
after implementing Stopword which can completelly change the interpretation of the Reviews. => not is a boundy case =>can say sometimes using StopWords can be very dangerious
- Removing Stopwords can make our Vecotors smaller, but it is not a silver bullet
- After removing Stopwords we will remain with group of words which will still make sense
- 🤔
Good
andgood
these two again will give different vectors, machine will think of them as different words- Solution is convert everything to
lower case
orupper case
- Solution is convert everything to
- 🤔
STEMING
say Very Good,Excellent belongs to base word Good, same does Steming to our text- Algorithms used for Steming are as
SnowballStemmer
=> woork better then porterPorterStemmer
=> these are designed by Linguistics Experts
- Algorithms used for Steming are as
- 🤔
LEMMITIZATION
breaking a sentence into words, works bit different with languages like Madarin and Japanies- Problem comes when we have say names as New York, what lemmitization will do it will breat it into New and York and then originality is lost
- Solution is Lemmitizer dictionary which takes care of such cases
Bag-of-Words + Text Preprocessing does not guranty semantic meaning of words!!! So what to do
- Solution is Lemmitizer dictionary which takes care of such cases
- Problem comes when we have say names as New York, what lemmitization will do it will breat it into New and York and then originality is lost
4.:raising_hand_man:Word2Vec
comes in picture when we have to consider semantic meaning for words
* Bag-of-Words do not consider semantic meaning then comes Word2Vec in rescue
Word2Vec + Text Preprocessing guranty semantic meaning of words
Say r1 and r2 are two different reviews
r1 => This product is very good and affordable
r2 => This product is not very good and affordable
After removing stopwords we get:
r1 => product good affordable
r2 => product good affordable, as --This, is, very, and, not-- are stopwords
now for machine both review are same if changed to vector form, as there distance will be 0
😎To Solve this problem we use
bi-gram or n-gram
🤔how does bi-gram work then?? => takes 2 words pair as a dimension
- It makes pairs of words(two words at a time) and consider each pair as a dimension and then compare each pair with other
- say for review r1 => This product is very good and affordable it will make vector as v1 =>'This product' v2=>'product is' v3=> 'is very' and so on, this is how it will find difference between 'not very good' and 'very good'
- say for review r1 => This product is very good and affordable it will make vector as v1 =>'This product' v2=>'product is' v3=> 'is very' and so on, this is how it will find difference between 'not very good' and 'very good'
🤔Why to bother about n-grams/bi/uni??
Helps to somhow retain some of the sequence information, for n-gram if n=1 than it will not retain sequential information, as we for bi or tri gram we will different dimension for 'very good' and 'not very good'
🤔Problem With grams
number of tri-gram
>= number of bi-grams
>= number of uni-gram
in any text
- if we have
n-gram
and n>1 then dimensionality d increases But BOW with bi-gram/tri-gram are very very usefull as it helps us retain the sequence information but with a cost of Dimensionality
🙋♂️What is Tf-idf(term frequency-inverse document frequency)??
,:thinking:What does Tf and idf means???
Say we have n-number of reviews(r) and each review is combination of some words say r1 => w1,w2,w3,w2. r2 => w3,w4,w1
- make a BOW representation of it => for r1 => w1 occurs 1 time, w2 occurs 2 times, w3 occurs 1 time
- Now for
TF(wi,rj)
=number of times say wi occurs in rj
/total number of words in rj
- for TF(term frequncy) of (all words(w) and all reviews(r)) = number of times say w1 occurs in r1 / total number of words in r1
- Say in r1 how many time w2 occurs => TF(w2,r1)[read as Tf of word-w2 in review-r1] = 2/4
- for TF(term frequncy) of (all words(w) and all reviews(r)) = number of times say w1 occurs in r1 / total number of words in r1
- 🤓
Point to remember
Tf of any word in any review(document) ranges between 0 and 10 <= Tf(wi,rj) <= 1
<= always, as this have range of 0-1 it can be seen as Probability(as Probability of anything have a range of (0-1)TF is saying what is the Probability of finding a word in review(document)
<= more often word occur will lead to higher Tf- BOW and Tf-idf are techniques which were first created for Information Retrival(sub area of NLP) in the past (maybee applied on thoug of web pages)
:thinking:
What is IDF(inverse document frequency) then??
- BOW and Tf-idf are techniques which were first created for Information Retrival(sub area of NLP) in the past (maybee applied on thoug of web pages)
:thinking:
- IDF is always measured for a given word(w) in a corpus(whole data), say corpus = set of all Reviews(r1,r2....,rn)
IDF(wi,corpus)
=log[number of all Reviews(documents) / number of Reviews(documents) which contain wi]
- Read as IDF of word(wi) in a given document corpus(all reviews) = log[number of all Reviews(documents) / number of Reviews(documents) which contain wi]
number of all Reviews(documents)
>=
number of Reviews(documents) which contain wi
<= always, can be think logically- if above case is true then value of
number of all Reviews(documents)
>=
number of Reviews(documents) which contain wi
>=
1
<= always i.e.IDF
>=
0
<= always - so log(number of all Reviews(documents) / number of Reviews(documents) which contain wi) will be >= 0 WHY???
- as log(1) = 0 any number greater then 1 may be log(2) will given somthing more then 0.
- if above case is true then value of
IDF of any word in a document corpus is always greater then or equall to 0
- 🤓
Point to remember
If number of Reviews(documents) which contain wi is so hight then IDF will decrease => can sayIDF decreases when number of Reviews(documents) which contain wi increases and apposit
- for example: as the is used most of time in Revies and count of the is appx equall to corpus count then its IDF will be low
- If wi is a rare term its IDF will be high
Multiply Tf formula with idf formula
- Tf * IDF => Tf(wi,r1) * IDF(wi,corpus)
Tf(wi,r1)
=> finds how often wi(word) occur in review-r1(documnet) => if wi occurs most of the time its value will be highIDF(wi,corpus)
=> finds how often wi(word) occur in corpus(whole Reviw data) => if wi occurs most of the time its value will be least- 🤓
Point
In TF-IDF we give:- more importance if a word is frequent in Review(document) along with
- more importance to a rare word in the corpus
- 🤓
- BOW finds number of occurances of wi(any ith word) in ri(any ith review)
- This is how Tf-Idf evolves this BOW approach.
- 🤔
Problem with Tf-IDF
=> does not take Semantic meaning of Words, say for it Good and Excellent or Cheape and Affordable are two different root words- Though Word2Vec solves this problem
🙋♂️More about Word2Vac??
as this is a state of the art technique => working -- take a large text corpus -> Word2Vec -> word:vector
This link can quench your thirst of Word2Vec
- As tf-idf and Bow takes sentences and chanfge them to sparse vectors
Word2Vec takes Words changes them into dense d-dimensional not sparse vector
- keeps in consideration Semantical mraning, say walked->walking, swam->swimming (verb tense)
- keep in consideration retlationships, say men->women and king->queen, vector for these are parallen, say (V_men - V_Women) || (V_king - V_queen)
- It does all of this in automode not programed to do so, as a ton of maths is involved behind it
- more dimensions you take more information rich the vector is => if we have higher dimensions faar more complex relationship can be understood
- dimensionality depends upon the size fo the corpus
- intutivelly in observes neighbour hood of the word if that is same then there vector should be same (how this is done can be seen in deep learning -> matrix factorizations)
As Word2Vec takes a word and gives us vector of d-dimension related with the word but in our case Reviews(documents) are sequence of words(sentences).
So Using Word2Vec How to convert a sentences into a Vector???
sen2vec
is a Deep Learning Technique which can be used but Deep Learning is out of the Scope for this implementation For nowSimpler weighting techniques
will be as:Average-Word2Vec
TfIDF-Weighted Word2Vec
These above mentioned two techniques are very simple one, some techniqes likeThought Vectors
comes up in Deep Learning which is for now out of focus topic
🤔How Average-Word2Vec works ??
- say for Review-r1 take out Word2Vec of each word and add them all individually then divide it with number of words in Review-r1, now conside this outcome as vector representation of r1.
- sumup all vectors and divide it with number of words (avg)
- this is the Simplest way to use Word2Vec to build a Sentences vector which not always works
🤔How TfIDF-Weighted Word2Vec works ??
- for a review(r) first find TfIDF of each word(w) then find Word2Vec of each word then multiply Tf-idf(t) with word2Vec and divide by summation of all tfidf's
- tfidf-word2vec(r1) = [t1 * Word2Vec(w1) + t2 * Word2Vec(w2) / (t1+t2)] <= say we have only two words(w1,w2) in review(r1)
sparsity of the matrix = no.of zero elements / total elements
Given a text review, determine sentiment of review whether its positive or negative
Say a business have a product, only by undestanding product reviews, they can understand what to add,modify or remove in that product. +ve review tells Business what to consider while making new product and opposit goes for negative reviews
Score/Rating can be used
- Rating which are 4 or 5 could be considered as +ve_rating
- Rating which are 1 or 2 could be considered as -ve_rating
- Rating which are 3 could be considered as neutral or ignored This can be seen as a approximate or proxy way of determining polarity or review
ℹ️ Data Source
: Amazon Fine Foods Review
Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon
- Number of
reviews
: 568,454 - Number of
users
: 256,059 - Number of
products
: 74,258 - 260 users with > 50 reviews
Timespan
: Oct 1999 - Oct 2012 (13 years dataset)- Number of
Attributes/Columns
in data: 10
- Id
- ProductId - unique identifier for product
- UserId - unqiue identifier for the user
- ProfileName - profile name of user
- HelpfulnessNumerator - number of users who found review helpful => Users who said review is helpfull say 2000people, we dont take people who said review is not Usefull in this counting
- HelpfulnessDenominator - number of users who indicated whether they found review helpful or not => we add both users who said review is usefull and not usefull say, people number who find review usefull=2000 and notusefull=100, so HelpfulnessDenominator in this case will be (2000+100)
- Score - rating between 1 and 5
- Time - timestamp when review was given
- Summary - brief summary of the review (written at the top of the review)
- Text - text of review
- Data Cleaning
- Data Visualization
- Text Featurization(BOW, tfidf, Word2Vec)
- Building several Machine Learning models like
- Logistic Regression
- KNN
- SVM
- Naive Bayes
- Random Forest
- GBDT
- LSTM(RNNs) etc
Given longitudinal data, one should be able to understand how things change over time. McAuley and Leskovec published a paper in 2013 detailing how they used Amazon's gourmet food section to build a recommendation classifier which builds upon experience of a reviewer. Using this longitudinal dataset, there should be many things that could be understood from looking into data. For instnace, we could potentially see trends of food over years and maybe even capture cupcake craze of 2011
- Understanding evolution of reviewers over time
- Understanding variation of helpfulness of reviews for products over time
- Visualize changes in reviews over 10 year period to understand what trends were important that year
Several results gathered through analysis notebook
- Review lengths over time become longer
- Semantic prediction of summary and review text weakly but significantly correlated according to pearson correlation
- Summary of reviews also get slightly longer over time
- Older a product is, more variation for review scores
- Helpfulness_Ratio generally increases overtime for a product
- Scatter_text plots showed evolution of Amazon platform and shows transition from movies to foods
- Earlier positive ratings stemmed from specific products but slowly shifted towards more sentiment based relationship.
This dataset for Fine Foods Reviews shows several trends
- As writers become more experienced, length of their reviews get longer
- In addition, Summary of reviews also seem to get longer
- Additionally, Older_Products have more variation in Review_Scores
- Finally, using scattertext some of text features which are associated with higher or lower ratings for product which eventually shifts to a more semnatic based relationship can be seen
As this work will progress I will try to update this framework along with Proper Conclusions
1
-Amazon Food Reviews EDA, NLP, Text Preprocessing and Visualization using ``t-SNE
i.e
t-distributed stochastic neighbor embedding`
- Define
Problem Statement
- Performe Exploratory Data Analysis(EDA) on Amazon Fine Food Reviews Dataset plotted
Word Clouds
,Distplots
,Histograms
etc. - Performe
Data Cleaning
&Data Preprocessing
by removing unneccesary and duplicates rows and for text reviewsremoved html tags
, punctuations,Stopwords
andStemmed words
usingPorter Stemmer
- Document concepts clearly
Plot t-SNE
plots for Different Featurization of Data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec
- Apply Logistic Regression on Different Featurization of Data viz.
BOW(uni-gram)
,tfidf
,Avg-Word2Vec
andtf-idf-Word2Vec
- Use both
Grid Search
&Randomized Search
Cross Validation - Evaluate test data on various performance metrics like
accuracy
,f1-score
,precision
,recall
etc. also plotConfusion matrix
using seaborne - Showe how
Sparsity
increases as we increaselambda
or decreaseC
whenL1 Regularizer
is used for eachfeaturization
- Do
Pertubation test
to check whether features aremulti-collinear
or not
- Apply K-Nearest Neighbour on Different Featurization of Data viz.
BOW(uni-gram)
,tfidf
,Avg-Word2Vec
andtf-idf-Word2Vec
- Use both
brute
&kd-tree
implementation of KNN - Evaluate test data on various performance metrics like
accuracy
also plottedConfusion matrix
usingseaborne
- Apply Naive Bayes using
Bernoulli NB
andMultinomial NB
on Different Featurization of Data viz. BOW(uni-gram), tfidf. - Evaluate test data on various performance metrics like
accuracy
,f1-score
,precision
,recall
etc. also plottedConfusion Matrix
using seaborne - Print Top 25 Important Features for both Negative and Positive Reviews
- Apply SVM with
rbf
(radial basis function) kernel on different Featurization of Data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec - Use both Grid Search & Randomized Search Cross Validation
- Evaluate test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plot Confusion matrix using seaborne
- Evaluate
SGDClassifier
on best resulting featurization
- Apply Decision Trees on Different Featurization of Data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec
- Use Grid Search with random 30 points for getting best max_depth
- Evaluate test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plot Confusion matrix using seaborne
- Plot Feature Importance recieved from Decision Tree Classifier
- Apply
Random Forest
on Different Featurization of Data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec - Use Grid Search with random 30 points for getting best max_depth, learning rate and n_estimators
- Evaluate test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plot Confusion matrix using seaborne
- Plot
World Cloud
of feature importance recieved fromRF and GBDT Classifier