AMAZON FINE FOOD REVIEW WITH NAIVE BAYES

Abhay desai

8 min readJul 27, 2021

The Amazon fine food review dataset consists of reviews of the food from Amazon

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

Id
ProductId — unique identifier for the product
UserId — unqiue identifier for the user
ProfileName
Helpfulness Numerator — number of users who found the review helpful
HelpfullnessDenominator — number of users who indicated whether they found the review helpful or not
Score — rating between 1 and 5
Time — timestamp for the review
Summary — brief summary of the review
Text — text of the review

We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

Loading the data

The dataset is available in two forms

.csv file
SQLite Database

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualize the data efficiently.

Here we will ignore the all Scores equal to 3 . If the Score id is above 3 ,then we will consider as positive, Otherwise it will be negative.

The Output will be

DATA CLEANING : Deduplication

In Machine Learning Data Cleaning is Very important We want to preprocess the data using Pandas.

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. Following is an example:

Now we will sort the data and remove the duplicates of entries

We can see that only 69.25 % data is remaining after removing the duplicates, we observed that 30.75% data is duplicated in our original data.

Also we can see that, below image, we can observed that in two rows given below the value of Helpfulness Numerator is greater than HelpfullnessDenominator which is not practically possible hence these two rows too are removed from calculations.

further, Now we will remove these kind of data also ,

after removing the data:

But here is the Problem we can see that in our dataset we have columns of text and summary. These are text features but in machine learning we want to use only numerical features for building model. Now, the question is how to convert Text features into numerical vectors?

Natural Language Processing: we use the some of the techniques of Natural Language Processing to convert text to numerical vectors.

[3] Preprocessing

[3.1]. Preprocessing Review Text

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

Begin by removing the html tags
Remove any punctuations or limited set of special characters like , or . or # etc.
Check if the word is made up of english letters and is not alpha-numeric
Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
Convert the word to lowercase
Remove Stopwords
Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)Ex: taste ,tasty ,tasteful for these words base form is tast ,after stemming .

After which we collect the words used to describe positive and negative reviews

Load the data :

Now , we will test the data and remove the special characters

As the model , is working right now , we will combine all the sentence for the reviews and Summary

Here are some techniques to find the vector representation of given text :

Bag of Words

tf–idf

Word2vec

Average Word2vec

Average tf–idf Word2vec

We will use BAG OF WORDS :

I’ll take a popular example to explain Bag-of-Words (BoW)

We all love watching movies (to varying degrees). I tend to always look at the reviews of a movie before I commit to watching it. I know a lot of you do the same! So, I’ll use this example here.

Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good

BoW, which stands for Bag of Words

Bag of Words (BoW) Model

The Bag of Words (BoW) model is the simplest form of text representation in numbers. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).

Let’s recall the three types of movie reviews we saw earlier:

Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good

We will first build a vocabulary from all the unique words in the above three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

So , As you get to Know what is BOW ,

Now we will implement on our Model

So , now we will get some some feature names , shape of our BOW vectorizer and unique words.

bi-gram, tri-gram and n-gram

1.removing stop words like “not” should be avoided before building n-grams
2.count_vect = CountVectorizer(ngram_range=(1,2))
please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

3. you can choose these numebrs min_df=10, max_features=5000, of your choice

Now , Its Time to Apply The Naive Bayes on BOW

Review text, preprocessed one converted into vectors using (BOW)
Find the best hyper parameter which will give the maximum AUC value
Consider a wide range of alpha values for hyperparameter tuning, start as low as 0.00001
Find the best hyper paramter using k-fold cross validation or simple cross validation data
Use gridsearch cv or randomsearch cv or you can also write your own for loops to do this task of hyperparameter tuning
Find the top 10 features of positive class and top 10 features of negative class for both feature sets and using absolute values of `coef_` parameter of MultinomialNB and print their corresponding feature names

Feature engineering

To increase the performance of your model, you can also experiment with with feature engineering like :
Taking length of reviews as another feature.

we will see :

Considering some features from review summary as well.
on X-axis you will have alpha values, since they have a wide range, just to represent those alpha values on the graph, apply log function on those alpha values.
Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test.
When we find the ROC curve , just for representing it we will plot a graph of confusion matrix and plot an seaborn heatmap