Abhay desai
8 min readJul 27, 2021


The Amazon fine food review dataset consists of reviews of the food from Amazon

  1. Number of reviews: 568,454
  2. Number of users: 256,059
  3. Number of products: 74,258
  4. Timespan: Oct 1999 — Oct 2012
  5. Number of Attributes/Columns in data: 10

Attribute Information:

  1. Id
  2. ProductId — unique identifier for the product
  3. UserId — unqiue identifier for the user
  4. ProfileName
  5. Helpfulness Numerator — number of users who found the review helpful
  6. HelpfullnessDenominator — number of users who indicated whether they found the review helpful or not
  7. Score — rating between 1 and 5
  8. Time — timestamp for the review
  9. Summary — brief summary of the review
  10. Text — text of the review
  • We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

Loading the data

The dataset is available in two forms

  1. .csv file
  2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualize the data efficiently.

Here we will ignore the all Scores equal to 3 . If the Score id is above 3 ,then we will consider as positive, Otherwise it will be negative.

Loading the data

The Output will be

DATA CLEANING : Deduplication

In Machine Learning Data Cleaning is Very important We want to preprocess the data using Pandas.

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. Following is an example:

Now we will sort the data and remove the duplicates of entries

We can see that only 69.25 % data is remaining after removing the duplicates, we observed that 30.75% data is duplicated in our original data.

Also we can see that, below image, we can observed that in two rows given below the value of Helpfulness Numerator is greater than HelpfullnessDenominator which is not practically possible hence these two rows too are removed from calculations.

further, Now we will remove these kind of data also ,

after removing the data:

But here is the Problem we can see that in our dataset we have columns of text and summary. These are text features but in machine learning we want to use only numerical features for building model. Now, the question is how to convert Text features into numerical vectors?

Natural Language Processing: we use the some of the techniques of Natural Language Processing to convert text to numerical vectors.

[3] Preprocessing

[3.1]. Preprocessing Review Text

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

  1. Begin by removing the html tags
  2. Remove any punctuations or limited set of special characters like , or . or # etc.
  3. Check if the word is made up of english letters and is not alpha-numeric
  4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
  5. Convert the word to lowercase
  6. Remove Stopwords
  7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)Ex: taste ,tasty ,tasteful for these words base form is tast ,after stemming .

After which we collect the words used to describe positive and negative reviews

Load the data :

Now , we will test the data and remove the special characters

As the model , is working right now , we will combine all the sentence for the reviews and Summary

Here are some techniques to find the vector representation of given text :

Bag of Words



Average Word2vec

Average tf–idf Word2vec

We will use BAG OF WORDS :

I’ll take a popular example to explain Bag-of-Words (BoW)

We all love watching movies (to varying degrees). I tend to always look at the reviews of a movie before I commit to watching it. I know a lot of you do the same! So, I’ll use this example here.

  • Review 1: This movie is very scary and long
  • Review 2: This movie is not scary and is slow
  • Review 3: This movie is spooky and good
  1. BoW, which stands for Bag of Words

Bag of Words (BoW) Model

The Bag of Words (BoW) model is the simplest form of text representation in numbers. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).

Let’s recall the three types of movie reviews we saw earlier:

  • Review 1: This movie is very scary and long
  • Review 2: This movie is not scary and is slow
  • Review 3: This movie is spooky and good

We will first build a vocabulary from all the unique words in the above three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

So , As you get to Know what is BOW ,

Now we will implement on our Model

So , now we will get some some feature names , shape of our BOW vectorizer and unique words.

bi-gram, tri-gram and n-gram

1.removing stop words like “not” should be avoided before building n-grams
2.count_vect = CountVectorizer(ngram_range=(1,2))
please do read the CountVectorizer documentation

3. you can choose these numebrs min_df=10, max_features=5000, of your choice

Now , Its Time to Apply The Naive Bayes on BOW

  • Review text, preprocessed one converted into vectors using (BOW)
  • Find the best hyper parameter which will give the maximum AUC value
  • Consider a wide range of alpha values for hyperparameter tuning, start as low as 0.00001
  • Find the best hyper paramter using k-fold cross validation or simple cross validation data
  • Use gridsearch cv or randomsearch cv or you can also write your own for loops to do this task of hyperparameter tuning
  • Find the top 10 features of positive class and top 10 features of negative class for both feature sets and using absolute values of `coef_` parameter of MultinomialNB and print their corresponding feature names

Feature engineering

  • To increase the performance of your model, you can also experiment with with feature engineering like :
  • Taking length of reviews as another feature.

we will see :

  • Considering some features from review summary as well.
  • on X-axis you will have alpha values, since they have a wide range, just to represent those alpha values on the graph, apply log function on those alpha values.
  • Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test.
  • When we find the ROC curve , just for representing it we will plot a graph of confusion matrix and plot an seaborn heatmap

Output :

Now we will see Confusion Matrix for the train data :

Now we see Confusion matrix for Test data :

Now we will see Top 10 words according the positive and negative class :

Now we will apply Multinomial Naive Bayes on BOW :

we will get the graph like :

and ROC will be :

Confusion matrix will be for train data :

Confusion matrix will be for test data :

Now we will see the top 10 data of both positive and negative class:

This is the small introduction to Text Featurization with Natural Language Processing by using real-world data set Amazon Food Reviews .

Thank you so much .