The Amazon fine food review dataset consists of reviews of the food from Amazon

  1. Number of reviews: 568,454

Attribute Information:

  1. Id
  • We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

Loading the data

The dataset is available in two forms

  1. .csv file

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualize the data efficiently.

Here we will ignore the all Scores equal to 3 . If the Score id is above 3 ,then we will consider as positive, Otherwise it will be negative.

Loading the data

The Output will be

DATA CLEANING : Deduplication

In Machine Learning Data Cleaning is Very important We want to preprocess the data using Pandas.

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. Following is an example:

Now we will sort the data and remove the duplicates of entries

We can see that only 69.25 % data is remaining after removing the duplicates, we observed that 30.75% data is duplicated in our original data.

Also we can see that, below image, we can observed that in two rows given below the value of Helpfulness Numerator is greater than HelpfullnessDenominator which is not practically possible hence these two rows too are removed from calculations.

further, Now we will remove these kind of data also ,

after removing the data:

But here is the Problem we can see that in our dataset we have columns of text and summary. These are text features but in machine learning we want to use only numerical features for building model. Now, the question is how to convert Text features into numerical vectors?

Natural Language Processing: we use the some of the techniques of Natural Language Processing to convert text to numerical vectors.

[3] Preprocessing

[3.1]. Preprocessing Review Text

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

  1. Begin by removing the html tags

After which we collect the words used to describe positive and negative reviews

Load the data :

Now , we will test the data and remove the special characters

As the model , is working right now , we will combine all the sentence for the reviews and Summary

Here are some techniques to find the vector representation of given text :

Bag of Words



Average Word2vec

Average tf–idf Word2vec

We will use BAG OF WORDS :

I’ll take a popular example to explain Bag-of-Words (BoW)

We all love watching movies (to varying degrees). I tend to always look at the reviews of a movie before I commit to watching it. I know a lot of you do the same! So, I’ll use this example here.

  • Review 1: This movie is very scary and long
  1. BoW, which stands for Bag of Words

Bag of Words (BoW) Model

The Bag of Words (BoW) model is the simplest form of text representation in numbers. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).

Let’s recall the three types of movie reviews we saw earlier:

  • Review 1: This movie is very scary and long

We will first build a vocabulary from all the unique words in the above three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

So , As you get to Know what is BOW ,

Now we will implement on our Model

So , now we will get some some feature names , shape of our BOW vectorizer and unique words.

bi-gram, tri-gram and n-gram

1.removing stop words like “not” should be avoided before building n-grams
2.count_vect = CountVectorizer(ngram_range=(1,2))
please do read the CountVectorizer documentation

3. you can choose these numebrs min_df=10, max_features=5000, of your choice

Now , Its Time to Apply The Naive Bayes on BOW

  • Review text, preprocessed one converted into vectors using (BOW)

Feature engineering

  • To increase the performance of your model, you can also experiment with with feature engineering like :

we will see :

  • Considering some features from review summary as well.

Output :

Now we will see Confusion Matrix for the train data :

Now we see Confusion matrix for Test data :

Now we will see Top 10 words according the positive and negative class :

Now we will apply Multinomial Naive Bayes on BOW :

we will get the graph like :

and ROC will be :

Confusion matrix will be for train data :

Confusion matrix will be for test data :

Now we will see the top 10 data of both positive and negative class:

This is the small introduction to Text Featurization with Natural Language Processing by using real-world data set Amazon Food Reviews .

Thank you so much .



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store