1. Number of reviews: 568,454
  2. Number of users: 256,059
  3. Number of products: 74,258
  4. Timespan: Oct 1999 — Oct 2012
  5. Number of Attributes/Columns in data: 10
  1. Id
  2. ProductId — unique identifier for the product
  3. UserId — unqiue identifier for the user
  4. ProfileName
  5. Helpfulness Numerator — number of users who found the review helpful
  6. HelpfullnessDenominator — number of users who indicated whether they found the review helpful or not
  7. Score — rating between 1 and 5
  8. Time — timestamp for the review
  9. Summary — brief summary of the review
  10. Text — text of the review
  • We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

Loading the data

  1. .csv file
  2. SQLite Database
Loading the data

[3] Preprocessing

[3.1]. Preprocessing Review Text

  1. Begin by removing the html tags
  2. Remove any punctuations or limited set of special characters like , or . or # etc.
  3. Check if the word is made up of english letters and is not alpha-numeric
  4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
  5. Convert the word to lowercase
  6. Remove Stopwords
  7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)Ex: taste ,tasty ,tasteful for these words base form is tast ,after stemming .

We will use BAG OF WORDS :

  • Review 1: This movie is very scary and long
  • Review 2: This movie is not scary and is slow
  • Review 3: This movie is spooky and good
  1. BoW, which stands for Bag of Words

Bag of Words (BoW) Model

  • Review 1: This movie is very scary and long
  • Review 2: This movie is not scary and is slow
  • Review 3: This movie is spooky and good

bi-gram, tri-gram and n-gram

  • Review text, preprocessed one converted into vectors using (BOW)
  • Find the best hyper parameter which will give the maximum AUC value
  • Consider a wide range of alpha values for hyperparameter tuning, start as low as 0.00001
  • Find the best hyper paramter using k-fold cross validation or simple cross validation data
  • Use gridsearch cv or randomsearch cv or you can also write your own for loops to do this task of hyperparameter tuning
  • Find the top 10 features of positive class and top 10 features of negative class for both feature sets and using absolute values of `coef_` parameter of MultinomialNB and print their corresponding feature names
  • To increase the performance of your model, you can also experiment with with feature engineering like :
  • Taking length of reviews as another feature.
  • Considering some features from review summary as well.
  • on X-axis you will have alpha values, since they have a wide range, just to represent those alpha values on the graph, apply log function on those alpha values.
  • Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test.
  • When we find the ROC curve , just for representing it we will plot a graph of confusion matrix and plot an seaborn heatmap



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store