Text Analytics on Yelp Reviews

13 minute read

Leveraging structured information like numerical and categorical variables for predictive analytics has been done in academics for a relatively long period of time. However, using textual information has been very recent, mostly within the last 2 decades and is still evolving because existing academic studies in social sciences mostly use non - Natural Language Processing methods i.e they do not fully use the power of text. In this post, we will attempt to see a simple workflow that can help turn textual information into numeric information for analytics.

Text analytics, also known as text mining, is the methodology and process followed to derive actionable information and insights from textual data. This involves using NLP, information retrieval, and machine learning techniques to parse unstructured text data into more structured forms and deriving patterns and insights from this data that would be helpful for the end user.

Some of the main techniques in text analytics include,

Text classification
Text clustering
Text summarization
Entity extraction and recognition

Motivation

The applications of text mining are manifold and some of the most popular ones include the following,

Spam detection
Sentiment analysis
Chatbots
Ad placements
Social media analysis

In this post, we will attempt to analyze the Yelp dataset and predict the sentiment associated with a review using text information from reviews by sentiment analysis.

Methodology

A simple text analytics pipeline for supervised classification can be visualized in the workflow given below,

png

The primary blocks in the above workflow include,

Text Pre-Processing and Normalization
Feature Extraction
Supervised Machine Learning Algorithm

Text Pre-Processing and Normalization:

Any unstructured data in its raw format is not well-formatted. Text pre-processing involves deploying a variety of techniques to convert raw text into well-defined sequences with standard structure and notation. Some of the pre-processing techniques that can be explored to standardize the text data include,

Stop words and special character removal
Tokenization
Stemming
Lemmatization

Leveraging some of these techniques will help improve the quality of inputs being fed into the feature extraction block. This text cleaning step is essential for real world problems although we skip this particular step in this post.

Feature Extraction

The feature extraction block helps convert the standardized text into numeric/categorical features that can be used for learning by the supervised learning models. This process is also called Vectorization as we convert every document into a feature vector to be fed into the supervised classification models. Some standard techniques that are usually deployed for vectorization are,

1) Bag of Words Model

2) Term Frequency – Inverse Document Frequency Model

3) Advanced Word Vectorization models (using Google’s word2vec algorithm)

Supervised Learning

From prior literature, there are some supervised learning algorithms that tend to perform better for text classification problems. These algorithms include,

o Multinomial Naïve Bayes

o Support Vector Machines

o Neural Nets

Other techniques like logistic regression, decision trees, random forests and gradient boosting can be explored. However, the success of these algorithms has usually been restricted to problems involving structured data.

Data Overview

This project uses a small subset of the data from Kaggle’s Yelp Business Rating Prediction competition.

Description of the data:

yelp.csv contains the dataset.
Each observation (row) in this dataset is a review of a particular business by a particular user.
The stars column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
The text column is the text of the review.
The cool/useful/funny fields represent the comments on the review left by other users

Let us read yelp.csv into a Pandas DataFrame and examine it.

import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

path = 'yelp.csv'
yelp = pd.read_csv(path)

The shape of the dataset can be examined using the shape function.

yelp.shape

(10000, 10)

We notice that there are 10,000 review texts that are present in the dataset. We add a field to represent the length of the text field as we would attempt to see any relation between text sentiment and review length. We can now examine a sample of the dataset using the head() function.

yelp['text length'] = yelp['text'].apply(len)
yelp.head(5)

	business_id	date	review_id	stars	text	type	user_id	cool	useful	text length
0	9yKzy9PApeiPPOUJEtnvkg	2011-01-26	fWKvX83p0-ka4JS3dc6E5A	5	My wife took me here on my birthday for breakf...	review	rLtl8ZkDX5vH5nAx9C3q5Q	2	5	889
1	ZRJwVLyzEJq1VAihDhYiow	2011-07-27	IjZ33sJrzXqU-0X6U8NwyA	5	I have no idea why some people give bad review...	review	0a2KyEL0d3Yb1V6aivbIuQ	0	0	1345
2	6oRAC4uyJCsJl1X0WZpVSA	2012-06-14	IESLBzqUCLdSzSqm0eCSxQ	4	love the gyro plate. Rice is so good and I als...	review	0hT2KtfLiobPvh6cDC8JQg	0	1	76
3	_1QQZuf4zZOyFCvXc0o6Vg	2010-05-27	G-WvGaISbqqaMHlNnByodA	5	Rosie, Dakota, and I LOVE Chaparral Dog Park!!...	review	uZetl9T0NcROGOyFfughhg	1	2	419
4	6ozycU1RpktNG2-1BroVtw	2012-01-05	1uJFq2r5QfJG_6ExMRCaGw	5	General Manager Scott Petello is a good egg!!!...	review	vYmM4KTsC8ZfQBg-j5MWkw	0	0	469

Data Exploration

import seaborn as sns
import matplotlib.pyplot as plt

We will use the Seaborn package for visualization in Python. The facet grid function can be used to plot histograms and visualize if there are any relationships between review length and review sentiment.

g = sns.FacetGrid(data=yelp, col='stars')
g.map(plt.hist, 'text length', bins=50)
plt.show()

png

The distribution of text length looks similar across all five ratings. However, the count of text reviews seems to be skewed a lot higher towards the 4-star and 5-star ratings

box = sns.boxplot(x='stars', y='text length', data=yelp)
plt.show()

png

The boxplot shows that reviews with a lower star rating (i.e 1 and 2) tend to have a higer median length as compared to the higher star reviews. We can infer that when people are unhappy or want to express a negative sentiment, they tend to write more content in the reviews. So, a shorter review is not always a bad indicator as these could mean users with positive sentiment use fewer words describing their experience.

Data Filtration

We will filter the dataset to contain only the 5-star and 1-star reviews as they represent polar opposite sentiments and will enable us to design the problem into a simple binary classification as opposed to a multinomial classification problem otherwise.

yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
yelp_best_worst.shape

(4086, 11)

# examine the class distribution
yelp_best_worst.stars.value_counts().sort_index()

1     749
5    3337
Name: stars, dtype: int64

The dataset that we will be using for model building and validation has class imbalance as seen from the class distribution of reviews.

Feature Creation

We define X and y from the new dataframe, and then split X and y into training and testing sets, using the review text as the only feature and the star rating as the response. We perform the train/test data split prior to vectorization as we want the training document-term matrix to have terms only from the training set. If we first create a document-term matrix and then perform train-test split on the matrix, the training document-term matrix will contain the terms from the test set that were not seen on the training set. This is an important step that must be kept in mind as it might otherwise bias our results in real world scenarios.

# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# examine the object shapes
print(X_train.shape)
print(X_test.shape)

(3064,)
(1022,)

We now use CountVectorizer to create document-term matrices from X_train and X_test. In this step, the reviews are vectorized using CountVectorizer() to be subsequently fed into the supervised classification model.

# use CountVectorizer to create document-term matrices from X_train and X_test
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# fit and transform X_train
X_train_dtm = vect.fit_transform(X_train)

# only transform X_test
X_test_dtm = vect.transform(X_test)

# examine the shapes
print(X_train_dtm.shape)
print(X_test_dtm.shape)

(3064, 16825)
(1022, 16825)

Model Building and Validation

We fit a Multinomial Naive Bayes model using the training document-term matrix as features and the review rating (1 or 5) as the target variable. We then predict the star rating for the test document-term matrix, and then calculate the accuracy and print the confusion matrix.

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# print the number of features that were generated
print('Features: ', X_train_dtm.shape[1])

# use Multinomial Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# print the accuracy of its predictions
print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

# print the confusion matrix
print('Confusion Matrix: ')
print(metrics.confusion_matrix(y_test, y_pred_class))

Features:  16825
Accuracy:  0.918786692759
Confusion Matrix:
[[126  58]
 [ 25 813]]

To benchmark our model, we build a baseline model and compute the null accuracy, which is the classification accuracy that could be achieved by always predicting the most frequent class(i.e a review rating of 5). The improvement over the baseline model would be a true representation of how good our model is.

###NULL Accuracy : No of correct classifications when we predict all record to be star rating 5####
y_test.value_counts().head(1)/y_test.shape

5    0.819961
Name: stars, dtype: float64

The MultiNomial Naive Bayes model with an accuracy of 91.87% shows a substantial improvement over the baseline model which shows an accuracy of 81.9%.

In our classification problem, Naive Bayes has taken 5 star rating as positive class and 1 star rating as a negative class. The errors made by the model in binary classification problems can be of two types - False positives and False negatives.

FALSE POSITIVE A false positive is a scenario where the actual label is 0 but the predicted label is 1. In this scenario if we predict an actual bad rating(1 star) as a good rating(5 star) we can call the observation to be a false positive

FALSE NEGATIVE A false negative is a scenario where the actual label is 1 but the predicted label is 0. In this scenario if we predict an actual good rating(5 star) to be a bad rating(1 star) we can call the observation to be a false negative

Findings and Insights

We can review the records which are false positives and false negatives to analyze where the model is making mistakes. This intuition can help improve the accuracy in subsequent model iterations.

#####Filter out a sample of false positives#####
X_test[y_test < y_pred_class].sample(10, random_state=6)

  This has to be the worst restaurant in terms o...
  Buca Di Beppo is literally, italian restaurant...
  My mother always told me, if I didn't have any...
  The owner has changed hands & this place isn't...
  Still a place that is unacceptable in my book-...
   Don't waste your time...Arrowhead mall on the ...
  this is a business located in the fry's grocer...
  Have been going to LGO since 2003 and have alw...
  The salad plates were not chilled... As they u...
  Went last night to Whore Foods to get basics t...
Name: text, dtype: object

######Filter out a sample of false negatives#####
X_test[y_test > y_pred_class].sample(10, random_state=6)

  What a great surprise stumbling across this ba...
  I've passed by prestige nails in walmart 100s ...
  I was there last week with my sisters and whil...
  I went to sears today to check on a layaway th...
  This place is so great! I am a nanny and had t...
  Since I have ranted recently on poor customer ...
  I now consider myself an Arizonian. If you dri...
   Here's the deal. I said I was done with OT, bu...
  I`ve had work done by this shop a few times th...
   Once again Wildflower proves why it's my favor...
Name: text, dtype: object

Hypothesis on mistakes made by the model:

1) Naive Bayes classifier makes an assumption that features are independent given the target variable. Features that are marginally correlated can result in misclassification. This results in Naive Bayes missing out on sarcastic reviews which cannot be detected by assuming feature independence. This is one of the primary reasons for misclassification. Another example could be the case of double negation being used to indicate something positive. Two negative tokens can be combined to talk about something positive. This correlation between features will result in false classification as this model assumes features independence.

2) Naive Bayes does not work well when there is class imbalance. The current data we trained on has strong imbalance and that can be one of the possible reasons for misclassification.

3) We can also notice that there is a tendency for negative reviews to be much longer in detail. A quick examination shows some of the positive reviews have been pretty long and are misclassified. Including this feature along with the document-term matrix can further improve accuracy.

4) Naive Bayes also has a tendency to make extreme classifications with probabilties close to zero or one. There are some reviews that are too close to call which Naive Bayes misclassifies as a result of this property.

We can further get some intuition on the the sentiment analysis by looking at the top tokens present in positive and negative reviews. We calculate which 10 tokens are the most predictive of 5-star reviews, and which 10 tokens are the most predictive of 1-star reviews.

X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

# Naive Bayes counts the number of times each token appears in each class
nb.feature_count_

array([[ 26.,   4.,   1., ...,   0.,   0.,   0.],
       [ 39.,   5.,   0., ...,   1.,   1.,   1.]])

# rows represent classes, columns represent tokens
nb.feature_count_.shape

(2, 16825)

# number of times each token appears across all one star reviews
one_star_token_count = nb.feature_count_[0, :]
one_star_token_count

array([ 26.,   4.,   1., ...,   0.,   0.,   0.])

# number of times each token appears all 5 star reviews
five_star_token_count = nb.feature_count_[1, :]
five_star_token_count

array([ 39.,   5.,   0., ...,   1.,   1.,   1.])

# create a DataFrame of tokens with their separate bad review and good review counts
tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')

##examine a random sample of tokens#
tokens.sample(5, random_state=3)

	five_star	one_star
token
amazed	12.0	0.0
polytechnic	1.0	0.0
sheared	0.0	1.0
impersonal	0.0	1.0
sane	0.0	1.0

# add 1 to avoid dividing by 0
tokens['one_star'] = tokens.one_star + 1
tokens['five_star'] = tokens.five_star + 1
tokens.sample(5, random_state=3)

	five_star	one_star
token
amazed	13.0	1.0
polytechnic	2.0	1.0
sheared	1.0	2.0
impersonal	1.0	2.0
sane	1.0	2.0

# Naive Bayes counts the number of observations in each class
nb.class_count_

array([  565.,  2499.])

We convert the count of a token to frequency of the token by dividing it with the total number of tokens in the respective class(five star or one star). This is done to compute the goodness of each token as a predictor relative to other tokens in its class.

# convert the  counts into frequencies
tokens['one_star'] = tokens.one_star / nb.class_count_[0]
tokens['five_star'] = tokens.five_star / nb.class_count_[1]
tokens.sample(5, random_state=3)

	five_star	one_star
token
amazed	0.005202	0.00177
polytechnic	0.000800	0.00177
sheared	0.000400	0.00354
impersonal	0.000400	0.00354
sane	0.000400	0.00354

# calculate the ratio of fivestar to one star for each token
tokens['fivestar_to_onestar_ratio'] = tokens.five_star / tokens.one_star

The top 10 tokens that help predict five star reviews can be seen below. Words like fantastic, positive, yum, favorite, outstanding etc. which have a positive connotation tend to be more useful in predicting five star reviews

tokens.sort_values('fivestar_to_onestar_ratio', ascending=False).head(10)

	five_star	one_star	fivestar_to_onestar_ratio
token
fantastic	0.077231	0.003540	21.817727
perfect	0.098039	0.005310	18.464052
yum	0.024810	0.001770	14.017607
favorite	0.138055	0.012389	11.143029
outstanding	0.019608	0.001770	11.078431
brunch	0.016807	0.001770	9.495798
gem	0.016006	0.001770	9.043617
mozzarella	0.015606	0.001770	8.817527
pasty	0.015606	0.001770	8.817527
amazing	0.185274	0.021239	8.723323

The top 10 tokens that help predict one star reviews can be seen below. Words like refused, disgusting, filthy etc. which have a negative connotation tend to be more useful in predicting one star reviews. The most useful token for predicting one star review is staff person. This indicates that most people who give a poor rating are driven by poor customer service of the staff.

tokens['onestar_to_fivestar_ratio'] = tokens.one_star / tokens.five_star
tokens.sort_values('onestar_to_fivestar_ratio', ascending=False).head(10)

	five_star	one_star	fivestar_to_onestar_ratio	onestar_to_fivestar_ratio
token
staffperson	0.0004	0.030088	0.013299	75.191150
refused	0.0004	0.024779	0.016149	61.922124
disgusting	0.0008	0.042478	0.018841	53.076106
filthy	0.0004	0.019469	0.020554	48.653097
unacceptable	0.0004	0.015929	0.025121	39.807080
acknowledge	0.0004	0.015929	0.025121	39.807080
unprofessional	0.0004	0.015929	0.025121	39.807080
ugh	0.0008	0.030088	0.026599	37.595575
yuck	0.0008	0.028319	0.028261	35.384071
fuse	0.0004	0.014159	0.028261	35.384071

Summary

In this post, we have looked at the different stages involved in a text classification workflow. We have performed a sentiment analysis on the Yelp dataset to predict the sentiment of a review from the review text using Multinomial Naive Bayes model. We also have understood vectorization that helps to convert raw unstructured information into features suitable for machine learning models.

Share on

Twitter Facebook Google+ LinkedIn

Sudharsan Ganesh