Analysing Sentiment
In this tutorial we will learn how to use the SENTIMENT
function to predict the overall sentiment in women's clothing reviews. 👗
What is Sentiment Analysis?
Sentiment analysis uses natural language processing (NLP) techniques to find the overall 'sentiment' of a piece of text, e.g. is it a positive 😍, negative 😢 or neutral 😶 statement?
This allows us to quickly summarise the good, the bad, and the neither from a mountain of text. For example, if we had 23,000 reviews of clothing, that would take an exceptionally long time to figure out the proportions of reviews that are positive, negative or neutral. If we wanted to zoom in on the bad reviews, so we could inform our future products, we would need to label all of these reviews or perhaps just skim through and cherrypick some examples. Nobody wants to do that.
So! Automating the process of analysing the sentiment of text using NLP can save vast amounts of time and headache, giving both quick and deep insights.
Want to know more about Sentiment Analysis?
If you'd like to learn more about the Sentiment Analysis algorithm at Infer (a variant of a Transformer), we recommend this snazzy explanation:
Using SQL-inf for Sentiment
Note that as we are currently in a beta launch, our text analysis commands have been limited to using 1000 randomly sampled rows instead of the entire dataset. This limitation will be lifted post-beta.
SQL-inf takes these state-of-the-art NLP techniques for sentiment analysis and turns them into a simple one-liner.
For this analysis, let's use the demo dataset customer_feedback
, a dataset with reviews of Women's Clothing from an e-commerce store.
This dataset is pre-loaded into the Infer platform with a few starter queries to help us along.
The CSV file can be found here, downloaded directly here. Instead of using this whole dataset, we have uploaded a random sample of 5976 rows into the platform.
We can preview the Input Data to get a quick sense of what is in the dataset, as highlighted in the image below.
Next, we run our one-liner:
SELECT * FROM customer_feedback SENTIMENT("Review Text")
and we get a result! The SENTIMENT
function takes in only one column, in this case Review Text
, as it is only analysing text.
We now see there are several new inf.*
columns: Positive
, Negative
, Neutral
, and prediction
.
Positive
, Negative
, Neutral
contain the probabilities of belonging to each class, and sum to one.
prediction
is the sentiment with the highest probability of the three.
Getting Insights
Now we have our sentiment analysis, what can we do to further our insights?
Auto-visualisation with Infer
The Infer platform will automatically begin to visualise the Sentiment Analysis for you. The simplest way to visualise the data is to look at the totals and the ratios of positive, negative, and neutral statements in your text. We show this in a bar chart, as shown below:
You can then click on one of these barcharts to begin to explore the relationship between predicted sentiment and the other variables. One can do this by clicking the column names directly, or use the arrows in the chart.
In the video below, we explore the relationship of the Rating
on the prediction by clicking on the relevant columns.
It's quite clear that higher ratings are associated with the positive predictions, and low ratings with negative predictions, just as we would expect!
Deeper Analysis
Filtering
We can find deeper insights by probing the data via SQL-inf directly.
For example, if we want to look at all negative reviews we can append a
WHERE
statement to filter by Negative
predictions.
SELECT prediction, "Review Text", Title, Rating FROM customer_feedback SENTIMENT("Review Text") WHERE prediction='Negative'
Combining with EXPLAIN
Now that we have a label for each of our reviews, we can actually probe further - what drives negative reviews?
Is Age
a factor? Maybe the type (Department Name
) of clothing makes an impact? Surely the Rating
is a good indicator?
SELECT Age, "Department Name", "Division Name", "Clothing ID", Rating, inf.* FROM (SELECT * FROM customer_feedback SENTIMENT("Review Text")) EXPLAIN(PREDICT(prediction))
Yes, in fact, Rating
, is the strongest indicator of the sentiment of a review - that makes a lot of sense!
Good ratings generally indicate positive reviews, and vice versa for negative reviews.
We can also see that the model is ~90% accurate at predicting Negative
sentiment, ~75% accurate at predicting Positive
, and ~44% Neutral
(a more difficult class!).
This means our model is significantly better than random guessing (1/3, or 33% accuracy).
So our model is definitely learning some real patterns here.
We can drop Rating
to see how that impacts our predictive power:
SELECT Age, "Department Name", "Division Name", "Clothing ID", inf.* FROM (SELECT * FROM customer_feedback SENTIMENT("Review Text")) EXPLAIN(PREDICT(prediction))
Our accuracy has dropped down to ~50%, a significant drop. Now clothing_id
is the most important factor.
This is interesting because it indicates that particular items of clothing tend to get similar (positive) reviews, almost like a 'bestseller'.
This gives us some intuition for how important each of the features are for sentiment, and how good they are at predicting it.
Finally, we can run our previous command without EXPLAIN
to get a clearer picture of what is going on.
SELECT Age, "Department Name", "Division Name", "Clothing ID", Rating, inf.* FROM (SELECT * FROM customer_feedback SENTIMENT("Review Text")) PREDICT(prediction)
If we run the above, we can select the columns for Positive
+ Rating
, and Positive
+ clothing_id
to get the below relationships:
We can see easily that:
- high ratings give positive sentiment
- a banded structure (popular, not popular, unpopular) appears for clothing ID.
Good job on your first deep text analysis!
Trying other Sentiment Models
By default, Infer uses a sentiment model trained on Amazon Reviews. This works best on review-like text, but may not work as well on, say, Twitter text ('tweets').
If you have some tweets to examine, or you find the default model a bit lacklustre, you can try out the previous experiment with different model versions.
To do this, simply state the version you'd like to use (twitter
, amazon_reviews
):
SELECT * FROM customer_feedback SENTIMENT("Review Text", version='twitter')
Topic Analysis
We've performed Sentiment Analysis on our text data, and even used other commands in the Infer library to find deep insights on how other variables affect the sentiment of a review.
But what about the content of the text? What did people say in the reviews? For that, we move to Topic Modelling...
Please continue to the next chapter to learn all about Topic Modelling within the Infer platform.