Finding Topics
Hello again! This is the second part of a tutorial on analysing text data, specifically women's clothing reviews. π If you haven't completed that yet, we highly recommend you do that first. If you have, great! Let's continue.
In this tutorial we will learn how to use the TOPICS
function to understand the common themes or 'topics' in the reviews, using our SQL-inf Topic Modelling algorithm.
What is Topic Modelling?β
A topic model is a natural language processing (NLP) algorithm that is used to discover common themes or 'topics' in a set of documents.
Here's a bit of intuition for how that works.
If a document is about a specific topic, e.g. Pets, then it is more likely to contain certain kinds of words like cat
π±, dog
πΆ, walkies
πΆββοΈ, floofs
π, etc.
Likewise a topic about clothing might talk about how a sweater is extra cozy
or a skirt is tight around the waist
. A document can therefore be entirely described by its topic or a mixture of topics.
A topic model attempts to break up your documents into these topics using advanced natural language processing techniques, so that you can easily summarise the different themes prevalent in vast amounts of text.
In fancy science terms, we say that topics are latent (hidden) semantic (the meaning of words) structures. No one needs to explicitly say this document is about pets
- we humans infer the topic from what the text is talking about.
Topic modelling is the same process of inferring the underlying topic but instead of using a brain π§ we use machine-learning and NLP algorithms. π€
Want to know more about Topic Modelling?
If you'd like to learn more about the Topic Modelling, we recommend this snazzy explanation:
Using SQL-inf for Topic Modellingβ
Note that as we are currently in a beta launch, our text analysis commands have been limited to using 1000 randomly sampled rows instead of the entire dataset. This limitation will be lifted post-beta.
SQL-inf takes these state-of-the-art NLP techniques for topic modelling and turns them into a simple one-liner.
We use the same dataset as in the Sentiment Analysis tutorial ("Women's e-Commerce Clothing Reviews"), with table name customer_feedback
.
Next, we run our one-liner:
SELECT * FROM customer_feedback TOPICS("Review Text")
and we get a result! The TOPICS
function takes in only one column, in this case Review Text
, as it is only analysing text.
We now see there are several new inf.*
columns: segment
, topic_id
, topic_name
, projection_x
, projection_y
.
There are also many more rows than were inputted. What is happening?
segment
is a portion of the text. As we said earlier, documents can be a mixture of topics, and likewise each review will contain multiple topics.
If we segment the reviews into smaller chunks of text, it is easier to identify the different topics.
This is done quite simply by breaking text into segments any time we find a full stop, comma, exclamation mark, and words like and
, but
, and however
.
So, I love this dress. It fits me perfectly.
will become two segments, I love this dress
and It fits me perfectly
.
Because we split each review into segments, we now have multiple output rows for each input row. Hence the large output.
topic_id
is an identifier that is ordered by size, so 1
represents the largest topic, 2
is the second largest, etc.
A topic ID of -1
means that segment belongs to no topics.
topic_name
shows the top four most representative words of that topic.
For example, love_recommend_highly_fun
would be a topic most represented by the words love
, recommend
, highly
, fun
... a 'recommended' topic!
Getting Insights
Now we have our Topic Modelling analysis, what can we do to further our insights?
Auto-visualisation with Inferβ
The Infer platform will automatically begin to visualise the Topic Modelling for you.
This is currently a bit limited for beta and we're working on better ways to interact with topics.
The first auto-visualisation is a 2D representation of the data. This is the under-the-hood representation of the data before creating topics. We will soon allow the user to hover over the data points and get an idea for what is in each topics.
Deeper Analysisβ
Filteringβ
We can find deeper insights by probing the data via SQL-inf directly.
For example, if we want to look at all segments in a specific topic we can append a
WHERE
statement to filter by e.g. topic_id=5
.
SELECT * FROM customer_feedback TOPICS("Review Text") WHERE topic_id=5
This topic is clearly about the height, which is often given as a reference for others that are reading the reviews. It's interesting to see that it's the 5th most talked about topic.
If I were this e-commerce site, I'd maybe recommend making height a publicly shown attribute, if it is so useful!
Topic Clustering Levelβ
If you find you are getting too many topics, or not enough, you can change the minimum acceptable size of a topic
with min_topic_size
. By default, this is set to be 2% of the size of your dataset.
SELECT * FROM customer_feedback TOPICS("Review Text", min_topic_size=5)
min_topic_size
is the same optional argument as in the CLUSTER
function, because they use similar methods under-the-hood.
Combining with EXPLAIN
β
As we did in the Sentiment Analysis tutorial, we can probe further - are certain topics driven correlated with other columns?
Below we create a topic_of_interest
variable. The value is 1 if topic_id=5
, aka our 'height' topic, or 0 if anything else.
This allows us to zoom into this specific topic to see what drives it.
SELECT Age, "Department Name", "Division Name", "Clothing ID", Rating, CASE WHEN topic_id=5 THEN 1 ELSE 0 END AS topic_of_interest, inf.* FROM (SELECT * FROM customer_feedback TOPICS("Review Text")) EXPLAIN(PREDICT(topic_of_interest))
This result produces warnings! Huh. What went wrong? Well, the accuracy of the model is very low, worse than randomly guessing in fact. This indicates that the inputs do not explain the output, i.e. the height topic is not related to age, department name, division, or clothing item.
This is good! This makes sense because the topic of height shouldn't really be explained by those variables. It's just a topic that people talk about in relation to clothing reviews, and that's ok. You can try some other topics to see if you can find any correlations!
Wrapping Up the Reviewsβ
By using Sentiment Analysis and Topic Modelling we found that:
- The sentiment of the review is highly explained by the rating - obviously!
- We also found that clothing items fall into 'popular' and 'unpopular' categories.
- We found lots of interesting topics, including one that is about 'recommending' a product, and one about the height.
- We tried to explain our 'height' topic with other variables, but found it was not correlated. BUT, this makes sense!
We could use these insights to help build better products, e.g. adding 'height' as metadata to reviews to give other readers more context!
That's it for our text analysis tutorial. Now, go out and analyse your own text using these new tools and skills!