So welcome back to the final part of intent classification in chatbots! We already went over the dataset and we performed some necessary operations on it to be able to use it in the previous article.
Table of Contents
Steps to Build Intent Classification in Chatbots (Continued…)
So now we can move on directly from where we left off.
1. TF-IDF Vectorization
After we got the vocabulary and corpus of text data of possible user queries, we can see how it looks if we transform a piece of user test query:
print(Tfd.transform([test_data['Test user queries']]))
So it’s a sparse row matrix generated for each text. A sparse matrix is one that has very few non-zero elements in it.
2. Determine data similarity with Cosine Similarity
This is the magic sauce that will find the similarity between the two pieces of text.
In Data Mining, the measure of similarity refers to the distance in a dataset with dimensions that represent the features of the data object.
If this distance is smaller, there will be a high degree of similarity, but there will be a low degree of similarity when the distance is large.
Some of the popular measures of resemblance are:
- Euclidean Distance.
- Manhattan Distance.
- Jaccard Similarity.
- Minkowski Distance.
- Cosine Similarity.
Cosine resemblance is a metric that helps to determine how similar the data objects are, regardless of their size.
Using Cosine Similarity, we can measure the similarity between two sentences in Python.
Data objects in a dataset are treated as a vector under cosine similarity.
Formula:- Cos(x, y) = x . y / ||x|| * ||y||
from sklearn.metrics.pairwise import cosine_similarity sorted(cosine_similarity(Tfd.transform([test_data['Test user queries']]),Tfd_train))[-5:]
3. Combining TF-IDF and Cosine Similarity
So now we can combine both the TF-IDF conversion of the test query and finding the Cosine similarity. Go over the logic carefully:
cosine_val =  result =  for i,query in enumerate(test_data['Test user queries']): sug = str(i)+"," sim_arr = cosine_similarity(Tfd.transform([query]),Tfd_train) #similarity array tmp_ix = [x for x in range(len(sim_arr))] cosine_val.append(sorted(zip(sim_arr, tmp_ix), reverse=True)[:3]) if cosine_val[i] == 0.0: sug+='2' elif cosine_val[i] == 1.0: sug+=str(cosine_val[i]) else: sug+="1," for tupple in cosine_val[i]: string_list_suggestions= if tupple>.5: sug+=str(tupple)+',' sug = sug[:-1] print(sug) result.append(sug)
For each test query, the output is as follows:
- the first number gives the ID of the test query.
- the second number is 2 if there is no match among the user queries AND the cosine similarity is zero.
- the second number is 1 if there is a cosine similarity in the interval [0.5,1].
- if the cosine similarity is exactly 1, that means there is a direct match and then the second number is the ID of the matched query.
If we run the above, we get the following output:
4. Fetching original IDs
However, these IDs are not from the original dataset, since we had divided the variations column queries into multiple rows
So we need to fetch the actual ids based on the original dataset:
- keep the other data same
- if the id is “1”( i.e, suggestions ), then we fetch the real intent IDs.
res_final =  for each in result: if each.split(",") == '1': tmp = each.split(",") temp_list =  an_list =  for suggestion in tmp[2:]: if df["id"][int(suggestion)] not in temp_list: print(df["intent"][int(suggestion)]) temp_list.append(df["id"][int(suggestion)]) for item in list(set(temp_list)): an_list.append(item) print(tmp[:2]+an_list) res_final.append(",".join(str(x) for x in tmp[:2]+an_list)) else: res_final.append(each)
So now if we run this:
And we are done.
In the picture above, you can see that similar queries are occurring together, which means our program works!
In the next article, we’ll take a look at Rasa, an open-source intent classification chatbot.
If you liked reading this article and want to read more, go ahead and visit the Journaldev’s homepage. All the latest posts can be seen there.