Retrieval-based Intent Classification in Chatbots 3/4

Filed Under: Python Advanced
Introduction To Chatbots (2)

So welcome back to the final part of intent classification in chatbots! We already went over the dataset and we performed some necessary operations on it to be able to use it in the previous article.

Steps to Build Intent Classification in Chatbots (Continued…)

So now we can move on directly from where we left off.

1. TF-IDF Vectorization

After we got the vocabulary and corpus of text data of possible user queries, we can see how it looks if we transform a piece of user test query:

print(Tfd.transform([test_data['Test user queries'][5]]))
tfidf of test query
tfidf of test query

So it’s a sparse row matrix generated for each text. A sparse matrix is one that has very few non-zero elements in it.

2. Determine data similarity with Cosine Similarity

This is the magic sauce that will find the similarity between the two pieces of text.

In Data Mining, the measure of similarity refers to the distance in a dataset with dimensions that represent the features of the data object.

If this distance is smaller, there will be a high degree of similarity, but there will be a low degree of similarity when the distance is large.

Some of the popular measures of resemblance are:

  • Euclidean Distance.
  • Manhattan Distance.
  • Jaccard Similarity.
  • Minkowski Distance.
  • Cosine Similarity.

Cosine resemblance is a metric that helps to determine how similar the data objects are, regardless of their size.

Using Cosine Similarity, we can measure the similarity between two sentences in Python.

Data objects in a dataset are treated as a vector under cosine similarity.

Formula:- Cos(x, y) = x . y / ||x|| * ||y||

from sklearn.metrics.pairwise import cosine_similarity
sorted(cosine_similarity(Tfd.transform([test_data['Test user queries'][5]]),Tfd_train)[0])[-5:]

we get:

Top-5 cosine similarity values
Top-5 cosine similarity values

3. Combining TF-IDF and Cosine Similarity

So now we can combine both the TF-IDF conversion of the test query and finding the Cosine similarity. Go over the logic carefully:

cosine_val = []
result = []
for i,query in enumerate(test_data['Test user queries']):
  sug = str(i)+","
  sim_arr = cosine_similarity(Tfd.transform([query]),Tfd_train)[0] #similarity array
  tmp_ix = [x for x in range(len(sim_arr))]
  cosine_val.append(sorted(zip(sim_arr, tmp_ix), reverse=True)[:3])
  if cosine_val[i][0][0] == 0.0:
    sug+='2'
  elif cosine_val[i][0][0] == 1.0:
    sug+=str(cosine_val[i][0][1])
  else:
    sug+="1,"
    for tupple in cosine_val[i]:
      string_list_suggestions=[]
      if tupple[0]>.5:
        sug+=str(tupple[1])+','
    sug = sug[:-1]
  print(sug)
  result.append(sug)

For each test query, the output is as follows:

  • the first number gives the ID of the test query.
  • the second number is 2 if there is no match among the user queries AND the cosine similarity is zero.
  • the second number is 1 if there is a cosine similarity in the interval [0.5,1].
  • if the cosine similarity is exactly 1, that means there is a direct match and then the second number is the ID of the matched query.

If we run the above, we get the following output:

the IDs of the suggestions intent classification in chatbots
the IDs of the suggestions

4. Fetching original IDs

However, these IDs are not from the original dataset, since we had divided the variations column queries into multiple rows

So we need to fetch the actual ids based on the original dataset:

  • keep the other data same
  • if the id is “1”( i.e, suggestions ), then we fetch the real intent IDs.
res_final = []
for each in result:
  if each.split(",")[1] == '1':
    tmp = each.split(",")
    temp_list = []
    an_list = []
    for suggestion in tmp[2:]:
      if df["id"][int(suggestion)] not in temp_list:
        print(df["intent"][int(suggestion)])
        temp_list.append(df["id"][int(suggestion)])
    for item in list(set(temp_list)):
      an_list.append(item)
    print(tmp[:2]+an_list)
    res_final.append(",".join(str(x) for x in tmp[:2]+an_list))
  else:
    res_final.append(each)

So now if we run this:

query of the original dataset intent classification in chatbots
query of the original dataset

And we are done.

In the picture above, you can see that similar queries are occurring together, which means our program works!

In the next article, we’ll take a look at Rasa, an open-source intent classification chatbot.

Ending Note

If you liked reading this article and want to read more, go ahead and visit the Journaldev’s homepage. All the latest posts can be seen there.

Happy learning!

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages