Doc2Vec was used to cluster Quora question tags

The introduction

Quora is a popular knowledge sharing platform and I often share my thoughts on Quora. The platform is based on a question-and-answer format and is known for its simple design and smooth user experience.

When new questions are added to Quora, they are automatically flagged by the bot based on the context of the question and later edited by the user. These labels reflect the topic categories into which questions are grouped. Here is a basic overview of a problem.

Recently I was looking for the right data set, and I came across this page on Quora :Programming Challenges. I chose the dichotomous problem challenge called Answered. This includes nearly 10,000 questions (training sets and test sets total). Each question and its hashtag and other information are stored in JSON format. Below is an example of JSON.

Sample question JSON

To do it

The first task is to read the data from the JSON file. The training set has a total of about 9,000 questions and the test set has a total of about 1,000 questions.

import json

f = open(answered_data_10k.in).read().split(\n)

train_set = f[1:9001]

test_set = f[9002:-1]

train = [json.loads(i)foriintrain_set]

test= [json.loads(i)foriintest_set]

questions = train +test

The next step is to extract the topic tags from all the data sets. In JSON files, topics are stored in key keys. Different questions have different numbers of subject tags. The maximum number of labels allowed for a single problem is 26. There is also the problem of no associated topic tags.

# Create thelistof topics

topic_list = []

forquestion in questions:

iflen(question[topics]) 0:

fortopic in question[topics]:

topic_list = topic_list + [topic[name]]

topic_list =list(set(topic_list))

print(len(topic_list))

There were 8,762 topic tags in the challenge.

After extracting the topic tags, we need to cluster the questions with the same tags. Before starting, we analyzed the data first, because it would be very difficult to cluster 8762 directly and the quality of the cluster could not be guaranteed.

So we limited the minimum number of questions for each topic to solve this problem. There were 3,275 topics with more than one question. There are exactly 900 topics with five questions, a relatively good number for clustering.

In the end, we decided to limit the number of questions under the topic to 5 for two main reasons. First, for better vector representation of topics, and second, because topics with fewer problems are mostly associated with unrelated problems.

#Assigning questiontotopics.

question_list = []

final_topic_list = []

fortopic in topic_list:

temp = []

forquestion in questions:

context = [i[name]fori in question[topics]]

iftopic in context:

temp.append(question[‘question_text’])

iflen(temp) =5:

question_list.append(temp)

final_topic_list.append(topic)

topic_list = final_topic_list

Next, we write a function that regularizes each paragraph by converting it to lowercase and removing punctuation and stops. There are five to more questions under each topic. We treat the collection of questions under each topic as a document.

In this way, we iterate over the hashtags, then group the questions into paragraphs, and regularize the paragraphs. We then feed the paragraph and its topic tag to Gensim’s TaggedDocument function for further regularization.

fromnltkimportword_tokenize

fromnltk.corpusimportstopwords

fromgensimimportmodels

fromgensim.models.doc2vecimportTaggedDocument

#Function for normalizing paragraphs.

defnormalize(string):

lst = word_tokenize(string)

lst =[word.lower()forwordinlstifword.isalpha()]

lst = [wforwinlstifnotwinstopwords.words(‘english’)]

return(lst)

# Aggregate questions under each topic tag as a paragraph.

# Normalize the paragraph

# Feed the normalized paragraph along with the topic tag into Gensim’s Tagged Document function.

# Append the return value to docs.

docs = []

forindex, iteminenumerate(topic_list):

question = .join(question_list[index])

question = normalize(question)

docs.append(TaggedDocument(words=question, tags=[item]))

Prepare data for Gensim’s DocVec

Next we train the Doc2Vec model.

The vector_size and window should be adjusted until the result is optimal.

importgensim

model = gensim.models.Doc2Vec(vector_size=200,window=3, min_count=0, workers=4, epochs=40)

model.build_vocab(docs)

model.train(docs, total_examples=model.corpus_count, epochs=model.iter)

Doc2Vec Training

After the Doc2Vec model was trained, we used KMeans algorithm to cluster the document vector. The number of clusters was checked from 100 to 50. A cluster number approaching 100 will cause large clusters to be cut into smaller ones, while a cluster number equal to 50 will cause unrelated clusters to be combined into large clusters. After careful evaluation of clustering results, 60 was selected as the cluster number.

fromsklearn.clusterimportKMeans

fromsklearnimportmetrics

importpylabaspl

importmatplotlib.pyplotasplt

fromsklearn.decompositionimportPCA

kmeans_model = KMeans(n_clusters=60, init=’k-means++’, max_iter=100)

X = kmeans_model.fit(model.docvecs.doctag_syn0)

labels= kmeans_model.labels_.tolist()

l = kmeans_model.fit_predict(model.docvecs.doctag_syn0)

#map each centroid to its topic tag

word_centroid_map = dict(zip( model.docvecs.offset2doctag, l))

#Print Cluster List

Forclusterinrange (0100) :

print(\nCluster %d% cluster)

words = []

foriinrange(0,len(word_centroid_map.values())):

if(list(word_centroid_map.values())[i] == cluster ):

words.append(list(word_centroid_map.keys())[i])

print(words)

The KMeans model is fitted and the list of clusters is retrieved

Here are some examples of clusters.

The first cluster has a fair amount of “Design, UI/UX design, Software development, Web development, and Web sites.”

The second cluster has a lot of “online advertising, marketing tools.”

The third cluster has many topics like “life, self-improvement”.

If you look at this cluster you can see that most of the sex-related topics in the database are about parenting and sex-related education. None of the other hot keywords appeared.

The cluster is a hodgepodge of political and economic topics. At the same time, it also reveals the internal relationship between economy and politics, and how the economic factors of a region become the main evaluation criteria for the development of the region.

Film and television, art, books, music and other forms of creative arts.

Why is there Chess in this cluster?

This one is very interesting and focuses on Religion. There are three things I want you to pay attention to: the Hypothesis Questions are in this cluster. Ethics is in this cluster. LGBTQ issues are most related to Homosexuality and Parenting and Children are one.

What is clear is that the way clusters are formed reflects the way topics on Quora relate to each other, and the way questions are asked. Due to its uniqueness, Servey Questions form a cluster of their own, while Politics has many related topics.

These are some of the 60 clusters that I think are worth mentioning. You can also mention interesting clusters you found. The code for this project can be found at Github.

Doc2Vec was used to cluster Quora question tags

Related Posts

C++ programming concepts: constants, constant expressions, and constant initialization

Programmer ridicule: graduate, the company arranged the master was actually a bachelor’s degree

How do I set thread pool parameters? Meituan gave an answer that blew the interviewer’s mind. | the nuggets technical essay