introduce

SimilarVocabulary is an open source project of mine on github. The project itself is not complicated. It uses the word vector in NLP to retrieve words with a high degree of association, and the project adopts spacy, an open source library in NLP. The library comes with a trained model for predicting English text. Project Address:

https://github.com/wotchin/SimilarVocabulary

Details of the code

Let’s use this simple demo to show you how to use the Spacy library and the code to get similar words.

Loading model:

nlp = spacy.load('en_core_web_lg')Copy the code

En_core_web_log is a trained model that comes with the library. This model is very large and needs to be downloaded by yourself. Here I give the initialization script as init.sh

Get input text

This line of code is mainly used to get the input text content, and then preprocess the input text (mainly words) to generate token:

while True: if line ! = "": words += line.replace("\n"," ") line = f.read() else: break nlp = spacy.load('en_core_web_lg') print("modal loaded.") tokens = nlp(words)Copy the code

Set a threshold value

To use this library to achieve similar word retrieval principle is roughly: by has trained model, through the contrast of the given word vector “distance” between two words, we have given this threshold is the “distance” online, by setting the threshold value, you can get what we want the output – semantic similar words. Let’s take a quick look at the process of setting a threshold:

Threshold = 0.0 while threshold <= 0.0: try: threshold = float(input("input threshold value:") except: Threshold = 0.0 length = 3 Try: length = int(input("input result length:")) except: length = 3Copy the code

The code is simple: the input threshold is required, and the number of similar words to be queried is required.

Implementing comparison queries

The logic I have implemented here is: give a corpus data set, which collects the vast majority of commonly used English words, look for results in this data set that have similar meanings to the words we have given in advance, and then select n better results to return. The code for this section is:

While True: queue = [] # [[' dog ', 0.1], [' cat ', 0.2]...].  i = input("input your word:") if i ! = "": txt = nlp(i) for token in tokens: score = token.similarity(txt) if score >= threshold and family_check(txt.text.strip(),token.text.strip()) < 0: If len(queue) >= length: index = 0 # in order to contrast value = 1.0 for I in range(0,len(queue)): if queue[i][1] < value: value = queue[i][1] index = i if value < score: queue[index] = [token.text,score] else: queue.append([token.text,score]) print(queue)Copy the code

use

What is the meaning of finding similar words like this? For example, when some quiz apps, such as the popular online live quiz apps, are compiling questions, what if there is only one option and three additional interference options need to be added? Such a close word generator can be used to generate closely related words, such as dog, which might return pig,cat,cook for us. It can also be used in the search engine such an associative search scenario. Of course, there’s more to it than that. It’s all about imagination.