Word embedding was performed with Code2vec, Glow and spaCy

The author | Maria Malitckaya compile | source of vitamin k | forward Data Science

An effective way to improve machine learning models is to use word embedding. With word embedding, you can capture the context of words in a document and then find semantic and syntactic similarities.

In this article, we will discuss an unusual application of word embedding technology. We will try to use the OpenAPI specification as a data set to find the best word embedding techniques. As an example of the OpenAPI specification, we will use OpenAPI Specifications (Swagger. IO /specificati…

The biggest challenge is that the OpenAPI specification is neither natural language nor code. But it also means that we are free to use any embedding model available. In this experiment, we will investigate three possible candidates: Code2Vec, Glow and spaCy.

Code2vec is a neural model that learns analogies related to source code. The model is trained on a Java code database, but you can apply it to any code.

And GloVe. GloVe is a commonly used natural language processing algorithm. It was trained on Wikipedia and Gigawords.

Finally, we have spaCy. While spaCy is a recent development, the algorithm is already known for being the fastest word embedding in the world.

Let’s see which of these algorithms is better suited to the OpenAPI dataset and which is faster for the OpenAPI specification. I’ve divided this article into six parts, each containing code examples and some hints for future use, plus a conclusion.

Download data set
Download glossary
Extract field name
Identify the
Create a dataset of field names
Testing embedded
conclusion

Now, we can get started.

1. Download the dataset

First, we need to download the entire apis-guru database: apis. Guru /.

You will notice that most apis- Guru specifications are in Swagger 2.0 format. However, the latest version of the OpenAPI specification is OpenAPI 3.0.

So let’s use an Unmock script to convert the entire dataset into this format! You can do this by following the instructions in the README file of the Unmock OpenAPI script: github.com/meeshkan/un…

It may take a while, but eventually, you’ll have a big data set.

2. Download vocabulary

code2vec

Download the model from the Code2vec GitHub page and follow the instructions in the Quick Start section.
Load using gensim library.

model = word2vec.load_word2vec_format(vectors_text_path, binary=False)
Copy the code

GloVe

Download a GloVe vocabulary from the website. We picked the biggest one because it was more likely to find all of our words. You can choose where to download it, but it’s best to store it in a working directory for convenience.
Manually load the GloVe vocabulary.

embeddings_dict = {}
with open(".. /glove/glove.6B.300d.txt".'r', encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:]."float32")
        embeddings_dict[word] = vector
Copy the code

spaCy

Load spaCy’s vocabulary:

NLP = spacy. Load (' en_core_web_lg ').Copy the code

3. Extract the field name

The full list of OpenAPI specification names can be obtained from the scripts/fetch-list.sh file or using the following functions (for Windows) :

def getListOfFiles(dirName) :
    listOfFile = os.listdir(dirName)
    allFiles = list(a)for entry in listOfFile:
        fullPath = posixpath.join(dirName, entry)
        if posixpath.isdir(fullPath):
            allFiles = allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(fullPath)
                
    return allFiles
Copy the code

Another big problem is getting field names from our OpenAPI specification. To do this, we will use the Openapi-typed library. Let’s define a get_fields function that accepts the OpenAPI specification and returns a list of field names:

def get_fields_from_schema(o: Schema) - >Sequence[str] :
    return [
        *(o['properties'].keys() if ('properties' in o) and (type(o['properties'= =])type({})) else[]), * (sum([
            get_fields_from_schema(schema) for schema in o['properties'].values() if not ('$ref' in schema) and type(schema) == type({})], []) if ('properties' in o) and ($        *(get_fields_from_schema(o['additionalProperties']) if ('additionalProperties' in o) and (type(o['additionalProperties'= =])type({})) else []),
        *(get_fields_from_schema(o['items']) if ('items' in o) and  (type(o['items'] = =type({}))) else[]),]def get_fields_from_schemas(o: Mapping[str.Union[Schema, Reference]]) - >Sequence[str] :
    return sum([get_fields_from_schema(cast(Schema, maybe_schema)) for maybe_schema in o.values() if not ('$ref' in maybe_schema) and (type(maybe_schema) == type({}))], [])


def get_fields_from_components(o: Components) - >Sequence[str] :
    return [
        *(get_fields_from_schemas(o['schemas']) if 'schemas' in o else[]),]def get_fields(o: OpenAPIObject) - >Sequence[str] :
    return [
        *(get_fields_from_components(o['components']) if 'components' in o else[]),]Copy the code

A: congratulations! Now our data set is ready.

4. Logo

Field names may contain punctuation marks, such as _ and – symbols, or words with camel case. We can slice up these words and call them signs.

The camel_case function below checks the camel name. First, it checks for punctuation. If so, it’s not a hump. It then checks for capital letters inside the word (excluding the first and last characters).

def camel_case(example) :      
    if  any(x in example for x  in string.punctuation)==True:
        return False
    else:
        if any(list(map(str.isupper, example[1: -1))) = =True:
            return True
        else:
            return False
Copy the code

The next function, camel_case_split, splits the hump word into parts. To do this, we should identify uppercase letters and mark where the case changes. The split word () function returns a list of split words. For example, the field name BodyAsJson is converted to a list [‘Body’, ‘As’, ‘Json’].

def camel_case_split(word) :
    idx = list(map(str.isupper, word))
    case_change = [0]
    for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
        if x and not y:  
            case_change.append(i)
        elif not x and y:  
            case_change.append(i+1)
    case_change.append(len(word))
    return [word[x:y] for x, y in zip(case_change, case_change[1:)if x < y]
Copy the code

This camel_case_split function is then used for the following tokenization algorithm. Here, we first check for punctuation in the word. Then, we divide the word into parts. These words could be hump words. If so, we can break it up into smaller pieces. Finally, after splitting each element, the entire list is converted to lowercase.

def tokenizer(mylist) :
    tokenized_list=[]
    for word in mylist:

        if '_'  in word:
            splitted_word=word.split('_')
            for elem in splitted_word:
                if camel_case(elem):
                    elem=camel_case_split(elem)
                    for el1 in elem:
                        tokenized_list.append(el1.lower())
                else:    
                    tokenized_list.append(elem.lower())
        elif The '-' in word:
            hyp_word=word.split(The '-')
            for i in hyp_word:
                if camel_case(i):
                    i=camel_case_split(i)
                    for el2 in i:
                        tokenized_list.append(el2.lower())
                else: 
                    tokenized_list.append(i.lower())
        elif camel_case(word):
            word=camel_case_split(word)
            for el in word:
                tokenized_list.append(el.lower())
        else:
            tokenized_list.append(word.lower())
    return(tokenized_list)
tokenizer(my_word)
Copy the code

5. Create a dataset with field names

Now, let’s create a large data set with all the field names from the specification.

The following dict_dataset function gets a list of file names and paths and opens each specification file. For each file, the get_field function returns a list of field names. Some field names may duplicate within a specification. To avoid this duplication, let’s use list(dict.fromkeys(col)) to convert the list of field names in the list into a dictionary, and then return. Then we can identify the list. Finally, we create a dictionary with a file name key and a list of field names as values.

def dict_dataset(datasets) :
    dataset_dict={}
    for i in datasets:
        with open(i, 'r') as foo:
            col=algo.get_fields(yaml.safe_load(foo.read()))
            if col:
                mylist = list(dict.fromkeys(col))
                tokenized_list=tokenizer(mylist)
                dataset_dict.update({i: tokenized_list})
            else:
                continue
    return (dataset_dict)
Copy the code

6. Test embedding

Code2vec and GloVe

Now we can find the words that are out of the vocabulary (unrecognized words) and calculate the percentage of these words in the Code2VEc vocabulary. The following code also applies to GloVe.

not_identified_c2v=[]
count_not_indent=[]
total_number=[]

for ds in test1:
    count=0
    for i in data[ds]:
        if not i in model:
            not_identified_c2v.append(i)
            count+=1
    count_not_indent.append(count)
    total_number.append(len(data[ds]))

total_code2vec=sum(count_not_indent)/sum(total_number)*100
Copy the code

spaCy

The spaCy vocabulary is different, so we need to modify the code accordingly:

not_identified_sp=[]
count_not_indent=[]
total_number=[]

for ds in test1:
    count=0
    for i in data[ds]:
        f not i in nlp.vocab:
                count+=1
                not_identified_sp.append(i)
    count_not_indent.append(count)
    total_number.append(len(data[ds]))

        
total_spacy=sum(count_not_indent)/sum(total_number)*100
Copy the code

For Code2vec, Glow and spaCy, the percentages of unrecognized words were 3.39, 2.33 and 2.09, respectively. Since the percentages for each algorithm are relatively small and similar, we can run another test.

First, let’s create a test dictionary with words that should be similar across all API specifications:

test_dictionary={'host': 'server'.'pragma': 'cache'.'id': 'uuid'.'user': 'client'.'limit': 'control'.'balance': 'amount'.'published': 'date'.'limit': 'dailylimit'.'ratelimit': 'rate'.'start': 'display'.'data': 'categories'}
Copy the code

For GloVe and Code2VEc, we can use the similar_by_vector method provided by the Gensim library. SpaCy hasn’t implemented this method yet, but it allows us to find the most similar words ourselves.

To do this, we need to format the input vector for use in the distance function. We will create each key in the dictionary and check to see if the corresponding value is among the 100 most similar words.

First, we’ll format the vocabulary to use the distance.cdist function. This function computes the distance between each pair of vectors in the vocabulary. We then sort the list from the smallest distance to the largest distance and take the top 100 words.

from scipy.spatial import distance

for k, v in test_dictionary.items():
    input_word = k
    p = np.array([nlp.vocab[input_word].vector])
    closest_index = distance.cdist(p, vectors)[0].argsort()[::-1] [-100:]
    word_id = [ids[closest_ind] for closest_ind in closest_index]
    output_word = [nlp.vocab[i].text for i in word_id]
    #output_word
    list1=[j.lower() for j in output_word]
    mylist = list(dict.fromkeys(list1))[:50]
    count=0
    if test_dictionary[k] in mylist:
        count+=1
        print(k,count, 'yes')
    else:
        print(k, 'no')
Copy the code

The following table summarizes the results. SpaCy shows that the word “client” is in the top 100 most similar words of the word “user.” It is useful for almost all OpenAPI specifications and can be used for future OpenAPI specification similarity analysis. The vector of the word “balance” is close to the vector of the word “amount”. We found it particularly useful for payment apis.

conclusion

We have tried three different word embedding algorithms for the OpenAPI specification. Although all three words perform well on this data set, an additional comparison of the most similar words suggests that spaCy is better for us.

SpaCy is faster than other algorithms. The spaCy vocabulary reads 5 times faster than the Glow or Code2vec vocabulary. However, the lack of built-in functions such as similar_by_vector and similar_word is a barrier when using the algorithm.

Also, just because spaCy works well with our dataset doesn’t mean spaCy will be better for every dataset in the world. So feel free to try embedding different words for your own data set, and thanks for reading!

The original link: towardsdatascience.com/word-embedd…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/

Word embedding was performed with Code2vec, Glow and spaCy

1. Download the dataset

2. Download vocabulary

3. Extract the field name

4. Logo

5. Create a dataset with field names

6. Test embedding

conclusion

Related Posts

How to tune Hadoop? It’s enough to read this one

Julia joined TPU, which is a programming language that has to incorporate machine learning on its own

The latest AI news: Google is not happy with Musk’s insults…