Visual word embedding based on PCA and T-SNE

The author | Marcellus Ruben compile | source of vitamin k | forward Datas of Science

When you hear the words “tea” and “coffee”, what do you think of these two words? You might say they’re all drinks with a certain amount of caffeine. The point is, we can easily recognize that the two words are related. However, when we present the words “tea” and “coffee” to a computer, it does not recognize the connection between the two words as well as we do.

Words are not something that computers naturally understand. In order for the computer to understand the meaning behind the word, the word needs to be encoded in numeric form. This is where word embedding comes in.

Word embedding is a technique commonly used in natural language processing to convert words into vector values. These vectors will occupy the embedded space in a certain number of dimensions.

If two words have similar contexts, such as “tea” and “coffee,” they will be closer to each other in the embedded space and farther apart from other words with different contexts.

In this article, I’ll show you step by step how to visually embed the word. Since the focus of this article is not to explain the basic theory behind word embedding in detail, you can read more about this theory in this article and in this article.

To visualize word embedding, we will use common dimensionality reduction techniques such as PCA and T-SNE. To map words to vector representations in the embedded space, we use pre-trained words embedded in GloVe.

Load the pre-trained word embedding model

Before visualizing word embedding, we usually need to train the model. However, word embedding training is computationally expensive. Therefore, pre-trained word embedding models are often used. It contains words in the embedded space and their associated vector representations.

GloVe is a popular pre-trained word embedding model developed by Stanford researchers in addition to The Google-developed Word2vec. In this article, word embedding for GloVe pre-training is implemented, which you can download here.

Nlp.stanford.edu/projects/gl…

Meanwhile, we can use the Gensim library to load the pre-trained word embedding model. You can install the library using the PIP command, as shown below.

pip install gensim
Copy the code

As a first step, we need to convert the GloVe file format to word2vec file format. With the word2vec file format, we can use the Gensim library to load the pre-trained word embedding model into memory. Since this file takes some time to load each time this command is called, it would be better to use a separate Python file for this purpose only.

import pickle
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec


glove_file = datapath('C:/Users/Desktop/glove.6B.100d.txt')
word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)

model = KeyedVectors.load_word2vec_format(word2vec_glove_file)

filename = 'glove2word2vec_model.sav'
pickle.dump(model, open(filename, 'wb'))
Copy the code

Create input words and generate similar words

Now that we have a Python file to load the pre-trained model, we can then call it in another Python file to generate the most similar words based on the input words. The input word can be any word.

Once the word is entered, the next step is to create a code to read it. We then need to specify the number of similar words for each input word generated by the model. Finally, we store the results of similar words in a list. Here is the code to do this.

import pickle

filename = 'glove2word2vec_model.sav'
model = pickle.load(open(filename, 'rb'))

def append_list(sim_words, words) :
    
    list_of_words = []
    
    for i in range(len(sim_words)):
        
        sim_words_list = list(sim_words[i])
        sim_words_list.append(words)
        sim_words_tuple = tuple(sim_words_list)
        list_of_words.append(sim_words_tuple)
        
    return list_of_words

input_word = 'school'
user_input = [x.strip() for x in input_word.split(', ')]
result_word = []
    
for words in user_input:
    
        sim_words = model.most_similar(words, topn = 5)
        sim_words = append_list(sim_words, words)
            
        result_word.extend(sim_words)
    
similar_word = [word[0] for word in result_word]
similarity = [word[1] for word in result_word] 
similar_word.extend(user_input)
labels = [word[2] for word in result_word]
label_dict = dict([(y,x+1) for x,y in enumerate(set(labels))])
color_map = [label_dict[x] for x in labels]
Copy the code

For example, suppose we wanted to find the five most similar words associated with “school”. Therefore, “school” will be our input word. Our results are * ‘college’, ‘schools’,’ elementary ‘, ‘students’, and’ student ‘*.

Visual word embedding based on PCA

We now have the input word and the similar words generated based on it. Next, it’s time to visualize them in an embedded space.

Through the pre-trained model, each word can be mapped to the embedding space by vector representation. However, word embedding has a high dimension, which means words cannot be visualized.

Principal component analysis (PCA) is usually used to reduce the dimension of word embedding. In short, PCA is a feature extraction technique that combines variables and then removes the least important ones while preserving the valuable ones. If you want to dig deeper into PCA, I recommend this article.

Towardsdatascience.com/a-one-stop-…

With PCA, we can visualize word embedding in 2D or 3D, so let’s create code to visualize word embedding using the model we called in the code block above. In the code below, only the 3d visualization is shown. In order to visualize principal component analysis in two dimensions, only minor changes are applied. You can find what you need to change in the comments section of the code.

import plotly
import numpy as np
import plotly.graph_objs as go
from sklearn.decomposition import PCA

def display_pca_scatterplot_3D(model, user_input=None, words=None, label=None, color_map=None, topn=5, sample=10) :

    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
    
    word_vectors = np.array([model[w] for w in words])
    
    three_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:3]
    # For 2D, change the three_DIM variable to two_DIM as follows:
    # two_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:2]

    data = []
    count = 0
    
    for i in range (len(user_input)):

                trace = go.Scatter3d(
                    x = three_dim[count:count+topn,0], 
                    y = three_dim[count:count+topn,1],  
                    z = three_dim[count:count+topn,2],
                    text = words[count:count+topn],
                    name = user_input[i],
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10.'opacity': 0.8.'color': 2})# For 2D, instead of using go.Scatter3d, we need to use Go. Scatter and remove variable z. Also, instead of using the variable three_DIM, use the previously declared variable (for example, two_DIM)
            
                data.append(trace)
                count = count+topn

    trace_input = go.Scatter3d(
                    x = three_dim[count:,0], 
                    y = three_dim[count:,1],  
                    z = three_dim[count:,2],
                    text = words[count:],
                    name = 'input words',
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10.'opacity': 1.'color': 'black'})# For 2D, instead of using go.Scatter3d, we need to use Go. Scatter and remove variable z. Also, instead of using the variable three_DIM, use the previously declared variable (for example, two_DIM)
            
    data.append(trace_input)
    
# Config layout

    layout = go.Layout(
        margin = {'l': 0.'r': 0.'b': 0.'t': 0},
        showlegend=True,
        legend=dict(
        x=1,
        y=0.5,
        font=dict(
            family="Courier New",
            size=25,
            color="black"
        )),
        font = dict(
            family = " Courier New ",
            size = 15),
        autosize = False,
        width = 1000,
        height = 1000
        )


    plot_figure = go.Figure(data = data, layout = layout)
    plot_figure.show()
    
display_pca_scatterplot_3D(model, user_input, similar_word, labels, color_map)
Copy the code

For example, let’s say we want to associate the five most similar words with “ball,” “school,” and “food.” Here is an example of a two-dimensional visualization.

Here is a 3D visualization of the same set of words.

Visually, we can now see patterns about the space these words take up. The words related to “ball” are close to each other because they have similar contexts. Meanwhile, the distance between them and the words related to “school” and “food” varies further depending on their context.

Visual word embedding based on T-SNE

Besides PCA, another common dimension reduction technique is T-distributed random neighborhood embedding (T-SNE). The difference between PCA and T-SNE lies in the basic technology of dimensionality reduction.

PCA is a linear dimension reduction method. Linear mapping of data in high dimensional space to low dimensional space while maximizing the variance of data. Meanwhile, T-SNE is a nonlinear dimension reduction method. The algorithm uses T-SNE to calculate the similarity of high and low dimensional space. Second, an optimization method, such as gradient descent, is used to minimize similarity differences in the two Spaces.

The visual code for word embedding with T-SNE is very similar to the code for PCA. In the code below, only the 3d visualization is shown. To make the T-SNE visible in 2D, only minor changes are applied. You can find what you need to change in the comments section of the code.

import plotly
import numpy as np
import plotly.graph_objs as go
from sklearn.manifold import TSNE

def display_tsne_scatterplot_3D(model, user_input=None, words=None, label=None, color_map=None, perplexity = 0, learning_rate = 0, iteration = 0, topn=5, sample=10) :

    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
    
    word_vectors = np.array([model[w] for w in words])
    
    three_dim = TSNE(n_components = 3, random_state=0, perplexity = perplexity, learning_rate = learning_rate, n_iter = iteration).fit_transform(word_vectors)[:,:3]


    # For 2D, change the three_DIM variable to two_DIM as follows:
    # two_dim = TSNE(n_components = 2, random_state=0, perplexity = perplexity, learning_rate = learning_rate, n_iter = iteration).fit_transform(word_vectors)[:,:2]

    data = []


    count = 0
    for i in range (len(user_input)):

                trace = go.Scatter3d(
                    x = three_dim[count:count+topn,0], 
                    y = three_dim[count:count+topn,1],  
                    z = three_dim[count:count+topn,2],
                    text = words[count:count+topn],
                    name = user_input[i],
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10.'opacity': 0.8.'color': 2})# For 2D, instead of using go.Scatter3d, we need to use Go. Scatter and remove variable z. Also, instead of using the variable three_DIM, use the previously declared variable (for example, two_DIM)
            
                data.append(trace)
                count = count+topn

    trace_input = go.Scatter3d(
                    x = three_dim[count:,0], 
                    y = three_dim[count:,1],  
                    z = three_dim[count:,2],
                    text = words[count:],
                    name = 'input words',
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10.'opacity': 1.'color': 'black'})# For 2D, instead of using go.Scatter3d, we need to use Go. Scatter and remove variable z. Also, instead of using the variable three_DIM, use the previously declared variable (for example, two_DIM)
            
    data.append(trace_input)
    
# Config layout

    layout = go.Layout(
        margin = {'l': 0.'r': 0.'b': 0.'t': 0},
        showlegend=True,
        legend=dict(
        x=1,
        y=0.5,
        font=dict(
            family="Courier New",
            size=25,
            color="black"
        )),
        font = dict(
            family = " Courier New ",
            size = 15),
        autosize = False,
        width = 1000,
        height = 1000
        )


    plot_figure = go.Figure(data = data, layout = layout)
    plot_figure.show()
    
display_tsne_scatterplot_3D(model, user_input, similar_word, labels, color_map, 5.500.10000)
Copy the code

The visualization results for the top 5 most similar words related to “ball”, “school” and “food” are shown below in the same example as in the PCA visualization.

Here is a 3D visualization of the same set of words.

As with PCA, note that words with similar contexts are close to each other, while words with different contexts are farther apart.

Create a Web application to visualize word embedding

So far, we have successfully created a Python script to embed words in 2D or 3D using PCA or T-SNE. Next, we can create a Python script to build a Web application for a better user experience.

This Web application enables us to visualize word embedding with a great deal of functionality and interactivity. For example, a user can type his own input word or select the first N most similar words associated with each input word that will be returned.

Web applications can be created using dashes or Streamlit. In this article, I will show you how to build a simple interactive Web application to visualize Streamlit word embedding.

First, we’ll take all the Python code we created earlier and put it into a Python script. Next, we can start creating a few user input parameters, as follows:

Dimension reduction technology, users can choose to use PCA or T-SNE. Since there are only two options, we can use the SelectBox property in Streamlit.
The dimension of visualization, in which the user can choose to embed words in 2D or 3D display. As before, we can use the SelectBox property.
Enter words. This is a user input parameter that asks users to type the input words they want, such as “ball,” “school,” and “food.” Therefore, we can use the text_input attribute.
Top-n The most similar word, where the user needs to specify the number of similar words to associate with each input word returned. Because we can pick any number.

Next, we need to consider the parameters that will appear when we decide to use t-SNE. In T-SNE, we can adjust some parameters to get the best visualization results. These parameters are complexity, learning rate, and number of optimization iterations. Therefore, in every case, it does not exist for the user to specify the best value for these parameters.

Since we are using Scikit Learn, we can refer to the documentation to find the default values for these parameters. The default value for perplexity is 30, but we can adjust the value between 5 and 50. The default value for learning rate is 300, but we can adjust this value between 10 and 1000. Finally, the default value for the number of iterations is 1000, but we can change it to 250. We can create these parameter values using the slider property.

import streamlit as st

dim_red = st.sidebar.selectbox(
 'Select dimension reduction method',
 ('PCA'.'TSNE'))
dimension = st.sidebar.selectbox(
     "Select the dimension of the visualization",
     ('2D'.'3D'))
user_input = st.sidebar.text_input("Type the word that you want to investigate. You can type more than one word by separating one word with other with comma (,)".' ')
top_n = st.sidebar.slider('Select the amount of words associated with the input words you want to visualize '.5.100, (5))
annotation = st.sidebar.radio(
     "Enable or disable the annotation on the visualization",
     ('On'.'Off'))  

if dim_red == 'TSNE':
    perplexity = st.sidebar.slider('Adjust the perplexity. The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity'.5.50, (30))
    
    learning_rate = st.sidebar.slider('Adjust the learning rate'.10.1000, (200))
    
    iteration = st.sidebar.slider('Adjust the number of iteration'.250.100000, (1000))
Copy the code

Now we’ve covered all the pieces we need to build our Web application. Finally, we can package these things into a complete script, as shown below.

import plotly
import plotly.graph_objs as go
import numpy as np
import pickle
import streamlit as st
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

filename = 'glove2word2vec_model.sav'
model = pickle.load(open(filename, 'rb'))

def append_list(sim_words, words) :
    
    list_of_words = []
    
    for i in range(len(sim_words)):
        
        sim_words_list = list(sim_words[i])
        sim_words_list.append(words)
        sim_words_tuple = tuple(sim_words_list)
        list_of_words.append(sim_words_tuple)
        
    return list_of_words


def display_scatterplot_3D(model, user_input=None, words=None, label=None, color_map=None, annotation='On',  dim_red = 'PCA', perplexity = 0, learning_rate = 0, iteration = 0, topn=0, sample=10) :
    
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
    
    word_vectors = np.array([model[w] for w in words])
    
    if dim_red == 'PCA':
        three_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:3]
    else:
        three_dim = TSNE(n_components = 3, random_state=0, perplexity = perplexity, learning_rate = learning_rate, n_iter = iteration).fit_transform(word_vectors)[:,:3]

    color = 'blue'
    quiver = go.Cone(
        x = [0.0.0], 
        y = [0.0.0],
        z = [0.0.0],
        u = [1.5.0.0],
        v = [0.1.5.0],
        w = [0.0.1.5],
        anchor = "tail",
        colorscale = [[0, color] , [1, color]],
        showscale = False
        )
    
    data = [quiver]

    count = 0
    for i in range (len(user_input)):

                trace = go.Scatter3d(
                    x = three_dim[count:count+topn,0], 
                    y = three_dim[count:count+topn,1],  
                    z = three_dim[count:count+topn,2],
                    text = words[count:count+topn] if annotation == 'On' else ' ',
                    name = user_input[i],
                    textposition = "top center",
                    textfont_size = 30,
                    mode = 'markers+text',
                    marker = {
                        'size': 10.'opacity': 0.8.'color': 2
                    }
       
                )
               
                data.append(trace)
                count = count+topn

    trace_input = go.Scatter3d(
                    x = three_dim[count:,0], 
                    y = three_dim[count:,1],  
                    z = three_dim[count:,2],
                    text = words[count:],
                    name = 'input words',
                    textposition = "top center",
                    textfont_size = 30,
                    mode = 'markers+text',
                    marker = {
                        'size': 10.'opacity': 1.'color': 'black'
                    }
                    )

    data.append(trace_input)
    
# Config layout
    layout = go.Layout(
        margin = {'l': 0.'r': 0.'b': 0.'t': 0},
        showlegend=True,
        legend=dict(
        x=1,
        y=0.5,
        font=dict(
            family="Courier New",
            size=25,
            color="black"
        )),
        font = dict(
            family = " Courier New ",
            size = 15),
        autosize = False,
        width = 1000,
        height = 1000
        )


    plot_figure = go.Figure(data = data, layout = layout)

    st.plotly_chart(plot_figure)

def horizontal_bar(word, similarity) :
    
    similarity = [ round(elem, 2) for elem in similarity ]
    
    data = go.Bar(
            x= similarity,
            y= word,
            orientation='h',
            text = similarity,
            marker_color= 4,
            textposition='auto')

    layout = go.Layout(
            font = dict(size=20),
            xaxis = dict(showticklabels=False, automargin=True),
            yaxis = dict(showticklabels=True, automargin=True,autorange="reversed"),
            margin = dict(t=20, b= 20, r=10)
            )

    plot_figure = go.Figure(data = data, layout = layout)
    st.plotly_chart(plot_figure)

def display_scatterplot_2D(model, user_input=None, words=None, label=None, color_map=None, annotation='On', dim_red = 'PCA', perplexity = 0, learning_rate = 0, iteration = 0, topn=0, sample=10) :
    
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
    
    word_vectors = np.array([model[w] for w in words])
    
    if dim_red == 'PCA':
        two_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:2]
    else:
        two_dim = TSNE(random_state=0, perplexity = perplexity, learning_rate = learning_rate, n_iter = iteration).fit_transform(word_vectors)[:,:2]

    
    data = []
    count = 0
    for i in range (len(user_input)):

                trace = go.Scatter(
                    x = two_dim[count:count+topn,0], 
                    y = two_dim[count:count+topn,1],  
                    text = words[count:count+topn] if annotation == 'On' else ' ',
                    name = user_input[i],
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 15.'opacity': 0.8.'color': 2
                    }
       
                )
               
                data.append(trace)
                count = count+topn

    trace_input = go.Scatter(
                    x = two_dim[count:,0], 
                    y = two_dim[count:,1],  
                    text = words[count:],
                    name = 'input words',
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 25.'opacity': 1.'color': 'black'
                    }
                    )

    data.append(trace_input)
    
# Config layout
    layout = go.Layout(
        margin = {'l': 0.'r': 0.'b': 0.'t': 0},
        showlegend=True,
        hoverlabel=dict(
            bgcolor="white", 
            font_size=20, 
            font_family="Courier New"),
        legend=dict(
        x=1,
        y=0.5,
        font=dict(
            family="Courier New",
            size=25,
            color="black"
        )),
        font = dict(
            family = " Courier New ",
            size = 15),
        autosize = False,
        width = 1000,
        height = 1000
        )


    plot_figure = go.Figure(data = data, layout = layout)

    st.plotly_chart(plot_figure)

dim_red = st.sidebar.selectbox(
 'Select dimension reduction method',
 ('PCA'.'TSNE'))
dimension = st.sidebar.selectbox(
     "Select the dimension of the visualization",
     ('2D'.'3D'))
user_input = st.sidebar.text_input("Type the word that you want to investigate. You can type more than one word by separating one word with other with comma (,)".' ')
top_n = st.sidebar.slider('Select the amount of words associated with the input words you want to visualize '.5.100, (5))
annotation = st.sidebar.radio(
     "Enable or disable the annotation on the visualization",
     ('On'.'Off'))  

if dim_red == 'TSNE':
    perplexity = st.sidebar.slider('Adjust the perplexity. The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity'.5.50, (30))
    
    learning_rate = st.sidebar.slider('Adjust the learning rate'.10.1000, (200))
    
    iteration = st.sidebar.slider('Adjust the number of iteration'.250.100000, (1000))
    
else:
    perplexity = 0
    learning_rate = 0
    iteration = 0    

if user_input == ' ':
    
    similar_word = None
    labels = None
    color_map = None
    
else:
    
    user_input = [x.strip() for x in user_input.split(', ')]
    result_word = []
    
    for words in user_input:
    
        sim_words = model.most_similar(words, topn = top_n)
        sim_words = append_list(sim_words, words)
            
        result_word.extend(sim_words)
    
    similar_word = [word[0] for word in result_word]
    similarity = [word[1] for word in result_word] 
    similar_word.extend(user_input)
    labels = [word[2] for word in result_word]
    label_dict = dict([(y,x+1) for x,y in enumerate(set(labels))])
    color_map = [label_dict[x] for x in labels]
    

st.title('Word Embedding Visualization Based on Cosine Similarity')

st.header('This is a web app to visualize the word embedding.')
st.markdown('First, choose which dimension of visualization that you want to see. There are two options: 2D and 3D.')
           
st.markdown('Next, type the word that you want to investigate. You can type more than one word by separating one word with other with comma (,). ')

st.markdown('With the slider in the sidebar, you can pick the amount of words associated with the input word you want to visualize. This is done by computing the cosine similarity between vectors of words in embedding space.')
st.markdown('Lastly, you have an option to enable or disable the text annotation in the visualization.')

if dimension == '2D':
    st.header('2D Visualization')
    st.write('For more detail about each point (just in case it is difficult to read the annotation), you can hover around each points to see the words. You can expand the visualization by clicking expand symbol in the top  right corner of the visualization.')
    display_pca_scatterplot_2D(model, user_input, similar_word, labels, color_map, annotation, dim_red, perplexity, learning_rate, iteration, top_n)
else:
    st.header('3D Visualization')
    st.write('For more detail about each point (just in case it is difficult to read the annotation), you can hover around each points to see the words. You can expand the visualization by clicking expand symbol in the top  right corner of the visualization.')
    display_pca_scatterplot_3D(model, user_input, similar_word, labels, color_map, annotation, dim_red, perplexity, learning_rate, iteration, top_n)

st.header('The Top 5 Most Similar Words for Each Input')
count=0
for i in range (len(user_input)):
    
    st.write('The most similar words from '+str(user_input[i])+' are:')
    horizontal_bar(similar_word[count:count+5], similarity[count:count+5])
    
    count = count+top_n
Copy the code

You can now run your Web application using the Conda prompt. At the prompt, go to the directory of the Python script and type the following command:

$ streamlit run your_script_name.py
Copy the code

Next, a browser window automatically pops up where you can access your Web application locally. Below is a snapshot of what you can do with the Web application.

In this way! You have created a simple Web application that has a lot of interactivity and can visualize word embedding using PCA or T-SNE.

If you want to see the full code of the word embedded visualization, you can access it on my GitHub page.

Github.com/marcellusru…

The original link: towardsdatascience.com/visualizing…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/

Visual word embedding based on PCA and T-SNE

Load the pre-trained word embedding model

Create input words and generate similar words

Visual word embedding based on PCA

Visual word embedding based on T-SNE

Create a Web application to visualize word embedding

Related Posts

Advanced Python basics: from functions to advanced magic methods

OpenMLDB Weekly Update (2021.9.27-2021.10.4)

A Brief Review of Deep Learning Algorithms (I)