The paper contains 3640 words and is expected to last 7 minutes

This article is written for those with a background in data science and technology. This article explores how to visualize a well-known Amazon review data set using fast sentence embedding. The technical tools covered in this article are FSE, FIT-SNE, and Tableau.

T-sne mapping of 100,000 Amazon products © Oliver Borchers

The figure above shows the result of this pipeline. At the end of the article, there is a link to an interactive Tableau map to browse all 100,000 products.

data

If you often use NLP, you probably have heard of Stanford amazon reviews data set (http://jmcauley.ucsd.edu/data/amazon/). The second category was “electronic” with 7,824,482 comments.

To get embeddings for each product, you need to review data and metadata. Both are connected via the unique ASIN Amazon logo. Let’s start with the metadata:

Metadata for Amazon reviews. © Oliver Borchers

In order to get visual data (that is, to represent each product as a mapping vector), only reviews for each product can be used. Therefore, we are dealing with many-to-one embedding. After screening out asins with less than 10 comments, 97,249 ASins and 6,875,530 comments were finally obtained.

We do not do any preprocessing of the text data. Why is that? Because…

From reviews to product embedding

In order to get an embed for each comment, you first need some kind of pre-embedded. Comments are likely to contain many unknown words. Fortunately, FSE comes into play in the FastText out-of-the-box model. We first loads the publicly available FastText model (https://fasttext.cc/docs/en/english-vectors.html) :

from gensim.models.keyedvectors import FastTextKeyedVectors

ft =

FastTextKeyedVectors.load(“.. /models/ft_crawl_300d_2m.model”)

Next, the SIF model is instantiated from FSE.

from fse.models import SIF

model = SIF(ft, components=10, lang_freq=”en”)

The number of components is equal to 10, as explained in the STS Baseline reproducibility section.

Lang_freq parameters. Some pre-trained embeddings do not contain word frequency information in the corpus. Fse supports collecting inductive word frequencies for pre-training models in multiple languages (which may take a while, depending on the size of the vocabulary), which is essential for SIF and uSIF models.

All FSE models require a list of tuples. The first entry of the tuple is the tag list (sentence) and the second entry is the sentence index. The latter determines the target row in which the sentence is embedded. We will write multiple sentences (comments) on one line (many-to-one) later.

s = ([“Hello”, “world”], 0)

Fse provides multiple input classes, all with different capabilities. There are six input classes to choose from:

• IndexedList: Used for pre-split sentences.

• CIndexedList: Used for pre-split sentences with custom indexes.

• SplitIndexedList: Used for unsplit sentences that split strings.

• SplitCIndexedList: Used for unsplit sentences with custom indexes.

• CSplitIndexedList: Used for unsplit sentences that split strings. Can provide custom split functionality.

• CSplitCIndexedList: Sentences used to provide custom indexes and custom splitting capabilities.

• IndexedLineDocument: used for streaming sentences from disk. Is indexable to search for similar sentences.

The above input classes are sorted by speed. That is, IndexedList is the fastest, and CSplitCIndexedList is the slowest variable (more calls = slower). Why so many classes? Because we want to reduce the number of lines of code per __getitem__ method, so as not to affect the calculation speed.

We used SplitCIndexedList to audit the data because we didn’t want to pre-split the data (the 7 million comments pre-split took up a lot of memory). Internally, this input class mainly points to comments and only performs preprocessing when __getitem__ is called.

Import SplitCIndexedList from FSE

review = [“I really like this product.”, “Its nice and comfy.”]

s = SplitCIndexedList(review, custom_index = [0, 0])

print(s[0])

print(s[1])

>>> ([‘I’, ‘really’, ‘like’, ‘this’, ‘product.’], 0)

>>> ([‘Its’, ‘nice’, ‘and’, ‘comfy.’], 0)

Type on the blackboard, both of these sentences point to index 0. Therefore, they will all be added to an embed with an index of 0. To map each ASIN to an index, you just need a few more user-friendly functions.

Import SplitCIndexedList from FSE

ASIN_TO_IDX = {asin : index for index, asin in enumerate(meta.index)}

indexed_reviews = SplitCIndexedList(data.reviewText.values, custom_index = [ASIN_TO_IDX[asin] for asin in data.asin])

Now everything is in place, just run

model.train(indexed_reviews)

The model was trained on a 16-core, 32GB memory cloud instance. The whole process takes about 15 minutes, with about 8,500 comments per second. The average comment contained 86 words, making a total of 59,377,622 words. We’ll compress about 7 million comments into a 100,000-by-300 matrix. There is no point in calling more functions at this point because preprocessing (splitting) has reached a bottleneck.

If the data has already been pre-split, it can reach 500,000 sentences per second when operating on a regular MacBook Pro.

Want to know more, please see the tutorial notebook: https://github.com/oborchers/Fast_Sentence_Embeddings/blob/master/notebooks/Tutorial.ipynb

Embedding and Visualization

After training sentence embedding, each individual embedding can be accessed through the index of each embedding or the complete embedding matrix. The syntax should be as close as possible to the GenSims syntax for ease of use.

model.sv[0] # Access embedding with index 0

model.sv.vectors # Access embedding matrix

The corresponding sentence vector (SV) class provides quite a few functions to handle acquired sentence embedding. For example, you can refer to the functions of similarity, distance, most similar, literal similar, literal similar, or literal similar vectors.

Visualizing this amount of data using standard Sklearn T-SNE would have taken a very long time. So try a more optimized approach: fit-SNE. This optimized T-SNE visualization operation uses Fourier transform to speed up t-SNE calculation. If you have no problem, you can read the paper [4]. Magically used all 16 cores of the machine.

import sys; sys.path.append(‘.. /FIt-SNE’)

from fast_tsne import fast_tsne

mapping = fast_tsne(model.sv.vectors, perplexity=50, seed=42)

Once you’ve computed the mapping, you’ve effectively done the hard part. Add some information from the metadata to each point, and you can eventually export everything to Tableau.

Tableau mapping

To access the corresponding graphic, visit the “My Public Tableau” page:

https://public.tableau.com/views/ProductMap_15674406439260/TSNEPrice?:embed=y&:display_count=yes&:origin=viz_share_link

Hover over each point to get information about product price, name, brand, and so on.

From an embedding perspective, each cluster contains quite a bit of information.

Illustrated by Amazon Review © Oliver Borchers

In addition, go into the data and manually mark some clusters to determine if cluster formation is efficient to some extent. This mapping captures a lot of the information contained in comments.

Remember: all we did was average the number of words in each comment and summarize all the comments. Shockingly, averaging based embedding is possible.

Amazon Review Book Category label © Oliver Borchers

conclusion

Sentence embedding is an important part of NLP pipeline. This article showed you how to visualize the well-known Amazon review data set using fast Sentence embedding, Fit-SNe, and Tableau.

I hope you have fun using FSE. Feel free to suggest other models or features.

The corresponding FSE package can be run on PIP/Github, giving data scientists a way to quickly calculate sentence embedding.

Tip: This article requires regular Python (>3.6) packages, especially Numpy, Scipy, Cython, and Gensim.



Recommended Reading topics

Leave a comment like follow

We share the dry goods of AI learning and development. Welcome to pay attention to the “core reading technology” of AI vertical we-media on the whole platform.



(Add wechat: DXSXBB, join readers’ circle and discuss the freshest artificial intelligence technology.)