Document vectors with spaCy¶

This post demonstrates how to cluster documents without a labeled data set using a Word Vector model trained on Web data (provided by spaCy). I find it fascinating what is possible with a large amount of data and no labeled data.

import spacy
import os
import json
import pandas as pd

Load data¶

Previously downloaded data Hacker News from here.

Here we separate it into individual stories.

%%time
stories = []

with open(os.path.join('data', 'HNStoriesAll.json'), 'rb') as fp:
    pages = json.load(fp)
for page in pages:
    for story in page['hits']:
        stories.append(story)
print(len(stories))

1333789
Wall time: 30.6 s

Extract just the id and the title.

titles = [{"id":s['objectID'], "title": s['title']} for s in stories]
titles[0]

{'id': '7815290', 'title': 'DuckDuckGo Settings'}

Load spaCy Core Web Large model

%%time
nlp = spacy.load("en_core_web_lg")

Wall time: 9.01 s

Verify that the model is working

doc = nlp(titles[0]['title'])
#dir(doc)

#doc.vector

list(doc.ents)

[]

list(doc.noun_chunks)

[DuckDuckGo Settings]

list(doc.sents)

[DuckDuckGo Settings]

Process 10,000 titles (limited for speed)

Could drop the entities since they are not used currently.

%%time 

for i, title in enumerate(titles):
    doc = nlp(title['title'])
    title['vector'] = doc.vector
    title['entities'] = list(doc.ents)
    del doc
    if i >= 10000:
        break

Wall time: 56.9 s

Create a pandas data frame with title and document vector

titles_dict = dict((title['title'], title['vector']) for title in titles if 'vector' in title)
documents_df = pd.DataFrame.from_dict(titles_dict, orient='index')
documents_df.head()

Train TSNE on the vectors and create a dataframe with the 2d vectors

%%time

from sklearn.manifold import TSNE
tsne = TSNE()
tsne_vectors = tsne.fit_transform(documents_df)
tsne_vectors = pd.DataFrame(index=documents_df.index, data=tsne_vectors)
test = { 
    'x_coord' : tsne_vectors[0].values, 
    'y_coord' : tsne_vectors[1].values
}
tsne_vectors = pd.DataFrame(test, index=pd.Index(documents_df.index))
tsne_vectors.head()

Wall time: 3min 36s

from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

# Source -> https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb

plot_data = ColumnDataSource(data=tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, reset'),
                   active_scroll=u'wheel_zoom')

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@index') )


# draw the words as circles on the plot
tsne_plot.circle(x=u'x_coord', 
                 y=u'y_coord', 
                 source=plot_data,
                 line_alpha=0.2, 
                 fill_alpha=1,
                 size=10,
                 hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);

In the above visualization we hope to find that similar sentences are close to each other. There are some bad examples too.

I found a few articles about Google potentially buying Twitch clustered together:

Google to Acquire Twitch for $1 Billion
Google/YouTube reportedly set to Buy Twitch for over $1 Billion
YouTube to acquire video game streaming service twitch for $1 billion?
Google reportedly buying acquiring live games video site Twitch for $1bn
Meet Twitch the video-streaming firm Google may buy for $1 billion
Is Google about to buy Twitch for $1bn?
In 2006, Google paid $1.65bn for YouTube. Now it's acquiring Twitch for $1bn
YouTube to buy Twitch for $1bn

And not far from these:

Google to buy Skybox Imaging for at least $1b
Google to acquire micro-satellite company Skybox for $1B

But also nearby:

Adobe offers Photoshop for $9.99 per month until end of May
Sonic.Net Offering 1 Gbps, Unlimited Phone for $40 in California

What is neat about this is that we didn't need a labeled data set to get started. Word vector models are able to be trained on unlabeled data (in this case Web data). Combined with the date of the article we could attempt to group stories about the same topic. Or we could recommend similar articles that a reader may be interested in.

	0	1	2	3	4	5	6	7	8	9	...	290	291	292	293	294	295	296	297	298	299
DuckDuckGo Settings	0.515860	0.134955	-0.146073	-0.251888	-0.570215	-0.100120	0.388780	-0.460955	-0.213385	-0.346350	...	0.036770	0.069570	0.296665	0.613625	0.375830	-0.268476	-0.125810	0.071613	0.034150	0.628140
Making Twitter Easier to Use	-0.048493	0.080209	-0.317336	0.133527	-0.185892	-0.046460	0.084478	-0.122317	-0.014832	1.902720	...	-0.271532	0.065391	0.144273	0.084367	0.101718	-0.010995	-0.140841	-0.121623	-0.081698	0.506876
London refers Uber app row to High Court	0.052784	0.121495	0.060381	0.031344	0.221139	-0.109554	0.143441	-0.173580	-0.047636	1.687105	...	-0.136413	-0.051603	0.046576	0.013684	-0.062076	0.008569	0.043893	-0.105870	0.096160	0.130871
Young Global Leaders, who should be nominated? (World Economic Forum)	-0.095607	0.149450	0.041192	-0.014108	0.101103	-0.105989	0.046705	-0.048695	0.002153	2.283726	...	-0.237890	0.083993	0.023581	-0.081463	0.006529	0.110942	-0.054442	-0.160829	-0.049139	-0.013723
Blooki.st goes BETA in a few hours	-0.014594	0.025128	-0.072924	0.062255	0.036138	-0.159130	-0.088883	-0.234647	0.230422	1.773779	...	-0.163038	-0.067990	0.084636	-0.096408	0.000227	0.012051	-0.036039	-0.128264	-0.135046	0.117284

	x_coord	y_coord
DuckDuckGo Settings	-68.741676	9.232639
Making Twitter Easier to Use	21.885588	-30.466169
London refers Uber app row to High Court	25.289881	49.012508
Young Global Leaders, who should be nominated? (World Economic Forum)	7.143032	12.719261
Blooki.st goes BETA in a few hours	40.268761	8.354419

Document Vectors with spaCy

Document vectors with spaCy¶

Load data¶