Graham Holker

Document Vectors with spaCy

2019-09-22 -- NLP with spaCy

Document vectors with spaCy

This post demonstrates how to cluster documents without a labeled data set using a Word Vector model trained on Web data (provided by spaCy). I find it fascinating what is possible with a large amount of data and no labeled data.

In [1]:
import spacy
import os
import json
import pandas as pd

Load data

Previously downloaded data Hacker News from here.

Here we separate it into individual stories.

In [17]:
%%time
stories = []

with open(os.path.join('data', 'HNStoriesAll.json'), 'rb') as fp:
    pages = json.load(fp)
for page in pages:
    for story in page['hits']:
        stories.append(story)
print(len(stories))
1333789
Wall time: 30.6 s

Extract just the id and the title.

In [3]:
titles = [{"id":s['objectID'], "title": s['title']} for s in stories]
titles[0]
Out[3]:
{'id': '7815290', 'title': 'DuckDuckGo Settings'}

Load spaCy Core Web Large model

In [4]:
%%time
nlp = spacy.load("en_core_web_lg")
Wall time: 9.01 s

Verify that the model is working

In [19]:
doc = nlp(titles[0]['title'])
#dir(doc)
In [20]:
#doc.vector
In [21]:
list(doc.ents)
Out[21]:
[]
In [22]:
list(doc.noun_chunks)
Out[22]:
[DuckDuckGo Settings]
In [23]:
list(doc.sents)
Out[23]:
[DuckDuckGo Settings]

Process 10,000 titles (limited for speed)

Could drop the entities since they are not used currently.

In [10]:
%%time 

for i, title in enumerate(titles):
    doc = nlp(title['title'])
    title['vector'] = doc.vector
    title['entities'] = list(doc.ents)
    del doc
    if i >= 10000:
        break
    
Wall time: 56.9 s

Create a pandas data frame with title and document vector

In [12]:
titles_dict = dict((title['title'], title['vector']) for title in titles if 'vector' in title)
documents_df = pd.DataFrame.from_dict(titles_dict, orient='index')
documents_df.head()
Out[12]:
0 1 2 3 4 5 6 7 8 9 ... 290 291 292 293 294 295 296 297 298 299
DuckDuckGo Settings 0.515860 0.134955 -0.146073 -0.251888 -0.570215 -0.100120 0.388780 -0.460955 -0.213385 -0.346350 ... 0.036770 0.069570 0.296665 0.613625 0.375830 -0.268476 -0.125810 0.071613 0.034150 0.628140
Making Twitter Easier to Use -0.048493 0.080209 -0.317336 0.133527 -0.185892 -0.046460 0.084478 -0.122317 -0.014832 1.902720 ... -0.271532 0.065391 0.144273 0.084367 0.101718 -0.010995 -0.140841 -0.121623 -0.081698 0.506876
London refers Uber app row to High Court 0.052784 0.121495 0.060381 0.031344 0.221139 -0.109554 0.143441 -0.173580 -0.047636 1.687105 ... -0.136413 -0.051603 0.046576 0.013684 -0.062076 0.008569 0.043893 -0.105870 0.096160 0.130871
Young Global Leaders, who should be nominated? (World Economic Forum) -0.095607 0.149450 0.041192 -0.014108 0.101103 -0.105989 0.046705 -0.048695 0.002153 2.283726 ... -0.237890 0.083993 0.023581 -0.081463 0.006529 0.110942 -0.054442 -0.160829 -0.049139 -0.013723
Blooki.st goes BETA in a few hours -0.014594 0.025128 -0.072924 0.062255 0.036138 -0.159130 -0.088883 -0.234647 0.230422 1.773779 ... -0.163038 -0.067990 0.084636 -0.096408 0.000227 0.012051 -0.036039 -0.128264 -0.135046 0.117284

5 rows × 300 columns

Train TSNE on the vectors and create a dataframe with the 2d vectors

In [13]:
%%time

from sklearn.manifold import TSNE
tsne = TSNE()
tsne_vectors = tsne.fit_transform(documents_df)
tsne_vectors = pd.DataFrame(index=documents_df.index, data=tsne_vectors)
test = { 
    'x_coord' : tsne_vectors[0].values, 
    'y_coord' : tsne_vectors[1].values
}
tsne_vectors = pd.DataFrame(test, index=pd.Index(documents_df.index))
tsne_vectors.head()
Wall time: 3min 36s
Out[13]:
x_coord y_coord
DuckDuckGo Settings -68.741676 9.232639
Making Twitter Easier to Use 21.885588 -30.466169
London refers Uber app row to High Court 25.289881 49.012508
Young Global Leaders, who should be nominated? (World Economic Forum) 7.143032 12.719261
Blooki.st goes BETA in a few hours 40.268761 8.354419
In [14]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()
Loading BokehJS ...
In [25]:
# Source -> https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb

plot_data = ColumnDataSource(data=tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, reset'),
                   active_scroll=u'wheel_zoom')

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@index') )


# draw the words as circles on the plot
tsne_plot.circle(x=u'x_coord', 
                 y=u'y_coord', 
                 source=plot_data,
                 line_alpha=0.2, 
                 fill_alpha=1,
                 size=10,
                 hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);

In the above visualization we hope to find that similar sentences are close to each other. There are some bad examples too.

I found a few articles about Google potentially buying Twitch clustered together:

  • Google to Acquire Twitch for $1 Billion
  • Google/YouTube reportedly set to Buy Twitch for over $1 Billion
  • YouTube to acquire video game streaming service twitch for $1 billion?
  • Google reportedly buying acquiring live games video site Twitch for $1bn
  • Meet Twitch the video-streaming firm Google may buy for $1 billion
  • Is Google about to buy Twitch for $1bn?
  • In 2006, Google paid $1.65bn for YouTube. Now it's acquiring Twitch for $1bn
  • YouTube to buy Twitch for $1bn

And not far from these:

  • Google to buy Skybox Imaging for at least $1b
  • Google to acquire micro-satellite company Skybox for $1B

But also nearby:

  • Adobe offers Photoshop for $9.99 per month until end of May
  • Sonic.Net Offering 1 Gbps, Unlimited Phone for $40 in California

What is neat about this is that we didn't need a labeled data set to get started. Word vector models are able to be trained on unlabeled data (in this case Web data). Combined with the date of the article we could attempt to group stories about the same topic. Or we could recommend similar articles that a reader may be interested in.

In [ ]: