This post demonstrates how to cluster documents without a labeled data set using a Word Vector model trained on Web data (provided by spaCy). I find it fascinating what is possible with a large amount of data and no labeled data.
import spacy
import os
import json
import pandas as pd
%%time
stories = []
with open(os.path.join('data', 'HNStoriesAll.json'), 'rb') as fp:
pages = json.load(fp)
for page in pages:
for story in page['hits']:
stories.append(story)
print(len(stories))
Extract just the id and the title.
titles = [{"id":s['objectID'], "title": s['title']} for s in stories]
titles[0]
Load spaCy Core Web Large model
%%time
nlp = spacy.load("en_core_web_lg")
Verify that the model is working
doc = nlp(titles[0]['title'])
#dir(doc)
#doc.vector
list(doc.ents)
list(doc.noun_chunks)
list(doc.sents)
Process 10,000 titles (limited for speed)
Could drop the entities since they are not used currently.
%%time
for i, title in enumerate(titles):
doc = nlp(title['title'])
title['vector'] = doc.vector
title['entities'] = list(doc.ents)
del doc
if i >= 10000:
break
Create a pandas data frame with title and document vector
titles_dict = dict((title['title'], title['vector']) for title in titles if 'vector' in title)
documents_df = pd.DataFrame.from_dict(titles_dict, orient='index')
documents_df.head()
Train TSNE on the vectors and create a dataframe with the 2d vectors
%%time
from sklearn.manifold import TSNE
tsne = TSNE()
tsne_vectors = tsne.fit_transform(documents_df)
tsne_vectors = pd.DataFrame(index=documents_df.index, data=tsne_vectors)
test = {
'x_coord' : tsne_vectors[0].values,
'y_coord' : tsne_vectors[1].values
}
tsne_vectors = pd.DataFrame(test, index=pd.Index(documents_df.index))
tsne_vectors.head()
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value
output_notebook()
# Source -> https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb
plot_data = ColumnDataSource(data=tsne_vectors)
# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
plot_width = 800,
plot_height = 800,
tools= (u'pan, wheel_zoom, box_zoom,'
u'box_select, reset'),
active_scroll=u'wheel_zoom')
# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@index') )
# draw the words as circles on the plot
tsne_plot.circle(x=u'x_coord',
y=u'y_coord',
source=plot_data,
line_alpha=0.2,
fill_alpha=1,
size=10,
hover_line_color=u'black')
# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None
# engage!
show(tsne_plot);
In the above visualization we hope to find that similar sentences are close to each other. There are some bad examples too.
I found a few articles about Google potentially buying Twitch clustered together:
Google to Acquire Twitch for $1 Billion
Google/YouTube reportedly set to Buy Twitch for over $1 Billion
YouTube to acquire video game streaming service twitch for $1 billion?
Google reportedly buying acquiring live games video site Twitch for $1bn
Meet Twitch the video-streaming firm Google may buy for $1 billion
Is Google about to buy Twitch for $1bn?
In 2006, Google paid $1.65bn for YouTube. Now it's acquiring Twitch for $1bn
YouTube to buy Twitch for $1bn
And not far from these:
Google to buy Skybox Imaging for at least $1b
Google to acquire micro-satellite company Skybox for $1B
But also nearby:
Adobe offers Photoshop for $9.99 per month until end of May
Sonic.Net Offering 1 Gbps, Unlimited Phone for $40 in California
What is neat about this is that we didn't need a labeled data set to get started. Word vector models are able to be trained on unlabeled data (in this case Web data). Combined with the date of the article we could attempt to group stories about the same topic. Or we could recommend similar articles that a reader may be interested in.