2. Topic Modeling with Gensim

2.1. Data

[1]:

%matplotlib inline
import matplotlib.pyplot as plt
from collections import defaultdict
from gensim import corpora

plt.style.use('seaborn')

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

2.2. Models

[2]:

from gensim import models

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

[3]:

lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi_model[corpus_tfidf]

[4]:

lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=2)
corpus_lda = lda_model[corpus]

2.3. Document to Topic Membership

2.3.1. LSI

For LSI, you will have to cluster to see which documents go into which cluster.

[5]:

for doc, as_text in zip(corpus_lsi, documents):
    print(doc, as_text)

[(0, 0.06600783396090049), (1, 0.5200703306361851)] Human machine interface for lab abc computer applications
[(0, 0.19667592859142102), (1, 0.7609563167700062)] A survey of user opinion of computer system response time
[(0, 0.0899263997244606), (1, 0.7241860626752509)] The EPS user interface management system
[(0, 0.07585847652177845), (1, 0.6320551586003429)] System and human system engineering testing of EPS
[(0, 0.10150299184979857), (1, 0.5737308483002964)] Relation of user perceived response time to error measurement
[(0, 0.7032108939378322), (1, -0.16115180214025474)] The generation of random binary unordered trees
[(0, 0.8774787673119844), (1, -0.16758906864659012)] The intersection graph of paths in trees
[(0, 0.9098624686818588), (1, -0.1408655362871859)] Graph minors IV Widths of trees and well quasi ordering
[(0, 0.6165825350569281), (1, 0.0539290756638969)] Graph minors A survey

2.3.1.1. k-means

When we use k-means, we supply the number of k as the number of topics. We may then get the predicted labels out for topic assignment. Note that this approach makes LSI a hard (not hard as in difficult, but hard as in only 1 topic per document) topic assignment approach.

[6]:

import numpy as np
from sklearn.cluster import KMeans

np.random.seed(37)

X = np.array([[tup[1] for tup in arr] for arr in corpus_lsi])
kmeans = KMeans(n_clusters=2, random_state=37).fit(X)

[7]:

kmeans.labels_

[7]:

array([1, 1, 1, 1, 1, 0, 0, 0, 0])

[8]:

kmeans.cluster_centers_

[8]:

array([[ 0.77678367, -0.10391933],
       [ 0.10599433,  0.64219974]])

2.3.1.2. Plotting

[9]:

X = np.array([[tup[1] for tup in arr] for arr in corpus_lsi])

x = X[:,0]
y = X[:,1]
c = ['r' if i == 0 else 'g' for i in kmeans.labels_]

fig, ax = plt.subplots(figsize=(5, 5))
_ = ax.scatter(x, y, c=c)

2.3.2. LDA

For LDA, the results gives you the probability of membership to each topic. Note these sum to 1.0 per document?

[10]:

for doc, as_text in zip(corpus_lda, documents):
    print(doc, as_text)

[(0, 0.8609383), (1, 0.13906173)] Human machine interface for lab abc computer applications
[(0, 0.897828), (1, 0.10217202)] A survey of user opinion of computer system response time
[(0, 0.8827687), (1, 0.11723135)] The EPS user interface management system
[(0, 0.88589597), (1, 0.114104055)] System and human system engineering testing of EPS
[(0, 0.8546059), (1, 0.14539409)] Relation of user perceived response time to error measurement
[(0, 0.3230946), (1, 0.67690533)] The generation of random binary unordered trees
[(0, 0.19665903), (1, 0.803341)] The intersection graph of paths in trees
[(0, 0.14095493), (1, 0.8590451)] Graph minors IV Widths of trees and well quasi ordering
[(0, 0.13678026), (1, 0.86321974)] Graph minors A survey

2.3.2.1. Plotting

[11]:

X = np.array([[tup[1] for tup in arr] for arr in corpus_lda])

x = X[:,0]
y = X[:,1]
c = ['r' if np.argmax(X[r,:]) == 0 else 'g' for r in range(X.shape[0])]

fig, ax = plt.subplots(figsize=(5, 5))
_ = ax.scatter(x, y, c=c)

2.4. Id-Word and Word-Id

Knowing that IDs map to words (tokens) and vice-versa when we want to diagnose/troubleshoot how the weights associate with words within a topic.

[12]:

print(lsi_model.id2word.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

[13]:

print(lsi_model.id2word.id2token)

{0: 'computer', 1: 'human', 2: 'interface', 3: 'response', 4: 'survey', 5: 'system', 6: 'time', 7: 'user', 8: 'eps', 9: 'trees', 10: 'graph', 11: 'minors'}

2.5. Word Weights or Probabilities

2.5.1. LSI

Each word has a sort of a weight associated with it per topic.

[14]:

lsi_model.get_topics()

[14]:

array([[ 0.04940859,  0.02969616,  0.03522417,  0.05951239,  0.1869311 ,
         0.06135723,  0.05951239,  0.05823724,  0.03490897,  0.70321089,
         0.53773148,  0.40171367],
       [ 0.29287972,  0.2804038 ,  0.32750471,  0.3204961 ,  0.17065511,
         0.46024666,  0.3204961 ,  0.3726838 ,  0.3323675 , -0.1611518 ,
        -0.07585493, -0.0294099 ]])

2.5.1.1. Softmax Scaling

We may use softmax to transform the weights from LSI.

[15]:

from scipy.special import softmax

w = softmax(lsi_model.get_topics(), axis=1)
w

[15]:

array([[0.07084467, 0.06946182, 0.06984687, 0.0715641 , 0.08128912,
        0.07169624, 0.0715641 , 0.0714729 , 0.06982486, 0.13622283,
        0.11544712, 0.10076537],
       [0.08833543, 0.08724021, 0.09144762, 0.09080894, 0.0781724 ,
        0.10442909, 0.09080894, 0.09567389, 0.09189339, 0.05609854,
        0.06109357, 0.06399799]])

[16]:

from wordcloud import WordCloud

def get_word_cloud_text(weights, id2token):
    d = {f'{id2token[i]}': int(w * 100.0) for i, w in enumerate(weights)}
    return d

def create_word_cloud(d):
    wc = WordCloud(background_color='black')
    wc.generate_from_frequencies(d)
    return wc

def plot_word_cloud(w, id2token):
    wc_texts = [get_word_cloud_text(w[r], id2token) for r in range(w.shape[0])]
    clouds = [create_word_cloud(text) for text in wc_texts]

    fig, axes = plt.subplots(1, 2, figsize=(20, 5))

    for ax, cloud in zip(axes, clouds):
        _ = ax.imshow(cloud, interpolation='bilinear')
        _ = ax.grid(False)
        _ = ax.axis('off')

    plt.tight_layout()

[17]:

plot_word_cloud(w, lsi_model.id2word.id2token)

2.5.1.2. Min-Max Scaling with Adjustments To Negatives

Another way might be to assign zero to all negative numbers and then perform min-max scaling.

[18]:

from sklearn import preprocessing

w = lsi_model.get_topics().copy()
w[w < 0] = 0.0
w = preprocessing.minmax_scale(w.T).T
w

[18]:

array([[0.02926799, 0.        , 0.0082077 , 0.04426959, 0.23345434,
        0.04700873, 0.04426959, 0.04237633, 0.00773971, 1.        ,
        0.75430468, 0.55235244],
       [0.63635383, 0.6092468 , 0.7115852 , 0.69635726, 0.37079055,
        1.        , 0.69635726, 0.80974798, 0.72215081, 0.        ,
        0.        , 0.        ]])

[19]:

plot_word_cloud(w, lsi_model.id2word.id2token)

2.5.2. LDA

The word weights per topics in LDA makes much more sense, as they must add to 1.0 per topic. These weights are probabilities.

[20]:

w = lda_model.get_topics()
w

[20]:

array([[0.0889305 , 0.09007747, 0.08601815, 0.08386292, 0.05002798,
        0.15367272, 0.08598344, 0.11484789, 0.07876081, 0.07892839,
        0.05578829, 0.03310144],
       [0.04700249, 0.04509814, 0.05183834, 0.05541703, 0.11159794,
        0.06927942, 0.05189603, 0.06885701, 0.06388869, 0.12849924,
        0.16692205, 0.13970357]], dtype=float32)

[21]:

plot_word_cloud(w, lda_model.id2word.id2token)

2.6. Coherence Scores

Topic coherence is a way to judge the quality of topics via a single quantitative, scalar value. There are many ways to compute the coherence score. For the u_mass and c_v options, a higher is always better. Note that u_mass is between -14 and 14 and c_v is between 0 and 1.

-14 <= u_mass <= 14
0 <= c_v <= 1

The coherence score is an aggregation of the following.

segmentation
probability estimation
confirmation measure

2.6.1. u_mass

Note with u_mass you always use corpus=corpus_<model>.

[22]:

from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=lsi_model, corpus=corpus_lsi, coherence='u_mass')
cm.get_coherence()

[22]:

-2.093259175449117

[23]:

cm = CoherenceModel(model=lda_model, corpus=corpus_lda, coherence='u_mass')
cm.get_coherence()

[23]:

-3.3492146807185965

2.6.2. c_v

Note with c_v you have to pass in the tokenized text texts=texts.

[24]:

cm = CoherenceModel(model=lsi_model, texts=texts, coherence='c_v')
cm.get_coherence()

[24]:

0.3838413553737203

[25]:

cm = CoherenceModel(model=lda_model, texts=texts, coherence='c_v')
cm.get_coherence()

[25]:

0.3838413553737203

2.6.3. Optimal Number of Topics

2.6.3.1. LSI

[26]:

import pandas as pd

results = []

for t in range(2, 10):
    lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=t)
    corpus_lsi = lsi_model[corpus_tfidf]

    cm = CoherenceModel(model=lsi_model, corpus=corpus_lsi, coherence='u_mass')
    score = cm.get_coherence()
    tup = t, score
    results.append(tup)

results = pd.DataFrame(results, columns=['topic', 'score'])

[27]:

s = pd.Series(results.score.values, index=results.topic.values)
_ = s.plot()

2.6.3.2. LDA

[28]:

results = []

for t in range(2, 10):
    lda_model = models.LdaModel(corpus, id2word=dictionary, num_topics=t)
    corpus_lda = lda_model[corpus]

    cm = CoherenceModel(model=lda_model, corpus=corpus_lda, coherence='u_mass')
    score = cm.get_coherence()
    tup = t, score
    results.append(tup)

results = pd.DataFrame(results, columns=['topic', 'score'])

[29]:

s = pd.Series(results.score.values, index=results.topic.values)
_ = s.plot()

2.6.4. k-means

For LSI, we may also use k-means and the silhouette score. This score is between -1 and 1. A score towards -1 indicates bad clustering, a score towards 0 indicates mixed-quality clustering (bad and good, let’s suppose), and a score towards 1 indicates optimal clustering.

-1 <= silhouette score <= 1

[30]:

from sklearn.metrics import silhouette_score

results = []

for t in range(2, 8, 1):
    lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=t)
    corpus_lsi = lsi_model[corpus_tfidf]

    X = np.array([[tup[1] for tup in arr] for arr in corpus_lsi])
    kmeans = KMeans(n_clusters=t, random_state=37).fit(X)

    score = silhouette_score(X, kmeans.labels_)

    tup = t, score
    results.append(tup)

results = pd.DataFrame(results, columns=['topic', 'score'])

[31]:

s = pd.Series(results.score.values, index=results.topic.values)
_ = s.plot()