• Review
  • Image Data
  • Tabular Data
  • Text Data
  • Assignments
    • Haiteng Engineering: Quality Control System Analysis
    • AirBnBarcelona
  1. How Computers “Understand”
  • Review
    • Introduction to Python
    • NumPy: Numerical Python
  • Image Data
    • How Computers See
    • Computer Vision
  • Tabular Data
    • Pandas: Better than Excel
    • Investment Drivers
    • From Data to Model: AirBnB
    • Time Series Data in Pandas
  • Text Data
    • How Computers “Read”
    • 30-Text-Data/060-Embeddings.ipynb

Navigation

On this page

  • Early Attempts at Organizing Language
  • NLTK
  • The Modern Era: Embeddings
    • Word math
    • Embeddings in the World

How Computers “Understand”

Language is at the core of human communication, but teaching computers to “understand” it is a monumental challenge. Unlike humans, computers need to convert words and concepts into numbers to process them effectively. Today, we’ll look at a concept called word embeddings, where words are transformed into mathematical representations that capture their meaning, relationships, and context. But first, let’s explore how humans have historically organized and connected words.

Early Attempts at Organizing Language

In 1805, Peter Mark Roget began working on what would become one of the most influential tools in the English language: Roget’s Thesaurus. Unlike a dictionary that defines words, Roget’s Thesaurus organized words by concepts and ideas. For instance, words like “happy,” “joyful,” and “elated” would be grouped together not just as synonyms, but as part of a broader conceptual category of positive emotions. This hierarchical organization of language was revolutionary and, in many ways, presaged modern computational approaches to understanding language.

Fast forward to the digital age, where Princeton University’s WordNet project took Roget’s concept of organizing words by meaning and transformed it into a comprehensive digital database. WordNet is included in the Natural Language Toolkit (NLTK), a popular Python library for natural language processing (NLP).

NLTK

To imbue our python notebooks with a touch of intelligence, we can use wordnet which is included in the Natural Language Toolkit (nltk). NLTK (Natural Language Toolkit) is a popular Python library widely used for natural language processing (NLP) tasks. It provides a set of tools and datasets for processing, analyzing and understanding human language data.

With NLTK, you can perform various NLP tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, named entity recognition, sentiment analysis, and more. It also offers access to a vast collection of language resources, including corpora, lexical resources, grammars, and pre-trained models. Long story short, NLTK empowers you in exploring, analyzing, and processing textual data effectively, making it an invaluable asset whenever you deal with written language.

Installation is a straightforward process - we simply need to execute the following command:

! pip3 install nltk
Requirement already satisfied: nltk in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (3.9.1)
Requirement already satisfied: click in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from nltk) (2024.11.6)
Requirement already satisfied: tqdm in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from nltk) (4.67.0)
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('omw-1.4')
True

Synonyms and Lemmas

At the core of WordNet lies the concept of a synset (synonym set), similar to how Roget grouped words by meaning. These sets represent groups of synonymous words that express a particular concept. Let’s explore some examples:

You can access them using the following syntax: wordnet.synsets('hello').

wordnet.synsets('hello')
[Synset('hello.n.01')]

This word just belongs to one synonym set, other words have more complicated meanings:

wordnet.synsets('hi')
[Synset('hello.n.01'), Synset('hawaii.n.01')]

Note how it came up with Hawaii because ‘HI’ is the short form of ‘Hawai’. It’s usually better to use longer terms to avoid such conflicts.

wordnet.synsets('bank')
[Synset('bank.n.01'),
 Synset('depository_financial_institution.n.01'),
 Synset('bank.n.03'),
 Synset('bank.n.04'),
 Synset('bank.n.05'),
 Synset('bank.n.06'),
 Synset('bank.n.07'),
 Synset('savings_bank.n.02'),
 Synset('bank.n.09'),
 Synset('bank.n.10'),
 Synset('bank.v.01'),
 Synset('bank.v.02'),
 Synset('bank.v.03'),
 Synset('bank.v.04'),
 Synset('bank.v.05'),
 Synset('deposit.v.02'),
 Synset('bank.v.07'),
 Synset('trust.v.01')]

Every synset is composed of words, not in their original form, but in their lemma form. Lemmas are the base or canonical forms of words, representing their dictionary entries or headwords. For instance, the lemma of ‘beautifully’ is ‘beautiful’, or of ‘walking’ is ‘walk’. They serve as the common form for inflected or derived words, capturing the core meaning of a word and facilitating semantic analysis and language processing tasks. You can retrieve these using the .lemmas() syntax. To get a nice human-friendly version of the lemma, use .name() on the lemma.

# depository_financial_institution.n.01
bank_synset1 = wordnet.synsets('bank')[1]
bank_synset1.lemmas()
[Lemma('depository_financial_institution.n.01.depository_financial_institution'),
 Lemma('depository_financial_institution.n.01.bank'),
 Lemma('depository_financial_institution.n.01.banking_concern'),
 Lemma('depository_financial_institution.n.01.banking_company')]
[lemma.name() for lemma in bank_synset1.lemmas()]
['depository_financial_institution',
 'bank',
 'banking_concern',
 'banking_company']

We can also define these using .definition():

bank_synset1.definition()
'a financial institution that accepts deposits and channels the money into lending activities'

You try it

Using a combination of wordnet.synsets() and synset.definition(), figure out all definitions of the word bank:

...
Ellipsis
Solution
for synset in wordnet.synsets("bank"):
    print(synset.definition())

# alternative
[ synset.definition() for synset in wordnet.synsets("bank")]
sloping land (especially the slope beside a body of water)
a financial institution that accepts deposits and channels the money into lending activities
a long ridge or pile
an arrangement of similar objects in a row or in tiers
a supply or stock held in reserve for future use (especially in emergencies)
the funds held by a gambling house or the dealer in some gambling games
a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
a container (usually with a slot in the top) for keeping money at home
a building in which the business of banking transacted
a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
tip laterally
enclose with a bank
do business with a bank or keep an account at a bank
act as the banker in a game or in gambling
be in the banking business
put into a bank account
cover with ashes so to control the rate of burning
have confidence or faith in
['sloping land (especially the slope beside a body of water)',
 'a financial institution that accepts deposits and channels the money into lending activities',
 'a long ridge or pile',
 'an arrangement of similar objects in a row or in tiers',
 'a supply or stock held in reserve for future use (especially in emergencies)',
 'the funds held by a gambling house or the dealer in some gambling games',
 'a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force',
 'a container (usually with a slot in the top) for keeping money at home',
 'a building in which the business of banking transacted',
 'a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)',
 'tip laterally',
 'enclose with a bank',
 'do business with a bank or keep an account at a bank',
 'act as the banker in a game or in gambling',
 'be in the banking business',
 'put into a bank account',
 'cover with ashes so to control the rate of burning',
 'have confidence or faith in']

The ‘net’ in wordnet comes from all the relations that are encoded in wordnet.

Let’s try a couple

beautiful = wordnet.synsets('beautiful')[0]
beautiful.lemmas()[0].antonyms()
[Lemma('ugly.a.01.ugly')]
dolphin = wordnet.synsets('dolphin')
dolphin
[Synset('dolphinfish.n.02'), Synset('dolphin.n.02')]

Do you see anything fishy??

dolphin[1].hypernyms()[0].definition()
'any of several whales having simple conical teeth and feeding on fish etc.'

The Modern Era: Embeddings

While WordNet and Roget’s Thesaurus organize words through human-crafted hierarchies and relationships, modern NLP has moved toward learning these relationships automatically from data. Word embeddings represent the cutting edge of this approach, transforming words into dense numerical vectors that capture semantic relationships based on how words are actually used in text.

Word embeddings are at the heart of most modern applications in natural language processing, from creating basic word clouds to powering sophisticated models like ChatGPT and advanced translation systems. Unlike the discrete categories of Roget or the explicit relationships in WordNet, embeddings capture subtle semantic relationships in a continuous mathematical space.

The wiki-news-300d-50K model we’ll be using is a pre-trained Word2Vec embedding built from Wikipedia and news data. It provides 300-dimensional vector representations for 50,000 of the most common words, capturing their semantic meanings and relationships based on real-world usage. This compact and efficient model is ideal for exploring word similarities, analogies, and clustering tasks in natural language processing.

You can dowload it from Virtual Campus.

%%time 

import gensim.models.keyedvectors as word2vec

# Load pre-trained Word2Vec model using Gensim.
model = word2vec.KeyedVectors.load_word2vec_format('./resources/wiki-news-300d-50K.vec')
CPU times: user 5.05 s, sys: 41 ms, total: 5.09 s
Wall time: 5.12 s

The %%time cell magic displays how long the cell takes to execute. If all went well, you should now have an embedding model of 100K words. Each word has associated with it a numerical representation as a list of numbers.

model.vectors.shape
(50000, 300)

Remember, everything is pretrained. We can get the vector for a particular word as follows:

model.get_vector('hello')[:5]
array([-0.192 ,  0.1544,  0.0467,  0.0592,  0.1369], dtype=float32)
model.similar_by_word('king')[:5]
[('kings', 0.7969563603401184),
 ('queen', 0.763853907585144),
 ('monarch', 0.739997148513794),
 ('King', 0.7281951904296875),
 ('prince', 0.7132729887962341)]
 model.similarity('beautiful','carrot')
0.2784524
print(model.similarity('man','woman'))
#print(model.similarity('man','king'))
#print(model.similarity('man','queen'))
#print(model.similarity('man','potato'))
0.8164522

HARD: you could calculate this manually using the cosine between the two vectors:

\[ \text{cosine similarity} := \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} \]

Use the get_vector() command to get the embeddings of the word ‘man’ and ‘potato’. Then, use the np.dot() function and the np.linalg.norm() function to calculate the cosine similarity.

Solution
import numpy as np

v1 = model.get_vector('man')
v2 = model.get_vector('potato')

np.dot(v1, v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))
0.40033147

Sometimes, results are maybe not what we would expect. Let’s try to find the most similar fruits to oranges:

model.similar_by_word('orange')
[('yellow', 0.8086148500442505),
 ('purple', 0.7722597718238831),
 ('blue', 0.7472313046455383),
 ('red', 0.7455444931983948),
 ('pink', 0.7302511930465698),
 ('green', 0.7257395386695862),
 ('brown', 0.6846711039543152),
 ('maroon', 0.6534305810928345),
 ('grey', 0.6497707962989807),
 ('colored', 0.6493549346923828)]

Oops! It’s mostly finding colors …

To figure out how the model is “thinking”, we need to inspect similar items. Here’s what the following code does:

  1. Find the 100 most similar items
  2. Ask the model about the word embedding numpy array
  3. Position similar words together (tSNE)
  4. Color similar words the same (clustering)

Steps 2-4 individually are advanced ML concepts which would require a class to explain each, so we’ll take it for granted that they work here.

INPUT_WORD = 'orange'
NUMBER_OF_CLUSTERS = 5

# 0. We need a couple more packages for all the ML and plotting functions
import matplotlib.pyplot as plt
import sklearn
import umap.umap_ as umap
from sklearn.metrics.pairwise import cosine_distances
from sklearn.cluster import KMeans

# 1. find the similar words
top_similar_words = [w for w,s in model.similar_by_vector(INPUT_WORD,topn=100)]
top_embeddings    = [ model.get_vector(w) for w in top_similar_words]

# 2. Calculate distance from 1 word to another
distances       = cosine_distances(top_embeddings)

# 3. Find a positioning on a 2D screen based on the distances
method          = sklearn.manifold.TSNE()
embedding_in_2D = method.fit_transform(distances)

# 4. Cluster / group words together by cosine similarity
kmeans = KMeans(n_clusters=6, random_state=42)
clusters = kmeans.fit_predict(embedding_in_2D)

# Plot 
plt.figure(figsize=(12,6))

ax = plt.gca()
for i in range(embedding_in_2D.shape[0]):
    ax.annotate( top_similar_words[i], xy=embedding_in_2D[i,:], ha='center', alpha=0.8,color=plt.cm.tab10(clusters[i] % 10), fontsize=12)

plt.xlim(embedding_in_2D.min(axis=0)[0],embedding_in_2D.max(axis=0)[0])
plt.ylim(embedding_in_2D.min(axis=0)[1],embedding_in_2D.max(axis=0)[1])
plt.axis('off');

Try it out with other input words …

Word math

One of the most astonishing results of word2vec models is that you can do word math. Let’s try to get the computer to guess what the capital of France is. The question could go something like this:

“Hello DSfBot, Madrid is to Spain as X is to France, what is X?”

Behind the scenes the math will look as follows:

x_madrid = model.get_vector('Madrid')
x_spain  = model.get_vector('Spain')
x_france = model.get_vector('France')

model.similar_by_vector(
    x_madrid - x_spain  + x_france
)
[('Paris', 0.830527663230896),
 ('Madrid', 0.7582901120185852),
 ('France', 0.7122387886047363),
 ('Toulouse', 0.6498643159866333),
 ('Lille', 0.6471526026725769),
 ('Strasbourg', 0.644611120223999),
 ('Lyon', 0.6408068537712097),
 ('Brussels', 0.6333270072937012),
 ('Marseille', 0.6331944465637207),
 ('Parisian', 0.6248301863670349)]
x1   = model.get_vector('king')
x2   = model.get_vector('male')
x3   = model.get_vector('female')

model.similar_by_vector(
    x1 - x2  + x3
)
[('king', 0.9448229074478149),
 ('queen', 0.7812791466712952),
 ('kings', 0.7464563250541687),
 ('monarch', 0.7043989300727844),
 ('King', 0.6892116665840149),
 ('prince', 0.6742870211601257),
 ('kingdom', 0.6693125367164612),
 ('princess', 0.6626783609390259),
 ('ruler', 0.6474557518959045),
 ('royal', 0.6442288160324097)]
x1  = model.get_vector('king')
x2   = model.get_vector('male')
x3 = model.get_vector('female')
model.similar_by_vector(
    x1 - x2  + x3
)
[('king', 0.9448229074478149),
 ('queen', 0.7812791466712952),
 ('kings', 0.7464563250541687),
 ('monarch', 0.7043989300727844),
 ('King', 0.6892116665840149),
 ('prince', 0.6742870211601257),
 ('kingdom', 0.6693125367164612),
 ('princess', 0.6626783609390259),
 ('ruler', 0.6474557518959045),
 ('royal', 0.6442288160324097)]
x1  = model.get_vector('Paul')
x2   = model.get_vector('man')
x3 = model.get_vector('woman')
model.similar_by_vector(
    x1 - x2  + x3
)
[('Paul', 0.8640668392181396),
 ('Pauline', 0.6703044176101685),
 ('Peter', 0.6604458093643188),
 ('Linda', 0.6589093804359436),
 ('Catherine', 0.63918536901474),
 ('Christine', 0.6386603116989136),
 ('John', 0.6372238397598267),
 ('Stephanie', 0.6357173323631287),
 ('Annette', 0.6346656680107117),
 ('Susan', 0.6319258213043213)]

At the time of its discovery in the original word2vec paper, this was completely unexpected and took the world in awe … Do note that this is not infallible, especially with the low resolution embeddings that we’re dealing with here.

You try it

You can do basic vector math to help the model on its way with our Orange question. Try this procedure instead:

  1. find the vector embedding for ‘orange’ using model.get_vector()
  2. find the vector embedding for ‘fruit’ model.get_vector()
  3. Add fruit to orange, resulting in a [300,] shaped array which represnts their sum.
  4. now query the model using model.similar_by_vector()

Note: you could instead also have subtracted color from orange :)

v_orange = ...
v_fruit  = ...
v_mean   = ...
model.similar_by_vector(v_mean)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[23], line 4
      2 v_fruit  = ...
      3 v_mean   = ...
----> 4 model.similar_by_vector(v_mean)

File ~/miniconda3/envs/dsfb/lib/python3.11/site-packages/gensim/models/keyedvectors.py:914, in KeyedVectors.similar_by_vector(self, vector, topn, restrict_vocab)
    890 def similar_by_vector(self, vector, topn=10, restrict_vocab=None):
    891     """Find the top-N most similar keys by vector.
    892 
    893     Parameters
   (...)
    912 
    913     """
--> 914     return self.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)

File ~/miniconda3/envs/dsfb/lib/python3.11/site-packages/gensim/models/keyedvectors.py:837, in KeyedVectors.most_similar(self, positive, negative, topn, clip_start, clip_end, restrict_vocab, indexer)
    835         keys.append(item)
    836     else:
--> 837         keys.append(item[0])
    838         weight[idx] = item[1]
    840 # compute the weighted average of all keys

TypeError: 'ellipsis' object is not subscriptable
Solution 1
v_orange = model.get_vector('orange')
v_fruit  = model.get_vector('fruit') 

np.mean([v_orange,v_fruit], axis=0)

model.similar_by_vector(
    v_orange + v_fruit
)
[('fruit', 0.8935022950172424),
 ('orange', 0.8768476247787476),
 ('fruits', 0.7378386855125427),
 ('peach', 0.7249290943145752),
 ('juice', 0.709208607673645),
 ('apple', 0.6928902268409729),
 ('citrus', 0.6889742016792297),
 ('yellow', 0.6878279447555542),
 ('strawberry', 0.6786347031593323),
 ('purple', 0.6785171627998352)]
Solution 2
v_orange = model.get_vector('orange')
v_fruit  = model.get_vector('color') 

np.mean([v_orange,v_fruit], axis=0)

model.similar_by_vector(
    v_orange - v_fruit
)
[('orange', 0.4103437066078186),
 ('carrot', 0.27398887276649475),
 ('banana', 0.2591733932495117),
 ('juice', 0.2586422562599182),
 ('grapefruit', 0.25596487522125244),
 ('bemused', 0.25220292806625366),
 ('lemon', 0.24836258590221405),
 ('cheeky', 0.2448141872882843),
 ('orchard', 0.24456676840782166),
 ('irate', 0.2444906383752823)]

We could also visualize what an “orange” plus “fruit” looks like:

INPUT_VECTOR = model.get_vector('orange') + model.get_vector('fruit')
NUMBER_OF_CLUSTERS = 5

# 0. We need a couple more packages for all the ML and plotting functions
import matplotlib.pyplot as plt
import sklearn
import umap.umap_ as umap
from sklearn.metrics.pairwise import cosine_distances
from sklearn.cluster import KMeans

# 1. find the similar words
top_similar_words = [w for w,s in model.similar_by_vector(INPUT_VECTOR,topn=50)]
top_embeddings    = [ model.get_vector(w) for w in top_similar_words]

# 2. Calculate distance from 1 word to another
distances       = cosine_distances(top_embeddings)

# 3. Find a positioning on a 2D screen based on the distances
method          = sklearn.manifold.TSNE()
embedding_in_2D = method.fit_transform(distances)

# 4. Cluster / group words together by cosine similarity
kmeans = KMeans(n_clusters=5, random_state=42)
clusters = kmeans.fit_predict(embedding_in_2D)

# Plot 
plt.figure(figsize=(12,6))

ax = plt.gca()
for i in range(embedding_in_2D.shape[0]):
    ax.annotate( top_similar_words[i], xy=embedding_in_2D[i,:], ha='center', alpha=0.8,color=plt.cm.tab10(clusters[i] % 10), fontsize=12)

plt.xlim(embedding_in_2D.min(axis=0)[0],embedding_in_2D.max(axis=0)[0])
plt.ylim(embedding_in_2D.min(axis=0)[1],embedding_in_2D.max(axis=0)[1])
plt.axis('off');

Embeddings in the World

See slides.