Curious. Math. Code. AI. Games

Disclaimer: To get the most of this blog post, I encourage you to play around this Colab Notebook while reading through.

This is my current understanding of the subject. I write this in hopes that it can be of help to some people and to help me learn the subject better. If you found anything wrong, please feel free to let me know.

At its core, Word2Vec is a technique used to teach computers the meaning of words, not just their spelling. It relies on the idea that words appearing in similar contexts share similar meanings. It translates words into a lists of numbers (vectors) so that we can do math with them. We can interpret that the goal of Word2Vec is to create a map that contains the coordinates of the words. So if we look at a certain word, we can see similar words in its coordinates.

Intuition

To help us understand visualize Word2Vec, we can imagine having “points” (the words) inside a “map” (hyperspace) where words that are similar are closer to each other. For example, the words “King”, “Queen”, “Prince”, and “Princess” are in the same cluster while the words “car”, “oil”, “gasoline” are in another cluster far from the first cluster.

Our goal is to take a large corpus, from example the whole Game of Thrones series, and let the machine learn the meaning of the words in the book and create the word “map”.

How do we do this?

Preparing the data

First we need to have a list of all the sentences (corpus). For example:

sentences = [
	"the king loves the queen",
	"the queen loves the king",
	"the king eats meat",
	"the queen drinks wine",
	"sheep chews grass",
]

2. We get all the unique words in the list of sentences, then we assign it to a number. So from our example above, we get the following:

unique_words = {
	"the": 0,
	"king": 1,
	"loves": 2,
	"queen": 3,
	"eats": 4,
	"meat": 5,
	"drinks": 6,
	"wine": 7,
	"sheep": 8,
	"chews": 9,
	"grass": 10,
}

3. Then we setup the sentences that our computer can work with. Basically, we convert the sentences to numbers.

converted_sentences = [
	[0, 1, 2, 0, 3], # The (0) king (1) loves (2) the (0) queen (3)
	[0, 3, 2, 0, 1],
	[0, 1, 4, 5],
	[0, 3, 6, 7],
	[8, 9, 10]
]

4. We can often tell the meaning of a word by the words around it. Surrounding words (context words) provide hints to help infer a word’s (center word) meaning without using a dictionary. So we want to create a list (dataset) containing a center word and context words. We can do it with the “Sliding Window” technique. Basically, we will iterate through the words in the sentences, pick the center word, then get the context word. It’ll be easier to show with the example sentences above.

“the king loves the queen”

First we pick the center word, “the”, then look at its sides. Since this is the first word, we don’t get a context word from the left. But we do get the word “king” in its right. So we get this:

(the, king)

Then we move to the next word, “king”. We look at its left and get the context word “the”. Then we look at its right and get the context word “loves”. So for this iteration we get these pairs

(king, the)
(king, loves)

Then we repeat this process until the end of the sentence. After the first sentence we got these pairs

(the, king)
(king, the)
(king, loves)
(loves, king)
(loves, the)
(the, loves)
(the, queen)
(queen, the)

Remember that the format is (center word, context word) and it should really be the pairs of numbers (using the converted_sentences). I only used the actual words above so it’ll be easier to visualize. We then repeat the process through all of the sentences. By the end, we will have a long list of number pairs.

[
	(0, 1),
	(1, 0),
	(1, 2),
	(2, 1),
	(2, 0),
	(0, 2),
	(0, 3),
	(3, 0),
	# ... rest of the pairs from the rest of the sentences.
]

Creating the “map”

First we need to create the “map” (hyperspace) for our words. Unlike a traditional map which is only in 2D (x, y coordinates) or even a space (3D) our “word map” resides in a “hyperspace” with many dimensions (EMBED_DIMENSIONS). We can think of the EMBED_DIMENSIONS as the “space” or “room” for our machine to learn. For example, if we set it to 1, the machine might learn “royalty”. If we set it to 2, it might learn “royalty”, then “gender”. This is an oversimplification of EMBED_DIMENSIONS to help us visualize what they are. Just think of EMBED_DIMENSIONS as the abstract features that the machine will learn from the words.

# VOCABULARY_SIZE is the number of our unique words.
word_map_hyperspace = np.random.uniform(-0.5, 0.5, (VOCABULARY_SIZE, EMBED_DIMENSIONS))

We represent our “map” as a matrix with dimensions VOCABULARY_SIZE x EMBED_DIMENSIONS. So using our example, and an EMBED_DIMENSIONS = 2, our map matrix (word_map_hyperspace) will look like this

[
	[ 0.20993745, -0.02032706], # Think of this as the coordinates for the word "the"
    [-0.01980316, -0.22844879], # "king"
	[ 0.34120067, -0.46427962], # "loves"
    [ 0.01062148, -0.21420324], # "queen"
    [-0.08416039,  0.26910854], # "eats"
    [-0.28808131,  0.05186818], # ...
    [-0.04058687, -0.36510978],
	[-0.31803735,  0.48589141],
    [ 0.34898348, -0.3618558 ],
    [-0.1512221 , -0.33315531],
    [ 0.35733597,  0.47376461]
]

Initially, we randomly place the words in the “map” (hyperspace). This only make sense because initially our computer doesn’t know anything about the word.

2. Then we need to create another matrix which we will call the context matrix. Remember that each of the words plays two roles: as a center word and as a context word.

The first matrix we created above (map matrix) is the identity of the word itself (it represents the qualities of the word). This second matrix represents what usually hangs around the center word. In our “space” analogy, we can think of the context matrix as a gravity well that pulls words that our center word.

Imagine a context word Kingdom, this context_word["kingdom"] creates a gravity well at a specific coordinate in the context matrix then during training when it come across the word “King”, we will ask is “King” close to the center of the gravity well context_word["kingdom"]? The answer is yes, hence the model strengthens that gravity (attraction).

Overtime our gravity well context_word["kingdom"] effectively pulls the words “Queen”, “Prince”, “Princess” into the same neighborhood.

# This is how we initialized the context_matrix.
# It has the same size as the word map matrix and we will also place the gravity wells randomly.
context_matrix = np.random.uniform(-0.5, 0.5, (VOCABULARY_SIZE, EMBED_DIMENSIONS))

Before we start discussing how to correct the map, I want to quickly go over a statistical method called “Maximum Likelihood Estimation”.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a core statistical method to find the best parameters for a probability model by selecting values that make the observed data most probable, essentially finding the model configuration that best “fits” or “explains” the data.

To help us visualize what MLE does, we can think of this “Detective Analogy”. Imagine we are a detective.

The Evidence (Corpus): We found a photo of “King” and “Queen” together. This event definitely happened, Probability = 1 (100%).

The Theory (The Map): When we look at the map, it shows that “King” lives in Antartica and “Queen” lives in Paris.

MLE Logic: If our map is correct, the likelihood of them being together is very low (0.00001%). The probability is very low. We know that the evidence is real, so our map must be wrong.

So our course of action is to correct the map (move the king or the queen or both of them) until the likelihood of them being together in a photo becomes high.

How MLE creates Clusters

It is important to know that MLE doesn’t just move two words. It forces logical consistency across all the words.

Imaging the corpus has these two sentences:

“The King rules”
“The Queen rules”

Step 1: MLE fixes Sentence 1

MLE sees “King” and “Rules” together.
It forces the coordinates of “King” to move closer to “Rules”.

Step 2: MLE fixes Sentence 2

MLE sees “Queen” and “Rules” together.
It forces the coordinates of “Queen” to move closer to “Rules”.

Step 3: The mathematical consequence. If “King” is close to “Rules” and “Queen” is close to “Rules”, then by necessity, “King” MUST be close to “Queen”.

We can see that MLE didn’t “know” that “King” and “Queen” were synonyms. It just tried to maximize the probability of them both being neighbors with “Rules”. The clustering is the inevitable side effect of satisfying MLE.

Negative Sampling

You will see in the section below that we use the Sigmoid and Binary Cross Entropy function. We are only able to use them because we are using the Negative sampling technique. Without negative sampling, we will be doing dot product to all of the words (imagine if our dictionary has 100,000 words) and have to update 100,000 vectors (compute for gradient) in each step.

Basically, we get our “positive” pairs from what we built above. Then we also need to create “negative” pairs where we know for sure that the center word is paired with a randomly picked word that is NOT the context word. E.g. (king, sheep), (queen, grass).

So using negative sampling, we changed the problem from “Which word is the correct one out of the entire vocabulary?” to “Is this the correct word?”. The new problem can be answered with just “Yes” or “No”. Making it so much simpler.

Correcting the map (training)

We take two words, say “King” (center word) and “Queen” (context word). We want to know how close or similar these words are. To do this we use the Dot product.

z = np.dot(king_vector, queen_vector)

The score (z) is a raw number (scalar) like 5, 10, 0, -5, etc. The problem with this score is we can’t do probability math with them. Probability must be between 0 and 1 (0% to 100%)

2. To resolve our problem with score (z), we feed it into the Sigmoid function (p).

\sigma = p = \frac{1}{1 + e^{-z}}

This function “squashes” the score into a number in the range of 0 to 1.

3. Then we want to measure how “wrong” (error) our score is. To do this, we use what we call “Binary Cross Entropy“, it sounds fancy and all but at it’s really just the Negative Log of Maximum Likelihood Estimation, I’ll write more about this in another blog but the core idea is still what I discussed above in MLE.

The formula for “Binary Cross Entropy” is

J = -[yln(p) + (1-y) ln (1 – p)]

# y is the target
# p is the probability (output of sigmoid)

4. Now that we got how “wrong” we are, we have all the tools we need to know how we can adjust the coordinates of the words (word_map_hyperspace) in the map and the context_matrix.

Remember in the MLE section above, we want to correct the map to increase the likelihood that the evidence (corpus) becomes high. Increasing this likelihood is the same as decreasing (minimizing) the error (Binary Cross Entropy function). To minimize a function we will be using an optimization algorithm Gradient Descent, to do this we need to get the gradient, and to get the gradient we do different calculus. Again, I’ll dive deeper about this in another but basically the gradient descent is.

# Gradient descent
# This updates the "coordinates" of a little bit to make it a bit more accurate.
vector_center_word = vector_center_word - (LEARNING_RATE * gradient_vector_center_word)
vector_context_word = vector_context_word - (LEARNING_RATE * gradient_vector_context_word)

Here are the recap of all the formulas we used.

z = np.dot(v_{center}, u_{context})

p = \frac{1}{1 + e^{-z}}

J = -[yln(p) + (1-y) ln (1 – p)]

To compute the gradients:

\frac{\partial{J}}{\partial{v_{center}}} = \frac{\partial{J}}{\partial{p}} \cdot \frac{\partial{p}}{\partial{z}} \cdot \frac{\partial{z}}{\partial{v_{center}}}

\frac{\partial{J}}{\partial{p}} = -\frac{y}{p} + \frac{1-y}{1-p} = \frac{p-y}{p \cdot (1-p)}

\frac{\partial{p}}{\partial{p}} = \frac{e^{-z}}{(1+e^{-z})^2} = \frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}} = p \cdot (1 – p)

\frac{\partial{z}}{\partial{v_{center}}} = u_{context}

Writing them as a whole:

\frac{\partial{J}}{\partial{v_{center}}} = \frac{p-y}{\cancel{p \cdot (1-p)}} \cdot \cancel{p \cdot (1 – p)} \cdot u_{context}

\frac{\partial{J}}{\partial{v_{center}}} = (p – y) \cdot u_{context}

Then to get the gradient for u_{context}

\frac{\partial{J}}{\partial{u_{context}}} = (p – y) \cdot v_{center}

5. Then we will just re-iterate the training until we got a good (acceptable) map.

What’s next? Further improvement.

Use matrix multiplication instead of for loops. This will decrease the training time. I only used for loops to make it easier to follow whats happening.
Common words such as “the”, “a”, “is”, etc appear more often compared to rare words like “king”, “queen”. This will add a lot of noise and make training more difficult. You can research on ways to handle this.
The “center word” + “context word” architecture we used is called Skip-gram. There’s another architecture called CBOW (Continuous Bag of Words) where it does the opposite, we fed the machine the context words (‘The’, ‘loves’, ‘the’, ‘queen’) and asked it to guess the missing center word (‘King’). I recommend you to try creating Word2Vec using this architecture.
Use a bigger corpus for training.
And more.. Just search “Word2Vec” in Google and you’ll see more concepts and techniques you can discover and play around.

Blog

Word2Vec