Understanding GloVe Vectors

There are many articles out there that tell you about word vectors and their uses. I will try to focus on how GloVe vectors are calculated and the underlying equations behind it. The motivation behind creating GloVe vector was that the authors wanted to create a model which utilizes the word-word co-occurrence counts and thus make efficient use of statistics. The end model is a weighted least squares model, with weights depending on the word-word co-occurrence counts. The regression equation they use is

tr(wi)*w~k + bi + b~k = log(Xik)

and the cost function is weighted least squares

J = summition over the i,j of f(Xij)*( tr(wi)*wi + bi + bj - log(Xij))^2

Here Xij is the word-word co-occurrence counts. The equation was solved using AdaGrad optimizer in the paper. It gives out two vectors wi and w~i. The resultant vector is just sum of these two, wi + w~i

Now, having stated the equation which is used to obtain GloVe vector, let's see how it is obtained. The authors showed that it made more sense to use the ratio of probability Pik/Pjk to obtain word vector as this ratio is equal to one when the words i and j are either similar to k or very different from it. When i was close to k but j not, then the ratio was very big, and i does not match with k and j does, then the ratio is very small. Thus, this simple ratio, tells a lot of about the similarity between words. Here Pij is calculated as Xij/Xi. So, in effect we need to find a function F which maps word vector to this ratio, i.e.

F(wi,wj,w~k) = Pik/Pjk

Then they simplify the equation by making the assumption that F is dependent on wi-wj

F(wi - wj,w~k) = Pik/Pjk

To make the LHS scalar they take the dot product of vectors

F(tr(wi-wj)*w~k) = Pik/Pjk

Since there is no distinction between word and context word in word-word co-occurrence matrix, so the function F should not change when we change w to w~ and X to tr(X). To do so, F is represented as

F(tr(wi-wj)*w~k) = F(tr(wi)*w~k)/ F(tr(wj)*w~k)

which implies F(tr(wi)*w~k) = Pik = Xik/Xi.

The solution to F is F = exp, so

tr(wi)*w~k = log(Pik) = log(Xik) - log(Xi)

Finally, log(Xi) is independent of k, so it can be considered as a bias term bi, and we can add bias b~k term such that the equation is symmetric

tr(wi)*w~k + bi + b~k = log(Xik).

This is the equation that we want to obtain and this is our regression equation. This is how the GloVe vectors are obtained. Please comment if you see errors in my post, suggestions are welcome if you want to me cover any topic, or you want to suggest changes in my post.

Search This Blog

My journey into Machine Learning

Understanding GloVe Vectors

Comments

Post a Comment

Popular posts from this blog

Understanding Batch Normalization