Word Embedding

KESHAV SARDA
9 min readDec 14, 2022

--

Word embedding is a technique used in natural language processing (NLP) where words or phrases from a vocabulary are mapped to vectors of real numbers. This allows words with similar meanings to have similar representations in the vector space, making it easier for the model to understand the meaning of words and their relationship to each other. Word embeddings are typically learned from large amounts of text data and can be used in a variety of NLP tasks such as language translation and text classification.

Overall, word embedding is a powerful technique for representing natural language data in a way that is easily understood by machine learning models. It has become a standard component of many NLP systems and has greatly improved the performance of these systems on a variety of tasks.

TF-IDF

TF-IDF is a technique used in natural language processing (NLP) to measure the importance of a word in a document relative to a collection of documents. It stands for “Term Frequency — Inverse Document Frequency” and is a way to score the relevance of a word to a document in a corpus.

TF-IDF is calculated by multiplying the term frequency (TF) of a word by the inverse document frequency (IDF) of the word. Term frequency is the number of times a word appears in a document, while inverse document frequency is the logarithm of the total number of documents in the corpus divided by the number of documents containing the word.

The resulting TF-IDF score for a word indicates how important the word is to a particular document in the corpus. Words that have a high TF-IDF score in a document are considered to be relevant to the content of the document, while words with a low TF-IDF score are less relevant.

TF-IDF is commonly used in information retrieval and text mining to determine the most important words in a document and to retrieve relevant documents based on a query. It is also used in text summarization to identify the most important sentences or phrases in a document.

TF: Term Frequency

Term frequency, or TF, is a measure used in natural language processing (NLP) to determine the importance of a word in a document. It is calculated as the number of times a word appears in a document divided by the total number of words in the document. This value is often normalized by the maximum term frequency in the document to give a score between 0 and 1, where 1 indicates that the word appears the most in the document.

TF(t) = (Number of times term t appears in a document) / (Total
number of terms in the document).

IDF: Inverse Document Frequency

Inverse document frequency, or IDF, is a measure used in natural language processing (NLP) to determine the importance of a word in a corpus of documents. It is calculated as the logarithm of the total number of documents in the corpus divided by the number of documents containing the word. This value is often multiplied by the term frequency (TF) of a word to give the TF-IDF score, which is a measure of the importance of the word in a document relative to the corpus.

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Python Implementation

TF, IDF implementation
TF, IDF Implementation

Bag of words (BOW)

A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag” of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

Let’s understand this with an example. Suppose we wanted to vectorize the following:

the cat sat

the cat sat in the hat

the cat with the hat

We’ll refer to each of these as a text document.

Step 1: Determine the Vocabulary

We first define our vocabulary, which is the set of all words found in our document set. The only words that are found in the 3 documents above are:

{ the, cat, sat, in, the, hat, with}

Step 2: Count

To vectorize our documents, all we have to do is count how many times each word appears:

Example of Count Vectorizer

Python Implementation

Python code for Bag of Words

Challenges with the bag of words and TF-IDF

In BOW, the size of the vector is equal to the number of elements in the vocabulary. If most of the values in the vector are zero then the bag of words will be a sparse matrix. Sparse representations are harder to model both for computational reasons and also for informational reasons.

Also, in BOW there is a lack of meaningful relations and no consideration for the order of words. Here’s more that adds to the challenge with this word embedding technique.

Massive number of weights: large amounts of input vectors invite massive amounts of weight for a neural network.

No meaningful relations or consideration for word order: The bag of words does not consider the order in which the words appear in the sentences or a text.

Computationally intensive: With more weight comes the need for more computation to train and predict.

While the TF-IDF model contains the information on the more important words and the less important ones, it does not solve the challenge of high dimensionality and sparsity, and unlike BOW it also makes no use of semantic similarities between words.

Word2vec

Word2vec method was developed by Google in 2013. Presently, we use this technique for all advanced natural language processing (NLP) problems. It was invented for training word embeddings and is based on a distributional hypothesis.

These are basically shallow neural networks that have an input layer, an output layer, and a projection layer. It reconstructs the linguistic context of words by considering both the order of words in history as well as the future. The method involves iteration over a corpus of text to learn the association between the words. It relies on a hypothesis that the neighboring words in a text have semantic similarities with each other. It assists in mapping semantically similar words to geometrically close embedding vectors.

It uses the cosine similarity metric to measure semantic similarity. Cosine similarity is equal to Cos(angle) where the angle is measured between the vector representation of two words/documents.

Cosine Similarity

So, if the cosine angle is one, it means that the words are overlapping.

And if the cosine angle is a right angle or 90°, It means words hold no contextual similarity and are independent of each other.

Word2Vec has two neural network-based variants:

  • Continuous Bag of Words (CBOW)
  • Skip-gram.

1. CBOW

The continuous bag of words variant includes various inputs that are taken by the neural network model. Out of this, it predicts the targeted word that closely relates to the context of different words fed as input. It is fast and a great way to find better numerical representation for frequently occurring words.

I MOSTLY LISTEN ARIJIT SINGH SONGS

Window size = 5

Context Current

In CBOW, we define a window size. The middle word is the current word and the surrounding words (past and future words) are the context. CBOW utilizes the context to predict the current words. Each word is encoded using One Hot Encoding in the defined vocabulary and sent to the CBOW neural network.

The hidden layer is a standard fully connected dense layer. The output layer generates probabilities for the target word from the vocabulary.

CBOW Architecture

2. SkipGram

Skip-gram is a slightly different word embedding technique in comparison to CBOW as it does not predict the current word based on the context. Instead, each current word is used as an input to a log-linear classifier along with a continuous projection layer. This way, it predicts words in a certain range before and after the current word.

This variant takes only one word as an input and then predicts the closely related context words. That is the reason it can efficiently represent rare words.

Comparison of CBOW and Skip-gram architecture

Python Implementation -

GloVe: Global Vector for word representation

GloVe (Global Vectors for Word Representation) is a word embedding technique developed by Stanford researchers in 2014. It is similar to other word embedding methods, such as Word2Vec, in that it maps words to vectors of real numbers in a high-dimensional space. However, unlike Word2Vec, which uses a shallow neural network to learn word vectors, GloVe learns word vectors by performing matrix factorization on the co-occurrence matrix of word-word co-occurrences in a corpus.

The GloVe model learns word vectors that capture both the semantic and syntactic relationships between words. For example, the vectors for similar words are located close to each other in the vector space, while the vectors for opposites are located far apart. This allows the model to capture the meaning of words and their relationships to each other.

GloVe has been shown to perform well on a variety of natural language processing tasks such as language translation, named entity recognition, and sentiment analysis. It has been used in many NLP systems and has become a popular choice for learning word vectors.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing (NLP) model developed by researchers at Google. It is a deep learning model that is trained on large amounts of text data to learn the relationships between words and their meaning in context.

One of the key innovations of BERT is its use of bidirectional attention, which allows the model to consider the context of a word on both the left and the right when making predictions. This is different from previous models, which only considered the context to the left of a word. BERT also uses transformer architecture, which allows the model to efficiently process long sequences of data and make predictions about the relationships between words in a sentence.

BERT has been shown to perform well on a wide range of NLP tasks, including language translation, question answering, and text classification. It has become a popular choice for many NLP applications and has been used in many state-of-the-art systems.

Applications of Word Embeddings

Word embeddings are a powerful tool in natural language processing (NLP) and have a variety of applications. Some common applications of word embeddings include:

  • Language translation: Word embeddings can be used to improve the performance of machine translation systems by capturing the semantic relationships between words in different languages.
  • Text classification: Word embeddings can be used as features in machine learning models for tasks such as sentiment analysis and spam detection.
  • Information retrieval: Word embeddings can be used to represent the content of documents in a way that allows a search engine to understand the meaning of words and retrieve relevant documents based on a query.
  • Text summarization: Word embeddings can be used to identify the most important words and phrases in a document, which can then be used to generate a summary of the document.
  • Named entity recognition: Word embeddings can be used to identify named entities, such as people, places, and organizations, in a piece of text.

Overall, word embeddings are an important tool for many NLP tasks and have greatly improved the performance of NLP systems.

--

--

No responses yet