From terms to ideas: extracting topics from texts
The different terms that are in an idea cluster are often called variants. There are several types of variants, although they all share the fact that they are differents spellings of the same idea.
- Synonyms, i.e. words that have a different spelling, often even with a different etymology, but usually point to a very similar meaning, at least in the context of a specific domain
- Inclusions, i.e. when a term contains a shorter one, and only adds a minor precision on its meaning
- Spelling errors – this can be a major issue, not only for the end task of term extraction but also for the previous, more basic tasks such as tokenization, POS tagging and lemmatization that assume a language correctness
- Of course all of these can be combined, so the resulting expression really has no obvious connection to the base term we want to link it to
The variant problem(s)
Gathering variants gives rise to a number of technical problems. There’s first and foremost the non trivial issue of linking words that do not share any part of their spelling, like we have already seen.
It is also necessary to find a representation, or data structure, that fit these term clusters in order to, for instance, refer to similar terms using a unique identifier.
Finally, all of this as to be applied on massive amounts of terms at the same time – in order to measure their relatedness – which raises computational difficulties.
Baselines and their limitations
Some naïve methods can easily be experimented to get a grasp of the variant problem, but as we’ll see they quickly reach strong limitations.
One might consider that terms that contains another one have the same meaning, and only add some details upon it.
While this is generally true for multiword terms – for instance, “life insurance contract” or “life insurance policy” don’t add much information upon the term “life insurance” – this application is a bit more blurry for single word terms. Indeed, “life insurance” is arguably not a variant of “insurance”, and certainly not one of “life”.
Going a bit further down the linguistic road, one other option to find related words is to look at their stem. The idea here is that words that belong to the same morphological family share the same meaning.
This approach is problematic for several reasons: first, this statement is simply not true in a number of cases: “universe” and “university” will yield the same stem “univers”. Stemming is also unable to handle irregular plurals like “mouse” / “mice”… and the list could go on. There is also the fact that an additional stemming step is required, which can be non trivial in particular for morphologically rich languages.
Moreover, and this is also true for the inclusion technique, the words you found are likely to be infrequent. Indeed, it is generally the case that one form of a word is widely used in a collection of reviews, and the others are much rarer, so the clusters you create using this method are rather inefficient.
Finally, another ancient method we can use – and ironically sounds like a dinosaur’s name – is to match words in a thesaurus. This implies a tremendous amount of manual work to maintain the resource in each language you want to support, and possibly every industry domain as vocabulary can be very domain-specific. In addition to this, this kind of resource isn’t always reliable as you often face the problem of semantic drift, i.e. the fact that meaning is not stable through transitive links between words.
A word about clustering
Clustering is the art of grouping similar things together. Clustering can be typically seen as an iterative process where most similar elements are grouped at each step, or conversely most dissimilar elements are partitioned.
Methods for clustering include the classic k-means algorithm and hierarchical cluster analysis for instance. Clustering is the way we use here at dictanova to group variants in a single topic. But the key point in clustering is how we determine that two terms are similar to each other. So let’s see how we can do that.
Everything is a vector
The idea behind word vectorization is to identify words using their frequent context, i.e. other words that appear next to it. Concretely, this is done by setting a certain size for the word window within which we consider the proximity of words to be relevant – usually a span of 5 to 10 words.
Then the heavy part begins: the building of the co-occurrence matrix. For each possible pair of words we count the number of times the words appear together, so in the end you get something that look like this:
Each line in this matrix represents a word, in the form of a numerical vector. What is interesting about these is that, thanks to linear algebra, we can easily compute a similarity between two vectors, which means in our case a similarity between two words.
That is to say, because words such as “price” and “value” often appear with the same kind of words (among others, “cheap” or “expensive”), their vector representations only show minor differences, and their similarity in meaning, according to the distributional hypothesis, is therefore important.
High and low
A first way of representing a word using this mechanism is by using all of the possible pieces of information about it in the vector, thus creating a high-dimensional vector. Such vectors can be adapted for high-scale analysis of a collection of texts but fail to produce relevant results on the task of word similarity.
One reason for this is the widely feared curse of dimensionality: because the size of the vectors (their dimension) is much greater than the number of vectors that are compared, trying to group them will most likely give blurry results, if any. To put it in another way, if each of the characteristics (or features) you want to base your comparison on are only held by a fistful of vectors among the millions you are analyzing, computing a similarity won’t make much sense: all vectors will seem similar, because they’re all equally different.
The other problem is that this representation induces a processing bottleneck. High-dimensional spaces are hard to index, and as the computation time needed is linearly proportional to the dimension of the vectors, calculating similarity on these for huge amount of text takes months on high-end machines. This is bad for anyone, and in particular in our timed industry setting.
A solution for this is to reduce the number and size of the vectors and only keep relevant vectors, containing values that describes them best – that is in a nutshell what methods such as word2vec, and similar neural network architectures, are about.
How exactly can we compute these vectors?
Word embedding vectors
The family of vectors obtained by training neural networks are known as word embedding vectors. Originally proposed by Mikolov et al. (2013), the approach to get the word embedding vectors is named word2vec. The vector representations of words learned by word2vec models have been shown to carry semantic information and to be very useful in several Natural Language Processing tasks.
Learning embedding vectors
Let’s try to go a bit further into word embedding vectors following a concrete example.
We want to use a neural network to learn the embedding vector for each word in our corpus. The input of the network will be a word and words in its immediate context. However, instead of using a co-occurrence matrix, we will use one-hot vector representations of context words.
The neural network we want to use consists of two dense layers or fully connected layers. The first layer tries to generalize the high dimensional one-hot input vectors into a low dimensional dense vector. The second layer maps the dense vector to the original dimension size (i.e. vocabulary size), by projecting the output values to a probability space using the softmax function we can calculate the difference between the output vector and the one-hot vector of the current word. Then by means of backpropagation we can compute the gradients in order to update each learnable parameter in the network using gradient descent.
In other words, we are training a neural net to predict a word from its context. This algorithm is also known as CBOW (continuous bag-of-words). Once the training process is complete, the dense vector obtained in the first layer is actually the so-called word embedding vector.
The following diagram shows the forward operation for the word “bag” in the sentence “The school bag is very expensive”. One-hot vectors of “school” and “is” are fed to the network, passing from layer to layer to produce the predicted vector.
During the backward operation, weights are adjusted as seen before. After the last update of the weights, the embedding vector of “bag” can be found in the hidden layer.
There is in fact another word2vec architecture called skip-gram, but don’t worry, if you understand how CBOW works, you can easily understand skip-gram as it is basically the opposite algorithm: learning to predict immediate context from current word.
What kind of results can I expect from this?
Embedding vectors in the wild
The output of such word similarity models is usually a set of related words, given a set of reference words (or seed) and a set of counterexamples words, i.e. words that should not be related to the output.
Indeed, the semantic meanings held by the word vectors are surprisingly robust to this type mathematical operations, so much that you can actually apply algebra to carefully chosen vectors to produce functional or relationship vectors, in addition to simple vocabulary vectors.
The fact that words can be grouped by their meaning offers a way to understand huge datasets at a human readable scale, which is next to impossible with a word-by-word analysis. In our case, we use this type of method to automatically structure your textual content so that you can access your documents through coherent groups of ideas. In Dictanova API, such groups of words are simply called Topics.
However, as powerful as word vectorization is, it would be largely ineffective if we’d only use on-the-shelf solutions. Indeed, there remains a strong limitation here on the fact that only single words, and to a certain extend very idiomatic expressions, are vectorialized. This is useless in our application that focuses on meaningful and domain-specific expressions, because it extracts four times more multi-word terms than single words.
How we adapted word vectorization to multi-word terms will be the subject of our next article.
What have we learned?
Using the long-known paradigm of distributional semantics, recent methods have put forth extraordinary results in gathering words sharing a similar meaning. That is, for single words. There is still a challenge to reproduce such results for multiword expressions, which represents the vast majority of terms we’re interested in.
In our next article we will see how we overcome the problem of vectorization in the case of multiword terms, so that we can gather meaningful expressions together, such as “difficulties with a reservation” and “check-in problems”.