How to choose your NLP API ?
When typing “NLP API” in Google, there is more than 7 millions results. Not all of them are NLP technology provider but at least the first pages can take you to TextRazor, Google NLP, Aylien, Recast.AI, IBM Watson, Dandelion, MonkeyLearn, and much more.
But are they all the same ? How to choose the one that suits your needs ?
First of all, you have to think about what do you mean by “analyze” :
- Do you want to identify important expressions in text ? This process is called Term extraction
- Do you want to know if the author is more positive, negative or neutral ? This process is called Sentiment Analysis.
- Do you want to associate a category to your text ? This process is called Classification.
Each analysis process can be solved by using several approaches that differentiates NLP provider one from each others.
There is two main approaches for terms extractions : dictionary based and unsupervised analysis.
Dictionary based extraction is a way to detect important terms by providing a full list of what you are searching for called the dictionary. The analysis will only detect what the dictionary contains. The dictionary must be maintained to fit your needs.
Instead, the unsupervised analysis will perform a deep semantic analysis to understand the structure of the provided text and automatically detect important parts. This approach is much more flexible as it’s tolerant to typos and can detect new terms that never appeared previously.
In both approach, the technology can support a number of languages and be specialized on a particular kind of text (product reviews for example). Be aware that some provider may support a lot of languages by translating the provided text to a known one (for example english) and perform the analysis on the translation and deterioring results.
At Dictanova, we provide an unsupervised analysis based on machine learning specialized for customer’s voice analysis and natively supporting French, English, Spanish, German and Chinese.
NLP providers can provide sentiment analysis at 2 levels :
- global level : the sentiment will be relative to the whole piece of text analyzed. For example “I’m very happy with my trip because it was on time and quiet but the food could have been better” will be globally positive because the global sentiment is positive even if the food could have been better.
- term level : the sentiment will be relative to each term detected by the term extraction process. By using the same example as above, the “trip” will be positive but the “food” will be negative.
When using the global level, it is not possible to know which term has been used to flag the text as positive, neutral or negative. In other words, each term of the text will be associated with the global sentiment even if it not accurate (like the “food” in the previous sample).
When using the term level, it is not possible to have a global sentiment as it is meaningless but you’ll be able to identify in each text the parts that are positive or negative allowing a fine grained analysis.
Some NLP providers provide a sentiment score (usually between 1 and -1) instead of a sentiment label (positive, negative, neutral). These scores are generally very close to 1 and -1 and not really more relevant than a sentiment label.
At Dictanova, we use the term level sentiment. We strongly believe that the global sentiment is useless and, in the customer experience context, can be easily replace by a score associated with the text such as NPS or CSAT.
Classification can be solved using two approaches : model based classification or dictionary based classification.
In model based approach, the customer must provide a list of text associated with a category. Using machine learning algorithm, the NLP provider will build a model that will be able to classify further texts in the right category. Accuracy strongly depends on the quality (and volume) of the data sample used to train the model. In this approach, it is not possible to get the information on why the text has been classified in a given category.
In dictionary based classification, terms are classified in categories. When texts are analyzed by the Term Extraction process, each term detected is verified and if it is classified in a category, the text is automatically classified in this category. Unfortunately, the process of building the dictionary is usually a laborious task performed by a human. Some NLP providers tries to provide on-the-shelf dictionary to make it easier to get started but these dictionaries are not customer-specific and hard to maintain : if a new term is detected and is not classified, the text containing this term will not be classified.
At Dictanova, we strongly believe that it is mandatory to allow the customer to know exactly why a piece of text has been classified in a given category. On the basis of this belief, we couldn’t use the model based approach. But we also strongly believe that NLP must be an automatic process and should not be painful.
That’s why we developed a unique technology capable of generating a classification plan based on the terms detected in the whole set of text of the customers.