Chatbots are becoming the prime conversational interface for customer engagement and customer service. AI-powered chatbots are now able to converse with customers intelligently by extracting intents and entities accurately and responding to the queries through the implementation of advanced technologies like NLU (Natural Language Understanding) and NLG (Natural Language Generation). The other aspects like measuring the effectiveness of the conversation, understanding customer's satisfaction level or detecting other hidden aspects from the conversation (like sensing a lead) are going to be the next big addition around marketing and customer engagement through Chatbots and another conversational platform.
In this article, we will walk you through to understand how sentiment analysis can be implemented and perfected along with other technology stacks in tagging or understanding the sentiment of a conversation to achieve this next big addition.
Sentiment analysis or opinion mining is a field within Natural Language Processing (NLP) which identifies/extracts opinion from phrases or conversations. We have used sentiment analysis along with machine learning and other technologies to build this as a platform for understanding conversation effectively.
Following technology stacks/libraries have been used in building the core solution.
Firstly we have to train the Neural Network for sentimental analysis, and then integrate the train model to predict sentiments of a conversation.
Training the model consists of two parts:
In both processes training and prediction, the data pre-processing pipeline is common.
Let's first understand what is data pre-processing and the steps involved in it. Data pre-processing is basically cleaning the data of noise and converting the raw data to an algorithm usable format. The different steps involved are :
Converting to n-grams
Now first understand the approach for the training of neural network. The data used for training neural network was an organization internally available and tagged dataset. The dataset is then fed to the pre-processing stage where the individual steps mentioned above performs its task and output a list of n-grams token (for our purpose we took n as 1).
Now we have a matrix of m x n, where m is a number of columns in our dataset and n is the number of keywords obtained after the data pre-processing steps. These keywords are then passed to word2vec module which outputs a vector of 200 lengths (average out vector for all keywords) which is used as our training data (remember, our matrix of training data is now m x 200). This numerical vector of length 200 is fed to the input layer of our neural network, having a hidden layer and output layer with sigmoid as our activation function. 100 epoch are run with a bit of tweaking to get a 90 percent accuracy model (accuracy benchmark on our organization's internal data).
This model is deployed to the production environment which predicts the output of a query and saves the record in our datastore i.e. Elasticsearch which is used for searching and stats generation process.
So let's first see the input of individual chat to our module powered by the model developed through the above-mentioned steps and the output.
Input: "The watch that I bought was very bad",
Output: The output mentioned below is stored in our Elasticsearch datastore.
"raw_conversation": "The watch that I bought was very bad",
The above output is of that of an individual chat of a longer conversation. Such individual chat data outputs are then grouped for the entire conversation and a final outcome is presented for the entire conversation.
USER CONVERSATION ANALYSIS
AGENT: Hello, How may I assist you?
USER: The mobile phone that I had bought from your store Is not working fine
AGENT: As I can see in my system that you ordered a phone two days ago which got delivered yesterday. I assure you for the best service.
USER: Yeah, I need a refund for this defective product
AGENT: Don't worry, we have escalated the matter. A pickup agent will pick the phone and the refund for the same will be processed.
USER: Thank you so much for the quick resolution.
AGENT: We are here to help you. Have a great day ahead.
"raw_conversation": "The mobile phone that bought from your store Is not working fine"
"tags": [ "refund","need","product","defective"],
"raw_conversation": "I need a refund for this defective product"
"tags": [ "quick","much","resolution","Thank"],
"raw_conversation": "Thank you so much for quick resolution."
Each user interaction of the above conversation is fed to our model and saved separately in our elastic datastore. Results of individual interaction can be grouped using appropriate business logic to have a sentiment of the overall conversation. E.g. business logic: a conversation is positive/negative based on the number/percentage of positive and negative interactions and whichever is higher, the conversation will be tagged accordingly. Likewise, anyone can write any business logic using tag words and sentiments together depending on platform requirement.
Let's explore the technology stack and understand the importance of each component present in the above list.
Elasticsearch: Elasticsearch mainly plays two roles in this solution:
Enhanced searching with fuzziness - The main role of Elasticsearch in the tag cloud is to enhance the searching mechanism and provide the stats on the analytics page in near to real-time. Also, it helps to identify the entities which can be generic entities (date, time, place) or it can be user-defined entities (capturing of shorthand words and converting it to its full form). Fuzziness helps to identify correct keywords which are incorrect. (eg. delli -> Delhi).
As a Datastore - It acts as a data store storing the sentiment as well as pre-processed data inputted to the module. Benefits of saving the pre-processed data help to perform tag based searching and faster aggregation query execution.
Neural Networks: Neural Network is used as a training algorithm for classifying a query as positive or negative (will implement the third class neutral in next phase). Now, what was the benefit of using neural network why not a rule-based approach?
Firstly a simple rule-based approach has a lot of drawbacks :
It doesn't take sentence semantics into consideration, therefore, making a decision based on the keywords having part of speech as an adjective.
It needs a lot of conditions to handle negation, adverbs as well as query containing keywords such as but/else/otherwise.
Now coming to the part why specifically neural network algorithm and why not some other learning algorithm. As we know that Neural networks are universal approximators and when pumped with right config and semantically preserved numeric vectors(discussed in next point) of training data, neural networks outperform other training algorithms such as SVM, naive Bayes, decision tree or random forest. The corpus used was an internally available tagged dataset. Excluding sarcasm, it gave us an accuracy of 89 percent for two-class classification(need to include the accuracy matrix). Keras was the library used for building and training the neural network with backend as Theano.
Word2vec: For the representation of text as numbers, there are many options out there. The simplest methodology when dealing with text is to create a word frequency
the matrix that simply counts the occurrence of each word. A variant of this method is to estimate the log scaled frequency of each word but considering its occurrence in all documents (tf-idf). Also, another popular option is to take into account the context around each word (n-grams) so that e.g. New York is evaluated as a bi-gram and not separately. However, these methods do not capture high-level semantics of text, just frequencies. A recent advance on the field of Natural Language Processing proposed the use of word embeddings. Word embeddings are dense representations of text, coming through a feed-forward neural network. That way, each word is being represented by a point that is embedded in the high-dimensional space. With careful training, words that can be used interchangeably should have similar embeddings. A popular word embeddings network is word2vec. Word2vec is a simple, one-hidden-layer neural network that sums word embeddings and instead of minimizing a multi-class logistic loss (softmax), it minimizes a binary logistic loss on positive and negative samples, allowing to handle huge vocabularies efficiently.
In order to represent the 20Newsgroup documents, I use a pre-trained word2vec model provided by Google. This model was trained on 100 billion words of Google News and contains 300-dimensional vectors for 3 million words and phrases. As a pre-processing, the 20Newsgroups dataset was tokenized and the English stop-words were removed. Empty documents were removed (555 documents deleted). Documents with not at least 1 word in word2vec model were removed (9 documents deleted). The final resulting dataset consists of 18282 documents. For each document, the mean of the embeddings of each word was calculated, so that each document is represented by a 300-dimensional vector.