 All right. I think it's a good time to start. Can you guys hear me at the back? Awesome. So hey, everyone. I'm Oindrila Chatterjee. I recently joined Red Hat as a data scientist. And as you can see from my sash, I'm also an organizer at the Fund Committee for DEF CON. So I hope you guys have been having a lot of fun so far. And if not, for starters, this talk is going to be a lot of fun. So thanks for coming in. But a lot more fun is going to be tonight at the party, which is Star Wars and Hacker's theme. So please be there. And while we wait for that, let's dive right in. So as you must be already aware, cognitive systems today present exciting opportunities for building new kinds of applications with really powerful intelligence behind them. And when I say cognitive systems, I mean AI or artificial intelligence systems, which exhibit capabilities such as learning, understanding, and reasoning from data. And one such system that we are going to look at today is the sentiment analysis system. So what is sentiment analysis? So sentiment analysis is nothing but getting sentiment values or polarities from natural language text. And why do we do it? Why do we do sentiment analysis? Products. We want to know, without reading through each customer reviews, what people feel about the new iPhone. Organizations want to know what its consumers or customers feel about products, services, tools, technologies that the organization has to offer. Public sentiment, we want to know what's the consumer confidence is to the spare increasing politics. We want to know what people think about a particular candidate or a particular issue. And prediction. We want to predict market trends from sentiment analysis and also maybe predict election results and whatnot. So the applications of sentiment analysis are numerous. But how do we do this? So we do this using techniques of NLP or natural language processing, which encompass various machine learning and deep learning tools and technologies. And what is NLP? What is natural language processing? So NLP sits right at the intersection of artificial intelligence and computational linguistics. And it basically allows computers to understand and for interpret and manipulate human language quite as good as we do. And having said that background, let's look at what we will learn today from this talk. So firstly, I'll talk about my experience working on a sentiment analysis service and how we narrowed down to a single service. Secondly, we'll talk about how we continually improve the service and also by introducing feedback. Thirdly, we'll go over the lessons learned while going through this process. We'll go over the usage of the framework and how to implement such a service. So there are various ways of performing sentiment analysis. So I'll give like a brief overview of what are the three primary ways in which you can perform sentiment analysis. So firstly, is a lexicon or a dictionary-based approach. So in this approach, you derive sentiment from a piece of text based on the count of the words that are appearing in the text and what sentiments do they signify. And also you apply a rule-based approach. So this is like a very generic, very basic approach at tackling a complex thing like sentiment analysis. Second comes a machine learning approach. So in this approach, what you do is train a machine learning model and make a learn from training data which is being labeled into particular sentiment types and have the model predict on new data. And thirdly, it's a deep learning-based approach wherein you create a self-teaching learning system which learns from multiple layers which are also called neural networks. And this also learns from large data sets. So these are the three ways in which broadly one can perform sentiment analysis. So let's look at what we started with. So our first pass at the Building a Sentiment Analysis Service was actually composed of three NLP tools, three open-source NLP libraries which are Stanford Core NLP, Vader, and Text Blob. So these are three really good and robust NLP services which are available. So Core NLP is based using a recurrent neural network which is a deep learning-based model and it's trained on a tree-based corpus called the Sentiment Tree Bank which lets it actually learn the compositional effects in natural language. Second, Vader. Vader is a lexicon or a rule-based approach and it's mainly targeted towards social media text but it does a good job otherwise as well. Thirdly, Text Blob is a classifier-based approach and it is trained on the IMDB Movie Review data set which is a large, huge data set consisting of movie reviews and it's annotated with the sentiment for each review. So we had our free models and we are treating them as three classifiers and our goal was to get one prediction for our red-hat-based artifacts. So what we wanted to do was we had free models giving out free predictions. So our first pass was to build a model stack which is a model-onsombling method wherein you take three classifiers and you get the vote, you get a vote of how many classifiers vote for which label and you create a stack model based on all the three and you again then get your final prediction. And this was a pretty straightforward and good initial approach and tackling sentiment analysis. And the thing about, the thing specific about this was since Stanford Core NLP treats derived sentiment sentence by sentence that is it gives one annotation for each sentence. We were now treating our entire chunk or paragraph sentence by sentence. So this method had some limitations firstly the accuracy of sentence classification. So basically there were a lot of misclassified negative sentences. Secondly, we saw foreign languages which were similar to English like Spanish, German, French which uses those same characters were often misclassified because they were thought of as English. And thirdly, the domain specific sentences were misclassified so that really means sentences which are specific to Red Hat's customer service or you know, things which are really context specific were not being identified in true sense. So now what we tried to do is we took out three classifiers and we brought in training data. We labeled some data that we had and created a training data set and now we again got the predictions from the three classifiers when treated independently one by one the training data and we assigned them weights based on how they are performing. And now we had a weighted majority based voting schema to get one single prediction. And this was, this got brought about some improvements as compared to the previous service now that we had built a stacking or model stacking based approach using weighted majority vote. Secondly, we were also filtering out the non-English sentences using Google's lang detect API. And thirdly, we also, what we did was we fed in some context specific terms which are specific to Red Hat into Vader's lexicon. And this brought about some improvements in performance and the metric that we are using to determine our improvement is the F1 score and that's giving us a good gauge of how the models are performing in comparison to each other. So as we see in positive and negative labels it showed quite some improvement and also improved the neutrals by a little bit but we were mainly concerned about positives and negatives. So this was great and another thing which is really important which is attached with sentiment analysis is entity detection. So what are entities? So any piece of text that you have the nouns that are appearing, the tokens which have some significance like places, persons names, events, organization names, et cetera, are entities, as simple as that. So why are we doing entity detection is nothing but we want to understand what are our sentiment scores pertaining to specific entities that we are actually interested in. And the entity detection model which we built was using Spacey which is an open source NLP tool again and the limitation with that was we again weren't able to identify some Red Hat context specific terms which we are actually interested in like Ansible, Rail, OpenShift, Red Hat CloudForms, so on and so forth. And often non-entities were classified as entities like words just because of the capitalization or their position in a particular sentence were being thought of as entities whereas that shouldn't be happening. And thirdly, many entities were being misclassified that is words like Rail was being thought of as an organization and OpenShift was thought of as an organization wherein that shouldn't be the case. So we took some steps to elevate this issue so we compile a huge list of Red Hat context specific terms that we wanted to be identified in our service and we created a balanced training dataset and used that to retrain Spacey's model. And this gave after a lot of passes, after a lot of versions of the training data we saw some improvements. That is we saw now higher number of context specific entities being recognized which was a great thing. We saw better classification of entities as well. Now since the training data corrected the classifications we saw better classification then we saw lower non-entity detection that is lower detection of words which are not entities. So having said that after this process we had our sentiment analysis service, we had our entity detection service but we wanted a single service to serve the purpose and we wanted to explore other options. So we thought of exploring a deep learning based approach for sentiment analysis. So as many of you might be knowing or just to give a high level overview of what deep learning is. So one of the primary goals of deep learning is to learn, to represent your inputs in such a way that it's actually easier to map your inputs to a target task. So here we have like a crude drawing of a convolutional neural network in the left and as you can see that the input is fed through a series of layers and each layer has an activation function which is produced. What an activation corresponds to is actually a model's representation of what's contained in the input that you feed in. And as you can see on the right, right there, the features the CNN is learning are actually pretty high level features as compared to simple pixels. You see car wheels, I don't know if it's really viewable but you can see it's pretty good in car wheels there and also faces, features which represent facial features. And for terminology sake, the thing on the left is a source model which is like a pre-trained model for the rest of our talk and what we introduce after that is a classifier on the top of it, top of the pre-trained model basically. We'll get into this a little more but this is just to set the basics, right? So the first prototype approach that we tried was a recurrent neural network using long short term memory units. And in order to understand how deep learning can be applied to this, think about the kind of data that we are feeding into our deep learning model. So in our case, we are feeding in string inputs and to perform stuff like back propagation, matrix multiplications, dot products, we cannot operate on strings. So we need to convert them to vectors. And for our case, we work with again, the IMDB movie review data set which is a large data set and we wanted to convert it to a vector. So these vectors need to be created in such a way that we should be able to represent the context, the meaning and the semantics of the word that they appear in the sentence. And in order to create such vectors also called word embeddings, we use vector generation models like glove, goto-vec and so on. And so what the vector consists is or the embedding matrix consists is of one single representation for each and every word that can appear in the paragraph. So then we create an IDs matrix for each review that is that consists of all the vector embeddings for each review that we are feeding in. And we then look at our reference neural network model. So how a recurrent neural network is different from a traditional feed-forward neural network is actually the temporal aspect of recurrent neural networks. And what we do is, so here we have a hidden state vector associated with each vector that we are feeding in into the system. And long short-term memory units are nothing but modules that we place inside of RNNs. And at a high level, they make sure that we are able to encapsulate long-term dependencies in the text. And using this model, we now train the model on the input that we have and we see the results. So the results look like this where we see that the F1 scores of this LSTM-based model perform much better than our initial service. And that's a great thing, right? But we weren't really happy there at that point because firstly we wanted to explore more, see how we can do better. And this needed a lot of training time and computational resources which we wanted to avoid if possible. So as we see in this graph, the main limitation with deep learning-based approaches is that deep learning is not applicable in each scenario always because of the lack of the label training data that we often need for our tasks specific or our tasks in organizations. And in addition to having high training data, label training data requirements, there's also huge training time and computational expenses associated with deep learning. So that's like a major limitation that we wanted to alleviate if possible. So we looked at approaches like transfer learning and some of you might have heard transfer learning. It's nothing but you have a model. It's really applicable. It's really a long-time research and computer vision and image processing. So what you do in transfer learning is you train your model on certain kinds of data, as Subin also mentioned in the previous talk. You train it on certain kinds of data and you have a train on new kinds of data by not changing the entire model completely. And you must be wondering how does transfer learning work for NLP and why is pre-training useful in NLP? How is that useful? So simply put, it just so happens that pre-training allows a model to capture and learn a wide variety of linguistic phenomena and such as long-term dependencies and negation, for instance. Well, on a task specific or a domain specific task like sentiment classification. And this works in language because as it happens, negation, for instance, is a really important property in text and when we want to derive sentiment polarity, negation is a useful feature to understand sentiments or even sarcasm, for instance. And a language model that possesses universal properties could actually be useful in tasks where firstly there's a lack of annotated data sets or language resources. But this idea is also exciting because we are not only aiming to build a universal model, we are also addressing some of the difficult challenges in ML research that is availability of data and resources and computational expense mainly. So one of the primary researchers in this domain was the bidirectional encoder representation from transformers. So this was like a really revolutionary technique in using transfer learning in NLP and it was released last year. So this actually tries to alleviate the main issue that is shortage of training data. And what it does is it works in two steps. So the first step is a semi-supervised learning on large amounts of data. So what happens is you take in a data set like Wikipedia, for instance, and it is not labeled, but it's huge and we have lots of data which is stored in Wikipedia and you make a model learn from the features which are appearing in an annotated text and you build a model out of it. And then comes our second fine tuning step which is more supervised and here you're introducing a data set although smaller than the previous task, but still large and you make it do task specific classification. Like for our instance, we fed it with data which was labeled with sentiment annotations and hence it can classify positive and negative sentences. So in this second task you can introduce whatever you want. You can basically do entity detection, code reference resolution and a lot of other things fine tuning procedures. So what makes BERT really different from the previous approaches? Firstly, contextual representation. So you might have heard about Glove and Word2Vec. They are context free representation and what that means is the word bank be it in the context of river or in the context of money would have the same representation in Word2Vec but here we have different representations for words which have different meanings depending on context. And secondly, it's a deeply bidirectional unsupervised language representation. So what this means is unidirectional models are actually trained by predicting each word conditioned on the previous words that are appearing in the sentence. But for bidirectional model, what you do is you mask out words and let the other words appearing before and after that predict the word. So it's based out of a masking technique and it's deeply bidirectional. And thirdly, in this approach, the model actually encapsulates relationships between sentences, not only words. So what that means is you have, say you have sentence A and sentence B, what you do is you get a probability score of whether the next sentence is an actual successor of the previous sentence or is it just some random sentence appearing after it. And last but not the least, really important for us, the transformer architecture is much more parallel as compared to the RNNs which are very sequential and because of which we are able to utilize if available GPU infrastructures and this really improves our training, hyperparameter tuning, prediction procedures all at once. So we trained the model using BERT and we plotted the results, the F1 scores. And we saw that the F1 scores of the BERT based model were even better and a lot better than the LSTM based model. So this was pretty exciting. We had reached better scores and much better classification accuracies. And why is this important? This is important because first of all, we are classifying better. Secondly, now we are annotating one label or one sentiment score for an entire piece of text as opposed to the previous approach where we were labeling sentence by sentence. And thirdly, this was taking much lesser time to train on the GPUs. I haven't put the matrix here but it's there on the repository. So this was really exciting. So after deploying this model and performing the analyses, you actually review the trained model's performance and see whether any adjustments must be made to the annotator to make improvements for its ability to detect sentiments. However, in a cognitive DevOps environment, you do not really stop there. The cycle doesn't stop there. A wider view of the training system over time incorporates feedback from the users of the system and from the system that you piloted actually. And this typically includes a custom user interface with which users can interact. And the model is used by the application and the results are actually shown in the application UI as you see here. So the end users can actually view the results in form of an application UI and provide their feedback on sentences which are not being classified accurately. And this feedback is collected and stored from many users across many interactions within the system. And what do you do with that system, with that feedback that you have now collected? So what you do is at a given threshold, this feedback is incorporated back into the system. Optionally, the feedback can be reviewed by machine learning engineers or data scientists and corrected or re-annotated if required. And you typically incorporate a batch of feedback back into the system with the initial annotated corpus and retrain the model. And thereby produce the next version of the model. And we repeat these steps until we have reached our wanted accuracies or whenever, say a new variability is introduced in the system and you can perform this retraining. So as you can see in the diagram, we incorporate the feedback in our second supervised fine-tuning step and we add the feedback back into the data set that we are using for training the model. But some lessons that we learned while going through this procedure. So firstly, you know that AI systems learn more as they encounter new training data, but it is also important to test your model early and often to verify that it is actually extracting desired insights from data. So you should measure the performance of your system regularly as you add new training data and these regular measurements help you determine when your model performance is actually not improving with new training data, which either indicates number one, I need to refine your model itself or when it reaches a logical stopping point or actually change the initial training data that you had. And it is also advisable to go beyond the metrics of F1 score, precision and recall metrics as you start going deeper. And there are some metrics that you could track. For example, firstly, the number of ground truth that is a training and test examples that you're using in each iteration. You should be tracking what's the split of training and test. Secondly, you wanna be tracking the number of records in the ground truth for each label type that is in the new data set that you're introducing. How many are positive instances? How many are negative instances? And each artifact related to your AI system should actually be tracked in source control. So this includes your code, your model, your training data itself. And software developers are really familiar with the virtues of version control, but this is really important in machine learning, developing machine learning workflows as well. And secondly, what we learned, that you need enough variability in your training data such that your model is actually able to learn all patterns that you wanna extract. So firstly, too little training data provides less data for the model to learn and results in underfitting. And too much training data is also not good. It results in overfitting. So you need to have a fine balance between the training and test split. And because both overfitting and underfitting can lead to poor model predictions and thus affect the performance on the test sets. And hence, the selection of your representative ground truth is absolutely critical to training cognitive systems. And with this, we can look at a short demo of the model serving. So, can you see my screen here? First, what we are gonna do is, in this folder here, we have saved. So what we have done is we have trained our model using the Jupyter Notebook infrastructure and we have saved our models into a folder in a .pb format called model.en, a model underscore en, my bad. And what we are gonna do is try to create a Docker container using the TensorFlow serving API to actually serve this model and enable us to draw predictions from a web UI. So for that, what we are gonna do is first, actually, I have the commands here. Why don't I make it easy? So first, we're gonna pull the TensorFlow serving and we are going to run that. So I'm calling it serving base demo. And now what we are gonna do is next, we want to add our model that we have saved in model underscore en to the container that we pulled from the TensorFlow serving API. So, and now we wanna commit the changes that we made. Just gonna check here, if I can find my mouse pointer. Okay, awesome. So I'm simply gonna check if the documents are created. So as you can see, 18 seconds ago, serving model demo, that's the latest image. And now we no longer need the initial container that we had pulled the serving base demo since we already added our model to it and we are simply gonna kill it. We need to run the container. So, gonna be hoping that we could get it, awesome. So this has been served and what we can do next is look at a simple flask of Python file that we created for the client that we have for a flask app. So as you can see here, we run it on, we have actually, what we do here is get a post request on the same port that we open and what our post request actually does it, does is points it to the model that we have served using the TensorFlow serving API and get predictions and it's actually as simple as that. So we are gonna run this now, this works by Python. Awesome. So we have this URL, I'm gonna just paste the link here because I didn't want it to default going to Chrome. And yeah, so as you see there's like a text box here where you can submit your query, your sentence and what it should return is the probabilities for the two labels that we have trained it on, the negative and positive labels and the negative corresponds to zero, the label zero and the positive corresponds to the label one. And I'm just gonna enter random sentence here, I like that kind of talks as simple as that. And as you can see here, we see these are the two probabilities of the labels zero and one and it finally returns the greater one, which is the label one, which is of course positive and if we enter something which is negative, that should be the label, right? So it returns a label zero and this is as simple as that. So going back to our slides, we don't need our backup demo and how do you build such a system? So in order to build such a system, we used our open data hub infrastructure to do this. So what open data hub is, it's like an integration of tools to support machine learning and data engineering workloads for various personas like data engineers, machine learning engineers, et cetera, et cetera. So you can do ETL, you can run your models and you can even serve it once it's deployed and developed and also you can do stuff like monitoring hyperparameter tuning using the open data hub infrastructure and also the thing which is really useful for us that it provides custom images for Jupyter Notebooks if you wanna run your workload, say using the TensorFlow API or you wanna use Spark for ETL, you can use that and this really improves and enhances your model training and deployment procedures. So you can also, to learn more about this, you can also watch the recording of the talk given by Pete and Joanna from my team. I've linked it there and that's, I think about it. Thank you for listening. So this is the first time I've heard about BERT. I did some Googling on it. It looks like it came out from Google. Are you using the BERT PyPy module that I'm looking at, the BERT embeddings 1.0.1? Yeah, we are using the TensorFlow model, the TensorFlow version of BERT which comes in four versions. We are using the large, uncaged version of that. Okay, and how old or new is that, the one you're using? How old? Yeah, how long has it been available? Oh, it's a very recent development from you, Google, which came out in December 2018. Okay, yeah. You might have mentioned this already and I might have skipped it, but so you train your model on the IMDB dataset, right? And so the test accuracy and all the results that you are showing us, was it like from the test split of the IMDB dataset or was it from the Red Hat's own dataset or like how was the test like, you know, it was tested? Great question. So the F1 scores that I'm showing it is on Red Hat's customer dataset, like the customer service reviews that we are using. So those are the ones I ran the F1 scores on. But I trained it on the IMDB movie review dataset because that's the only huge dataset that we have at this point, but with the feedback loop, we are trying to collect more annotated datasets. Yeah. Yeah. Can you talk a little bit about the BERT network as in how big was the network and was it easy to train and whatever the hyperparameters that you were training, what that you were optimizing for? And one more thing, did you go back to the entity detection part with the BERT model or did you just didn't get to it? Yeah, great question. So first part of the answer training the BERT-based model was it was a little challenging because it was a two-step procedure. The first, we didn't bother to retrain the pre-trained BERT-based model which would take like two months to train on four GPUs. But what we did was we only worked on the fine tuning of the model which did not take a lot of hyperparameter tuning because in the example sets given in the repository, they had recommended the hyperparameters for our classification task like sentiment analysis. But besides that, it would take a lot of training time on only using CPUs, four cores and 16 giggram. So switching to a GPU infrastructure really sped that process up. And yeah, the second part, did you, yeah, so you asked if we also went back to the entity detection. So we did not go back to the entity detection as yet because that's still behaving pretty robust using this PC API. But we might wanna in the future use the BERT infrastructure for entity detection as well. Any more questions? All right, thank you.