 Good afternoon everyone and thanks for taking time to attend this session. My name is Venkat Raman. I'm a senior software engineer at Metapack. Metapack is not an Indian-based company. It's an Euro-based company actually where we do logistics for e-commerce retailers. It's not a known in India. The topic of my presentation today is entity co-occurrence and entity reputation from unstructured data. Now when you say co-occurrence is like in unstructured data actually two entities can occur together. That's what a co-occurrence is like. If you write a tweet, if you mention Nike and Adidas, it means they are occurring together in a corpus. Reputation is actually, sorry, reputation is actually the scoring or how well the entity is performing in a graph, like which entity or which node in a graph has got the most popularity in the graph. That's what it is. The agenda for today is small motivation, knowledge graph, basic introduction, problem statement, what we try to solve, the data flow architecture and the conclusion. Bear in mind this is a 20-minute lightning talk, so in case if I run out of time we can ask questions later. Motivation, probably I think this is one of the few sessions where you didn't hear about deep learning or CNN or recurrent neural networks because I know that's the most hottest topic in data science these days. So as I said, most of the data is very unstructured in nature. I think even neural networks try to solve the unstructured modeling. Not every problem needs machine learning or deep learning models. It's because people think like, okay, you throw the data set and build a model and try to predict and then eventually find out it's more biased actually. It's giving you always the most number of labels available in the data set. It doesn't generalize, so not every problem needs that. Large amounts of data sets are connected in this world. So people started thinking like, they started the structured data. We used RDBMS to represent structured data and then we moved to unstructured data. We used Hadoop file system. We used data lakes and everything, but then now we are coming back to the world-structured way, but they are connected to some sense, like social media and everything has coined that. So connected data or linked data actually they provide insights quickly, especially at Facebook and Twitter where they have massive amount of data actually. It takes time to run the models, but the marketing team and the sales team use actually graph models to get some valuable insights. So that's the motivation. Well, it's a basic introduction to knowledge graph. I mean, I assume most of the students are computer science background like graph theory is pretty well known. So knowledge graph they encode actually structured information of entities and capture the relationships among entities. Now the word structured is very important here because you can represent only a connected data in a graph. If the data is not connected, you've got just a plain set of news media articles where the data is not connected. You can't represent that in a graph. It captures the relationship between individual items, providing a model and access pattern. I mean, I see access pattern actually, there are a lot of algorithms actually which are built around graph algorithms, which will help you to query the data stored in the graph. Sometimes actually people say the data is also represented as triples. I mean, I'm not sure how many of you guys heard of RDF format. This was description framework format. They call it triples. It's called the subject, predicate and the object. Basically, the subject and the objects are nouns. That's your places, things, persons and your object and your predicate is the action actually you perform on that object. Knowledge graph is very much compared to ontology. Ontology as well actually it's old. This is the legacy AI. You represent knowledge using ontologies. It's been used in a fuzzy system in an expert systems where you represent the information in ontological format, which actually captures the relations between concepts and data. It's ontology operates at a very abstract level. You have object oriented programming. You write abstract classes and then you have subclasses ontology operates on the abstract level. Ontologies have solved many problems in a particular domain. There are places actually where people have seen, they've converted the structured data and RDBMS into ontologies. Pretty much in enterprise data into ontology is actually for efficient querying. So knowledge graph is very similar to that. At the open source ontology is all like DPpedia. That's the Wikipedia's clone. Basically capturing the structured information of Wikipedia into ontology, yoga and then wordnet. Yago and then wordnet. Yeah. Some of the use cases powered by knowledge graph is I think it's improving search relevance where Google actually released a paper in 2011 where how to improve the search relevance. The one you see on the right hand side of a Google search, that's your knowledge graph, actually, basically. Because what you type in Google can be an entity. It can be a person. It can be a hotel. So they actually revolutionized knowledge graphs in production. Since then, you know, it's gone a lot of research. And people also use now in QA applications. A QA application is one of the top built AI applications because the machine has to know search relevance. So a lot of answers. The moment you question basically your voice assistants, your home assistants, they use knowledge graph. Also, these people build recommendation engines. They personalized some of the products offerings by a knowledge graph. So there are a lot of use cases which are catching up. And actually, I also seen people augment the knowledge graph with a deep learning model. So it's not that it has to be separate, but you can augment the results from a model with the knowledge graph. This is just a simple knowledge graph representation of the deductive agencies in the world. You know, this is where actually it's because cybersecurity is where graph is pretty much used. This is just explaining, you know, like a person who belongs to NSA, just a national security agency. It's nationalities in the United States. And then he also belongs to CA. Now, pretty much cybersecurity or home security works with connected data. So that's why I took this example. That's how they share information across countries. It's pretty much any information is shared across countries so that you can have easy insights. Okay. The problem statement today, you know, these two gentlemen actually, I don't know, Ajay and Juzi, they reviewed Nokia and Samsung mobiles on Twitter. Pretty much I think everyone now reviews products on Twitter. Like you buy a Volkswagen car and you go to Twitter. Yeah, Volkswagen is running at 120 miles per hour, kilometers per hour. It's pretty obvious, actually. People write on social media. Now this information is very useful for us to like and share, but it's very useful for a marketing person, actually, to know, okay, how their product did, how their product is doing. It's even, you know, companies like Facebook, actually, a lot of data-driven insights are lead to decisions. So this information is very much useful, how you use their product, how people like their product. So the problem is about how can you take some valuable insights from this unstructured data, which is, in this case, it's Twitter. Okay, the techniques we used for the knowledge graph construction. So there are some tools available for knowledge graph constructions. Actually, Google didn't release any open source tool, but there are some tools actually available from the University of Toronto, where they outsourced some open source tools. We did evaluate it, but it didn't quite really, we didn't quite really get what we want. So we relied on, you know, basic text processing techniques. Well, pretty much everyone knows text processing techniques like cleaning the data, pre-processing, extracting the entities, extracting only the relevant information, removing stop words and everything. So in the knowledge graph, actually, entities are represented using nodes. So in this particular domain or context, entity is actually a person or a product. It's a noun. It represents something physically in the world. Relationships among entities are just. So, I mean, if it's a social media network, a relationship is a friend of a friend. But if it's in this context, actually, the relationship, whether you like a product or dislike a product, how do you say, okay, if you say like in Twitter, like, we are Samsung mobile, like Note 10 is going to release, actually, if you say, yeah, the camera is awesome, which means you like it. People do sentimental analysis using deep learning models on these things. So that's a relationship. Now, in our case, it's Twitter feeds. They are persons, products and location. Now, Twitter actually is very important in giving in terms of location as well, because you want to know where is the person tweeting from. That's very much information for a marketing team or a sales team, where the information is coming from in the tweet. And relationships or likes and dislikes of an individual person about a product and relation to other users. Now, in this case, actually, we did capture if he's a friend of a friend. We didn't want to, but we wanted to know actually, I mean, basically, the product wanted to know if a friend of a friend, they are the same likes as another friend, actually. So we wanted to even capture that relationship. That's why we have got a relation to other users. So the, well, I'll come to the architecture, actually, but traditionally we used, I mean, we looked at graken.ai. I'm not sure have you guys heard of graken.ai. Graken.ai is the most advanced open source graph DB, actually, available in the market. But we use Neo4j, actually. So Neo4j solved the problem for us. Neo4j has its own problems. It's a unidirectional graph. You can't model a bi-directional one and it's got its own problems. But for our use cases, Neo4j, use case was Neo4j DB was much better. Graph inference. Okay. Now, pretty much graph theory has been there for so many years and people have done a lot of research and analysis. There are a lot of centrality algorithms, actually, which will help you to find out how the node behaves in a graph. For example, there is degree centrality, which measures the number of incoming connections to a particular node and the number of outgoing connections. Basically, a sum of incoming connections to an outgoing connections. Now, degree centrality is very important, actually. That will tell you how popular the node is. If you are a good friend of someone, you'll have a lot of incoming connections. If you're a bad guy, then you'll have many incoming connections or outgoing connections. And there's PageRank. Well, you don't need an introduction to PageRank. Like Google actually revolutionized it in the web world where they found out, actually, which web page is more relevant to a user or which website is more relevant when delivering their contents. So Google just combined PageRank plus vector algebra. They did the whole math and they released the research engine. So PageRank is, again, it's a patent from Google. Closeness, actually. Yeah. Closeness in the centrality gives you how close the nodes are together, basically. So, like, if your node A actually is so close to node B, then you can actually measure if the followers of node A are the same as the followers of node B. It's very much a lot of centrality algorithms. There's also shortest path algorithm, a lot of algorithms in the graph you can use to get some valuable insights. So in our case, actually, we to capture the entity reputation, because that's the problem you're trying to solve actually is Samsung or is Nokia is the best product which a user likes, which is the problem you wanted to solve. So we use actually degree centrality. So we, we call an API, actually. It measures the number of incoming and outgoing relationships from a single node. So the entities that have the highest degree centrality turned out to be popular. And you guys can run on a Twitter data and find out actually, which is the most popular phone in the world. So that's what we use that algorithm. The main thing is the data flow architecture, okay? The, this is, this is very small, this is not massive. There's a data source which can be anything. You can plug data from anywhere. And then we used, we wanted to scale up because of product, we wanted to build scale up actually. We initially used the Kafka in the stream world. We used pop-up topics to publish to a separate topic. So, but we changed the Kafka with kinesis because we found it difficult to maintain the cluster. So we, there's a microservice actually which reads the Twitter data source and publishes it to the raw data stream. And I say raw means it is not, it's not enriched or it's not anything. It's just raw data. It can be even junk data because Twitter has got a lot of noise, okay? So the, the consumer actually consumes from the stream. It does stream processing and it cleans up the data. When it's cleaning up the data is removing the quotes which have got punctuations because punctuations have no meaning other than what you read in English. And then it removed the stop words and it removed unnecessary words actually. It's very important because I'll tell you in later actually people do write some bad words. So, you know the NLP actually used to spacey and other models actually to find out okay, whether is it a legitimate English word, whether it's a good word because people write sometimes really crap in Twitter actually. They just write whatever they want. We, that's published to another stream. This is to decouple the cleaned data from the low level downstream processing. That's what we call. So the data ingestion pipeline writes to a clean data stream which actually consumed by another microservice which extracts entities and relationships. Now this is the key microservice. That's because this is the one which does most of the logic. Now we actually typically scale based on the data. The reason why we need to decouple is for scaling. So that you can decouple. Once you decouple, you can scale this service because as the data grows because Twitter data can be quite large sometimes. You get a bust of data sometimes. When you get the bust of data, if you need to scale, then you need to decouple the publisher and the consumer. So now the kinesis or the Kafka, whatever data flow you used, actually, that can handle trillions of events or trillions of messages a day. So we actually saw some latency called the back pressure in the distributed systems world. We actually handled the back pressure sometimes in the stream. But the stream will take care of it as a ideal way. So this extracts relationships and things and then it publishes JSON data which is binary encoded and then stored in another stream which is called the entities and the relationship stream. Now this is the actual JSON or the data which is valuable needed for the graph database. We had another service which is called the Neo4j writer and that reads from the stream, the stream processing and writes it to Neo4j. Now Neo4j is an asset-based transactional database. So they pretty much have every write is a single transaction like rdbmi. You cannot have multiple writers until you have a lock. So we saw, OK, that problem that Neo4j cannot handle multiple things and it's not like Oracle where you can just dump so many things and you can have transactions. So we had a single writer which is a Neo4j writer which reads it from the entities and relationship stream and stores the data in the Neo4j. Neo4j actually calls it as a property graph. I'm not sure how many people have worked with Neo4j. So they call it property graph and labels. That's what they call. The nodes are called labels. And Neo4j is a unidirectional. You can model the relationship only one way. The reason they say is that they don't want to have another bit to save the bidirectional one. They say it's computationally heavy. That's what they say. We built an inference service. Actually, this is another microservice. So we infer, actually. So for reading, we used cipher. But for writing to Neo4j, we used low-level APIs in Java, actually. We prefer Java because that's more performant than Python or Scala. Basically, we wrote in Scala and Java. That's the one. That's pretty much it, actually. So the learnings are like, what we learned is that tweets are easy to get. But the quality of data is very poor and too noisy. Well, it's expected, I think, because it's a social media. And it's expected. Graph querying is the most important thing. It's not same as SQL. I'm not sure how many guys use SQL. But SQL is the most widely used tool in a data scientist to get quick information. Graph querying is not same as an equal, because graph works on pattern matching, actually. Pattern matching is a totally different way of looking at the data than the SQL. Because SQL actually scans the database. It creates the views. It does so many algorithms. So you need to understand, actually, how the graph querying works. You also need to understand the data model represented in graph. If you don't understand the data model, querying becomes really tough. You start with a small graph and iterate on before building a bigger one. So that's the problem we did, actually. We just started dumping so much of information. But then you thought, OK, you build a subgraph. It's called a subgraph in a knowledge, I mean, in a Neo4j in a knowledge graph. You build a small subgraph. And then you build another subgraph. And then you see if you can relate those two subgraphs, because the querying will be faster. That's what we found out. And as I said, Neo4j is an acid compliant like IDBMS. So watch out for multiple writers writing to a database. Actually, when we wrote this question to Neo4j, like, now, why did you guys, you know, they were very particular that they are acid compliant. That's because they felt, actually, a graph data can be represented only if the data is structured. And if the data is structured, it means people think, OK, you can export an IDBMS data. So they wanted to have the consistency. Also, barren by Neo4j has called causal consistency. I'm not sure you guys have heard of eventual consistency, strong consistency in a distributed world. IDBMS gives a strong consistency. No SQL databases they provide an eventual consistency model. But Neo4j gives a causal consistency. What we found out was the data wasn't immediately available for querying. That's what it means. Eventually, it's a distributed cluster. Eventually, the nodes will match out on the data. That's what we found out Neo4j. Last one, always dockerize. And we deploy a containerized application. We orchestrated using docker spam, actually. But you can use Kubernetes. It's very important to orchestrate so that you can scale up or scale down. That's all. The conclusion is identify the problems that can be solved using graph theory. Scoring via graph model, as I said, it can augment or support the results from the deep learning model. You can always have a in parallel with the deep learning models. These two are really good papers, actually, if you want to learn about knowledge graph. But there are so many papers published by Google I think now the Google search is powered by knowledge graph at their back end. If you see the question answers, everything is powered by knowledge graph. That's what I see. That's all. Thank you. Any questions if you have? I don't know if you have time. Thank you. Thanks for the session. Two minutes. I have one question. This is Gagani from Pesh Gravity. So thanks for the session. Just want to check one thing. You wrote NLP microservice to extract entities and relationships for the tweets. I mean, the process tweets. So just want to check that microservice does it consider context as well? Because let's say there's a term called GlaxoSmithKline. It could refer to the name of a medicine and it could also refer to the name of a organization. Was it smart enough to identify whether that entity is a medicine or it's an organization? Yeah, we did actually. We did named entity recognition in that, where we will find out actually what is the entity is. Exactly. So that's what I'm saying, like, was your service able to distinguish? Yeah, yeah, yeah, we used a lot of libraries with to do that. Can you elaborate on it? I'm not sure if you're out of Spacey. Spacey, yeah, yeah, we use Spacey deep learning models actually to find out whether the thing is an organization or it's an, there are some set of specific topics in that Spacey model. It's a deep learning model built actually. So what happens usually in libraries like Spacey or Stanford and we use, right? So there's some generic entities. If you're targeting those, then the solution is there, right? And you can further train it on your dataset as well. Yes. But when you are dealing with very domain-specific entities, right, let's say drugs or proteins or others, right? I mean, like, these models need to be retrained, right, specifically on your training dataset. So that's what I want to understand, like, what was your entity set and was Spacey enough for all of the entities that you were trying to... In this case, actually, you know, we were reviewing a product which is a Samsung mobile, you know, it was Samsung mobile, so it was pretty easy. But if you, as you said, you know, if you wanted some specific model, then you need to retrain those. You use transfer learning, download the Spacey model, and then you retrain and you see that's the best way. Yeah, yeah, in our case, it was straightforward because they were proper known brands which we were measuring. Yeah, because, you know, like in this area, the hardest thing is actually to annotate the data, right, because, I mean, like, training, retraining is everything is good, but it's like garbage and garbage is out. And that's why, you know, like, the key area is how you annotate it, right? Now, bear in mind, actually, you know, this was done for a client, and he had direct connection with Twitter to get the data, and there was some manual cleaning has been done which I overcame, actually, in this thing. There was a lot of cleaning has been done in the data. Now, before we get the tweets, so there was a person, actually, because it was from a, for a big client, the client was very particular that you don't, you filter out all the noises and you analyze it. If you use a Twitter firehose API, it will give you everything. You put a filter on it, it'll give you everything. So there was some manual annotation, I would say manual has been done. There's actually a difference, honestly. So that's pre-processing, which you said, right? Yes. The cleaning, because the Twitter data is noisy, all of us, right? Yes, yes, yeah. You need to do that cleaning of the text in order for it to be, you know, in a format which you actually interfere into your model, right? Sure, yes. And then on the cleansed text, you annotate it, actually. That's what you know. No, no, we didn't annotate the plain text. Now, as I said, you know, we didn't build a model, as I said, we just built a graph. We didn't, yeah, yeah. Maybe we'll talk outside, yeah. We didn't build a deep learning model in this, if you get what I'm saying. Don't, yeah, yeah.