 Anyway, thanks for coming, thanks to the organizers, thanks to Jean, B, Amir, and guys can now provide the location for this conference, for this talk, yeah. So, before I start, I'd like to ask a bit on your profile. I see that as a whole, the audience, some of you look very young and probably fresh, you're still in university. How many of you here are students, by the way? Maybe about one quarter. How many of you work in startups? Another quarter, yeah. How many of you work in a, say, in a financial industry or banking institution? Cool, thanks. Okay, my main objective of this, hopefully, you know, after this about one hour, probably last, is two objectives. One is to introduce to you a relative, say, not, it's not so new, but it's not so well known concept as a data granule. The second one is to come to let you appreciate the difficulty of using big data in the capital markets. I'm talking about the capital markets in this case. And my contents you can see over here is, initially, it would be a bit theoretical in a sense. Just come to serve as a background to what I'll say later. And then I'll do a bit more practical in a sense for practice. And I'll do some demonstrations. And since this is Python data, I'll run some scripts in Python for you to satisfy. Especially in use cases, yeah. So, if I start. Now, you all know, like, big data, big data, right? So, why is it not called big information? Can anyone, you know, what do you all think? Yeah, any answers? Why not big information, yeah, sorry? It's just data, it's like, what do you use it for? It's your, like, you get the information out of it. It's just normal data. Yeah. It's more cashier. I'm sorry? It's more cashier. Yeah, right, absolutely. I think there's a, you know, I think, you can look at this, you know. Data is unmind. Information is what you want to know. You want to know information from your different data sources, whether it's from financial news, market data, all the different data silos. And out of this, you get your information. It's my data. Yeah. And today, our first talk touched on what is information retriever. I guess I'll slowly build up, you know, from a big perspective down to, you know, the more granular part. Now, I take this off Wikipedia. Information retriever is the activity of obtaining information sources from different data sources. And most of the time, at least what we're familiar with is more on the unstructured data. So therefore, I mean, there's a very good query. I mean, SQL is more on the, you know, back 10, 20 years ago, you have RDMS, you know, you have very well-developed SQL queries, which is very good for, you know, extracting structured data in numbers, et cetera. But now with the blow, I mean, with the increase in internet, the blow, I mean, in more internet sources, you have to see a lot of unstructured data that's going on. And this requires a form of information retriever. So one of the practical use of information retriever, very quickly, some of you could know this. We have the topic classification. We have a sentiment analysis, search engine, page ranks, automatic classification, you know, automatic summary, you know, like you have Fitbot, you know, it's a very popular app on your iPad and iPhone. And Autobot, which is a recent phenomenon. Autobot is, you know, it's like a chat machine, you know. This thing got very popular nowadays in the fintech industry. So this is some of the practical use of information retriever. So, and what does information retriever, what are the techniques that are used? OK, I list some of the things here. Word factor, TF-IDF, similarity scoring, co-science similarity. Some of you will probably be familiar with the page rank, you know, LSI. But they're all techniques, analytical techniques to retrieve information from documents. I can very briefly say that like TF-IDF, it stands for term frequency, inverse document frequency. So all these work on the principle idea of having a word factor, which they call features in computer science. And then you do some machine learning, do some comparison. You identify some co-science similarity. And then you use it for the previous slide, like topic classification, or use it for NLP sentiment analysis. So it's a very big picture of what information retriever is about. So some of that latent somatic analysis or so, those are from generative probabilistic model. What in a short, it actually looks, it tries to classify a document according to a certain topic. But obviously, you know, you cannot, it would not just be belonging to one particular topic. So it may be like, you know, maybe 50% political and then 30% sports, et cetera. So it uses a latent distribution to make sure of distribution to kind of fit into a particular topic of a document. So can you identify what is the unifying theme among all these algorithms? Can anyone say anything? Is there anything that, you know... You have a tax money? You have a tax money, yes. I mean use for tax money. You can do all this in Python? You're right, you're right. Like TF-IDF, you have a scikit-learned. And then a concise image is just a metric application. If you're writing in Python, you have all these kind of libraries as well. And then latent somatic analysis as well. You can do all these algorithms in Python. Your, sorry, what's your name? Amy. Amy. Yes. Amy. You're right, it's a tax money. You can see that this is a tax money in the sense that it works on a whole document. It looks on a very holistic level what information is about. So this is what has been ongoing for the past, you know, like you can say in the past decade or so, the past few years. And then there's a bit of an evolution, you know, from the general, generalistic sense into a more specific sense which I will build upon is called information granulist. So I think you see very familiar names here. We have a Web 1.0, you know, like HTTP, SQL, SOAP, XNL. I guess I'm going to go into that much. We have a Web 2.0. Now, again, you see a, you know, flash, JavaScript, Java. You all know what all this are. We are now moving on to the place of Web 3.0. You have, now, it gets a bit more unfamiliar. It's Spark code. Any of our Sematic databases. So we are moving towards Web 4.0 to which the more intelligent personal agents there's more distributed searches. So now, I'll introduce to you what RDF is. RDF is a metadata. So does anyone know what RDF is already? Oh, okay. All right. It stands for resource document format. So it's really a metadata. It's one for the future. So I'll describe a little bit on what RDF is and then give you a background to it. Now, RDF actually was first released in 2004, the 1.0 version. And then the second version was released just two years ago. So as I mentioned, it's a metadata specification. A metadata, as you all know, describes the kind of data that's actually assisting the database. So RDF assists in different serialization format. You have N-triples, you have Tertus, RDF, XML, which I wouldn't describe too much. And then it's actually similar to the sense that relational database. But it's at a more granular level. It's called Trippist, which is the next slide. And along with RDF, there's a whole lot of tools as well. You have Sparkle, you have our ontology. Sparkle is a query language. You can see it's very much similar to SQL. It's a query language that's built for RDF. And our ontology is actually, as some of you know, ontology refers to a hierarchical classification of terms. I mean, the hierarchy in the sense that you have, for example, animals, mammals, and then you have different animals and come to humans. Why this is useful in RDF concept is that because RDF is at a very granular level and it helps to have a hierarchy and ontology to describe your data format. Any questions so far? Thanks. I will describe now the trippers. Now, the RDF basically describes the trippers. By the trippers, it means the subject, a predicate, and an object. So, for example, you know, like, so, I mean, say, Alex is eating pizza. So, Alex is the subject. Eating is a predicate. And pizza is the object. So, you can see that this is a very simple format, but it describes the information at a very low level. And then you imagine this whole lot of trippers, this whole lot of information that has been in an enterprise layer. And then you link them all together, your bilens together. And then you get a huge graph of RDF entities and predicates. Which is what, actually, I'm moving now more into the practice. Now they have a good background on what RDF, how the data is slowly coming from the general to the specific. We have Thomson writers. We do research on the bow. Bow stands for Big Open Thing Data. We try to segmentize, to break down data into a very granular level into RDF formats so that it can be, so that from it we actually get more information than just looking at the whole document, classifying the topic of the document. So, we have three initiatives. One is the perm ID. One is Open Cali. And I know one of the two is Data Fusion. So for, I'll describe each of them in detail. All right, this is our perm ID and intelligent training. So, perm ID is literally an identifier. It's a mutable 64-bit identifier. Some of you in the financial industry, you know things like ICIN, QCIP, or bonds. Perm ID is the same concept as this, but it's extended to a broader class of objects that can include like instruments, include like people, organizations, market boards. The reason why we have perm ID is to have a unified team across different data sources. And we also have the intelligent training, which is actually called Open Cali. What the Open Cali does, which I'll demonstrate on the web page later, it categorizes, it tags all the entities. Now, in natural language processing, it's called Lean Entity Recognitions. It tags the different entities, whether it's people, company, events. And then it put that into an RDA format. And as I mentioned just now, by being in an RDA format, it brings the data to a very low level, a granular level. And for example, you can have an event with like one company that has gone IPO. And then this will be reflected on the document itself. I mean, at a very RDF level. What's more, company A undergoes IPO. So company A is the subject, and IPO will be the object. So once we have the perm IDs, we should identify across the different data sources. And we have the intelligent tagging, which tags the different documents, according to the entities, according to the events into RDF formats. You can imagine that's been done for a whole lot of documents. And this is a fact into a data fusion. A data fusion here is a triple store. It stores all the different RDFs of our entities. And then it relates them together. I mean, it links them together. Don't forget, it's big open link data. And from this, you can see relationships. I guess you all know that, you know, we live in a very interlinked world. And it's a network. So there's not only causality, but there's a correlation amount of different entities. This we saw, the fact shuttered based on the predicates, actions. Okay, the next one I will give a very short video on data fusion. There's so much data. It's challenging to find what you need to make the right business decisions. You can lose sight of the big picture because the data is structured and unstructured, scattered over various sources, sometimes within your company and also around the world. So, how do you make sense of it all? And how can you make use of all of this data to drive more efficient and effective strategy? The answer is Thompson Reuters Cortellus Data Fusion. It's the solution that provides a seamless integration of your data, other data and our data. It connects the dots so you can understand the information and start innovating even while more data is being generated. Cortellus Data Fusion keeps track of all of the information wherever it is, establishing relationships between all the data available to you and even the data you never knew existed, pinpointing the information that drives your business forward. Cortellus Data Fusion lets you make sense of everything by securely pulling in external data to sync with your data and combine with our curated data. It can be installed behind your business's firewall or private cloud to keep all your accumulated data completely secure. The ultimate platform for your new insights and decision making, Cortellus Data Fusion allows you to integrate your internal, public and content provider data, the future of enterprise data integration now. Thanks, thank you. I think we can cancel it. Cancel it? YouTube. What's the space that we just have? It's a set of bar charts. Alright, thanks. Alright, actually this Data Fusion used to be very heavily used and very popular in the pharmaceutical industries. And it's been transplanted into a financial industry. The reason why it's got a very good success in the pharmaceutical industry is that you can imagine that in the pharmaceutical industry the relationships are quite standard. You can say that if you take this straw, it's good for this disease. If this is a disease, you have some side effects. It's all very structured in a sense. The types of relationships, the types of events, the scope is actually less than much lesser. Which actually explains why Data Fusion was very popular in the pharmaceutical industry and was a good success. I mean, it's not new in a sense. It seems like one of the past 10 years has been very popular in the pharmaceutical industries. Now, when we transplant this idea into a financial industry, it takes a different paradigm, different perspective. We still do the same. We still have an RDF format, but data at a very granular level. And we still use tagging. We still use like in the triple stores, but we would have to do it in a different manner. Now, the next thing I will demonstrate to you is some of the tools that we have on the Oven Calais. You can see that we have a website and it's actually... You can use it as well. I will just do maybe a search on a certain... This is any particular document on the internet, or it refers to any internal documents or research reports, analyst reports, which we put on the Oven Calais or intelligent tagging. So, you can see over here the tagging has been done by the website, treat intelligent tagging. You have a different topics, clarification, which I already mentioned. We also have the entities here. Some of the countries have been recognized, organizations, and obviously this looks like a labour topic. So, it's 99% a labour subject. And you have the workers pay since it refers to the topics. So, this is what I already mentioned. We have the topic, a big picture of the general view of the topic here. And then we can go one step further, and this is an RDF format file. You can see that this is some of the RDF. This is the object, and then some of them are the subject, the predicate, and you have a whole lot of information at a very granular level. You can see. So, this file, which you can save it into an RDF XML format, you can serialize it into different, whether it's Tertle or Entrepost. And it's fat into data fusion with all the different Triposts. And as promised, there's a bit of Python here. So, this is a similar one. In fact, we have, for OpenCale, we have a Python API. And what this does is like, it actually looks, I've already saved some file over here, some documents into my folder here. And when I run the program, you would look at the file and then you would go to the API of the OpenCale and converts the documents into our RDF format inside output directory. You can obtain a token from the website and save some file. I mean, the sending word goes into the API and then the output would be here. You can see that this, you know, it's machine readable format RDF. So, this code is actually available on our website as well, for OpenCale. We instantly feel that it's the best semantic engine for this area. With Thomson Writers, we put a lot of effort into this particular intergent tagging. Now, the next, yeah, share please. We have, it's free up to a certain number of words a day. I think it's about a few thousand. And we have an open perm idea as well. Yeah, please. For RDF, right, it's subject to the relationship. So, what is the difference between RDF and a graph? So, it's a node relationship node. Yes. A graph composed of a lot of RDF, you know, I mean, imagine you have a lot of trippers, billions, potentially millions. And then the nodes are linked obviously by the predicates. So, imagine that they see a whole network and they form a graph on that work. That's it, have to explain? Yeah. And this graph consists of RDF? Millions, a lot of RDF linked together. Yeah. Thanks. Now, this is our website as well. It's an open source thing, perm ID. And it's possible to enter a certain entity and then you can do that in Python as well. In fact, over here, I entered the kind of search for a perm ID of this particular entity called 1MDB. Yeah. I've run it already. And then it returns a series of, you know, related entities as well. Now, 1MDB obviously is abbreviation, so it will get an exact heap. In fact, there are a whole lot of, you know, similar state, 1MDB, energy investments, all related entities to MDB by running the Python script. Again, it uses a token which you can obtain from the website. So, yeah. So it just, it goes, it's a RESTful API, it just goes into that website and then searches through internally and returns a whole list of related entities that could match your query. What kind of entities do you have in the database? Organizations? Market instruments? Quotes. We are building into it persons as well. If you search our name, it may come up there. Maybe you search Obama or something. Yeah. But just to be IPP. Yeah. Do you have cross references to other databases like? Cross reference. Okay. Wickedpedia. There's a RDF format of it. It's called DBpedia. And by being in the RDF format, it can be fed into datafusion. Because it's, I mean, basically datafusion is just, you know, it's a data lake. And it would just take any data that's, say, CSV, JDBC, you know, compliant. And even more to an easy extent, you can take in a DBpedia. Without explicit cross reference, there might be error rate when trying to... There might be error rate when you don't have explicit cross reference. Explicit cross reference I don't quite understand. With the other database. From my database. From yours. And say, I would like to cross reference it with Web of ontology. Okay. I think... So you mean to say that, you know, when you access... Wickedpedia. When you access, say, use data from the other sources, it's possible to get, you know, you would have some confusion over some entities or same entities. Yeah, you're right, absolutely. That there's something that we are trying to resolve as well. Thanks for the question. Michael, thanks. So, I would now move on to applications in finance. You can imagine that, you know, this might be in the network environment is actually this what the financial industry is moving towards. In the Lehman crisis. Why did the crisis happen? It's a contagion. Lehman fell and then that affected other banks. And then it affects other banks and it keep on spreading it out in the network. So this is really like a, you know, it's like a graph, you know, you have relationships between the different entities. So, and this is what I meant by, you know, personality analysis and stress testing. So, you can imagine that, you know, how useful this can be done as well. I mean, okay. Notwithstanding that this is still an area of active research and which I'll show you some of the, you know, some new advances in this area. It's so used in like compliance and risk. Now, compliance and risk, I think we all know about what happened when MDB, DSI, etc. In the past one month or so. So, this compliance and risk is like, you know, there's a lot of emphasis, you know, like nowadays on anti-money laundering, you know, on trade finance as well. Now, it's just not enough to know whether a person is good or bad, you know. Nowadays, you know, criminals are smart, you know. You can use something that's, you know, on the front appears to be good, you know, a good entity. But behind it, you know, it could be like, you know, behind it could be a dummy company, a shell company, and shell company of a shell company. And finally, you manage to reach that particular, you know, that dubious character. So again, there's a network of relationships, you know, it's relationship with one another, and then iterative until, you know, that you need to do second-level checks, third-level checks. Which is very, it's a key concern now for the regulators. So it's used, again, you can see that, you know, it forms a form of network that is very useful in this area. Yeah. And of course, we have a credit risk network, a contagion. And then the supply chain. We know that, you know, for example, like, you know, like the Foxconn is a supplier of Apple, but who's the supplier of Foxconn, you know. You know, you have an Apple iPhone and, you know, all the different parts actually can come from different companies and then you get a whole lot of supply chain. Again, you can see that, you know, one this to another, kind of a network. Yeah. Sorry. So, you can see this is, okay, finally you can see the graph. Yeah. This is actually coming from data fusion. We have like, you know, like one MPV and then you can see the related entities, related entities as well. And of course, Thompson Reuters with Thompson Reuters, we have our own database set. We have all these relationships here that enable us to find the, you know, the build-up of this network as well. And, okay. Then, yeah, please. The slide before. So, have you integrated this with Thompson Reuters KYC system go-to area? Okay. We are in the process of doing that. Okay. So, the question is, you put four use cases, right? Yeah. I think there should be another use case that is like investment analysis for the rating. Yeah. Yes. You can put this way. You can look at it, you know, you can cluster the stocks together. I mean, we have, of course, let's say you do a set management site. The whole portfolio. The portfolio could be from different industries, different countries. And then you can do machine learning, K-Me's clustering, to group together entities together. Yeah. And then you can relate to, for example, let's say an event happened in Paris, where there's the airport attack, something more terrorist attack. So, what would this impact on? I mean, this would probably impact on travel industries, and the travel industries would impact, you know, things like the leisure industries, the airline industries. And then, potentially you would have, you know, to say Air France is affected. What if Air France is a, you know, is a shareholder of another company, and then, you know, it triggers on and on. So, it affects, it can be used to asset management as well. So, are you also, like, ready to integrate there with other talking artists, too? You're gonna process of doing it. That is the suit.