 I would like to introduce the first speaker of today's session. It's a great pleasure to have Dr. Sven Lauer from the University of Tartu with us. Dr. Lauer holds a PhD degree in computer science, which he completed in 2008 at the Helsinki University of Technology. And since 2008, Dr. Lauer has been affiliated with the University of Tartu in many different roles, including a senior research role, senior research role. He was also a project manager at STUK. Just for everyone, STUK is a MLF PM partner, and he was in this role between 2015 and 2021. And Dr. Lauer is an expert in privacy preserving data mining and cryptography. And today his lecture will focus on how to extract information from medical texts and we're very excited about that. And with that, Dr. Lauer, please, the floor is all yours. As I was introduced, I held many positions in the university and STUK. And one of them is to actually do practical data extraction from medical texts. And I have been doing this for 10 years. And during that time, we have failed like at least three times and there are lots of things I have learned. And today I'm trying to give you kind of high level overview what you should know if you want to do a fact extraction from medical texts. So let's just see what is a fact extraction. So let's say that you have a medical text and you want to know lab measurements, but unfortunately those are not in structured form but those are actually inside text. So you can see here, there is a date here and then follows the analyte, okay, it's an Estonian then it has a value and it also has a unit. But below that, okay, there are still those analysts but sometimes there is a unit present, sometimes there is unit is not present. Obviously this text is not tagged like here so that you know what are there. And the question in case of fact extraction needs to find out those facts from them. There are many reasons why you should use a fact extraction. One of them is if you do want to start the adverse drug reaction then you have those patient complaints. They should be in structured form in medical health records but usually they aren't because kind of reporting adverse drug reaction is kind of takes some time and it's much easier for the doctor to just write that the patient complained nausea and that's why we changed the drug. The same thing is also true with let's say with disease description. Usually you have a form of code in this medical text but some kind of aspects of the disease are written only in textual form. And when you have biopsies, x-rays, CT scans, all of those are usually in textual form. So if you want to know whether there are certain types of cells in the biopsy or certain types of anomalies in the biopsy, you should actually study those facts. And okay, the application-wise most of the first three are just kind of the epidemiological or clinical studies where you can use them but sometimes you can use fact extraction to kind of in other machine learning projects. For instance, if you would like to do image recognition, so you have a large set of x-rays and you have a patient records, then the problem usually is that those records have to be labeled and now you usually have this x-ray and you have a corresponding description and it would be very handy if you could actually extract some important facts out of it. And in that way, this fact extraction becomes as input to another machine learning method. And essentially what the information extraction allows you to do is that first of all, it allows you to fill up the gaps. So sometimes doctors don't kind of write down the data into the structural fields. Information systems don't write data into a structural fields because it's hard to integrate them and we didn't have money or whatnot. The other one is that sometimes those factors which you are interested are not kind of normally in a structural way. For instance, you wouldn't like the doctor to actually click and assign your lifestyle, whether you have a lot of stress or you are kind of in a peaceful mood or you have some crises or so this is not what you want the doctor to do. So usually this is written in freestyle free text and therefore you need to kind of extract information out. Sometimes, as I said, you want to do a disease subtyping or you want to do kind of define some sort of refined treatment outcome. Let's say for stroke, we would like to know whether patient needed a speed therapist or not. Again, this is not in a structured data but usually in the text that was written that the doctor arranged a speech therapist arrangements or there is actually a description of what was done there. So in that sense, these textual fields are very important. And as I said, through this kind of lecture I'm using very simple task here which is a measurement extraction because this is something which you all can relate. So there are lots of measurements and they have very simple structure. So if you want to extract the measurement, you want to know what was measured, what was the corresponding value, what was the units and when it was measured. And now let's say that you build this system and it will take text and convert them into the tables. Then you have a problem in structured data because usually the measurement extraction is very noisy. You get kind of various misspellings in the fields. You get the fields, but those are misspelled or you get them wrong terms. So there is a first thing you have to do after this kind of measurement extraction is, you have to do this kind of measurement extraction you have to do an additional natural language processing task which is a very boring and laborious is that you have to clean the values through the standardization. And sometimes you even need to do the harmonization and conflict resolution, meaning that you have different sources and then you have to check whether they are contradicting or not. So this is a kind of in case of natural language processing the first and most simplest task you have to do. And this is usually the most important thing to do because when you have a medical record it has easy things to extract and hard things to extract. And you should usually start from easy things unless you are looking for particular reference. The next thing to consider is that sometimes data is in a formative format. It's either in the HTML or XML format. And therefore then you can actually know where the field starts and where the fields ends but you don't know how the data is organized. So the idea here is that you have to first of all detect what in which format the data is and usually the data is in several formats that are several different ways, let's say to write down the analysis tables and you just have to unify them and convert them into a common format. Then considering another task which is now more complex is that when you actually want to extract the data from free text then usually the free text is not completely unstructured, it's a semi-structured. In case of patient recalls, they will actually split into the small snippets of text where each of them is kind of dated. So in let's say in October the patient came in and then they complained about that but in September there was another entry and so on. So essentially you have to do a format discovery and split the text into small pieces. And when you do that you have to kind of recognize patterns in the text which are sometimes not correct. So this is usually in computer science terms kind of referred as a robust parsing. So you have to kind of recognize important markers and split the text into pieces. Finally, there is the most complex task is when the data is indeed in free format and there is no structure. So this is like this one, this is in free format so and you have to recognize stuff inside here and there is no markers to help you. And here there are like three basic tasks which are very important to do. The first one is the segmentation. Another one is when you have split the text then you have to order those segments according to relevance. Just the relevance is that let's say if you want to extract measurements then you just order the element kind of segments and the top most certainly contain measurements and bottom most very unlikely to contain measurements. And the final task obviously is to do the fact section. Now, depending what you actually want to achieve or what is your task, there are three possible ways to kind of approach this fact extraction problem. The first one is that you will just take your time look through the text and do this extraction manually and usually the easiest way to deploy that is that you have some sort of actual table and one column is text, the other one is the fact you want to extract, let's say a measurements and that's it, this is very easy to set up and in many cases this is enough. So if you have let's say a thousand medical documents to look through, there is no reason to kind of set up a complex fact extraction pipeline. By the time you set it up you can solve this in Excel. And the only reason not to set it up if you know that this is not a one-time project or it's infeasible to look it through. So this leaves you two other options. One of them is that you will use some sort of annotation program which allows you to kind of use some machine learning inputs but you have to make the decision by yourself. So this is a kind of manual labeling where you use machine learning as a kind of tool to make it easier for you. And finally, obviously you can use just automatic machine learning pipeline. The main distinction between those two is the question whether you can allow, what is the error rate you allow to do? So if you do let's say a genome-wide association studies and you do a deep phenotyping, you can allow let's say 10, 20% errors in the extraction. Okay, the signal will decrease and maybe you don't find all the SNPs which are associated with this deep phenotype but essentially nothing bad happens. However, if you let's say want to build a decision support system, then this is not acceptable and therefore then there has to be this human component which actually looks through what the machine has done. The other aspect to consider is what should the machine do? There are like three things the machine can do. So we can make a prediction. Let's say that this document contains this cholesterol measurement, blood sugar measurement and so on, so this is completely automatic. This is harder to achieve. Then there are two other tasks. One of them is this focus instead of telling you that this is the extracted value. It says that, okay, have a large set of text and actually the measurement can be only in this small paragraph. And this is kind of very important when you do this manual, a curated reading because you don't want to read all the text. So this is automatic focus is important. And the other thing related to this is automatic ranking. So even if you have extracted the segments where the measurements can be, some of them are more probable to contain measurements than the others. And since you have a finite resources, you should be kind of able to rank them and look through the most promising ones and kind of skip and then at some point say that I don't kind of read more there. Okay, so let's say that we want to do this measurement extraction. So there are like typical steps here what you should carry out. The first and the most obvious is that you should just look at the text to see whether you can solve these tasks or not. For instance, we were kind of looking this antibiotic description in hospitals and we wanted to know, okay, can we kind of see whether there were some complications with antibiotics, but we had to drop that project because we looked at the text and discovered that, okay, this is not actually typed into the electronic health records. So this first step that you actually expect to text and see whether you can yourself extract the facts is very important. The next thing is that when you do that, you see the initial structure of the facts. And when you do that, you will, let's say that if you do a measurements, then it's quite obvious that the measurement must contain a numeric value. And also, okay, it must contain an analyte and there are not kind of infinitely many lab measurements. There is actually a certain list of measurements you can do. So you can build a term lists and for extracting the numbers, you can use regular expression, which match a certain kind of strings of certain structures so you can recognize numbers. Let's say telephone numbers, specific codes. And this is kind of the first thing you do and this allows you to kind of annotate text with the potential components. And based on that, you can do the text segmentation and also find the regions of interest. And this is very dumb. Essentially, this is nothing more than advanced search in text. So instead of kind of typing control F and cholesterol, you just do it automatically. But this is surprisingly efficient usually because it allows you to kind of to be, find out the prospective data paragraphs. And it's important because the data what you have is me is a scarce data. If you have, let's say, measurements, they are not written in each textual field. They are done in certain intervals, usually rarely, when you have a specific measurements. And therefore, if you want to build up a set of training samples where you have a data and the corresponding value, then, and you do it kind of naively, then what happens is that you have to read through a lot of texts and it is slow. And let's say you read like 1000 texts and you get like 10 positive cases and this is not a good way to do. So therefore, the term lists and regular expressions are your friends because they allow you to and throw away a large part of irrelevant texts. Next obviously is that we use some sort of classical machine learning method and you have now a labeled data. And then you train and validate the model and see how well it behaves. And here, just to notice is that that's what you want to do. First of all, you need to detect whether a certain fact is in a text or not. Usually the unit of text is sentence. So the question is whether a measurement appears in the sentence or not. And this is already very powerful if you do this kind of curated text extraction because this allows you to kind of show the curator only the sentences which possibly contain measurements. And the harder task is actually do the extraction. So you know that the sentences information is there but now you have to extract it out. And yeah, the third one which I already said is the prioritization which text to look through. And usually when machine learning people think this is the place where it ends. You have a model, it works, it extracts the text and that's all. But this is not actually the place where it ends because you might get facts. For instance, you might actually get the fact that patient weight is 1.6 kilograms. But this is clearly incorrect number. So you have to put additional filters which will kind of find out that this is not true. Another one is that usually in Estonian text there are like cases where patient weight is like 3.4 or let's say 2.7 kilograms. And those are associated with females. And this means that actually the weight is not the patient weights but actually is newborn child's weight. So there are things you consider kind of this is another set of validation which you have to do in order to use this data correctly. Another aspect to consider this is a kind of a technical one is that there are like abstraction levels in text mining and if you kind of neglect some of them then you might get into the troubles. Obviously, you would like to work on phrases and sentences so you would like to take a phrase or a sentence and kind of extract the measurement out of it. However, you have to think about other things as well. So let's start from the most boring ones. This is the fact that the texts stored in computers as bits in particular characters are written as bits but how they are written, there are many ways to do it. You can use a UTF encoding. You can use Latin one encoding. There are many, many encodings and sometimes information systems or doctor's computers mix them up. So they take a text which is let's say written in Latin one and interpret this as a unique code text and then the original content gets mangled. So and then you get kind of very weird looking sequences of symbols which actually corresponds to some reasonable text if you would encode them in correct way. So this is a kind of problem because when you kind of interpret them kind of characters in wrong way, all the next steps will fail. Usually this is not very important but sometimes when you get kind of errors you have to actually consider this one. The next thing you have to consider is a tokenization. So essentially if you have a set of characters then you have to split it into the words and this can be non-trivial thing, yeah? So here for instance, there are like this number and this number and this number and it's really hard to kind of say whether this is actually a date and then there is a measurement unit or this is a year and this is a measurement where kind of this period is missing from here. So there are lots of problems with this one in practice because doctors write texts in hurry and make a lot of mistakes. So there are lots of issues here. And the next kind of level you have to work is that when you have a words then you want to actually annotate them. The reason for that is that all machine learning methods use some features and actually it's very important that you would kind of give them kind of right mythological features or if you use let's say some sort of terms that you will recognize that this penicillin is a antibiotics which correspond to a certain code or you would like to let's say no, no that the liver is inside abdomen. So there is like those kind of annotations you have to put on the text and when you have this annotation play in place it's much easier to build those text segmentations and fact extraction algorithms. So the last kind of general thing to consider is okay how do you solve this task? There are like three basic ways how you can solve it. You can use knowledge based methods then you can use supervised machine learning and unsupervised machine learning. The knowledge based methods are kind of structured in the following way that you have external knowledge, you have let's say a list of all possible analytes or all possible drugs or all possible kind of body parts and then you just recognize them in the text or you have a certain rules how to write down a phrase that the patient smokes. So we can have a phrase is that patient smokes one pack of cigarettes, smokes heavy or so on but then you can actually kind of think and come up with a description which kind of describes all possible ways how you can describe that the patient smokes. And these methods are actually surprisingly efficient because there are not much resources what you need to do. So you have to have some sort of lexicon ontologies and some sort of standards but you need this all this anyway in order to do the data cleaning step on top of this fact extraction. So usually this is easy to achieve and in many cases we can actually stop here and this is gives you good enough results but sometimes this is not the case. Let's say that you want to do those tasks which are a bit fuzzy. So kind of measurement extraction is kind of easy because it has a very specific structure for the phrases but let's say that the fact that patient has a high stress level or has a problem this is something which you can't kind of structure can't write all possible ways how this is described in a text. So in this case you have to do the supervised machine learning where you have a label text and you train a method. So the problem here is that you have to have text with annotations. So this is a problem. So you need to at least 1000 or more positive examples and as I said, since the positive mentions of a fact are rare, this means that you need to kind of look through many, many texts. Let's maybe even let's say 50,000 texts and this is a significant burden. Okay, you need to use GPU or CPU time but okay, this is cheap. You can do it compared to annotating texts. So the one way out of it is to use unsupervised machine learning. So the idea here is that you use unlabeled data or text to learn some sort of general model. Then you use few text annotation to fine-tune that model to predict the thing you want to do. And this thing kind of requires a lot of GPU time when you do this general model. But when you have trained that one, the kind of specification of fine-tuning requires a little time. So this is kind of currently the best or holy grail of doing fact extraction. And there are like three aspects here to remember. So the first one and the oldest one is a word embedding. The idea here is that you take a word and associate the vector with it. And if you do it cleverly, like using word to vector, this vector capsizes the semantic meaning of this word. And based on that, you can do quite a lot. But the kind of the next generation is this idea of transformers where the idea is slightly different. The problem with word embeddings is that for each word you have only one vector, but usually the words have different meaning and this depends on the context. So the transformers kind of assign a different word embedding based on context. So this is a context-sensitive embeddings. And the two most famous right now is this cheap GTP-3 which is not available, but the PERT which is kind of available. And there are many versions of that. And there are like medical versions of that, but I don't know whether they are available or not. But this is where you could start. And another thing to do when you don't do direct prediction whether this, okay, the fact is in the sentence or not is a similarity scoring. So you would like to, you have certain seed phrases about smoking and you would like to know what are the other similar phrases which could be related to smoking. And then you use a similarity scoring and the thing to remember is this word moomer's distance which allows you to find similar phrases. Okay, now we kind of reached the second block of my talk and this is kind of more practical oriented or I give an overview what are the tools you can use in order to do the fact extraction. And they start from the simplest ones. Those are the rule-based methods. So those are the knowledge-based methods. And kind of, this is kind of very important is that when you do some sort of fact extraction task and you can use existing standards and then you should kind of collect all of them and try to use them. So if you work with diseases, there are this ICD-10 as in Estonia but there is the SNOMED which is another one for lab measurements, there is a standard for drug names, there is a standard for anatomy or body parts, there are standards. So you should just kind of somehow collect all of those standards and extract the terms there. And by doing that, you get this initial term list which could be used in the text. In real life, this is not how the doctors use those things, they misspell the terms or use a different term. So what you do is that you take this initial term list and unlabeled text and do some sort of fuzzy matching and you get an extended term list and then you kind of grow out false positives and that stuff here and they get kind of final purity term list which is the list which you just match on the text and this allows you to find out, let's say, all mentionings of the drugs and this is kind of all mentionings of the measurement analytes and this is already a powerful thing because you can do text segmentation based on that. Essentially, you take a match, take like 100 or 200 characters to the left, 200 characters to the left, do the right, extract this text and it is very probable that this will contain a measurement or in case of drugs that this will contain doctor's prescription of a drug or something about the drug. Okay, the other thing which is important is this regular expressions. So you can't write a list of all possible dates or all possible numbers because there are like infinitely many of them. So instead of that, you just use regular expressions to kind of fix the format. Let's say for the date, there are two numbers followed by another two numbers followed by another two numbers and in between there are drugs and we can kind of specify a number of formats and we can, this language is quite, the scriptural language is quite powerful and you can do all wonderful things here but mainly you would like to get those numbers, dates and some special symbols and some headers to find out but what you shouldn't do is that you shouldn't use regular expressions to kind of capture some variations of the terms. This is a practical reason because handling a long term list is much more efficient than handling regular expressions. There are kind of libraries which will match long term list very fast compared to the long list of regular expression. This will be a bit slow and sometimes if you're not careful, you will write regular expressions which don't do the things you kind of intend. So this one kind of intuitively says that you want to match A or AB or AC but you can write it in two ways and depending on which way you write it it actually does different things. So this one matches only the A and doesn't match AB but this one tries first to match ABC and if it fails, tries to match AB and then A. So this matches ABC and AB and A so this will kind of create a lot of confusion. Now when you have kind of set up a term list and the regular expressions then usually there are surprises. So essentially you miss some way of writing a date or some way of writing a number or some way to write the header and if you are not careful, what you do is that you will just update your regular expression and go further and but this is this is this can lead to a problem. So essentially the problem here is that when you change a regular expression it can actually change the way it behaves and you might not kind of get a regression in a way that you forget certain other formers. So the right way to do it is to just each time you see a problem you write a, you extract the corresponding text and check always whether the kind of thing that you are able to match that. And if you do that, you will actually get a kind of a documentation why this regular expression is such and you will get this kind of gradual improvement thing here. And kind of another thing is that you should rewrite a regular expression. You should have some sort of common library which you kind of update. The third way to do kind of extraction which is rule-based is using a grammar and essentially what the grammar does is that it gives you a list of rules which describe phrases or in a way it describes how to kind of combine words into the phrases and let's see what it says. If you want to do this measurement extraction, let's just go back and see how does it look like. So we have these things here. And okay, we know let's say that this is a number, this is a unit and this is a date. And now what the phrase grammar does is how you can combine those into a measure. And usually how you read that one is that if you have a number and a unit, then you can kind of combine them into a quantified number. So this is a rule in this grammar. And another rule what you can have is that if you have a date, analyte, and a quantified number, you get a measurement. And with two rules, you can kind of match quite a many of measurements, but sometimes people don't write a date. So you should be able to also match that if you have an analyte and this quantified number that you get a measurement. Or sometimes people forget about units. So we have a date, analyte, and a number, and this is also a measurement. In extreme case, you have only the analyte and number, this should also correspond to the measurement. And so in a way, what you see is that you will have annotations on text and then you combine that and you get a new annotation on top of that. And obviously we are interested in measurements. So this is the final symbol we are looking after or phrase what we want to combine, but sometimes you want not to recognize only one of them, but you can get kind of more kind of outcomes. So that is not only one final kind of symbol you want to extract. So it's not that you would like to extract measurements. Sometimes you would like to extract quantifying numbers as well. The other thing is that we have here many rules, but we have to fix an ordering in how the rules apply. Which rule is tried first, which is tried second, and so on. So there is a rule priority. And so let's see why is this necessary. So let's say that I have this canonical phrase, which starts from a date, which I know, and I'm checking for the analyte and for the quantifying number. And here I have a quantifying number. First of all, I see that there is a number and a unit. So I treat this block as a quantifying number. I see that in front of that, there is a analyte and the date, and therefore I know that this is a complete match and this must be a measurement. But sometimes I have this incomplete match if, and I also have to match these. Now I have this rule that when I have a number and a analyte, then I can declare this as a measurement. Now, if I don't have ordering, then in this phrase, I also have the analyte and the measurement, and this is also a match and this is also a measurement, but I really usually want to have maximal match. And the priority is to resolve this issue saying that let's first try to combine this thing. If this fails, let's try other rules, which are less precise. So if I can't have a date, analyte, and the quantified number, okay, then I have to try the analyte and the quantified number. So this matches this one. Now, this doesn't match that one. So I tried the first complete rule, it failed, second rule failed, so let's try. We can try the analyte and the quantified, date, analyte, and a number, okay, this matches this one and for this one, all three rules fail and I have to use the fourth rule. So this is the way how this kind of premise work. Just to mention, normally when computer science people work, they write those in another direction. So the meaning is the same. So here I have written how you combine the things, but if you revert them, direction goes how you generate, but since we are working with text and try to recognize this is the way you should look at it. Here, another thing to consider is that the end of the day, you want to have a match, but you want to decorate that. And this is the, another thing you can add to the rules is that when you combine the things, you will extract some attributes out of it. So if you combine this record and the thing, then you will extract from the first date, from the second, what is the analyte, from the word, what is the value and the unit. And for incomplete phrases, obviously you can't extract all the attributes, but some of them. So you have to remember when you specify a phrase grammar, there are two things to consider is that how you, what is the correct phrase and what is the information or annotations on top of that phrase. And when you have this kind of grammar, it's quite easy to propagate this information up and you don't have this detection extraction problem that you can detect the phrase, but you can't extract what is inside the phrase because essentially you can take care of that by these annotations. There are two kind of, two other aspects here to consider which are kind of interesting here is that sometimes you do, I tried to match in a text and then you get some kind of mismatches. So it doesn't match. So you have, let's say you have only quantified units and then it's really important to kind of see those because near those usually there are those kind of terms or analytic names which were not in your initial list. So by just looking why sometimes we parse the half of the phrase allows you to improve the term lists. And yet another thing to consider is that usually computer science people are very religious about those things. So they say that, okay, if you have a number and unit it always has to become a quantified number. But if you are doing in practice, you would like to keep those set of rules very simple because because if you have like 100 of rules then you quickly lose track what those mean. And therefore these rules are not precise in the sense that they usually match more than it's intended. But in this way you might get a kind of recognized measurements which are not measurements. The way to counter that is add additional checks which are not in the grammar. So you have here and measure an analyte and a number and the rule says that you should combine them. But let's say that you know what are the all possible values of cholesterol measurements. And you see that, okay, this 2000 is too much and then you can abort this kind of combination and say, okay, although I have an analyte and a number here, this doesn't correspond to a measurement because those are incompatible. And if you kind of use this kind of additional checks idea, what you get is that you can get much, much simpler grammars where this kind of matching part is not inside the grammar but actually inside the, inside the kind of additional functions and you can use actually a machine learning to see whether these two can kind of be combined or not. The last thing which has been a big headache for us is that usually when people talk about tokenization, they say that, okay, yeah, you tokenize the text, make it into those small chunks and then you try to do the same thing with those small chunks and then you try to combine them and that's it, but this is not a case in practice. In a way that the tokenization doesn't have to be unique. So if you look at this one, you have a cholesterol and there is 21.05 to 2021 and then there is a number. So you could interpret this as a date where there is a dot missing or you could interpret this as a measurement, yeah. So there are two ways you can view it. What is this thing here? And you don't know a priori which is correct. If you look at this one, you can see from why the context that this is probably incorrect because it is very unlikely that you have a cholesterol measurement which is followed by the number, another number, another number and then comes another kind of analyte and the measurement token. So you could infer that this one on top of is actually correct tokenization, but you can't do that just by looking what is written here and you have to take into account the wider context. And what this means is that you don't have at the end of the tokenization phase, you don't have a kind of unique tokenization, but you could have potentially many tokenizations and you could have a different tokenization. And you can now have to decide which of those are correct or not. And this is a problem because common kind of parsing algorithms can't handle ambiguous tokenization. So when you have this kind of problem, you have to roll out your own parser which can handle kind of ambiguous tokenization and then there's some big way of tokenization based on the grammar itself. Okay. So this kind of gives you a, finishes the short overview of those kind of rule-based measures. And these are kind of really, they have easy to kind of easy to apply and you get very quickly a decent progress. So you can let get most of the measurements out of it, but and you don't need this labeling and gives you a good baseline. But the problem is that going further from this baseline is really hard and this is the kind of place where they fail because writing a very complex rules is hard. So it's almost impossible to write a correct phrase grammar for all ways how a doctor can write that patient smokes. So that's why usually people kind of try to use those machine learning methods or and usually they start from supervised machine learning methods. And the easiest one is this linear classifier idea. And let's say, how does it work? Essentially, what a linear classifier does is it derives you this implicit rules. Let's say that you have indicator variables like cholesterol, which tells you whether a phrase or term cholesterol is in this text. And another indicator which indicates whether term HDL which is a type of cholesterol is in this text or LDL is in this text. And let's say that these are zero one variables and then you can come up with a linear rule which says that just take one time, one times cholesterol plus one times HDL plus one times LDL and subtract two. And if this sum is a positive, then you have a measurement which text contain the cholesterol measurements. And if it's below zero, then it doesn't. So this is kind of one potential kind of linear classification rule. And if you kind of think, what does it actually tell you? It tells you that either cholesterol and HDL is in the phrase or other two combinations are there based on this threshold. And the kind of idea of using this linear classifier here is a kind of nice, is that if you have enough examples, then it will derive you this set of rules implicitly. And there are obviously several ways how you can do a linear classification and this support vector machine is usually the best one to use because it's a statistically stable one. But obviously you can use a logistic regression as well. And the thing which makes this support vector machine kind of stand out is that you can use this introduced non-linearity. Because if you look here, what this rule is, it is just you can kind of observe only that you have this cholesterol and HDL together in this phrase, but you can't get this thing that the cholesterol and HDL are close together or more complex things. And that this kind of idea of doing this non-linearity things using feature map and kernels allow you to kind of combine those basic facts in a kind of more non-linear way and get more complex decisions. Not only these rules, but somewhere where you have ores and ends in a more complex way. Yeah, okay. But the problem with this set to set it up in practice, so there are two modes how you can operate. So either you decide on the text whether it contains this cholesterol measurements or not, or you want to know where it contains. And the second one is kind of usually used. And if you do it in this way, what you have here is that you have a list of tokens for which you have to decide whether this is a measurement or not. And now to do that, you have to kind of use information from nearby context. So you usually fix a window. So if you want to know whether this is a part of a measurement or not. So you let's say take two words from the left and two words to the right. And then for each of those words, you add features on top of that. And usually when people did it early times, they do it kind of manually. So you have a term list. So you see that this is a cholesterol. Cholesterol is an analyte. So you have a feature like that there was a analyte two words to the left is one feature. Or you could have a kind of another feature is that you have an analyte two words to the right. And you could have a phrase list as well. And morphological features, for instance, that this seems to be a noun and not a verb. And essentially based on that, you cook a set of features. And how do you define those features is not trivial. And essentially there are like many ways you can build. So you can say that here you have a cholesterol and the date to get that on the left and so on. But your success is here determining how well you define these features. We in practice actually did this feature design for let's say what named entity recognition to recognize person names and organizations. And okay, those features were quite complex and how did we arrive to them was kind of very fuzzy. So usually you want to get rid of that. So you really don't want to do this manual feature construction because you really don't know how to do it. So the way to get out of it is a word embedding. So the idea here is that you use some sort of word embedder. Let's say a word to work and what the word effect does, it takes a word and outputs you a 100 or 1000 depend on your choice informative features for each word. So you have a word and you get the vector out of it. And those features have some sort of semantic information. They somehow encode that okay, cholesterol is analyte. It's a noun and some other things. The nice thing about that is this is completely automatic. You mean in order to do that, you have to take a large set of text, run a word like algorithm on top of that with right configuration, but then you have this feature map. And then you have this representation and then you can run the support vector machine on top of this embedding. And what does this thing do is that you don't need now those handcraft word-based features. Those are kind of automatically arise from this embedding. But there is one thing you can't kind of ignore is you have to still tell you how many symbols to the left you look and how many symbols to the right you look. And this is non-trivial and sometimes the information is really, really far apart, which is important. And if your window is small, then it doesn't kind of fit into it and therefore you make a wrong decisions. Another thing is that although you can drain to this embedding, it's essentially a dictionary. So there is like a finite number of words for which this embedding is known. Usually it's like 50,000 words, but if you have a medical text, there might be more words or misspellings and therefore you somehow need to kind of decide how many words you give up and say that those are unknown words. And another kind of problem with word embeddings is that you kind of ignore the fact that the word has a multiple meaning. So in English, there are like a noun verb problem so that you have a noun, which could be a verb. And in some context, you know what it is and actually knowing whether it's a verb or noun is kind of important in the dictionary. The way how this embedders actually train is it's a clever thing. It's quite old by the idea is that you will take unlabeled text and you somehow want to train this, find those 1,100 features. So what you do is that you convert this unlabeled text into a labeled text by just taking a text, deleting one word and you have now a prediction task that okay, what is this worth? And since you know the answer, you get a kind of infinite tool kind of, very easily a large training site and therefore this kind of thing works. Okay, but there was one other thing which I kind of forgot to mention in case of fact extraction is that let's say we do this prediction. So when we have this fixed finger, this means that I make a prediction separate with which element, yeah. So I'm going to say that, okay, this, let's say is a part of measurement. Then I make a decision whether this is a part of measurement whether this is a part of measurement and so on but all of those decisions are independently done. And this can lead to a problem where, okay, let's see that this is a part of cholesterol measurement. This is not a part of measurement. This is a part of measurement and this is part of measurement here. I have a kind of a part of prediction that this is HDL measurement obviously because this contains HDL. And this is also an HDL measurement because HDL is nearby and it has a right unit but by some odd reason, I have decided that this is actually a part of cholesterol measurement. And this is kind of not reasonable because it is very unlikely that inside this measurement is another measurement. So there is something wrong with this is essentially that this output is not consistent because I made individual measurements. So the way you can kind of get to handle this is that you will have a separate rule about the measurement consistency. So essentially you are going to say that the measurements are single blocks of text and formally you model that with a mark of random fields and this essentially takes this thing and gives you a probability assignment for this labeling that tells that this is very unlikely. While this one where this would be a measurement is more likely and if I would swap this one to orange and this would be even more likely. And now if I have SVM predictions and a mark of random field which does this kind of consistency checking, I can combine those two and kind of improve my prediction and say that, okay, this is my original one but if I flip this one, okay, as overall, this is more probable. And this is the idea behind this conditional random field. So what they do is they kind of train two things simultaneously. First of all, this consistency model of labels which is this model by mark of random fields. And second one, this individual token prediction things and okay, they combine that and therefore it can kind of detect those errors which are not detected if you make those individual predictions. But in practice, the difference is not so big. The reason here is that, okay, the consistency requirements are quite simple. There is not very complex structure to learn. Yeah, so this is one thing. So by looking at the labels, it's quite hard to detect that I have kind of consistently kind of incorrectly labeled this, let's say instead of a cholesterol to something else. So I can't detect these errors. And the other one is that if you have a code, let's say SVM predictor or some other kind of predictor for individual values, although those kind of predictions are independent, they reuse the same data and therefore they are correlated and these things, what happens here are very unlikely. So when you are faced, whether to use the kind of individual predictions or your conditional random fields, okay, those are slightly better, but for technical reasons, you can't apply the conditional random fields because they don't work with your data or data format. You don't lose much if you use the support vector machine prediction or other linear kind of prediction. Okay, there is this last thing to kind of consider here is I have still one problem here. Is that I have to think about the window size. And this is a problem because if I take a wrong window size, it's too large, then I kind of can't train. I have kind of, it doesn't converge and it's kind of overfits. And if I take a too small window, then important information is left out. So there is another kind of advancement which is this context sensitive word embeddings which take care of this one. And here, instead of having a dictionary, I will have a neural network. Usually those are transformer types of networks. So what they do is that they take into, let's say a sentence, we kind of pass it and for each word, they assign this informative features but those are now depending where the word occurs, what are the surrounding words and so on. But again, the way how this thing is trained is still the same. So you have this mass language modeling task where you just drop some word inside of a sentence and try to predict what it is. And by doing this, you get a really huge training set which allows you to train a very complex neural network and this takes a lot of time. I think the Estonian BERT model which is one of those neural network was training about a week. But then again, if you train a neural network, you have this problem that you have to fix some here hyper parameters and if you put them incorrectly, nothing doesn't converge and gives you wrong answers. So it can actually take many, many months to get it correct. But one thing which is cool is when you have finally kind of nailed that one, you can use that model for many, many tasks. And the way the transformers kind of solve is another issue that now you don't have an observation thing there. So essentially those embeddings are assigned based on the entire sentence. So the information from any part of a sentence can be used there. And it also resolves the second one is that it gives you different meanings. And now the way how you use it is that you take this context sensitive embeddings and based on that, you build another classifier on top of that. And when you do it smartly, you also kind of adjust the neural network that you have, meaning that it doesn't kind of output the same embeddings, but those are slightly tuned. And it works surprisingly well, but there are some problems is that you can't capture this dark background knowledge. If the neural network can learn only the things which are inside a text cell, there is no way the neural network can learn ATC codes for let's say a drugs or some codes for the analyze if they are not inside a text. So you still have to use this background information. Okay, and this kind of gives you a kind of overview of how you can solve the things. But there is a last thing I want to mention, this is a really brief one is how you kind of improve your results. So there are like three main sources of improvement when you have a system. First of all, you should always work with the term list and ontologies and whatever you do, you have to just use a kind of version them so that you can make those small improvements and you can somehow communicate and document them. So if somebody looks them, let's say a couple of years afterwards, you can understand why some terms were added or some terms were deleted. This can give you a kind of very big improvements because essentially there's term lists and define you the features and the background information which is not inside the text usually. So this is the staff information which you enrich. The second fact is also obvious is that when you do this extraction, you have a version one extractor, version two extractor and so on. If it's a long project, then you have to actually measure that what you do actually makes sense. So you have to create dedicated test sets for each isolated problem. So if you have let's say problems with cholesterol measurements, so then you have to make a separate test set just to isolate that problem. And the obvious thing is that what I said is you have to have after the extraction, you have to have this validation routines which actually check whether whatever you extracted from the data that this actually is is internally consistent. So, and this allows you to kind of discover certain types of errors and that kind of improves the kind of outcome. So we have extracted the measurements and I think we have the initial version was quite good, but then we just looked for certain types of invariance, stating that if you have extracted the measurement, it has to have an analyte inside or otherwise it's just a number, but there were kind of significant proportion of those where there was only a number by looking these and tracing them back, we could kind of find out that there was a slight shift in some patterns and this was kind of mismatch during the extraction and so on. So in generally, as you should check the outcomes with that they are correct and any other kind of statistical anomaly tests are kind of cool here. Another thing to note is the problem of diminishing returns. So let's say that you have some sort of extraction system and let's say you measured accuracy and you use certain type of an algorithm and then you have trained this for the training set size which is 100 and you can kind of measure two things here. One of them, let's say is a training set accuracy which is an overestimation of actual accuracy and you can measure the kind of a test set accuracy and for this training set of size 100, let's say these are fluctuating around 75%. Now, if you would have the training set which is smaller, obviously the test set accuracy would be smaller because you can't learn everything from the examples while the test set accuracy, sorry, a training set accuracy could be kind of larger because you have smaller set, you can overfit the data and therefore the accuracy is larger. So and so on, you can kind of do a couple of experiments and see how those two lines progress. So in this case, the test set accuracy kind of grows and now fluctuates and similarly, you have the same thing for training set accuracy and obviously when you have enough data, those two will kind of match and the place where they match gives you the maximal performance of this type of this algorithm. And by looking at these lines, you can do two things. First of all, notice that, okay, let's label half of the data more, how much do I gain? And here it seems that less than a percent. So this doesn't make sense. So this is one thing. And this is kind of the main thing is to know how much does labeling additional samples give you advantage? And if you are here, obviously you should label more, but if you are here, it tells you that you should do something different because, okay, there is no reason to label more. You should try to work with features or try different methods or something else. This is kind of important thing to remember. The last thing which is something which people rarely talk about is that whenever you see some precision and accuracy measurements, most of them are kind of not very relevant. The problem is in the fact that you need a really, really large test set to measure accurately. For instance, if you want to estimate precision in 1%, then you have to actually have really, really many measurements in the worst case. So usually if you have 1000 measurements, the measurement error is around 3%. So this is a problem. And you can actually go kind of pass this problem by using a different approach. So what the problem means is that since you can't estimate the precision with the accuracy with precision more than 1%, you can't see whether you're actually making a progress or not because essentially the improvement might be smaller than the variance of this kind of test error. And the way out is that you have to actually use a relative performance measurement. So you have a baseline classifier which is reasonably close. And now your new method and then you evaluate them on unlabeled data and you look only the differences they made. And you select let's say 100 or 1000 those differences. And then you manually check which one of them is correct and if both of them are incorrect, then you label that those are incorrect. And then you now can measure these two things. The improvement ratio, whether the new algorithm was more correct than the old one and how frequently that those differences arise. And if you use those two factors, you can actually quite easily to measure how much the new algorithm is better in absolute terms and this estimate is much, much more precise than you would obtain on a training set. The problem here is now that whenever you do, you have to do a manual labeling. But this is kind of the last aspect I wanted to consider here. So just to recap, you have a lot of methods. You should start with rule-based methods because you need, they can give you the answer quickly. And if they don't give you the answer quickly, they at least allow you to build up a reasonable training set. And only after that, you can do this kind of machine learning afterwards. Okay, this would be all from my side. Thank you very much, Sven, for this very informative and important talk. Are there any questions for time? Someone raised their hand. Diane raised your hand. Please, you can ask the question. Thank you very much. My question is, in Catherine's introduction, it was mentioned that you are also working on cryptography and security. So since medical texts are sensitive data, could you elaborate a bit on how to extract information from medical texts using cryptography protocols? Is it something that is done? Are there any trends or any recommendation? Okay, so let's say that you, so this is a completely different question, but I will answer. So doing fact extraction is so labor strong. So if you do, let's say run this third algorithm, you train that one, this runs one month on a computer. Okay, now if you do a privacy-preserving algorithm, so essentially what you do is that you multiply this running time by factor of thousand or by factor of million. So this tells you that this is invisible. So doing a fact extraction is something which you need to do on a raw data. So it's mostly a matter of time? Yeah, this is a problem of running time. So let's say one month, so you are not willing to wait like thousand months in order to get your results, even if it is much, much easier somehow to get this agreement that you run this on a secure server thing. So that is the issue. There is a way to do it in a privacy-preserving way if you use hardware. So you know that Intel has this secure place which are essentially that you run a computer which is not controlled by you inside your computer. And you cannot access that. So I think in this setup, it's possible to do kind of privacy-preserving kind of information extraction because then you can just say that in your computer, there is another computer which nobody can access and do that there. But this would require certain legal precedents to be... Yes, for sure, but it gives a burden to someone else, like in other computers. Okay, thank you very much. Thank you for this question. Are there any further questions? Yes, I do have a question. Oh, sorry, go ahead, yeah. Yeah, sorry. You mentioned, so first of all, thanks for the talk, it was interesting. So you mentioned that after the training and validation, the work doesn't stop. I wanted to know what are basically the steps or what is the work done after the validation and could you give examples of that? Okay, let me kind of go to these things. So let's say that you have a text and you have extracted the weights or any kind of facts. There are like several levels of correctness. As I said, whether this fact is correct, so this is by them done before and then whether it's correct in the context. And let's say what kind of validations you can do and this is the place where, which is kind of hard. Let's say you take the measurements in case of measurements, there are certain values of cholesterol levels where you can be alive. And the most basic validation would be that, okay, if the value is outside of that range, this is incorrect way. This is kind of the most simplest one. The other kind of another obvious one is that you will draw a timeline for these patients and see whether there are sudden abrupt changes which can kind of indicate that, okay, this can't happen. Like that you have the cholesterol level, then it kind of increases like 10 fold and the next time it's also the same. So probably this is important. So these are kind of classic input data validation methods. If you want to kind of look at the context or meaning that you look at the text and you want to know, can you use this text to kind of get hints whether this value is correct or not? Then in this case, you have these questions. Okay, we have the measurement. So what type, how this was measured, okay? And who was measured? And this is something which you can be done based on text though. For the weight extraction, we have certain kind of detectors which detect whether the text actually mentions that this weight is measured not for the patient themselves, but for the child. Okay, but there are yet another issue is that sometimes you have measurements which have a different date. So it tells you that the patient had a particular measurement value 10 years. Okay, so there are like many aspects you can layer on top of this basic measurement and where you stop, it depends how much time and resources you have. But I think for practical purposes, just the basic input data validation that, okay, you just draw the kind of, what is this? Histograms and look for outliers would be okay. Thank you. Thank you. And then Chavani, please. Hi, thank you for your talk. I have a general question actually. I wanted to know if there are efforts that you are aware of to standardize this type of medical entries so that it could either simplify these extraction tasks significantly or make it so that we can very directly find the information that we want. I understand that the challenges of something like that are enormous, but are you aware of any effort in that direction? Okay, so there are like, let me think. Okay, so essentially there is a body of standards here which is quite good. So the SNOMED standard is kind of very broad and it specifies a lot of things. And if you have, in your native language of interest, you have this standard translated, then you can use it. For the lab measurements, you have this Boeing standard and you have those kind of, this is something which my co-workers work is this OMAP data platform which has this kind of common data description language. And there are many, many checks and kind of tools which allow you to do stuff. But if you are working very closely with the facts, then you have to roll them out by yourself. You know, I have the, in practice I have this problem that do we have for each analyte a kind of a range where it can be and actually this is not right now available. There are some preprints in the OMAP system but I'm not aware of kind of very large standardization because it's very heterogeneous problem. So everybody has a different problem. Yeah, I understand. I suspect it. So thank you. Great, thanks a lot. And we have time for one last brief question and answer. Lukas, please. Okay, thank you, Catherine. And thanks a lot, Sven, for the talk. I was wondering on the same line as what Diane said of privacy preservation, if you can use something like what you showed in slide five of like fact extraction to detect sensitive data in a text and kind of anonymize it before the fact extraction as a pre-processing step. So you can, I don't know if remove but mitigate the problem of handling sensitive data. Okay, there are, I have a personal kind of belief is that you should do the kind of test processing in a safe environment. And after these results have been extracted you should use those in privacy-preserving set and cell in a way that it's much easier to work with raw data to handle those hiccups and afterwards when it's in the right format then you should do that. But obviously, if you kind of derive those algorithms and if you can run them in secure computational environments, this is one way that you derive the algorithm and then securely deploy that and make sure that it doesn't do anything else to because the black box get the results out encrypted or in what a way and then process and later on this is one way to do and this is I think it's the future. Okay, thanks a lot. Great, thanks a lot. And with that we thank Sven again for his talk. We send a round of virtual applause and we will have a break until 10.30 and we will be coming at 10.30 sharp. Have a good break. Thank you everyone.