 And that's my Twitter ID. My Gmail is shawya.sarkaryajini.com. If you have any questions, follow the stop. You can email me. If you have any questions while I'm speaking, you can just put up your hand and I'd rather have an interactive discussion than a mobile app. A few things. I work at G, but I just want to say this, which we try to keep it as vendor-neutral as possible and very non-potter-rated. So ever since the dark humanity, people have been trying to see what is inside them in the bottom. This is a picture of a gentleman called René de Neck in France, 1820. And he's trying to hear what's inside the chest of a person and detect tuberculosis. On his left arm, if you see. So in his left arm, you see that small little thing which is the first stethoscope. This person invented the stethoscope. And I just noticed today morning, I don't know whether this is a coincidence, that one of the elephants over here has something like that. I don't know whether that's a stethoscope or not. But that was 1820. And in 200 years, we've come quite a bit. And this is what we have now, a digital stethoscope. So primarily, all the data that we gather from inside the human body today is digital. That is the takeover. And in this discussion, we should talk about such kind of digital data that we can gather from the human body, the size of those data, and one of the challenges in handling and managing that kind of data at that scale. So I understand that a lot of people in this section are not from our medical imaging background. So I'll just take a few minutes to give you an overview of some of the things that we're going to talk about. It's a very basic super crash course in medical imaging. So in medical imaging, we have this term called modalities. Modalities are nothing but ways by which you can gather data from inside the human body and see that data in a visually, humanly, interpretable form. So this is what we call a digital x-ray. It's something very similar to a normal analog x-ray, which many of us have had experience with. You have a generator and an x-ray machine which shoots x-ray onto the human patient who's lying down on the table. And you have a detector which slits underneath this, and it's kind of like the CCD detector that you have on your digital camera. It just captures an image which looks like that. That's what we call a side cut or a sagittal cut of the human spine. You can see a little bit of the skull and the mandible over there. Now this is one of the simplest forms of data handling. This works on basic physics principles. You shoot x-ray, some of the x-ray is absorbed by the tissues, some of it is absorbed by the bones, some of it passes through. And this, what we need to remember, this is 2D. So you kind of see a 2D plane of what's inside the human body. Very good for finding out fractures, tuberculosis, that kind of stuff. This is another imaging modality called computer tomography or CT. This works on a very similar principle to that of an x-ray, but the difference is that this is in 3D. So what happens is there is a rotating x-ray which kind of goes around your body when you're lying on the table. And it gives you a shot from different angles, which you can all compose together and build up a 3D volume. And that's what a 3D volume looks like with color maps and textures put on it. And then we have magnetic resonance imaging. What this does is it uses the properties of magnetic spin inside the human body of a lot of water molecules. And this uses the property of magnetic spin in these water molecules. And this is good for imaging tissues inside the body. So not really very good for bones, but if you look over there, this actually, this is a picture of a ligament in your knee, which is a popular knee. It's somebody's knee which is getting compressed by these two bones. And then we have another model called the PET. It stands for Positron Emission Tomography. This is very interesting. This does not really show you the structures in the human body per se. But what we do is we inject a radioactive biomarker inside the human body. And the biomarker has the property of latching on to cancer cells. And when it latches on to cancer cells, it starts emitting, it undergoes a positron decay, and starts showing out proton particles. And then you then have this. You have a whole ring inside this which catches and counts these proton particles coming out of your body, and understands that this is where the cancerous tissues are. And then we superimpose that on a seeking image, which is very good at kind of showing the structures, the bones in the human body, and find out exactly where the problem is. So essentially the red areas that you see, or the dark areas that you see over here, these are known as hot spots. So you can kind of find out if there is an occurrence of cancer, or very importantly, a relapse of cancer in the human body. Many a times we see a patient comes in for cancer treatment, goes back home completely cured, and the cancer was probably somewhere in the shoulder, and 18 months later finds out that there's a relapse of a different kind of cancer in the animal. They're completely unrelated body parts. So pet scans are good. You scan the whole human body, 100 to 1.2, 1.3 meters of the human body, and you can find out if there are any cancerous cells anywhere that are male or female going. So the basic fundamental of imaging is this. This is what we call the basic image change. You have the human body, and you have what's called a digital acquisition system, or an electromechanical software system which picks up signals from the human body. It can be a change of magnetic spin which is conveyed by other frequency. It can be simple x-rays passing through the human body. It can be these photon counts which the digital acquisition system picks up. But essentially what we're trying to do is there's a change in the physics inside the human body, and we are using some kind of a quantum code camera to catch that and get some information. Now, the information that's captured by this digital acquisition system is nothing that a human eye can understand, or a human brain can understand. For example, quantum counts coming from the human body. We will not be able to understand that. So we have something that is done to these signals, and these signals are mostly they're not even in the visual XY or XYZ plane. They're not in a three-dimensional visual axis. So we construct an image from these signals, and that's a whole different process. We're not going to do that, but essentially what I want to convey is we have some signals coming in from the human body. We construct images out of them, and then we add patient information. Obviously, when we are doing medical imaging, it does not make sense if there's no patient context. So we gather the patient context from either the database or the dial to zoom the scan types in the name of the patient, the age and weight. We add the patient context, and then comes on a viewer, which finally humans are able to understand. They're able to look inside the human body and diagnose the problems. So the crux on this slide is the fancier your digital acquisition system is, the larger your data set. It's kind of just similar principle as the number of millions of pixels that you have on your digital camera. The more the number of pixels, the bigger your raw file size is going to be. So let's get into how big is that image data, really. This image data is growing up into two parts. One is the header information, which primarily has a bunch of information about the patient, who has referenced the patient, what kind of a scan is it, what's the age of the patient. But primarily text data. And then we have pixels of information, which is actually an image that you can see visually on the computer screen. The text data hovers around a few kings. And the pixel data, assuming that it's a 1024 by 1024 image space, and these are mostly trace data, which is mostly. You have 16 bits of pixel, it walks out to about two megabytes. So two megabytes of data. But that's not too much, right? Two megabytes is not really huge data. Why do we have a problem? The problem is that these data sets don't come along. You have a series of data sets that's acquired when you are doing a medical human scan. And each of these is around two megabytes, depending on the kind of scanner data you have. You're just talking orders of magnitude. So when you put a human on a table and you're scanning that person, you got a picture of these slices of bread. And that is exactly how it looks like when the image is coming. It's as if you're cutting the person, you're creating slices of that person and looking into the person this way. This is the nature of most of the data that you can. So this data comes in volumes, or we have different medical terms for slices, some call it Cs, but essentially, there's a large amount of medical data which comes in. And so that's it. Not only this, when you do a scan, we have multiple kinds of scan. I mean, if you want to do a CT scan, it's not just one set of data. CT scan will have four or five different sets of data based on the parameters which the doctor wants to apply on the data. So kind of multiplies very well with you. These are what we call slices, individual slices. This is a pictorial representation. So if you cut up the head and you are looking from the top, this is what it would look like. So this is your green marker. And so how big are these data sets? So if you will take an example of a CT cardiac scan, if you look at CT cardiac scan, you can have 3,000 images per set. And each of these data, so the size of one data set is about 16. And for one exam, one person goes to the hospital one day, takes one exam, it can multiply up to 30, 16. So this is the size of the data that you're calling for one person. So then the question comes, how many are you taking for? How many of you have seen an x-ray film? Some of you have. This is what it looks like. You go to the doctor, the doctor gives you this x-ray film after taking the x-ray, or if it's a CT scan, he is going to put in, he's going to compose a film like this, which looks pretty much like this, but he's going to put in a small number of images, not, you've got two images over here. He's going to put, you know, like 10 by 15 stacks of images and give it to you. The problem is, this is not digital, okay? You have to take it, you have to carry it home, you have to preserve it, and there's no way to keep this film, say, for a long period of time. 10 years, 15 years, you cannot. But obviously with advent of technology, what we have are what we call interchange media. It can be CDs, can be DVDs. So the doctor will burn all your digital data onto a CD or a DVD and give it to you. So you could take it to another position and you can show it in that position, accept the DVD, we have some format and protocol problems which we can get into data. So the bigger question is a medical legal question. Who is responsible for preserving this data? What happens if this data gets lost? Today in India, the laws are not that strong, so we do not have good governance over medical data. In other parts of the world, particularly in Europe and US, the governance of medical data is strong, very strong, and it's getting stronger by the day. You get into that in a while. So fundamentally, if you want to store data and if you want to share that data, we need a format and a protocol. Think about JPEG files. If you want to send JPEG files, you have to first have to have a format definition that your computer and my computer can understand. And then you have to have a protocol, can be an HDTV-based protocol, you can have an FPV-based protocol but there needs to be a certain protocol. The fun part about medical imaging is that up until 1993, there was no standard protocol in medical imaging. And that's a picture of the Tower of India itself. So build-up of storage is kind of goes on to say what happens if you don't have a standard way of communication. It's funny that the X-ray was invented in the early part of the last century, around 1920s. And for around 70 years, there was no format for capturing, storing, and sharing medical instruments. That changed. Around 1993, we started to work on a standard, but generally the community started working on a standard and we've got a standard called DIPOC. It stands for Digital Information and Communication in Medicine. So DIPOC deals with primary to kind of images that I've shown you. The CD, the MRs, the nuclear medical cameras, ultrasound, X-ray, and now it's getting into the field of pathology. So if you have a digital microscope and you capture some data, you want to store that data, you want to share that data, we have this standard called DIPOC. It's a pretty complex standard. It's huge, actually it's huge. It's kind of like what I see like the Indian Constitution. So there is a standard, but there is no standard way of interpreting that standard. Because when you have every year, there are 12 or 13 supplements to the standard coming out. So what your interpretation of the protocol is or the data format is, may not be the same as my interpretation. And we get into a lot of problems with that for sharing images. So one thing which we need to know is that DIPOC is based on TCPI. And this is very important because we are going to take a look at some of the problems that this offers. So this is kind of based on the TCPI protocol. When it started, this is a historic site, when it started it was only based on TCPI because TCPI itself was forming. So they had a physical 50 pin connector that you would run between two machines and use basically a data link kind of a protocol to transfer data from one machine to the other machine. So we'll get into a mind exercise on how much data do we really need to store these images. And what we're going to try to answer is how much disk space is needed if I'm going to do a cardiac screening of everyone in Bangalore. I want to do a cardiac, a CD cardiac scan to rule out the possibility of calcification in the human heart for everyone in Bangalore. How much data space do we need? This kind of thing is not exactly a mind exercise. There are communities, there are governments which are actually trying to do this for the health population. We don't have Bangalore yet, since once we set in the CF row, we will look at that. But this is a very important question, how much data do you need? And that is the crux of the challenges and problems. Now one thing that I want to say is that these are back of Bangalore calculations. So, worry about the magnitude of the scale, not the exact numbers, because depending on what kind of additional acquisition system you have, your size of the data could go up or go down. But the order of magnitude remains same. This is the proposed group map for the Bangalore metro. And this is the heart, this is the human heart. So assuming that we have a population of seven million people in Bangalore, you are going to take a few, you know, one million. And everybody's not going to be able to do a cardiac scan and we do not need to do a CT cardiac scan for everybody in Bangalore. So let's take a color person, they're just 50%. Roughly people over the age of 35 or nearly 40 would want to do a scan and we can delay but this 50% or very 3% or 47% doesn't change the numbers. So behind this space that's required to store a CT cardiac scan is around 20 gigabytes. That is for one person. So the space that we need is 70 gigabytes for just the city of Bangalore. If we were to do a population health exercise, we would need 70 gigabytes of data for the population of Bangalore. So how does it look in scale? When we put it into perspective, 20 terabytes of data, that's the library of farmers. AT&T crawl records, 312 terabytes of data. This is the largest single database in the world as of today. It's got a few trillion rows of data. Google Earth, and this is just the land part, we're not going to switch from the wall part. The emerged land database for Google is close to 450 terabytes of data. And this is the Google crawler. This data is actually a few years old. So this web changed by 100 terabytes or so. 850 to 900 terabytes is the size of the Google index, not the index, including all the parameters. Then put 70 terabytes, 70 terabytes, it's kind of, you know, if you would put that, 70,000 terabytes is going to go to the city. So that's the data space we need to do proper population expansion, which we don't have. So we don't do proper population expansion. So just to get you involved in some of these numbers, we actually have Christian Michael College in Belor, who are keeping track of all the scans that they do. They've been keeping track for the last 10 years. And their data size is around 60 terabytes. We have our healthcare, it's a group of 14 hospitals in Israel, whose database size is growing at 20 terabytes annually, sorry, 250 terabytes annually. Belor has half a million of exams per year. This entire hospital chain has 4.5 millions of exam every year. The estimated imaging data size just in the United States in 2014 is around 100 terabytes. And estimated imaging size in 2020 is 35 terabytes, which is like 10 to the power of six terabytes. So it's a lot of space. So finally, there are three challenges that we have in medical imaging. One is archival, how do we store the data? One is search, now that we have stored the data, how do I retrieve that data from that huge database? If you go in for a doctor's consultation and the doctor wants to bring back your five years of medical history, including images, it needs to be done in a finite amount of time. You cannot do not like to wait for like five hours for the students to come back. And the third is transfer. These huge images, this is one of the principal problems, how do we move these huge images from one place to the other place? And you put this right. So take this example, you write a blog post and you put it up on WordPress. It's a very important blog post. You don't have a local copy. You put it up on WordPress and three days later you find that the WordPress server has crashed, the database gone on top, and your blog post is gone quickly. What is going to be your reaction? You're going to be mad. But more than being mad, is there anything that I can do as an individual user? Or I can write another blog post on blogspots and how bad WordPress is. But there's not much we can do. This is not the same with medical images. Governments around the world regulate that anybody who's storing medical images provide a guarantee, like hospitals or clinics that are storing medical images, they provide a guarantee that these medical images are going to be preserved for a really long time. Let's say that this is some data from the National Head Services in the UK. The British government rules that for mothers you have to store data for the mom and the child for at least 25 years. So if you know anybody who's a pregnant mother, she has an ultrasound scan, that data needs to be stored for 25 years. If somebody's suffering from Alzheimer's, you need to store that data for at least eight years even after the patient has died. For its children, when a child is born, a child is born with a condition like a congenital heart or neonatal jaundice, you make some scans, you got to store until that child is 25. So this concept of providing, guarantee, how do you do that? The good news is that the Daikon standard based in a protocol called storage commitment. When there is a protocol by which when you send an image from one of these modalities or a CT scanner to the hospital image server, there's a protocol which guarantees that the image is gonna have been stored over there properly and you can delete the image from your local data store. Again, huge capacity requirements. This one is particularly interesting. Space and data center-grade infrastructure. Storage 60 terabytes of data, which CNC Ballot is doing today, is not possible for every other clinic and hospital in India. And you are not gonna get that kind of a storage guarantee or that set of hardware through which you can say, I'm gonna be able to store this data for a long time. It costs a lot, it just costs an awful lot of money and that's why these systems which guarantee storage service are extremely costly. The second problem is actually finding the data. Once you've stored this data, how do you find the data? And it turns out that search is actually a software problem. It's not a very complicated problem. The reason is, at this point, we are not so much interested in the pixel data. The pixel data is so huge, we don't really know what to do with it. Even if we had a way of traversing through it, we don't know what to do with it. So what we do is we index only the metadata or the header information of patients. And the other part is that this standard for DICOM, because of the time it was built, it's a very structured schema-based standard and loves SQL. You can even see concepts from SQL which has been taken into the standard. So all you need to do is store your header information into SQL tables. And the pixel information is stored in flat files. Now, this is a packer. This is something that we at G-Healthcare do across a large number of product lines and over the years we found it to be very useful in our situation. Does not mean that this is a standard way of doing it. There are organizations which will insert this pixel data into our blog object, into this SQL table, which kills performance. But this is something that we do. The good thing is that you get a power of SQL queries when you are doing your database. I want to look at Sharia Cercars seeking scans from 1st January, 2000 to 31st December, 2005. It's a very simple query. The database is huge, but that's what databases are meant for. They give you the power to find that data very quickly. Insertion is fast, and reading is also fast because we use NAMI map I-O because the pixel data is being put into flat files. So you can really read and write very quickly from the disk. We use something called table replication because it's very common for a database to crash while the system is in operation. So while a database is in operation. So yesterday, I happened to go to CNC below and they were saying that even with the power pack up, they see a lot of power cuts, which means that the database just goes boom. Even with the UPS, it barely serves the purpose. So we do see a lot of database crashes, but we replicate just the tables, not the flat files. And the best part of it is that since we stored the complete file as a flat file, it is possible to recreate the header data from your file. So the next question is transferring these data. The fundamental question is, why would I want to transfer the data? Turns out that there are three reasons and I'll skip between the next slide in this one. So this is what happens in the hospital view. You've got these modalities, a CT scanner and X-ray system and MRI scanner, which are continuously taking images. Patients are coming in, coming out, just a hospital like Velo does 1,000 pipes. It's acceptable for medical use. Daikon does not tell you anything about what is acceptable or what is not acceptable. But what Daikon does is it gives you the standard for using JPEG, JPEG 2000 as compression engines for these images. And what kind of a quality factor you built into those JPEGs is a choice that you have to make. And you have to clinically validate your product before selling it in certain parts of the world. For example, if you were to make a viewer which has like a QF of 30 and you have to prove that 30 is valid for human imaging in, let's say, USA. So there's a clinical validation question. Good question. What about the noise in the data? For example, you have taken an image and you're sharing it with others. So is there some kind of signature mechanism taken at the beginning that makes sure that data is still intact? Are you talking about checksum or are you talking about noise? There's something of the kind of checksum, actually. So when transferring data, it's really highly likely that the data gets stored. And the processing on the other side could still be able to see some images. How can we show that we have some noise in the data? Great question. That provides you a basic checksum feature. It's not a very advanced checksum feature. So you can, more than checksums, you can have a threat of somebody entering into the system and actually manipulating your data. But since these data sites are complex and there's not a lot to be gained, this is not financial transaction data, we don't see that kind of attacks on a diagonal stream. I didn't mean to attach the, you know, why is this rate imperative? Yeah, so that provides you a basic checksum correction, but it's not like an MD5 that is guaranteed.