 We have Harry Moreno. He's going to talk about how one would identify sick cannabis plants using machine learning. Is your mic good? Hello. Keep it close. My name's Harry. All right. Thank you. So, hi. Hope you're enjoying DEF CON. And so we'll begin. This is machine learning for sick cannabis. How would you build a model to identify sick cannabis? I designed this with the intention of being very practical, so folks could take a photo of any sort of cannabis plants at an average distance from your smartphone. So, who am I? I'm a software engineer. I'm not a data scientist. That's not my job title. I'm actually a software engineer. But I'm interested in data science and machine learning. And I thought this would be a pretty interesting problem, very practical problem to solve. I'm from New York. I'm an organizer at Kaggle New York City. Kaggle, for those who don't know, is a data science platform where people compete from around the world to be the best data scientist on interesting data sets. So, if you're in New York and want to do hands-on machine learning, please join us. What's the background here? So, some context on what's possible, what's being done with AI in 2018. Artificial intelligence is becoming more pervasive and accessible. Two examples, radiologists and interpreting x-rays, the say of the art is that we've built models that are as good or better than professional radiologists in terms of accuracy, as well as, I think last year, a Stanford University published a model for skin cancer, diagnosing skin cancer that could run on a smartphone. So, that's what's possible. And so, what are we trying to do? We are trying to see if we can build something similar, but for diagnosing cannabis. The intended users are hobbyists, people that aren't doctors, aren't plant doctors, and they just want to grow a plant. Industrial players. So, if you have a cannabis farm, you're very keen on knowing if your crops are sick. And the other problem, too, is that diagnosing a plant is required as a domain expert. So, for example, I'm not a domain expert in diagnosing plants, so I would want a model like this to help me. And so, that's our goal. We want software that tells you if cannabis is sick by using your phone. So, how do we make a model? How do we make a predictive model? The machine learning process is this. You gather data, you train on examples, you build a predictive model on the examples. We launch it, we publicize it so that people can use it, and then we iterate. So, one of the things that we can do once we've deployed it is that, as people use it, as people upload photos to get their predictions, that grows our dataset. And the goal is to make a very, very accurate, free, predictive model. So, first step, one gathering data. We, myself and some other people in the community, helped scrape for sick cannabis, pictures of sick cannabis. And there are websites that do this. They're, that you upload your pictures of your sick plant and other people tell you what might be wrong, like if it's purple or if it's yellow, it means different things. And to solve them, they have different solutions. So, we built a scraper to collect sick cannabis photos. Similarly, we collected healthy cannabis photos. This was a little bit easier because people like to show off their healthy plants and how well they're doing. So, that was easier. Here, we collected data for, a dataset that would trick the plant. So, if our goal is to build a model that can tell you if a plant is, a cannabis plant is sick or if it's healthy, another, a very practical concern is, if I upload a photo, can the model tell me if there's even cannabis in the photo? So, I added this third class that was basically composed of meme photos, the Caltech object dataset with random everyday objects and a lot of pictures of flowers and plants to see if the model could learn to, what, what, if the model could learn how to identify what is a cannabis plant first and then if it's sick or healthy. So, this is sort of the design of the, of the training built this other images dataset. So, after, at the end of this stage of gathering the data, we ended up with 3,000 images, 1,000 sick, 1,000 healthy, 1,000 other. This is actually a very, very tiny dataset in terms of we're doing this machine learning stuff and we'll see that it, how it performs. If you want to follow along, what I pretty much did was I read section 5.3 in deep learning with Python. Here, they go over the documentation which helps with mutating your dataset so that you can get more mileage out of it as well as transfer learning which allows you to begin building your model based off of cutting edge research that other people have publicized. So, for practical concerns, I work off of a fairly old MacBook and it has a GPU but it's just very, very old. So, I used Amazon Web Services, they're Sage Maker offering which is the one click deploy. So, a GPU instance and you get all of the libraries installed. The one GPU instance is $1.50 an hour, the four GPU is $10 an hour. So, if you want to do this and you don't have a GPU, this is how you could do it. The specific technique that I used here was transfer learning which basically we take the ImageNet winner. So, ImageNet was another problem that was very, very popular and now it's considered a solve problem which was, the challenge was object recognition. So, that challenge was you're given tons of data and there's 1,000 classes and your model has to say what's in the photo. This is actually considered a solve problem now and there are plenty of pre-trained models that you can build your own models off of and that's what this transfer learning technique is. We basically take the, so the full, imagine from top to bottom is the ImageNet solution, you would take that, remove the top layer and put in your own and that's what you're training. So, these lower layers are reusable featured detectors that you can use. And so, what are some of the results? They're, you try two architectures broadly. There's this ResNet 50, ResNet short for residual network. I don't want to get into the details but I try this one first because this is the one that reached human level accuracy on the ImageNet competition and the accuracy that we got when building our model was about 60% validation accuracy and then I tried another model, another architecture called VGG16 and this is coming out of Oxford and this one achieved 80% accuracy and this is pretty much what I ended up deploying this second architecture. There, it's, I want to figure out why this is the case and everything online tells me that ResNet should have been more accurate but it seems like we just need more data so if we had something like 20,000 or 50,000 images I would like to reach, try this again with ResNet and see if that could get up to 95, 99% accuracy but for now with 3,000 images we have VGG16 and then we deployed it to the cloud. This is running on EC2. There's no GPU in production. It takes about one second for inference to happen, for the prediction to, for you to get a prediction and it's built with Python tools like Keras, Flask and that's about it. We built a user interface and you can check this out right now, chronic sickness.com and if there's enough time I'll demo it but it's there, it's live, pretty much you upload a file from your phone or from your browser, from your computer and you submit it and you get a probability of what the model thinks is the outcome. So for this example picture it's 88%, it adds, so it all adds up to one, the three classes so it would predict that picture is of healthy cannabis and I encourage you to test it out, help us make it better. We want to iterate so this is just the first step, 80% isn't that great. I feel like this, this, if we're putting radiologists and dermatologists out of work, this I think in my opinion would be a simpler problem so it's really just about getting more data. 3,000 images is way, way too small. We want to, it's already open source, we want to, the key problem though is getting good labeled data and that's very time intensive, requires an expert so we want to now build some sort of crowdsourcing platform where people can contribute when they have free time or as their interests brings them in and as they lose interest they can, they can, they can leave but we want to preserve that sort of work in a crowdsource sort of way so look out for that. Future work, so the problem of diagnosing sick plants is a three class problem, not as, it's, it's, it's hard to say like what, what should be more difficult or what are less difficult but for example ImageNet was a 1000 class challenge and statistically if you just made a random guess it'd be pretty hard to get the right guess when there's lots of classes. So these other problems if we could classify the exact disease of a plant is much, much more difficult than just what we have now which is sick or not sick or not cannabis and then it might be fun because there's lots of sites like this already to try to predict the, the strain of the, the cannabis just for fun and that is many, many, well one, one source told me that it was about 800 strains so that's arguable as well but it'd be pretty interesting to build something like this with no access to the, the plant's genetics, it's purely, it's outward appearance and seeing what, what computers think what sort of strains they discover so and that's my talk. The, the site's available at chronicsickness.com and if you want to reach out to me that's my site haramarano.com and if you want to help if you have access to a large data set of labeled cannabis photos I really want to speak with you. Other than that I, I want to build the crowd sourcing sort of platform so that people can contribute and let's, let's make a free predictive model for cannabis disease. That's, that's free for everybody that we could all build together and it benefits everybody so that's kind of what I want to build. That's it. So I can take questions by, by my clock I have 10 more minutes but if, if people want to, um, yeah I guess questions, any questions? Yeah. Okay so I think the question is do you, do we want to build something that is for ailments other than diseases? Like lack of water and yeah. So actually when, so maybe I'm not an expert in cannabis but when I said disease I was actually going, I was actually including those sort of ailments because this breakdown of 40 plant diseases is actually from groweasy.com and so we see things like boron deficiency for example light burn which isn't a disease but it's more of a issue that you'll have if you are just doing this in your closet or something. Um, so we could build both but I think for, for proper diseases it would actually be more difficult and you would actually want to get some sort of botanist. I think from a more practical point of view it'd be much easier to get, collect data on those things that you, that you did mention. So practical things like is it lacking water? Does it need more sunlight? And is it too close to the generator? That's one that I learned in this project. So it's sort of like, it's sort of like the software sort of meets you in the middle. So if people want to build certain things or we have the talent we could certainly build a proper disease one but I think it'd be most, just way easier to do the common ailments. So in the back, the hat. So the question is does the, does the Z, does the model? Yeah, ChronicSignals.com, it does not do this. This is, this is for future work. Um, it's, we have a model that's about 80% accurate on just telling you if it, if it's sick or not. And that's first, first we want to solve that problem. It should be in my opinion, what I think that we should be able to build a model that's above human level, accuracy, whatever we could debate about that, what that means exactly. But we should be able to build that for sick not sick. And then once that's solved we can tackle, uh, more granular problems like the specific ailment and the, the strains just for fun. So we don't have an API yet. It's just the, uh, ChronicSignals.com where you can upload a photo and there's a link to the GitHub so you can see the code. Uh, but the model is not publicized. It's, it, the model itself is about 100 megabytes. Um, GitHub doesn't have support for large files. So it's not there, but we can, we can speak offline about that. If people want an API, um, we could, we could build something like that. Oh, one more question. Yeah, so the GPU really helps with the specific technique called data augmentation. And so you can do that for different types of data. So in machine learning problems, you might have audio data or text data in this, in this problem, it's image data. And so it turns out that we can do random projections off that data, like right, rotating and skewing it, um, and zooming in. All of that is drastically accelerated with a GPU. So in the book, it says that if you don't have a GPU, you don't even try to do data augmentation. And so, so there's that. And then the actual model training time with a GPU for the, it's about an hour and a half for 3000 images. And I think that would grow linearly. So, yeah, if we get, if we get 20,000 images, it might be like six hour jobs that, I think so. I think the model training is more in, it, it grows linearly with your dataset size, not so much with the problem, the number of classes you're trying to predict. So, and, uh, so if you have any more questions, the, the site is chronicsecondaries.com. You can follow through to the GitHub and find my handle and you can send me emails if you want, if you want to help or if you have suggestions. Thank you.