 I would like to welcome you all to this talk where I will be talking about how we can use open data to predict market movements. It's going to be a short talk so I'll be really fast. So a little bit about myself. My name is Asha Saini. I work as a senior advisor with the competitive intelligence team at Dell Technologies. I've been with the company for a little over two years now. I'm based in Boston and I'm here just for this conference. I hope you enjoy the talk and learn something new from this talk. I'll briefly describe what is open data, common crawl. Has anyone here heard of common crawl? I'll be talking a little bit about what is Gartner's Magic Warden, what I mean by market movements. We've done a ton of ad hoc analysis on this common crawl data and it is really hard for me to share all our interesting findings in this little talk but I will be covering some and then we'll do a deep dive into the step by step process, the data analysis process from data extraction through pre-processing and analysis followed by visualization and interpretation of results. Open data is similar to open source. It's just that when you talk about open source you're talking about source code and applications. Open data has to do with data sets that are freely available to the public and anyone can access it and use it for whatever analysis they want to do without getting into the troubles of copyright or patents or any kind of mechanism of control. Common crawl is a non-profit organization that built and they've constantly been maintaining the repository of web crawl archive data. They've been doing this for more than 10 years I guess. So they have this web crawl archive data available in Amazon's public data set, public S3 bucket. A crawl is executed once every month and at the end of the month the archive for that month is posted in this repository. This repository as you can imagine in a sense it's like a copy of the internet. So it has billions of web pages archived in it and you can imagine the size of it. It's definitely 100 of terabytes of data and size. Gartner is a global research advisory and consulting firm. They've been helping their clients, making the right decisions and choosing the right technology or making the right strategy decisions. They've been doing that for their clients for more than 40 years now. So it's a pretty well-known and recognized and repeated research and consulting firm. What is of our interest here is their magic quadrant reports that they publish on a regular basis for various different market categories. So what you see on the top right-hand side is the magic quadrant. This is what it looks like. I'm not sure if the text is readable but the four quadrants are leaders, challengers, niche players and visionaries and vendors in a given market category are placed in one of these quadrants based on a criteria. So Gartner has a criteria that they use to assess or evaluate these vendors in a given technology or market space and then they place these vendors in one of these quadrants. So in market movements is basically when these vendors are placed in one of these quadrants and they constantly publish these reports. The interval for the reports differ depending on what market category you're talking about. So when they publish a report, once a vendor is placed in one of these quadrants, the vendors are observed to move from one, either you can see some movements either in the same quadrant or it is possible that a vendor moves from one quadrant to another. As an example, let's say Dell acquired EMC and then we became Dell EMC and let's say just for the sake of example, Dell EMC made a movement from challenges to leaders magic quadrant in let's say July 2018. So that is what I mean by the market movements in the magic quadrants and it also has to do with the overall market trends like where exactly is the market headed, where are we moving in terms of technology, what are some technologies that are gaining traction and what are some technologies that are not, what are some emerging technologies and what are some disappearing technologies because if you are a company who is investing in a technology that's disappearing, you would want to change your strategy. So that was just to lay a background for those who don't know about all these terms. These are two separate things and this is what I will be talking about. When we started this project, we made a hypothesis and one of our project goal was to see, find out if there is a correlation between what's being talked about a vendor in the media or news websites or on their websites or articles, analyst reports, is there a correlation between those things and vendors making movements in the magic quadrants. So that was the hypothesis that we started with and we wanted to see if there is a correlation and this is one of our findings. What you see here is the graph for NetApp articles versus NetApp's placement in the magic quadrants. So I would like you to focus on two lines, the top most and the bottom most. Both are I think appearing in red color on the projector but it's maroon and red. So the bottom one is NetApp articles and the top one is NetApp's placement in the magic quadrant. You can see there is, the pattern is similar of these two lines. And this one was done for the general purpose disk area magic quadrant and we looked at a lot of other companies and a lot of other magic quadrants as well to see if there's a correlation between them. Hitachi data system is another one that we looked at. This is also general purpose disk area magic quadrant. And you can see the blue line at the top and the maroon line. Maroon line is the number of articles that appeared about Hitachi and general purpose disk area. Like articles that had mentioning of Hitachi and general purpose disk area. And Hitachi's placement in that space, the market space in the magic quadrant. So you can see that there is a little bit of correlation, right? So let's dive deep into the process that we used for our analysis, which is pretty much standard. This is just a high level step by step process. In reality, there's a lot more steps involved, right? So when you start analyzing your data, the first thing that you want to do is define your objectives. What are some business problems you're trying to solve? What are some questions that you're looking for, right? And then for our project, our objectives were divided in two categories, which is hypothesis validation that I just talked about. And the ad hoc analysis. So like I said, we made a hypothesis that whether or not there is a correlation between what's being talked about these vendors in the market or news websites, articles, blogs. Whether that has a correlation with that vendor making a movement in the magic quadrant. And then the ad hoc analysis part of it was for us to just kind of gain some comparative insights. Like I mentioned, I work in the comparative intelligence team. So a lot of my time is spent on finding out what our comparators are up to in the market. What strategies or what major announcements are they making? Is there a new product launch? Especially the ones that are going to have an impact on our business. So kind of gain comparative insights for which we did text analysis on company websites, industry trends. And again, the emerging and disappearing technologies that I already covered in the first slide. Also look at how technology terms like blockchain, public cloud, IoT, artificial intelligence, machine learning. How all these technology terms have been evolving over the past five to ten years. Understanding how our comparators are transitioning in terms of technology over the past few years. This is an important one because I can give an example here like HPE acquired simplicity for their hyperconversion infrastructure. And after they acquired simplicity they came up with a very robust hyperconversion infrastructure portfolio, right? So that was the strategy that they had and the transition. Like hyperconversion infrastructure is not the only thing HP is doing, but it is one of their line of businesses. So we would want to be aware of such transitions or major decisions that these competitors, our competitors are making. Then the next step is collection of the data. So first thing that we had to do was determine what data are we trying to collect. So common crawl was our data source. And then we had to identify what data sources are we going to use for our data. And then you gather the data and put it in a format that's workable for you depending on the type of analysis you want to do. So for our project we identified the two categories for our data sources. On the left hand side you see IT news websites and on the right hand side the company websites. So websites like the register info world, these are some news websites that cover the data center infrastructure solution space to a great detail. So it was a great source of information for us. You also see PC world and computer world in the list because Dell is also into the consumer electronics business which is the laptop business, right? And then we also looked at some companies like we identified the major or the big on premise infrastructure solution business providers. So we looked at all these companies. When it comes to downloading the data what you see on the right hand side is the image on the right hand side. There is raw data which is the web page data that is available in the Common Crawl archive but like I said it's there's billions of web pages in it. So you cannot just connect to this data source and just download the data. There has to be an indexing mechanism for data like that. So there is index server that has indexing for all the web pages that are available in the archive. CDX index client is a project. It is an open source project that is available in on GitHub. I'll be sharing the links towards the end of this slide for all these GitHub links for all these projects. So this project allows you to download the index data and using that index data you can further download the actual raw or web page data. So we wrote a lot of scripts in Python to download the indexes for the domains that we were interested in. Like for example netapp.com or the next platform.com or theregister.com. So we identified all those domains in the previous step and then we downloaded all the index data for those domains. These scripts were executed on EC2 instances in AWS. And what you see on the right hand side, this image here is an example of what the index data looks like. I'm going to switch to the browser just to demonstrate you real quick. So this is the browser interface for the, why is it not working? I think if it's not working, I have the image on the slide also, so it's fine. It's okay, it's okay. I have the image in the slide as well. So this image on the right hand side is what, it shows what the index data looks like. I'm not sure if it's readable, but this is for the netapp.com domain. So when you go to the browser interface, you can basically query the API. Say I want to see all the index data for netapp.com domain and it will give you all the index data. So this is what it looks like. I have the link in one of my slides. So if you are interested, you can check it out afterwards. Then after we downloaded the index, the next step was to download the actual raw or webpage data. So another project called Common Crawl Document Download Project is also an open source project available on GitHub that you can use to download the webpage data. Now this tool requires the index data to be fed to it as an input. So the data that we downloaded in the previous step was fed to this project as an input and this project would download the data depending on what index data you are giving it to. Similar to the previous step, we wrote a lot of scripts to download the data for various different domains. And all the websites have different way of like the sections are different and the web pages are organized differently. So we had to do it like selectively, like separately for all the domains that we were interested in. The image that you see here is for emc.com domain. This is what the data, web page data look like in our S3 bucket. So we executed the scripts again on EC2 instances in AWS and we stored the downloaded data, the raw web page data in our S3 bucket. So once we downloaded the data, next step is processing the data. This is where you kind of start filtering out the unwanted or meaningless data. That does not hold any value for your analysis. It's not going to make any contribution to what you're looking for in the data. This is also where you do the cleaning and removing of duplicates and valid data. And kind of start putting structure to the unstructured format. In our case, the data was very, very random and very, very extremely unstructured data. For our project, one of the steps we had to do was pre-processing, where we had to get rid of all the HTML code. Because we're talking about the web page data, which obviously has a lot of HTML and JavaScript code. And our interest is only on the text data, the plain text data, right? So DKPRO C4 corpus is a project. It has a lot of utilities. Boilerplate is one of the utilities that we used in our project to get rid of the HTML code. In the next slide, excuse me. You can see this is, I'm sorry about the font size, but you can probably see the color text is the code part of the data. And this is before HTML removal and after HTML removal is just the text data. You can see that the size of the data got reduced from 100 into KB to 9 KB, which is like a lot of difference. The next step was to analyze the data, wherein we start exploring the data to see what are some hidden patterns in it. Are there any trends in it? Are there any correlations? And what are some messages contained in the data? So the data processing, for the data processing piece of it, we were running Spark, Scala and Python code in Zeppelin notebooks on EMR clusters. We downloaded the data from RS3 bucket. The HTML clean data from RS3 buckets into RDDs and we did a lot of transformations on the RDDs. And when you're doing text analysis, these are some steps that are standard. These are involved. You have to do these things like getting rid of the special characters and removing multiple spaces and stop words. Stop words are the words like is and the words that appear a lot in the English language, but they don't really hold much value. So you have to get rid of all those words. You have to do lower casing. So that words like Asha with the capital A is not treated as different from Asha with the small a. So lower casing, tokenization, stemming and lemmatization. These are some standard things that you do when you're doing text analysis. And then we did the n-grams frequency analysis. n-grams is n number of terms occurring consecutively. So there is monograms, bi-grams and trigrams. An example for bi-gram could be public loud. Like two words appearing together. Why we did this? Because when you do word frequency count analysis, the number of times data appears and the number of times center appears. May not mean much to you. Like it may not be very informative. But when the number of times data center appears together may mean a lot to you, right? So we did the n-grams frequency analysis and then we stored this data back into our S3 bucket. This is an example of the data, how it looked like before the word count. Sorry, before and after the stop words removal. So you can see the words highlighted are off and they're gone from the right hand side column. Then we did the visualization. For visualization we used D3JS and Zeppelin Notebook itself. Now this is a bubble chart that is displaying the word frequency count. The size of the bubble also has to do with the number of times the word appears. So bigger the size of the bubble, more is the frequency of that word in the text data. So it is not, let me see if I can zoom in. So looking at this, this is this bubble chart is for NetApp domain. NetApp.com domain, you can see words. I'm gonna read these words out for you. This is storage, this is data, this is NetApp, this is cloud. This is analytics solution. So all these words kind of, you can tell that NetApp is a company that is into the data storage and cloud business, right? So we did these kind of visualizations for a lot of domains. And then the last step is really just the interpretation of results where you kind of circle back and go back to your first step and see if you were able to find the answers for the questions that you started off with. Were you able to solve the business problems that you were looking for, right? This is a high level architecture of the project that we did in AWS. This data source, the four layers are data source and data processing, analysis and visualize. Data source is, this is the public data set in S3 bucket. We downloaded the indexes, stored it in the S3 buckets. And then we use the common crawl project to download the actual raw web page data which was stored back into the S3 bucket. And then we use the DK Pro C4 corpus project to get rid of the HTML tag. And then we fed the HTML clean data to the Zeppelin notebooks on EMR clusters to do the analysis. And then in the end, the data was visualized in D3.js. We have not been using Tableau yet, but it is on our roadmap. These are some important links I was talking about if you're really interested in exploring this data, common crawl archive. You can visit these links and find out more about it. This is the team that I would like to thank because this analysis wouldn't have been possible without these brilliant minds. We won an award for this project for a technical paper that we submitted for a contest at Dell Technologies. That is all I had for this talk. If you have any questions, I will be around. If you have any kind of feedback, any ideas, anything that you would like to discuss. I'll be here and happy to have that conversation with you. So thank you all.