 Hi, thank you so much for the great introduction, a very good evening everyone, hope you all are doing great and enjoying your weekend. I am Anshika Rajiv and I am a recent graduate from the University of Hong Kong, where I studied computer science and information systems on a 100% scholarship. Currently, I am working at an investment bank in Hong Kong as a technology analyst. I am so excited to be a speaker at PyCon India 2021 and I hope you all are as excited to be here at this amazing conference. So before I start my presentation, I would like to clarify that this presentation in no way is an investment advice and we should always carry out our own research before investing our hard-earned money. Now that the disclaimer is out of the way, without any further ado, let us begin. Today, I will be presenting on the topic financial data forecaster, that is, the time series forecasting of stock market data. Before starting the presentation, let me give you a quick outline of what I will be covering in my talk today. Firstly, I will be presenting the background and the objectives of this presentation. Next, I will discuss the methodology that will be adopted in order to achieve this objective. Moving on, I will throw light on the process of data collection, data pre-processing, data analysis, feature creation, feature engineering and application of regression models. Finally, I will discuss some of the results and analyze this. After it, I will briefly talk about future recommendations and references and then conclude my presentation. So let us begin. Now financial data forecasting has been a field of great interest amongst researchers across the world. It is a significant domain with a wide variety of applications and consequences. Moreover, every second massive amount of financial information is being generated in the form of stock prices, cryptocurrency prices, foreign exchange indices. If this information is leveraged and converted into insights, it can be very beneficial for both individuals and business organizations. This is because it can potentially help one, to A, make better operational and strategic decisions, B, to mitigate risks and losses and C, to generate some profits. Now in order to achieve this objective and facilitate the prediction of financial data, machine learning and deep learning have been playing a very crucial role. In the era of rapid technological advancement, machine learning and deep learning for predicting stock market prices and trends has become more popular than ever before. In fact, according to case studies by Deloitte and KPMG, the growth rate of intelligence systems and robo-advisors has been 70% with over $2.2 trillion in assets under management for such automated predictive systems. Thus, financial data forecasting through the means of machine learning and deep learning forms the basis of my talk today. The ultimate goal is to experiment and evaluate different algorithms that give the highest accuracy or the lowest error. This is achieved by exploring different kinds of input variables that will be used in our machine learning model and seeing how much they contribute towards stock price prediction. Now the scope of the financial data that I will be talking about today is time series financial data of the stock market. So now some of you must be wondering what exactly is time series data. It is nothing but the data collected over repeated measurements of time such as where the data or health record data, but it is sequential data where the order of the data holds great importance. Now time series forecasting of this data facilitates the prediction of a value in the future on the basis of the values in the past. In order to achieve this objective, a structured methodology will be followed. This includes data collection, data preprocessing, exploratory data analysis, feature engineering, implementation of machine learning algorithms as well as testing of these algorithms. Now let us cover each one of these steps of the machine learning pipeline in some detail. Now first comes data collection. Data acts as fuel to any machine learning model. Thus, it is imperative to collect large amount of financial data. In this presentation, I will be mainly covering two kinds of financial data. One is numeric data that is the historical stock prices and the other is the non-numeric data which is the news headlines. Now numeric data or the stock price data was collected because historic prices and trends can potentially affect the future prices and trends. So as we can see on my screen that 20 years of stock market data has been collected with the help of Yahoo Finance API. The structure of the data mainly includes the date, the high, low, open, close, adjusted close and volume. Now giving a little bit background about what these symbols really mean, the date essentially acts as an index and tells the date on which the data was collected. High denotes the highest value of the stock in a certain day. Low denotes the lowest value of the stock in a certain day. Open is the opening price of the stock. Close is naturally the closing price. Volume is the amount of stock that a company traded on a specific date. Moving on, some non-numeric data was also collected. Now the main reason for collecting non-numeric data was the fact that on reading a lot of research papers and on witnessing real life events, I realized that stock market data and the future prices of the stock are not only dependent on the historic prices but a number of other environmental factors such as public sentiments, social media activities, change in political powers, terrorist activities and so much more. I think great examples of this can be the fact that certain tweets by eminent personalities such as Mark Cuban or Elon Musk often lead to the increase or the decrease in the stock prices. Another great example would be the sudden changes in the game stock prices earlier this year because of some reddit conversations. This really motivated me to collect non-numeric data as well. And this was collected with the help of an API from Reuters.com which is an online news agency that really covers financial data. The structure of this data includes the headline, the year in which the headline was collected and the date of the headline. Now after we've collected a lot of data, it becomes very important to pre-process it. This is because the data can consist some sort of null values or outliers. So for this presentation, I would like to cover missing values and to scale the values. So missing values can either be completely removed or they can be replaced with the mean or the median. So I think it is not a very good idea to completely remove null values because simply removing them might lead to some sort of risk of losing important information. Whereas replacing them will give more comprehensive results. Next, the data was also scaled with the help of a min-max scalar in order to standardize it. Now that our data is collected and pre-processed, we come to the third stage which is the process of conducting exploratory data analysis. So this is very important because it facilitates better understanding of our data set and we will speak about plotting correlation tables, graphs and exploring the news articles to understand them better. So now first comes plotting a correlation table. So what exactly is correlation? Correlation essentially provides a better understanding about the dependence of one variable on the other variable. It ranges from a scale of minus one to one. Minus one essentially is perfectly negative correlation. Zero indicates no correlation whereas one indicates perfectly positive correlation and essentially helps in telling us the strength between various features. So as we can see on the screen, the cooler colors which are the shades of blue indicate less correlation whereas the heater colors, the stronger colors which are the shades of red depict the stronger correlations. Apart from this, the closing prices of the stock can also be plotted and analyzed. Now as you can see on my screen, there are a lot of sharp peaks and dips in the stock prices. This is mainly because of some sort of changes in the company. It can also happen because of other factors. For example, in 2008, there was a financial crisis in the United States which really affected the stock markets not only in the US but the rest of the world as well. So we can see that there's a constant rise and dip in the prices over the course of years because of a lot of factors which makes this entire topic so very complex, yet so very interesting. Next, in terms of the news headlines, the presence of competitor firms was also analyzed and as we can see that the competitor firms have been mentioned a couple of times. So I've taken the example of Bank of America and Bank of China. So we can see JB Morgan, Goldman Sachs, Citibank and Wells Fargo have been mentioned so many times in the news articles pertaining to Bank of America whereas ICBC, HSBC, Agricultural Bank of China and BEE have been mentioned so many times in headlines for Bank of China stock. Next, I also plotted some word plots to essentially understand what kind of words have been consistently repeated in the news articles and I've attached some of the snapshots for your reference as well. So now that we understand our data better and we've analyzed it by plotting various graphs and charts, it is now time to create some additional features. So a number of new features can be created. This includes technical indicators, sentiment scores, subjectivity scores and headline embeddings. So now these features will be used as input variables in our machine learning model in order to predict the closing price of a certain stock or index in the future. Now let's speak a little bit about technical indicators. These are essentially mathematical calculations that make use of past price and volume to help in the prediction of stock prices in the future. Now various technical indicators can be calculated and this includes simple moving average, exponential moving average, relative strength index and moving average convergence divergence. So now speaking about simple moving average, it is essentially a simple average calculation of the closing price of any security or stock for a given number of days. Exponential moving average assigns lesser weight to the past data and it is based on a recursive formula that includes its calculation all past days in our price series. Relative strength index calculates a ratio of the recent upward price movements to the absolute price movements and finally moving average convergence and divergence reveals changes in the strength, direction, momentum and duration of a trend in a stock's prices. Now tactical indicators and tactical analysis is generally based on a common belief that price often tends to repeat itself, the trends often in the past tend to follow each other in the future and it also assumes that the market discounts everything which essentially means that the stock price reflects everything that could possibly affect a company. However in practice we know that this is not true because of all the environmental factors that have so much impact on the stock prices. Thus more features were created with the help of news headlines. This included sentiment scores, polarity scores and headline embeddings. Now let's understand each of these in a little bit more detail. So now first comes Vader's sentiment intensity analyzer which was made use of to calculate the positive, negative, neutral and compound scores. As the name suggests negative score tells us the negative sentiment in a sentence neutral indicates the neutral sentiment in a sentence whereas positive tells us the positive sentiment in a sentence. A compound score shows the aggregated sentiment and it can also be calculated with the help of Vader's sentiment intensity analyzer. Now I'm sure some of you must be wondering what is Vader? Vader really stands for Valence Aware Dictionary and Sentiment Reasoner. It is a sentiment analyzer that is trained using social media data and news data using a lexicon-based approach. Now don't worry if you don't understand what a lexicon-based approach is. It essentially means that we look at the words, punctuation, phases, emojis and rate them as positive or negative. These scores are based on pre-tained model labeled by human reviewers. So the main advantage of using Vader over other analyzers is the fact that it is computationally economic and very fast. The second advantage could be that it is lexicon and the rules by Vader are directly accessible and not hidden. Therefore it can be easily understood, extended and modified. Next comes the use of text block which is also a sentiment lexicon and used to quantify the amount of personal opinion and the amount of factual information. It gives us values of polarity and subjectivity. Now polarity essentially gives us the idea if a statement is positive or negative whereas subjectivity quantifies the amount of personal opinion and the factual information that is present in a certain news article. Lastly Google's universal sentence encoder is implemented. We make use of it to encode the textual data into high-dimensional vectors called embeddings and these numerical representations of the textual data. The numerical vectors generated can be made use of as teachers in our predictive models. Now I would like to also share a little bit more about the basic architecture of the universal sentence encoder. What it really does is that it firstly converts the sentences into lower keys and then tokenizes them. Next the encoder enables encoding of sentences into fixed 512 dimension embeddings or vectors. Deep averaging network or DAN is the encoder that is made use of here. It computes the unigram and the biogram embeddings and averages out these embeddings and passes them on to a deep neural network. Then it returns a final sentence embedding of 512 dimensions. These embeddings are used for unsupervised and supervised tasks in the Stanford natural language interface corpus. The model can then be used to map any sentence into a sentence embedding of 512 dimensions. Now that we've created so many new features, let us move on to the next step which is feature engineering. This is a very crucial step in the machine learning pipeline. So as part of feature engineering, mainly two things are carried out, dimensionality reduction and feature selection. Now dimensionality reduction is performed using principal component analysis. What PCA actually does is that it reduces the computational complexity and helps to avoid the curse of dimensionality. In PCA, the data is projected onto principal components such that the maximum variance is preserved and least amount of information is lost in the process. As you can see in the graph on my screen, the number of components versus the cumulative variance is sorted and we can see that the maximum variance is preserved by using 150 components. Thus we can reduce the dimensionality of the dataset to 150 by projecting it onto the hyperplane defined by the first 150 principal components. Next we perform feature selection using recursive feature elimination. Now in RFE, a backward selection of features is implemented based on two attributes. One is the coefficient and the other one is the feature importance. The main aim of feature selection is to choose a subset of features from the original input dataset such that the subset is able to represent the entire input dataset while at the same time reduces the potential impacts of noise or irrelevant variables and thus reduces complexity. So now it's quite intuitive that not all features would be affecting the price in the future with equal importance. This was confirmed when RFE was performed and it was noted that some features such as simple moving average, exponential moving average and positive scores contributed significantly while the others did not. Now we've performed a lot of crucial steps in the machine learning pipeline and we come to the main chunk which is the actual implementation of machine learning and deep learning models. So a regression based approach is followed to predict the price in the future and as a consequence of that predict the movement of the stock prices as well. So initially I tried a couple of classification algorithms such as random forests, support vector machines, logistic regression but I realized that these models often do not give a very high accuracy and result in overfitting so many times. Thus I stuck with a regression based approach with the help of long short-term memory. So now this is a very very popular type of algorithm that has gained a lot of popularity mainly because of the fact that it is very successful on time series data. This is mainly because it can handle long-term dependencies very well. LSTM is a special type of RNN actually and in LSTM small altercations can be made to the information with the help of operations such as multiplication and addition and this information flows through various cell states. As a result LSTM can selectively remember or forget information based on their importance. The sigmoid neural network layer and the multiplication point wise function help in deciding which information is led through. So I hope you have a high level understanding of what really LSTM is and now let's dive a little deeper and discuss the architecture of the LSTM model. So firstly it includes an activation function which was chosen as a linear activation function. The loss function is the mean squared error whereas the optimizer is the add-on optimizer with the 0.01 learning rate. These parameters were essentially devised after carrying out a lot of experimentation. In LSTM there is also a very important parameter called the sequence length which determines the number of days in the past to consider to predict the value in the future. Now various experimentations were conducted with different input features in the model and they were evaluated on the basis of a root mean squared error or RMSC scores. Then these four parameters that we have were optimized with the help of Kera's random search in order to optimize them and I would discuss the results in the later subsections of the presentation. Then like in any task that we perform testing holds so much value. In time series data back testing is actually the equivalent of cross validation for other kinds of data and back testing essentially is an attempt to bootstrap the data in a way that we can estimate the expected test error and we cannot simply use cross validation because the data has a sequence and this sequence holds importance. So the notion of back testing essentially refers to the process of assessing the accuracy of the historic data. Now after performing back testing let's now discuss some of the results and analyze these results to better understand how this was beneficial for us and what we can do to improve. So now we can see on my screen that I've plotted the actual versus predicted values of the closing price after implementation of machine learning. So here if you look closely the value of the predicted stock actually differs from the actual value. However we can see that the movement of the price is the same and I think that's also a great achievement in the sense that we were at least able to predict the movement to a great extent with the help of LSTM. Next different combinations of input variables were experimented and evaluated. So the graph that you see on my screen essentially speaks about the RMSE scores using different input variables. So we can see that if we only use technical indicators as our input in the machine learning and deep learning models then our error is quite high. However if we combine it with the raw features then the error naturally decreases. Another very important point to note here is that the features that were selected after performing feature engineering contributed to the least possible error which proves that feature engineering is a very important aspect of the machine learning and deep learning models and performing it yields better results in the future. Next we also have the results before and after optimization. Once we perform optimization we can see that there is a reduction in the error and our model tends to perform better. So optimization should also be performed in order to achieve better results. This brings me to the next stage of my presentation which is future recommendations. The application of machine learning and deep learning can be made in the financial industry and can be expanded to many more aspects such as sanctioned screening, fraud detection, anti-money laundering checks and KYC processes. Other than this topological data analysis can also be made use of and this is a relatively new approach and has been in the limelight in the recent few months because of its success in medical imaging. So I think it's going to be very interesting to see how it performs on stock market data. And this brings me to the end of my presentation. Through the course of the presentation, the relevance and importance of financial data forecaster has been noted and we have gone through the entire pipeline of how the data was collected through Yahoo Finance API and joyters.com. Next the data was pre-processed and missing values were checked for. Then we analyzed the data by plotting a number of graphs, correlation tables, some other charts to better understand it. Then a number of features were created in addition to the raw features that we had collected such as technical indicators, sentiment scores and embeddings. After we collected it, we performed feature engineering with the help of principle component analysis and recursive feature elimination. After this we performed a long short-term memory algorithm on the data and calculated the RMSE scores. We then tested our algorithm and also optimized it and we discovered that feature engineering is extremely important. We noted that optimization hyperparameters is very important and we also understood the fact that stock market data not only depends on previous prices but a lot of other environmental factors that need to be taken into account such as political agendas, terrorist activities, economic fluctuations, recessions, social media activities and so much more. Here's also a list of all the references and all the research papers I read in order to gain more knowledge of the stock market and how financial data forecasting can be really performed. And at the end I would like to say that thank you so much for your time and for attending this talk. I hope this presentation has been knowledgeable and informative with those who want to delve into machine learning. I had a great time presenting at Picon and I'm so glad to get the opportunity to do so. If there are any questions I'd be happy to take them otherwise you can reach out to me on any of these channels. Thank you Anshika for the wonderful talk. It was very very interesting and very very informative and I understand that it's really challenging to solve real world problems. It's a really great talk. Thank you, thank you so much. Let me check if there are any questions. Yeah so there are some questions. So the first question is, sorry the, okay, yeah Anshika. So there are some questions. So the first question is can you share more about test setup results and how do you determine accuracy? Absolutely sure. That's a very good question first things first. So in terms of testing I very briefly covered the concepts of back testing. So back testing forms an equivalent of cross validation that we usually perform in machine learning and the reason we use back testing is because our data is is sequential and the order holds important. So back testing essentially helps us to assess the accuracy of the forecasting method using the existing historical data and this process is typically iterative and repeated over multiple dates present in the historic data. So I'm not sure if I have enough time to go into the details of it but you can definitely look up what back testing it and how it works. Speaking about the results we made, I made use of RMSE scores in order to see the accuracy because RMSE scores end up telling us more than the accuracy score. So I think that's a very good metric to see how our algorithm is performing. Even with the help of the graphs we can see the usual trends and patterns and how our algorithm is really performing and if it is simply overfitting on the data we have or if it is actually taking into consideration all the parameters that we're making use of in the machine learning and deep learning models. Thank you. So next question is any resources for newbies to get at dirty on machine learning? Yes that's a very good question and I'm sure it's going to help a lot of the people attending the talk today. So I also recently started picking up machine learning it's just been a year so what I did is that I took a lot of courses on the internet in machine learning. I think Andrew Nung's course is excellent, very easy to follow and it clarifies your basic concepts. Even some books in machine learning are very helpful. So these are the two things that I did and after gaining some basic knowledge I also ended up doing a lot of projects in machine learning so that my basis is clear and I know how to implement what I've really learned in the course and by reading the books. So I think first get good knowledge and then implement that knowledge with the help of some simple projects and then the world's your oyster do some great projects that solve your life problems. Thank you. Next question is can you share some of the packages or modules that will be helpful for the analysis? Yes so firstly if you want to perform analysis I think NLTK library is extremely popular in machine learning in general so that's something that you should definitely look out for. Then for the sentiment analysis you can use the sentiment intensity analyzer from Vader. Then for text blob you can import text blob and of course pandas in NumPy for analyzing your data and plotting charts is always comes handy to you. Yeah so there's a last question do you do algorithmic trading any particular tool or technology being harnessed? I do not do algorithmic trading yet but I hope to get my hands dirty with it in the future. If you would like to discuss more about it I'd be happy to learn from you or just have a healthy conversation about it so feel free to reach out to me.