Predict The Stock Market With Machine Learning And Python
Published: May 22, 2022
Duration: 00:35:55
Category: Education
Trending searches: predictit
Introduction hi my name is vic and today we're going to be predicting the stock market using machine learning we'll start out by downloading data on the s p 500 index then we'll clean the data up and use it to train a model then we'll do back testing to actually figure out how good our model is and we'll add in some more predictors to improve our accuracy we'll end with some next steps that you can use to continue improving the model on your own it's going to be a really fun and exciting project before i started dataquest i actually spent a lot of time predicting the stock market both winning machine learning competitions and developing and selling algorithms there are a lot of real world considerations when you're predicting the stock market that not all tutorials will show you so i'm going to show you those today so you can build a higher quality project let's dive in [Music] by the end of this project we will have created a machine learning model that can predict tomorrow's s p 500 index price given historical data we'll also have back tested this model on 20 plus years of historical data so we can really be confident in the predictions that it's making okay let's go ahead and get started we're going to be using jupiter lab for this project you can also use jupiter notebook if you have it installed [Music] Downloading S&P 500 price data and the first thing we'll do is we'll import something called the y finance package this package calls the yahoo finance api to download daily stock and index prices and the first thing we'll do is we'll initialize something called a ticker class which will enable us to download price history for a single symbol in this case we will use the gspc symbol which is the s p 500 index so we'll go ahead and run that and then the next thing we'll do is we'll go ahead and query the historical prices so we'll use the history method and we'll pass in period equals max which will query all data from the very beginning when the index was created so let's run that and we actually end up with a panda's data frame which is very very nice and in this data frame each row is the price on a single trading day so non-trading days are not included in this data the columns are the opening price so the price when the market opened the highest price during the day the lowest price during the day the closing price when the when the exchange closed and the volume so the total volume that was traded that day so we're essentially going to use these columns to predict if the stock price will go up or down tomorrow we also have these additional columns dividends and stock splits but we're not going to use these and we'll actually remove them later all right so we'll take a look at the index of the s p 500 data frame and we can see we have a date time index and the index is the this column on the left if you're not familiar with it and that column will enable us later on to really index and slice the data frame easily [Music] Cleaning and visualizing our stock market data all right so the first thing we'll do is we'll go ahead and plot the data in the data frame so we'll plot the closing price against the index so what this is doing is it's saying show the index which is really the the trading days the dates on the x-axis and show the closing price on the y-axis so we can run that and we get a nice chart of the s p 500 price history and we really can regret not buying the index fund at any point in the last few years all right we'll do a slight bit of data cleaning here we'll just remove those extra columns that we don't need so we'll go ahead and remove the dividends column and we'll also remove the stock splits column these columns are more appropriate for individual stocks not an index so we don't actually need them [Music] Setting up our target for machine learning and then the next thing we're going to do is set up our target so this is what we're actually going to be predicting using machine learning so this target is going to be will the price go up or down tomorrow so some people like to predict the absolute price so trying to predict if the stock price will be 17 or 18 tomorrow the big big problem with that is your model can be extremely accurate you can be very good at predicting the absolute price but you can still lose a ton of money because ultimately if you're buying and selling stocks you don't care about getting accurate on the absolute price right you care more about getting accurate on the directionality if the price will go up or down so you know can i buy the stock and then will it go up you can be really close to predicting the the actual price and in fact be very far behind on predicting if the stock will go up or down so what we're going to try to do is say on days that the stock goes up can we actually predict that it will go up and that way if we want to buy the stock we know we can buy it and the price will go up so our target is going to be will the stock go up or down so first we're going to create a column called tomorrow and basically what this column is going to be is it's going to be tomorrow's price and we'll use the pandas shift method to help us do this so let me run this and then show you what actually happened so it we took the close column and then we shifted all the prices back one day so you can see for january 3rd 1950 the tomorrow column is now the price on january 4th the closing price so we we now have a column that shows tomorrow's price then based on tomorrow's price we can now set up a target so the target is what we're going to try to predict with machine learning and really all we need to do with the target is is tomorrow's price greater than today's price so this will basically return a boolean indicating if tomorrow's price is greater than today's price but we want to actually convert this to an integer so we can use it in machine learning so we're going to use the as type method and pass in integer okay has type or that's int so now we're going to show the s p 500 data frame and we can see we now have a target column that is a one when the price went up so that's when tomorrow's price is greater than today's price and it is a zero when the price went down and this is what we're going to try to predict the next thing we'll do is there's a lot of historical data in this data frame and usually a lot of historical data is great but with stock market data if you go back too far the market could have shifted fundamentally and some of that old data may not be as useful in making future predictions so what we're going to do is we're going to remove all data that came before 1990 so we are going to use the pandas loca method loc and basically say only take the rows where the index is at least 1990 january 1st and we can take a look and see what happened and you can see there are only dates after january 1st 1990. now i wrote dot copy here and i wrote the dot copy because if you don't you can sometimes get a pandas setting with copy warning when you try to subset a data frame and then later assign back to it so the dot copy helps us avoid that [Music] Training an initial machine learning model okay so we've now set up our data and we can actually start to train our first machine learning model so let's go ahead and do that and for our initial model i'm going to use something called a random forest classifier i love to use random forest as my default model for most machine learning for a few reasons so one random forests work by training a bunch of individual decision trees with randomized parameters and then averaging the results from those decision trees so because of this process random forests are resistant to overfitting they can overfit but it's harder for them to overfit than it is for other models to overfit they also run relatively quickly and they can pick up non-linear tendencies in the data so for example the open price is not linearly correlated with the target for example if the open price is 4 000 versus 3000 there's no linear relationship between the open price and the target if the open price is higher that doesn't mean the target will also be higher so our random forests can pick up non-linear relationships which in stock price prediction most of the relationships are non-linear if you can find a linear relationship then you can make a lot of money so we're going to initialize our model and we have we're going to pass in a few parameters so n estimators is the number of individual decision trees we want to train the higher this is generally the better your accuracy is up up to a limit right you can't just get free accuracy by making this higher and higher i'm going to set it pretty low just so this runs quickly for us but you might want to try with a with a higher value min sample split this helps us protect against overfitting decision trees have a tendency to overfit if they build the tree too deeply if you don't know much about decision trees don't worry about it but setting min sample split helps us protect against that overfitting the higher we set it the less accurate the model will be but the less it will overfit so you may want to experiment with this and just find the optimal number and then i'm going to set random state equal to 1. so a random forest as you may have guessed has some randomization built in so setting a random state means that if we run the same model twice the random numbers that are generated will be in a predictable sequence each time using this random seed of one so if we rerun the model twice we'll get the same results which helps if you're updating or improving your model and you want to make sure it's actually the model or the something you did that improved error versus just just something random okay now we're going to split our data up into a train and test set now this is time series data and with time series data you can't use cross validation or you can but if you do then your results will look amazing when you're training and horrible in the real world and the reason why is if you use cross validation or another way to split up your training and test set that doesn't take the time series nature of the data into account you will be using future data to predict the past which you just can't do in the real world and will result in something called leakage where you're leaking information into the model so if i asked you to predict the stock price tomorrow and i gave you what the stock price is going to be in 30 days you would probably do better at predicting the stock price tomorrow than if i didn't tell you anything about the future so we want the model to actually learn how to predict the stock price not just randomly happen to have some knowledge about the future that we're not going to have in the real world all right so the way we're gonna split this data set up is we are going to put all of the rows except the last hundred rows into the training set and we're gonna put the last hundred rows into the test set i'll show you a more sophisticated way to actually split this up and measure error later but for now we're just creating a simple baseline model and this is the easiest way to do the split and then predictors so i'm going to create a list with all of the columns that we're going to use to predict the target now i like to be really explicit about predictors because i've been burned before by just using all of the columns as predictors and then creating a model that looks amazing when i'm training it right like 100 accuracy but then in the real world doesn't work what's really easy to do is accidentally use the tomorrow column or the target column even to predict the target and then what's happening is your model actually knows the future right which isn't going to happen in the real world all right so we're going to use close we're going to use volume open high and low so those are going to be our predictors and then what we're going to do is we're going to go ahead and fit the model so model.fit train predictors so this is using these predictor columns and then we are going to try to predict the target so this is going to train the model we're using the predictor columns in order to predict the target so let's run that and that'll take a little bit to run now once it's finished our next step is actually to measure how accurate the model is this is a really important piece of machine learning right you need to measure if your model is doing what you think it is or not so we're going to import again from scikit-learn we're going to import something called precision score and all the precision score is is it's saying when we said that the market would go up when when the target was a one did it actually go up so what percentage of the time when we said the market would go up did it actually go up and this is actually a really good error metric or accuracy metric for this particular case because i'm going to assume in this case that we want to buy stock and when we buy stock we want to hold that stock and then sell it and we want to make sure that when we buy stock the stock price is actually going to increase so depending on what you want and what your goals are you may want to adjust kind of what error metric you're using to measure performance but in this case we're going to use precision score so we're going to generate predictions using our model with the predict method and we'll pass in our test set with the predictors so that's going to generate predictions these predictions are in a numpy array which is a little bit hard to work with so we're actually going to turn this into a pandas series and we're going to use the same index as our test data set i have to import pandas so we're going to import pandas and then create this series and we can see predictions is now a series and it's a little bit easier to read and then we'll go ahead and actually calculate the precision score so we will calculate the percentage and score using the actual target and the predicted target and we can see this is not a very good precision score right so when we said the stock price would go up it only went up 42 percent of the time that's not great we'd be better off actually trading against this model doing the opposite of what it tells us to do but that's okay we're going to make this model better and we will be able to get more accurate predictions okay so the next thing we'll do is we'll just quickly plot our predictions and in order to do that we will combine our actual values with our predicted values and we'll use the pandas concat function to do that so we're concatenating our test target which is our actual values and our predicted values and then we're going to pass in axis equals one which means treat each of these inputs as a column in our data set now we can do is we can plot this and what this shows us is the orange line zero is our predictions and the blue line is what actually happened so we can see we mostly predicted that the market would go up and uh most mostly it seems to have gone down so that explains why our predictions were so far off [Music] all right the next thing we're going to Building a backtesting system do is build a more robust way to test our algorithm so currently we're only able to test against the last hundred days but if you're really building a stock price model and you want to use it in the real world you want to be able to test across multiple years of data right because you want to know how your algorithm is going to handle a lot of different situations that gives you more confidence that it'll work in the future so what we're going to do is we're going to do something called back testing and in order to enable back testing the first thing we'll do is create a prediction function and this will basically just wrap up everything we just did into one function so it's the fitting of the model using the training predictors and the target it's generating our predictions which is just model dot predict test predictors then it's combining our our model into a series which i'll actually just copy and paste the only difference here is i gave the series a name predictions and then finally it's combining everything together same thing we did before and then at the end we'll return our combined data frame with the actual values and the predictions now we can do is write a backtest function which takes in our s p 500 data a machine learning model our predictors it also takes in a start value which we'll set to 2500 and a step value so what is the start value so when you back test you want to have a certain amount of data to train your first model so every trading year has about 250 days so this is saying take 10 years of data and then train your first model with 10 years of data and the step is 250 which means that we will be training a model for about a year and then going to the next year and then going into the next year so what we're going to do is we're going to take the first 10 years of data and predict values for the 11th year then we'll take the first 11 years of data predict values for the 12th year then we'll take the first 12 years of data predict the values for the 13th year and so on and this way we'll actually get predictions for a lot of different years and be able to have more confidence in our model all right so in this backtest function we're going to create a list called all predictions and uh that will be a list where of data frames where each data frame is the predictions for a single year and then we are going to create a function to loop across our data year by year and make predictions for all of the years except the first 10 or so and then we'll split up our training and our test data i'm going to use the dot copy to avoid that setting with copy warning and this code is doing exactly what i mentioned it's creating the training set and the test set the training set is all of the years prior to the current year and the test set is the current year then we'll use our predict function to generate our predictions train test predictors and model then we're going to append to all predictions we're going to append our uh our predictions for the given year and then at the end we're going to concatenate all our predictions together so concatenate can take a list of data frames and combine them all into a single data frame so let's go ahead and run these and then what we can do is back test for our s p 500 data with the model we created earlier and with the predictors we created earlier and after we finish the back test we can actually start evaluating the error of our predictions so first let's take a look at predictions and see how many days we predicted the market would go up versus down so value counts will just count up how many times each type of prediction was made so we can see we predicted that the market would go down on about 3000 days we predicted the market would go up on about 2 000 days and now we can actually look at our precision score and we can take the target and we can take the predictions and this will give us our precision score okay so across all of these rows these about 6 000 or so trading days we were about 53 accurate precise so when we said the market would go up it went up 53 of the time now is that good or not so as a benchmark what we can look at is the percentage of days where the market actually went up and to do that we can look at the value counts of the target divided by the number of rows total and this will give us percentages so the s p 500 in the days we were looking at actually went up 53.6 percent of days and went down 46.3 percent of days so if we all we had done was just wake up every day and say i'm gonna buy and sell at the end of the day we would actually have been better off than using this algorithm this algorithm performed a little bit worse than just the natural percentage of days that the stock market went up but that's okay now that we have back testing we actually have a lot of confidence in our model and our ability to test it Adding additional predictors to our model so the next thing we'll do is add some more predictors to our model and see if that improves our accuracy all right so what we're going to do is we're going to create a variety of rolling averages so if if you're just a human analyst trying to predict if a stock will go up tomorrow some of the numbers you might look at are is the stock price today higher than it was last week higher than it was three months ago a year ago five years ago and you might use all of those inputs to help you determine if the stock if the stock will go up or down and we're going to give the algorithm that information so what these horizons are are horizons on which we want to look at rolling means so we'll calculate the mean close price in the last two days the last trading week which is five days the last three months or so which is 60 trading days the last year and the last four years and then we'll find the ratio between today's closing price and the closing price in those periods which will help us know hey is the market gone up a ton because if so it may be due for a downturn has the market gone down a ton if so it may be due for an upswing so we're just going to give the algorithm some more information to help it make better predictions and then we're going to create a list called new predictors which will hold some of the new columns that we're going to create okay so we're going to loop through these horizons and then we're going to calculate a rolling average against that horizon and we'll take the mean and then what we can do is actually create a couple of columns so one will be called ratio column and that'll we'll name that column close ratio horizon so close ratio 2 close ratio 5 etc and then we'll add it to our s p 500 data set data frame so all this is going to be is the close price in the s p 500 divided by our rolling average so the first time through the loop this is going to be the ratio between today's close and the average close in the last two days second time through the loop it'll be the ratio between today's close the average close in the last five days and and so on we can also look at a trend and a trend is just going to be the number of days in the past x days whatever the horizon is that the stock price actually went up and what we can do here is say trend column equals so what we'll do is we'll use shift again but we'll shift forward this time and then what we'll do is we'll find the rolling sum okay of the target so what is this doing let's scroll up and find the s p 500 data frame so what this is going to do is it is going to on on any given day it is going to look at the past few days and see the average the sum of the target so if we're on uh january 8th 1990 it's gonna look at the last four five days and find the sum of the target there are only four days available so in reality we wouldn't be able to compute a rolling sum but let's assume there's a fifth day here and it is able to and it would basically take the sum so the sum of the number of days that the stock price actually went up okay and then we're going to add these to new predictors ratio column trend column so let's go ahead and run that and we should now have some extra columns in our s p 500 data set and you can see there's a lot of nans so so what's the deal with that so when pandas cannot find enough days or enough rows prior to the current row to actually compute a rolling average it'll just return nam so this is the close ratio two which is based on the rolling average of the two days prior to and including the current day so on january 2nd 1990 there are no days before this so it can't actually compute a rolling sum so it returns a rolling average so it returns nan on january 3rd 1990 it can right it takes the average of this day and the previous day and same thing with all of these columns it's a little bit different for trend because you can't include the current day so here it's looking for two previous days it doesn't include the current day because if it did you'd be including today's target in that column which will give you leakage and make your algorithm look amazing but it's not going to work in the real world so we're going to get rid of some of these extra columns using drop n a extra rows sorry with the missing rows all right so we now see our data starts in 1993 that's because of these trend 1000 and close ratio 1000 columns so we needed about four years of data to actually compute those all right so let's see how these performed [Music] Improving our model so let's let's update our model slightly and change some of our parameters so we'll increase our number of estimators to 200 and we will reduce our min sample split to 50 and we'll keep our random state so let's go ahead and run that and then we'll rewrite our predict function slightly so let me go up and copy and paste this so here when you just run dot predict basically the model returns 0 or 1. what we actually want is a little bit more control over how we define what becomes a one and what becomes a zero so we're going to use the predict proba method and what this will return is actually a probability that the row will be a zero or a one so to return the probability that the stock price will go down tomorrow and the probability of the stock price will go up tomorrow so what we can do is just get the second column of this which will be the probability the stock price goes up and then what we want to do is set our custom threshold so by default the threshold is 0.5 so if there's greater than a 50 chance that the price will go up the model will return that the price will go up but we're actually going to set that threshold to 60 so this means that the model has to be more confident the price will go up in order to actually show that the price will go up and what this will do is reduce our total number of trading days so it'll reduce the number of days that it predicts the price will go up but it will increase the chance that the price will actually go up on those days which fits really well with what we want right we don't want to make a ton of trades we want to know that when we make a trade the price will actually go up but we don't want to trade every single day that's a way to lose money pretty quickly and the rest of this function should be the same so let's run that and then let's go ahead and run our back test again and this time we'll pass in our new predictors you may notice that we're actually getting rid of using the close open high low and volume columns and the reason for that is those are just absolute numbers right so it isn't super informative to the model if the price today is 465 dollars it doesn't tell me anything about whether the price will go up or down tomorrow the ratios are actually the most informative part what is the price today compared to the price yesterday compared to the price last week so that's why we actually took those columns out all right so once the back test is done what we can do is we can take a look at the value counts again for the predictions so you'll remember last time there were about 3 000 days where it predicted the price would go down and about 2 000 days it predicted the price would go up so that was the value counts from last time the distribution is very different now you can see that there's only a few days that we've predicted the price would go up and that's because we changed this threshold right we asked the model to be more confident in its predictions before it actually predicted that the price would go up and what this means is that we're actually going to be trading we're going to be buying stock on fewer days but hopefully and we're about to find out hopefully we will be more accurate on those on those days so we will check the precision score and we'll look at our target and then we'll look at our predictions so let's run that and we can see when we buy a stock so when the model predicts that the price will go up 57 percent of the time it will actually go up so this may not seem great right 57 is a failing grade in most places but it's actually pretty good especially given that we're just looking at time series data and we're just looking at historical prices of the index this would this would actually make you money if you had traded off it from 1993 to the present now would i recommend using this model to go make trades no there are things you can add to it to make it more accurate though that i'll talk about but this is actually a pretty good result given the data that we had to work with and it's better than our baseline right so the stock went up about 53 percent of the days but our model is actually on days it says to buy the price actually goes up 57 percent of the time so the model actually has some predictive value [Music] okay so let me summarize and then talk Summary and next steps with the model about things you could do to extend this model so we did a lot right we downloaded some stock data for the s p 500 index we cleaned and visualized the data we set up our machine learning target we trained our initial model we then evaluated error and created a way to back test and really accurately measure that error over p over long periods of time then we improved our model with some extra predictor columns so if you want to continue extending this model some things i would recommend thinking about so there are exchanges that are open overnight so the s p 500 only trades during u.s market hours but there are other indices around the world that open before the u.s markets open so it might be worth looking at those prices and seeing if you can actually correlate them right if if an index on the other side of the world is increasing does that help predict the s p 500 bet you can add in news so that includes articles that are coming out general macroeconomic conditions like interest rates inflation etc you can also think about adding in some key components of the s p 500 like key stocks and key sectors it's possible that for example if tech is in a downturn it's possible that six months later the s p 500 will go down maybe it doesn't go down immediately so that's another thing you can try you can also try increasing the resolution right we're looking at daily data here but you could try looking at hourly data minute by minute data tick data even if you can get it not not always the easiest or cheapest to get but if you can get that data you can make more accurate predictions so those are just some ideas on where you can take this but as i know from personal experience you can build quite a bit on this model and get pretty far if you if you want to all right so i hope you enjoyed this overview of how to build a machine learning model to predict the s p 500 [Music]