Using Stock Data for Classification Problem: Action

Albert Um
5 min readMay 9, 2021

--

Photo by M. B. M. on Unsplash

This blog will demonstrate a simple way to frame financial stock data into a sequence classification problem. The business case is to, given historical stock data, create a model that will predict whether a trade(action) will be ‘Positive’ or ‘Negative.’

I can turn the business case into an ML problem by slicing the historical data points into sequences of input days and output days. Each input matrix(input days, features) has a categorical dependent variable dependent on the output days.

I will assume all transactions to be ‘long trades’, and trades will be considered ‘Positive’ if the price increases by +6% from entry. A trade will be considered ‘Negative’ if the price decreases by 3% from entry. In addition, each trade should only look to the past 40 trading days(~8 weeks) with an exit time of 40 days after entry.

Setting an exit time frame will result in some instances when the stock price never meets the stop loss or target within the set output days. When the price has not met the stop loss or target, I will classify them as ‘Neutral.’ (The stop loss, target, input days, and output days are subjective, and establishing this rule, in the beginning, will ease problem-solving later.)

Gather Data

For demonstration purposes, I will use the AlphaVantage to obtain historical stock price data. The API keys are freely distributed, and the response, by default, returns a JSON format data; ex. IBM_demo (https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=IBM&apikey=demo)

The full documentation of the API can be found here (https://www.alphavantage.co/documentation/) with an opportunity to quickly obtain a free API key.

API Parameters

  • function: For this blog, I will request daily stock data (TIME_SERIES_DAILY’), however intraday, weekly, and monthly data is available. Please resort back to documentation for the correct parameter.
  • symbol: The symbol parameter refers to the stock ticker and I will set it to ‘SPY’ (S&P Depository Receipt, spider).
  • apikey: API key which can be claimed here (https://www.alphavantage.co/support/#api-key).
  • outputsize: By default, the output size is set to compact which returns the latest 100 data points. I will change this parameter to ‘full’ to return all historical data points.
  • datatype: By default, the response returns a JSON string. However, for this demonstration, I will set the datatype parameter to ‘csv’ for simplicity. It’s important to note, requesting a JSON object is substantially faster than requesting a downloadable CSV file.
import numpy as np
import pandas as pd
API_KEY = '' # input API KEY here
df = pd.read_csv('https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=SPY&apikey={}&outputsize=full&datatype=csv'.format(API_KEY))
df.head()

The full dataframe has 5413 observations, time ranging from ‘1999–11–01’ to ‘2021–05–06’(present). I will slice the dataframe so the data points will range from ‘2014–01–01’ to ‘2021–05–06’.

Visualize Data

After slicing the data frame, it’s essential to plot the data to understand and reinforce the business question.

import plotly.graph_objects as go
from plotly.offline import iplot
from plotly.subplots import make_subplots
# slice dataframe
_df = df.loc[(df['timestamp'] >= '2014-01-01') & (df['timestamp'] <= '2021-05-06')].sort_values(by = 'timestamp').reset_index(drop=True)
# candlestick plot
fig = make_subplots(rows = 2, cols = 1, shared_xaxes = True, subplot_titles = ('SPY', 'Volume'), vertical_spacing = 0.1, row_width = [0.2, 0.7])
fig.add_trace(go.Candlestick(x = _df['timestamp'],
open = _df['open'],
high = _df['high'],
low = _df['low'],
close = _df['close'], showlegend=False),
row = 1, col = 1)
fig.add_trace(go.Bar(x = _df['timestamp'], y = _df['volume'], showlegend=False), row = 2, col = 1)fig.update(layout_xaxis_rangeslider_visible=False)
fig.show();
Zooming from 2020 to present

Sequence Classification

The objective is to slice the initial data frame into two tensors, X and y, where the input of each sequence X should be 40 observations of past data points that incrementally increase by one trading day.

The y tensor should be the same number of sequences of which each y value contains either ‘Positive,’ ‘Neutral,’ or ‘Negative.’ A ‘Positive’ classification is when the price has increased by 6% before expiration(40 trading days out) or before the price hits the stop loss (decrease by 3%). On the contrary, a ‘Negative’ classification is when the price had decreased by 3% of entry(stop loss) before stock achieved the price target (increase by 6%) or before expiration.

There are 1849 data points(trading days) from ‘2014–01–01’ to ‘2021–05–06’ (present). With an input step and output steps of 40 days, the X tensor shape has 1770 sequences with 1770 y values. (1849–40 input days — 40 output days + 1 inclusive day on the last sequence). The first sequence for X goes from day 1–40, with the first y value using values from day 41–80(1849–40 = 1809). The last sequences for X go from day 1770–1809, with the last y value using values from day 1810–1849. Therefore, the X sequence should have 1770 sequences.

To put it simply, I have sliced the original data frame so that I buy every day from day 40–1809.

Negative: 750, Neutral: 673, Positive: 347

Despite SPY having an uptrend for the past 30 years, it’s understandable to have more instances of ‘Negative’ values due to the tighter restrictions on the stop loss vs. target(-3% vs. +6%). Also, SPY prices do not drastically change within 40 days of output, and there are many instances when the price has not met the stop loss or target.

Splitting daily stock data into sequences of input steps with each sequence having a dependent variable, action, can frame stock data into a classification problem.

--

--