Building a Financial Machine Learning Pipeline with Alpaca

Person standing in front of screen — Photo by rishi on Unsplash

(You can read the original version of this article here.)

(All the code snippets provided above can be found in the Github repo for this post.)

Overview

In this two-part article, I will walk through the process of preparing a financial dataset for input into a machine learning model using the Alpaca Data API.

Part 1 will cover pulling minutely bar data from the API, converting it to dollar bars, and then using these bars to build a simple feature matrix.

Part 2 will show how to select a subset of samples using an event-based filter, construct targets for each of these samples, and build bootstrapped training datasets for use in an ensemble model.

Much of the content of this article is inspired by the excellent book Advances in Financial Machine Learning. I recommend that interested readers check it out, both for a more comprehensive treatment of the methods covered here, and also for introductions to a range of more sophisticated techniques.

Pulling Raw Data
Converting to Dollar Bars
Constructing a Feature Matrix
Conclusion

Pulling the Raw Data

The first step in any financial machine learning pipeline is to download the raw data which we will use to construct features and targets. In this case, we will be starting with minutely bar data (i.e. open/high/low/close/volume), which can be accessed via Alpaca’s /bars endpoint.

To keep things simple in this walkthrough, we’ll use bar data from the S&P 500 ETF SPY.

Accessing the bars endpoint is quite straightforward. All we need to specify is the size of the bars (minutely in our case), the ticker of the stock, and the start and end timestamps for the range of data. The function below retrieves all the minutely bars for a particular date given the opening and closing timestamps for that date.

import requests 

def get_date_bars(ticker_symbol, open_timestamp, close_timestamp):

    # set the url for pulling bar data
    base_url = 'https://data.alpaca.markets/v1/bars/minute'

    # set the request headers using our api key/secret
    request_headers = {'APCA-API-KEY-ID': '<YOUR_KEY_HERE>', 'APCA-API-SECRET-KEY': '<YOUR_SECRET_HERE>'}

    # set the request params for the next request
    request_params = {'symbols': ticker_symbol, 'limit': 1000, 'start': open_timestamp.isoformat(), 'end': close_timestamp.isoformat()}

    # get the response
    date_bars = requests.get(base_url, params=request_params, headers=request_headers).json()[ticker_symbol]

    # if the date on the response matches the closing date for the day, throw the candle away (since it technically happens after the close)
    if date_bars[-1]['t'] == int(close_timestamp.timestamp()):
        date_bars = date_bars[:-1]

    # return the bars for the date
    return date_bars

Next, we need to call this function for each date in the time period of interest. This is a little more complicated than it seems, because we need to know both which dates the market was open, and also what the opening and closing times were for those dates. We can accomplish both of these tasks with the pandas_market_calendar package. The following function shows how to use this package to get a list of all minutely bars for a particular symbol going back to the start of 2015.

import pandas_market_calendars as mcal
import time

def get_all_bars(ticker_symbol):

    # get a list of market opens and closes for each trading day from 2015 onwards
    trading_days = mcal.get_calendar('NYSE').schedule(start_date='2015-01-01', end_date='2020-08-31')

    # initialize an empty list of all bars
    all_bars = []

    # for each day in our list of trading days...
    for i in range(len(trading_days)):

        # get the time at the start of the request
        request_time = time.time()

        # get the list of bars for the next day
        next_bars = get_date_bars(ticker_symbol, trading_days['market_open'][i], trading_days['market_close'][i])

        # print a log statement
        print(f'Got bars for {next_bars[-1]["t"]}')

        # add the next bars to our growing list of all bars
        all_bars += next_bars

        # sleep to ensure that no more than 200 requests occur per 60 seconds
        time.sleep(max(request_time + 60/200 - time.time(), 0))

    # return the list of all bars
    return all_bars

Note that this function takes several minutes to run since it has a built-in time.sleep() to respect Alpaca’s rate limit of 200 requests per minute. I recommend that you save the output of this function to a file, which can then be loaded on demand for subsequent processing. The simplest way to do this is via the pickle module:

import pickle

# download and save the raw data to a pickle file (only need to do this once)
pickle.dump(get_all_bars('SPY'), open('SPY.pkl', 'wb'))

# load the raw data from the pickle file
time_bars = pickle.load(open('SPY.pkl', 'rb'))

Now that we’ve got our raw data, it’s time to move on to the next step - processing it into a more useful format.

Converting to Dollar Bars

Currently, our bar data is partitioned by time: each bar denotes the first, highest, lowest, and last price over one minute. This time-based approach is by far the most common way of representing financial data, and is the default choice on most charting applications.

However, although time-based bars are intuitive and readily available, they are not necessarily the best choice for generating features for a machine learning model. For one thing, many market participants tend to place their trades at the beginning or the end of the trading day, and so it is often the case that the majority of trading activity (and thus revealed information about other participants) happens during these windows. More generally, financial markets tend to incorporate new information in sudden bursts when the information becomes available, rather than continuously throughout the day. Time-based bars are thus highly variable in terms of how much information they contain about the behavior of other market participants.

A simple solution to these issues is to convert our bars from representing a fixed amount of time to representing a fixed amount of volume. In this case, “volume” can be denominated either in terms of number of shares (referred to as volume bars), or in terms of dollar value of shares (referred to as dollar bars). I generally prefer dollar bars for two reasons: first, they are more internally consistent for datasets where the price of an asset changes significantly over time; and second, they are less influenced by events such as share sales or buybacks which change the number of outstanding shares of a company.

To convert our time bars into dollar bars, we iterate over each bar and keep track of the highest high and lowest low seen so far, as well as the cumulative dollar volume. Because we only have minutely data rather than tick data, we cannot calculate the exact dollar volume traded during each minute. The best we can do is approximate it by multiplying the share volume by the average of the opening and closing prices. Every time the accumulated dollar volume goes above our fixed threshold, we generate another dollar bar and reset our running high/low. The following function implements this conversion for a given dollar threshold:

import math

def get_dollar_bars(time_bars, dollar_threshold):

    # initialize an empty list of dollar bars
    dollar_bars = []

    # initialize the running dollar volume at zero
    running_volume = 0

    # initialize the running high and low with placeholder values
    running_high, running_low = 0, math.inf

    # for each time bar...
    for i in range(len(time_bars)):

        # get the timestamp, open, high, low, close, and volume of the next bar
        next_timestamp, next_open, next_high, next_low, next_close, next_volume = [time_bars[i][k] for k in ['t', 'o', 'h', 'l', 'c', 'v']]

        # get the midpoint price of the next bar (the average of the open and the close)
        midpoint_price = (next_open + next_close)/2

        # get the approximate dollar volume of the bar using the volume and the midpoint price
        dollar_volume = next_volume * midpoint_price

        # update the running high and low
        running_high, running_low = max(running_high, next_high), min(running_low, next_low)

        # if the next bar's dollar volume would take us over the threshold...
        if dollar_volume + running_volume >= dollar_threshold:

            # set the timestamp for the dollar bar as the timestamp at which the bar closed (i.e. one minute after the timestamp of the last minutely bar included in the dollar bar)
            bar_timestamp = next_timestamp + 60

            # add a new dollar bar to the list of dollar bars with the timestamp, running high/low, and next close
            dollar_bars += [{'timestamp': bar_timestamp, 'high': running_high, 'low': running_low, 'close': next_close}]

            # reset the running volume to zero
            running_volume = 0

            # reset the running high and low to placeholder values
            running_high, running_low = 0, math.inf

        # otherwise, increment the running volume
        else:
            running_volume += dollar_volume

    # return the list of dollar bars
    return dollar_bars

Note that because we started with minutely bars rather than tick data, our conversion to dollar bars is inexact. Each bar corresponds to at least a fixed amount of dollar volume, and no more than that fixed amount plus the amount of volume over an additional one-minute period. But in order to ensure that each bar represented exactly this amount of volume, we would need to know the price and size of each individual trade. However, for our purposes, the approximate dollar bars we have just created will be sufficient.

The choice of dollar threshold is up to you, and depends on your goals in the same way that minutely, hourly, and daily time bars are all useful in different contexts. For this example, I will set the dollar threshold at $50m. This gives us an average of one dollar bar every 14 minutes for the entire data set.

# convert the time bars to dollar bars
dollar_bars = get_dollar_bars(time_bars, 50000000)

With our time-based bars converted to more informationally homogeneous dollar bars, we can now use these dollar bars to build a feature matrix.

Constructing a Feature Matrix

Feature engineering is an important part of any machine learning endeavor, and doubly so for financial machine learning. A deep dive into the construction of useful financial features is unfortunately beyond the scope of this article. For illustrative purposes, I will just walk through the process of constructing a feature matrix with a handful of simple features.

The first step is to convert our bars from their current form as a list of dictionaries into a pandas DataFrame. This allows us to make use of many of pandas’ built-in functions for calculating useful statistics, and speeds up computation by vectorizing our calculations.

Once we’ve converted our bars into a DataFrame, we can compute the features themselves. We will start with a few simple features: last price vs exponentially-weighted moving average, last price vs cumulative high/low range, and cumulative return vs rolling high/low range. We will also add features for degree of convexity/concavity and rolling autocorrelation.

Note that each of these features is parameterized by the range of bars over which it is computed (e.g. a 5-bar EWMA vs a 10-bar EWMA). The following function takes as input a DataFrame of bars and the range parameter and adds columns for each of the aforementioned features:

def add_feature_columns(bars_df, period_length):

    # get the price vs ewma feature
    bars_df[f'feature_PvEWMA_{period_length}'] = bars_df['close']/bars_df['close'].ewm(span=period_length).mean() - 1

    # get the price vs cumulative high/low range feature
    bars_df[f'feature_PvCHLR_{period_length}'] = (bars_df['close'] - bars_df['low'].rolling(period_length).min()) / (bars_df['high'].rolling(period_length).max() - bars_df['low'].rolling(period_length).min())

    # get the return vs rolling high/low range feature
    bars_df[f'feature_RvRHLR_{period_length}'] = bars_df['close'].pct_change(period_length)/((bars_df['high']/bars_df['low'] - 1).rolling(period_length).mean())

    # get the convexity/concavity feature
    bars_df[f'feature_CON_{period_length}'] = (bars_df['close'] + bars_df['close'].shift(period_length))/(2 * bars_df['close'].rolling(period_length+1).mean()) - 1

    # get the rolling autocorrelation feature
    bars_df[f'feature_RACORR_{period_length}'] = bars_df['close'].rolling(period_length).apply(lambda x: x.autocorr()).fillna(0)

    # return the bars df with the new feature columns added
    return bars_df

In order for our features to reflect both short and long term characteristics of the market at each point in time, it is often useful to include the same types of features over a variety of bar ranges. One approach would be to use an evenly-spaced set of ranges such as 10, 20, 30, 40, and 50 bars. However, this has the effect of disproportionately over-weighting longer-term features, since the 40- and 50-bar ranges will contain roughly the same information while the 10- and 20-bar ranges will not. For this reason, I prefer to use an exponentially-spaced set of bar ranges. In this example we will use ranges of 4, 8, 16, 32, 64, 128, and 256 bars (ranging from ~1 hour to ~9 days). The following function calls add_feature_columns to generate features for each of those bar ranges:

import pandas as pd

def get_feature_matrix(dollar_bars):

    # convert the list of bar dicts into a pandas dataframe
    bars_df = pd.DataFrame(dollar_bars)

    # number of bars to aggregate over for each period
    period_lengths = [4, 8, 16, 32, 64, 128, 256]

    # for each period length
    for period_length in period_lengths:

        # add the feature columns to the bars df
        bars_df = add_feature_columns(bars_df, period_length)

    # prune the nan rows at the beginning of the dataframe
    bars_df = bars_df[period_lengths[-1]:]

    # filter out the high/low/close columns and return
    return bars_df[[column for column in bars_df.columns if column not in ['high', 'low', 'close']]]

Now, all we need to do to create a full feature matrix of five feature types for each of seven bar ranges is call this function on our list of dollar bars:

# construct the feature matrix from the dollar bars
feature_matrix = get_feature_matrix(dollar_bars)

For the purposes of this tutorial, we will conclude our feature engineering effort here. However, a real-world ML pipeline would probably include at least two more phases: first, the application of some form of feature selection or dimensionality reduction to our set of 35 features; and second, the standardization/ normalization of each feature before it is used as input to an ML model.

Recap and Next Steps

In Part 1 of this post, we walked through three key stages of building a financial machine learning pipeline with Alpaca. First, we pulled raw minutely bar data from Alpaca’s Data API. Next, we converted those minutely bars into dollar bars. Finally, we used those dollar bars to generate a matrix of a few dozen potentially useful features.

All the code snippets provided above can be found in the Github repo for this post.

Stay tuned for Part 2, where we will walk through the last essential step of a financial ML pipeline: labeling our data. We will also explore two optional steps, event-based sampling and bootstrapping, which have the potential improve our model’s performance.

If you enjoyed this article, be sure to check out my other work on my website.

Follow @AlpacaHQ on Twitter!

Brokerage services are provided by Alpaca Securities LLC ("Alpaca"), member FINRA/SIPC, a wholly-owned subsidiary of AlpacaDB, Inc. Technology and services are offered by AlpacaDB, Inc.