Horses for Courses: A Systematic Betting Strategy

This dataset has been available for quite some time now, and there are many analyses of it. I’ve gone a similar route… how can we exploit this data for betting purposes. I take an approach from systematic/quantitative portfolio management, and use a factor returns/factor loading approach to determine an “alpha” for each horse in a race and systematically bet across these alphas. This approach looks to systematically exploit inefficiencies in the betting market and bet against other market participants biases. An example of this may be that punters consistently pay a premium to bet on younger horses, perhaps above and beyond the actual age effect in the race.

This approach was heavily inspired by the Macquarie Quant research teams 2017 Melbourne Cup publication which can be found across the web. Unfortunately, I don’t replicate their exceptional returns here (perhaps deliberately) but simply demonstrate how to apply the techniques using Python.

Disclaimer: this is entirely for fun and demonstrative purposes only, I have no experience with horse betting

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

runners = pd.read_csv(r"D:\Downloads\horses-for-courses (1)\runners.csv")
forms = pd.read_csv(r"D:\Downloads\horses-for-courses (1)\forms.csv")
odds = pd.read_csv(r"D:\Downloads\horses-for-courses (1)\odds.csv")
horses = pd.read_csv(r"D:\Downloads\horses-for-courses (1)\horses.csv")
markets = pd.read_csv(r"D:\Downloads\horses-for-courses (1)\markets.csv")
riders = pd.read_csv(r"D:\Downloads\horses-for-courses (1)\riders.csv")

EDA

runners = runners[['id','position', 'place_paid', 'market_id', 'horse_id', 'trainer_id', 'rider_id', 'form_rating_one', 'handicap_weight', 'barrier', 'last_five_starts']]
forms = forms[['market_id', 'horse_id', 'runner_number','days_since_last_run', 'overall_starts', 'field_strength', 'overall_wins', 'overall_places']]

eda_df = runners.merge(forms, left_on=['market_id', 'horse_id'], right_on=['market_id','horse_id'])
eda_df = eda_df.merge(horses, left_on = ['horse_id'], right_on=['id'])
eda_df = eda_df.merge(riders, left_on = ['rider_id'], right_on=['id'])
odds_mean = odds.groupby(['runner_id']).mean()
odds_mean.reset_index(inplace=True)
eda_df = eda_df.merge(odds_mean, left_on = 'id_x', right_on = 'runner_id')

eda_df.drop(['id_y','id_x', 'odds_one_place_wagered', 'odds_two_place_wagered', 'odds_three_place_wagered', 'odds_two_win_wagered'], axis=1, inplace=True)

eda_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80805 entries, 0 to 80804
Data columns (total 40 columns):
position                    61605 non-null float64
place_paid                  80805 non-null int64
market_id                   80805 non-null int64
horse_id                    80805 non-null int64
trainer_id                  80805 non-null float64
rider_id                    80805 non-null object
form_rating_one             80805 non-null float64
handicap_weight             80805 non-null float64
barrier                     80805 non-null int64
last_five_starts            76311 non-null object
runner_number               80805 non-null int64
days_since_last_run         80805 non-null int64
overall_starts              80805 non-null int64
field_strength              79843 non-null float64
overall_wins                80805 non-null int64
overall_places              80805 non-null int64
age                         80805 non-null float64
sex_id                      80805 non-null float64
sire_id                     80805 non-null float64
dam_id                      80805 non-null float64
prize_money                 80805 non-null float64
id                          80805 non-null int64
sex                         80805 non-null object
runner_id                   80805 non-null int64
odds_one_win                80805 non-null float64
odds_one_win_wagered        80805 non-null float64
odds_one_place              80805 non-null float64
odds_one_place_wagered      0 non-null float64
odds_two_win                80805 non-null float64
odds_two_win_wagered        0 non-null float64
odds_two_place              80805 non-null float64
odds_two_place_wagered      0 non-null float64
odds_three_win              80805 non-null float64
odds_three_win_wagered      80805 non-null float64
odds_three_place            80805 non-null float64
odds_three_place_wagered    80805 non-null float64
odds_four_win               80805 non-null float64
odds_four_win_wagered       80805 non-null float64
odds_four_place             80805 non-null float64
odds_four_place_wagered     80805 non-null float64
dtypes: float64(26), int64(11), object(3)
memory usage: 25.3+ MB

Form Rating

Interestingly, we see that unless the horses form is 100.0, if the horses form is between ~70-99 there is no substantial difference in the number of wins recorded. In fact, there does seem to be a dropoff in the higher form ratings but this could be due to a smaller sample size available. It may also be the fact that form raters are more likely to assign a 100.0 than a 99.0, which leads to the significant uptick in horses being rated 100.0 winning.

form_data = eda_df.groupby(['form_rating_one'])['overall_wins', 'overall_places'].sum()

plt.figure(figsize=(10,5))
form_data['overall_wins'].plot(kind='bar')
plt.ylabel('Number of Wins')
plt.xlabel('Form')
plt.title('Distribution of Wins Per Form Rating')
plt.show()

plt.figure(figsize=(10,5))
form_data['overall_places'].plot(kind='bar')
plt.ylabel('Number of Wins')
plt.xlabel('Form')
plt.title('Distribution of Places Per Form Rating')
plt.show()

png

Age

age_data = eda_df.groupby(['age'])['overall_wins', 'overall_places'].sum()

plt.figure(figsize=(7,5))
age_data['overall_wins'].plot(kind='bar')
plt.ylabel('Number of Wins')
plt.xlabel('Age')
plt.title('Distribution of Wins Per Age')
plt.show()

plt.figure(figsize=(7,5))
age_data['overall_places'].plot(kind='bar')
plt.ylabel('Number of Wins')
plt.xlabel('Age')
plt.title('Distribution of Places Per Age')
plt.show()

<matplotlib.figure.Figure at 0x249d062fe48>

png

sns.pairplot(eda_df[['age', 'overall_wins', 'overall_places']])
plt.show()

png

horse_data = pd.DataFrame(eda_df.groupby(['horse_id'])['overall_wins', 'overall_places'].sum())

plt.figure(figsize=(7,5))
sns.distplot(horse_data['overall_wins'].values)
plt.xlabel('Number of Wins')
plt.title('Distribution of Wins Per Horse')
plt.show()

plt.figure(figsize=(7,5))
sns.distplot(horse_data['overall_places'].values)
plt.xlabel('Number of Wins')
plt.title('Distribution of Places Per Horse')
plt.show()

png

Barriers

It’s a well known punters strategy to avoid the horses in the outer barriers (i.e. in a 10 barrier race: 8,9, 10). If we look at the distribution of wins/places per barrier, this seems to be reasonably confirmed with barriers 1-6 having a reasonably equal distribution of wins whilst barriers 9-10 drop off substantially. Given that this appears to be a reasonably pervasive anomaly, it should be baked into the odds being offered already, but perhaps there is a systematic bias that punters overestimate the importance of the barrier in the race and this could potentially be exploited.

barrier_data = pd.DataFrame(eda_df.groupby(['barrier'])['overall_wins', 'overall_places'].sum())
barriers_per_race = eda_df.groupby(['market_id'])['barrier'].count()

barrier_counts = pd.DataFrame(list(Counter(barriers_per_race).items()), columns = ['Barriers', 'Count'])
barrier_counts.set_index(['Barriers'], inplace=True)

barrier_counts.plot(kind='bar')
plt.show()

png

plt.figure(figsize=(7,5))
barrier_data['overall_wins'].plot(kind='bar')
plt.ylabel('Number of Wins')
plt.xlabel('Barrier')
plt.title('Distribution of Wins Per Barrier')
plt.show()

plt.figure(figsize=(7,5))
barrier_data['overall_places'].plot(kind='bar')
plt.ylabel('Number of Wins')
plt.xlabel('Barrier')
plt.title('Distribution of Places Per Barrier')
plt.show()

png

We can see that race position is actually correlated with odds (surprise, surprise).

sns.clustermap(eda_df.corr(), figsize=(15,15))
plt.show()

png

Betting

We’ll approach developing a betting model similar to what’s known as a factor model in asset management/finance. Our target variable will be the observed return (i.e. if we bet $1 on a horse with odds of 1.5 and the horse wins, our observed return is $0.5) and we’ll use factors such as barrier, age, pre-race odds to try to explain the observed return. By doing this, we may be able to develop a model which can explain the observed return, and then apply this model to new data to attempt to maximise our expected return when betting on a horse race (likely to require us to bet across many horses races… but this is all just for fun!).

forms = forms[['market_id', 'horse_id', 'days_since_last_run']]
runners = runners[['id','position', 'place_paid', 'market_id', 'horse_id','form_rating_one', 'handicap_weight', 'barrier', 'last_five_starts']]

odds_mean = odds.groupby(['runner_id']).mean()
odds_mean.reset_index(inplace=True)
odds_mean = odds_mean[['runner_id', 'odds_one_place', 'odds_two_place', 'odds_three_place', 'odds_four_place']]

combined_df = runners.merge(forms, left_on=['market_id', 'horse_id'], right_on = ['market_id', 'horse_id'])
combined_df = combined_df.merge(odds_mean, left_on = ['id'], right_on = ['runner_id'])
combined_df = combined_df.merge(horses[['id','age']], left_on = 'horse_id', right_on = 'id')
combined_df = combined_df.merge(markets[['id', 'timezone']], left_on='id_x', right_on='id')
combined_df['timezone'] = pd.to_datetime(combined_df['timezone'])
combined_df = combined_df[np.isfinite(combined_df['position'])]

combined_df['odds'] = (combined_df['odds_one_place'] + combined_df['odds_two_place'] + combined_df['odds_three_place'] + combined_df['odds_four_place']) / 4
combined_df.drop(['odds_one_place', 'odds_two_place', 'odds_three_place', 'odds_four_place'], axis=1, inplace=True)
combined_df['return'] = combined_df['place_paid'] * combined_df['odds'] - 1

combined_df['inv_odds'] = 1 / combined_df['odds']
combined_df['inv_odds_sq'] = np.power(combined_df['odds'], 2)

combined_df['last_five_starts'] = combined_df['last_five_starts'].apply(calc_start_score)
to_norm = combined_df[['market_id', 'form_rating_one', 'horse_id','handicap_weight', 'barrier', 'last_five_starts', 'days_since_last_run', 'age', 'inv_odds', 'inv_odds_sq']]
normed_df = to_norm.groupby(['market_id']).apply(lambda x: (x - x.min())/(x.max() - x.min()))
normed_df['market_id'] = combined_df['market_id']
normed_df['timezone'] = combined_df['timezone']
normed_df['horse_id'] = combined_df['horse_id']

normed_df.head()

	market_id	horse_id	handicap_weight	barrier	last_five_starts	days_since_last_run	age	inv_odds	inv_odds_sq	timezone
4	338	6	0.000000	0.153846	0.916667	0.116667	0.250000	1.000000	0.000000	2016-10-02 18:15:00
6	314	1	1.000000	0.250000	0.250000	0.000000	0.571429	0.147641	0.154099	2016-09-27 19:48:00
9	563	13	0.000000	0.090909	0.617647	0.059633	0.600000	1.000000	0.000000	2016-11-25 22:12:00
22	483	17	0.555556	0.727273	0.055556	0.000000	0.000000	0.179808	0.045073	2016-11-21 21:40:00
27	127	22	0.000000	0.909091	0.000000	0.000000	0.000000	0.675319	0.004061	2016-07-30 21:19:00

def calc_start_score(value):
    """
    Converts a horse's past 5/20 starts into a numeric score
    i.e. if a horses past races are xf245x, the score should be 15 + 2 + 4 +5 = 26
    f = did not finish, x = scratched.
    """
    
    arr = list(str(value))
    val = 0
    for i in arr:
        try:
            val += int(i)
        except:
            if i == 'x':
                val += 0
            elif i =='f':
                val += 15
    return val

Factor Loadings

Above we got our data into a nice clean format for running through some regression models. Our approach is fairly simple:

Split the dataset into a training period (before November 2016) and test period (November 2016 onwards).
Fit our models to the training dataset. Effectively: $Expected Return = \alpha + \beta_{1}LastFiveStarts + \beta_{2}DaysSinceLastRun + \beta_{3}Handicap + \beta_{4}Barrier + \beta_{5}Form + \beta_{6}\frac{1}{odds} + \beta_{7}\frac{1}{odds}^2 + \epsilon$
In our test period, calculate the expected return for each horse as $H_{alpha}$ and then bucket our $H_{alpha}$ into quartiles.
In each race, systematically bet an equal amount across each horse in each quartile
Track profits over time in each quartile

By doing this, we’ll be able to see if there’s any profit to be made by exploiting our basic model and betting on horses in the top quartile.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from xgboost import XGBRegressor
from xgboost import plot_importance

final_df = normed_df.merge(pd.DataFrame(combined_df[['return']]), left_index=True, right_index=True)
final_df.dropna(inplace=True)

train_df = final_df.loc[final_df['timezone'] < '2016-11-01']
test_df = final_df.loc[final_df['timezone'] >= '2016-11-01']

X_train = train_df[['age', 'last_five_starts', 'days_since_last_run', 'handicap_weight', 'barrier', 'form_rating_one', 'inv_odds', 'inv_odds_sq']]
X_test = test_df[['age', 'last_five_starts', 'days_since_last_run', 'handicap_weight', 'barrier', 'form_rating_one', 'inv_odds', 'inv_odds_sq']]

y_train = train_df['return']
y_test = test_df['return']

Linear Regression

First we’ll run a linear regression to get an idea of the factor loadings. Effectively this is telling us which factors explained the observed return. We see that inverse odds age and handicap are quite dominant.

model = LinearRegression()
model.fit(X_train.values, y_train.values)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

pd.DataFrame(model.coef_, X_train.columns).plot(kind='bar')
plt.legend('')
plt.show()

png

XGBoost

For our actual model, we’ll fit an xgboost random forest regressor.

clf3 = XGBRegressor()
clf3.fit(X_train, y_train)

halphas = pd.DataFrame(clf3.predict(X_test), index=X_test.index, columns=['halpha'])
halphas = halphas.merge(test_df, left_index=True, right_index=True)
halphas['fractiles'] = halphas.groupby(['market_id'])['halpha'].apply(pd.cut, bins=4, labels=[1,2,3,4])

We notice that are fractiles don’t look particularly good, fractiles 1-3 have a negative mean return, whilst fractile 4 has a slight positive tilt. Ideally we’d liked to have seen fractiles 1/2 with negative returns and fractiles 3/4 with positive returns.

halphas.groupby(['fractiles'])['return'].mean().plot(kind='bar')
plt.show()

png

ax = plot_importance(clf3)
fig = ax.figure
fig.set_size_inches(15, 10)
plt.show()

png

Betting Simulation

We’ll run our basic betting simulation here, where for each race we distribute a bet equally across the fractiles. We then track our returns over time.

Notice that we do get quite a nice fractile profile spread, where if we could go long our top fractile (fractile 4) and short our bottom fractile (fractile 1) we could make a healthy profit. Unfortunately, I’m not aware of being able to easily bet against a horse.

We thus notice that the top fractile in the long run doesn’t make a huge return. There are numerous reasons for this:

I’ve only given the model ~135 races to bet on, likely not enough
My model is fairly naive (deliberately) and I haven’t gone to extra lengths to clean it and deal with outliers etc.
Potential ways to improve include:
- Bringing in extra factors
- Dealing with missing data better
- Dealing with odds data better: i.e. taking into account time

def sim_bets(df, bet_size):
    
    results_df = pd.DataFrame()
    
    for i in np.unique(df['market_id']):
        tmp_df = df.loc[df.market_id == i]
        
        quint_counts = tmp_df.groupby(['fractiles'])['horse_id'].count()
        bets = pd.DataFrame(bet_size / quint_counts)
        bets.columns = ['bet_amount']
        
        tmp_df = tmp_df.merge(bets, left_on = 'fractiles', right_index=True)
        
        tmp_df['bet_return'] = tmp_df['return'] * tmp_df['bet_amount']
        
        tmp_res = pd.DataFrame(tmp_df.groupby(['fractiles'])['bet_return'].sum())
        tmp_res.fillna(0, inplace=True)
        tmp_res.columns = [i]
        results_df = results_df.append(tmp_res.T)
        
    return results_df

sim = sim_bets(halphas, 1)

sim_cumu['f4-f1'] = sim_cumu.iloc[:,3] - sim_cumu.iloc[:,0]

sim_cumu = sim.cumsum()
sim_cumu.columns=['f1', 'f2', 'f3', 'f4']
sim_cumu['f4-f1'] = sim_cumu['f4'] - sim_cumu['f1']
sim_cumu.plot(figsize=(15,10))
plt.title('Fractile Profit Over Time ($1 Bets)', size=20, fontweight='bold')
plt.ylabel('Cumulative Profit ($)')
plt.xlabel('Race Number')
plt.show()

png