Quantitative Horse Racing with R: Calibration, Backtesting, and Deployment
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Racing analytics as an inference-and-decision system
Thoroughbred flat racing is not a binary classification problem. It is a multi-competitor outcome process with hierarchy (horse / trainer / jockey / track), time dependence (form cycles, market moves), and decision layers (how you act on probabilities).
This macro post is deliberately code-heavy. The goal is to show a canonical, reproducible R workflow you can adapt: from a data contract to modeling to evaluation to deployment.
- Model layer: calibrated probabilities and uncertainty.
- Decision layer: explicit assumptions, risk controls, stress tests.
If you do only one thing, do this: never judge models by ROI alone. Use proper scoring rules and calibration first.
1) Canonical data contract: race → runner → odds snapshots
The fastest way to avoid analytics chaos is to standardize your storage model. A minimal schema lets every chapter, notebook, or experiment start from the same tables.
DuckDB schema (SQL)
-- sql/schema.sql CREATE TABLE IF NOT EXISTS races ( race_id VARCHAR PRIMARY KEY, race_date DATE, region VARCHAR, track VARCHAR, surface VARCHAR, going VARCHAR, distance_m INTEGER, race_class VARCHAR, purse DOUBLE, field_size INTEGER ); CREATE TABLE IF NOT EXISTS runners ( race_id VARCHAR, runner_id VARCHAR, horse_id VARCHAR, jockey_id VARCHAR, trainer_id VARCHAR, draw INTEGER, weight_kg DOUBLE, age INTEGER, sex VARCHAR, odds_decimal DOUBLE, finish_pos INTEGER, finish_time_s DOUBLE, win INTEGER, PRIMARY KEY (race_id, runner_id) ); CREATE TABLE IF NOT EXISTS odds_snapshots ( race_id VARCHAR, runner_id VARCHAR, ts TIMESTAMP, odds_decimal DOUBLE, traded_vol DOUBLE, PRIMARY KEY (race_id, runner_id, ts) );
2) Generate a reproducible simulated dataset
High-quality racing data is often license-restricted. A robust approach is to make every lab runnable on a simulator that matches your canonical schema, then provide adapters to ingest real data only when access is lawful.
Race simulator (R)
# R/simulate.R
library(dplyr)
library(tidyr)
simulate_races <- function(
n_races = 200, n_horses = 600, n_jockeys = 200, n_trainers = 150,
min_field = 6, max_field = 14, seed = 1
) {
set.seed(seed)
horses <- tibble(horse_id = sprintf("H%04d", 1:n_horses),
ability = rnorm(n_horses, 0, 1))
jockeys <- tibble(jockey_id = sprintf("J%03d", 1:n_jockeys),
skill = rnorm(n_jockeys, 0, 0.4))
trainers<- tibble(trainer_id= sprintf("T%03d", 1:n_trainers),
skill = rnorm(n_trainers, 0, 0.5))
surfaces <- c("Turf","Dirt","AllWeather")
goings <- c("Firm","Good","Gd-Fm","Gd-Sft","Soft","Heavy","Standard","Fast","Sloppy")
races <- tibble(
race_id = sprintf("R%05d", 1:n_races),
race_date = as.Date("2025-01-01") + sample(0:365, n_races, TRUE),
region = sample(c("GB","IE","US","AU","FR"), n_races, TRUE),
track = sample(c("TRACK_A","TRACK_B","TRACK_C"), n_races, TRUE),
surface = sample(surfaces, n_races, TRUE, prob = c(0.45,0.35,0.20)),
going = sample(goings, n_races, TRUE),
distance_m = sample(c(1000,1200,1400,1600,1800,2000,2400,2800), n_races, TRUE),
race_class = sample(c("G1","G2","G3","HCP","ALW","CLM"), n_races, TRUE),
purse = round(exp(rnorm(n_races, log(25000), 0.7))),
field_size = sample(min_field:max_field, n_races, TRUE)
)
runners <- races |>
rowwise() |>
do({
r <- .
k <- r$field_size
df <- tibble(
race_id = r$race_id,
runner_id = paste0(r$race_id, "_", seq_len(k)),
horse_id = sample(horses$horse_id, k, FALSE),
jockey_id = sample(jockeys$jockey_id, k, TRUE),
trainer_id = sample(trainers$trainer_id, k, TRUE),
draw = sample(1:k, k, FALSE),
weight_kg = pmax(45, rnorm(k, 55, 3)),
age = sample(2:7, k, TRUE),
sex = sample(c("C","F","G","H"), k, TRUE, prob = c(0.2,0.35,0.35,0.1))
) |>
left_join(horses, by="horse_id") |>
left_join(jockeys, by="jockey_id", suffix=c("","_j")) |>
left_join(trainers, by="trainer_id", suffix=c("","_t")) |>
mutate(
draw_effect = ifelse(k >= 12, (k + 1 - draw)/k, 0),
surface_effect = case_when(r$surface=="Turf" ~ 0.15,
r$surface=="AllWeather" ~ 0.05,
TRUE ~ 0),
weight_penalty = -0.03*(weight_kg - 55),
utility = ability + skill + skill_t + 0.12*draw_effect +
surface_effect + weight_penalty + rnorm(k, 0, 0.25)
)
# Sequential proportional-to-exp(utility) finish ordering
remaining <- seq_len(k)
w <- exp(df$utility - max(df$utility))
order_idx <- integer(0)
for (pos in seq_len(k)) {
probs <- w[remaining]/sum(w[remaining])
pick <- sample(remaining, 1, prob = probs)
order_idx <- c(order_idx, pick)
remaining <- setdiff(remaining, pick)
}
df$finish_pos <- match(seq_len(k), order_idx)
df$win <- as.integer(df$finish_pos == 1)
base_time <- r$distance_m/16
df$finish_time_s <- base_time - 0.8*df$ability + rnorm(k, 0, 1.2)
# Market-like odds with an overround
p <- exp(df$utility - max(df$utility)); p <- p/sum(p)
overround <- 1.18
df$odds_decimal <- pmin(200, pmax(1.01, 1/(overround*p) * exp(rnorm(k, 0, 0.06))))
df |>
select(race_id, runner_id, horse_id, jockey_id, trainer_id,
draw, weight_kg, age, sex, odds_decimal,
finish_pos, finish_time_s, win)
}) |>
ungroup()
list(races = races, runners = runners)
}
sim <- simulate_races(seed = 42)
Quick EDA sanity checks
library(ggplot2) ggplot(sim$races, aes(field_size)) + geom_histogram(bins = 12) + labs(x = "Runners per race", y = "Count") ggplot(sim$runners, aes(odds_decimal)) + geom_histogram(bins = 50) + scale_x_log10() + labs(x = "Decimal odds (log scale)", y = "Count")
3) Persist to Parquet and query with DuckDB
A clean pattern is: write Parquet (Arrow) → create tables in DuckDB from read_parquet().
This yields a fast, local warehouse without heavyweight infrastructure.
library(arrow)
library(DBI)
library(duckdb)
dir.create("data/derived", recursive = TRUE, showWarnings = FALSE)
write_parquet(sim$races, "data/derived/races.parquet")
write_parquet(sim$runners, "data/derived/runners.parquet")
con <- dbConnect(duckdb(), "data/warehouse/racing.duckdb")
dbExecute(con, "INSTALL parquet; LOAD parquet;")
dbExecute(con, "
CREATE OR REPLACE TABLE races AS
SELECT * FROM read_parquet('data/derived/races.parquet');
")
dbExecute(con, "
CREATE OR REPLACE TABLE runners AS
SELECT * FROM read_parquet('data/derived/runners.parquet');
")
dbGetQuery(con, "SELECT COUNT(*) AS n_races FROM races;")
dbGetQuery(con, "SELECT COUNT(*) AS n_runners FROM runners;")
dbDisconnect(con, shutdown = TRUE)
Query patterns you will use constantly
-- Average odds and win rate by surface
SELECT
surface,
AVG(odds_decimal) AS avg_odds,
AVG(win) AS win_rate,
COUNT(*) AS n
FROM runners r
JOIN races x USING (race_id)
GROUP BY surface
ORDER BY n DESC;
-- Within-race normalization of implied probability
WITH base AS (
SELECT
race_id,
runner_id,
odds_decimal,
1.0 / odds_decimal AS implied_raw
FROM runners
)
SELECT
race_id,
runner_id,
implied_raw / SUM(implied_raw) OVER (PARTITION BY race_id) AS implied_p_norm
FROM base;
4) Baselines: probabilistic win modeling + calibration
In racing analytics, your model output is typically a probability. That means evaluation should prioritize proper scoring rules and calibration, not just AUC.
Feature store (leakage-safe)
library(dplyr)
features <- sim$runners |>
left_join(sim$races, by = "race_id") |>
transmute(
race_id, runner_id,
win,
# pre-race covariates only
draw, weight_kg, age, sex,
surface, going, distance_m, race_class, purse,
odds_decimal,
log_odds = log(odds_decimal),
implied_p_raw = 1/odds_decimal
)
stopifnot(all(is.finite(features$log_odds)))
stopifnot(all(features$odds_decimal >= 1.01))
tidymodels GLM baseline + log loss / Brier / AUC
library(tidymodels)
set.seed(123)
df <- features |>
group_by(race_id) |>
mutate(implied_p = implied_p_raw / sum(implied_p_raw)) |>
ungroup() |>
mutate(win = factor(win, levels = c(0,1), labels = c("no","yes")))
spl <- initial_split(df, strata = win)
train<- training(spl)
test <- testing(spl)
rec <- recipe(win ~ draw + weight_kg + age + sex + surface + going +
distance_m + odds_decimal, data = train) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
mod <- logistic_reg() |> set_engine("glm")
wf <- workflow() |> add_recipe(rec) |> add_model(mod)
fit <- fit(wf, train)
pred <- predict(fit, test, type = "prob") |>
bind_cols(test |> select(win))
metric_set(mn_log_loss, brier_class, roc_auc)(pred, truth = win, .pred_yes)
Calibration plot
library(probably) # Binned calibration plot cal_plot_breaks(pred, truth = win, estimate = .pred_yes, num_breaks = 10)
5) Multi-runner outcome modeling: rankings, not just winners
Binary win models throw away a lot of structure. Racing naturally produces a ranking: finish positions across multiple runners. Ranking models let you target that structure directly.
Plackett–Luce (direct ranking likelihood)
library(PlackettLuce)
set.seed(99)
items <- paste0("H", 1:12)
n_races <- 40
ability <- rnorm(length(items)); names(ability) <- items
Rmat <- matrix(0, nrow = n_races, ncol = length(items))
colnames(Rmat) <- items
for (r in 1:n_races) {
field <- sample(items, size = sample(6:10, 1), replace = FALSE)
util <- ability[field] + rnorm(length(field), 0, 0.3)
ord <- field[order(util, decreasing = TRUE)]
Rmat[r, ord] <- seq_along(ord) # 1 = best; 0 = absent
}
rankings <- as.rankings(Rmat)
pl <- PlackettLuce(rankings)
summary(pl)
coef(pl) # worth parameters
Bradley–Terry via pairwise decomposition
library(dplyr)
library(BradleyTerry2)
pairwise <- tibble()
for (r in 1:n_races) {
in_race <- names(which(Rmat[r, ] > 0))
ranks <- Rmat[r, in_race]
for (i in 1:(length(in_race) - 1)) {
for (j in (i + 1):length(in_race)) {
h1 <- in_race[i]; h2 <- in_race[j]
win1 <- as.integer(ranks[h1] < ranks[h2])
pairwise <- bind_rows(pairwise, tibble(h1=h1, h2=h2, win1=win1, win2=1-win1))
}
}
}
agg <- pairwise |>
group_by(h1, h2) |>
summarise(win1 = sum(win1), win2 = sum(win2), .groups = "drop")
bt <- BTm(cbind(win1, win2), h1, h2, data = agg, id = "horse")
BTabilities(bt)
6) Hierarchy: GLMMs and Bayesian multilevel models
Racing is fundamentally hierarchical: repeated horses, trainers, jockeys, tracks, regions. Fixed effects can overfit; multilevel models provide shrinkage and better uncertainty accounting.
GLMM with trainer/jockey random effects (lme4)
library(lme4)
library(dplyr)
sim2 <- simulate_races(n_races=500, n_horses=800, n_jockeys=120, n_trainers=100, seed=202)
df2 <- sim2$runners |> left_join(sim2$races, by="race_id")
m_glmm <- glmer(
win ~ scale(weight_kg) + scale(draw) + surface + going +
(1 | trainer_id) + (1 | jockey_id),
data = df2,
family = binomial()
)
summary(m_glmm)
Bayesian multilevel model (brms)
library(brms)
priors <- c(
prior(normal(0, 1), class = "b"),
prior(exponential(1), class = "sd")
)
fit_bayes <- brm(
win ~ scale(weight_kg) + scale(draw) + surface + going +
(1 | trainer_id) + (1 | jockey_id),
data = df2,
family = bernoulli(),
prior = priors,
chains = 2, iter = 1000, cores = 2, seed = 202
)
summary(fit_bayes)
pp_check(fit_bayes)
7) Time-to-event: survival and frailty modeling
Some questions in racing are naturally time-to-event: time to first win, time to peak performance, time to retirement, time between races. Survival analysis gives you the right likelihood and handles censoring.
Cox proportional hazards
library(dplyr)
library(survival)
set.seed(1)
n <- 1000
horses <- tibble(
horse_id = sprintf("H%04d", 1:n),
talent = rnorm(n),
trainer_id= sample(sprintf("T%03d", 1:80), n, TRUE),
debut_age = pmax(2, rnorm(n, 3.0, 0.6))
)
base_rate <- 0.002
linpred <- 0.65 * horses$talent - 0.25 * (horses$debut_age - 3)
rate <- base_rate * exp(linpred)
time_days <- rexp(n, rate = rate)
censor_days <- runif(n, 60, 700)
status <- as.integer(time_days mutate(time_days = time_obs, status = status)
cox <- coxph(Surv(time_days, status) ~ scale(talent) + scale(debut_age), data = df_surv)
summary(cox)
cox.zph(cox)
Frailty term (trainer-level heterogeneity)
cox_f <- coxph( Surv(time_days, status) ~ scale(talent) + scale(debut_age) + frailty(trainer_id), data = df_surv ) summary(cox_f)
8) Modern ML: XGBoost in tidymodels and mlr3
Gradient boosting can capture non-linearities and interactions without manual feature engineering. But if you don’t evaluate probabilistically (log loss, calibration), you can accidentally ship a model that “ranks well” yet prices poorly.
tidymodels: tuned XGBoost
library(tidymodels)
library(dplyr)
set.seed(314)
sim3 <- simulate_races(n_races = 500, n_horses = 1200, seed = 314)
df3 <- sim3$runners |>
left_join(sim3$races, by="race_id") |>
mutate(win = factor(win, levels=c(0,1), labels=c("no","yes")))
spl <- initial_split(df3, strata = win)
train <- training(spl)
test <- testing(spl)
rec <- recipe(win ~ draw + weight_kg + age + sex + surface + going +
distance_m + odds_decimal, data=train) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors())
folds <- vfold_cv(train, v = 5, strata = win)
xgb_spec <- boost_tree(
trees = 800,
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune()
) |>
set_engine("xgboost") |>
set_mode("classification")
wf <- workflow() |> add_recipe(rec) |> add_model(xgb_spec)
grid <- grid_space_filling(
tree_depth(),
learn_rate(),
loss_reduction(),
sample_size = sample_prop(),
mtry(range = c(5, 60)),
size = 20
)
res <- tune_grid(
wf,
resamples = folds,
grid = grid,
metrics = metric_set(mn_log_loss, roc_auc)
)
best <- select_best(res, "mn_log_loss")
final_fit <- finalize_workflow(wf, best) |> fit(train)
pred <- predict(final_fit, test, type="prob") |> bind_cols(test |> select(win))
metric_set(mn_log_loss, brier_class, roc_auc)(pred, truth = win, .pred_yes)
mlr3: comparable workflow
library(data.table)
library(mlr3)
library(mlr3learners)
set.seed(99)
sim4 <- simulate_races(n_races = 400, n_horses = 900, seed = 99)
df4 <- as.data.table(merge(sim4$runners, sim4$races, by = "race_id"))
df4[, win := factor(ifelse(win == 1, "yes", "no"))]
task <- as_task_classif(df4, target = "win", id = "win_task")
task$set_col_roles(
c("race_id","runner_id","horse_id","jockey_id","trainer_id"),
roles = "name"
)
rsmp <- rsmp("cv", folds = 5)
lrn_glm <- lrn("classif.log_reg", predict_type = "prob")
rr_glm <- resample(task, lrn_glm, rsmp)
rr_glm$aggregate(msr("classif.logloss"))
rr_glm$aggregate(msr("classif.auc"))
lrn_xgb <- lrn("classif.xgboost", predict_type = "prob",
nrounds = 300, eta = 0.05, max_depth = 4)
rr_xgb <- resample(task, lrn_xgb, rsmp)
rr_xgb$aggregate(msr("classif.logloss"))
rr_xgb$aggregate(msr("classif.auc"))
9) Odds time series: from snapshots to features
Exchange markets produce time-stamped odds and volumes. Treat them as time series and engineer features like drift, volatility, late steam, and liquidity. Keep it honest: market microstructure is noisy, and backtests require strong assumptions.
Simulated snapshots as a tsibble + feature extraction
library(dplyr)
library(tsibble)
library(feasts)
library(ggplot2)
set.seed(1)
snap <- tibble(
runner_id = "R00001_01",
ts = as.POSIXct("2025-06-01 14:00:00", tz = "UTC") + seq(0, 60*60, by = 60),
odds_decimal = pmax(1.01, 6 + cumsum(rnorm(61, 0, 0.05))),
traded_vol = cumsum(pmax(0, rnorm(61, 10, 3)))
)
tsbl <- snap |> as_tsibble(index = ts, key = runner_id)
feat <- tsbl |>
features(odds_decimal, list(
mean = ~ mean(.),
sd = ~ sd(.),
min = ~ min(.),
max = ~ max(.),
slope = ~ coef(lm(. ~ seq_along(.)))[2]
))
feat
ggplot(tsbl, aes(ts, odds_decimal)) +
geom_line() +
labs(x = "Time", y = "Decimal odds")
Schema normalization: snapshots table
# Example transformation skeleton for real exchange data:
# 1) parse raw feed messages
# 2) map to normalized columns: race_id, runner_id, ts, odds_decimal, traded_vol
# 3) write Parquet and load into DuckDB
# write_parquet(odds_snapshots, "data/derived/odds_snapshots.parquet")
# dbExecute(con, "CREATE OR REPLACE TABLE odds_snapshots AS
# SELECT * FROM read_parquet('data/derived/odds_snapshots.parquet');")
10) Backtesting as a decision layer (fractional Kelly + guardrails)
Once you have calibrated probabilities, you can add a decision policy. One common framework is fractional Kelly sizing, but it must be capped and stress tested. The same probabilities can produce very different outcomes depending on execution constraints, market impact, limits, and selection filters.
Simulated fractional Kelly example with risk caps
library(dplyr)
library(ggplot2)
sim5 <- simulate_races(n_races = 180, n_horses = 500, seed = 9)
runners <- sim5$runners |>
group_by(race_id) |>
mutate(
p_market = (1/odds_decimal) / sum(1/odds_decimal),
# toy "model": market + noise (for demonstration)
p_model = pmin(0.99, pmax(0.001, p_market + rnorm(n(), 0, 0.02))),
edge = p_model - p_market
) |>
ungroup()
fractional_kelly <- function(p, odds_dec, fraction = 0.25, cap = 0.05) {
b <- odds_dec - 1
f_star <- (b*p - (1 - p)) / b
f <- pmax(0, f_star) * fraction
pmin(f, cap)
}
bankroll <- 1
max_drawdown <- 0.25
peak <- bankroll
hist <- tibble()
for (rid in unique(runners$race_id)) {
if (bankroll < (1 - max_drawdown) * peak) break # guardrail: stop
race <- runners |> filter(race_id == rid)
pick <- race |> arrange(desc(edge)) |> slice(1)
if (pick$edge > 0.01 && pick$odds_decimal mutate(drawdown = 1 - bankroll/peak)
ggplot(hist, aes(seq_along(bankroll), bankroll)) + geom_line() +
labs(x = "Race index", y = "Bankroll")
ggplot(hist, aes(seq_along(drawdown), drawdown)) + geom_line() +
labs(x = "Race index", y = "Drawdown")
11) Deployment: versioned models + HTTP prediction API
If you want your work to be reusable, build a deployable artifact. A clean R-native pattern is:
pins for versioning + vetiver for model packaging +
plumber for the API server.
library(tidymodels)
library(pins)
library(vetiver)
library(plumber)
library(dplyr)
sim6 <- simulate_races(n_races = 250, n_horses = 700, seed = 101)
df6 <- sim6$runners |>
left_join(sim6$races, by="race_id") |>
mutate(win = factor(win, levels=c(0,1), labels=c("no","yes")))
rec <- recipe(win ~ draw + weight_kg + age + sex + surface + going +
distance_m + odds_decimal, data = df6) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors())
fit <- workflow() |>
add_recipe(rec) |>
add_model(logistic_reg() |> set_engine("glm")) |>
fit(df6)
board <- board_folder("pins", versioned = TRUE)
v <- vetiver_model(fit, model_name = "win_prob_glm")
vetiver_pin_write(board, v)
api <- pr() |> vetiver_api(v)
# Run locally:
# api |> pr_run(host = "0.0.0.0", port = 8080)
api
Minimal request payload shape (JSON)
{
"draw": 7,
"weight_kg": 55.5,
"age": 4,
"sex": "G",
"surface": "Turf",
"going": "Good",
"distance_m": 1600,
"odds_decimal": 6.2
}
12) Reproducibility mechanics: renv + deterministic builds
Reproducibility is not vibes. Lock dependencies, record versions, and standardize builds. This is especially important when your analytics becomes a book, a report, or a product artifact.
renv workflow
# Initialize (once) # renv::init() # Snapshot current library into renv.lock renv::snapshot() # Restore exact package versions from renv.lock renv::restore()
Dockerfile for deterministic PDF builds (Quarto)
FROM rocker/verse:4.4.2
WORKDIR /book
COPY renv.lock renv.lock
COPY renv/ renv/
COPY .Rprofile .Rprofile
RUN R -q -e "install.packages('renv'); renv::restore(prompt = FALSE)"
COPY . .
CMD ["quarto", "render", "book", "--to", "pdf"]
CI skeleton (GitHub Actions)
name: Render PDF
on:
push:
branches: [ main ]
workflow_dispatch:
jobs:
pdf:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: r-lib/actions/setup-r@v2
- uses: r-lib/actions/setup-renv@v2
- name: Render Quarto book
run: quarto render book --to pdf
13) An “engineering checklist” you can actually follow
- Canonical schema
- Parquet + DuckDB
- Adapters, not redistribution
- GLM/GLMM baselines
- Ranking likelihoods
- Bayesian uncertainty
- Log loss / Brier
- Calibration curves
- Stress-tested backtests
- Odds snapshots
- Drift/volatility features
- Liquidity constraints
- Explicit assumptions
- Fractional Kelly caps
- Drawdown guardrails
- pins versioning
- vetiver packaging
- plumber API
One more code snippet: within-race probability normalization
Many racing tables store “odds” per runner, but for inference you often want a normalized probability within a field. Here’s a standard pattern that also highlights overround issues.
library(dplyr)
normalize_field_probs <- function(df, odds_col = odds_decimal) {
df |>
group_by(race_id) |>
mutate(
implied_raw = 1 / {{ odds_col }},
overround = sum(implied_raw),
p_market = implied_raw / overround
) |>
ungroup()
}
df_norm <- normalize_field_probs(sim$runners)
df_norm |>
summarise(
avg_overround = mean(overround),
pmin_overround= min(overround),
pmax_overround= max(overround)
)
The post Quantitative Horse Racing with R: Calibration, Backtesting, and Deployment appeared first on R Programming Books.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.