Short-Horizon Lap Time Prediction and Cross Circuit Generalization in Formula 1

Modeling performance and generalization across circuits

Overview

This project studies lap time prediction in Formula 1 using machine learning models trained on telemetry, weather, and circuit data from the 2023–2025 seasons. The goal is to understand which modeling approaches generalize across circuits and which features drive performance.

Two complementary modeling settings are explored:

Per-race models — trained and evaluated within a single race
Per-circuit models — trained on multiple circuits and tested on unseen ones

The latter represents the more realistic and challenging prediction task.

Research Questions

This work addresses two complementary prediction problems:

1. Within-Race Prediction

Can lap times be accurately predicted within a single race when full contextual information is available?

This setting assumes knowledge of the circuit, drivers, and race conditions. It represents an upper bound on achievable performance and measures how well short-term lap dynamics can be modeled when persistent driver- and circuit-specific effects are observed.

2. Cross-Circuit Generalization

Can lap time dynamics be predicted on circuits not seen during training?

This setting removes circuit-specific information and evaluates whether models can generalize across tracks. It represents a fundamentally different learning problem, in which persistent circuit- and driver-specific effects are intentionally removed so that the model must rely on transferable performance signals rather than memorization.

The one-lap-ahead horizon is chosen as the shortest interval at which strategic decisions (pace control, tyre management, pit timing) can realistically react to updated information.

Data Sources

FastF1 — Official telemetry, lap times, weather, and timing data
OpenF1 API — Pit stops, stints, race control events
F1DB — Circuit geometry and historical metadata

Coverage: 2023–2025 Formula 1 seasons

Note: The cache/ folder (~several GB) is not included in this repository. It can be regenerated by running utilities_dataset.ipynb.

Feature Engineering

All features are strictly causal: for each lap t, only information available up to and including lap t is used. Any variables derived from future laps (e.g., final stint length, post-race aggregates) are explicitly excluded.

Tyre & Stint

Compound (SOFT, MEDIUM, HARD)
Tyre age and stint length
New tyre indicator

Race State

DRS availability and traffic conditions
Race position and gaps
Clean/dirty air indicators

Weather

Air and track temperature
Wind speed and direction (trigonometric encoding)
Humidity and pressure

Circuit Geometry (per-circuit models only)

Number of turns and braking zones
Straight lengths and DRS zones
Corner angles and speed profiles

Target Variables

Two related but conceptually distinct prediction targets are used, corresponding to different generalization objectives.

Within-race target (RQ1)
Absolute next-lap time, defined as:

y_t^(R) = LapTime_{t+1}

This target captures short-horizon lap time evolution within a fixed race context, where persistent driver-, car-, and circuit-specific effects are observable.
Cross-circuit target (RQ2)
Deviation of the next-lap time from the average lap time of the current stint observed up to lap ( t ):

y_t^(C) = LapTime_{t+1} − mean(LapTime_stint, ≤ t)

This transformation removes circuit-specific scale effects and race-level baselines, forcing models to rely exclusively on transferable performance signals such as tyre degradation, weather variation, and race-state dynamics.

While both targets are expressed in seconds, they correspond to different estimands and learning problems and should not be interpreted interchangeably. Performance metrics across the two settings are therefore not directly comparable.

Modeling Setup and Information Availability

Each modeling setting corresponds to a different information set, allowing us to distinguish between memorization-driven accuracy and genuinely transferable predictability.

Per-Race Models

Objective: Predict lap times within a single race
Split: Time-ordered 70/15/15 within each (driver, stint) group
Models: Linear Regression, Ridge Regression
Features: Tyre, Weather, Race State, Driver/Team identifiers
Use Case: Race strategy optimization with full contextual information

Per-Circuit Models

Objective: Generalize to unseen circuits
Split: Circuit-disjoint (15 train / 4 validation / 5 test circuits)
Models: Linear (Ridge, ElasticNet), Nonlinear (Decision Tree, Random Forest, Hist Gradient Boosting, CatBoost)
Features: Tyre, Weather, Race State, Circuit Geometry
Validation: GroupKFold cross-validation grouped by circuit
Use Case: Pre-season predictions for new or modified circuits

Results

1. Per-Race Models (Known Circuits)

Model	Features	Test MAE (s)	Test RMSE (s)	Test R²
Linear	Tyre + Stint	0.6506	0.8413	0.9929
Linear	Tyre + Stint + Weather	0.7012	0.9254	0.9914
Linear	Full (No Driver/Team)	0.6102	0.8146	0.9933
Ridge (α=1.0)	Full + Driver/Team	0.4199	0.5768	0.9966

Per-race prediction is extremely accurate, driven largely by driver- and team-specific effects.

2. Per-Circuit Models (Unseen Circuits)

Linear & Regularized Models

Model	Features	MAE (s)	RMSE (s)	R²
Linear	Tyre + Weather	0.468	0.656	-0.21
Linear	Tyre + Weather + State	0.365	0.521	0.239
Ridge (α=5000)	Tyre + Weather + State	0.375	0.527	0.222

Nonlinear Models

Model	Features	MAE (s)	RMSE (s)	R²
Random Forest	Tyre + Weather + State	0.335	0.486	0.338
Random Forest	+ Geometry	0.333	0.484	0.342
Hist. Gradient Boosting	Tyre + Weather + State	0.335	0.499	0.302

Note: Performance metrics across per-race and per-circuit settings are not directly comparable due to differing targets and information sets.

Key Findings

Per-race models achieve near-perfect accuracy (R² ≈ 0.997) but rely heavily on driver and circuit identity
Per-circuit generalization is significantly harder and exposes model limitations
Tree-based ensemble models (Random Forest, Hist. Gradient Boosting) outperform linear approaches when generalizing
Geometry features improve performance for nonlinear models but degrade linear ones (multicollinearity)
CatBoost showed strong training performance but poor generalization (overfitting)
Driver/Team effects are powerful within-race predictors but do not transfer across circuits

Best model for cross-circuit generalization: Random Forest with Tyre + Weather + State + Geometry features (MAE = 0.33s, R² = 0.34)

Project Structure

.
├── cache/                          # FastF1 raw data (not included, ~several GB)
├── csv_output/                     # Processed datasets and model outputs
│   ├── Filtered_Data/              # Final cleaned datasets
│   ├── nonlinear/                  # Model results (JSON/CSV)
│   ├── Train_set.xlsx              # Circuit-disjoint splits
│   ├── Validation_set.xlsx
│   └── Test_set.xlsx
├── figures/                        # Plots and visualizations
├── Formula1_dataset.ipynb          # Data extraction from FastF1
├── Data_cleaning_and_feat_eng.ipynb  # Feature engineering & cleaning
├── Linear.ipynb                    # Linear models (Ridge, ElasticNet)
├── Nonlinear.ipynb                 # Nonlinear models (DT, RF, HGB, CatBoost)
├── utilities_dataset.py            # Helper functions
└── README.md

Requirements

Python 3.8+

pip install -r requirements.txt

Core dependencies:

pandas, numpy, scikit-learn
matplotlib, seaborn
fastf1 (F1 data API)
catboost (nonlinear models)
jupyter, openpyxl

Usage

1. Data Extraction (Optional)

Regenerate the cache from FastF1 (requires several time):

jupyter notebook utilities_dataset.ipynb

2. Feature Engineering

Process cache files into structured datasets:

jupyter notebook Data_cleaning_and_feat_eng.ipynb

3. Train Models

Linear models (per-race and per-circuit):

jupyter notebook Linear.ipynb

Nonlinear models (DT, RF, HGB, CatBoost):

jupyter notebook Nonlinear.ipynb

Limitations & Future Work

Current Limitations

Reduced-form prediction:
The models predict lap times directly, which reflect a mixture of tyre degradation, driver intent, fuel effects, and race strategy. Results should therefore be interpreted as predictive rather than causal.
Limited cross-circuit sample:
Strict circuit-disjoint evaluation leaves a small number of held-out tracks, implying non-negligible uncertainty in generalization estimates.
Coarse circuit representation:
Geometry features are high-level summaries and may not fully capture track-specific performance demands.

Future Directions

Richer telemetry-based circuit descriptors
Sequential models (LSTM / Transformers) for multi-lap dynamics
Causal decomposition of tyre, driver, and strategy effects
Cross-era generalization across regulatory regimes

Acknowledgments

FastF1 Library: theOehrly/Fast-F1
F1 Database: f1db/f1db
OpenF1 API: openf1.org

License

This project is for educational purposes only. F1 data is property of Formula 1 and is accessed via the FastF1 library under their terms of use.

Author: Jacopo Sinigaglia

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
__pycache__		__pycache__
csv_output		csv_output
figures		figures
.DS_Store		.DS_Store
.gitignore		.gitignore
Cached_file_inspection.ipynb		Cached_file_inspection.ipynb
Data_cleaning_and_feat_eng.ipynb		Data_cleaning_and_feat_eng.ipynb
Linear.ipynb		Linear.ipynb
README.md		README.md
nonlinear.ipynb		nonlinear.ipynb
requirements.txt		requirements.txt
utilities_dataset.ipynb		utilities_dataset.ipynb
utilities_dataset.py		utilities_dataset.py

jacopo887/ADA_project_JS

Folders and files

Latest commit

History

Repository files navigation