Modeling performance and generalization across circuits
This project studies lap time prediction in Formula 1 using machine learning models trained on telemetry, weather, and circuit data from the 2023–2025 seasons. The goal is to understand which modeling approaches generalize across circuits and which features drive performance.
Two complementary modeling settings are explored:
- Per-race models — trained and evaluated within a single race
- Per-circuit models — trained on multiple circuits and tested on unseen ones
The latter represents the more realistic and challenging prediction task.
This work addresses two complementary prediction problems:
Can lap times be accurately predicted within a single race when full contextual information is available?
This setting assumes knowledge of the circuit, drivers, and race conditions. It represents an upper bound on achievable performance and measures how well short-term lap dynamics can be modeled when persistent driver- and circuit-specific effects are observed.
Can lap time dynamics be predicted on circuits not seen during training?
This setting removes circuit-specific information and evaluates whether models can generalize across tracks. It represents a fundamentally different learning problem, in which persistent circuit- and driver-specific effects are intentionally removed so that the model must rely on transferable performance signals rather than memorization.
The one-lap-ahead horizon is chosen as the shortest interval at which strategic decisions (pace control, tyre management, pit timing) can realistically react to updated information.
- Data Sources
- Feature Engineering
- Modeling Setup
- Results
- Key Findings
- Project Structure
- Requirements
- Usage
- FastF1 — Official telemetry, lap times, weather, and timing data
- OpenF1 API — Pit stops, stints, race control events
- F1DB — Circuit geometry and historical metadata
Coverage: 2023–2025 Formula 1 seasons
Note: The cache/ folder (~several GB) is not included in this repository. It can be regenerated by running utilities_dataset.ipynb.
All features are strictly causal: for each lap t, only information available up to and including lap t is used. Any variables derived from future laps (e.g., final stint length, post-race aggregates) are explicitly excluded.
- Compound (SOFT, MEDIUM, HARD)
- Tyre age and stint length
- New tyre indicator
- DRS availability and traffic conditions
- Race position and gaps
- Clean/dirty air indicators
- Air and track temperature
- Wind speed and direction (trigonometric encoding)
- Humidity and pressure
- Number of turns and braking zones
- Straight lengths and DRS zones
- Corner angles and speed profiles
Two related but conceptually distinct prediction targets are used, corresponding to different generalization objectives.
-
Within-race target (RQ1)
Absolute next-lap time, defined as:y_t^(R) = LapTime_{t+1}
This target captures short-horizon lap time evolution within a fixed race context, where persistent driver-, car-, and circuit-specific effects are observable.
-
Cross-circuit target (RQ2)
Deviation of the next-lap time from the average lap time of the current stint observed up to lap ( t ):y_t^(C) = LapTime_{t+1} − mean(LapTime_stint, ≤ t)
This transformation removes circuit-specific scale effects and race-level baselines, forcing models to rely exclusively on transferable performance signals such as tyre degradation, weather variation, and race-state dynamics.
While both targets are expressed in seconds, they correspond to different estimands and learning problems and should not be interpreted interchangeably. Performance metrics across the two settings are therefore not directly comparable.
Each modeling setting corresponds to a different information set, allowing us to distinguish between memorization-driven accuracy and genuinely transferable predictability.
- Objective: Predict lap times within a single race
- Split: Time-ordered 70/15/15 within each (driver, stint) group
- Models: Linear Regression, Ridge Regression
- Features: Tyre, Weather, Race State, Driver/Team identifiers
- Use Case: Race strategy optimization with full contextual information
- Objective: Generalize to unseen circuits
- Split: Circuit-disjoint (15 train / 4 validation / 5 test circuits)
- Models: Linear (Ridge, ElasticNet), Nonlinear (Decision Tree, Random Forest, Hist Gradient Boosting, CatBoost)
- Features: Tyre, Weather, Race State, Circuit Geometry
- Validation: GroupKFold cross-validation grouped by circuit
- Use Case: Pre-season predictions for new or modified circuits
| Model | Features | Test MAE (s) | Test RMSE (s) | Test R² |
|---|---|---|---|---|
| Linear | Tyre + Stint | 0.6506 | 0.8413 | 0.9929 |
| Linear | Tyre + Stint + Weather | 0.7012 | 0.9254 | 0.9914 |
| Linear | Full (No Driver/Team) | 0.6102 | 0.8146 | 0.9933 |
| Ridge (α=1.0) | Full + Driver/Team | 0.4199 | 0.5768 | 0.9966 |
Per-race prediction is extremely accurate, driven largely by driver- and team-specific effects.
| Model | Features | MAE (s) | RMSE (s) | R² |
|---|---|---|---|---|
| Linear | Tyre + Weather | 0.468 | 0.656 | -0.21 |
| Linear | Tyre + Weather + State | 0.365 | 0.521 | 0.239 |
| Ridge (α=5000) | Tyre + Weather + State | 0.375 | 0.527 | 0.222 |
| Model | Features | MAE (s) | RMSE (s) | R² |
|---|---|---|---|---|
| Random Forest | Tyre + Weather + State | 0.335 | 0.486 | 0.338 |
| Random Forest | + Geometry | 0.333 | 0.484 | 0.342 |
| Hist. Gradient Boosting | Tyre + Weather + State | 0.335 | 0.499 | 0.302 |
Note: Performance metrics across per-race and per-circuit settings are not directly comparable due to differing targets and information sets.
- Per-race models achieve near-perfect accuracy (R² ≈ 0.997) but rely heavily on driver and circuit identity
- Per-circuit generalization is significantly harder and exposes model limitations
- Tree-based ensemble models (Random Forest, Hist. Gradient Boosting) outperform linear approaches when generalizing
- Geometry features improve performance for nonlinear models but degrade linear ones (multicollinearity)
- CatBoost showed strong training performance but poor generalization (overfitting)
- Driver/Team effects are powerful within-race predictors but do not transfer across circuits
Best model for cross-circuit generalization: Random Forest with Tyre + Weather + State + Geometry features (MAE = 0.33s, R² = 0.34)
.
├── cache/ # FastF1 raw data (not included, ~several GB)
├── csv_output/ # Processed datasets and model outputs
│ ├── Filtered_Data/ # Final cleaned datasets
│ ├── nonlinear/ # Model results (JSON/CSV)
│ ├── Train_set.xlsx # Circuit-disjoint splits
│ ├── Validation_set.xlsx
│ └── Test_set.xlsx
├── figures/ # Plots and visualizations
├── Formula1_dataset.ipynb # Data extraction from FastF1
├── Data_cleaning_and_feat_eng.ipynb # Feature engineering & cleaning
├── Linear.ipynb # Linear models (Ridge, ElasticNet)
├── Nonlinear.ipynb # Nonlinear models (DT, RF, HGB, CatBoost)
├── utilities_dataset.py # Helper functions
└── README.md
Python 3.8+
pip install -r requirements.txtCore dependencies:
pandas,numpy,scikit-learnmatplotlib,seabornfastf1(F1 data API)catboost(nonlinear models)jupyter,openpyxl
Regenerate the cache from FastF1 (requires several time):
jupyter notebook utilities_dataset.ipynbProcess cache files into structured datasets:
jupyter notebook Data_cleaning_and_feat_eng.ipynbLinear models (per-race and per-circuit):
jupyter notebook Linear.ipynbNonlinear models (DT, RF, HGB, CatBoost):
jupyter notebook Nonlinear.ipynb-
Reduced-form prediction:
The models predict lap times directly, which reflect a mixture of tyre degradation, driver intent, fuel effects, and race strategy. Results should therefore be interpreted as predictive rather than causal. -
Limited cross-circuit sample:
Strict circuit-disjoint evaluation leaves a small number of held-out tracks, implying non-negligible uncertainty in generalization estimates. -
Coarse circuit representation:
Geometry features are high-level summaries and may not fully capture track-specific performance demands.
- Richer telemetry-based circuit descriptors
- Sequential models (LSTM / Transformers) for multi-lap dynamics
- Causal decomposition of tyre, driver, and strategy effects
- Cross-era generalization across regulatory regimes
- FastF1 Library: theOehrly/Fast-F1
- F1 Database: f1db/f1db
- OpenF1 API: openf1.org
This project is for educational purposes only. F1 data is property of Formula 1 and is accessed via the FastF1 library under their terms of use.
Author: Jacopo Sinigaglia
