Skip to content

This project studies lap time prediction in Formula 1 using ML models trained on telemetry, weather, and circuit data from the 2023–2025 seasons. Two complementary modeling settings are explored: Per-race models — trained and evaluated within a single race Per-circuit models — trained on multiple circuits and tested on unseen ones

Notifications You must be signed in to change notification settings

jacopo887/ADA_project_JS

Repository files navigation

Short-Horizon Lap Time Prediction and Cross Circuit Generalization in Formula 1

Modeling performance and generalization across circuits

Formula 1 Lap Time Prediction


Overview

This project studies lap time prediction in Formula 1 using machine learning models trained on telemetry, weather, and circuit data from the 2023–2025 seasons. The goal is to understand which modeling approaches generalize across circuits and which features drive performance.

Two complementary modeling settings are explored:

  • Per-race models — trained and evaluated within a single race
  • Per-circuit models — trained on multiple circuits and tested on unseen ones

The latter represents the more realistic and challenging prediction task.


Research Questions

This work addresses two complementary prediction problems:

1. Within-Race Prediction

Can lap times be accurately predicted within a single race when full contextual information is available?

This setting assumes knowledge of the circuit, drivers, and race conditions. It represents an upper bound on achievable performance and measures how well short-term lap dynamics can be modeled when persistent driver- and circuit-specific effects are observed.

2. Cross-Circuit Generalization

Can lap time dynamics be predicted on circuits not seen during training?

This setting removes circuit-specific information and evaluates whether models can generalize across tracks. It represents a fundamentally different learning problem, in which persistent circuit- and driver-specific effects are intentionally removed so that the model must rely on transferable performance signals rather than memorization.

The one-lap-ahead horizon is chosen as the shortest interval at which strategic decisions (pace control, tyre management, pit timing) can realistically react to updated information.

Table of Contents

  1. Data Sources
  2. Feature Engineering
  3. Modeling Setup
  4. Results
  5. Key Findings
  6. Project Structure
  7. Requirements
  8. Usage

Data Sources

  • FastF1 — Official telemetry, lap times, weather, and timing data
  • OpenF1 API — Pit stops, stints, race control events
  • F1DB — Circuit geometry and historical metadata

Coverage: 2023–2025 Formula 1 seasons

Note: The cache/ folder (~several GB) is not included in this repository. It can be regenerated by running utilities_dataset.ipynb.


Feature Engineering

All features are strictly causal: for each lap t, only information available up to and including lap t is used. Any variables derived from future laps (e.g., final stint length, post-race aggregates) are explicitly excluded.

Tyre & Stint

  • Compound (SOFT, MEDIUM, HARD)
  • Tyre age and stint length
  • New tyre indicator

Race State

  • DRS availability and traffic conditions
  • Race position and gaps
  • Clean/dirty air indicators

Weather

  • Air and track temperature
  • Wind speed and direction (trigonometric encoding)
  • Humidity and pressure

Circuit Geometry (per-circuit models only)

  • Number of turns and braking zones
  • Straight lengths and DRS zones
  • Corner angles and speed profiles

Target Variables

Two related but conceptually distinct prediction targets are used, corresponding to different generalization objectives.

  • Within-race target (RQ1)
    Absolute next-lap time, defined as:

    y_t^(R) = LapTime_{t+1}

    This target captures short-horizon lap time evolution within a fixed race context, where persistent driver-, car-, and circuit-specific effects are observable.

  • Cross-circuit target (RQ2)
    Deviation of the next-lap time from the average lap time of the current stint observed up to lap ( t ):

    y_t^(C) = LapTime_{t+1} − mean(LapTime_stint, ≤ t)

    This transformation removes circuit-specific scale effects and race-level baselines, forcing models to rely exclusively on transferable performance signals such as tyre degradation, weather variation, and race-state dynamics.

While both targets are expressed in seconds, they correspond to different estimands and learning problems and should not be interpreted interchangeably. Performance metrics across the two settings are therefore not directly comparable.


Modeling Setup and Information Availability

Each modeling setting corresponds to a different information set, allowing us to distinguish between memorization-driven accuracy and genuinely transferable predictability.

Per-Race Models

  • Objective: Predict lap times within a single race
  • Split: Time-ordered 70/15/15 within each (driver, stint) group
  • Models: Linear Regression, Ridge Regression
  • Features: Tyre, Weather, Race State, Driver/Team identifiers
  • Use Case: Race strategy optimization with full contextual information

Per-Circuit Models

  • Objective: Generalize to unseen circuits
  • Split: Circuit-disjoint (15 train / 4 validation / 5 test circuits)
  • Models: Linear (Ridge, ElasticNet), Nonlinear (Decision Tree, Random Forest, Hist Gradient Boosting, CatBoost)
  • Features: Tyre, Weather, Race State, Circuit Geometry
  • Validation: GroupKFold cross-validation grouped by circuit
  • Use Case: Pre-season predictions for new or modified circuits

Results

1. Per-Race Models (Known Circuits)

Model Features Test MAE (s) Test RMSE (s) Test R²
Linear Tyre + Stint 0.6506 0.8413 0.9929
Linear Tyre + Stint + Weather 0.7012 0.9254 0.9914
Linear Full (No Driver/Team) 0.6102 0.8146 0.9933
Ridge (α=1.0) Full + Driver/Team 0.4199 0.5768 0.9966

Per-race prediction is extremely accurate, driven largely by driver- and team-specific effects.


2. Per-Circuit Models (Unseen Circuits)

Linear & Regularized Models

Model Features MAE (s) RMSE (s)
Linear Tyre + Weather 0.468 0.656 -0.21
Linear Tyre + Weather + State 0.365 0.521 0.239
Ridge (α=5000) Tyre + Weather + State 0.375 0.527 0.222

Nonlinear Models

Model Features MAE (s) RMSE (s)
Random Forest Tyre + Weather + State 0.335 0.486 0.338
Random Forest + Geometry 0.333 0.484 0.342
Hist. Gradient Boosting Tyre + Weather + State 0.335 0.499 0.302

Note: Performance metrics across per-race and per-circuit settings are not directly comparable due to differing targets and information sets.


Key Findings

  • Per-race models achieve near-perfect accuracy (R² ≈ 0.997) but rely heavily on driver and circuit identity
  • Per-circuit generalization is significantly harder and exposes model limitations
  • Tree-based ensemble models (Random Forest, Hist. Gradient Boosting) outperform linear approaches when generalizing
  • Geometry features improve performance for nonlinear models but degrade linear ones (multicollinearity)
  • CatBoost showed strong training performance but poor generalization (overfitting)
  • Driver/Team effects are powerful within-race predictors but do not transfer across circuits

Best model for cross-circuit generalization: Random Forest with Tyre + Weather + State + Geometry features (MAE = 0.33s, R² = 0.34)


Project Structure

.
├── cache/                          # FastF1 raw data (not included, ~several GB)
├── csv_output/                     # Processed datasets and model outputs
│   ├── Filtered_Data/              # Final cleaned datasets
│   ├── nonlinear/                  # Model results (JSON/CSV)
│   ├── Train_set.xlsx              # Circuit-disjoint splits
│   ├── Validation_set.xlsx
│   └── Test_set.xlsx
├── figures/                        # Plots and visualizations
├── Formula1_dataset.ipynb          # Data extraction from FastF1
├── Data_cleaning_and_feat_eng.ipynb  # Feature engineering & cleaning
├── Linear.ipynb                    # Linear models (Ridge, ElasticNet)
├── Nonlinear.ipynb                 # Nonlinear models (DT, RF, HGB, CatBoost)
├── utilities_dataset.py            # Helper functions
└── README.md

Requirements

Python 3.8+

pip install -r requirements.txt

Core dependencies:

  • pandas, numpy, scikit-learn
  • matplotlib, seaborn
  • fastf1 (F1 data API)
  • catboost (nonlinear models)
  • jupyter, openpyxl

Usage

1. Data Extraction (Optional)

Regenerate the cache from FastF1 (requires several time):

jupyter notebook utilities_dataset.ipynb

2. Feature Engineering

Process cache files into structured datasets:

jupyter notebook Data_cleaning_and_feat_eng.ipynb

3. Train Models

Linear models (per-race and per-circuit):

jupyter notebook Linear.ipynb

Nonlinear models (DT, RF, HGB, CatBoost):

jupyter notebook Nonlinear.ipynb

Limitations & Future Work

Current Limitations

  • Reduced-form prediction:
    The models predict lap times directly, which reflect a mixture of tyre degradation, driver intent, fuel effects, and race strategy. Results should therefore be interpreted as predictive rather than causal.

  • Limited cross-circuit sample:
    Strict circuit-disjoint evaluation leaves a small number of held-out tracks, implying non-negligible uncertainty in generalization estimates.

  • Coarse circuit representation:
    Geometry features are high-level summaries and may not fully capture track-specific performance demands.

Future Directions

  • Richer telemetry-based circuit descriptors
  • Sequential models (LSTM / Transformers) for multi-lap dynamics
  • Causal decomposition of tyre, driver, and strategy effects
  • Cross-era generalization across regulatory regimes

Acknowledgments


License

This project is for educational purposes only. F1 data is property of Formula 1 and is accessed via the FastF1 library under their terms of use.


Author: Jacopo Sinigaglia

About

This project studies lap time prediction in Formula 1 using ML models trained on telemetry, weather, and circuit data from the 2023–2025 seasons. Two complementary modeling settings are explored: Per-race models — trained and evaluated within a single race Per-circuit models — trained on multiple circuits and tested on unseen ones

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published