This project applies Multiple Linear Regression (MLR) to predict Highway mpg (miles per gallon) of a car(Fuel Information.Highway MPG) based on various vehicle features. The dataset includes technical specifications of cars, such as engine type, fuel type, and transmission details.
- Identify key factors affecting highway fuel efficiency.
- Build a robust regression model to predict Highway mpg (miles per gallon) of a car(
Fuel Information.Highway MPG). - Evaluate the model’s performance and optimize feature selection.
📂 MLR Analysis/
│
├── 📄 LICENSE
│
├── 📄 README.md
│
├── 📂 data/
│ └── cars.csv # Your dataset
│
└── 📄 MLR_analysis.ipynb # Google Colab notebook with the full analysis
The dataset includes the following features:
- Vehicle Specifications:
Identification.Model Year,Dimensions.Height,Dimensions.Width, etc. - Engine Details:
Engine Information.Engine Type,Torque,Horsepower, etc. - Fuel Information:
Fuel Type,City MPG, etc. - Transmission Details:
Number of Forward Gears,Transmission Type.
- Handling duplicated Values: Removed duplicated values
- Encoding Categorical Variables:
- Applied Target Encoding for
'Identification.Make','Identification.Model Year','Engine Information.Engine Type','Engine Information.Driveline' - Applied One-Hot Encoding for
'Fuel Information.Fuel Type','Engine Information.Transmission','Identification.Classification'
- Applied Target Encoding for
- Feature Scaling: Standardized numerical features where necessary.
- Outlier Treatment: Used IQR method to remove extreme values.
A heatmap was generated to analyze the correlation between independent variables and the target variable (Fuel Information.Highway MPG).
Fuel Information.City mpgis highly correlated withHighway MPG.- Some features exhibit multicollinearity, requiring feature selection.
To select the best predictors:
- Recursive Feature Elimination (RFE): Selected the most impactful features.
- Multicollinearity Check (VIF): Removed features with VIF > 10 to avoid redundancy.
- Correlation Analysis: Chose features highly correlated with the target variable but uncorrelated with each other.
Final Selected Features:
['Fuel Information.City mpg', 'Identification.Model Year_encoded',
'Fuel Information.Fuel Type_Diesel fuel', 'Fuel Information.Fuel Type_E85',
'Fuel Information.Fuel Type_Gasoline']
- Model Used: Multiple Linear Regression
- Train-Test Split: 80% Training, 20% Testing
- Evaluation Metrics:
- R² Score: Indicates how well the model explains variance.
- VIF Analysis: Ensures low multicollinearity.
- RMSE: Measures average error in predictions.
Here’s the comparison between the actual values and the predicted values:
| Metric | Before Handling Outliers | After Handling Outliers |
|---|---|---|
| R² Score | 0.9121 | 0.9495 |
| RMSE | 0.0070 | 0.0432 |
| Adjusted R² | 0.9357 | 0.9529 |
After handling outliers, the model became more generalizable with reduced errors.
To check model assumptions, we plotted residuals to ensure they followed a normal distribution and exhibited homoscedasticity.
Key Takeaways from Residual Analysis:
- The residuals approximately follow a normal distribution.
- No clear heteroscedasticity, indicating stable variance.
- Confirms that our model meets linear regression assumptions.
- Feature Engineering Matters: Proper encoding and selection significantly improved model performance.
- Multicollinearity is Crucial: Reducing VIF led to more stable coefficients.
- Outlier Handling is Important: Post-cleaning, the model showed better predictive accuracy.
- Business Impact: This model helps automobile manufacturers understand which factors most influence fuel efficiency.
Click the link below to open the project in Google Colab:
[](https://colab.research.google.com/github/Aparna-analyst/Machine-Learning/blob/main/MLR_analysis.ipynb)
- Make sure to upload your dataset (
cars.csv) to Colab by:- Clicking Files on the left sidebar.
- Clicking Upload and selecting your CSV file.
If needed, install dependencies directly in Colab by running:
!pip install pandas numpy scikit-learn matplotlib seaborn- Execute all the cells in the notebook step by step by pressing Shift + Enter.
- Download your output files by right-clicking on them in the Files section and choosing Download.
- Try Ridge/Lasso Regression to improve regularization.
- Perform Cross-Validation for better generalization.
- Test on new unseen vehicle data to validate real-world performance.
- Scikit-Learn Documentation: https://scikit-learn.org/
- Pandas Data Manipulation: https://pandas.pydata.org/
Aparna S - LinkedIn



