Modeling and Prediction

#The APS (Air Processing System) is critical for the operation of Scania trucks. APS failures can #lead to significant industrial costs and production downtime. Historical sensor data enables the #analysis of system behavior and early detection of anomalies before they escalate into critical #failures. #In this study, a Random Forest model was trained to detect APS system failures, with special #attention given to the highly imbalanced nature of the dataset by applying class weighting to the #rare positive instances. Model performance was evaluated using standard metrics including precision, #recall, F1-score, and confusion matrices, while the associated industrial cost was calculated based #on the impact of false positives and false negatives. To further optimize failure detection and #reduce total costs, XGBoost was optionally employed. Additionally, feature importance analysis was #conducted to identify the most critical sensors influencing APS failure predictions, with the top #features visualized through horizontal bar charts to provide interpretable insights for industrial #decision-making.

Modeling and Prediction

pip install xgboost --break-system-packages

from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix, classification_report import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

1. Feature / Target Separation

X_train = train_df.drop(columns=['class']) y_train = train_df['class'].map({'neg': 0, 'pos': 1}) # 1 = APS failure

X_test = test_df.drop(columns=['class']) y_test = test_df['class'].map({'neg': 0, 'pos': 1})

2. Missing Value Imputation

Replace missing values using the median (robust to outliers)

X_train = X_train.fillna(X_train.median()) X_test = X_test.fillna(X_train.median()) # use train statistics only

3. Feature Scaling

scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)

4. Random Forest Model with Class Weighting

Strong weight on positive class to reduce costly false negatives

rf_model = RandomForestClassifier( n_estimators=200, class_weight={0: 1, 1: 50}, random_state=42, n_jobs=-1 )

rf_model.fit(X_train_scaled, y_train)

5. Prediction and Evaluation

y_pred = rf_model.predict(X_test_scaled)

conf_matrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:\n", conf_matrix)

print("\nClassification Report:\n") print(classification_report(y_test, y_pred))

6. Industrial Cost Evaluation

Cost definition (from APS challenge)

COST_FP = 10 # unnecessary workshop inspection COST_FN = 500 # missed APS failure

false_positives = conf_matrix[0, 1] false_negatives = conf_matrix[1, 0]

total_industrial_cost = ( false_positives * COST_FP + false_negatives * COST_FN )

print("Total Industrial Cost:", total_industrial_cost)

7. Confusion Matrix Visualization

plt.figure(figsize=(6, 5)) sns.heatmap( conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted Negative', 'Predicted Positive'], yticklabels=['Actual Negative', 'Actual Positive'] ) plt.title("Confusion Matrix – APS Failure Detection") plt.xlabel("Prediction") plt.ylabel("Ground Truth") plt.show()

8. Feature Importance Analysis

feature_importance = pd.Series( rf_model.feature_importances_, index=X_train.columns )

top_features = feature_importance.sort_values(ascending=False).head(15)

print("\nTop 15 Most Important Features:") print(top_features)

9. Feature Importance Visualization

plt.figure(figsize=(10, 6)) top_features.plot(kind='barh') plt.gca().invert_yaxis() plt.title("Top 15 Most Influential Sensors for APS Failure Prediction") plt.xlabel("Importance Score") plt.show()

#Figure 1 illustrates the confusion matrix of the Random Forest model applied to APS failure #detection under a highly imbalanced class distribution. The model correctly classifies the majority #of non-failure cases, as reflected by the large number of true negatives and the very low number of #false positives. This indicates a strong capability to avoid unnecessary maintenance actions. #However, a non-negligible number of APS failures are misclassified as normal operations (false #negatives), which represent critical cases from an industrial perspective due to their high #associated cost. This result emphasizes the importance of cost-sensitive learning and motivates the #use of alternative models, such as XGBoost, to further reduce false negatives and optimize #industrial cost. In the confusion matrix, the value 15,607 corresponds to true negatives, that is, #trucks that do not have an APS system failure and are correctly identified as such by the model, #demonstrating its ability to avoid unnecessary maintenance interventions. #Figure 2 presents the top 15 most influential sensor variables used by the Random Forest model to #predict APS failures. The results indicate that the prediction is driven by a limited subset of #sensors, with aa_000 being the most dominant feature, followed by ci_000, ck_000, and dn_000. This #suggests that APS failures are strongly associated with specific operational measurements rather #than uniformly across all sensors. The concentration of importance among these variables highlights #their potential relevance for targeted monitoring and preventive maintenance strategies, as focusing #on key sensors could improve fault detection efficiency while reducing system complexity.

#Conclusion #The experimental results confirm that the proposed machine learning approach is effective for #detecting failures of the Air Pressure System (APS) in heavy-duty trucks. The Random Forest model #achieved a high overall accuracy (≈99%) and a strong precision for the negative class, indicating #reliable identification of non-failure cases. However, the recall for APS failures remained moderate #(≈58–61%), highlighting the intrinsic difficulty of detecting rare failure events in highly #imbalanced industrial datasets. #The integration of class weighting significantly reduced the number of false negatives, which are #associated with the highest industrial cost. Using the defined cost function, the Random Forest #model resulted in a total industrial cost of approximately 74,000–78,000 units, demonstrating a #meaningful improvement over non–cost-aware baselines. Furthermore, the XGBoost model substantially #outperformed Random Forest in cost optimization, reducing the total industrial cost to approximately #29,850 units, primarily by further decreasing missed APS failures. #Feature importance analysis revealed that a limited set of sensor variables (e.g., aa_000, ci_000, #ck_000, dn_000) consistently contributed most to the predictive performance. This suggests that APS #degradation can be detected through specific operational patterns captured by onboard sensors. #Overall, these results validate the relevance of cost-sensitive learning for industrial reliability #studies and demonstrate the practical value of data-driven predictive maintenance in intelligent #transportation systems.

#APS Failure at Scania Trucks [Dataset]. (2016). UCI Machine Learning Repository. https://doi.org/#10.24432/C51S51

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
Figure 2.png		Figure 2.png
Figure_1.png		Figure_1.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modeling and Prediction

pip install xgboost --break-system-packages

1. Feature / Target Separation

2. Missing Value Imputation

Replace missing values using the median (robust to outliers)

3. Feature Scaling

4. Random Forest Model with Class Weighting

Strong weight on positive class to reduce costly false negatives

5. Prediction and Evaluation

6. Industrial Cost Evaluation

Cost definition (from APS challenge)

7. Confusion Matrix Visualization

8. Feature Importance Analysis

9. Feature Importance Visualization

About

Uh oh!

Releases

Packages

License

abdibasidadan-byte/Predicting

Folders and files

Latest commit

History

Repository files navigation

Modeling and Prediction

pip install xgboost --break-system-packages

1. Feature / Target Separation

2. Missing Value Imputation

Replace missing values using the median (robust to outliers)

3. Feature Scaling

4. Random Forest Model with Class Weighting

Strong weight on positive class to reduce costly false negatives

5. Prediction and Evaluation

6. Industrial Cost Evaluation

Cost definition (from APS challenge)

7. Confusion Matrix Visualization

8. Feature Importance Analysis

9. Feature Importance Visualization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages