This project focuses on analyzing flight journey data through Exploratory Data Analysis (EDA) and building a Naive Bayes classification model to predict flight arrival status (on-time or delayed).
- Part 1: Perform EDA on the dataset to explore and visualize different features, uncover relationships, and interpret trends.
- Part 2: Train and test machine learning models (Naive Bayes) to classify flights as on-time or late.
The dataset contains 10 features, including:
- Carrier
- Departure Time
- Destination
- Date
- Flight Number
- Origin
- Day of the Week
- Day of the Month
➡️ View Dataset
Day Encoding
- 1 = Monday, 2 = Tuesday, … , 7 = Sunday
- Day of Month: Numerical values representing each calendar day
Carrier Codes
- CO = Continental
- DH = Atlantic Coast
- DL = Delta
- MQ = American Eagle
- OH = Comair
- RU = Continental Express
- UA = United
- US = USAirways
Destinations
- JFK = Kennedy
- LGA = LaGuardia
- EWR = Newark
Origins
- DCA = Reagan National
- IAD = Dulles
- BWI = Baltimore–Washington Int’l
- Pandas → data handling & transformation
- NumPy → numerical operations
- Matplotlib → visualizations (plots, charts, histograms)
- Seaborn → statistical graphics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsData cleaning and transformation steps included:
- Handling null values
- Frequency analysis
- Checking for inconsistencies
- Preparing features for statistical analysis and modeling
The dataset is visualized to reveal distributions, trends, and correlations using:
- Pairplot
- Relplot
- Histogram
- Pie Chart
- Barplot
- Strip Plot
- Jointplot
The Gaussian Naive Bayes algorithm was used for prediction.
Naive Bayes Types:
- Gaussian → continuous data
- Multinomial → discrete data
- Bernoulli → binary data
In this case, the Gaussian model was applied.
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score
x_train, x_test, y_train, y_test = train_test_split(gets, target, test_size=99, random_state=5)
model = GaussianNB()
model.fit(x_train, y_train)
model.score(x_test, y_test)✅ Accuracy Score: 0.8283
The confusion matrix evaluates classification performance by comparing predicted vs actual labels.
- True Positives (TP): Correctly predicted positives
- True Negatives (TN): Correctly predicted negatives
- False Positives (FP): Incorrectly predicted positives
- False Negatives (FN): Incorrectly predicted negatives