Skip to content

This project analyzes flight journey data through Exploratory Data Analysis and visualization, then applies a Gaussian Naive Bayes model to predict whether flights arrive on time or are delayed, achieving ~83% accuracy.

Notifications You must be signed in to change notification settings

ro-drick/Flight-Data-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Flight-Data-Analysis

This project focuses on analyzing flight journey data through Exploratory Data Analysis (EDA) and building a Naive Bayes classification model to predict flight arrival status (on-time or delayed).

  • Part 1: Perform EDA on the dataset to explore and visualize different features, uncover relationships, and interpret trends.
  • Part 2: Train and test machine learning models (Naive Bayes) to classify flights as on-time or late.

Dataset Overview

The dataset contains 10 features, including:

  • Carrier
  • Departure Time
  • Destination
  • Date
  • Flight Number
  • Origin
  • Day of the Week
  • Day of the Month

➡️ View Dataset

Day Encoding

  • 1 = Monday, 2 = Tuesday, … , 7 = Sunday
  • Day of Month: Numerical values representing each calendar day

Carrier Codes

  • CO = Continental
  • DH = Atlantic Coast
  • DL = Delta
  • MQ = American Eagle
  • OH = Comair
  • RU = Continental Express
  • UA = United
  • US = USAirways

Destinations

  • JFK = Kennedy
  • LGA = LaGuardia
  • EWR = Newark

Origins

  • DCA = Reagan National
  • IAD = Dulles
  • BWI = Baltimore–Washington Int’l

Libraries Used

  • Pandas → data handling & transformation
  • NumPy → numerical operations
  • Matplotlib → visualizations (plots, charts, histograms)
  • Seaborn → statistical graphics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Data Preprocessing

Data cleaning and transformation steps included:

  • Handling null values
  • Frequency analysis
  • Checking for inconsistencies
  • Preparing features for statistical analysis and modeling

Data Visualization

The dataset is visualized to reveal distributions, trends, and correlations using:

  • Pairplot
  • Relplot
  • Histogram
  • Pie Chart
  • Barplot
  • Strip Plot
  • Jointplot

Machine Learning Model

Naive Bayes Classification

The Gaussian Naive Bayes algorithm was used for prediction.

Naive Bayes Types:

  • Gaussian → continuous data
  • Multinomial → discrete data
  • Bernoulli → binary data

In this case, the Gaussian model was applied.

from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score

x_train, x_test, y_train, y_test = train_test_split(gets, target, test_size=99, random_state=5)
model = GaussianNB()
model.fit(x_train, y_train)
model.score(x_test, y_test)

Accuracy Score: 0.8283


Confusion Matrix

The confusion matrix evaluates classification performance by comparing predicted vs actual labels.

  • True Positives (TP): Correctly predicted positives
  • True Negatives (TN): Correctly predicted negatives
  • False Positives (FP): Incorrectly predicted positives
  • False Negatives (FN): Incorrectly predicted negatives

Project Files

About

This project analyzes flight journey data through Exploratory Data Analysis and visualization, then applies a Gaussian Naive Bayes model to predict whether flights arrive on time or are delayed, achieving ~83% accuracy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published