This repository contains a comprehensive collection of 15+ datasets spanning various domains including healthcare, entertainment, transportation, demographics, and more. Each dataset is carefully organized and ready for analysis, making it perfect for:
- π¬ Data Science Projects
- π€ Machine Learning Experiments
- π Statistical Analysis
- π Educational Purposes
- πΌ Business Analytics
- π― Overview
- π Dataset Categories
- π₯ Featured Datasets
- π Dataset Details
- π Quick Start
- π‘ Usage Examples
- π Data Insights
- π οΈ Tools & Libraries
- π Contributing
- π License
| Category | Count | Description |
|---|---|---|
| π₯ Healthcare | 4 | Medical data, diabetes, health camps |
| π Transportation | 3 | Cars, traffic, police data |
| π Real Estate | 1 | Housing market data |
| π Demographics | 2 | Census, population data |
| π Education | 3 | Udemy courses, student performance |
| π¬ Entertainment | 2 | Netflix content, trending data |
| π¦ Pandemic | 1 | COVID-19 statistics |
| πΈ Science | 1 | Iris flower classification |
| β Historical | 1 | Titanic passenger data |
| πΌ Business | 1 | Employee attrition data |
- Diabetes Dataset - Comprehensive health metrics for diabetes prediction
- Health Camp Data - Multi-camp attendance and patient profiles
- Car Dataset - Vehicle specifications and market analysis
- Police Data - Traffic incidents and law enforcement statistics
- Netflix Dataset - Content analysis and viewing patterns
- Trending Data - Social media and content trends
π©Ί Health & Medical Datasets
- Size: 100,000+ records
- Features: Gender, Age, Hypertension, Heart Disease, BMI, HbA1c Level, Blood Glucose
- Target: Diabetes prediction (Binary classification)
- Use Cases: Predictive modeling, health risk assessment
- Components: Patient profiles, camp details, attendance records
- Size: Multiple files with 10,000+ records
- Features: Demographics, health metrics, camp participation
- Use Cases: Healthcare analytics, patient behavior analysis
π Transportation & Mobility
- Features: Make, model, year, price, specifications
- Use Cases: Price prediction, market analysis, feature comparison
- Content: Incident reports, traffic violations, enforcement data
- Use Cases: Crime analysis, traffic pattern studies
π Real Estate & Demographics
- Features: Property details, prices, location metrics
- Use Cases: Price prediction, market trends, investment analysis
- Content: Demographic statistics, population distribution
- Use Cases: Demographic analysis, policy planning
π Education & Learning
- Features: Course details, ratings, pricing, enrollment
- Use Cases: Course recommendation, pricing strategy
- Content: Academic performance metrics
- Use Cases: Educational analytics, performance prediction
π¬ Entertainment & Media
- Features: Content type, ratings, release dates, genres
- Use Cases: Content analysis, recommendation systems
- Content: Social media trends, viral content metrics
- Use Cases: Trend analysis, social media insights
π¬ Classic ML Datasets
- Size: 150 records
- Features: Sepal/Petal dimensions
- Target: Species classification (3 classes)
- Use Cases: Classification tutorials, algorithm comparison
- Size: 400+ records
- Features: Passenger details, ticket info, survival status
- Use Cases: Survival prediction, feature engineering
pip install pandas numpy matplotlib seaborn scikit-learnimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load any dataset
df = pd.read_csv('diabetes.csv')
# Quick overview
print(df.info())
print(df.describe())
print(df.head())# Diabetes Dataset Analysis
diabetes_df = pd.read_csv('diabetes.csv')
# Distribution of diabetes cases
plt.figure(figsize=(10, 6))
sns.countplot(data=diabetes_df, x='diabetes')
plt.title('Distribution of Diabetes Cases')
plt.show()
# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(diabetes_df.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Prepare data
X = diabetes_df.drop(['diabetes'], axis=1)
y = diabetes_df['diabetes']
# Handle categorical variables
X_encoded = pd.get_dummies(X, drop_first=True)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X_encoded, y, test_size=0.2, random_state=42
)
# Train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Evaluate
y_pred = rf_model.predict(X_test)
print(classification_report(y_test, y_pred))# Netflix content analysis
netflix_df = pd.read_csv('Project_8_Netflix_Dataset.csv')
# Content type distribution
plt.figure(figsize=(10, 6))
netflix_df['type'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Netflix Content Distribution')
plt.show()
# Release year trends
plt.figure(figsize=(12, 6))
netflix_df['release_year'].hist(bins=30, edgecolor='black')
plt.title('Netflix Content Release Year Distribution')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.show()| Dataset | Records | Features | Missing Values | Target Variable |
|---|---|---|---|---|
| Diabetes | 100,000+ | 9 | Minimal | Binary |
| Iris | 150 | 5 | None | Multi-class |
| Titanic | 400+ | 12 | Moderate | Binary |
| Netflix | Varies | 10+ | Low | None |
# Data quality assessment function
def assess_data_quality(df, dataset_name):
print(f"\n=== {dataset_name} Quality Assessment ===")
print(f"Shape: {df.shape}")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"Duplicate rows: {df.duplicated().sum()}")
print(f"Data types: {df.dtypes.nunique()} unique types")
return df.info()- Data Manipulation:
pandas,numpy - Visualization:
matplotlib,seaborn,plotly - Machine Learning:
scikit-learn,tensorflow,pytorch - Statistical Analysis:
scipy,statsmodels - Jupyter Environment:
jupyter notebook,jupyter lab
# Essential packages
pip install pandas numpy matplotlib seaborn
# Machine Learning
pip install scikit-learn tensorflow
# Advanced visualization
pip install plotly dash
# Statistical analysis
pip install scipy statsmodels
# Jupyter environment
pip install jupyter jupyterlabπ Datasets/
βββ π README.md # This comprehensive guide
βββ π©Ί diabetes.csv # Primary diabetes dataset
βββ π©Ί diabetes1.csv # Secondary diabetes data
βββ πΈ IRIS.csv # Classic iris classification
βββ β Titanic_dataset.csv # Historical passenger data
βββ π§ͺ testdata.csv # Testing dataset
βββ π CleaneD_testdata_File.csv # Cleaned test data
βββ π student-pass-fail-data.csv # Academic performance
βββ π Health_Care_Dataset/ # Comprehensive health data
β βββ π₯ Patient_Profile.csv
β βββ π₯ Health_Camp_Detail.csv
β βββ π *_Health_Camp_Attended.csv
β βββ π Cleaned_Data/
βββ π Trending/ # Social media trends
βββ π Udmey Data/ # Educational platform data
βββ π Project_*_*.csv # Thematic project datasets
# Diabetes risk assessment model
def diabetes_risk_model():
df = pd.read_csv('diabetes.csv')
# Feature engineering and model training
return trained_model
# Health camp effectiveness analysis
def analyze_health_camps():
camp_data = pd.read_csv('Health_Care_Dataset/Health_Camp_Detail.csv')
attendance = pd.read_csv('Health_Care_Dataset/First_Health_Camp_Attended.csv')
# Analysis code here# Car price prediction
def predict_car_price():
cars_df = pd.read_csv('Project_2_Cars_Dataset.csv')
# Price prediction model
# Traffic pattern analysis
def analyze_police_data():
police_df = pd.read_csv('Project_3_Police Data.csv')
# Traffic and crime pattern analysis# Netflix content recommendation
def netflix_recommender():
netflix_df = pd.read_csv('Project_8_Netflix_Dataset.csv')
# Recommendation algorithm
# Trending content predictor
def predict_trending():
trends_df = pd.read_csv('Trending/trending.csv')
# Trend prediction modelclass DataProcessor:
def __init__(self, dataset_path):
self.df = pd.read_csv(dataset_path)
def clean_data(self):
# Remove duplicates
self.df = self.df.drop_duplicates()
# Handle missing values
self.df = self.df.fillna(self.df.mean(numeric_only=True))
return self
def feature_engineering(self):
# Create new features
# Encode categorical variables
return self
def split_data(self, target_column):
# Train-test split logic
return X_train, X_test, y_train, y_testWe welcome contributions! Here's how you can help:
- Fork the repository
- Create a feature branch:
git checkout -b feature/new-dataset - Add your dataset with proper documentation
- Commit changes:
git commit -am 'Add new healthcare dataset' - Push to branch:
git push origin feature/new-dataset - Submit a Pull Request
- Include dataset description and source
- Provide data dictionary/schema
- Add usage examples
- Ensure data quality and cleanliness
- Follow naming conventions
- Diabetes Dataset: Healthcare research compilation
- Iris Dataset: R.A. Fisher's classic botanical study
- Titanic Dataset: Historical maritime records
- Netflix Dataset: Public streaming platform data
- Health Camp Dataset: Medical outreach program data
This dataset collection is available under Open Source License.
- β Free for educational and research purposes
- β Free for commercial use with attribution
- β Modification and redistribution allowed
- β No warranty provided
When using these datasets, please cite:
Dataset Collection by itsluckysharma01
GitHub: https://github.com/itsluckysharma01/Datasets
- Clone the repository
- Install required packages
- Choose a dataset that interests you
- Load and explore the data
- Run example analyses
- Build your own models!
- π§ Email: [Your Contact]
- π¬ Issues: Open a GitHub issue
- π Wiki: Check our documentation
Last updated: September 2025