This project explores the MovieLens dataset using MongoDB for non-relational data handling and Python for querying, analysis, and visualization. It aims to uncover patterns in movie popularity, user preferences, genre evolution, and rating behaviors over time.
βββ notebooks/
β βββ dataWrangling.ipynb # Data loading, merging, transformation & JSON export
β βββ extra.ipynb # MongoDB queries & visualizations using Python
βββ merged_movies_ratings.json # Final processed dataset
βββ dump/ # intermediate datasets
βββ Pictures/
βββ SMBUD Project - Federica Topazio.pdf # Full report
βββ README.txt # Citations of dataset
βββ README.md # This fileMovieLens 100k dataset containing:
- ~100,000 ratings from 610 users
- 9,742 movies with genre metadata
- Timestamps for all ratings (from 1996 to 2018)
Main fields:
userId,movieId,title,genres,rating,timestamp,release year
- MongoDB: flexible document-based schema & aggregation queries
- Python (pandas, matplotlib, seaborn, pymongo): data wrangling and visualization
- Jupyter Notebooks: analysis and execution of Python code
- CSV loading, cleaning & transformation
- Genre parsing and timestamp formatting
- Normalization of ratings
- JSON export for MongoDB
- Average rating per genre
- User behavior based on rating activity
- Most polarizing movies (variance in ratings)
- Rating trends over time (e.g., for Toy Story)
- Genre popularity and evolution over the years
- Temporal patterns (monthly ratings)
- Correlation between number of ratings and average rating
- π Bar chart: average rating per genre
- π€ Scatter plot: user engagement vs average rating
- 𧨠Variance bar chart: most polarizing movies
- π Line plot: rating trends for specific movies over time
- π§ Heatmap: genre popularity changes by year
- π Monthly rating trends
- MongoDB is well-suited for semi-structured datasets with nested lists (e.g., genres)
- Aggregation pipelines enable complex, efficient querying
- Visual exploration of rating patterns can inform recommender systems
Federica Topazio
Politecnico di Milano | Systems and Methods for Big and Unstructured Data (2023β2024)
This project is licensed for academic and research purposes.