This repository contains most of the contents of the University of Lleida MEINF Datamining Course (Fall 2019). The rest of the materials will be uploaded to the Virtual Campus.
For the course development this repository will be updated and also reviewed for possible changes.
The contents are divided into two kind of sessions: laboratories and sessions. Laboratories are thought as "tutorials" where we will learn how to install/develop/run certain pieces of code or software. However, sessions are more theoretical whilst the hands-on spirit is mantained.
- Laboratory 0: Data Processing Environments
-
Session 1: Introduction to Data Cleaning and Pandas Data Structures
- Notebook 1: Introduction to Data Cleaning
- Notebook 2: Pandas Data Structures
-
Session 2: Technically Correct and Consistent Datasets:
- Notebook 3: From Raw to Technically Correct Data
- Notebook 4: From Technically Correct to Consistent Data
-
Session 3: Exploratory Data Analysis and Data Pipelines:
- Notebook 5: Exploratory Data Analysis
- Notebook 6: Data Pipelines
- Notebook 6 extra: Apache Spark DataFrame
- Notebook 6 extra: Apache Spark MLlib