This project investigates the effect of a major annual event in Barcelona on rental prices on Booking. Data is collected for at least two different weeks for Barcelona and another city (Milan) as a control group, aiming to analyze price variations using a difference-in-differences (DiD) model. Additionally, a text analysis is performed on accommodation descriptions to identify patterns in texts associated with prices.
textmining_booking/
|-- booking/
| |-- packages/
| | |-- __pycache__/
| | |-- __init__.py # Package initialization file
| | |-- dataloading.py # Data loading and cleaning
| | |-- processing.py # Data processing
| | |-- scraper.py # Web scraper from Booking
| | |-- selenium_setup.py # Selenium setup for scraping
| |-- Barcelona_MWC.csv # Data extracted from Barcelona
| |-- Milan_MWC.csv # Data extracted from Milan
| |-- geckodriver.exe # Selenium driver for Firefox
|-- ITM_HW1.ipynb # Principal Notebook
|-- hw1.pdf # Document with project requirements
|-- README.md # Description the project structure
|-- requirements.txt # Dependencies required to run
|-- setup.py # Installation and setup script
To install the required dependencies, run:
pip install -r requirements.txt-
Selenium Setup:
- Download
geckodriver.exefor Firefox or use the appropriate driver for Chrome. - Place it in the project folder.
- Download
-
Run the Files:
python packages/scraper.py python packages/dataloading.py python packages/processing.py
These files generate searches on Booking webpages, extract data according to our delimitations, and preprocess the description of each hotel.
-
Data Analysis:
- Run the
ITM_HW1.ipynbnotebook in Jupyter Notebook to visualize exploratory analysis, data cleaning, and the DiD regression. In this notebook, we use pipelines to call all the functions from the .py files.
- Run the
- Rental price data is collected from Booking for Barcelona and Milan.
- Navigation through multiple result pages is automated.
- Accommodation descriptions are also extracted for text analysis.
- Text preprocessing is performed by removing stopwords and applying stemming.
- Wordclouds are generated before and after preprocessing.
- Terms associated with higher prices are explored.
- The impact of the event on prices is estimated using a difference-in-differences model.
- Additional controls based on text descriptions are included.
- Heterogeneous effects are explored according to accommodation quality.
This project was developed as part of an academic assignment. It is recommended to follow good practices in web scraping and respect the terms of service of the platforms used.