Data collection tools for scraping adoption websites and missing children databases. This repository contains the scrapers and raw datasets used in the missing-children-ai-search project for identifying missing Ukrainian children through facial recognition AI.
This repository collects data from three main sources:
- Russian data: usynovite.ru (Russian adoption website)
- Belarusian data: dadomu.by (Belarusian adoption website)
- Ukrainian data: childrenofwar.gov.ua (Ukrainian government database for displaced children)
- Python 3.x
- Jupyter Notebook
curl,awk, andjq(for image downloads)
Install Python dependencies:
pip install -r requirements.txtRun the Jupyter notebooks to download profile data from each website:
children_of_war.ipynb- Ukrainian datadadomy.ipynb- Belarusian datausynovite.ipynb- Russian data
Use the bash script to download images from URLs in the scraped data files:
./download_images.shImages are saved with filenames matching the profile ID from the source URL.
The data/ directory contains:
- Profile data: JSONL files with
profile_link,image_url, anddescriptioncolumns - Images: ZIP archives of downloaded profile images (image filename = profile ID)
- Russian website (usynovite.ru) accessed via Serbia VPN
- Belarusian website (dadomu.by) requires login credentials (not shared for account owner's security)
The facial recognition analysis using this data is available at texty/missing-children-ai-search