Skip to content

Standardize Data#122

Open
zaeema-n wants to merge 10 commits intoLDFLK:mainfrom
zaeema-n:fix_data_again
Open

Standardize Data#122
zaeema-n wants to merge 10 commits intoLDFLK:mainfrom
zaeema-n:fix_data_again

Conversation

@zaeema-n
Copy link
Member

@zaeema-n zaeema-n commented Feb 1, 2026

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 1, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://LDFLK.github.io/datasets/pr-preview/pr-122/

Built to branch gh-pages at 2026-02-01 06:52 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zaeema-n, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the consistency and structure of the statistical data stored in JSON files. It systematically removes redundant 'Year' columns, standardizes various column names, and cleans up aggregated 'Total' entries that can be derived programmatically. These changes improve the clarity and usability of the raw data. Additionally, a new automation script is introduced to facilitate the ingestion process for multiple years, ensuring a more efficient and controlled data pipeline.

Highlights

  • Data Structure Standardization: The 'Year' column has been removed from numerous 'data.json' files across various datasets (e.g., Annual Tourism Receipts, Complaints, SLBFE registrations, Workers Remittances) for years 2019-2024, as the year is implicitly part of the file path.
  • Metadata Updates: Corresponding 'metadata.json' files have been updated to reflect the reduced 'column_count' after the removal of the 'Year' column. An exception is 'data/statistics/2019/datasets/Tourist Attractions Revenue and Visitors/metadata.json' where the column count increased from 3 to 6, indicating additional columns were added or corrected.
  • Column Name Consistency: Several datasets (e.g., Tourist Arrivals by Age, Asylum Seekers By Nationality, Prime Minister Recurrent Expenditure) have had column names standardized for better readability and consistency (e.g., 'Age Group' to 'Age', 'No. of Persons' to 'No. of persons', 'Ministry Office' to 'Institution').
  • Redundant Data Removal: 'Total' rows or columns have been removed from some datasets (e.g., SLBFE Registration by District, SLBFE Registration through Private Sources, SLBFE registration by country) to streamline the raw data.
  • Automated Data Ingestion Script: A new shell script, 'ingest_all_years.sh', has been added to automate the ingestion of statistics data for years 2020 through 2025, incorporating a 3-minute delay between each year's processing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

github-actions bot commented Feb 1, 2026

📦 Documentation Build

Status Artifact
✅ Build successful Download docs-build-pr-122

To preview locally:

  1. Click the artifact link above
  2. Scroll to "Artifacts" section and download docs-build-pr-122
  3. Extract the zip file
  4. Run npx serve . in the extracted folder
  5. Open http://localhost:3000

Built from commit bf4a213

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request focuses on standardizing various data files by removing redundant 'Year' and 'Total' columns/rows, and harmonizing column names for better consistency. These are positive changes that improve data quality. A new script to automate data ingestion for multiple years is also a valuable addition. My review identifies a few critical issues where metadata (column_count) was not correctly updated to match the data changes, which could lead to processing errors. I also found one instance of a country name that does not follow the repository's standardization rules, and a bug in the new ingestion script that will cause it to fail. Addressing these points will ensure the data and scripts are accurate and robust.

"storage_type": "tabular",
"dataset_name": "SLBFE registration by gender",
"column_count": 6,
"column_count": 5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The column_count is incorrect. The corresponding data.json file was updated to have 2 columns ("Male", "Female"), but the column_count here is set to 5. It should be updated to 2 to match the data.

Suggested change
"column_count": 5,
"column_count": 2,

"storage_type": "tabular",
"dataset_name": "SLBFE registration by manpower level",
"column_count": 16,
"column_count": 15,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The column_count is incorrect. The corresponding data.json file was updated to have 14 columns after removing the "Year" and "Total" columns, but the column_count here is set to 15. It should be updated to 14 to reflect the actual number of columns.

Suggested change
"column_count": 15,
"column_count": 14,

"storage_type": "tabular",
"dataset_name": "Workers Remittances",
"column_count": 6,
"column_count": 5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The column_count is incorrect. The corresponding data.json file was updated to have 4 columns after removing the "Year" and "Middle East as a % of total remittance" columns, but the column_count here is set to 5. It should be updated to 4.

Suggested change
"column_count": 5,
"column_count": 4,

REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "$REPO_ROOT"

for year in 2020 2021 2022 2023 2024 2025; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The loop includes the year 2025, but there is no data for 2025 in the repository. The ingestion script will fail when it tries to access data/statistics/2025/data_hierarchy_2025.yaml because the file does not exist and the script is configured to exit on error (set -e). The loop should end at 2024 to prevent this failure.

Suggested change
for year in 2020 2021 2022 2023 2024 2025; do
for year in 2020 2021 2022 2023 2024; do

["Papua New Guinea", 0, 11, 0, 3, 0, 1, 0, 0, 1, 0, 1, 0, 1],
["Philippines", 0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 1],
["Qatar", 24, 441, 51, 4606, 438, 7, 137, 26, 315, 61, 682, 196, 2705],
["Republic of Dominica", 0, 11, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The country name "Republic of Dominica" should be standardized to "Dominica" to maintain consistency across datasets, as per the repository's general rules.

Suggested change
["Republic of Dominica", 0, 11, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 0],
["Dominica", 0, 11, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 0],
References
  1. Standardize country names across datasets to avoid confusion between similarly named countries. For example, "Republic of Dominican" should be standardized to "Dominican Republic", and "Republic of Dominica" should be standardized to "Dominica".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant