Skip to content

Conversation

@kmoscoe
Copy link
Contributor

@kmoscoe kmoscoe commented Jan 12, 2026

This PR does the following:

  • Removes variable definitions from config.json and puts them in variables.mcf instead
  • Adds more metadata to the variable definitions
  • Adds a unit column to a CSV
  • Fixes the provenance definitions which were incorrect (and pointed to the wrong URLs)

Also removes the examples/ directory and all files, as it was somewhat useless for any external purposes.

I have started up a local instance with these files and everything looks good: https://bullie.svl.corp.google.com:8080

hqpho and others added 30 commits November 6, 2024 19:13
Highlights:
- Pins the version of `transformers` in nl_requirements.txt
- Adds support for a schema update mode in the data management
container, documented
[here](https://docs.datacommons.org/custom_dc/troubleshooting.html#schema-check-failed).
Schema check errors should also now have a direct link to the
troubleshooting page.
- Updates services container to exit as soon as any of the mixer, NL, or
website servers fails to start up
# Highlights
- SQL queries have been parameterized.
- The /translate API endpoint and the /translator web endpoint have been
removed.
- Some base CSS files have been modified as part of ongoing visual
refreshes of the main Data Commons site.
- A new config file option `includeInputSubdirs` is now available.
Please note that while this feature has undergone basic testing, it may
have rough edges and is not yet documented on our docsite. The
associated feature request is
https://issuetracker.google.com/issues/369945544.

# Submodule diffs
- Mixer:
datacommonsorg/mixer@656512f...b5d6d7c
- Import:
datacommonsorg/import@5d14167...98cd40c
# Highlights
* Visual and performance improvements to place pages
* Improved webdriver tests

# Submodule Diffs
* Mixer:
datacommonsorg/mixer@b5d6d7c...b960d36
# Highlights
- Bug fix for data imports failing if the GCS input path was not
initialized with an empty blob
- Bug fix for custom variables not showing up in custom Stat Var
Explorer
- Bug fix for services container not starting up if embeddings file is
empty
- Improvements to data load log format
- Services container no longer creates SQL tables if they don't exist
(the data management container should be used to create the tables
instead)
- Web admin removed (was deprecated after data management container
launch)
* Added support for schema-only updates without requiring any input
files.
* Updated the terraform scripts to use the above feature to create the
DB schema and start the services at the outset.
* Enabled AdminArea1 and AdminArea2 place types in the tools explorer
for multiple countries.
# Highlights
- Fixes feature check at startup for custom DC
- Parallelizes Place page API calls
# Highlights
- Parametrize the maximum examined Topics during NL fulfillment
([website PR#4986](datacommonsorg#4986))
- Improve performance of queries to getStatVarSummaries ([import
PR#365](datacommonsorg/import#365))
# Highlights
- Added `maxCharts` option to `/api/explore/fulfill` and
`/api/explore/detect-and-fulfill` endpoints allowing clients to fetch
additional data
# Highlights
- Enabled the new revamped place pages
# Highlights

- Add support for facet requests for sql backends.
([datacommonsorg#1549](datacommonsorg/mixer#1549))
- Return correct facet info in series facet responses.
([datacommonsorg#1550](datacommonsorg/mixer#1550))
# Highlights
### Imported datasets:
* Open Data for Africa including:
    * Nigeria Statistics
    * Ivory Coast Census
    * Egypt Census (under source Statistics Egypt)

### Refreshed datasets:
* U.S. Bureau of Labor Statistics (BLS)
    * Current Employment Statistics (CES)
# Highlights
- Removed generateTopics from the config and now always generates
embeddings for both SVs and topics if they are present
- Added support for slashes in entity names
- Data updates
# Highlights
* UI improvements
* Data updates
# Highlights
* Fixed a bug in Custom DC data imports where generated stat var
grouping with long IDs would cause the import to fail
* Fixed a bug where Custom DC imports with stat var groupings disabled
would cause NL search to fail

Note: this is an off-cycle release, and it not in sync with the
datacommons.org production image

Verified changes by
* Verified "URLS to check" locally from go/dconcall
* Verified locally that the data import changes work
* Verified with the UN that the data import changes work in cloud run
* Verified with the UN that there are no breaking changes to their
undata site.
# Highlights
- Fixed a bug in the slider web component to keep it in sync with other
components on the page.
# Highlights
- Mixer: Adds a new API endpoint for filtering stat vars by
place/entity.
- Website changes are listed below.
- Specific call out for datacommonsorg#5329 Updated map tooltip to show full variable
name instead of dcid
# Highlights
- Adds a temporary search-indicators endpoint in website for MCP
compatibility: datacommonsorg#5448
- [flag-gated] Updates header of explore more results, metadata modal
and dataset selection (see list of PRs below)
# Highlights
- Includes a [fix](datacommonsorg#5522)
to better discern custom DCs which will let us turn on the dataset
selector and metadata modal for custom dc users
# Highlights
- UI updates to debug info, metadata, facet selector, tab component,
highlight tile
- Some UI page redirects for visualization tools
- Updates to StatVar autocomplete
# Highlights

## Mixer

Logs are now JSON structured logs. This should ensure they are ingested
by GCP at the appropriate severity level and can be queried by field.

## Website

You now have the option to disable Google Map components in your
instance. This is helpful for folks who don’t need the map functionality
and want to limit Maps API costs. To disable Google Maps, use the
“disable_google_maps” terraform setting as described in our docs
(https://docs.datacommons.org/custom_dc/deploy_cloud.html#optional).

In response to user feedback, we’ve updated the language in our
visualization tools to use the more precise term “facet” instead of the
more vague “data source”

We’ve fixed a bug in our download modal such that the facet being
downloaded will now always match the facet selected in the chart.

We’ve improved the readability of our download dialogs by adding some
CSV highlighting to the CSV previews and displays each line on its own
to make it easier for the user to scan.

Similarly, we’ve added syntax highlighting to our API dialog to make its
contents more readable.
# Highlights
- Improved logging
- Fixes to the default Custom DC search CSS
@kmoscoe kmoscoe requested a review from keyurva January 12, 2026 22:30
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kmoscoe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the custom Data Commons sample by transitioning to a more explicit and standardized schema. The changes involve updating data formats in CSV files to align with Data Commons ID conventions, centralizing variable definitions into a new MCF file, and enhancing the accuracy and detail of data provenance links. These modifications aim to improve the sample's adherence to Data Commons best practices and make the data more interpretable.

Highlights

  • Data Format Standardization: CSV files (average_annual_wage.csv and gender_wage_gap.csv) have been updated to explicitly use Data Commons IDs (DCIDs) for entities and variables, and a unit column was added to average_annual_wage.csv.
  • Externalized Variable Definitions: Variable metadata, previously embedded in config.json, is now defined in a dedicated variables.mcf file, enhancing modularity and clarity.
  • Improved Provenance Information: The config.json file now contains more specific provenance descriptions and accurate URLs for the data sources.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great step towards standardizing the data schema. I've identified a few areas for improvement to enhance correctness and consistency:

  • In variables.mcf, the statType for average_annual_wage seems incorrect (median vs. mean).
  • The schema for gender_wage_gap could be more explicit by adding a unit and correcting its statType.
  • The column order in gender_wage_gap.csv is inconsistent with average_annual_wage.csv.

My detailed comments provide specific suggestions to address these points.

Copy link
Contributor

@keyurva keyurva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool - thanks Kara!

I'm assuming you've already given this a try and it works? Knowing you, you must have, but calling it out explicitly!

@kmoscoe
Copy link
Contributor Author

kmoscoe commented Jan 13, 2026

Very cool - thanks Kara!

I'm assuming you've already given this a try and it works? Knowing you, you must have, but calling it out explicitly!

Yep, you can see it in action at https://bullie.svl.corp.google.com:8080. (I mentioned this in the initial description.)

@kmoscoe kmoscoe merged commit 5193f13 into datacommonsorg:master Jan 13, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants