Convert custom DC sample to explicit schema #5895

kmoscoe · 2026-01-12T22:30:42Z

This PR does the following:

Removes variable definitions from config.json and puts them in variables.mcf instead
Adds more metadata to the variable definitions
Adds a unit column to a CSV
Fixes the provenance definitions which were incorrect (and pointed to the wrong URLs)

Also removes the examples/ directory and all files, as it was somewhat useless for any external purposes.

I have started up a local instance with these files and everything looks good: https://bullie.svl.corp.google.com:8080

Highlights: - Pins the version of `transformers` in nl_requirements.txt - Adds support for a schema update mode in the data management container, documented [here](https://docs.datacommons.org/custom_dc/troubleshooting.html#schema-check-failed). Schema check errors should also now have a direct link to the troubleshooting page. - Updates services container to exit as soon as any of the mixer, NL, or website servers fails to start up

# Highlights - SQL queries have been parameterized. - The /translate API endpoint and the /translator web endpoint have been removed. - Some base CSS files have been modified as part of ongoing visual refreshes of the main Data Commons site. - A new config file option `includeInputSubdirs` is now available. Please note that while this feature has undergone basic testing, it may have rough edges and is not yet documented on our docsite. The associated feature request is https://issuetracker.google.com/issues/369945544. # Submodule diffs - Mixer: datacommonsorg/mixer@656512f...b5d6d7c - Import: datacommonsorg/import@5d14167...98cd40c

# Highlights * Visual and performance improvements to place pages * Improved webdriver tests # Submodule Diffs * Mixer: datacommonsorg/mixer@b5d6d7c...b960d36

# Highlights - Bug fix for data imports failing if the GCS input path was not initialized with an empty blob - Bug fix for custom variables not showing up in custom Stat Var Explorer - Bug fix for services container not starting up if embeddings file is empty - Improvements to data load log format - Services container no longer creates SQL tables if they don't exist (the data management container should be used to create the tables instead) - Web admin removed (was deprecated after data management container launch)

* Added support for schema-only updates without requiring any input files. * Updated the terraform scripts to use the above feature to create the DB schema and start the services at the outset. * Enabled AdminArea1 and AdminArea2 place types in the tools explorer for multiple countries.

# Highlights - Fixes feature check at startup for custom DC - Parallelizes Place page API calls

# Highlights - Parametrize the maximum examined Topics during NL fulfillment ([website PR#4986](datacommonsorg#4986)) - Improve performance of queries to getStatVarSummaries ([import PR#365](datacommonsorg/import#365))

# Highlights - Added `maxCharts` option to `/api/explore/fulfill` and `/api/explore/detect-and-fulfill` endpoints allowing clients to fetch additional data

# Highlights - Enabled the new revamped place pages

# Highlights - Add support for facet requests for sql backends. ([datacommonsorg#1549](datacommonsorg/mixer#1549)) - Return correct facet info in series facet responses. ([datacommonsorg#1550](datacommonsorg/mixer#1550))

# Highlights ### Imported datasets: * Open Data for Africa including: * Nigeria Statistics * Ivory Coast Census * Egypt Census (under source Statistics Egypt) ### Refreshed datasets: * U.S. Bureau of Labor Statistics (BLS) * Current Employment Statistics (CES)

# Highlights - Removed generateTopics from the config and now always generates embeddings for both SVs and topics if they are present - Added support for slashes in entity names - Data updates

# Highlights * UI improvements * Data updates

# Highlights * Fixed a bug in Custom DC data imports where generated stat var grouping with long IDs would cause the import to fail * Fixed a bug where Custom DC imports with stat var groupings disabled would cause NL search to fail Note: this is an off-cycle release, and it not in sync with the datacommons.org production image Verified changes by * Verified "URLS to check" locally from go/dconcall * Verified locally that the data import changes work * Verified with the UN that the data import changes work in cloud run * Verified with the UN that there are no breaking changes to their undata site.

# Highlights - Fixed a bug in the slider web component to keep it in sync with other components on the page.

# Highlights - Mixer: Adds a new API endpoint for filtering stat vars by place/entity. - Website changes are listed below. - Specific call out for datacommonsorg#5329 Updated map tooltip to show full variable name instead of dcid

# Custom DC Highlights - datacommonsorg#5428 - datacommonsorg#5423

# Highlights - Adds a temporary search-indicators endpoint in website for MCP compatibility: datacommonsorg#5448 - [flag-gated] Updates header of explore more results, metadata modal and dataset selection (see list of PRs below)

[Release notes](https://github.com/datacommonsorg/website/releases/tag/v3.2.2)

# Highlights - Includes a [fix](datacommonsorg#5522) to better discern custom DCs which will let us turn on the dataset selector and metadata modal for custom dc users

# Highlights - UI updates to debug info, metadata, facet selector, tab component, highlight tile - Some UI page redirects for visualization tools - Updates to StatVar autocomplete

# Highlights ## Mixer Logs are now JSON structured logs. This should ensure they are ingested by GCP at the appropriate severity level and can be queried by field. ## Website You now have the option to disable Google Map components in your instance. This is helpful for folks who don’t need the map functionality and want to limit Maps API costs. To disable Google Maps, use the “disable_google_maps” terraform setting as described in our docs (https://docs.datacommons.org/custom_dc/deploy_cloud.html#optional). In response to user feedback, we’ve updated the language in our visualization tools to use the more precise term “facet” instead of the more vague “data source” We’ve fixed a bug in our download modal such that the facet being downloaded will now always match the facet selected in the chart. We’ve improved the readability of our download dialogs by adding some CSV highlighting to the CSV previews and displays each line on its own to make it easier for the user to scan. Similarly, we’ve added syntax highlighting to our API dialog to make its contents more readable.

# Highlights - Improved logging - Fixes to the default Custom DC search CSS

…to christie

…ebsite into christie

…to christie

…to explicit

gemini-code-assist · 2026-01-12T22:31:17Z

Summary of Changes

Hello @kmoscoe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the custom Data Commons sample by transitioning to a more explicit and standardized schema. The changes involve updating data formats in CSV files to align with Data Commons ID conventions, centralizing variable definitions into a new MCF file, and enhancing the accuracy and detail of data provenance links. These modifications aim to improve the sample's adherence to Data Commons best practices and make the data more interpretable.

Highlights

Data Format Standardization: CSV files (average_annual_wage.csv and gender_wage_gap.csv) have been updated to explicitly use Data Commons IDs (DCIDs) for entities and variables, and a unit column was added to average_annual_wage.csv.
Externalized Variable Definitions: Variable metadata, previously embedded in config.json, is now defined in a dedicated variables.mcf file, enhancing modularity and clarity.
Improved Provenance Information: The config.json file now contains more specific provenance descriptions and accurate URLs for the data sources.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a great step towards standardizing the data schema. I've identified a few areas for improvement to enhance correctness and consistency:

In variables.mcf, the statType for average_annual_wage seems incorrect (median vs. mean).
The schema for gender_wage_gap could be more explicit by adding a unit and correcting its statType.
The column order in gender_wage_gap.csv is inconsistent with average_annual_wage.csv.

My detailed comments provide specific suggestions to address these points.

custom_dc/sample/gender_wage_gap.csv

custom_dc/sample/variables.mcf

keyurva

Very cool - thanks Kara!

I'm assuming you've already given this a try and it works? Knowing you, you must have, but calling it out explicitly!

custom_dc/sample/average_annual_wage.csv

kmoscoe · 2026-01-13T18:53:13Z

Very cool - thanks Kara!

I'm assuming you've already given this a try and it works? Knowing you, you must have, but calling it out explicitly!

Yep, you can see it in action at https://bullie.svl.corp.google.com:8080. (I mentioned this in the initial description.)

…to explicit

…plicit

hqpho and others added 30 commits November 6, 2024 19:13

2024-12-17 Custom DC stable release (datacommonsorg#4800)

4dc0bb1

# Highlights * Visual and performance improvements to place pages * Improved webdriver tests # Submodule Diffs * Mixer: datacommonsorg/mixer@b5d6d7c...b960d36

2025-02-10 Custom DC stable release (datacommonsorg#4944)

4cad3e9

# Highlights - Fixes feature check at startup for custom DC - Parallelizes Place page API calls

2025-03-10 Custom DC stable release (datacommonsorg#5012)

f95b598

# Highlights - Parametrize the maximum examined Topics during NL fulfillment ([website PR#4986](datacommonsorg#4986)) - Improve performance of queries to getStatVarSummaries ([import PR#365](datacommonsorg/import#365))

2025-03-24 Custom DC stable release (datacommonsorg#5050)

2717b9e

# Highlights - Added `maxCharts` option to `/api/explore/fulfill` and `/api/explore/detect-and-fulfill` endpoints allowing clients to fetch additional data

2025-04-01 Custom DC stable release (datacommonsorg#5086)

7d2dc72

# Highlights - Enabled the new revamped place pages

2025-04-08 Custom DC stable release (datacommonsorg#5109)

ea9e7bc

# Highlights - Add support for facet requests for sql backends. ([datacommonsorg#1549](datacommonsorg/mixer#1549)) - Return correct facet info in series facet responses. ([datacommonsorg#1550](datacommonsorg/mixer#1550))

2025-06-11 Custom DC stable release (datacommonsorg#5210)

b5dfed4

# Highlights - Removed generateTopics from the config and now always generates embeddings for both SVs and topics if they are present - Added support for slashes in entity names - Data updates

2025-06-24 Custom DC stable release (datacommonsorg#5231)

55482c5

# Highlights * UI improvements * Data updates

2025-07-21 Custom DC stable release (datacommonsorg#5319)

e5c9834

# Highlights - Fixed a bug in the slider web component to keep it in sync with other components on the page.

2025-09-03 Custom DC stable release (datacommonsorg#5466)

427255a

# Custom DC Highlights - datacommonsorg#5428 - datacommonsorg#5423

2025-09-08 Custom DC stable release (datacommonsorg#5479)

0696aec

# Highlights - Adds a temporary search-indicators endpoint in website for MCP compatibility: datacommonsorg#5448 - [flag-gated] Updates header of explore more results, metadata modal and dataset selection (see list of PRs below)

Custom DC v3.2.2 release (datacommonsorg#5495)

01d6ab0

[Release notes](https://github.com/datacommonsorg/website/releases/tag/v3.2.2)

2025-09-30 Custom DC stable release (datacommonsorg#5564)

16ed35a

# Highlights - Includes a [fix](datacommonsorg#5522) to better discern custom DCs which will let us turn on the dataset selector and metadata modal for custom dc users

2025-10-06 Custom DC stable release (datacommonsorg#5591)

806557f

# Highlights - UI updates to debug info, metadata, facet selector, tab component, highlight tile - Some UI page redirects for visualization tools - Updates to StatVar autocomplete

2025-11-17 Custom DC stable release (datacommonsorg#5720)

a9fd8bd

# Highlights - Improved logging - Fixes to the default Custom DC search CSS

Replace getopt in run_cdc_dev_docker to make compatible with macOS

df89749

remove obsolete garbage handling

216c8b1

support --flag=val arg format

e97db65

suport -- as terminator

f92ce66

reduce repeated code

d5ede9a

update helper to echo instead of modifiying global var

02ac6ea

final nit, i promise

56faf89

kmoscoe added 10 commits December 17, 2025 11:49

Merge branch 'master' of https://github.com/datacommonsorg/website in…

3a03379

…to christie

Merge branch 'customdc_stable' of https://github.com/datacommonsorg/w…

0e77383

…ebsite into christie

Merge branch 'master' of https://github.com/datacommonsorg/website in…

e3b6157

…to christie

Update sample to use explicit schema

bfe0822

convert sample to explicit schema

0d6f03f

Merge branch 'master' of https://github.com/datacommonsorg/website in…

eb1ff7d

…to explicit

Merge branch 'master' of https://github.com/datacommonsorg/website in…

b4f9439

…to explicit

Convert to explicit schema

ad91baf

remove incorrect file

f63d7de

Fix empty rows

ae5618b

kmoscoe requested a review from keyurva January 12, 2026 22:30

Remove examples/ directory and contents

10dd785

gemini-code-assist bot reviewed Jan 12, 2026

View reviewed changes

custom_dc/sample/gender_wage_gap.csv Outdated Show resolved Hide resolved

custom_dc/sample/variables.mcf Outdated Show resolved Hide resolved

custom_dc/sample/variables.mcf Outdated Show resolved Hide resolved

imple,ent changes suggested by Gemini

a6cf42e

keyurva approved these changes Jan 13, 2026

View reviewed changes

custom_dc/sample/average_annual_wage.csv Show resolved Hide resolved

kmoscoe and others added 4 commits January 13, 2026 10:55

remove redundant dcid: prefixes

7d8ea41

Merge branch 'master' into explicit

94beae2

Merge branch 'master' of https://github.com/datacommonsorg/website in…

5bd2493

…to explicit

Merge branch 'explicit' of https://github.com/kmoscoe/website into ex…

81e8c54

…plicit

kmoscoe merged commit 5193f13 into datacommonsorg:master Jan 13, 2026
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert custom DC sample to explicit schema #5895

Convert custom DC sample to explicit schema #5895

Uh oh!

kmoscoe commented Jan 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keyurva left a comment

Uh oh!

Uh oh!

kmoscoe commented Jan 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Convert custom DC sample to explicit schema #5895

Convert custom DC sample to explicit schema #5895

Uh oh!

Conversation

kmoscoe commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keyurva left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kmoscoe commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

kmoscoe commented Jan 12, 2026 •

edited

Loading

kmoscoe commented Jan 13, 2026 •

edited

Loading