-
Notifications
You must be signed in to change notification settings - Fork 115
Convert custom DC sample to explicit schema #5895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Highlights: - Pins the version of `transformers` in nl_requirements.txt - Adds support for a schema update mode in the data management container, documented [here](https://docs.datacommons.org/custom_dc/troubleshooting.html#schema-check-failed). Schema check errors should also now have a direct link to the troubleshooting page. - Updates services container to exit as soon as any of the mixer, NL, or website servers fails to start up
# Highlights - SQL queries have been parameterized. - The /translate API endpoint and the /translator web endpoint have been removed. - Some base CSS files have been modified as part of ongoing visual refreshes of the main Data Commons site. - A new config file option `includeInputSubdirs` is now available. Please note that while this feature has undergone basic testing, it may have rough edges and is not yet documented on our docsite. The associated feature request is https://issuetracker.google.com/issues/369945544. # Submodule diffs - Mixer: datacommonsorg/mixer@656512f...b5d6d7c - Import: datacommonsorg/import@5d14167...98cd40c
# Highlights * Visual and performance improvements to place pages * Improved webdriver tests # Submodule Diffs * Mixer: datacommonsorg/mixer@b5d6d7c...b960d36
# Highlights - Bug fix for data imports failing if the GCS input path was not initialized with an empty blob - Bug fix for custom variables not showing up in custom Stat Var Explorer - Bug fix for services container not starting up if embeddings file is empty - Improvements to data load log format - Services container no longer creates SQL tables if they don't exist (the data management container should be used to create the tables instead) - Web admin removed (was deprecated after data management container launch)
* Added support for schema-only updates without requiring any input files. * Updated the terraform scripts to use the above feature to create the DB schema and start the services at the outset. * Enabled AdminArea1 and AdminArea2 place types in the tools explorer for multiple countries.
# Highlights - Fixes feature check at startup for custom DC - Parallelizes Place page API calls
# Highlights - Parametrize the maximum examined Topics during NL fulfillment ([website PR#4986](datacommonsorg#4986)) - Improve performance of queries to getStatVarSummaries ([import PR#365](datacommonsorg/import#365))
# Highlights - Added `maxCharts` option to `/api/explore/fulfill` and `/api/explore/detect-and-fulfill` endpoints allowing clients to fetch additional data
# Highlights - Enabled the new revamped place pages
# Highlights - Add support for facet requests for sql backends. ([datacommonsorg#1549](datacommonsorg/mixer#1549)) - Return correct facet info in series facet responses. ([datacommonsorg#1550](datacommonsorg/mixer#1550))
# Highlights
### Imported datasets:
* Open Data for Africa including:
* Nigeria Statistics
* Ivory Coast Census
* Egypt Census (under source Statistics Egypt)
### Refreshed datasets:
* U.S. Bureau of Labor Statistics (BLS)
* Current Employment Statistics (CES)
# Highlights - Removed generateTopics from the config and now always generates embeddings for both SVs and topics if they are present - Added support for slashes in entity names - Data updates
# Highlights * UI improvements * Data updates
# Highlights * Fixed a bug in Custom DC data imports where generated stat var grouping with long IDs would cause the import to fail * Fixed a bug where Custom DC imports with stat var groupings disabled would cause NL search to fail Note: this is an off-cycle release, and it not in sync with the datacommons.org production image Verified changes by * Verified "URLS to check" locally from go/dconcall * Verified locally that the data import changes work * Verified with the UN that the data import changes work in cloud run * Verified with the UN that there are no breaking changes to their undata site.
# Highlights - Fixed a bug in the slider web component to keep it in sync with other components on the page.
# Highlights - Mixer: Adds a new API endpoint for filtering stat vars by place/entity. - Website changes are listed below. - Specific call out for datacommonsorg#5329 Updated map tooltip to show full variable name instead of dcid
# Custom DC Highlights - datacommonsorg#5428 - datacommonsorg#5423
# Highlights - Adds a temporary search-indicators endpoint in website for MCP compatibility: datacommonsorg#5448 - [flag-gated] Updates header of explore more results, metadata modal and dataset selection (see list of PRs below)
# Highlights - Includes a [fix](datacommonsorg#5522) to better discern custom DCs which will let us turn on the dataset selector and metadata modal for custom dc users
# Highlights - UI updates to debug info, metadata, facet selector, tab component, highlight tile - Some UI page redirects for visualization tools - Updates to StatVar autocomplete
# Highlights ## Mixer Logs are now JSON structured logs. This should ensure they are ingested by GCP at the appropriate severity level and can be queried by field. ## Website You now have the option to disable Google Map components in your instance. This is helpful for folks who don’t need the map functionality and want to limit Maps API costs. To disable Google Maps, use the “disable_google_maps” terraform setting as described in our docs (https://docs.datacommons.org/custom_dc/deploy_cloud.html#optional). In response to user feedback, we’ve updated the language in our visualization tools to use the more precise term “facet” instead of the more vague “data source” We’ve fixed a bug in our download modal such that the facet being downloaded will now always match the facet selected in the chart. We’ve improved the readability of our download dialogs by adding some CSV highlighting to the CSV previews and displays each line on its own to make it easier for the user to scan. Similarly, we’ve added syntax highlighting to our API dialog to make its contents more readable.
# Highlights - Improved logging - Fixes to the default Custom DC search CSS
Summary of ChangesHello @kmoscoe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the custom Data Commons sample by transitioning to a more explicit and standardized schema. The changes involve updating data formats in CSV files to align with Data Commons ID conventions, centralizing variable definitions into a new MCF file, and enhancing the accuracy and detail of data provenance links. These modifications aim to improve the sample's adherence to Data Commons best practices and make the data more interpretable. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a great step towards standardizing the data schema. I've identified a few areas for improvement to enhance correctness and consistency:
- In
variables.mcf, thestatTypeforaverage_annual_wageseems incorrect (median vs. mean). - The schema for
gender_wage_gapcould be more explicit by adding aunitand correcting itsstatType. - The column order in
gender_wage_gap.csvis inconsistent withaverage_annual_wage.csv.
My detailed comments provide specific suggestions to address these points.
keyurva
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool - thanks Kara!
I'm assuming you've already given this a try and it works? Knowing you, you must have, but calling it out explicitly!
Yep, you can see it in action at https://bullie.svl.corp.google.com:8080. (I mentioned this in the initial description.) |
This PR does the following:
Also removes the examples/ directory and all files, as it was somewhat useless for any external purposes.
I have started up a local instance with these files and everything looks good: https://bullie.svl.corp.google.com:8080