Skip to content

Conversation

@daniel-thom
Copy link
Collaborator

This is a prototype, mostly generated by Claude. The goal is to see if we can simplify interaction with dsgrid when running Spark jobs. The current code based on SQLAlchemy requires a separate Spark session: dsgrid creates a session with pyspark and chronify relies on an Apache Thrift Server (Hive).

We would get the following benefits by migrating:

  • Pass an Ibis Table object from dsgrid to chronify to perform time validation instead of only a path to a Parquet file.
  • Support of PyHive (the SQLAlchemy driver for Hive) is unclear.
  • dsgrid could drop much of its special handling for Spark vs DuckDB (DuckDB has an experimental Spark API, but it is incomplete and has an uncertain future). Ibis appears to be a better long term solution.

We would lose this functionality in SQLAlchemy:

  • Database transactions with rollback. Ibis does not support this natively.
  • We currently allow the user to ingest rows from multiple DataFrames into an existing table. If the first DataFrame is valid but the second is not, we perform a rollback and the state of the database is the same as the original state. With Ibis, we do not have code to delete the added rows. (It could be done with special-casing for backends that support it.)
  • This is not import for dsgrid as we do not ingest data like this. We need to ask other chronify users.

Outstanding work:

  • Some tests are failing due to time zone / DST handling with Spark.
  • Talk to other chronify users about dropping transaction support.

@codecov-commenter
Copy link

codecov-commenter commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 87.17625% with 203 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (ll/local_time2@4f2a3f3). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/chronify/ibis/spark_backend.py 73.01% 34 Missing ⚠️
src/chronify/ibis/functions.py 77.61% 30 Missing ⚠️
src/chronify/ibis/types.py 51.92% 25 Missing ⚠️
src/chronify/store.py 88.00% 21 Missing ⚠️
src/chronify/time_series_mapper_base.py 86.61% 19 Missing ⚠️
src/chronify/ibis/duckdb_backend.py 79.74% 16 Missing ⚠️
src/chronify/ibis/sqlite_backend.py 80.72% 16 Missing ⚠️
src/chronify/ibis/base.py 80.00% 14 Missing ⚠️
src/chronify/time_zone_localizer.py 82.75% 5 Missing ⚠️
tests/conftest.py 87.80% 5 Missing ⚠️
... and 9 more
Additional details and impacted files
@@                Coverage Diff                @@
##             ll/local_time2      #62   +/-   ##
=================================================
  Coverage                  ?   91.85%           
=================================================
  Files                     ?       55           
  Lines                     ?     4962           
  Branches                  ?        0           
=================================================
  Hits                      ?     4558           
  Misses                    ?      404           
  Partials                  ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lixiliu
Copy link
Collaborator

lixiliu commented Jan 24, 2026

This is a prototype, mostly generated by Claude. The goal is to see if we can simplify interaction with dsgrid when running Spark jobs. The current code based on SQLAlchemy requires a separate Spark session: dsgrid creates a session with pyspark and chronify relies on an Apache Thrift Server (Hive).

We would get the following benefits by migrating:

  • Pass an Ibis Table object from dsgrid to chronify to perform time validation instead of only a path to a Parquet file.
  • Support of PyHive (the SQLAlchemy driver for Hive) is unclear.
  • dsgrid could drop much of its special handling for Spark vs DuckDB (DuckDB has an experimental Spark API, but it is incomplete and has an uncertain future). Ibis appears to be a better long term solution.

We would lose this functionality in SQLAlchemy:

  • Database transactions with rollback. Ibis does not support this natively.
  • We currently allow the user to ingest rows from multiple DataFrames into an existing table. If the first DataFrame is valid but the second is not, we perform a rollback and the state of the database is the same as the original state. With Ibis, we do not have code to delete the added rows. (It could be done with special-casing for backends that support it.)
  • This is not import for dsgrid as we do not ingest data like this. We need to ask other chronify users.

Outstanding work:

  • Some tests are failing due to time zone / DST handling with Spark.
  • Talk to other chronify users about dropping transaction support.

For rollback behavior, can't we just make a copy of the original dataframe to do the next operation so we have something to fall back on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants