Conversation
… SQLGlot bug discovered along the way [RUN CI]
… into John/df_collection
|
|
||
| # Query from the materialized table - direct method call works for simple queries | ||
| result = asian_tmp.CALCULATE(name) | ||
| ``` |
There was a problem hiding this comment.
Can we add the actual result for each example? I think would be really helpful to understand what this api creates.
documentation/usage.md
Outdated
| The first argument it takes in is the PyDough node for the collection being materialized. The second argument is the name of the view/table to create. It can optionally take in the following keyword arguments: | ||
|
|
||
| - `as_view`: If `True`, create a VIEW. If `False` (default), create a TABLE. | ||
| - `replace`: If `True`, drop table/view if exists and then create the table/view. For Snowflake, use `CREATE OR REPLACE` to allow replacing an existing view/table. Default is `False`. |
There was a problem hiding this comment.
What happens if replace=False and the user tries to create a view/table that already exists? Can we specify it here?
There was a problem hiding this comment.
Agreed. Also, let's make the format of how defaults are declared consistent between this vs as_view.
There was a problem hiding this comment.
Not sure, what you mean?
If replace=False and user tried with one that exists, it'll fail as expected by all SQL engines.
I'll add the note but want to make sure I understand that you mean state it and I'm not missing something
| - actual_temp is the final temp value (may differ from input due to dialect limitations) | ||
| """ | ||
| # Handle differences in CREATE syntax for different databases. | ||
| create_caps = CREATE_CAPABILITIES[db_dialect] |
| ) | ||
|
|
||
| # Check if we can use CREATE OR REPLACE | ||
| can_replace = create_caps.replace_view if as_view else create_caps.replace_table |
| raise PyDoughException( | ||
| f"TEMPORARY views are not supported for {session.database.dialect.name}" | ||
| ) | ||
| # session.metadata = graph |
There was a problem hiding this comment.
Is this an actual comment?
There was a problem hiding this comment.
Nope, outdated comment. removed
| the created view/table. | ||
|
|
||
| """ | ||
| _validate_table_name(name) |
There was a problem hiding this comment.
There is a function from error_utils called is_valid_sql_identifier() that can be used here. There are more functions that can be used for validation like unique_properties_predicate.verify() for unique_columns. Also don't forget to manage quoted names for the table name and for columns. For reference, see how I use normalize_column_name in create_constant_table. I think you can use it as well.
There was a problem hiding this comment.
Thanks. I missed that.
| @@ -0,0 +1,126 @@ | |||
| """ | |||
| A user-defined collection representing a database [temporary] view/table. | |||
There was a problem hiding this comment.
Because this is inside the user_collections folder can we add the documentation on the README.md file? Just a brief description of this class would be fine.
| # then CALCULATE on materialized view | ||
| pytest.param( | ||
| PyDoughPandasTest( | ||
| "asian_nations = nations.WHERE(region.name == 'ASIA')\n" |
There was a problem hiding this comment.
Can we create a view collection from range/dataframe collection? If so, we should add a test. Can we combine user generated collections? For example CROSS a dataframe/range collection with a view collection
There was a problem hiding this comment.
We should also have tests where the uniqueness columns from to_table come into play (e.g. .BEST where the per=... ancestor is the to_table collection).
tests/test_pipeline_tpch_custom.py
Outdated
| PyDoughPandasTest( | ||
| "asian_nations = nations.WHERE(region.name == 'ASIA').CALCULATE(nation_key=key, nation_name=name)\n" | ||
| "asian_tmp = pydough.to_table(asian_nations, name='asian_nations_t4', replace=True)\n" | ||
| "result = CROSS(asian_tmp).CALCULATE(nation_key, nation_name).ORDER_BY(nation_key.ASC())", |
There was a problem hiding this comment.
Can we use CROSS like this without anything before? I thought it must have something before like collection1.CROSS(collection2). (Just making sure)
There was a problem hiding this comment.
EDIT: Based on Kian's comment elsewhere, this behavior should not happen. Updated code to error if this is used and updated the tests.
knassre-bodo
left a comment
There was a problem hiding this comment.
Initial review done; overall great work but some things that need to get iterated on.
documentation/usage.md
Outdated
| The first argument it takes in is the PyDough node for the collection being materialized. The second argument is the name of the view/table to create. It can optionally take in the following keyword arguments: | ||
|
|
||
| - `as_view`: If `True`, create a VIEW. If `False` (default), create a TABLE. | ||
| - `replace`: If `True`, drop table/view if exists and then create the table/view. For Snowflake, use `CREATE OR REPLACE` to allow replacing an existing view/table. Default is `False`. |
There was a problem hiding this comment.
Agreed. Also, let's make the format of how defaults are declared consistent between this vs as_view.
| | SQLite | No (uses DROP + CREATE)| Yes | No (uses DROP + CREATE)| Yes | | ||
| | Snowflake | Yes | Yes | Yes | No | | ||
| | PostgreSQL | No (uses DROP + CREATE)| Yes | Yes | No | | ||
| | MySQL | No (uses DROP + CREATE)| Yes | Yes | No | |
There was a problem hiding this comment.
Let's include Oracle and BodoSQL here (which reminds me... this PR will probably need some brief tests for both of those)
There was a problem hiding this comment.
Sure, The PR was up for review before these were merged. which as of today only BodoSQL so I'll work on that for now.
EDIT: BodoSQL relies on Bodo which dropped Mac/Intel support. I'll disable feature for BodoSQL till I make the machine switch and can resume it.
| #### Example 1: Basic Table Materialization | ||
|
|
||
| Below is an example of using `pydough.to_table` to materialize a filtered query as a temporary table, then query from it: | ||
|
|
||
| ```py | ||
| %%pydough | ||
| # Create a temporary table with Asian nations | ||
| asian_nations = nations.WHERE(region.name == 'ASIA') | ||
| asian_tmp = pydough.to_table(asian_nations, name='asian_nations', temp=True) | ||
|
|
||
| # Query from the materialized table - direct method call | ||
| result = asian_tmp.CALCULATE(name) | ||
| pydough.to_df(result) | ||
| ``` |
There was a problem hiding this comment.
Let's also show an example with the sql for both steps: what does the DDL sql look like, and what does the final to_df sql look like?
| # Handle the case where the ancestor is a ChildOperatorChildAccess | ||
| # (which happens when using CROSS at the top level with a | ||
| # generated collection). In that case, unwrap it and process the | ||
| # inner child_access (typically a GlobalContext). | ||
| # Only do this when parent is None (top-level), otherwise let normal | ||
| # ChildOperatorChildAccess handling below process it. | ||
| ancestor_context = node.ancestor_context | ||
| if ( | ||
| isinstance(ancestor_context, ChildOperatorChildAccess) | ||
| and parent is None | ||
| ): | ||
| ancestor_context = ancestor_context.child_access | ||
| hybrid = self.make_hybrid_tree(ancestor_context, parent, is_aggregate) |
There was a problem hiding this comment.
I think the problem here is that something else should have raised an error earlier but didn't. Using CROSS in that way without a context doesn't make sense, since CROSS has to be combining two different sides together, but just doing CROSS(asian_tmp) by itself makes no snese.
| # TODO: (gh #175) enable typed DataFrames. | ||
| data = self.cursor.fetchall() | ||
| return pd.DataFrame(data, columns=column_names) | ||
| def execute_ddl(self, sql: str) -> None: |
There was a problem hiding this comment.
We may need to revise how this works (in the DatabaseContext dataclass) to account for BodoSQL, since the way that PR works it it revises DatabaseContext.connection to either be a DatabaseConnector or a BodoSQLContext
There was a problem hiding this comment.
will be addressed in followup PR as discussed offline
| "name": "regions", | ||
| "type": "simple table", | ||
| "table path": "TPCH_SF1.REGION", | ||
| "table path": "SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.REGION", |
There was a problem hiding this comment.
Also, have you re-run pdunit_update -m "not execute" on all the tests? Because I imagine this change to the graph would change the SQL for all of our TPCH snowflake SQL tests.
.gitignore
Outdated
| # Ignore tpch.db file | ||
| tpch.db | ||
|
|
There was a problem hiding this comment.
We already ignore *.db earlier, so this is redundant.
There was a problem hiding this comment.
Yes, I noticed that and forgot to remove it
| "as_view, replace, temp", | ||
| [ | ||
| (False, False, False), | ||
| (False, False, True), | ||
| (True, False, True), | ||
| (True, False, False), | ||
| (False, True, False), | ||
| (False, True, True), | ||
| (True, True, False), | ||
| (True, True, True), | ||
| ], | ||
| ) |
There was a problem hiding this comment.
Won't it be highly problematic to run these tests multiple times, especially with some contexts like Snowflake, since if temp is False it is just adding a bunch of tables which will still be there during the next test, or worse during the next pytest run? Won't we need some kind of cleanup step?
There was a problem hiding this comment.
There's a cleanup step that handles that in run_e2e_test_to_table
cleanup_statement = f"DROP {table_or_view} IF EXISTS {table_name}"
| # then CALCULATE on materialized view | ||
| pytest.param( | ||
| PyDoughPandasTest( | ||
| "asian_nations = nations.WHERE(region.name == 'ASIA')\n" |
There was a problem hiding this comment.
We should also have tests where the uniqueness columns from to_table come into play (e.g. .BEST where the per=... ancestor is the to_table collection).
knassre-bodo
left a comment
There was a problem hiding this comment.
Initial review done; overall great work but some things that need to get iterated on.
| num_journals=n_jours, | ||
| ratio=n_pubs / n_jours, | ||
| ) | ||
| .ORDER_BY(year.ASC(na_pos="last")) |
There was a problem hiding this comment.
Not related to this PR.
As I'm the lucky person to have tests fail on my runs, this Snowflake test decided to fail with me 🤣
Fix to match SQLite Text in "defog_sql_text_academic_gen14" ORDER BY publication.year NULLS LAST; and make return determinstic.
E AssertionError: DataFrame.iloc[:, 0] (column name="year") are different
DataFrame.iloc[:, 0] (column name="year") values are different (100.0 %)
[index]: [0, 1]
[left]: [2021, 2020]
[right]: [2020, 2021]
| schema=schema_name, | ||
| ) | ||
|
|
||
| # Sqlite's datetime functions operate in UTC, |
There was a problem hiding this comment.
Unrelated to the PR.
The defog Snowflake e2e tests compare PyDough results on Snowflake against reference SQL on SQLite. SQLite always uses UTC, but Snowflake defaults to Pacific Time, so time-relative queries ("last week", "today", etc.) diverge in certain day/time runs. This fix ensures the Snowflake test connection sets TIMEZONE = 'UTC' to match SQLite's behavior.
Summary
This PR implements the
to_tablefunctionality for PyDough, allowing users to materialize PyDough queries as database tables or views, and then use them in subsequent queries.Workflow
PyDough Query ->
to_table()-> DDL executed -> ViewGeneratedCollection -> use in new PyDough Queryto_table()to materialize itCREATE TABLE AS SELECT...)ViewGeneratedCollection)Example
Main Changes
Added
to_table()function:as_view=Trueto create views instead of tablesreplace=Trueto replace existing tables/viewstemp=Trueto create temporary tablesViewGeneratedCollection:Added
execute_ddl()method to DatabaseConnection:CREATE [OR REPLACE TEMP] TABLE/VIEW,DROP TABLE/VIEW IF EXISTS)Test Infrastructure
reset_active_sessionfixture to automatically resets the global active session after each test to avoid session overlap which lead to some duplicate writing errorscloses #499