Skip to content

Support Materialized Views (to_table)#493

Open
hadia206 wants to merge 109 commits intomainfrom
Hadia/materialize_view
Open

Support Materialized Views (to_table)#493
hadia206 wants to merge 109 commits intomainfrom
Hadia/materialize_view

Conversation

@hadia206
Copy link
Contributor

@hadia206 hadia206 commented Feb 13, 2026

Summary
This PR implements the to_table functionality for PyDough, allowing users to materialize PyDough queries as database tables or views, and then use them in subsequent queries.

Workflow
PyDough Query -> to_table() -> DDL executed -> ViewGeneratedCollection -> use in new PyDough Query

  1. User writes PyDough query
  2. User calls to_table() to materialize it
  3. PyDough generates DDL (CREATE TABLE AS SELECT...)
  4. DDL is executed on the database
  5. Returns a collection reference to the new table (ViewGeneratedCollection)
  6. User can use that reference in new PyDough queries

Example

# Step 1: PyDough query
asian_nations = nations.WHERE(region.name == 'ASIA')

# Steps 2-5: Materialize it as a temp table
asian_tmp = pydough.to_table(asian_nations, name='asian_nations', temp=True)

# Step 6: Use the materialized table in subsequent queries
result = asian_tmp.CALCULATE(name).ORDER_BY(name)

# Use with other collections via CROSS
result = regions.CROSS(asian_tmp).WHERE(asian_tmp.region_key == regions.key).CALCULATE(
    nation_name=asian_tmp.name,
    region_name=regions.name
)

Main Changes

  • Added to_table() function:

    • Generates appropriate DDL statements for each database dialect (SQLite, MySQL, PostgreSQL, Snowflake) and returns a collection reference that can be used in subsequent PyDough queries
    • Support for as_view=True to create views instead of tables
    • Support for replace=True to replace existing tables/views
    • Support for temp=True to create temporary tables
  • ViewGeneratedCollection :

    • New collection type representing a user-created table/view
  • Added execute_ddl() method to DatabaseConnection:

    • Execute DDL statements (CREATE [OR REPLACE TEMP] TABLE/VIEW, DROP TABLE/VIEW IF EXISTS)
  • Test Infrastructure

    • Added reset_active_session fixture to automatically resets the global active session after each test to avoid session overlap which lead to some duplicate writing errors
    • Tests for different PyDough queries
    • Tests for different DDL statements

closes #499


# Query from the materialized table - direct method call works for simple queries
result = asian_tmp.CALCULATE(name)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the actual result for each example? I think would be really helpful to understand what this api creates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

The first argument it takes in is the PyDough node for the collection being materialized. The second argument is the name of the view/table to create. It can optionally take in the following keyword arguments:

- `as_view`: If `True`, create a VIEW. If `False` (default), create a TABLE.
- `replace`: If `True`, drop table/view if exists and then create the table/view. For Snowflake, use `CREATE OR REPLACE` to allow replacing an existing view/table. Default is `False`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if replace=False and the user tries to create a view/table that already exists? Can we specify it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Also, let's make the format of how defaults are declared consistent between this vs as_view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, what you mean?
If replace=False and user tried with one that exists, it'll fail as expected by all SQL engines.
I'll add the note but want to make sure I understand that you mean state it and I'm not missing something

- actual_temp is the final temp value (may differ from input due to dialect limitations)
"""
# Handle differences in CREATE syntax for different databases.
create_caps = CREATE_CAPABILITIES[db_dialect]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints

)

# Check if we can use CREATE OR REPLACE
can_replace = create_caps.replace_view if as_view else create_caps.replace_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type hint

raise PyDoughException(
f"TEMPORARY views are not supported for {session.database.dialect.name}"
)
# session.metadata = graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an actual comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, outdated comment. removed

the created view/table.

"""
_validate_table_name(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a function from error_utils called is_valid_sql_identifier() that can be used here. There are more functions that can be used for validation like unique_properties_predicate.verify() for unique_columns. Also don't forget to manage quoted names for the table name and for columns. For reference, see how I use normalize_column_name in create_constant_table. I think you can use it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I missed that.

@@ -0,0 +1,126 @@
"""
A user-defined collection representing a database [temporary] view/table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is inside the user_collections folder can we add the documentation on the README.md file? Just a brief description of this class would be fine.

# then CALCULATE on materialized view
pytest.param(
PyDoughPandasTest(
"asian_nations = nations.WHERE(region.name == 'ASIA')\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a view collection from range/dataframe collection? If so, we should add a test. Can we combine user generated collections? For example CROSS a dataframe/range collection with a view collection

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have tests where the uniqueness columns from to_table come into play (e.g. .BEST where the per=... ancestor is the to_table collection).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

PyDoughPandasTest(
"asian_nations = nations.WHERE(region.name == 'ASIA').CALCULATE(nation_key=key, nation_name=name)\n"
"asian_tmp = pydough.to_table(asian_nations, name='asian_nations_t4', replace=True)\n"
"result = CROSS(asian_tmp).CALCULATE(nation_key, nation_name).ORDER_BY(nation_key.ASC())",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use CROSS like this without anything before? I thought it must have something before like collection1.CROSS(collection2). (Just making sure)

Copy link
Contributor Author

@hadia206 hadia206 Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: Based on Kian's comment elsewhere, this behavior should not happen. Updated code to error if this is used and updated the tests.

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review done; overall great work but some things that need to get iterated on.

The first argument it takes in is the PyDough node for the collection being materialized. The second argument is the name of the view/table to create. It can optionally take in the following keyword arguments:

- `as_view`: If `True`, create a VIEW. If `False` (default), create a TABLE.
- `replace`: If `True`, drop table/view if exists and then create the table/view. For Snowflake, use `CREATE OR REPLACE` to allow replacing an existing view/table. Default is `False`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Also, let's make the format of how defaults are declared consistent between this vs as_view.

| SQLite | No (uses DROP + CREATE)| Yes | No (uses DROP + CREATE)| Yes |
| Snowflake | Yes | Yes | Yes | No |
| PostgreSQL | No (uses DROP + CREATE)| Yes | Yes | No |
| MySQL | No (uses DROP + CREATE)| Yes | Yes | No |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include Oracle and BodoSQL here (which reminds me... this PR will probably need some brief tests for both of those)

Copy link
Contributor Author

@hadia206 hadia206 Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, The PR was up for review before these were merged. which as of today only BodoSQL so I'll work on that for now.

EDIT: BodoSQL relies on Bodo which dropped Mac/Intel support. I'll disable feature for BodoSQL till I make the machine switch and can resume it.

Comment on lines +587 to +600
#### Example 1: Basic Table Materialization

Below is an example of using `pydough.to_table` to materialize a filtered query as a temporary table, then query from it:

```py
%%pydough
# Create a temporary table with Asian nations
asian_nations = nations.WHERE(region.name == 'ASIA')
asian_tmp = pydough.to_table(asian_nations, name='asian_nations', temp=True)

# Query from the materialized table - direct method call
result = asian_tmp.CALCULATE(name)
pydough.to_df(result)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also show an example with the sql for both steps: what does the DDL sql look like, and what does the final to_df sql look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +1572 to +1584
# Handle the case where the ancestor is a ChildOperatorChildAccess
# (which happens when using CROSS at the top level with a
# generated collection). In that case, unwrap it and process the
# inner child_access (typically a GlobalContext).
# Only do this when parent is None (top-level), otherwise let normal
# ChildOperatorChildAccess handling below process it.
ancestor_context = node.ancestor_context
if (
isinstance(ancestor_context, ChildOperatorChildAccess)
and parent is None
):
ancestor_context = ancestor_context.child_access
hybrid = self.make_hybrid_tree(ancestor_context, parent, is_aggregate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here is that something else should have raised an error earlier but didn't. Using CROSS in that way without a context doesn't make sense, since CROSS has to be combining two different sides together, but just doing CROSS(asian_tmp) by itself makes no snese.

# TODO: (gh #175) enable typed DataFrames.
data = self.cursor.fetchall()
return pd.DataFrame(data, columns=column_names)
def execute_ddl(self, sql: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to revise how this works (in the DatabaseContext dataclass) to account for BodoSQL, since the way that PR works it it revises DatabaseContext.connection to either be a DatabaseConnector or a BodoSQLContext

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be addressed in followup PR as discussed offline

"name": "regions",
"type": "simple table",
"table path": "TPCH_SF1.REGION",
"table path": "SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.REGION",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, have you re-run pdunit_update -m "not execute" on all the tests? Because I imagine this change to the graph would change the SQL for all of our TPCH snowflake SQL tests.

.gitignore Outdated
Comment on lines +6 to +8
# Ignore tpch.db file
tpch.db

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already ignore *.db earlier, so this is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noticed that and forgot to remove it

Comment on lines +6316 to +6327
"as_view, replace, temp",
[
(False, False, False),
(False, False, True),
(True, False, True),
(True, False, False),
(False, True, False),
(False, True, True),
(True, True, False),
(True, True, True),
],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be highly problematic to run these tests multiple times, especially with some contexts like Snowflake, since if temp is False it is just adding a bunch of tables which will still be there during the next test, or worse during the next pytest run? Won't we need some kind of cleanup step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a cleanup step that handles that in run_e2e_test_to_table
cleanup_statement = f"DROP {table_or_view} IF EXISTS {table_name}"

# then CALCULATE on materialized view
pytest.param(
PyDoughPandasTest(
"asian_nations = nations.WHERE(region.name == 'ASIA')\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have tests where the uniqueness columns from to_table come into play (e.g. .BEST where the per=... ancestor is the to_table collection).

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review done; overall great work but some things that need to get iterated on.

num_journals=n_jours,
ratio=n_pubs / n_jours,
)
.ORDER_BY(year.ASC(na_pos="last"))
Copy link
Contributor Author

@hadia206 hadia206 Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR.

As I'm the lucky person to have tests fail on my runs, this Snowflake test decided to fail with me 🤣

Fix to match SQLite Text in "defog_sql_text_academic_gen14" ORDER BY publication.year NULLS LAST; and make return determinstic.

E   AssertionError: DataFrame.iloc[:, 0] (column name="year") are different
   DataFrame.iloc[:, 0] (column name="year") values are different (100.0 %)
   [index]: [0, 1]
   [left]:  [2021, 2020]
   [right]: [2020, 2021]

schema=schema_name,
)

# Sqlite's datetime functions operate in UTC,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the PR.

The defog Snowflake e2e tests compare PyDough results on Snowflake against reference SQL on SQLite. SQLite always uses UTC, but Snowflake defaults to Pacific Time, so time-relative queries ("last week", "today", etc.) diverge in certain day/time runs. This fix ensures the Snowflake test connection sets TIMEZONE = 'UTC' to match SQLite's behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Materialize PyDough Queries as Database Views/Tables

3 participants