Adds support for filtering data during initial replication#502
Adds support for filtering data during initial replication#502raffidahmad wants to merge 5 commits intopowersync-ja:mainfrom
Conversation
🦋 Changeset detectedLatest commit: 3e993e1 The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
881e732 to
6810fb0
Compare
|
Thanks for the contribution! It will help a lot to be able to specify filters for the tables. There is an issue with the current approach of associating the filters with each bucket definition. Say you have these two definitions (the example is a little contrived): bucket_definitions:
all_users:
data:
- SELECT * FROM users
active_users:
data:
- SELECT * FROM users WHERE status = 'active'
source_tables:
users:
snapshot_filter: "status = 'active'"
In this case, the first definition doesn't specify a filter, the filter for the second definition affects the data replicated for the first. This can be fixed by disabling filters for the table if there is any definition without a filter. But I think it's better to make the filters more explicit, by changing the filters to be global instead of per-definition. Other configuration changes we need to consider:
My initial thoughts for other databases:
|
I initially explored the global filters approach for simplicity, but then decided on per-definition filters for optimization. I think your suggestion to disable filters for a table if any definition lacks a filter will be fine.
We can use the same filter configuration for both initial snapshot and logical replication, which will provide consistency.
Separate keys make sense. Something like
I would infer changestream filters from snapshot filters for consistency. If differences arise later, we can add separate keys. For now, using the same filter simplifies configuration and reduces confusion. |
|
Here is another example that highlights issues with per-definition filters: bucket_definitions:
inactive_users:
data:
- SELECT * FROM users
source_tables:
users:
snapshot_filter: "status = 'inactive'"
active_users:
data:
- SELECT * FROM users
source_tables:
users:
snapshot_filter: "status = 'active'"With someone not that familiar with the replication process, it may appear that these filters are applied per-definition. In reality, the filters are applied globally, and both definitions will end up with the same data. It is this mismatch between where the filters are specified versus how it is used that can cause confusion and potential issues.
Yes, that should work, just have to check for edge cases around special cases, such as special characters in the table name.
Since we support wildcard table names in queries, we should support the same for the filters. A specific use case example is Postgres partitioned tables.
Yes, the plan is to just forward it as-is for the most part. Technically you could use subqueries in the snapshot filters if we forward as-is, for example: Supporting JOINs in sync streams is something that we're working towards (with limitations), but that is different from using JOINs in the replication process.
Postgres row filters for logical replication must be configured in the source database - we cannot do that when reading the stream. There is a case to be made for configuring these filters automatically based on the sync config, but that's a bigger change. So since those row filters must already be configured on the source database, I think it's best to just re-use the same ones for the snapshot queries, rather than duplicating in sync rules. That would also match how Postgres itself does logical replication. MySQL and MongoDB don't have this same functionality, so there it does make more sense to configure it in the sync config. |
Ah I see it now. For phase 1 we can keep them global and revisit more granular control in a separate issue, does that sound okay?
Okay, i will look into it.
Got it
i would assume the key being named "snapshot_filter" is self explanatory, that this will only apply to initial replication.
I will leave JOIN support out for now since we can use sub-queries.
Fair point, leaving it out for this PR then |
|
With a global approach: The inactive_users bucket would be empty initially (or only contain users who became inactive after snapshot via CDC), which is confusing. Option 1: Union All Bucket Queries (Safest) Option 2: Explicit Global Filter |
Yes, that is still a potential source of confusion, and I'd consider this an "advanced"/"use with caution" feature for that reason. The main advantage of the global filter approach is that it makes it clear filters are applied to all definitions, not only some of them. I'd recommend this syntax: source_tables:
users:
mysql_snapshot_filter: "status = 'active'"
definitions:
...Explicitly including "mysql" in the key helps to highlight that it only applies to mysql right now, and the syntax is mysql-specific. |
76c9c38 to
f4385e1
Compare
|
@rkistner Please review now, i've updated the PR. I will add support for the other sources and mongo storage after that. |
|
Thanks, I think we're converging on a good approach now. I'll only be able to thoroughly review next week or the week after. |
Enables filtering of rows during the initial snapshot phase of binlog replication, based on a configurable SQL WHERE clause. This allows for partial snapshots, replicating only a subset of data based on specified criteria, which is particularly useful for large tables or scenarios where only recent data is needed. The commit also includes tests to verify the functionality of snapshot filtering, including handling of CDC changes and multiple bucket filters. Only for source: Mysql and PostgreSQL storage
Addresses an issue where the closing curly brace was misplaced, potentially preventing the filter from being applied. Removes an obsolete test case.
7a0a709 to
4c14793
Compare
55b1035 to
e5c9853
Compare
|
Added support for all sources and both mongo and postgresql storage. |
Explains that widening filters after the initial snapshot requires a resnapshot or backfill to include previously excluded rows. Also updates the summary of initial snapshot filters to reflect the same.
Enables filtering of rows during the initial snapshot phase of binlog replication, based on a configurable SQL WHERE clause.
This allows for partial snapshots, replicating only a subset of data based on specified criteria, which is particularly useful for large tables or scenarios where only recent data is needed.
Example included in the changeset file