Skip to content

Adds support for filtering data during initial replication#502

Open
raffidahmad wants to merge 5 commits intopowersync-ja:mainfrom
raffidahmad:feature/partial-initial-replication
Open

Adds support for filtering data during initial replication#502
raffidahmad wants to merge 5 commits intopowersync-ja:mainfrom
raffidahmad:feature/partial-initial-replication

Conversation

@raffidahmad
Copy link

@raffidahmad raffidahmad commented Feb 11, 2026

Enables filtering of rows during the initial snapshot phase of binlog replication, based on a configurable SQL WHERE clause.

This allows for partial snapshots, replicating only a subset of data based on specified criteria, which is particularly useful for large tables or scenarios where only recent data is needed.

Example included in the changeset file

@changeset-bot
Copy link

changeset-bot bot commented Feb 11, 2026

🦋 Changeset detected

Latest commit: 3e993e1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 18 packages
Name Type
@powersync/service-sync-rules Minor
@powersync/service-core Minor
@powersync/service-module-postgres-storage Patch
@powersync/service-module-mysql Minor
@powersync/service-jpgwire Patch
@powersync/service-core-tests Patch
@powersync/lib-services-framework Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-module-mongodb Patch
@powersync/service-module-mssql Patch
@powersync/service-module-postgres Patch
@powersync/service-module-core Patch
@powersync/service-image Minor
test-client Patch
@powersync/service-schema Minor
@powersync/lib-service-postgres Patch
@powersync/service-rsocket-router Patch
@powersync/lib-service-mongodb Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@CLAassistant
Copy link

CLAassistant commented Feb 11, 2026

CLA assistant check
All committers have signed the CLA.

@raffidahmad raffidahmad force-pushed the feature/partial-initial-replication branch from 881e732 to 6810fb0 Compare February 11, 2026 21:23
@rkistner
Copy link
Contributor

Thanks for the contribution! It will help a lot to be able to specify filters for the tables.

There is an issue with the current approach of associating the filters with each bucket definition. Say you have these two definitions (the example is a little contrived):

bucket_definitions:
  all_users:
    data:
      - SELECT * FROM users
  
  active_users:
    data:
      - SELECT * FROM users WHERE status = 'active'
    source_tables:
      users:
        snapshot_filter: "status = 'active'"

In this case, the first definition doesn't specify a filter, the filter for the second definition affects the data replicated for the first.

This can be fixed by disabling filters for the table if there is any definition without a filter. But I think it's better to make the filters more explicit, by changing the filters to be global instead of per-definition.

Other configuration changes we need to consider:

  1. The config should support specifying tables with different schemas - need to check what the best syntax would be here.
  2. We should consider supporting wildcards in the table names for filters.
  3. We need to consider what the config would look like if we also support Postgres, SQL Server and MongoDB.

My initial thoughts for other databases:

  1. For Postgres, we may want to use row filters defined on the source database, instead of specifying config in the sync rules file.
  2. For MongoDB, the syntax would be a BSON-style query, rather than a SQL string. Should we use a different key for this? And should we use EJSON?
  3. For MongoDB, change stream filters and snapshot query filters are similar, but not exactly the same. Can we infer one from the other, or let the user specify them separately?
  4. I haven't checked the capabilities required for SQL Server yet. But I do know we may want to support other source table config on the same level, such as the CDC table name.

@raffidahmad
Copy link
Author

raffidahmad commented Feb 12, 2026

In this case, the first definition doesn't specify a filter, the filter for the second definition affects the data replicated for the first.

This can be fixed by disabling filters for the table if there is any definition without a filter. But I think it's better to make the filters more explicit, by changing the filters to be global instead of per-definition.

I initially explored the global filters approach for simplicity, but then decided on per-definition filters for optimization. I think your suggestion to disable filters for a table if any definition lacks a filter will be fine.

Other configuration changes we need to consider:

  1. The config should support specifying tables with different schemas - need to check what the best syntax would be here.
    
  2. We should consider supporting wildcards in the table names for filters.
    
  1. How about the full qualified name as the key? It's cleaner
  2. Wildcards could add complexity. We can add wildcard support in a future enhancement if people request it.

For Postgres, we may want to use row filters defined on the source database, instead of specifying config in the sync rules file.

We can use the same filter configuration for both initial snapshot and logical replication, which will provide consistency.

For MongoDB, the syntax would be a BSON-style query, rather than a SQL string. Should we use a different key for this? And should we use EJSON?

Separate keys make sense. Something like

snapshot_filters:
      sql: "archived = false"        # MySQL, Postgres, SQL Server
      mongo: {archived: false}       # MongoDB

For MongoDB, change stream filters and snapshot query filters are similar, but not exactly the same. Can we infer one from the other, or let the user specify them separately?

I would infer changestream filters from snapshot filters for consistency. If differences arise later, we can add separate keys. For now, using the same filter simplifies configuration and reduces confusion.

@rkistner
Copy link
Contributor

Here is another example that highlights issues with per-definition filters:

bucket_definitions:
  inactive_users:
    data:
      - SELECT * FROM users
    source_tables:
      users:
        snapshot_filter: "status = 'inactive'"
  
  active_users:
    data:
      - SELECT * FROM users
    source_tables:
      users:
        snapshot_filter: "status = 'active'"

With someone not that familiar with the replication process, it may appear that these filters are applied per-definition. In reality, the filters are applied globally, and both definitions will end up with the same data. It is this mismatch between where the filters are specified versus how it is used that can cause confusion and potential issues.

Could [the schema] be solved by just using the table key as the full reference?

Yes, that should work, just have to check for edge cases around special cases, such as special characters in the table name.

I think explicit table names will be more clear wont it?

Since we support wildcard table names in queries, we should support the same for the filters. A specific use case example is Postgres partitioned tables.

I based this on your suggestion that the snapshot filter clause will just be forwarded raw, i would like to add support joins though. Are you foreseeing any issues with this approach of forwarding the filter as is?

Yes, the plan is to just forward it as-is for the most part.

Technically you could use subqueries in the snapshot filters if we forward as-is, for example: [WHERE] user_id NOT IN (SELECT user_id FROM deleted_users). However, I'd discourage this in most cases - the same subqueries or joins (if we did add support) won't work in the binlog streaming. So the only thing you can use it for is for optimizing the data that is loading initially, and I'd recommend rather denormalizing data to do that.

Supporting JOINs in sync streams is something that we're working towards (with limitations), but that is different from using JOINs in the replication process.

We can use the same filter configuration for both initial snapshot and logical replication, which will provide consistency.

Postgres row filters for logical replication must be configured in the source database - we cannot do that when reading the stream. There is a case to be made for configuring these filters automatically based on the sync config, but that's a bigger change.

So since those row filters must already be configured on the source database, I think it's best to just re-use the same ones for the snapshot queries, rather than duplicating in sync rules. That would also match how Postgres itself does logical replication.

MySQL and MongoDB don't have this same functionality, so there it does make more sense to configure it in the sync config.

@raffidahmad
Copy link
Author

raffidahmad commented Feb 12, 2026

With someone not that familiar with the replication process, it may appear that these filters are applied per-definition. In reality, the filters are applied globally, and both definitions will end up with the same data. It is this mismatch between where the filters are specified versus how it is used that can cause confusion and potential issues.

Ah I see it now. For phase 1 we can keep them global and revisit more granular control in a separate issue, does that sound okay?

Yes, that should work, just have to check for edge cases around special cases, such as special characters in the table name.

Okay, i will look into it.

Since we support wildcard table names in queries, we should support the same for the filters. A specific use case example is Postgres partitioned tables.

Got it

Technically you could use subqueries in the snapshot filters if we forward as-is, for example: [WHERE] user_id NOT IN (SELECT user_id FROM deleted_users). However, I'd discourage this in most cases - the same subqueries or joins (if we did add support) won't work in the binlog streaming. So the only thing you can use it for is for optimizing the data that is loading initially, and I'd recommend rather denormalizing data to do that.

i would assume the key being named "snapshot_filter" is self explanatory, that this will only apply to initial replication.
Alternatively, we could call it "initial_snapshot_filter" ?

Supporting JOINs in sync streams is something that we're working towards (with limitations), but that is different from using JOINs in the replication process.

I will leave JOIN support out for now since we can use sub-queries.

So since those row filters must already be configured on the source database, I think it's best to just re-use the same ones for the snapshot queries, rather than duplicating in sync rules. That would also match how Postgres itself does logical replication.

MySQL and MongoDB don't have this same functionality, so there it does make more sense to configure it in the sync config.

Fair point, leaving it out for this PR then

@raffidahmad
Copy link
Author

raffidahmad commented Feb 12, 2026

With a global approach:

initial_snapshot_filters:
  users:
    sql: "status = 'active'"

bucket_definitions:
  inactive_users:
    data:
      - SELECT * FROM users WHERE status = 'inactive'

The inactive_users bucket would be empty initially (or only contain users who became inactive after snapshot via CDC), which is confusing.

Option 1: Union All Bucket Queries (Safest)
Automatically snapshot rows matching ANY bucket's WHERE clause.

Option 2: Explicit Global Filter
User accepts that inactive_users bucket won't have historical data, only new transitions.

@rkistner
Copy link
Contributor

The inactive_users bucket would be empty initially (or only contain users who became inactive after snapshot via CDC), which is confusing.

Yes, that is still a potential source of confusion, and I'd consider this an "advanced"/"use with caution" feature for that reason.

The main advantage of the global filter approach is that it makes it clear filters are applied to all definitions, not only some of them.

I'd recommend this syntax:

source_tables:
  users:
     mysql_snapshot_filter: "status = 'active'"

definitions:
  ...

Explicitly including "mysql" in the key helps to highlight that it only applies to mysql right now, and the syntax is mysql-specific.

@raffidahmad raffidahmad force-pushed the feature/partial-initial-replication branch 2 times, most recently from 76c9c38 to f4385e1 Compare February 12, 2026 14:12
@raffidahmad
Copy link
Author

@rkistner Please review now, i've updated the PR.

I will add support for the other sources and mongo storage after that.

@rkistner
Copy link
Contributor

Thanks, I think we're converging on a good approach now.

I'll only be able to thoroughly review next week or the week after.

@raffidahmad raffidahmad changed the title Adds snapshot filtering for binlog replication Adds support for filtering data during initial replication Feb 13, 2026
Enables filtering of rows during the initial snapshot phase of binlog replication, based on a configurable SQL WHERE clause.

This allows for partial snapshots, replicating only a subset of data based on specified criteria, which is particularly useful for large tables or scenarios where only recent data is needed.

The commit also includes tests to verify the functionality of snapshot filtering, including handling of CDC changes and multiple bucket filters.

Only for source: Mysql and PostgreSQL storage
Addresses an issue where the closing curly brace was misplaced, potentially preventing the filter from being applied.

Removes an obsolete test case.
@raffidahmad raffidahmad force-pushed the feature/partial-initial-replication branch from 7a0a709 to 4c14793 Compare February 13, 2026 04:08
@raffidahmad raffidahmad force-pushed the feature/partial-initial-replication branch from 55b1035 to e5c9853 Compare February 13, 2026 08:01
@raffidahmad
Copy link
Author

Added support for all sources and both mongo and postgresql storage.
Tested locally on MySQL source and PostgreSQL storage (not extensively, will continue to use and push any fixes if required)

Explains that widening filters after the initial snapshot requires a resnapshot or backfill to include previously excluded rows.
Also updates the summary of initial snapshot filters to reflect the same.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments