Adds support for filtering data during initial replication by raffidahmad · Pull Request #502 · powersync-ja/powersync-service

raffidahmad · 2026-02-11T21:18:28Z

Enables filtering of rows during the initial snapshot phase of binlog replication, based on a configurable SQL WHERE clause.

This allows for partial snapshots, replicating only a subset of data based on specified criteria, which is particularly useful for large tables or scenarios where only recent data is needed.

Example included in the changeset file

changeset-bot · 2026-02-11T21:18:33Z

🦋 Changeset detected

Latest commit: 3e993e1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 18 packages

Name	Type
@powersync/service-sync-rules	Minor
@powersync/service-core	Minor
@powersync/service-module-postgres-storage	Patch
@powersync/service-module-mysql	Minor
@powersync/service-jpgwire	Patch
@powersync/service-core-tests	Patch
@powersync/lib-services-framework	Patch
@powersync/service-module-mongodb-storage	Patch
@powersync/service-module-mongodb	Patch
@powersync/service-module-mssql	Patch
@powersync/service-module-postgres	Patch
@powersync/service-module-core	Patch
@powersync/service-image	Minor
test-client	Patch
@powersync/service-schema	Minor
@powersync/lib-service-postgres	Patch
@powersync/service-rsocket-router	Patch
@powersync/lib-service-mongodb	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

CLAassistant · 2026-02-11T21:18:34Z

All committers have signed the CLA.

rkistner · 2026-02-12T09:17:33Z

Thanks for the contribution! It will help a lot to be able to specify filters for the tables.

There is an issue with the current approach of associating the filters with each bucket definition. Say you have these two definitions (the example is a little contrived):

bucket_definitions:
  all_users:
    data:
      - SELECT * FROM users
  
  active_users:
    data:
      - SELECT * FROM users WHERE status = 'active'
    source_tables:
      users:
        snapshot_filter: "status = 'active'"

In this case, the first definition doesn't specify a filter, the filter for the second definition affects the data replicated for the first.

This can be fixed by disabling filters for the table if there is any definition without a filter. But I think it's better to make the filters more explicit, by changing the filters to be global instead of per-definition.

Other configuration changes we need to consider:

The config should support specifying tables with different schemas - need to check what the best syntax would be here.
We should consider supporting wildcards in the table names for filters.
We need to consider what the config would look like if we also support Postgres, SQL Server and MongoDB.

My initial thoughts for other databases:

For Postgres, we may want to use row filters defined on the source database, instead of specifying config in the sync rules file.
For MongoDB, the syntax would be a BSON-style query, rather than a SQL string. Should we use a different key for this? And should we use EJSON?
For MongoDB, change stream filters and snapshot query filters are similar, but not exactly the same. Can we infer one from the other, or let the user specify them separately?
I haven't checked the capabilities required for SQL Server yet. But I do know we may want to support other source table config on the same level, such as the CDC table name.

raffidahmad · 2026-02-12T10:15:58Z

In this case, the first definition doesn't specify a filter, the filter for the second definition affects the data replicated for the first.

This can be fixed by disabling filters for the table if there is any definition without a filter. But I think it's better to make the filters more explicit, by changing the filters to be global instead of per-definition.

I initially explored the global filters approach for simplicity, but then decided on per-definition filters for optimization. I think your suggestion to disable filters for a table if any definition lacks a filter will be fine.

Other configuration changes we need to consider:

The config should support specifying tables with different schemas - need to check what the best syntax would be here.

We should consider supporting wildcards in the table names for filters.

How about the full qualified name as the key? It's cleaner
Wildcards could add complexity. We can add wildcard support in a future enhancement if people request it.

For Postgres, we may want to use row filters defined on the source database, instead of specifying config in the sync rules file.

We can use the same filter configuration for both initial snapshot and logical replication, which will provide consistency.

For MongoDB, the syntax would be a BSON-style query, rather than a SQL string. Should we use a different key for this? And should we use EJSON?

Separate keys make sense. Something like

snapshot_filters:
      sql: "archived = false"        # MySQL, Postgres, SQL Server
      mongo: {archived: false}       # MongoDB

For MongoDB, change stream filters and snapshot query filters are similar, but not exactly the same. Can we infer one from the other, or let the user specify them separately?

I would infer changestream filters from snapshot filters for consistency. If differences arise later, we can add separate keys. For now, using the same filter simplifies configuration and reduces confusion.

rkistner · 2026-02-12T10:39:15Z

Here is another example that highlights issues with per-definition filters:

bucket_definitions:
  inactive_users:
    data:
      - SELECT * FROM users
    source_tables:
      users:
        snapshot_filter: "status = 'inactive'"
  
  active_users:
    data:
      - SELECT * FROM users
    source_tables:
      users:
        snapshot_filter: "status = 'active'"

With someone not that familiar with the replication process, it may appear that these filters are applied per-definition. In reality, the filters are applied globally, and both definitions will end up with the same data. It is this mismatch between where the filters are specified versus how it is used that can cause confusion and potential issues.

Could [the schema] be solved by just using the table key as the full reference?

Yes, that should work, just have to check for edge cases around special cases, such as special characters in the table name.

I think explicit table names will be more clear wont it?

Since we support wildcard table names in queries, we should support the same for the filters. A specific use case example is Postgres partitioned tables.

I based this on your suggestion that the snapshot filter clause will just be forwarded raw, i would like to add support joins though. Are you foreseeing any issues with this approach of forwarding the filter as is?

Yes, the plan is to just forward it as-is for the most part.

Technically you could use subqueries in the snapshot filters if we forward as-is, for example: [WHERE] user_id NOT IN (SELECT user_id FROM deleted_users). However, I'd discourage this in most cases - the same subqueries or joins (if we did add support) won't work in the binlog streaming. So the only thing you can use it for is for optimizing the data that is loading initially, and I'd recommend rather denormalizing data to do that.

Supporting JOINs in sync streams is something that we're working towards (with limitations), but that is different from using JOINs in the replication process.

We can use the same filter configuration for both initial snapshot and logical replication, which will provide consistency.

Postgres row filters for logical replication must be configured in the source database - we cannot do that when reading the stream. There is a case to be made for configuring these filters automatically based on the sync config, but that's a bigger change.

So since those row filters must already be configured on the source database, I think it's best to just re-use the same ones for the snapshot queries, rather than duplicating in sync rules. That would also match how Postgres itself does logical replication.

MySQL and MongoDB don't have this same functionality, so there it does make more sense to configure it in the sync config.

raffidahmad · 2026-02-12T11:49:32Z

With someone not that familiar with the replication process, it may appear that these filters are applied per-definition. In reality, the filters are applied globally, and both definitions will end up with the same data. It is this mismatch between where the filters are specified versus how it is used that can cause confusion and potential issues.

Ah I see it now. For phase 1 we can keep them global and revisit more granular control in a separate issue, does that sound okay?

Yes, that should work, just have to check for edge cases around special cases, such as special characters in the table name.

Okay, i will look into it.

Since we support wildcard table names in queries, we should support the same for the filters. A specific use case example is Postgres partitioned tables.

Got it

Technically you could use subqueries in the snapshot filters if we forward as-is, for example: [WHERE] user_id NOT IN (SELECT user_id FROM deleted_users). However, I'd discourage this in most cases - the same subqueries or joins (if we did add support) won't work in the binlog streaming. So the only thing you can use it for is for optimizing the data that is loading initially, and I'd recommend rather denormalizing data to do that.

i would assume the key being named "snapshot_filter" is self explanatory, that this will only apply to initial replication.
Alternatively, we could call it "initial_snapshot_filter" ?

Supporting JOINs in sync streams is something that we're working towards (with limitations), but that is different from using JOINs in the replication process.

I will leave JOIN support out for now since we can use sub-queries.

So since those row filters must already be configured on the source database, I think it's best to just re-use the same ones for the snapshot queries, rather than duplicating in sync rules. That would also match how Postgres itself does logical replication.

MySQL and MongoDB don't have this same functionality, so there it does make more sense to configure it in the sync config.

Fair point, leaving it out for this PR then

raffidahmad · 2026-02-12T13:44:14Z

With a global approach:

initial_snapshot_filters:
  users:
    sql: "status = 'active'"

bucket_definitions:
  inactive_users:
    data:
      - SELECT * FROM users WHERE status = 'inactive'

The inactive_users bucket would be empty initially (or only contain users who became inactive after snapshot via CDC), which is confusing.

Option 1: Union All Bucket Queries (Safest)
Automatically snapshot rows matching ANY bucket's WHERE clause.

Option 2: Explicit Global Filter
User accepts that inactive_users bucket won't have historical data, only new transitions.

rkistner · 2026-02-12T13:52:23Z

The inactive_users bucket would be empty initially (or only contain users who became inactive after snapshot via CDC), which is confusing.

Yes, that is still a potential source of confusion, and I'd consider this an "advanced"/"use with caution" feature for that reason.

The main advantage of the global filter approach is that it makes it clear filters are applied to all definitions, not only some of them.

I'd recommend this syntax:

source_tables:
  users:
     mysql_snapshot_filter: "status = 'active'"

definitions:
  ...

Explicitly including "mysql" in the key helps to highlight that it only applies to mysql right now, and the syntax is mysql-specific.

raffidahmad · 2026-02-12T14:14:18Z

@rkistner Please review now, i've updated the PR.

I will add support for the other sources and mongo storage after that.

rkistner · 2026-02-12T14:39:47Z

Thanks, I think we're converging on a good approach now.

I'll only be able to thoroughly review next week or the week after.

Enables filtering of rows during the initial snapshot phase of binlog replication, based on a configurable SQL WHERE clause. This allows for partial snapshots, replicating only a subset of data based on specified criteria, which is particularly useful for large tables or scenarios where only recent data is needed. The commit also includes tests to verify the functionality of snapshot filtering, including handling of CDC changes and multiple bucket filters. Only for source: Mysql and PostgreSQL storage

Addresses an issue where the closing curly brace was misplaced, potentially preventing the filter from being applied. Removes an obsolete test case.

raffidahmad · 2026-02-13T08:02:27Z

Added support for all sources and both mongo and postgresql storage.
Tested locally on MySQL source and PostgreSQL storage (not extensively, will continue to use and push any fixes if required)

Explains that widening filters after the initial snapshot requires a resnapshot or backfill to include previously excluded rows. Also updates the summary of initial snapshot filters to reflect the same.

raffidahmad force-pushed the feature/partial-initial-replication branch from 881e732 to 6810fb0 Compare February 11, 2026 21:23

raffidahmad force-pushed the feature/partial-initial-replication branch 2 times, most recently from 76c9c38 to f4385e1 Compare February 12, 2026 14:12

raffidahmad changed the title ~~Adds snapshot filtering for binlog replication~~ Adds support for filtering data during initial replication Feb 13, 2026

raffidahmad added 2 commits February 13, 2026 04:08

hotfix: Typo fixes

be0c7cd

Addresses an issue where the closing curly brace was misplaced, potentially preventing the filter from being applied. Removes an obsolete test case.

raffidahmad force-pushed the feature/partial-initial-replication branch from 7a0a709 to 4c14793 Compare February 13, 2026 04:08

feat: implement global initial_snapshot_filters for all database modules

e5c9853

raffidahmad force-pushed the feature/partial-initial-replication branch from 55b1035 to e5c9853 Compare February 13, 2026 08:01

raffidahmad added 2 commits February 13, 2026 10:21

Clarifies filter update process in docs.

f19190e

Explains that widening filters after the initial snapshot requires a resnapshot or backfill to include previously excluded rows. Also updates the summary of initial snapshot filters to reflect the same.

Merge branch 'main' into feature/partial-initial-replication

3e993e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for filtering data during initial replication#502

Adds support for filtering data during initial replication#502
raffidahmad wants to merge 5 commits intopowersync-ja:mainfrom
raffidahmad:feature/partial-initial-replication

raffidahmad commented Feb 11, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Feb 11, 2026 •

edited

Loading

Uh oh!

rkistner commented Feb 12, 2026

Uh oh!

raffidahmad commented Feb 12, 2026 •

edited

Loading

Uh oh!

rkistner commented Feb 12, 2026

Uh oh!

raffidahmad commented Feb 12, 2026 •

edited

Loading

Uh oh!

raffidahmad commented Feb 12, 2026 •

edited

Loading

Uh oh!

rkistner commented Feb 12, 2026

Uh oh!

raffidahmad commented Feb 12, 2026

Uh oh!

rkistner commented Feb 12, 2026

Uh oh!

raffidahmad commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

raffidahmad commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

CLAassistant commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkistner commented Feb 12, 2026

Uh oh!

raffidahmad commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkistner commented Feb 12, 2026

Uh oh!

raffidahmad commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raffidahmad commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkistner commented Feb 12, 2026

Uh oh!

raffidahmad commented Feb 12, 2026

Uh oh!

rkistner commented Feb 12, 2026

Uh oh!

raffidahmad commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

raffidahmad commented Feb 11, 2026 •

edited

Loading

changeset-bot bot commented Feb 11, 2026 •

edited

Loading

CLAassistant commented Feb 11, 2026 •

edited

Loading

raffidahmad commented Feb 12, 2026 •

edited

Loading

raffidahmad commented Feb 12, 2026 •

edited

Loading

raffidahmad commented Feb 12, 2026 •

edited

Loading