[SPEC | CORE] : Allow table level override for scan planning#14867
[SPEC | CORE] : Allow table level override for scan planning#14867singhpk234 wants to merge 2 commits intoapache:mainfrom
Conversation
a04195b to
474e982
Compare
|
while the changes make sense to me, we may actually want to discuss this in the broader community to decide whether we want to override server-side scan planning at the table level |
3144cd9 to
f68fbe2
Compare
10123db to
7f2a05b
Compare
|
Thanks for raising this @singhpk234! I like the direction here. As a user today, there are two modes either use scan planning or not. Which begs the question, when should I use one versus the other? And right now, there is no clear insight or story from the user's perspective. Now from a catalog's perspective, the modes make sense. For instance, If the catalog is using planning to enforce governance, the |
|
Thanks for the feedback @geruh !
Optional in this context is that the catalog really doesn't have an opinion on what the client decides, it can choose local and remote, the way i was thinking is lets say you are running a lot of concurrent queries in your spark cluster and your driver is slim, even though, we are using spark, we may prefer spark. That being said yes in this impl what i did if the catalog supports plan endpoint and the catalog doesn't have any opinion on this, in java impl we always do scan planning, yes being able to toggle this based on client side config would be ideal may be when the server sends optional from the server side, and from the client side we have configured required we should not allow overwritting the key to optionals and reuse the optional ? WDYT I see @RussellSpitzer has similar feedback in ML thread too, let me take a deeper look on their feedback and respond there as well |
Decision matrix : scan planning mode (required | optional | none) :
Decision matrix : scan planning mode(client only | client preferred | catalog preferred | catalog only)
|
|
I wasn't thinking about it quite that way. I was assuming the client is configured independently of the catalog. The client can either have a user preference or none. If none, it does whatever the catalog feeds back to it if the client supports that mode. Otherwise it does what is manually specified. So Client (None) -> Follow Catalog Config (Client Only, Client Preferred, CatalogPreferred or Catalog Only), Fail if the client doesn't support config A user without a preference or who wants the Catalog to make the determination just leaves this unset on the client. Or a Client which wants to override the catalog can set a specific mode and fail fast if the Catalog doesn't support it. |
9b1adae to
66eb57a
Compare
core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java
Outdated
Show resolved
Hide resolved
860ba21 to
1f9bdcb
Compare
core/src/main/java/org/apache/iceberg/rest/ScanPlanningNegotiator.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java
Outdated
Show resolved
Hide resolved
open-api/rest-catalog-open-api.py
Outdated
| - **Both PREFERRED**: When both are PREFERRED (different types), client config wins | ||
| - **Both same**: When both have the same value, use that planning type | ||
| - **Only one configured**: Use the configured side (client or server) | ||
| - **Neither configured**: Use default (`client-preferred`) |
There was a problem hiding this comment.
I have a concern about some catalogs starting to make every table CATALOG_ONLY, which would essentially lock users to the catalog without providing a way to migrate the data to another catalog.
Maybe we add a sentence in the spec to enforce, that there should be some users where the catalog MUST provide access to the metadata files.
WDYT?
There was a problem hiding this comment.
the catalog without providing a way to migrate the data to another catalog
we would still have a way to migrate, mostly because in the loadTable we give back the metadata.json pointer (which is self describing the table state), and its the catalog ADMIN would be able to use that pointer and register table to another REST or Metastore backed catalog. In the model where storage is decoupled from compute its the administrator of the catalog who has given access the catalog to vend storage creds and it can very well take it back.
This feature is mostly like i want to read the table, can you help me with the data | delete files that corresponds to the table. Nevertheless i believe CATALOG_ONLY we think to be used primarily for gov cases also for things like scanning huge tables where planning can cause a lot of pressure on JVM (trino coordinar unstablity | spark requiring distributed planning) where catalog can do some efficient indexing (stuff like Redis) etc to help these engine.
All in all IMHO i believe vendor lock in and not being able to migrate would not be possible by exposing this option, please let me know if i am missing something.
There was a problem hiding this comment.
The specification states:
The corresponding file location of table metadata should be returned in the `metadata-location` field
However, it does not specify that this location must be readable by any user. (Perhaps this is something we should clarify going forward.)
Before the introduction of CATALOG_ONLY tables, users were required to have direct read access to the metadata files in order to plan queries on the table. That implied an access requirement, even though it was never explicitly documented. With the introduction of CATALOG_ONLY, this implicit requirement no longer applies, and we currently do not have an explicit requirement defined in the specification either.
885c41e to
6f20b79
Compare
6f20b79 to
5800b45
Compare
core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java
Outdated
Show resolved
Hide resolved
| String clientModeConfig = properties().get(RESTCatalogProperties.SCAN_PLANNING_MODE); | ||
| String serverModeConfig = tableConf.get(RESTCatalogProperties.SCAN_PLANNING_MODE); | ||
|
|
||
| // Validate that client and server configs don't conflict |
There was a problem hiding this comment.
We have a choice here. We can either hard fail on conflicting configs like is done below or we can just let one override the other.
I think the least surprising thing from a client perspective is to just override using the client config in case of conflicts. On average I don't expect people to be fiddling with this config. In the chance they do, they're probably intentionally trying to set it and we should attempt to respect that. Even if there's some FGAC and creds and all aren't returned and it fails later on, that's better to me.
I guess I could go either way here (as long as we log there's a conflict and we're overriding using some mechanism), but I think any approach seems better than just hard failing since that seems unneccessary.
There was a problem hiding this comment.
I think the least surprising thing from a client perspective is to just override using the client config in case of conflicts
It can cause an issue in a sense that let say i am low resource client i can't run scan planning at all, so i say use catalog and the catalog says do client ? i believe if we let catalog judgement win it may lead to failure ?
But i think other way to look at it we any way override the client config with the server one if server says for some configs we can do the same then here ?
I am also open to both !
5800b45 to
607e44f
Compare
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
Simplified scan planning configuration from boolean to 2-mode enum: - client (default): Use client-side scan planning - catalog: Use server-side scan planning if supported Changes: - RESTCatalogProperties: Added ScanPlanningMode enum with CLIENT and CATALOG - RESTSessionCatalog: Updated to check table config for scan planning mode - Updated OpenAPI specs with simplified documentation The mode can be configured per-table via LoadTableResponse.config() to allow fine-grained control over which tables use server-side planning.
a5cf1a5 to
3f2bce3
Compare
3f2bce3 to
e886045
Compare
About the change
Scan Planning Modes
Single enum ScanPlanningMode with 2 values:
client- MUST use client-side planningserver- MUST use server-side planningML : https://lists.apache.org/thread/z1g4y8b4ogdrn0jjtjlgg7yjgxdbzpvg