Introduce whitelisted folders for external data validation#27352
Introduce whitelisted folders for external data validation#27352yuslepukhin wants to merge 5 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds support for loading TensorProto external data from user-configured “whitelisted” directories (in addition to the model directory), addressing the prior security hardening that restricted external data to the model folder.
Changes:
- Introduces a new C/C++ SessionOptions API to configure semicolon-separated whitelisted external-data directories.
- Adds parsing/validation logic for whitelist paths and extends external data path validation to allow matches under whitelisted folders.
- Updates graph/session code paths and expands unit tests for whitelist behavior.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/ir/graph_test.cc | Updates call site for new Graph::ConvertInitializersIntoOrtValues signature. |
| onnxruntime/test/framework/tensorutils_test.cc | Adds extensive tests for whitelist-aware path validation and whitelist parsing. |
| onnxruntime/core/session/provider_bridge_ort.cc | Removes shared-provider host bridge for external data path validation. |
| onnxruntime/core/session/ort_apis.h | Adds C API entrypoint declaration for SessionOptionsSetWhiteListedDataFolders. |
| onnxruntime/core/session/onnxruntime_c_api.cc | Registers the new C API function pointer and updates version asserts (currently problematic). |
| onnxruntime/core/session/inference_session.cc | Parses whitelist from SessionOptions and passes it into initializer conversion. |
| onnxruntime/core/session/abi_session_options.cc | Implements SessionOptionsSetWhiteListedDataFolders (currently rejects nullptr). |
| onnxruntime/core/providers/shared_library/provider_interfaces.h | Removes ProviderHost virtual for validating external data paths. |
| onnxruntime/core/providers/shared_library/provider_api.h | Removes shared-provider wrapper for external data path validation. |
| onnxruntime/core/graph/graph.cc | Extends initializer conversion to validate external paths against whitelist. |
| onnxruntime/core/framework/tensorprotoutils.h | Adds ParseWhiteListedPaths and extends ValidateExternalDataPath signature. |
| onnxruntime/core/framework/tensorprotoutils.cc | Implements whitelist parsing and whitelist-aware external data path validation. |
| onnxruntime/core/framework/session_options.h | Adds SessionOptions::whitelisted_data_folders storage. |
| include/onnxruntime/core/session/onnxruntime_cxx_inline.h | Adds C++ wrapper SessionOptions::SetWhiteListedDataFolders implementation. |
| include/onnxruntime/core/session/onnxruntime_cxx_api.h | Adds C++ wrapper SessionOptions::SetWhiteListedDataFolders declaration. |
| include/onnxruntime/core/session/onnxruntime_c_api.h | Adds public C API declaration/docs for SessionOptionsSetWhiteListedDataFolders (currently mismatched). |
| include/onnxruntime/core/graph/graph.h | Changes public Graph API signature for ConvertInitializersIntoOrtValues. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto unrelated_dir = std::filesystem::temp_directory_path() / "unrelated_PathValidationTest"; | ||
| std::filesystem::create_directories(unrelated_dir); | ||
| auto cleanup = [&]() { std::filesystem::remove_all(unrelated_dir); }; | ||
|
|
||
| std::vector<std::filesystem::path> whitelist = {whitelisted_dir_}; | ||
| auto escaping_location = std::filesystem::path("..") / "unrelated_PathValidationTest" / "data.bin"; | ||
| ASSERT_FALSE(utils::ValidateExternalDataPath(base_dir_, escaping_location, whitelist).IsOK()); | ||
|
|
||
| cleanup(); | ||
| } |
There was a problem hiding this comment.
This test uses an ASSERT_* before calling the cleanup lambda. If the assertion fails, the temporary directory won’t be removed. Prefer a scope guard/RAII cleanup (or do cleanup in TearDown) so the directory is always removed.
| auto another_dir = std::filesystem::temp_directory_path() / "another_PathValidationTest"; | ||
| std::filesystem::create_directories(another_dir); | ||
| auto cleanup = [&]() { std::filesystem::remove_all(another_dir); }; | ||
|
|
||
| auto relative_to_outside = std::filesystem::path("..") / "outside" / "data.bin"; | ||
| std::vector<std::filesystem::path> whitelist = {another_dir, outside_dir_}; | ||
|
|
||
| // Escapes base_dir_ but outside_dir_ (second whitelist entry) should match. | ||
| ASSERT_STATUS_OK(utils::ValidateExternalDataPath(base_dir_, relative_to_outside, whitelist)); | ||
|
|
||
| cleanup(); | ||
| } |
There was a problem hiding this comment.
Same issue here: cleanup() is after ASSERT_STATUS_OK, so a failure will leak the temporary directory. Use a scope guard or ensure cleanup runs unconditionally.
| @@ -1004,9 +1004,6 @@ struct ProviderHost { | |||
|
|
|||
| virtual bool Utils__HasExternalDataInMemory(const ONNX_NAMESPACE::TensorProto& ten_proto) = 0; | |||
|
|
|||
There was a problem hiding this comment.
Removing Utils__ValidateExternalDataPath from ProviderHost eliminates the only ValidateExternalDataPath entrypoint available to shared-provider code (tensorprotoutils.h excludes it under SHARED_PROVIDER). If bridge/shared EPs still need to validate external data locations (per PR #26776), consider keeping this API and updating it to accept the whitelist parameter instead of removing it.
| // Validate that any external data referenced by the given TensorProto resides within the | |
| // provided whitelist of allowed locations. Implementations should return a non-OK Status | |
| // if any external data path is outside the whitelist. | |
| virtual Status Utils__ValidateExternalDataPath( | |
| const ONNX_NAMESPACE::TensorProto& tensor_proto, | |
| const std::vector<std::filesystem::path>& allowed_locations) = 0; |
Description
Many customers reported that they prefer to store external data in locations other than model folder PR
Previous security change disabled that possibility. PR #26776.
This PR introduces a new API that sets whitelisted folders option. Data stored under those folders or their subfolders would still be allowed.
Motivation and Context
qdrant/fastembed#603