feat: overhaul data export from nominal by drake-nominal · Pull Request #398 · nominal-io/nominal-client

drake-nominal · 2025-06-27T02:43:08Z

Description:

Allow users to get_read_stream on a DataSource.

Currently, I only support streaming pandas dataframes, but leave the framework available for adding other export types (e.g. pyarrow tables, polars, raw protos, etc.).

I did some benchmarking on my machine using a sample dataset featuring really high rate data for 10 minutes of data.
From these benchmarks, I was able to determine that with proper tuning, I was able to get my downloads to run significantly (~30%) faster by using process pools rather than thread pools, and was able to get much closer to saturating my network connection over the course of the download-- I suspect this is due to the gzipping at play.

As a result, I let the users configure everything-- by default, we use a single threaded threadpool to perform requests, but let users upgrade their stream to an N-worker process pool in a very opt-in way if they'd like to receive improved performance.

Benchmarks:

Test with 254,582,023 points (nans=0, non-nans=254,582,023) across 108 channels.

Both cases provide performance approx on par with 2 million points / second with ~100mbps internet.

Process pool (16 workers, 5_000_000 points per request)

Total time: ~110-130s

Thread pool (16 workers, 5_000_000 points per request)

Total time: 120-140s

Thread pool (32 workers, 1_000_000 points per request)

Total time: ~5 minutes

Thread pool (16 workers, 10_000_000 points per request)

Total time: 110s

Thread pool (32 workers, 10_000_000 points/request, 300_000_000 points/batch)

Total time: 120-130s

Process pool (32 workers, 10_000_000 points/request, 300_000_000 points/batch)

Total time: 90s

pyproject.toml

nominal/thirdparty/pandas/_pandas.py

nominal/core/channel.py

nominal/core/export_stream.py

nominal/_utils/process_tools.py

alkasm · 2025-07-01T17:16:34Z

nominal/core/datasource.py

+        pool_type: PoolType = DEFAULT_POOL_TYPE,
+    ) -> ExportStream[pd.DataFrame]: ...
+
+    def get_read_stream(


I'm hesitant to name this as a "stream" as it's not really using a streaming API.

alkasm · 2025-07-01T17:51:19Z

nominal/core/read_stream_base.py

+@dataclasses.dataclass(frozen=True, unsafe_hash=True, order=True)
+class TimeRange:
+    start_time: IntegralNanosecondsUTC
+    end_time: IntegralNanosecondsUTC


frozen, hashable, orderable - probably more direct to use a named tuple here!

alkasm · 2025-07-01T17:51:45Z

nominal/core/read_stream_base.py

+    # Mapping of channel names to their respective datasource rids
+    # channel_sources: Mapping[str, str]
+    channels: Sequence[Channel]


comment says mapping but it's a sequence

alkasm · 2025-07-01T18:07:25Z

nominal/core/read_stream_base.py

+                sub_offset = datetime.timedelta(seconds=self._points_per_request / channel_rate)
+                sub_offset_ns = int(sub_offset.total_seconds() * 1e9)


nit but the intermediate type doesn't really do anything for us

drake-nominal requested a review from alkasm June 27, 2025 02:43

alkasm reviewed Jun 27, 2025

View reviewed changes

drake-nominal force-pushed the deidukas/export-data branch from 3a28df9 to 4b93ea5 Compare June 30, 2025 20:09

Upgrade data export from nominal

4fcf8e4

drake-nominal force-pushed the deidukas/export-data branch 2 times, most recently from 26ad9b8 to e80fcba Compare July 1, 2025 13:45

remove support for Custom in exports

514d817

drake-nominal force-pushed the deidukas/export-data branch from e80fcba to 514d817 Compare July 1, 2025 13:59

drake-nominal added 2 commits July 1, 2025 10:34

hook up read stream to datasource, allow using threads or processes

e094fce

linting fixes

86eb44c

alkasm reviewed Jul 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: overhaul data export from nominal#398

feat: overhaul data export from nominal#398
drake-nominal wants to merge 4 commits intomainfrom
deidukas/export-data

drake-nominal commented Jun 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alkasm Jul 1, 2025

Uh oh!

alkasm Jul 1, 2025

Uh oh!

alkasm Jul 1, 2025

Uh oh!

alkasm Jul 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		sub_offset = datetime.timedelta(seconds=self._points_per_request / channel_rate)
		sub_offset_ns = int(sub_offset.total_seconds() * 1e9)

Comments

Conversation

drake-nominal commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Benchmarks:

Process pool (16 workers, 5_000_000 points per request)

Thread pool (16 workers, 5_000_000 points per request)

Thread pool (32 workers, 1_000_000 points per request)

Thread pool (16 workers, 10_000_000 points per request)

Thread pool (32 workers, 10_000_000 points/request, 300_000_000 points/batch)

Process pool (32 workers, 10_000_000 points/request, 300_000_000 points/batch)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alkasm Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

alkasm Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

alkasm Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

alkasm Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drake-nominal commented Jun 27, 2025 •

edited

Loading