Skip to content

Comments

feat: overhaul data export from nominal#398

Open
drake-nominal wants to merge 4 commits intomainfrom
deidukas/export-data
Open

feat: overhaul data export from nominal#398
drake-nominal wants to merge 4 commits intomainfrom
deidukas/export-data

Conversation

@drake-nominal
Copy link
Contributor

@drake-nominal drake-nominal commented Jun 27, 2025

Description:

Allow users to get_read_stream on a DataSource.

Currently, I only support streaming pandas dataframes, but leave the framework available for adding other export types (e.g. pyarrow tables, polars, raw protos, etc.).

I did some benchmarking on my machine using a sample dataset featuring really high rate data for 10 minutes of data.
From these benchmarks, I was able to determine that with proper tuning, I was able to get my downloads to run significantly (~30%) faster by using process pools rather than thread pools, and was able to get much closer to saturating my network connection over the course of the download-- I suspect this is due to the gzipping at play.

As a result, I let the users configure everything-- by default, we use a single threaded threadpool to perform requests, but let users upgrade their stream to an N-worker process pool in a very opt-in way if they'd like to receive improved performance.

Benchmarks:

Test with 254,582,023 points (nans=0, non-nans=254,582,023) across 108 channels.

Both cases provide performance approx on par with 2 million points / second with ~100mbps internet.

Process pool (16 workers, 5_000_000 points per request)

  • Total time: ~110-130s

Thread pool (16 workers, 5_000_000 points per request)

  • Total time: 120-140s

Thread pool (32 workers, 1_000_000 points per request)

  • Total time: ~5 minutes

Thread pool (16 workers, 10_000_000 points per request)

  • Total time: 110s

Thread pool (32 workers, 10_000_000 points/request, 300_000_000 points/batch)

  • Total time: 120-130s

Process pool (32 workers, 10_000_000 points/request, 300_000_000 points/batch)

  • Total time: 90s

@drake-nominal drake-nominal requested a review from alkasm June 27, 2025 02:43
@drake-nominal drake-nominal force-pushed the deidukas/export-data branch from 3a28df9 to 4b93ea5 Compare June 30, 2025 20:09
@drake-nominal drake-nominal force-pushed the deidukas/export-data branch 2 times, most recently from 26ad9b8 to e80fcba Compare July 1, 2025 13:45
@drake-nominal drake-nominal force-pushed the deidukas/export-data branch from e80fcba to 514d817 Compare July 1, 2025 13:59
pool_type: PoolType = DEFAULT_POOL_TYPE,
) -> ExportStream[pd.DataFrame]: ...

def get_read_stream(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to name this as a "stream" as it's not really using a streaming API.

Comment on lines +221 to +224
@dataclasses.dataclass(frozen=True, unsafe_hash=True, order=True)
class TimeRange:
start_time: IntegralNanosecondsUTC
end_time: IntegralNanosecondsUTC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frozen, hashable, orderable - probably more direct to use a named tuple here!

Comment on lines +248 to +250
# Mapping of channel names to their respective datasource rids
# channel_sources: Mapping[str, str]
channels: Sequence[Channel]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment says mapping but it's a sequence

Comment on lines +459 to +460
sub_offset = datetime.timedelta(seconds=self._points_per_request / channel_rate)
sub_offset_ns = int(sub_offset.total_seconds() * 1e9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but the intermediate type doesn't really do anything for us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants