Open
Conversation
alkasm
reviewed
Jun 27, 2025
3a28df9 to
4b93ea5
Compare
26ad9b8 to
e80fcba
Compare
e80fcba to
514d817
Compare
alkasm
reviewed
Jul 1, 2025
| pool_type: PoolType = DEFAULT_POOL_TYPE, | ||
| ) -> ExportStream[pd.DataFrame]: ... | ||
|
|
||
| def get_read_stream( |
Contributor
There was a problem hiding this comment.
I'm hesitant to name this as a "stream" as it's not really using a streaming API.
alkasm
reviewed
Jul 1, 2025
Comment on lines
+221
to
+224
| @dataclasses.dataclass(frozen=True, unsafe_hash=True, order=True) | ||
| class TimeRange: | ||
| start_time: IntegralNanosecondsUTC | ||
| end_time: IntegralNanosecondsUTC |
Contributor
There was a problem hiding this comment.
frozen, hashable, orderable - probably more direct to use a named tuple here!
Comment on lines
+248
to
+250
| # Mapping of channel names to their respective datasource rids | ||
| # channel_sources: Mapping[str, str] | ||
| channels: Sequence[Channel] |
Contributor
There was a problem hiding this comment.
comment says mapping but it's a sequence
Comment on lines
+459
to
+460
| sub_offset = datetime.timedelta(seconds=self._points_per_request / channel_rate) | ||
| sub_offset_ns = int(sub_offset.total_seconds() * 1e9) |
Contributor
There was a problem hiding this comment.
nit but the intermediate type doesn't really do anything for us
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description:
Allow users to
get_read_streamon aDataSource.Currently, I only support streaming pandas dataframes, but leave the framework available for adding other export types (e.g. pyarrow tables, polars, raw protos, etc.).
I did some benchmarking on my machine using a sample dataset featuring really high rate data for 10 minutes of data.
From these benchmarks, I was able to determine that with proper tuning, I was able to get my downloads to run significantly (~30%) faster by using process pools rather than thread pools, and was able to get much closer to saturating my network connection over the course of the download-- I suspect this is due to the gzipping at play.
As a result, I let the users configure everything-- by default, we use a single threaded threadpool to perform requests, but let users upgrade their stream to an N-worker process pool in a very opt-in way if they'd like to receive improved performance.
Benchmarks:
Test with 254,582,023 points (nans=0, non-nans=254,582,023) across 108 channels.
Both cases provide performance approx on par with 2 million points / second with ~100mbps internet.
Process pool (16 workers, 5_000_000 points per request)
Thread pool (16 workers, 5_000_000 points per request)
Thread pool (32 workers, 1_000_000 points per request)
Thread pool (16 workers, 10_000_000 points per request)
Thread pool (32 workers, 10_000_000 points/request, 300_000_000 points/batch)
Process pool (32 workers, 10_000_000 points/request, 300_000_000 points/batch)