Skip to content

Conversation

@reese
Copy link
Collaborator

@reese reese commented Jan 18, 2026

At the very beginning of each file, we have to construct a LineIndex to get the offsets of where each line starts, and we also look up which lines have Ruby vs. just comments/whitespace. Each pass we do is very expensive, since it requires us to iterate byte-by-byte over the entire source, and we currently unnecessarily do two passes instead of one.

This PR collapses those into a single pass. It also changes lines_with_ruby to be a sorted Vec instead of a BTreeSet. BTreeSet doesn't really make much sense here, since lookups are the same speed as a Vec + binary_search, but insertion is also O(log n). We could use a HashSet, but unless we're formatting 100k line Ruby files, the hashing overhead seems to outweigh the difference in lookup performance -- Vec is better for almost all normal usage.

This also pulls in the memchr crate, which has specialized instructions for doing character searches -- it seems to be a few times faster than a standard .iter() for our use case of searching for newline characters. memchr is already a transitive dependency for several crates, so this doesn't end up adding much to compilation time or anything.

Looking at profiling data, this was previously taking about 3% of total CPU time on average, and this PR cuts that at least in half on my laptop.

Copy link
Collaborator

@froydnj froydnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at this last night and was going to write up a Prism patch that exposes the newline vector, which will probably make this faster? But until we have that information, this is an excellent fix. (Prism release schedules are also not conducive to getting Rust crate changes out quickly.)

@reese
Copy link
Collaborator Author

reese commented Jan 18, 2026

Yeah that would be an excellent addition to the Prism bindings -- it's always felt a little silly that we have to do this, so it'd be great to have that done during parsing for us.

@reese reese merged commit 5290e84 into trunk Jan 18, 2026
8 checks passed
@reese reese deleted the reese-comments-single-pass branch January 18, 2026 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants