restructure repository by t-bltg · Pull Request #1176 · JuliaData/CSV.jl

t-bltg · 2026-01-10T21:48:30Z

This a request to restructure the repository in order to avoid shipping testfiles (about 29MB) when installing CSV.jl (similar to what is proposed in jheinen/GR.jl#575 or JuliaGraphics/ColorSchemes.jl#154).

$ cd .julia/packages
$ du -sh * | sort -h
[...]
11M	MathTeXEngine
23M	ColorSchemes
26M	GR
29M	CSV  # <== this is too much !
94M	DelaunayTriangulation

So we move the src and test directories to a dedicated dir named CSV : only those files will be downloaded when a user installs the CSV package through Pkg, thus saving bandwidth.
Xref: JuliaLang/Pkg.jl#3798.

This will need a corresponding PR to https://github.com/JuliaRegistries/General/blob/master/C/CSV/Package.toml, adding a subdir = "CSV" line.

Tests pass locally, and building docs also.
I've also refreshed the ci.yml action.

Fixes #1177.

Fixes JuliaData#464 (or at least improves it quite a bit). A new precompile.jl file is a script I ran to get some precompile statements for CSV.jl, Parsers.jl, and SentinelArrays.jl (which seem to be the biggest targets and ones that live in JuliaData). The Parsers.jl output ended up being insignificant for now, so that wasn't committed, but SentinelArrays.jl was. I've added two new "precompile.csv" and "precompile_small.csv" files that are used for snooping; they include a column of each type, which should hopefully cover a good chunk of codepaths we're compiling. We can ajust more later if there are certain paths that could use it and are causing people problems. All in all, this cuts TTFP (time-to-first-parse) in half on my machine.

Add some precompiles

Fixes JuliaData#668. The issue here is that when column names were passed manually, the code path that "skipped" to the datarow passed in the starting position as 1 instead of `pos` variable. This used to not be an issue because the `pos` was almost always 1 anyway. With `IOBuffer`, we now start `pos` at `io.ptr`, so we'll have more cases where it's critical to start reading at right position.

When column names passed manually, ensure we respect starting position

I've wanted to do this for a while; previously we were only using the estimate from the first 10 rows. This hooks into the "chunking" code, which looks at `tasks` # of chunks of a file to find the start of rows for each; we now keep track of the # of bytes we saw when doing those row checks and use those totals plus the original 10 rows to form a better estimate of the total # of rows.

Improve accuracy of estimated rows for multithreaded parsing

Make the automatic pooled=>string column promotion more efficient

…r invalid rows

added documentation for the dateformats option

Fixes JuliaData#679; alternative fix to JuliaData#681. When a column is dropped, we essentially turn it into a `Missing` column type and ignore it when parsing. There was a check later in file parsing, however, that said if no missing values were found in a column, to ensure its type is `Vector{T}` instead of `Vector{Union{Missing, T}}`. The core problem in issue JuliaData#679 was that these dropped columns, while completely `missing`, didn't get "flagged" as having `missing` values.

Fixes JuliaData#680. Before custom types, the `typemap` keyword argument was really only about mapping between the standard, supported types. With custom types, we still only support certain type mappings (Int to Float, Any type to String), but we also want to support type mappings like `Int64 => Int32`. This PR readjusts how typemap works when detecting column types to account for the possiblity of custom Integer or AbstractFloat type mappints for Int64 & Float64, and moving directly to String if that's specified.

Ensure dropped columns are ignored in later file processing

Fix typemap for our new custom types world

Additional fix for JuliaData#679. There was an alternative issue discovered in JuliaData#679 where a type was given for a dropped column, like `CSV.File(file; types=Dict(:id=>Int32), select=["first"])`. During parsing, it knew we would drop the `id` column, so it parsed it as `Missing`, but then tried to "set" the value to a `Vector{Int32}` which threw an error. When dropping a column, we can basically ignore any provided type information instead, so in the `Header` routine, we make sure to mark the type of a column as `Missing` if it will be dropped.

Fix case when type is provided for dropped columns

Add docs for reading non-UTF-8 files

Fixes JuliaData/DataFrames.jl#2309. After much discussion, people really like the `CSV.read` function call naming, so it was decided to keep it around, while still allowing a break from DataFrames.jl as a dependency. I also realized that `CSV.read` does indeed provide one bit of value/functionality: we can wrap `CSV.File` in `Tables.CopiedColumns`. What this means is that columnar sinks can safely use the columns passed without needing to make copies; i.e. they can assume ownership of the columns. In `CSV.read`, the user is essentially saying, "I want to make a `CSV.File` and pass it directly to `sink`" which also implies that `CSV.File` doesn't need to "own" its own columns. The only question left in my mind is, with the 1.0 release, what to do with `CSV.read(file)` when no `sink` argument is passed. Suggestions have included just returning a `CSV.File`, since it's a valid table anyway. Or allowing DataFrames.jl to define the no-sink-arg version and return a `DataFrame`; that one's a bit awkward as a form of "blessed" type piracy, but could also be useful for users as convenience (they just have to remember to do `using DataFrames` before trying to use it). The other awkward part is that we're currently warning users of the deprecation and that they should explicitly spell out `DataFrame` whereas we might not actually require that.

This is another optimization I've wanted to do for a while; we're already checking up to 5 rows from various locations in the file to find row boundaries for multithreaded parsing, so this just also samples the types from those locations. The end result is that we already go into full parsing having checked `tasks * 5` # of rows for column types. The 2 main advantages of adding this are that we can avoid running `detect` in all tasks in some cases, i.e. if we've already sampled that a column is `Int64`, then each task will immediately start parsing `Int64`s instead of having to `detect`, then `allocate`, etc. The other advantage is if we happen to catch a rogue string value in a part of the file, it can be promoted immediately instead of having to parse entire chunks as, say, `Int64`, and then re-parsed later. It's not super robust in that case because we're only sampling 5 rows from each chunk, but it's better than nothing and doesn't add extra cost to what we're already doing. This PR also adds an optimization when a `PooledString` column is promoted to `String`, we don't need to reparse, but can just allocate the `String` array and put the ref values in directly according to the codes.

There was a case where we were still assuming header is always a vector; this led to a confusing error when passing writeheader=false.

* Update to 0.10.14 * Update julia setup action * Add dependabot

* finalize memory if 1.11 * don't download busybox in test * test windows on lts

* fix decchar handling in writecell() for AbstractFloat * test for JuliaData#1109 fix decchar handling in writecell() for AbstractFloat * fix newline format --------- Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>

* fix INT128_MIN write * add write INT128_MIN test * use Base.uabs Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com> * add BigInt test --------- Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com>

* Bump codecov/codecov-action from 4 to 5 Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 4 to 5. - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@v4...v5) --- updated-dependencies: - dependency-name: codecov/codecov-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Update .github/workflows/ci.yml Co-authored-by: Chengyu Han <cyhan.dev@outlook.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com> Co-authored-by: Chengyu Han <cyhan.dev@outlook.com>

codecov · 2026-01-10T22:19:29Z

Codecov Report

❌ Patch coverage is 90.19934% with 177 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@bd4c8f3). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
CSV/src/utils.jl	86.47%	53 Missing ⚠️
CSV/src/context.jl	88.68%	37 Missing ⚠️
CSV/src/file.jl	94.14%	33 Missing ⚠️
CSV/src/rows.jl	84.29%	30 Missing ⚠️
CSV/src/detection.jl	94.71%	11 Missing ⚠️
CSV/src/write.jl	90.69%	8 Missing ⚠️
CSV/src/CSV.jl	70.00%	3 Missing ⚠️
CSV/src/chunks.jl	91.30%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1176   +/-   ##
=======================================
  Coverage        ?   90.55%           
=======================================
  Files           ?        9           
  Lines           ?     2319           
  Branches        ?        0           
=======================================
  Hits            ?     2100           
  Misses          ?      219           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

t-bltg · 2026-01-11T14:58:29Z

@quinnj, since you implemented the (now broken) compress test in #883, are you OK with the fix for #1177 implemented in a333eef ?

Side note : I think nightly would be fixed by #1170, can we also get this PR in ?

quinnj · 2026-01-11T18:19:54Z

I gotta be honest that I don't love this approach. It feels like we could restructure in a different way where the repo structure doesn't feel convoluted. I'd be open to generating some of hte larger files on the fly, or compressing, or hosting them separately. I think it should be much simpler to shuffle a few things around and still get 95% of the benefit.

t-bltg · 2026-01-11T18:25:40Z

Moving files elsewhere (other branch or repository) seems more convoluted that moving files around in here.

The new repository structure mimics a monorepo which is adopted by other packages e.g. :

I don't think having a monorepo structure with a single package (CSV) is looking odd or convoluted.

Generating csv files on the fly seems a lot of work, and I don't think we can replace all the testfiles with this approach.

quinnj · 2026-01-12T22:08:37Z

I just personally dislike the idea of having to restructure the repo via subdir package vs. just fixing the problem directly. I've restructured things on the main branch where we serve a compressed tarball via github release artifact that get used in test runs/CI. it seems to have shrunk the size of the repo for me from 28mb -> ~3mb. I'd prefer going with this kind of approach.

t-bltg · 2026-01-13T17:01:25Z

Whatever reduces the downloaded files is fine.

I've restructured things on the main branch where we serve a compressed tarball via github release artifact

I don't think it's very developer friendly to have to DL the testfiles, and cleanup them afterwards.

quinnj and others added 30 commits June 27, 2020 11:13

Merge pull request JuliaData#666 from JuliaData/jq/precompile

138e323

Add some precompiles

Add SentinelArrays compat

4671e57

Merge pull request JuliaData#671 from JuliaData/jq/668

da0a5dc

When column names passed manually, ensure we respect starting position

Bump version

e9651d9

Fix tests

5b06308

Merge pull request JuliaData#673 from JuliaData/jq/estrows

ed48ad2

Improve accuracy of estimated rows for multithreaded parsing

Make the automatic pooled=>string column promotion more efficient

b9cb5d4

remove debug

2933624

fix nightly test

88cf907

Merge pull request JuliaData#676 from JuliaData/jq/pool

e21beb3

Make the automatic pooled=>string column promotion more efficient

Fix JuliaData#678 by ensuring pooled columns get missing value set fo…

478670f

…r invalid rows

Bump version

131f233

added documentation for the dateformats option

b96868a

Merge pull request JuliaData#682 from kragol/document_dateformats

d4ce5b6

added documentation for the dateformats option

Merge pull request JuliaData#683 from JuliaData/jq/679

c53b274

Ensure dropped columns are ignored in later file processing

Merge pull request JuliaData#684 from JuliaData/jq/680

0238a22

Fix typemap for our new custom types world

bump version

2658645

Add docs for reading non-UTF-8 files

34f64c8

Improve docstring

e48968a

Merge pull request JuliaData#686 from JuliaData/jq/679_part2

ad9cc64

Fix case when type is provided for dropped columns

Merge pull request JuliaData#685 from JuliaData/nl/encodings

4136307

Add docs for reading non-UTF-8 files

Fix writeheader=false deprecation (JuliaData#692)

81ea51c

There was a case where we were still assuming header is always a vector; this led to a confusing error when passing writeheader=false.

ViralBShah and others added 17 commits April 9, 2024 20:05

Update Project.toml to 0.10.14 (JuliaData#1134)

3d61294

* Update to 0.10.14 * Update julia setup action * Add dependabot

Fix CI badge in README

57eca79

Fix reading gzipped file in Julia 1.11 on Windows (JuliaData#1144)

80936af

* finalize memory if 1.11 * don't download busybox in test * test windows on lts

Bump version to 0.10.15

41a6875

fix INT128_MIN write (JuliaData#1152)

8207959

* fix INT128_MIN write * add write INT128_MIN test * use Base.uabs Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com> * add BigInt test --------- Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com>

Update examples.md to use ZipArchives (JuliaData#1158)

04ec1cf

restructure repository

80d8165

move docs

e704f3b

update action

b70aa4b

fix CI

e6e79d9

fix docs

d7885b0

move docs back

bef4cae

try master

2b63dfb

add concurrency

af1685c

remove dir from cov

05d39c1

t-bltg mentioned this pull request Jan 10, 2026

compress test seems broken #1177

Closed

fix broken compression test

a333eef

DilumAluthge requested a review from quinnj January 11, 2026 16:22

t-bltg mentioned this pull request Jan 11, 2026

restructure repository JuliaGraphics/ColorSchemes.jl#154

Merged

fix julia-buildpkg

9662e01

quinnj force-pushed the main branch from 04ec1cf to 4f8c505 Compare January 12, 2026 15:48

t-bltg closed this Jan 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restructure repository#1176

restructure repository#1176
t-bltg wants to merge 790 commits intoJuliaData:mainfrom
t-bltg:move-testfiles

t-bltg commented Jan 10, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 10, 2026 •

edited

Loading

Uh oh!

t-bltg commented Jan 11, 2026 •

edited

Loading

Uh oh!

quinnj commented Jan 11, 2026

Uh oh!

t-bltg commented Jan 11, 2026 •

edited

Loading

Uh oh!

quinnj commented Jan 12, 2026

Uh oh!

t-bltg commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

t-bltg commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

t-bltg commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quinnj commented Jan 11, 2026

Uh oh!

t-bltg commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quinnj commented Jan 12, 2026

Uh oh!

t-bltg commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

t-bltg commented Jan 10, 2026 •

edited

Loading

codecov bot commented Jan 10, 2026 •

edited

Loading

t-bltg commented Jan 11, 2026 •

edited

Loading

t-bltg commented Jan 11, 2026 •

edited

Loading