Skip to content

restructure repository#1176

Closed
t-bltg wants to merge 790 commits intoJuliaData:mainfrom
t-bltg:move-testfiles
Closed

restructure repository#1176
t-bltg wants to merge 790 commits intoJuliaData:mainfrom
t-bltg:move-testfiles

Conversation

@t-bltg
Copy link

@t-bltg t-bltg commented Jan 10, 2026

This a request to restructure the repository in order to avoid shipping testfiles (about 29MB) when installing CSV.jl (similar to what is proposed in jheinen/GR.jl#575 or JuliaGraphics/ColorSchemes.jl#154).

$ cd .julia/packages
$ du -sh * | sort -h
[...]
11M	MathTeXEngine
23M	ColorSchemes
26M	GR
29M	CSV  # <== this is too much !
94M	DelaunayTriangulation

So we move the src and test directories to a dedicated dir named CSV : only those files will be downloaded when a user installs the CSV package through Pkg, thus saving bandwidth.
Xref: JuliaLang/Pkg.jl#3798.

This will need a corresponding PR to https://github.com/JuliaRegistries/General/blob/master/C/CSV/Package.toml, adding a subdir = "CSV" line.

Tests pass locally, and building docs also.
I've also refreshed the ci.yml action.

Fixes #1177.

quinnj and others added 30 commits June 27, 2020 11:13
Fixes JuliaData#464 (or at least improves it quite a bit).

A new precompile.jl file is a script I ran to get some precompile
statements for CSV.jl, Parsers.jl, and SentinelArrays.jl (which seem to
be the biggest targets and ones that live in JuliaData). The Parsers.jl
output ended up being insignificant for now, so that wasn't committed,
but SentinelArrays.jl was. I've added two new "precompile.csv" and
"precompile_small.csv" files that are used for snooping; they include a
column of each type, which should hopefully cover a good chunk of
codepaths we're compiling. We can ajust more later if there are certain
paths that could use it and are causing people problems.

All in all, this cuts TTFP (time-to-first-parse) in half on my machine.
Fixes JuliaData#668. The issue here is that when column names were passed
manually, the code path that "skipped" to the datarow passed in the
starting position as 1 instead of `pos` variable. This used to not be an
issue because the `pos` was almost always 1 anyway. With `IOBuffer`, we
now start `pos` at `io.ptr`, so we'll have more cases where it's
critical to start reading at right position.
When column names passed manually, ensure we respect starting position
I've wanted to do this for a while; previously we were only using the
estimate from the first 10 rows. This hooks into the "chunking" code,
which looks at `tasks` # of chunks of a file to find the start of rows
for each; we now keep track of the # of bytes we saw when doing those
row checks and use those totals plus the original 10 rows to form a
better estimate of the total # of rows.
Improve accuracy of estimated rows for multithreaded parsing
Make the automatic pooled=>string column promotion more efficient
added documentation for the dateformats option
Fixes JuliaData#679; alternative fix to JuliaData#681. When a column is dropped, we
essentially turn it into a `Missing` column type and ignore it when
parsing. There was a check later in file parsing, however, that said if
no missing values were found in a column, to ensure its type is
`Vector{T}` instead of `Vector{Union{Missing, T}}`. The core problem in
issue JuliaData#679 was that these dropped columns, while completely `missing`,
didn't get "flagged" as having `missing` values.
Fixes JuliaData#680. Before custom types, the `typemap` keyword argument was
really only about mapping between the standard, supported types. With
custom types, we still only support certain type mappings (Int to Float,
Any type to String), but we also want to support type mappings like
`Int64 => Int32`. This PR readjusts how typemap works when detecting
column types to account for the possiblity of custom Integer or
AbstractFloat type mappints for Int64 & Float64, and moving directly to
String if that's specified.
Ensure dropped columns are ignored in later file processing
Fix typemap for our new custom types world
Additional fix for JuliaData#679. There was an alternative issue discovered in
 JuliaData#679 where a type was given for a dropped column, like `CSV.File(file;
 types=Dict(:id=>Int32), select=["first"])`. During parsing, it knew we
 would drop the `id` column, so it parsed it as `Missing`, but then
 tried to "set" the value to a `Vector{Int32}` which threw an error.
 When dropping a column, we can basically ignore any provided type
 information instead, so in the `Header` routine, we make sure to mark
 the type of a column as `Missing` if it will be dropped.
Fix case when type is provided for dropped columns
Add docs for reading non-UTF-8 files
Fixes JuliaData/DataFrames.jl#2309. After much
discussion, people really like the `CSV.read` function call naming, so
it was decided to keep it around, while still allowing a break from
DataFrames.jl as a dependency.

I also realized that `CSV.read` does indeed provide one bit of
value/functionality: we can wrap `CSV.File` in `Tables.CopiedColumns`.
What this means is that columnar sinks can safely use the columns passed
without needing to make copies; i.e. they can assume ownership of the
columns. In `CSV.read`, the user is essentially saying, "I want to make
a `CSV.File` and pass it directly to `sink`" which also implies that
`CSV.File` doesn't need to "own" its own columns.

The only question left in my mind is, with the 1.0 release, what to do
with `CSV.read(file)` when no `sink` argument is passed. Suggestions
have included just returning a `CSV.File`, since it's a valid table
anyway. Or allowing DataFrames.jl to define the no-sink-arg version and
return a `DataFrame`; that one's a bit awkward as a form of "blessed"
type piracy, but could also be useful for users as convenience (they
just have to remember to do `using DataFrames` before trying to use it).
The other awkward part is that we're currently warning users of the
deprecation and that they should explicitly spell out `DataFrame`
whereas we might not actually require that.
This is another optimization I've wanted to do for a while; we're
already checking up to 5 rows from various locations in the file to find
row boundaries for multithreaded parsing, so this just also samples the
types from those locations. The end result is that we already go into
full parsing having checked `tasks * 5` # of rows for column types. The
2 main advantages of adding this are that we can avoid running `detect`
in all tasks in some cases, i.e. if we've already sampled that a column
is `Int64`, then each task will immediately start parsing `Int64`s
instead of having to `detect`, then `allocate`, etc. The other advantage
is if we happen to catch a rogue string value in a part of the file, it
can be promoted immediately instead of having to parse entire chunks as,
say, `Int64`, and then re-parsed later. It's not super robust in that
case because we're only sampling 5 rows from each chunk, but it's better
than nothing and doesn't add extra cost to what we're already doing.

This PR also adds an optimization when a `PooledString` column is
promoted to `String`, we don't need to reparse, but can just allocate
the `String` array and put the ref values in directly according to the
codes.
There was a case where we were still assuming header is always a vector;
this led to a confusing error when passing writeheader=false.
ViralBShah and others added 17 commits April 9, 2024 20:05
* Update to 0.10.14

* Update julia setup action

* Add dependabot
* finalize memory if 1.11

* don't download busybox in test

* test windows on lts
* fix decchar handling in writecell() for AbstractFloat

* test for JuliaData#1109 fix decchar handling in writecell() for AbstractFloat

* fix newline format

---------

Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
* fix INT128_MIN write

* add write INT128_MIN test

* use Base.uabs

Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com>

* add BigInt test

---------

Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com>
* Bump codecov/codecov-action from 4 to 5

Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 4 to 5.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@v4...v5)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update .github/workflows/ci.yml

Co-authored-by: Chengyu Han <cyhan.dev@outlook.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
Co-authored-by: Chengyu Han <cyhan.dev@outlook.com>
@codecov
Copy link

codecov bot commented Jan 10, 2026

Codecov Report

❌ Patch coverage is 90.19934% with 177 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@bd4c8f3). Learn more about missing BASE report.

Files with missing lines Patch % Lines
CSV/src/utils.jl 86.47% 53 Missing ⚠️
CSV/src/context.jl 88.68% 37 Missing ⚠️
CSV/src/file.jl 94.14% 33 Missing ⚠️
CSV/src/rows.jl 84.29% 30 Missing ⚠️
CSV/src/detection.jl 94.71% 11 Missing ⚠️
CSV/src/write.jl 90.69% 8 Missing ⚠️
CSV/src/CSV.jl 70.00% 3 Missing ⚠️
CSV/src/chunks.jl 91.30% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1176   +/-   ##
=======================================
  Coverage        ?   90.55%           
=======================================
  Files           ?        9           
  Lines           ?     2319           
  Branches        ?        0           
=======================================
  Hits            ?     2100           
  Misses          ?      219           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@t-bltg
Copy link
Author

t-bltg commented Jan 11, 2026

@quinnj, since you implemented the (now broken) compress test in #883, are you OK with the fix for #1177 implemented in a333eef ?

Side note : I think nightly would be fixed by #1170, can we also get this PR in ?

@quinnj
Copy link
Member

quinnj commented Jan 11, 2026

I gotta be honest that I don't love this approach. It feels like we could restructure in a different way where the repo structure doesn't feel convoluted. I'd be open to generating some of hte larger files on the fly, or compressing, or hosting them separately. I think it should be much simpler to shuffle a few things around and still get 95% of the benefit.

@t-bltg
Copy link
Author

t-bltg commented Jan 11, 2026

Moving files elsewhere (other branch or repository) seems more convoluted that moving files around in here.

The new repository structure mimics a monorepo which is adopted by other packages e.g. :

I don't think having a monorepo structure with a single package (CSV) is looking odd or convoluted.

Generating csv files on the fly seems a lot of work, and I don't think we can replace all the testfiles with this approach.

@quinnj
Copy link
Member

quinnj commented Jan 12, 2026

I just personally dislike the idea of having to restructure the repo via subdir package vs. just fixing the problem directly. I've restructured things on the main branch where we serve a compressed tarball via github release artifact that get used in test runs/CI. it seems to have shrunk the size of the repo for me from 28mb -> ~3mb. I'd prefer going with this kind of approach.

@t-bltg
Copy link
Author

t-bltg commented Jan 13, 2026

Whatever reduces the downloaded files is fine.

I've restructured things on the main branch where we serve a compressed tarball via github release artifact

I don't think it's very developer friendly to have to DL the testfiles, and cleanup them afterwards.

@t-bltg t-bltg closed this Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

compress test seems broken