Conversation
Fixes JuliaData#464 (or at least improves it quite a bit). A new precompile.jl file is a script I ran to get some precompile statements for CSV.jl, Parsers.jl, and SentinelArrays.jl (which seem to be the biggest targets and ones that live in JuliaData). The Parsers.jl output ended up being insignificant for now, so that wasn't committed, but SentinelArrays.jl was. I've added two new "precompile.csv" and "precompile_small.csv" files that are used for snooping; they include a column of each type, which should hopefully cover a good chunk of codepaths we're compiling. We can ajust more later if there are certain paths that could use it and are causing people problems. All in all, this cuts TTFP (time-to-first-parse) in half on my machine.
Add some precompiles
Fixes JuliaData#668. The issue here is that when column names were passed manually, the code path that "skipped" to the datarow passed in the starting position as 1 instead of `pos` variable. This used to not be an issue because the `pos` was almost always 1 anyway. With `IOBuffer`, we now start `pos` at `io.ptr`, so we'll have more cases where it's critical to start reading at right position.
When column names passed manually, ensure we respect starting position
I've wanted to do this for a while; previously we were only using the estimate from the first 10 rows. This hooks into the "chunking" code, which looks at `tasks` # of chunks of a file to find the start of rows for each; we now keep track of the # of bytes we saw when doing those row checks and use those totals plus the original 10 rows to form a better estimate of the total # of rows.
Improve accuracy of estimated rows for multithreaded parsing
Make the automatic pooled=>string column promotion more efficient
…r invalid rows
added documentation for the dateformats option
Fixes JuliaData#679; alternative fix to JuliaData#681. When a column is dropped, we essentially turn it into a `Missing` column type and ignore it when parsing. There was a check later in file parsing, however, that said if no missing values were found in a column, to ensure its type is `Vector{T}` instead of `Vector{Union{Missing, T}}`. The core problem in issue JuliaData#679 was that these dropped columns, while completely `missing`, didn't get "flagged" as having `missing` values.
Fixes JuliaData#680. Before custom types, the `typemap` keyword argument was really only about mapping between the standard, supported types. With custom types, we still only support certain type mappings (Int to Float, Any type to String), but we also want to support type mappings like `Int64 => Int32`. This PR readjusts how typemap works when detecting column types to account for the possiblity of custom Integer or AbstractFloat type mappints for Int64 & Float64, and moving directly to String if that's specified.
Ensure dropped columns are ignored in later file processing
Fix typemap for our new custom types world
Additional fix for JuliaData#679. There was an alternative issue discovered in JuliaData#679 where a type was given for a dropped column, like `CSV.File(file; types=Dict(:id=>Int32), select=["first"])`. During parsing, it knew we would drop the `id` column, so it parsed it as `Missing`, but then tried to "set" the value to a `Vector{Int32}` which threw an error. When dropping a column, we can basically ignore any provided type information instead, so in the `Header` routine, we make sure to mark the type of a column as `Missing` if it will be dropped.
Fix case when type is provided for dropped columns
Add docs for reading non-UTF-8 files
Fixes JuliaData/DataFrames.jl#2309. After much discussion, people really like the `CSV.read` function call naming, so it was decided to keep it around, while still allowing a break from DataFrames.jl as a dependency. I also realized that `CSV.read` does indeed provide one bit of value/functionality: we can wrap `CSV.File` in `Tables.CopiedColumns`. What this means is that columnar sinks can safely use the columns passed without needing to make copies; i.e. they can assume ownership of the columns. In `CSV.read`, the user is essentially saying, "I want to make a `CSV.File` and pass it directly to `sink`" which also implies that `CSV.File` doesn't need to "own" its own columns. The only question left in my mind is, with the 1.0 release, what to do with `CSV.read(file)` when no `sink` argument is passed. Suggestions have included just returning a `CSV.File`, since it's a valid table anyway. Or allowing DataFrames.jl to define the no-sink-arg version and return a `DataFrame`; that one's a bit awkward as a form of "blessed" type piracy, but could also be useful for users as convenience (they just have to remember to do `using DataFrames` before trying to use it). The other awkward part is that we're currently warning users of the deprecation and that they should explicitly spell out `DataFrame` whereas we might not actually require that.
This is another optimization I've wanted to do for a while; we're already checking up to 5 rows from various locations in the file to find row boundaries for multithreaded parsing, so this just also samples the types from those locations. The end result is that we already go into full parsing having checked `tasks * 5` # of rows for column types. The 2 main advantages of adding this are that we can avoid running `detect` in all tasks in some cases, i.e. if we've already sampled that a column is `Int64`, then each task will immediately start parsing `Int64`s instead of having to `detect`, then `allocate`, etc. The other advantage is if we happen to catch a rogue string value in a part of the file, it can be promoted immediately instead of having to parse entire chunks as, say, `Int64`, and then re-parsed later. It's not super robust in that case because we're only sampling 5 rows from each chunk, but it's better than nothing and doesn't add extra cost to what we're already doing. This PR also adds an optimization when a `PooledString` column is promoted to `String`, we don't need to reparse, but can just allocate the `String` array and put the ref values in directly according to the codes.
There was a case where we were still assuming header is always a vector; this led to a confusing error when passing writeheader=false.
* Update to 0.10.14 * Update julia setup action * Add dependabot
* finalize memory if 1.11 * don't download busybox in test * test windows on lts
* fix decchar handling in writecell() for AbstractFloat * test for JuliaData#1109 fix decchar handling in writecell() for AbstractFloat * fix newline format --------- Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
* fix INT128_MIN write * add write INT128_MIN test * use Base.uabs Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com> * add BigInt test --------- Co-authored-by: Nathan Zimmerberg <39104088+nhz2@users.noreply.github.com>
* Bump codecov/codecov-action from 4 to 5 Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 4 to 5. - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@v4...v5) --- updated-dependencies: - dependency-name: codecov/codecov-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Update .github/workflows/ci.yml Co-authored-by: Chengyu Han <cyhan.dev@outlook.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com> Co-authored-by: Chengyu Han <cyhan.dev@outlook.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1176 +/- ##
=======================================
Coverage ? 90.55%
=======================================
Files ? 9
Lines ? 2319
Branches ? 0
=======================================
Hits ? 2100
Misses ? 219
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I gotta be honest that I don't love this approach. It feels like we could restructure in a different way where the repo structure doesn't feel convoluted. I'd be open to generating some of hte larger files on the fly, or compressing, or hosting them separately. I think it should be much simpler to shuffle a few things around and still get 95% of the benefit. |
|
Moving files elsewhere (other branch or repository) seems more convoluted that moving files around in here. The new repository structure mimics a monorepo which is adopted by other packages e.g. : I don't think having a monorepo structure with a single package ( Generating |
|
I just personally dislike the idea of having to restructure the repo via subdir package vs. just fixing the problem directly. I've restructured things on the |
|
Whatever reduces the downloaded files is fine.
I don't think it's very developer friendly to have to DL the testfiles, and cleanup them afterwards. |
This a request to restructure the repository in order to avoid shipping
testfiles(about29MB) when installingCSV.jl(similar to what is proposed in jheinen/GR.jl#575 or JuliaGraphics/ColorSchemes.jl#154).So we move the
srcandtestdirectories to a dedicated dir namedCSV: only those files will be downloaded when a user installs theCSVpackage throughPkg, thus saving bandwidth.Xref: JuliaLang/Pkg.jl#3798.
This will need a corresponding PR to https://github.com/JuliaRegistries/General/blob/master/C/CSV/Package.toml, adding a
subdir = "CSV"line.Tests pass locally, and building docs also.
I've also refreshed the
ci.ymlaction.Fixes #1177.