forked from digital-preservation/csv-validator
-
Notifications
You must be signed in to change notification settings - Fork 0
Speed testing
JessFlan edited this page Nov 4, 2016
·
2 revisions
This page compares the speed to the existing validator compared to the forked version. The forked version discards information about rows that validate successfully and uses a cache so each regex is only compiled once.
The original version fails on large files as its memory usage even without the unique constraint is proportional to the size of the file it is processing. It also has a lot of overhead when processing regex columns.
| Version | File Size | Processing time | Succeeds ? |
|---|---|---|---|
| forked | 1Gb | 41.498s | yes |
| forked | 10Gb | 7m2.462s | yes |
| original | 1Gb | 1m20.675s | yes |
| original | 10Gb | 70m4.260s | no (GC overhead limit exceeded) |
curl https://raw.githubusercontent.com/textcreationpartnership/Texts/master/TCP.csv >TCP.csv
curl https://raw.githubusercontent.com/digital-preservation/csv-schema/master/example-schemas/TCP.csvs > TCP.csvs
cp TCP.csv TCP-10GB.csv
for count in {1..360}; do sed -n 2,61316p TCP.csv >> TCP-10GB.csv; done;
cp TCP.csv TCP-1GB.csv
for count in {1..36}; do sed -n 2,61316p TCP.csv >> TCP-1GB.csv; done;
9.8G TCP-10GB.csv
1.0G TCP-1GB.csv
time ~/csv-validate-fork/csv-validator/csv-validator-cmd/target/csv-validator-cmd-1.2-RC2-SNAPSHOT-application/csv-validator-cmd-1.2-RC2-SNAPSHOT/bin/validate TCP-1GB.csv TCP.csvs --disable-utf8-validation
PASS
real 0m41.498s
user 0m40.685s
sys 0m2.275s

time ~/csv-validate-fork/csv-validator/csv-validator-cmd/target/csv-validator-cmd-1.2-RC2-SNAPSHOT-application/csv-validator-cmd-1.2-RC2-SNAPSHOT/bin/validate TCP-10GB.csv TCP.csvs --disable-utf8-validation
PASS
real 7m2.462s
user 5m58.435s
sys 0m29.979s

http://search.maven.org/remotecontent?filepath=uk/gov/nationalarchives/csv-validator-cmd/1.2-RC1/csv-validator-cmd-1.2-RC1-application.zip
time ./validator/csv-validator-cmd-1.2-RC1\ 8/bin/validate TCP-1GB.csv TCP.csvs --disable-utf8-validation
PASS
real 1m20.675s
user 2m42.279s
sys 0m4.632s

time ./validator/csv-validator-cmd-1.2-RC1\ 8/bin/validate TCP-10GB.csv TCP.csvs --disable-utf8-validation
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at scalaz.Apply$class.ap2(Apply.scala:49)
at scalaz.ValidationInstances3$$anon$2.ap2(Validation.scala:520)
at scalaz.Applicative$class.apply2(Applicative.scala:32)
at scalaz.ValidationInstances3$$anon$2.apply2(Validation.scala:520)
at scalaz.std.ListInstances$$anon$1$$anonfun$traverseImpl$3.apply(List.scala:67)
at scalaz.std.ListInstances$$anon$1$$anonfun$traverseImpl$3.apply(List.scala:67)
at scalaz.DList$$anonfun$foldr$1.apply(DList.scala:57)
at scalaz.IList$$anonfun$foldRight$1.apply(IList.scala:151)
at scalaz.IList.foldLeft0$1(IList.scala:145)
at scalaz.IList.foldLeft(IList.scala:147)
at scalaz.IList.foldRight(IList.scala:151)
at scalaz.DList.foldr(DList.scala:57)
at scalaz.std.ListInstances$$anon$1.traverseImpl(List.scala:66)
at scalaz.std.ListInstances$$anon$1.traverseImpl(List.scala:14)
at scalaz.Traverse$Traversal.run(Traverse.scala:50)
at scalaz.Traverse$class.sequence(Traverse.scala:101)
at scalaz.std.ListInstances$$anon$1.sequence(List.scala:14)
at scalaz.syntax.TraverseOps.sequence(TraverseSyntax.scala:27)
at uk.gov.nationalarchives.csv.validator.AllErrorsMetaDataValidator$class.rules(AllErrorsMetaDataValidator.scala:47)
at uk.gov.nationalarchives.csv.validator.api.CsvValidator$$anon$2.rules(CsvValidator.scala:32)
at uk.gov.nationalarchives.csv.validator.MetaDataValidator$class.validateRow(MetaDataValidator.scala:235)
at uk.gov.nationalarchives.csv.validator.api.CsvValidator$$anon$2.validateRow(CsvValidator.scala:32)
at uk.gov.nationalarchives.csv.validator.AllErrorsMetaDataValidator$class.validateRows$1(AllErrorsMetaDataValidator.scala:30)
at uk.gov.nationalarchives.csv.validator.AllErrorsMetaDataValidator$class.validateRows(AllErrorsMetaDataValidator.scala:35)
at uk.gov.nationalarchives.csv.validator.api.CsvValidator$$anon$2.validateRows(CsvValidator.scala:32)
at uk.gov.nationalarchives.csv.validator.MetaDataValidator$$anonfun$11.apply(MetaDataValidator.scala:186)
at uk.gov.nationalarchives.csv.validator.MetaDataValidator$$anonfun$11.apply(MetaDataValidator.scala:150)
at resource.AbstractManagedResource$$anonfun$5.apply(AbstractManagedResource.scala:86)
at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
at scala.util.control.Exception$Catch.apply(Exception.scala:103)
at scala.util.control.Exception$Catch.either(Exception.scala:125)
real 70m4.260s
user 221m36.932s
sys 3m4.530s
