Skip to content
JessFlan edited this page Nov 4, 2016 · 2 revisions

Overview

This page compares the speed to the existing validator compared to the forked version. The forked version discards information about rows that validate successfully and uses a cache so each regex is only compiled once.

The original version fails on large files as its memory usage even without the unique constraint is proportional to the size of the file it is processing. It also has a lot of overhead when processing regex columns.

Version File Size Processing time Succeeds ?
forked 1Gb 41.498s yes
forked 10Gb 7m2.462s yes
original 1Gb 1m20.675s yes
original 10Gb 70m4.260s no (GC overhead limit exceeded)

Get some sample data

curl https://raw.githubusercontent.com/textcreationpartnership/Texts/master/TCP.csv >TCP.csv
curl https://raw.githubusercontent.com/digital-preservation/csv-schema/master/example-schemas/TCP.csvs > TCP.csvs

To duplicate the data to get a large enough test file:

cp TCP.csv TCP-10GB.csv
for count in {1..360}; do sed -n 2,61316p TCP.csv >> TCP-10GB.csv; done;

cp TCP.csv TCP-1GB.csv
for count in {1..36}; do sed -n 2,61316p TCP.csv >> TCP-1GB.csv; done;

Resulting data sets:

9.8G TCP-10GB.csv
1.0G TCP-1GB.csv

Forked validator

1Gb data set

Command:

 time ~/csv-validate-fork/csv-validator/csv-validator-cmd/target/csv-validator-cmd-1.2-RC2-SNAPSHOT-application/csv-validator-cmd-1.2-RC2-SNAPSHOT/bin/validate  TCP-1GB.csv TCP.csvs --disable-utf8-validation

Output:

PASS

real	0m41.498s
user	0m40.685s
sys	0m2.275s

JConsole Output

10Gb Large data set

Command:

 time ~/csv-validate-fork/csv-validator/csv-validator-cmd/target/csv-validator-cmd-1.2-RC2-SNAPSHOT-application/csv-validator-cmd-1.2-RC2-SNAPSHOT/bin/validate  TCP-10GB.csv TCP.csvs --disable-utf8-validation

Output:

PASS

real	7m2.462s
user	5m58.435s
sys	0m29.979s

JConsole Output

Original validator

Version from:

http://search.maven.org/remotecontent?filepath=uk/gov/nationalarchives/csv-validator-cmd/1.2-RC1/csv-validator-cmd-1.2-RC1-application.zip

1Gb data set

Command:

 time ./validator/csv-validator-cmd-1.2-RC1\ 8/bin/validate TCP-1GB.csv TCP.csvs --disable-utf8-validation

Output:

PASS

real	1m20.675s
user	2m42.279s
sys	0m4.632s

JConsole Output

10Gb data set

Command:

time ./validator/csv-validator-cmd-1.2-RC1\ 8/bin/validate TCP-10GB.csv TCP.csvs --disable-utf8-validation

Output:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at scalaz.Apply$class.ap2(Apply.scala:49)
at scalaz.ValidationInstances3$$anon$2.ap2(Validation.scala:520)
at scalaz.Applicative$class.apply2(Applicative.scala:32)
at scalaz.ValidationInstances3$$anon$2.apply2(Validation.scala:520)
at scalaz.std.ListInstances$$anon$1$$anonfun$traverseImpl$3.apply(List.scala:67)
at scalaz.std.ListInstances$$anon$1$$anonfun$traverseImpl$3.apply(List.scala:67)
at scalaz.DList$$anonfun$foldr$1.apply(DList.scala:57)
at scalaz.IList$$anonfun$foldRight$1.apply(IList.scala:151)
at scalaz.IList.foldLeft0$1(IList.scala:145)
at scalaz.IList.foldLeft(IList.scala:147)
at scalaz.IList.foldRight(IList.scala:151)
at scalaz.DList.foldr(DList.scala:57)
at scalaz.std.ListInstances$$anon$1.traverseImpl(List.scala:66)
at scalaz.std.ListInstances$$anon$1.traverseImpl(List.scala:14)
at scalaz.Traverse$Traversal.run(Traverse.scala:50)
at scalaz.Traverse$class.sequence(Traverse.scala:101)
at scalaz.std.ListInstances$$anon$1.sequence(List.scala:14)
at scalaz.syntax.TraverseOps.sequence(TraverseSyntax.scala:27)
at uk.gov.nationalarchives.csv.validator.AllErrorsMetaDataValidator$class.rules(AllErrorsMetaDataValidator.scala:47)
at uk.gov.nationalarchives.csv.validator.api.CsvValidator$$anon$2.rules(CsvValidator.scala:32)
at uk.gov.nationalarchives.csv.validator.MetaDataValidator$class.validateRow(MetaDataValidator.scala:235)
at uk.gov.nationalarchives.csv.validator.api.CsvValidator$$anon$2.validateRow(CsvValidator.scala:32)
at uk.gov.nationalarchives.csv.validator.AllErrorsMetaDataValidator$class.validateRows$1(AllErrorsMetaDataValidator.scala:30)
at uk.gov.nationalarchives.csv.validator.AllErrorsMetaDataValidator$class.validateRows(AllErrorsMetaDataValidator.scala:35)
at uk.gov.nationalarchives.csv.validator.api.CsvValidator$$anon$2.validateRows(CsvValidator.scala:32)
at uk.gov.nationalarchives.csv.validator.MetaDataValidator$$anonfun$11.apply(MetaDataValidator.scala:186)
at uk.gov.nationalarchives.csv.validator.MetaDataValidator$$anonfun$11.apply(MetaDataValidator.scala:150)
at resource.AbstractManagedResource$$anonfun$5.apply(AbstractManagedResource.scala:86)
at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
at scala.util.control.Exception$Catch.apply(Exception.scala:103)
at scala.util.control.Exception$Catch.either(Exception.scala:125)

real	70m4.260s
user	221m36.932s
sys	3m4.530s

JConsole Output