I noticed that a lot of mappings were
a) just wrong (e.g. linked to the wrong record, like col a vs col b, or the wrong version number / line item)
b) missing (e.g. no field to capture some data in a record)
c) duplicated (e.g. multiple fields mapped to the same name)
d) inconsistently named
e) not well segregated (e.g. comma or newline within fields that aren't escaped and are comma/newline separated)
So I'm working on a major overhaul of the source mapping, deriving directly from the e-filing headers all versions.xlsx eFilingFormats file. While at it, I'm having it support versions 1 & 2 as well as deprecated forms.
Because the data import will have to be re-done anyway (because of a-c above), I'm being a bit aggressive about making the names consistent and semantic — e.g. total_receipts_ytd instead of col_b_total_receipts. I'm hoping to reduce the total number of canonical field names from the current ~1.2k to something a bit more sane. ;-)
The new version will have a regex based mapping file, with US delimiters (ascii 31) and field type/size data, both to make it easier to edit in the future and to be able to automatically output a database migration file.
I'm expecting to be done in about a week and will make a pull request then. Right now it's not in a fully consistent state.
So @dwillis et al, please hold off on working on this part of the code for the moment.
(Also, I'll be publishing an .sql.gz dump of the full import to date.)