Major overhaul of source mapping

I noticed that a lot of mappings were
a) just wrong (e.g. linked to the wrong record, like col a vs col b, or the wrong version number / line item)
b) missing (e.g. no field to capture some data in a record)
c) duplicated (e.g. multiple fields mapped to the same name)
d) inconsistently named 
e) not well segregated (e.g. comma or newline within fields that aren't escaped and are comma/newline separated)

So I'm working on a major overhaul of the source mapping, deriving directly from the `e-filing headers all versions.xlsx` eFilingFormats file. While at it, I'm having it support versions 1 & 2 as well as deprecated forms.

Because the data import will have to be re-done anyway (because of a-c above), I'm being a bit aggressive about making the names consistent and semantic — e.g. `total_receipts_ytd` instead of `col_b_total_receipts`. I'm hoping to reduce the total number of canonical field names from the current ~1.2k to something a bit more sane. ;-)

The new version will have a regex based mapping file, with US delimiters (ascii 31) and field type/size data, both to make it easier to edit in the future and to be able to automatically output a database migration file.

I'm expecting to be done in about a week and will make a pull request then. Right now it's not in a fully consistent state. 

So @dwillis et al, please hold off on working on this part of the code for the moment.

(Also, I'll be publishing an .sql.gz dump of the full import to date.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Major overhaul of source mapping #75

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Major overhaul of source mapping #75

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions