Skip to content
This repository was archived by the owner on Mar 6, 2019. It is now read-only.
This repository was archived by the owner on Mar 6, 2019. It is now read-only.

Major overhaul of source mapping #75

@saizai

Description

@saizai

I noticed that a lot of mappings were
a) just wrong (e.g. linked to the wrong record, like col a vs col b, or the wrong version number / line item)
b) missing (e.g. no field to capture some data in a record)
c) duplicated (e.g. multiple fields mapped to the same name)
d) inconsistently named
e) not well segregated (e.g. comma or newline within fields that aren't escaped and are comma/newline separated)

So I'm working on a major overhaul of the source mapping, deriving directly from the e-filing headers all versions.xlsx eFilingFormats file. While at it, I'm having it support versions 1 & 2 as well as deprecated forms.

Because the data import will have to be re-done anyway (because of a-c above), I'm being a bit aggressive about making the names consistent and semantic — e.g. total_receipts_ytd instead of col_b_total_receipts. I'm hoping to reduce the total number of canonical field names from the current ~1.2k to something a bit more sane. ;-)

The new version will have a regex based mapping file, with US delimiters (ascii 31) and field type/size data, both to make it easier to edit in the future and to be able to automatically output a database migration file.

I'm expecting to be done in about a week and will make a pull request then. Right now it's not in a fully consistent state.

So @dwillis et al, please hold off on working on this part of the code for the moment.

(Also, I'll be publishing an .sql.gz dump of the full import to date.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions