Skip to content

Filtering implementation epic  #305

@Adoni5

Description

@Adoni5

Feature being added - Filtering!

Filter reads and alignments based on a flexible "mini language" specified in the TOML.
Would go into config TOML under the respective place for the filtering? I.e caller_settings/mapper_settings
Applied in the tight targets loop

# for base calling
filter = [
  "metadata.sequence_length > 0",
  "metadata.sequence_length < 1000",
]

# for alignment
filter = [
  "is_primary",
  "mapq > 40",
  "strand == -1",
]

This is parsed into magic Enums and Classes in _filter.py

chunks  = read_until_client.get_read_batch(...)
filtered_calls, calls  = partition(basecall_filter, basecall(chunks))
filtered_aligns, aligns = partition(alignment_filter, align(calls))

for result in aligns:
    print("boo these alignments are trash")

for filtered_item in filtered_calls + filtered_aligns:
    print("Woohoo we have great success in filtering")

I suppose we would store these on the respective classes? _PluginModule or something I forget

Ideas

  • Extend language to startsWith/endsWith
  • and/or/not logical operators

Issues that need resolving/clarification

  • VERY footgunny - for example sequence.metadata.length < 0, mapq < 0 and goodbye all reads. How can we safeguard against this? I suggest maybe starting only with PAF, and maybe adding some checks in validation.
  • Where do we add the tracking of filtering status. Do we add it directly to the Result object, do we add it straight into the plugin basecall/map_reads methods (would involve having to write separate implementations for new plugins), and have the plugin return two Iterables, one of reads/Results instances that passed and one that failed?
  • What do we do with Results that fail validations, unblock or proceed?
    • Fails basecalling filtering
    • Fails alignment filtering
  • DO we add a fails_validation to the toml/Conditions section, which defaults to proceed? This then relies on the exceeded max chunk behaviour
  • How and where do we log this?
  • What will it look like/where will it be placed in the config?
  • What will the API between targets and plug-ins look like?
  • How will we ensure that targets doesn’t miss any data?
### Tasks
- [ ] #304 
- [ ] Describe mini-language
- [ ] Needs mad tests

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestepicFeature being added to readfishneeds discussionA topic/feature that needs discussion from maintainers and users

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions