Skip to content

Comments

Add local module dshbio/fastatoparquet#14

Draft
heuermh wants to merge 8 commits intonf-core:devfrom
heuermh:add-fasta-to-parquet
Draft

Add local module dshbio/fastatoparquet#14
heuermh wants to merge 8 commits intonf-core:devfrom
heuermh:add-fasta-to-parquet

Conversation

@heuermh
Copy link
Contributor

@heuermh heuermh commented Mar 25, 2025

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/proteinannotator branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

----------------------------------------------------------------------------------------
*/

nextflow.enable.moduleBinaries = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the best config file for this?

label 'process_medium'

conda "${moduleDir}/environment.yml"
container 'community.wave.seqera.io/library/duckdb-cli:1.0.0--a85d12a2a9de17c9'
Copy link
Contributor Author

@heuermh heuermh Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be updated to a more recent version (1.1.3 is current in conda, latest version is 1.2.1)


script:
def prefix = task.ext.prefix ?: "${meta.id}"
def sql = "INSTALL parquet; LOAD parquet; COPY (WITH p AS (SELECT * FROM read_parquet('${parquet}/*.parquet')), s AS (SELECT unnest(string_to_array(sequence, '')) AS aa FROM p), h AS (SELECT unnest(map_entries(histogram(aa))) AS kv FROM s), e AS (SELECT * from read_csv_auto('amino_acid_properties.tsv')) SELECT h.kv['key'] AS amino_acid, h.kv['value'] AS count, e.* FROM h JOIN e ON h.kv['key'] = e.one_letter_symbol) TO '${prefix}.histogram.tsv' (HEADER, DELIMITER '\t')"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like it if this SQL query could be stored external to main.nf, ideally in a SQL file with templated variable substitution

def prefix = task.ext.prefix ?: "${meta.id}"
def sql = "INSTALL parquet; LOAD parquet; COPY (WITH p AS (SELECT * FROM read_parquet('${parquet}/*.parquet')), s AS (SELECT unnest(string_to_array(sequence, '')) AS aa FROM p), h AS (SELECT unnest(map_entries(histogram(aa))) AS kv FROM s), e AS (SELECT * from read_csv_auto('amino_acid_properties.tsv')) SELECT h.kv['key'] AS amino_acid, h.kv['value'] AS count, e.* FROM h JOIN e ON h.kv['key'] = e.one_letter_symbol) TO '${prefix}.histogram.tsv' (HEADER, DELIMITER '\t')"
"""
create_amino_acid_properties.sh
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really dumb, but I can't find a way to stage a static file (i.e. amino_acid_properties.tsv) within a module

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not stage it as an input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? What would that look like?

#
# See https://en.wikipedia.org/wiki/Amino_acid

cat <<END_PROPERTIES > amino_acid_properties.tsv
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤢

@heuermh heuermh force-pushed the add-fasta-to-parquet branch from 206d78a to 7e19196 Compare March 28, 2025 00:48
@heuermh
Copy link
Contributor Author

heuermh commented Mar 28, 2025

Rebased and force-pushed

def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
def amino_acid_properties = file("${moduleDir}/assets/amino_acid_properties.tsv")
def query_template = file("${moduleDir}/assets/query_template.sql")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These do not stage when running with -profile docker

@olgabot
Copy link
Collaborator

olgabot commented Jun 24, 2025

Hello, checking in here. I'm grateful for your work! This PR hasn't had any changes since March 27 (~3 months ago), is this still in progress or do you mind if someone else takes over? Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants