Update to nf-core last v2 version; incorporating GTDB by chasemc · Pull Request #370 · KwanLab/Autometa

chasemc · 2024-12-17T00:06:28Z

Old and new work on updating the nf-core standardization, along with incorporating the GTDB code that hadn't been added to the Nextflow workflow

.

chasemc · 2024-12-17T00:12:36Z

The following was used to test version d6706c6


# The combined output directory for both samples in both runs (NCBI and NCBI-GTDB) ends at ~293 GB
# The single_db_dir ends at 758GB (could be reduced 87GB by updating the workflow to delete nr.gz after the diamond database is created)
# The work directory ends at 147GB for the NCBI-only

git clone git@github.com:KwanLab/Autometa.git
cd Autometa
git switch new-nf

# make sure we build the container from the checked out branch for the workflow to use
docker build . -t jasonkwan/autometa:`git branch --show-current`

# create the output and database directories and the sample csv
example_dir="/home/chase/autometa_test"
mkdir -p $example_dir $example_dir/database_directory $example_dir/output

sample_sheet="$example_dir/autometa_test_samplesheet.csv"
echo "sample,assembly,fastq_1,fastq_2,coverage_tab,cov_from_assembly" > $sample_sheet
echo "78mbp,/media/bigdrive1/autometa_test_data/78Mbp/metagenome.fna.gz,/media/bigdrive1/autometa_test_data/78Mbp/forward_reads.fastq.gz,/media/bigdrive1/autometa_test_data/78Mbp/reverse_reads.fastq.gz,,0" >> $sample_sheet
echo "625Mbp,/media/bigdrive1/autometa_test_data/625Mbp/metagenome.fna.gz,/media/bigdrive1/autometa_test_data/625Mbp/forward_reads.fastq.gz,/media/bigdrive1/autometa_test_data/625Mbp/reverse_reads.fastq.gz,,0" >> $sample_sheet

# edit the resources for the workflow to use
echo '''
process {
  withLabel:process_low {
    cpus   = { 1 * task.attempt }
    memory = { 14.GB * task.attempt }
    time   = { 24.h  * task.attempt }
  }
  withLabel:process_medium {
    cpus   = { 12  * task.attempt }
    memory = { 42.GB * task.attempt }
    time   = { 24.h * task.attempt }
  }
  withLabel:process_high {
    cpus   = { 36 * task.attempt }
    memory = { 200.GB * task.attempt }
    time   = { 48.h * task.attempt }
  }
}
''' > $example_dir/nextflow.config

# run the full workflow + GTDB refinement 
nextflow run . \
    -profile docker  \
    --input $sample_sheet \
    --taxonomy_aware \
    --outdir ${example_dir}/output \
    --single_db_dir /media/BRIANDATA3/autometa_test \
    --autometa_image_tag 'new-nf' \
    --use_gtdb \
    --gtdb_version '220' \
    --large_downloads_permission \
    --max_memory '900.GB' \
    --max_cpus 90 \
    --max_time '20040.h' \
    -c $example_dir/nextflow.config \
    -w /media/BRIANDATA3/temp \
    -resume

# run the full workflow without GTDB refinement 
nextflow run . \
    -profile docker,slurm \
    --input $sample_sheet \
    --taxonomy_aware \
    --outdir ${example_dir}/output_ncbi_only \
    --single_db_dir /media/BRIANDATA3/autometa_test \
    --autometa_image_tag 'new-nf' \
    --large_downloads_permission \
    --max_memory '900.GB' \
    --max_cpus 90 \
    --max_time '20040.h' \
    -c $example_dir/nextflow.config \
    -w /media/BRIANDATA3/temp \
    -resume


# rm -rf $example_dir/output
# rm -rf /media/BRIANDATA3/autometa_test
# rm -rf /media/BRIANDATA3/temp
# rm -rf '/home/chase/tempauto'

chasemc · 2024-12-17T00:17:06Z

gtdb_run_reports.zip
ncbi_only_run_reports.zip

jason-c-kwan · 2024-12-20T17:22:52Z

I keep getting errors that it can't find the new-nf docker image even though I have built it locally on deep thought and I even tried pushing it to docker hub. Have you any insight into why that is happening?

chasemc · 2024-12-23T14:04:40Z

To use the local version now you have to set the registry to nothing, e.g.

# edit the resources for the workflow to use
echo '''
process {
  withLabel:process_low {
    cpus   = { 1 * task.attempt }
    memory = { 14.GB * task.attempt }
    time   = { 24.h  * task.attempt }
  }
  withLabel:process_medium {
    cpus   = { 12  * task.attempt }
    memory = { 42.GB * task.attempt }
    time   = { 24.h * task.attempt }
  }
  withLabel:process_high {
    cpus   = { 36 * task.attempt }
    memory = { 200.GB * task.attempt }
    time   = { 48.h * task.attempt }
  }
}
docker.registry = ""
''' > $example_dir/nextflow.config

But either works right now on the server

jason-c-kwan · 2024-12-23T15:51:33Z

I was running on the server before, so not having docker.registry = "" in the cextflow.config file doesn't seem to work for me. With that line it appears to be running, although I am getting these errors in the output:

ERROR ~ Error executing process > 'AUTOMETA:TAXONOMY_WORKFLOW:GTDB_REFINEMENT:TAXON_SPLIT:LCA:PREP_DBS (Preparing db cache for gtdb)'

Caused by:
  Process `AUTOMETA:TAXONOMY_WORKFLOW:GTDB_REFINEMENT:TAXON_SPLIT:LCA:PREP_DBS (Preparing db cache for gtdb)` terminated with an error exit status (1)


Command executed:

  # https://autometa.readthedocs.io/en/latest/scripts/taxonomy/lca.html
  autometa-taxonomy-lca \
      --blast . \
      --lca-output . \
      --dbdir . \
      --dbtype gtdb \
      --cache cache \
      --only-prepare-cache

  cat <<-END_VERSIONS > versions.yml
  "AUTOMETA:TAXONOMY_WORKFLOW:GTDB_REFINEMENT:TAXON_SPLIT:LCA:PREP_DBS":
      autometa: $(autometa --version | sed -e 's/autometa: //g')
  END_VERSIONS

Command exit status:
  1


  Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  Traceback (most recent call last):
    File "/opt/conda/bin/autometa-taxonomy-lca", line 8, in <module>
      sys.exit(main())
               ^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/autometa/taxonomy/lca.py", line 698, in main
      taxonomy_db = GTDB(args.dbdir)
                    ^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/autometa/taxonomy/gtdb.py", line 67, in __init__
      self.names = self.parse_names()
                   ^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/autometa/taxonomy/gtdb.py", line 180, in parse_names
      fh = open(self.names_fpath)
           ^^^^^^^^^^^^^^^^^^^^^^
  PermissionError: [Errno 13] Permission denied: './names.dmp'

Work dir:
  /media/BRIANDATA3/temp/a9/624e869a6c5610e906b5e1e66413e0

Container:
  jasonkwan/autometa:new-nf

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

I tried to make the /media/BRIANDATA3/autometa_test directory readable/writable by all users, but I still got the same error messages. It does appear to carry on running despite of this, though.

jason-c-kwan · 2024-12-23T17:42:24Z

Update: it ended after about an hour, so this is preventing it from running. I tried running it without pointing to the existing database files, and I got this:

ERROR ~ No such variable: out_ch

 -- Check script 'Autometa/./workflows/../subworkflows/local/./././prepare_nr.nf' at line: 133 or see '.nextflow.log' file for more details

chasemc · 2024-12-23T22:23:44Z

That drive has odd group permissions and those files were all assigned to "storage" group. I chowned the directory just now to be chase:chase but if that doesn't work I would just try another drive

chasemc · 2024-12-23T22:24:46Z

i.e. it seems to be a system-level file permission issue, not a workflow issue

jason-c-kwan · 2024-12-24T15:57:47Z

OK, I think I fixed the permissions issue, but I didn't realize that above the message about out_ch there was this message:

 Neither nr.dmnd or nr.gz were found and `--large_downloads_permission` is set to false.

Not totally sure why it is not using the stuff that is already there, but I would like to just try allowing it to download new databases. I tried adding --large_downloads_permission to the main.nf command in my submit script, but I got the same result. Is there something I am missing about how to pass this option to the workflow?

chasemc · 2025-01-06T15:26:27Z

Can you provide the full commands you are using?

jason-c-kwan · 2025-01-06T18:00:30Z

This is my current submit script:

#!/bin/bash
#SBATCH --partition=queue
#SBATCH -N 1 # Nodes
#SBATCH -n 1 # Tasks
#SBATCH --cpus-per-task=1
#SBATCH --error=autometa_test.%J.err
#SBATCH --output=autometa_test.%J.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jason.kwan@wisc.edu

# Initialize conda/mamba for bash shell
source ~/.bashrc   # or your shell rc file
source ~/miniconda3/etc/profile.d/conda.sh
source ~/miniconda3/etc/profile.d/mamba.sh

mamba activate autometa-nf

example_dir="/media/bigdrive1/autometa_test"
sample_sheet="$example_dir/autometa_test_samplesheet.csv"

mkdir -p $example_dir $example_dir/database_directory $example_dir/output

echo "sample,assembly,fastq_1,fastq_2,coverage_tab,cov_from_assembly" > $sample_sheet
echo "78mbp,/media/bigdrive1/autometa_test_data/78Mbp/metagenome.fna.gz,/media/bigdrive1/autometa_test_data/78Mbp/forward_reads.fastq.gz,/media/bigdrive1/autometa_test_data/78Mbp/reverse_
reads.fastq.gz,,0" >> $sample_sheet
echo "625Mbp,/media/bigdrive1/autometa_test_data/625Mbp/metagenome.fna.gz,/media/bigdrive1/autometa_test_data/625Mbp/forward_reads.fastq.gz,/media/bigdrive1/autometa_test_data/625Mbp/reve
rse_reads.fastq.gz,,0" >> $sample_sheet

# edit the resources for the workflow to use
echo '''
process {
  withLabel:process_low {
    cpus   = { 1 * task.attempt }
    memory = { 14.GB * task.attempt }
    time   = { 24.h  * task.attempt }
  }
  withLabel:process_medium {
    cpus   = { 12  * task.attempt }
    memory = { 42.GB * task.attempt }
    time   = { 24.h * task.attempt }
  }
  withLabel:process_high {
    cpus   = { 36 * task.attempt }
    memory = { 200.GB * task.attempt }
    time   = { 48.h * task.attempt }
  }
}
docker.registry = ""
''' > $example_dir/nextflow.config

# run the full workflow + GTDB refinement
nextflow run /home/jkwan/Autometa/main.nf \
    -profile docker  \
    --input $sample_sheet \
    --taxonomy_aware \
    --outdir ${example_dir}/output \
    --single_db_dir /media/BRIANDATA3/autometa_test \
    #--single_db_dir ${example_dir}
    --autometa_image_tag 'new-nf' \
    --use_gtdb \
    --gtdb_version '220' \
    --large_downloads_permission \
    --max_memory '900.GB' \
    --max_cpus 90 \
    --max_time '20040.h' \
    -c $example_dir/nextflow.config \
    -w /media/BRIANDATA3/temp \
    --large_downloads_permission \
    -resume

# run the full workflow without GTDB refinement
nextflow run /home/jkwan/Autometa/main.nf \
    -profile docker,slurm \
    --input $sample_sheet \
    --taxonomy_aware \
    --outdir ${example_dir}/output_ncbi_only \
    #--single_db_dir ${example_dir}
    --single_db_dir /media/BRIANDATA3/autometa_test \
    --autometa_image_tag 'new-nf' \
    --large_downloads_permission \
    --max_memory '900.GB' \
    --max_cpus 90 \
    --max_time '20040.h' \
    -c $example_dir/nextflow.config \
    -w /media/BRIANDATA3/temp \
    -resume

chasemc · 2025-01-06T21:32:35Z

Internet here is being worked on so I can't test it

my assumption would be that you added and then commented out #--single_db_dir ${example_dir} causing the remaining flags to not be executed

jason-c-kwan · 2025-01-06T23:03:52Z

OK, I think that might have been it. I couldn't get it to use the existing databases, so it is currently downloading them.

jason-c-kwan · 2025-01-08T18:19:41Z

It did get further along the pipeline, but I am now getting another error in the output:

executor >  local (22)
[71/3459b9] AUT…meta_test_samplesheet.csv) | 1 of 1 ✔
[e5/2a5c1b] AUT…gs < 3000 bp, from 625Mbp) | 2 of 2 ✔
[2b/704af7] AUT…(Aligning reads to 625Mbp) | 2 of 2 ✔
[d3/f38305] AUT…OLS_VIEW_AND_SORT (625Mbp) | 2 of 2 ✔
[4d/cec06a] AUT…EDTOOLS_GENOMECOV (625Mbp) | 2 of 2 ✔
[7a/06f6a4] AUT…OVERAGE:PARSE_BED (625Mbp) | 2 of 2 ✔
[-        ] AUT…ERAGE:SPADES_KMER_COVERAGE -
[75/75a41d] AUTOMETA:PRODIGAL (625Mbp)     | 2 of 2 ✔
[16/0a01fe] AUT…in 625Mbp against nr.dmnd) | 1 of 2
[75/a55f03] AUT…eparing db cache for ncbi) | 1 of 1, cached: 1 ✔
[4c/489840] AUT…inding ncbi LCA for 78mbp) | 1 of 1
[7a/c86131] AUT…on majority vote on 78mbp) | 1 of 1
[09/0e1081] AUT…s into kingdoms for 78mbp) | 1 of 1
[skipped  ] AUT…GTDB database version 220) | 1 of 1, stored: 1 ✔
[skipped  ] AUT…reparing Diamond database) | 1 of 1, stored: 1 ✔
[-        ] AUT…DB_REFINEMENT:EXTRACT_ORFS -
[-        ] AUT…TAXON_SPLIT:DIAMOND_BLASTP -
[c1/3e860a] AUT…eparing db cache for gtdb) | 1 of 1, cached: 1 ✔
[-        ] AUT…ENT:TAXON_SPLIT:LCA:REDUCE -
[-        ] AUT…:TAXON_SPLIT:MAJORITY_VOTE -
[9a/c25c71] AUT…rchaea markers for 625Mbp) | 4 of 4 ✔
Plus 7 more processes waiting for tasks… 
ERROR ~ Negative array index [-2] too large for array size 1

 -- Check script 'Autometa/./workflows/../subworkflows/local/././taxon_split.nf' at line: 73 or see '.nextflow.log' file for more details

I did look in .nextflow.log but I couldn't find the part about this error. Perhaps I missed it. Anyway, do you have any idea how to troubleshoot this?

chasemc · 2025-01-17T20:19:43Z

Can take a look when back in the US next week. Can you post a the log or email to my wisc email

chasemc · 2025-01-22T16:24:49Z

Downloading files and running on a completely new ubuntu instance

chasemc · 2025-01-24T20:15:41Z

Can you try now? If only one type (e.g. bacteria and not archaea ) was found then it was splitting the fielapth because t wasn't getting a list. The previous commit should fix that. Might have been a difference in databases that caused it

jason-c-kwan · 2025-01-30T16:26:48Z

Thanks! It appears to be running now. I will keep an eye on it.

chasemc added 4 commits December 13, 2024 11:20

Large update to Nextflow workflow

a7f007c

.

🍏 set reasonable defaults

d6706c6

Updated nextflow configs

38655e4

Updated nfcore version

ebad612

chasemc requested a review from jason-c-kwan December 17, 2024 00:20

Update taxon_split.nf

aec1aa9

Conversation

chasemc commented Dec 17, 2024

Uh oh!

chasemc commented Dec 17, 2024

Uh oh!

chasemc commented Dec 17, 2024

Uh oh!

jason-c-kwan commented Dec 20, 2024

Uh oh!

chasemc commented Dec 23, 2024

Uh oh!

jason-c-kwan commented Dec 23, 2024

Uh oh!

jason-c-kwan commented Dec 23, 2024

Uh oh!

chasemc commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chasemc commented Dec 23, 2024

Uh oh!

jason-c-kwan commented Dec 24, 2024

Uh oh!

chasemc commented Jan 6, 2025

Uh oh!

jason-c-kwan commented Jan 6, 2025

Uh oh!

chasemc commented Jan 6, 2025

Uh oh!

jason-c-kwan commented Jan 6, 2025

Uh oh!

jason-c-kwan commented Jan 8, 2025

Uh oh!

chasemc commented Jan 17, 2025

Uh oh!

chasemc commented Jan 22, 2025

Uh oh!

chasemc commented Jan 24, 2025

Uh oh!

jason-c-kwan commented Jan 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chasemc commented Dec 23, 2024 •

edited

Loading