Follow the steps below to install and configure HYMET
The easiest way to install HYMET is through Bioconda:
conda install -c bioconda hymetAfter installation, you will need to download the reference databases as described in the Reference Sketched Databases section.
Alternatively, you can clone the repository to your local environment:
git clone https://github.com/inesbmartins02/HYMET.git
cd HYMETIf you prefer using Docker, follow these steps:
-
Build the Docker Image:
docker build -t hymet . -
Run the Container:
docker run -it hymet
-
Inside the Container:
- The environment will already be set up with all dependencies installed.
- Run the tool as needed.
If you cloned the repository, you can create a Conda environment from the included file:
-
Create the Conda Environment:
conda env create -f environment.yml
-
Activate the Environment:
conda activate hymet_env
The tool expects input files in FASTA format (.fna or .fasta). Each file should contain metagenomic sequences with headers in the following format:
>sequence_id additional_info
SEQUENCE_DATA
- sequence_id: A unique identifier for the sequence.
- additional_info: Optional metadata (e.g., source organism, length).
- SEQUENCE_DATA: The nucleotide sequence.
Place your input files in the directory specified by the $input_dir variable in the main.pl script.
For example, if your input directory contains the following files:
input_dir/
├── sample1.fna
├── sample2.fna
└── sample3.fna
Each file (sample1.fna, sample2.fna, etc.) should follow the FASTA format described above.
Ensure all scripts have execution permissions:
chmod +x config.pl
chmod +x main.pl
chmod +x scripts/*.sh
chmod +x scripts/*.pyAfter installation and configuration, first you should run the configuration script to download and prepare the taxonomy files, and define the main paths.
./config.plThen, you can run the main tool to perform taxonomic identification:
./main.plIf installed via Conda, you can use:
hymet-config
hymetThe databases required to run the tool are available for download on Google Drive:
-
Download the Files:
- Click on the links above to download the
.gzfiles.
- Click on the links above to download the
-
Place the Files in the
data/Directory:- Move the downloaded files to the
data/directory of the project.
- Move the downloaded files to the
-
Unzip the Files:
- Use the following command to unzip the
.gzfiles:gunzip data/sketch1.msh.gz gunzip data/sketch2.msh.gz gunzip data/sketch3.msh.gz
- This will extract the files
sketch1.msh,sketch2.msh, andsketch3.mshin thedata/directory.
- Use the following command to unzip the
-
Verify the Files:
- After unzipping, ensure the files are in the correct format and location:
ls data/
- You should see the files
sketch1.msh,sketch2.msh, andsketch3.msh.
- After unzipping, ensure the files are in the correct format and location:
- config.pl: Configuration script that downloads and prepares taxonomy files.
- main.pl: Main script that runs the taxonomic identification pipeline.
- scripts/: Directory containing helper scripts in Perl, Python, and Bash.
- mash.sh: Script to run Mash.
- downloadDB.py: Script to download genomes.
- minimap.sh: Script to run Minimap2.
- classification.py: Script for taxonomic classification.
- taxonomy_files/: Directory containing downloaded taxonomy files.
- data/: Directory for storing intermediate data.
- sketch1.msh
- sketch2.msh
- sketch3.msh
- taxonomy_hierarchy.tsv
- output/: Directory where final results are saved.
The tool generates a classified_sequences.tsv file in the output/ directory with the following columns:
- Query: Identifier of the queried sequence.
- Lineage: Identified taxonomic lineage.
- Taxonomic Level: Taxonomic level (e.g., species, genus).
- Confidence: Classification confidence (0 to 1).
- This folder includes scripts to install and prepare all necessary data to replicate the work using our dataset.
- Prerequisites:
- Before running the scripts in this folder, users need to download the assembly files (
assembly_files.txt) for each domain from the NCBI FTP site.
- Before running the scripts in this folder, users need to download the assembly files (
- Scripts:
create_database.py: Downloads 10% of the content from each downloaded assembly file and organizes the datasets by domain.extractNC.py: Maps the content of each Genome Collection File (GCF) with its respective sequence identifiers. It generates a CSV file containing this mapping, with one column for the GCF and another column for the sequence identifiers (such as NC, NZ, etc.) present in each GCF.extractTaxonomy.py: Creates a CSV file containing the GCF and its respective taxonomy, among other information.- Additional scripts modify the data format and organization, including:
- Implementing mutations
- Converting formats (e.g., FASTA to FASTQ)
- Formatting into paired-end reads
GCFtocombinedfasta.py: Combines all GCFs from each domain into a single FASTA file, separating sequences by identifier. This script is used as input for most of the tools.
- Prerequisites:
For questions or issues, please open an issue in the repository.