-
Notifications
You must be signed in to change notification settings - Fork 0
Default mode in Locusts
The Default mode implements a safe and comparatively lightweight way of launching and controlling jobs via Locusts. In short, you will collect in a dedicated directory all the input files you want to use, naming them according to locusts's parsing criteria. Then, you will write a short bash command line that corresponds to the task you want to launch a large amount of times (with possibly different input files). After creating a separate and safe filesystem, locusts will sort the input files into as many directories as there are tasks to run, and will assign the tasks to the nodes (cores) of the specified runtime machine (cluster). Then, it will collect the output files you had previously declared and delete the temporary filesystem it created.
The swarm.launch function always needs these three compulsory arguments:
-
code: job code that will be used to label jobs/files -
parf: path of locusts parameter file -
cmd: bash command template
In addition, when Default mode is chosen, other four arguments are needed:
-
indir: input directory. This directory must contain all the input files and folders the user's script will need for running. -
outdir: output directory. This is the directory where the results will be stored. -
spcins: specific input file names. Template file names -
outs: expected output filenames
-
Give consistent names to your input files. Locusts needs to get instructions for each of the individual runs you are planning. For this reason, files having the same role (e.g. same kind of input / output) must be composed of a fixed part and a unique identifier (e.g.
input_file_15.txt). In the bash command (point 3), we will refer to template filename as an input filename where the unique identifier part has been replaced with the string'<id>'. For this reason, if two different files must be used concurrently, they must carry different fixed parts and same unique identifier (e.g.'structure_18.pdb'and'sequence_18.fa'). - Collect all the necessary input files in a directory. This will be the directory locusts will look into for finding the required input files.
-
Write a bash command that runs the executable(s) and store it into a variable. This is the command that will be launched by each worker created by locusts, and must contain the template input and output file names. Remarks:
- the bash command must mention all declared input and output template filenames. Template filenames are input filenames where the unique ID part has been replaced by the string
'<id>'. - input files can include scripts that must be run. This way, you can keep your bash command very simple (just a call to another script that takes the input and output filenames as arguments), and you can store the actual script in a separate file
- the bash command must mention all declared input and output template filenames. Template filenames are input filenames where the unique ID part has been replaced by the string
-
Create a list of all the template names of the input files, and another list with all the template name of the output files.
-
[optional] Create a list of shared input files. If all the tasks you want to run share one or more same input files, you can specify them in a separate list called shared list. The strings in this list must follow this format :
"alias:filename", wherealiasis some unique keyword you will refer to, andfilenameis the actual name of the file as it appears in the input folder. Each time you refer to an alias of a shared file in the bash command, you will have to prepend the alias with the string"<shared>"(see the example below).
You can find this example as a part of the distributed locusts code and PiPy package.
Suppose I have this short bash script that I want to run on a huge number of different files:
sleep 1
ls -lrth inputfile.txt secondinputfile.txt sharedfile.txt > ls_output.txt
cat inputfile.txt secondinputfile.txt sharedfile.txt > cat_output.txt
Whereas I want inputfile.txt and secondinputfile.txt to vary from one run to the next, I want the file sharedfile.txt to always be the same.
I would rename the input files I'd like to use as:
-
inputfile_1.txt,inputfile_2.txt, etc. -
secondinputfile_1.txt,secondinputfile_2.txt, etc.
This is the configuration that I'd write:
import locusts.swarm
my_input_dir = "tests/test_manager/my_input_dir/" # The path of the directory containing the inputs
my_output_dir = "tests/test_manager/my_output_dir/" # The path of the directory that will store the outputs
batch_job_code = "MngTest" # A unique identifier of your choice
specific_inputs = ['inputfile_<id>.txt', 'secondinputfile_<id>.txt']
shared_inputs = ['sf1:sharedfile.txt']
outputs = ['ls_output_<id>.txt', 'cat_output_<id>.txt']
parameter_file = "tests/test_manager/test_manager.par"
command_template = ('sleep 1;'
'ls -lrth {0} {1} {2} > {3};'
'cat {0} {1} {2} > {4}'
).format('inputfile_<id>.txt',
'secondinputfile_<id>.txt',
'<shared>sf1',
'ls_output_<id>.txt',
'cat_output_<id>.txt')
locusts.swarm.launch(
indir=my_input_dir,
outdir=my_output_dir,
code=batch_job_code,
spcins=specific_inputs,
shdins=shared_inputs,
outs=outputs,
cmd=command_template,
parf=parameter_file
)I want these tasks to run on the cluster mycluster, to which I have access (please read the Remote location page for details). The locusts parameter file would be:
### Generic
run_on_hpc True # True -> HPC; False: Local multithreading
### HPC
host_name mycluster # HPC name for SSH connection (see README file)
requested_nodes 2 # Number of nodes to be requested on HPC
cpus_per_node 10 # CPU cores per node (not counting hyperthreading). If variable, choose minimum value
partition multinode # Partition on HPC
hpc_exec_dir /data/me/locusts_try # HPC path where the jobs will be executed and the temp files stored
local_shared_dir # Do the HPC service and the local machine share a directory? If so, you can specify a shared folder to use
data_transfer_protocol tests/test_manager/data_transfer_protocol.sh
NB: In this case, I converted my script directly into the bash command. I could have done differently: write a bash script taking input and output files as arguments, place it among the input files, and then only write a command that launches that bash script (a shared file).
Naming conventions
Quick start
- Initializing the database
- Setting file permissions
- Singularity containers
- Running locusts for EncoMPASS
Navigating the database
- Database filesystem structure
- Key files
- Data structures
Main code
- Code overview
- OPM parsing
- Write OPM representation chart
- Structure alignment decision criteria
- Symmetry
- Check repo update
- Creating xml and squashfs
- see flowchart
Reference info