Skip to content

Default mode in Locusts

Edoardo Sarti edited this page Oct 27, 2023 · 1 revision

The Default mode implements a safe and comparatively lightweight way of launching and controlling jobs via Locusts. In short, you will collect in a dedicated directory all the input files you want to use, naming them according to locusts's parsing criteria. Then, you will write a short bash command line that corresponds to the task you want to launch a large amount of times (with possibly different input files). After creating a separate and safe filesystem, locusts will sort the input files into as many directories as there are tasks to run, and will assign the tasks to the nodes (cores) of the specified runtime machine (cluster). Then, it will collect the output files you had previously declared and delete the temporary filesystem it created.

Compulsory arguments

The swarm.launch function always needs these three compulsory arguments:

  • code : job code that will be used to label jobs/files
  • parf : path of locusts parameter file
  • cmd : bash command template

In addition, when Default mode is chosen, other four arguments are needed:

  • indir : input directory. This directory must contain all the input files and folders the user's script will need for running.
  • outdir : output directory. This is the directory where the results will be stored.
  • spcins : specific input file names. Template file names
  • outs : expected output filenames

Setting up the locusts job

Collect/rename input files

  1. Give consistent names to your input files. Locusts needs to get instructions for each of the individual runs you are planning. For this reason, files having the same role (e.g. same kind of input / output) must be composed of a fixed part and a unique identifier (e.g. input_file_15.txt). In the bash command (point 3), we will refer to template filename as an input filename where the unique identifier part has been replaced with the string '<id>'. For this reason, if two different files must be used concurrently, they must carry different fixed parts and same unique identifier (e.g. 'structure_18.pdb' and 'sequence_18.fa').
  2. Collect all the necessary input files in a directory. This will be the directory locusts will look into for finding the required input files.

Define mandatory locusts variables

  1. Write a bash command that runs the executable(s) and store it into a variable. This is the command that will be launched by each worker created by locusts, and must contain the template input and output file names. Remarks:

    • the bash command must mention all declared input and output template filenames. Template filenames are input filenames where the unique ID part has been replaced by the string '<id>'.
    • input files can include scripts that must be run. This way, you can keep your bash command very simple (just a call to another script that takes the input and output filenames as arguments), and you can store the actual script in a separate file
  2. Create a list of all the template names of the input files, and another list with all the template name of the output files.

Shared input files

  1. [optional] Create a list of shared input files. If all the tasks you want to run share one or more same input files, you can specify them in a separate list called shared list. The strings in this list must follow this format : "alias:filename", where alias is some unique keyword you will refer to, and filename is the actual name of the file as it appears in the input folder. Each time you refer to an alias of a shared file in the bash command, you will have to prepend the alias with the string "<shared>" (see the example below).

Example

You can find this example as a part of the distributed locusts code and PiPy package.

Suppose I have this short bash script that I want to run on a huge number of different files:

sleep 1
ls -lrth inputfile.txt secondinputfile.txt sharedfile.txt > ls_output.txt
cat inputfile.txt secondinputfile.txt sharedfile.txt  > cat_output.txt

Whereas I want inputfile.txt and secondinputfile.txt to vary from one run to the next, I want the file sharedfile.txt to always be the same.

I would rename the input files I'd like to use as:

  • inputfile_1.txt, inputfile_2.txt, etc.
  • secondinputfile_1.txt, secondinputfile_2.txt, etc.

This is the configuration that I'd write:

import locusts.swarm

my_input_dir = "tests/test_manager/my_input_dir/"  # The path of the directory containing the inputs
my_output_dir = "tests/test_manager/my_output_dir/"  # The path of the directory that will store the outputs
batch_job_code = "MngTest"  # A unique identifier of your choice
specific_inputs = ['inputfile_<id>.txt', 'secondinputfile_<id>.txt']
shared_inputs = ['sf1:sharedfile.txt']
outputs = ['ls_output_<id>.txt', 'cat_output_<id>.txt']
parameter_file = "tests/test_manager/test_manager.par"
command_template = ('sleep 1;'
                    'ls -lrth {0} {1} {2} > {3};'
                    'cat {0} {1} {2} > {4}'
    ).format('inputfile_<id>.txt',
    'secondinputfile_<id>.txt',
    '<shared>sf1',
    'ls_output_<id>.txt',
    'cat_output_<id>.txt')

locusts.swarm.launch(
    indir=my_input_dir,
    outdir=my_output_dir,
    code=batch_job_code,
    spcins=specific_inputs,
    shdins=shared_inputs,
    outs=outputs,
    cmd=command_template,
    parf=parameter_file
)

I want these tasks to run on the cluster mycluster, to which I have access (please read the Remote location page for details). The locusts parameter file would be:

### Generic
run_on_hpc		True			# True -> HPC; False: Local multithreading

### HPC
host_name		mycluster		# HPC name for SSH connection (see README file)
requested_nodes		2			# Number of nodes to be requested on HPC
cpus_per_node		10			# CPU cores per node (not counting hyperthreading). If variable, choose minimum value
partition		multinode		# Partition on HPC
hpc_exec_dir		/data/me/locusts_try	# HPC path where the jobs will be executed and the temp files stored
local_shared_dir				# Do the HPC service and the local machine share a directory? If so, you can specify a shared folder to use
data_transfer_protocol	tests/test_manager/data_transfer_protocol.sh

NB: In this case, I converted my script directly into the bash command. I could have done differently: write a bash script taking input and output files as arguments, place it among the input files, and then only write a command that launches that bash script (a shared file).

Clone this wiki locally