This repository contains the maintained source code of VuTeCo (VUlnerability TEst COllector), which can scan Java Git repositories to (1) find security-related test cases and (2) match them with CVE identifiers.
Please, see the MSR'26 paper for more details about its inner workings.
If you are looking for the MSR'26 version of VuTeCo, please see the Zenodo package.
If you are looking for the dataset Test4Vul, containing some manually-validated tests found in the wild by VuTeCo, please see https://github.com/tuhh-softsec/test4vul.
- Requirements
- Installation
- Running VuTeCo
- Supported Models
- How to Extend VuTeCo with a new Model/Technique
- Future Work
These are the base requirements to run VuTeCo:
- Python 3.11
- Java 8+
- A stable Internet connection (e.g., for downloading the Python packages and cloning remote repositories during inference).
VuTeCo has been tested on a Linux-based OS so far. Nevertheless, the scripts were implemented to be OS-agnostic, so they should also work on MacOS or Windows.
NOTES:
- The following commands assumes that
pythonis the default alias for the selected Python installation. You can change topython3without issues. - The
XXindicate the acronym of an AI model supported in VuTeCo. - The description of all command-line arguments can be found in the file
vuteco/src/common/cli_args.py.
Ensure to have sufficient space to host the packages downloaded from PyPI (roughly 7 GB).
- This setup will be improved in the future to avoid installing unneeded dependencies if one wants to use VuTeCo without the training-evaluation pipeline.
VuTeCo can be installed in any Python project as a package if it has pyproject.toml. If so, add the following line in the dependency list:
vuteco = { git = "https://github.com/tuhh-softsec/vuteco.git", subdirectory = "vuteco" }Alternatively, you can add the following line in the requirements.txt file:
vuteco @ git+https://github.com/tuhh-softsec/vuteco.git#subdirectory=vutecoIf you want to install it directly with pip:
pip install git+https://github.com/tuhh-softsec/vuteco.git#subdirectory=vutecoThe main function to run VuTeCo directly from Python code is vuteco.main.starter.vuteco_start().
VuTeCo will be available on PyPI in the future to simplify this step.
If you want to build VuTeCo locally, clone this repository and move into the vuteco/ directory:
cd vuteco/If this is the first use of this package, create the virtual environment and activate it.
python -m venv ./venv
source venv/bin/activateEnsure the right version of setuptools and and up to date version of wheel. Note that a too recent version of setuptools (>=82.0.0) may have problem with some dependencies still using pkg_resources (they will be replaced in the future):
python -m pip install --upgrade pip "setuptools<81.0.0" wheelInstall the required dependencies in the virtual environment (can take some minutes), as listed in pyproject.toml:
python -m pip install -e .After this, VuTeCo can be run with the command vuteco, which is equivalent to python -m vuteco.main.cli (you can choose any). This command is usable as long as the virtual environment remains active.
If you get unexpected problems caused by dependencies, use the pinned versions listed in requirements.txt
python -m pip install -r requirements.txtIf requirements.txt happens to have the line pkg_resources==0.0.0, remove it.
If packages like halo or ares are raising issues because of wheel, run this and try again:
python -m pip install --upgrade pip "setuptools<81.0.0" wheelUnsloth could give problems with dependencies. Try to ensure always the latest version directly from the repository if the one in the requirements.txt is giving issues:
python -m pip uninstall unsloth unsloth-zoo -y && python -m pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git git+https://github.com/unslothai/unsloth-zoo.giThe command to run VuTeCo is vuteco, which points to vuteco/src/vuteco/cli.py. The arguments it accepts are described in file vuteco/src/common/cli_args.py.
VuTeCo can be run in two modes:
- Finding: it predicts whether all the JUnit test methods found in a given project repository are security-related (returns a probability).
- Matching: it predicts whether all the JUnit test methods found in a given project repository and the supplied list of CVEs are matched (returns a probability).
The user can interpret the predicted probabilities freely (the tool per se does not classify, but only return probabilities). The recommended threshold for the positive classifications is the default 0.5.
The input projects to analyze can be supplied through the command-line argument -i (see this example as a guidance). The output is returned as JSON files, one per project analyzed, and placed in the directory indicated by the command-line argument -o.
- Make sure the directory does not exist or is empty so it does not clash with your existing files.
The AI model to use can be set with the argument -t. The list of supported models is reported below.
- In Finding mode, the recommended model is
uxc(UniXcoder); - In Matching mode, the recommended model is
dsc(DeepSeek Coder).
VuTeCo can be called in Finding mode in this way (with model uxc):
vuteco -i <PROJECT-LIST-FILE> -o <OUTPUT-DIRECTORY> -r from-file --no-vuln-match --skip-inspected-projects -t uxcVuTeCo can be called in Matching mode in this way (with model dsc):
vuteco -i <PROJECT-LIST-FILE> -o <OUTPUT-DIRECTORY> -r from-file --no-vuln-find --skip-inspected-projects -t dscVuTeCo automatically downloads the weights for the Finding and Matching models from Hugging Face (https://huggingface.co/emaiannone/models). However, should this automatic download encounter issues, you can solve this by:
- Manually download the model from the Hugging Face web pages.
- Place the Finding models under
<ANY-DIR>/<MODEL-NAME>/final; place the Matching models under<ANY-DIR>/<MODEL-NAME>/e2e/final(this weird path naming will be solved in the future). - Add this argument to the above commands:
--model-dir <ANY-DIR>.
| Acronym | Name |
|---|---|
| cb | CodeBERT |
| uxc | UniXcoder |
| ct5p | CodeT5+ |
| cl | CodeLlama |
| dsc | DeepSeek Coder |
| qc | Qwen Coder |
This guide explains how to add a new model or technique to VuTeCo. This guide will be further improved in the future.
Assume you want to add a technique named MyPowerfulTechnique, with the acronym mpt.
-
Register the technique for training and evaluation Add the name
mptto:FinderNameif it is for the Finding task, orEnd2EndNameif it is for the Matching task.
Follow the naming conventions used by existing entries (recommended, but not mandatory), e.g.,
MYPOWERFULTECHNIQUE_FND = "mpt-fnd"orMYPOWERFULTECHNIQUE_E2E = "mpt-e2e". -
Register the technique for inference Add the technique name to
TechniqueName. Again, following the existing naming patterns where possible, e.g.,MYPOWERFULTECHNIQUE = "mpt"
Create the class that implements your technique under the modeling module, following the style of the existing files.
This is where you define the specific behavior of your technique. If you want it to behave like the existing models in VuTeCo, ensure to inherit from the existing superclasses (e.g., NeuralNetworkFinder, LanguageModelFinder, NeuralNetworkE2E, LanguageModelE2E). If so, place the class in the appropriate file. For example, an LLM for Matching should typically go in modeling_lm_e2e.py. Follow existing naming conventions where possible. Otherwise, you have to define the custom behavior on your own, possibly in a new file if desired.
-
Define an evaluation output directory Create a new constant pointing to the directory where evaluation results will be written. Follow the naming convention used by existing constants, e.g.,
MPT_FND_EXPORT_EVAL_DIRPAT = os.path.join(EVALUATED_MODEL_DIRPATH, FinderName.MYPOWERFULTECHNIQUE_FND.value). -
Map the approach to its class and evaluation directory Add an entry to the appropriate
XYZ_MODELSdictionary, depending on the technique type. For example:NN_FINDER_MODELSfor neural-network–based Finding approachesLM_E2E_MODELSfor LLM-based Matching approaches
-
Map the approach to its inference name Add an entry to the appropriate
VUTECO_XYZdictionary to associate the training/evaluation name with the inference name.
This repository is under improvement. These are some activities that will be done to improve the usability of VuTeCo and the clarity of this README:
- Adjust the names with the ones used in the MSR'26 paper (e.g., End2End becomes Matcher)
- Explain better how VuTeCo can be extended (possibly simplifying the code as well).
- Separate VuTeCo (the tool used for inference) from the training-evaluation pipeline (for exporting the models). This also includes separating the required dependencies.
- Clean up dependencies and provide the Docker images to run VuTeCo out of the box.
- Handle testng test cases, other than JUnit.
Please, open new issues for suggestions and bug fixes! This is very appreciated :)