This repository contains scripts to process and analyze the Cyber Attack Manifestations - Log Data Set (CAM-LDS) using LLMs.
If you use data or scripts from this repository, please cite the following publication:
- M. Landauer, W. Hotwagner, T. Boenke, F. Skopik, M. Wurzenberger. CAM-LDS: Cyber Attack Manifestations for Automatic Interpretation of System Logs and Security Alerts. [PDF]
We provide a data set of filtered attack manifestations online. Download and unzip it with the following commands:
wget https://zenodo.org/records/18390561/files/manifestations_filtered.zip
unzip manifestations_filtered.zip
In the manifestations_filtered directory you can find log data and security alerts generated by various cyber attacks. The manifestations are collected for single attack steps, sequences, and techniques. We recommend to focus on attack steps, since each step may be associated with multiple techniques and thus both sequences and techniques directories contain duplicated logs.
For example, Step 4 of Scenario 1 is a web scanning attack; check out the plots in the state_charts directory for a list of all attack steps from each scenario, e.g., Scenario 2. You can find the logs corresponding to that attack step as follows:
head manifestations_filtered/steps/1_autostart_localaccount-4/videoserver/logs/access.log.2
192.42.1.174 - - [22/Sep/2025:18:31:35 +0000] "GET /tmp HTTP/1.1" 404 341 "-" "Fuzz Faster U Fool v2.1.0-dev"
192.42.1.174 - - [22/Sep/2025:18:31:35 +0000] "GET /components HTTP/1.1" 404 341 "-" "Fuzz Faster U Fool v2.1.0-dev"
192.42.1.174 - - [22/Sep/2025:18:31:35 +0000] "GET /forum HTTP/1.1" 404 341 "-" "Fuzz Faster U Fool v2.1.0-dev"
192.42.1.174 - - [22/Sep/2025:18:31:35 +0000] "GET /cgi-bin HTTP/1.1" 404 341 "-" "Fuzz Faster U Fool v2.1.0-dev"
192.42.1.174 - - [22/Sep/2025:18:31:35 +0000] "GET /profiles HTTP/1.1" 404 341 "-" "Fuzz Faster U Fool v2.1.0-dev"
192.42.1.174 - - [22/Sep/2025:18:31:35 +0000] "GET /objects HTTP/1.1" 404 341 "-" "Fuzz Faster U Fool v2.1.0-dev"
192.42.1.174 - - [22/Sep/2025:18:31:35 +0000] "GET /download HTTP/1.1" 404 341 "-" "Fuzz Faster U Fool v2.1.0-dev"
192.42.1.174 - - [22/Sep/2025:18:31:35 +0000] "GET /img HTTP/1.1" 404 341 "-" "Fuzz Faster U Fool v2.1.0-dev"
192.42.1.174 - - [22/Sep/2025:18:31:35 +0000] "GET /dyn HTTP/1.1" 404 341 "-" "Fuzz Faster U Fool v2.1.0-dev"
192.42.1.174 - - [22/Sep/2025:18:31:35 +0000] "GET /include HTTP/1.1" 404 341 "-" "Fuzz Faster U Fool v2.1.0-dev"
To create LLM prompts based on the log data in the manifestations_filtered/steps directory, use the following script. Using the displayed example, for each step 5 prompts with randomly sampled logs are generated. Using this parameter together with a random seed of 42 reproduces the llm_queries.json file provided in this repository. This command will create the file llm_queries.json.
python3 create_llm_prompt.py --samples 5 --seed 42
The generated prompts contain a general task description as well as the lines to be classified. An example could look like this:
You are a MITRE ATT&CK TTP classification expert. Your task is to classify the following system log data. You are provided with samples from one or more hosts and one or more log sources that are captured during execution of one specific MITRE ATT&CK technique. Always output a valid JSON object with the following fields:
- "techniques": A list of top 10 ATT&CK techniques that are most likely related to the sample logs, sorted in descending order. Only print the ID of the techniques without any other descriptions.
- "confidence": An estimate for the certainty that the logs indicate an actual attack rather than normal system or user activity. Provide one of the following estimates: "Certain: Attack", "Almost Certain: Attack", "Somewhat Certain: Attack", "Neutral", "Somewhat Certain: Normal", "Almost Certain: Normal", "Certain: Normal"
- "explanation": A brief explanation (1-2 sentences) why you think that the samples correspond to attacks or normal behavior, e.g., by pointing to specific artifacts or properties of the logs.
videoserver:
logs/access.log.1:
192.42.1.174 - - [24/Sep/2025:07:51:33 +0000] "GET /zm/index.php HTTP/1.1" 200 8978 "-" "Mozilla/5.0"
192.42.1.174 - - [24/Sep/2025:07:51:33 +0000] "POST /zm/index.php HTTP/1.1" 302 756 "-" "Mozilla/5.0"
logs/log/audit/audit.log:
type=SYSCALL msg=audit(1758700293.784:4701): arch=c000003e syscall=59 success=yes exit=0 a0=7fe4cb2e7152 a1=7ffc1def0bd0 a2=7ffc1def3878 a3=8 items=3 ppid=2948 pid=2977 auid=4294967295 uid=33 gid=33 euid=33 suid=33 fsuid=33 egid=33 sgid=33 fsgid=33 tty=(none) ses=4294967295 comm="sh" exe="/usr/bin/dash" subj==unconfined key="T1166_Seuid_and_Setgid" ARCH=x86_64 SYSCALL=execve AUID="unset" UID="www-data" GID="www-data" EUID="www-data" SUID="www-data" FSUID="www-data" EGID="www-data" SGID="www-data" FSGID="www-data"
type=EXECVE msg=audit(1758700293.784:4701): argc=3 a0="sh" a1="-c" a2=2F7573722F62696E2F7A6D75202D4120202D61202D6D20303B736C6565702036
To query the LLM with the prompts generated in the previous section, use the following script. Note that you need to set your OPENAI_API_KEY before running this scipt.
python3 get_llm_responses.py
This command will create the file llm_responses.json. Each line in this file is a json object that includes information on the scenario, variant, and step of the attack chain, as well as the prompt, response, and ground truth labels.
Responses from the LLM include the predicted attack techniques for the log data, an estimation whether the logs correspond to benign or malicious behavior, and an explanation. An example could look like this:
"techniques": ["T1059.004", "T1190", "T1068", "T1105", "T1041", "T1071.001", "T1055", "T1036", "T1005", "T1043"],
"confidence": "Somewhat Certain: Attack",
"explanation": "Web access to ZoneMinder (/zm/index.php) is followed by auditd showing apache2 (www-data) spawning /bin/sh with a \"sh -c\" command, which is a strong indicator of command execution via a web application (likely exploitation). The proximity of the remote client IP to the PHP warning and the shell exec makes this more consistent with attacker-driven activity than normal web usage."
By comparing the predicted techniques with the ground truth labels, the classification accuracy can be estimated.
The filtered attack manifestations are based on log data collected from multiple simulation runs. We recommend to just use the filtered manifestations as described in the previous sections; however, we also provide the script used to generate these manifestations.
First, you need to download the logs collected from all simulation runs from our Zenodo page and store them in a directory called data/scenario<id> within this respository (you need to create these folders). For example, for the simulations in Scenario 2, the following files should exist:
data/scenario2/scenario_2_cron
data/scenario2/scenario_2_rootkit
Note that filtering is based on the normal behavior activities observed in all of the available scenarios; therefore the resulting manifestations may differ if not all 34 simulation runs (18 from Scenario 1, 2 from Scenario 2, 6 from Scenario 3, 1 from Scenario 4, 1 from Scenario 5, 5 from Scenario 6, and 1 from Scenario 7) are in their respective directories. Then, run the following command to extract the attack manifestations - this command will generate the manifestations_filtered and manifestations_raw folders.
python3 extract_attack_logs.py
If you use data or scripts from this repository, please cite the following publication:
- M. Landauer, W. Hotwagner, T. Boenke, F. Skopik, M. Wurzenberger. CAM-LDS: Cyber Attack Manifestations for Automatic Interpretation of System Logs and Security Alerts. [PDF]