Skip to content

Adding changes to properly pass/fail a scenario if errors occur#1065

Merged
paigerube14 merged 1 commit intokrkn-chaos:mainfrom
paigerube14:exit_codes
Mar 31, 2026
Merged

Adding changes to properly pass/fail a scenario if errors occur#1065
paigerube14 merged 1 commit intokrkn-chaos:mainfrom
paigerube14:exit_codes

Conversation

@paigerube14
Copy link
Copy Markdown
Collaborator

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization

Description

Adding in return status' in a few scenarios to properly fail when an exception occurs

Related Tickets & Documents

If no related issue, please create one and start the converasation on wants of

  • Related Issue #:
  • Closes #:

Documentation

  • Is documentation needed for this update?

If checked, a documentation PR must be created and merged in the website repository.

Related Documentation PR (if applicable)

<-- Add the link to the corresponding documentation PR in the website repository -->

Checklist before requesting a review

See testing your changes and run on any Kubernetes or OpenShift cluster to validate your changes

  • I have performed a self-review of my code by running krkn and specific scenario
  • If it is a core feature, I have added thorough unit tests with above 80% coverage

REQUIRED:
Description of combination of tests performed and output of run

python run_kraken.py
...
<---insert test results output--->

OR

python -m coverage run -a -m unittest discover -s tests -v
...
<---insert test results output--->

@paigerube14 paigerube14 force-pushed the exit_codes branch 3 times, most recently from 2a29388 to 8ed7a6d Compare January 19, 2026 19:47
@paigerube14
Copy link
Copy Markdown
Collaborator Author

/review

@qodo-code-review
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Control Flow

The new unconditional success return may alter behavior depending on loop/try nesting. Validate that success is returned only after all actions/runs complete and that no paths unintentionally skip cleanup/telemetry publishing or override earlier failures.

                end_time = int(time.time())
                cerberus.get_status(krkn_config, start_time, end_time)
            except Exception as e:
                logging.error(
                    "ManagedClusterScenarioPlugin exiting due to Exception %s"
                    % e
                )
                return 1
return 0
Cleanup Robustness

The AWS network ACL manipulation is wrapped in a broad try/except but lacks a finally-style recovery. If an exception occurs after associations are replaced (or if a subnet has no association IDs), the cluster networking state may not be restored and created ACLs may leak. Consider guarding empty association lists and ensuring rollback/deletion happens even on partial failure.

try:
    vpc_id = scenario_config["vpc_id"]
    subnet_ids = scenario_config["subnet_id"]
    duration = scenario_config["duration"]
    # Add support for user-provided default network ACL
    default_acl_id = scenario_config.get("default_acl_id")
    ids = {}
    acl_ids_created = []
    for subnet_id in subnet_ids:
        logging.info("Targeting subnet_id")
        network_association_ids = []
        associations, original_acl_id = self.cloud_object.describe_network_acls(
            vpc_id, subnet_id
        )
        for entry in associations:
            if entry["SubnetId"] == subnet_id:
                network_association_ids.append(
                    entry["NetworkAclAssociationId"]
                )
        logging.info(
            "Network association ids associated with "
            "the subnet %s: %s" % (subnet_id, network_association_ids)
        )

        # Use provided default ACL if available, otherwise create a new one
        if default_acl_id:
            acl_id = default_acl_id
            logging.info(
                "Using provided default ACL ID %s - this ACL will not be deleted after the scenario",
                default_acl_id
            )
            # Don't add to acl_ids_created since we don't want to delete user-provided ACLs at cleanup
        else:
            acl_id = self.cloud_object.create_default_network_acl(vpc_id)
            logging.info("Created new default ACL %s", acl_id)
            acl_ids_created.append(acl_id)

        new_association_id = self.cloud_object.replace_network_acl_association(
            network_association_ids[0], acl_id
        )

        # capture the orginal_acl_id, created_acl_id and
        # new association_id to use during the recovery
        ids[new_association_id] = original_acl_id

    # wait for the specified duration
    logging.info(
        "Waiting for the specified duration " "in the config: %s" % duration
    )
    time.sleep(duration)

    # replace the applied acl with the previous acl in use
    for new_association_id, original_acl_id in ids.items():
        self.cloud_object.replace_network_acl_association(
            new_association_id, original_acl_id
        )
    logging.info(
        "Wating for 60 seconds to make sure " "the changes are in place"
    )
    time.sleep(60)

    # delete the network acl created for the run
    for acl_id in acl_ids_created:
        self.cloud_object.delete_network_acl(acl_id)
except Exception as e:
    logging.error(
        f"Network based zone outage scenario failed with exception: {e}"
    )
    return 1

return 0
Test Reliability

The patched open is not configured as a context manager, which can cause failures when the production code uses with open(...) as f. Using unittest.mock.mock_open() (or setting __enter__/__exit__) would make these tests less brittle.

@patch('time.time')
@patch('builtins.open', create=True)
@patch('yaml.full_load')
@patch('krkn.cerberus.get_status')
def test_run_multiple_actions_executes_all(self, mock_cerberus, mock_yaml, mock_open, mock_time):
    """
    Test that run() executes all actions, not just the first one
    This tests the fix for the early return bug
    """
    mock_time.return_value = 1234567890

    # Setup mock scenario config with multiple actions
    mock_yaml.return_value = {
        "managedcluster_scenarios": [
            {
                "actions": [
                    "managedcluster_start_scenario",
                    "managedcluster_stop_scenario",
                    "managedcluster_reboot_scenario"
                ],
                "managedcluster_name": "test-cluster",
                "runs": 1,
                "instance_count": 1,
                "timeout": 120
            }
        ]
    }

    mock_lib_telemetry = Mock(spec=KrknTelemetryOpenshift)
    mock_lib_telemetry.get_lib_kubernetes.return_value = self.mock_kubecli

    mock_scenario_telemetry = Mock()
    mock_krkn_config = {}

    # Mock inject_managedcluster_scenario to track calls
    call_tracker = []
    original_inject = self.plugin.inject_managedcluster_scenario
    def track_inject(action, *args, **kwargs):
        call_tracker.append(action)

    with patch.object(self.plugin, 'inject_managedcluster_scenario', side_effect=track_inject):
        # Execute the run method
        result = self.plugin.run(
            run_uuid="test-uuid",
            scenario="test_scenario.yaml",
            krkn_config=mock_krkn_config,
            lib_telemetry=mock_lib_telemetry,
            scenario_telemetry=mock_scenario_telemetry
        )

    # Assert all three actions were called
    self.assertEqual(result, 0)
    self.assertEqual(len(call_tracker), 3)
    self.assertIn("managedcluster_start_scenario", call_tracker)
    self.assertIn("managedcluster_stop_scenario", call_tracker)
    self.assertIn("managedcluster_reboot_scenario", call_tracker)

    # Assert cerberus was called 3 times (once per action)
    self.assertEqual(mock_cerberus.call_count, 3)

@patch('time.time')
@patch('builtins.open', create=True)
@patch('yaml.full_load')
def test_run_stops_on_first_error(self, mock_yaml, mock_open, mock_time):
    """
    Test that run() returns 1 and stops executing on first error
    """
    mock_time.return_value = 1234567890

    # Setup mock scenario config with multiple actions
    mock_yaml.return_value = {
        "managedcluster_scenarios": [
            {
                "actions": [
                    "managedcluster_start_scenario",
                    "managedcluster_stop_scenario",
                    "managedcluster_reboot_scenario"
                ],
                "managedcluster_name": "test-cluster",
                "runs": 1,
                "instance_count": 1,
                "timeout": 120
            }
        ]
    }

    mock_lib_telemetry = Mock(spec=KrknTelemetryOpenshift)
    mock_lib_telemetry.get_lib_kubernetes.return_value = self.mock_kubecli

    mock_scenario_telemetry = Mock()
    mock_krkn_config = {}

    # Mock inject_managedcluster_scenario to raise exception on first call
    call_tracker = []
    def track_inject_with_error(action, *args, **kwargs):
        call_tracker.append(action)
        if action == "managedcluster_start_scenario":
            raise Exception("Test failure")

    with patch.object(self.plugin, 'inject_managedcluster_scenario', side_effect=track_inject_with_error):
        # Execute the run method
        result = self.plugin.run(
            run_uuid="test-uuid",
            scenario="test_scenario.yaml",
            krkn_config=mock_krkn_config,
            lib_telemetry=mock_lib_telemetry,
            scenario_telemetry=mock_scenario_telemetry
        )

    # Assert failure and only first action was attempted
    self.assertEqual(result, 1)
    self.assertEqual(len(call_tracker), 1)
    self.assertEqual(call_tracker[0], "managedcluster_start_scenario")

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add proper error handling and return codes to scenario plugins

🐞 Bug fix 🧪 Tests

Grey Divider

Walkthroughs

Description
• Add proper error handling and return codes to scenario plugins
  - ManagedClusterScenarioPlugin: Fix early return preventing multiple actions
  - ZoneOutageScenarioPlugin: Capture and propagate return values from zone methods
  - ZoneOutageScenarioPlugin: Wrap network_based_zone in try-catch for exception handling
• Add tearDown methods to test classes for proper cleanup
• Add comprehensive test coverage for error handling and return codes
Diagram
flowchart LR
  A["Scenario Plugins"] -->|"Fix early returns"| B["ManagedClusterScenarioPlugin"]
  A -->|"Capture return values"| C["ZoneOutageScenarioPlugin"]
  C -->|"Add exception handling"| D["network_based_zone method"]
  E["Test Classes"] -->|"Add tearDown cleanup"| F["Multiple Test Files"]
  E -->|"Add error handling tests"| G["New Test Cases"]
Loading

Grey Divider

File Changes

1. krkn/scenario_plugins/managed_cluster/managed_cluster_scenario_plugin.py 🐞 Bug fix +1/-2

Fix early return preventing multiple actions execution

krkn/scenario_plugins/managed_cluster/managed_cluster_scenario_plugin.py


2. krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py 🐞 Bug fix +70/-60

Add return value propagation and exception handling

krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py


3. tests/test_managed_cluster_scenario_plugin.py 🧪 Tests +114/-0

Add tests for multiple actions and error handling

tests/test_managed_cluster_scenario_plugin.py


View more (9)
4. tests/test_zone_outage_scenario_plugin.py 🧪 Tests +188/-0

Add tests for return value propagation and failures

tests/test_zone_outage_scenario_plugin.py


5. tests/test_time_actions_scenario_plugin.py 🧪 Tests +75/-0

Add exception handling and error logging tests

tests/test_time_actions_scenario_plugin.py


6. tests/test_application_outage_scenario_plugin.py 🧪 Tests +4/-0

Add tearDown method for test cleanup

tests/test_application_outage_scenario_plugin.py


7. tests/test_container_scenario_plugin.py 🧪 Tests +4/-0

Add tearDown method for test cleanup

tests/test_container_scenario_plugin.py


8. tests/test_native_scenario_plugin.py 🧪 Tests +4/-0

Add tearDown method for test cleanup

tests/test_native_scenario_plugin.py


9. tests/test_openstack_node_scenarios.py 🧪 Tests +11/-0

Add tearDown methods for test cleanup

tests/test_openstack_node_scenarios.py


10. tests/test_pod_disruption_scenario_plugin.py 🧪 Tests +4/-0

Add tearDown method for test cleanup

tests/test_pod_disruption_scenario_plugin.py


11. tests/test_pvc_scenario_plugin.py 🧪 Tests +16/-0

Add tearDown methods for test cleanup

tests/test_pvc_scenario_plugin.py


12. tests/test_syn_flood_scenario_plugin.py 🧪 Tests +12/-0

Add tearDown methods for test cleanup

tests/test_syn_flood_scenario_plugin.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Mar 2, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (1) 📎 Requirement gaps (0)

Grey Divider


Action required

1. ZoneOutage run() masks exit codes 📘 Rule violation ✓ Correctness
Description
ZoneOutageScenarioPlugin.run() converts any non-zero return from
network_based_zone()/node_based_zone() into exit code 1, which can hide required health-check
failure codes (3+). This violates the standardized exit-code propagation requirement and also
returns 1 without clearly logging the reason at the decision point.
Code

krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py[R37-47]

+                    result = self.network_based_zone(scenario_config)
+                    if result != 0:
+                        return 1
                else:
                    kubecli = lib_telemetry.get_lib_kubernetes()
                    if cloud_type.lower() == "gcp":
                        affected_nodes_status = AffectedNodeStatus()
                        self.cloud_object = gcp_node_scenarios(kubecli, kube_check, affected_nodes_status)
-                        self.node_based_zone(scenario_config, kubecli)
+                        result = self.node_based_zone(scenario_config, kubecli)
+                        if result != 0:
+                            return 1
Evidence
Compliance requires propagating standardized exit codes (including 3+ for health check failures)
and logging the reason for the chosen exit code. The new logic maps any non-zero result to `return
1`, losing the original exit code information.

CLAUDE.md
krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py[37-47]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`ZoneOutageScenarioPlugin.run()` currently converts any non-zero result from `network_based_zone()` / `node_based_zone()` into exit code `1`. This can mask standardized exit codes like `3+` for health check failures, violating the exit-code propagation requirement.

## Issue Context
The compliance requirement expects standardized exit codes to be used and propagated appropriately (0 success, 1 scenario failure, 2 critical alerts, 3+ health check failure), and the reason for the exit code should be logged clearly.

## Fix Focus Areas
- krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py[37-47]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Tests call run() incorrectly 🐞 Bug ✓ Correctness
Description
New tests call ScenarioPlugin.run() with an unsupported krkn_config kwarg and assert cerberus
calls that the plugin never makes; this will raise TypeError / fail CI.
Code

tests/test_managed_cluster_scenario_plugin.py[R81-97]

+            result = self.plugin.run(
+                run_uuid="test-uuid",
+                scenario="test_scenario.yaml",
+                krkn_config=mock_krkn_config,
+                lib_telemetry=mock_lib_telemetry,
+                scenario_telemetry=mock_scenario_telemetry
+            )
+        
+        # Assert all three actions were called
+        self.assertEqual(result, 0)
+        self.assertEqual(len(call_tracker), 3)
+        self.assertIn("managedcluster_start_scenario", call_tracker)
+        self.assertIn("managedcluster_stop_scenario", call_tracker)
+        self.assertIn("managedcluster_reboot_scenario", call_tracker)
+        
+        # Assert cerberus was called 3 times (once per action)
+        self.assertEqual(mock_cerberus.call_count, 3)
Evidence
AbstractScenarioPlugin.run() and the plugin implementations define `run(run_uuid, scenario,
lib_telemetry, scenario_telemetry)—no krkn_config. The new tests pass krkn_config=...`, which
will raise TypeError: got an unexpected keyword argument 'krkn_config'. The managed-cluster test
also asserts cerberus.get_status is called, but ManagedClusterScenarioPlugin.run() never calls it.

krkn/scenario_plugins/abstract_scenario_plugin.py[28-35]
krkn/scenario_plugins/managed_cluster/managed_cluster_scenario_plugin.py[15-45]
tests/test_managed_cluster_scenario_plugin.py[81-87]
tests/test_zone_outage_scenario_plugin.py[87-93]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
New unit tests call `ScenarioPlugin.run()` with an unsupported `krkn_config` keyword argument and also assert behavior (cerberus calls) that the implementation doesn’t perform. This will fail CI with `TypeError` and/or incorrect assertions.

### Issue Context
`AbstractScenarioPlugin.run()` defines the canonical signature used across plugins. The tests should match that API.

### Fix Focus Areas
- tests/test_managed_cluster_scenario_plugin.py[81-97]
- tests/test_zone_outage_scenario_plugin.py[87-93]
- tests/test_time_actions_scenario_plugin.py[58-64]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. TimeActions except uses undefined e 🐞 Bug ✓ Correctness
Description
TimeActionsScenarioPlugin.run() references {e} in its error log without binding the exception as
e, causing a NameError and masking the original failure (scenario may crash instead of returning
1).
Code

tests/test_time_actions_scenario_plugin.py[R58-74]

+        result = self.plugin.run(
+            run_uuid="test-uuid",
+            scenario="test_scenario.yaml",
+            krkn_config=mock_krkn_config,
+            lib_telemetry=mock_lib_telemetry,
+            scenario_telemetry=mock_scenario_telemetry
+        )
+        
+        # Assert failure is returned
+        self.assertEqual(result, 1)
+        
+        # Assert logging.error was called with the exception message
+        mock_logging_error.assert_called_once()
+        error_call_args = str(mock_logging_error.call_args)
+        self.assertIn("Test exception message", error_call_args)
+        self.assertIn("TimeActionsScenarioPlugin", error_call_args)
+
Evidence
The implementation’s except (RuntimeError, Exception): block logs an f-string containing {e} but
does not capture the exception into a variable. This will raise NameError: name 'e' is not defined
during error handling, violating the contract that run() must not propagate exceptions and must
return 1 on failure.

krkn/scenario_plugins/time_actions/time_actions_scenario_plugin.py[39-43]
tests/test_time_actions_scenario_plugin.py[45-74]
krkn/scenario_plugins/abstract_scenario_plugin.py[37-40]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`TimeActionsScenarioPlugin.run()` has a broken exception handler: it references an undefined variable `e` in the log message. This causes a secondary `NameError`, masking the original exception and potentially propagating out of `run()`.

### Issue Context
Scenario plugins are expected to catch exceptions and return non-zero exit codes.

### Fix Focus Areas
- krkn/scenario_plugins/time_actions/time_actions_scenario_plugin.py[25-45]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
4. AWS ACL cleanup missing on error 🐞 Bug ⛯ Reliability
Description
ZoneOutageScenarioPlugin.network_based_zone() can change AWS network ACL associations and create
ACLs, but if an exception occurs mid-run it returns 1 without attempting to restore prior ACL
associations or delete created ACLs, risking prolonged network disruption and leaked ACLs.
Code

krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py[R138-169]

+                new_association_id = self.cloud_object.replace_network_acl_association(
+                    network_association_ids[0], acl_id
+                )
+
+                # capture the orginal_acl_id, created_acl_id and
+                # new association_id to use during the recovery
+                ids[new_association_id] = original_acl_id
+
+            # wait for the specified duration
            logging.info(
-                "Network association ids associated with "
-                "the subnet %s: %s" % (subnet_id, network_association_ids)
+                "Waiting for the specified duration " "in the config: %s" % duration
            )
-            
-            # Use provided default ACL if available, otherwise create a new one
-            if default_acl_id:
-                acl_id = default_acl_id
-                logging.info(
-                    "Using provided default ACL ID %s - this ACL will not be deleted after the scenario", 
-                    default_acl_id
+            time.sleep(duration)
+
+            # replace the applied acl with the previous acl in use
+            for new_association_id, original_acl_id in ids.items():
+                self.cloud_object.replace_network_acl_association(
+                    new_association_id, original_acl_id
                )
-                # Don't add to acl_ids_created since we don't want to delete user-provided ACLs at cleanup
-            else:
-                acl_id = self.cloud_object.create_default_network_acl(vpc_id)
-                logging.info("Created new default ACL %s", acl_id)
-                acl_ids_created.append(acl_id)
-
-            new_association_id = self.cloud_object.replace_network_acl_association(
-                network_association_ids[0], acl_id
+            logging.info(
+                "Wating for 60 seconds to make sure " "the changes are in place"
            )
+            time.sleep(60)

-            # capture the original_acl_id, created_acl_id and
-            # new association_id to use during the recovery
-            ids[new_association_id] = original_acl_id
-
-        # wait for the specified duration
-        logging.info(
-            "Waiting for the specified duration " "in the config: %s" % duration
-        )
-        time.sleep(duration)
-
-        # replace the applied acl with the previous acl in use
-        for new_association_id, original_acl_id in ids.items():
-            self.cloud_object.replace_network_acl_association(
-                new_association_id, original_acl_id
+            # delete the network acl created for the run
+            for acl_id in acl_ids_created:
+                self.cloud_object.delete_network_acl(acl_id)
+        except Exception as e:
+            logging.error(
+                f"Network based zone outage scenario failed with exception: {e}"
            )
-        logging.info(
-            "Waiting for 60 seconds to make sure " "the changes are in place"
-        )
-        time.sleep(60)
-
-        # delete the network acl created for the run
-        for acl_id in acl_ids_created:
-            self.cloud_object.delete_network_acl(acl_id)
+            return 1
Evidence
The method performs side-effectful operations (replace_network_acl_association /
create_default_network_acl) and only performs restore/delete in the normal path. On any exception,
it immediately returns 1; there is no finally to revert ids associations or delete
acl_ids_created. Since AWS helper methods raise RuntimeError on failure, exceptions are expected
and can occur after partial modification.

krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py[138-169]
krkn/scenario_plugins/node_actions/aws_node_scenarios.py[161-190]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
AWS network ACL operations are not rolled back on failure. If an exception occurs after one or more subnet ACL associations have been replaced (or after ACL creation), the function returns 1 without restoring the original ACLs or deleting created ACLs.

### Issue Context
This scenario is intentionally disruptive, but it must reliably restore connectivity even when intermediate AWS calls fail.

### Fix Focus Areas
- krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py[100-171]
- krkn/scenario_plugins/node_actions/aws_node_scenarios.py[161-227]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@paigerube14 paigerube14 force-pushed the exit_codes branch 5 times, most recently from 472e8e3 to 2740f71 Compare March 27, 2026 14:21
Signed-off-by: Paige Patton <prubenda@redhat.com>
@paigerube14 paigerube14 merged commit 35ee9d7 into krkn-chaos:main Mar 31, 2026
11 of 12 checks passed
paigerube14 added a commit to paigerube14/kraken that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant