Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions EFFICIENCY_REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# SapientML Core Efficiency Report

This report documents several areas in the codebase where efficiency improvements could be made.

## 1. Bubble Sort Algorithm in Label Ordering (High Impact)

**File:** `sapientml_core/adaptation/generation/template_based_adaptation.py`
**Lines:** 223-240

The `_sort` method uses a bubble sort algorithm with O(n^2) time complexity to order preprocessing labels. This could be replaced with Python's built-in `sorted()` function using a topological sort approach, which would be more efficient for larger label sets.

```python
def _sort(self, preprocessing_set, label_order):
n = len(preprocessing_set)
for i in range(n - 1):
for j in range(0, n - i - 1):
combination = preprocessing_set[j + 1] + "#" + preprocessing_set[j]
if combination in label_order:
preprocessing_set[j], preprocessing_set[j + 1] = preprocessing_set[j + 1], preprocessing_set[j]
return preprocessing_set
```

**Recommendation:** Replace with a more efficient sorting approach using `functools.cmp_to_key` or implement a proper topological sort.

## 2. Deprecated pandas Method Usage (Medium Impact)

**File:** `sapientml_core/meta_features.py`
**Line:** 209

The code uses `applymap()` which is deprecated in pandas 2.1.0 and will be removed in a future version. It should be replaced with `map()`.

```python
is_basic_type = sampledX.applymap(
lambda x: isinstance(x, int) or isinstance(x, float) or isinstance(x, bool) or isinstance(x, str)
)
```

**Recommendation:** Replace `applymap` with `map` for forward compatibility and to avoid deprecation warnings.

## 3. Multiple Iterations Over Same Collection (Medium Impact)

**File:** `sapientml_core/generator.py`
**Lines:** 310-322

In the `evaluate` method, the code iterates over `candidate_scripts` multiple times with separate list comprehensions when a single pass would suffice:

```python
error_pipelines = [pipeline for pipeline in candidate_scripts if pipeline[1].score is None]
# ...
succeeded_scripts = sorted(
[x for x in candidate_scripts if x[1].score is not None],
key=lambda x: x[1].score,
reverse=(not lower_is_better),
)
failed_scripts = [x for x in candidate_scripts if x[1].score is None]
```

**Recommendation:** Use a single loop to partition the scripts into succeeded and failed lists, avoiding redundant iterations.

## 4. Redundant Path Construction (Low Impact)

**File:** `sapientml_core/seeding/predictor.py`
**Lines:** 271-286

The pickle file loading code has redundant path construction patterns:

```python
if python_minor_version in [9, 10, 11]:
base_path = Path(os.path.dirname(__file__)) / ("../models/PY3" + str(python_minor_version))
with open(base_path / "pp_models.pkl", "rb") as f:
pp_model = pickle.load(f)
with open(base_path / "mp_model_1.pkl", "rb") as f1:
with open(base_path / "mp_model_2.pkl", "rb") as f2:
m_model = (pickle.load(f1), pickle.load(f2))
else:
with open(Path(os.path.dirname(__file__)) / "../models/pp_models.pkl", "rb") as f:
pp_model = pickle.load(f)
# ... similar pattern repeated
```

**Recommendation:** Consolidate path construction and simplify the conditional logic.

## 5. Inefficient Column-wise Processing (Low Impact)

**File:** `sapientml_core/params.py`
**Lines:** 385-394

In `summarize_dataset`, meta features are generated for each column individually in a loop:

```python
for column_name in df_train.columns:
meta_features = generate_column_meta_features(df_train[[column_name]])
```

**Recommendation:** Consider batch processing columns where possible to reduce overhead.

## Selected Fix

For this PR, I will fix **Issue #1: Bubble Sort Algorithm** as it provides the clearest algorithmic improvement from O(n^2) to O(n log n) complexity, which can have a noticeable impact when dealing with larger sets of preprocessing labels.
44 changes: 27 additions & 17 deletions sapientml_core/adaptation/generation/template_based_adaptation.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
import os
import sys
from collections import defaultdict
from functools import cmp_to_key
from pathlib import Path
from typing import Optional

Expand Down Expand Up @@ -221,23 +222,32 @@ def _order_labels(self, preprocessing_labels, model_labels):
return sorted_preprocessing_labels, sorted_model_labels

def _sort(self, preprocessing_set, label_order):
n = len(preprocessing_set)

# Traverse through all array elements
for i in range(n - 1):
# range(n) also work but outer loop will repeat one time more than needed.

# Last i elements are already in place
for j in range(0, n - i - 1):
# traverse the array from 0 to n-i-1
# Swap if the element found is greater
# than the next element
combination = preprocessing_set[j + 1] + "#" + preprocessing_set[j]

if combination in label_order:
# logger.debug('combination', combination)
preprocessing_set[j], preprocessing_set[j + 1] = preprocessing_set[j + 1], preprocessing_set[j]
return preprocessing_set
"""Sort preprocessing labels based on pairwise ordering constraints.

Uses Python's built-in sorted() with a custom comparator for O(n log n)
time complexity instead of the previous O(n^2) bubble sort approach.

Parameters
----------
preprocessing_set : list
List of preprocessing label names to sort.
label_order : list
List of ordering constraints in "A#B" format, meaning A should come before B.

Returns
-------
list
Sorted list of preprocessing labels.
"""

def compare(a, b):
if a + "#" + b in label_order:
return -1
elif b + "#" + a in label_order:
return 1
return 0

return sorted(preprocessing_set, key=cmp_to_key(compare))

def _get_adaptation_metric_label(self) -> Optional[str]:
if self.task.adaptation_metric:
Expand Down
Loading