Skip to content

Conversation

@mohammedahmed18
Copy link
Contributor

@mohammedahmed18 mohammedahmed18 commented Dec 10, 2025

PR Type

Enhancement


Description

  • Rank refinements by runtime and diff

  • Add normalization and weighting utilities

  • Change refinement request payload to ints

  • Cap refinements to top 45% candidates


Diagram Walkthrough

flowchart LR
  A["Valid optimizations"] -- "compute diff, runtime" --> B["Normalize metrics"]
  B["Normalize metrics"] -- "apply weights (runtime,diff)" --> C["Score candidates"]
  C["Score candidates"] -- "select top 45% or <=2 all" --> D["Submit refinements"]
  D["Submit refinements"] -- "parallel futures" --> E["Queue refined candidates"]
Loading

File Walkthrough

Relevant files
Enhancement
aiservice.py
Humanize runtime fields; trim logs                                             

codeflash/api/aiservice.py

  • Send runtimes as humanized strings
  • Remove extra debug/console logging
+2/-6     
code_utils.py
Utilities for weighting and normalization                               

codeflash/code_utils/code_utils.py

  • Add choose_weights helper
  • Add normalize utility
  • Add weighted score dictionary builder
+57/-0   
models.py
Refiner request runtime type to int                                           

codeflash/models/models.py

  • Change refiner request runtimes to int
+2/-2     
function_optimizer.py
Weighted, selective, parallel refinement flow                       

codeflash/optimization/function_optimizer.py

  • Queue-based refinement selection by weighted score
  • Build refiner requests locally with int runtimes
  • Submit refinements selectively and in parallel
  • Inject AIS client/executor into processor
+66/-50 
Configuration changes
config_consts.py
Config for weighted refinement capping                                     

codeflash/code_utils/config_consts.py

  • Add refinement thresholds and weights
  • Introduce top-N refinement ratio
+5/-0     

@github-actions
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

Variable name typo: 'top_indecies' appears to be a misspelling of 'top_indices'. While functional, this reduces readability and may cause confusion; consider renaming.

top_indecies = sorted(score_dict, key=score_dict.get)[:top_n_candidates]

for idx in top_indecies:
    data = self.all_refinements_data[idx]
Type Mismatch

Function 'create_score_dictionary_from_metrics' declares return type dict[int, int] but builds a dict of floats; adjust the return type to dict[int, float] or cast appropriately.

def create_score_dictionary_from_metrics(weights: list[float], *metrics: list[float]) -> dict[int, int]:
    """Combine multiple metrics into a single weighted score dictionary.

    Each metric is a list of values (smaller = better).
    The total score for each index is the weighted sum of its values
    across all metrics:

        score[index] = Σ (value * weight)

    Args:
        weights: A list of weights, one per metric. Larger weight = more influence.
        *metrics: Lists of values (one list per metric, aligned by index).

    Returns:
        A dictionary mapping each index to its combined weighted score.

    """
    if len(weights) != len(metrics):
        raise ValueError("Number of weights must match number of metrics")

    combined: dict[int, float] = {}

    for weight, metric in zip(weights, metrics):
        for idx, value in enumerate(metric):
            combined[idx] = combined.get(idx, 0) + value * weight

    return combined
Edge Case Handling

'normalize' returns all zeros when values are constant; the ranking then ties and slicing top N may be arbitrary. Consider deterministic tie-breaking or ensuring at least one candidate is selected when TOP_N_REFINEMENTS rounds to 0 for small lists > REFINE_ALL_THRESHOLD.

runtime_w, diff_w = REFINED_CANDIDATE_RANKING_WEIGHTS
weights = choose_weights(runtime=runtime_w, diff=diff_w)

runtime_norm = normalize(runtimes_list)
diffs_norm = normalize(diff_lens_list)
# the lower the better
score_dict = create_score_dictionary_from_metrics(weights, runtime_norm, diffs_norm)
top_n_candidates = int((TOP_N_REFINEMENTS * len(runtimes_list)) + 0.5)
top_indecies = sorted(score_dict, key=score_dict.get)[:top_n_candidates]

for idx in top_indecies:
    data = self.all_refinements_data[idx]

@github-actions
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix score dict typing and validation

The return type is declared as dict[int, int] but the values are floats from
weighted sums, which can cause type misuse. Also validate that all metric lists have
equal length to prevent silent index skew. Update the annotation to dict[int, float]
and add a uniform-length check.

codeflash/code_utils/code_utils.py [72-98]

-def create_score_dictionary_from_metrics(weights: list[float], *metrics: list[float]) -> dict[int, int]:
-    """Combine multiple metrics into a single weighted score dictionary.
-
-    Each metric is a list of values (smaller = better).
-    The total score for each index is the weighted sum of its values
-    across all metrics:
-
-        score[index] = Σ (value * weight)
-
-    Args:
-        weights: A list of weights, one per metric. Larger weight = more influence.
-        *metrics: Lists of values (one list per metric, aligned by index).
-
-    Returns:
-        A dictionary mapping each index to its combined weighted score.
-
-    """
+def create_score_dictionary_from_metrics(weights: list[float], *metrics: list[float]) -> dict[int, float]:
+    """Combine multiple metrics into a single weighted score dictionary."""
     if len(weights) != len(metrics):
         raise ValueError("Number of weights must match number of metrics")
+    lengths = {len(m) for m in metrics}
+    if len(lengths) > 1:
+        raise ValueError("All metric lists must have the same length")
 
     combined: dict[int, float] = {}
-
     for weight, metric in zip(weights, metrics):
         for idx, value in enumerate(metric):
-            combined[idx] = combined.get(idx, 0) + value * weight
-
+            combined[idx] = combined.get(idx, 0.0) + value * weight
     return combined
Suggestion importance[1-10]: 7

__

Why: Correctly identifies a typing mismatch (dict[int, int] vs float values) and adds a sensible length validation; both improve correctness and maintainability without altering logic.

Medium
Ensure at least one refinement

When TOP_N_REFINEMENTS yields 0 (e.g., small lists or low fraction), no candidates
are refined, stalling refinement. Ensure at least one candidate is selected when
there are items, and correct the typo in top_indecies to avoid confusion. Clamp the
count to [1, len(...)].

codeflash/optimization/function_optimizer.py [205-211]

 top_n_candidates = int((TOP_N_REFINEMENTS * len(runtimes_list)) + 0.5)
-top_indecies = sorted(score_dict, key=score_dict.get)[:top_n_candidates]
+if len(runtimes_list) > 0:
+    top_n_candidates = max(1, min(top_n_candidates, len(runtimes_list)))
+top_indices = sorted(score_dict, key=score_dict.get)[:top_n_candidates]
 
-for idx in top_indecies:
+for idx in top_indices:
     data = self.all_refinements_data[idx]
     future_refinements.append(self.refine_optimizations([data]))
Suggestion importance[1-10]: 6

__

Why: Clamping the number of selected candidates avoids a potential no-op when rounding yields 0 and fixes a typo; useful but not critical since a low TOP_N_REFINEMENTS might be intentional.

Low
General
Guard empty metric lists

Lower normalized values are better, but higher runtime/diff indicate worse
candidates. If lower is better, this line is fine; however confirm normalization
semantics by explicitly documenting and guarding against negative or empty lists.
Add an early return when lists are empty to avoid downstream errors.

codeflash/optimization/function_optimizer.py [201-206]

-runtime_norm = normalize(runtimes_list)
-diffs_norm = normalize(diff_lens_list)
-# the lower the better
+if not runtimes_list or not diff_lens_list:
+    self.refinement_done = True
+    return self.get_next_candidate()
+
+runtime_norm = normalize(runtimes_list)  # lower is better
+diffs_norm = normalize(diff_lens_list)   # lower is better
 score_dict = create_score_dictionary_from_metrics(weights, runtime_norm, diffs_norm)
Suggestion importance[1-10]: 5

__

Why: Adding an explicit empty-list guard improves robustness and aligns with later use; moderate impact as upstream logic likely ensures non-empty inputs, but the check is harmless and clarifies intent.

Low

@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Dec 10, 2025

⚡️ Codeflash found optimizations for this PR

📄 115% (1.15x) speedup for AiServiceClient.optimize_python_code_refinement in codeflash/api/aiservice.py

⏱️ Runtime : 32.8 milliseconds 15.3 milliseconds (best of 65 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch limit-refined-candidates).

Static Badge

Copy link
Contributor

@KRRT7 KRRT7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this, just implement the changes that the PR review bot gave you

@KRRT7
Copy link
Contributor

KRRT7 commented Dec 12, 2025

I suspect this will help with the long runtimes of our tracer-replay as well now that I think of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants