Skip to content

Tighten release notes prompt and clean input data#6068

Merged
gaiksaya merged 1 commit into
opensearch-project:mainfrom
andrross:tightened-prompt
Mar 31, 2026
Merged

Tighten release notes prompt and clean input data#6068
gaiksaya merged 1 commit into
opensearch-project:mainfrom
andrross:tightened-prompt

Conversation

@andrross
Copy link
Copy Markdown
Member

  • Strip Signed-off-by/Co-authored-by tags from commit messages
  • Strip HTML comments, Check List, Related Issues sections, and DCO boilerplate from PR descriptions
  • Drop PR body for dependabot PRs (upstream changelogs are noise)
  • Tighten AI_RELEASE_NOTES_PROMPT_COMMIT_OPENSEARCH: remove redundant sections, add example rewrite, use template variables for version/URL
  • Simplify Maintenance category to include all dependency updates

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 31, 2026

PR Reviewer Guide 🔍

(Review updated until commit 1fe7886)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Tighten AI release notes prompt for OpenSearch commits

Relevant files:

  • src/llms/prompts.py

Sub-PR theme: Clean commit messages and PR bodies before LLM processing

Relevant files:

  • src/release_notes_workflow/release_notes.py

⚡ Recommended focus areas for review

Regex Bug

The _clean_message regex uses a non-greedy match with a lookahead that may not correctly strip all Signed-off-by/Co-authored-by lines, especially when they appear at the end of the string. The \s*$ in the lookahead could cause the last tag to be missed or partially matched. This should be tested with edge cases like a single tag at the end, multiple consecutive tags, and tags with multi-word values.

text = re.sub(r'\s*(Signed-off-by|Co-authored-by):\s*\S+.*?(?=\s*(?:Signed-off-by|Co-authored-by):|\s*$)', '', text)
return re.sub(r'\s{2,}', ' ', text).strip()
Label Check Logic

The dependabot check uses 'dependabot' not in labels where labels is a list of strings. This checks if the string 'dependabot' is an exact element of the list, but dependabot labels are typically something like 'dependencies' or the author is dependabot[bot]. If the intent is to detect dependabot PRs by label, this check may never match and the PR body will always be included for dependabot PRs.

if pr_body and 'dependabot' not in labels:
DCO Regex Fragility

The DCO boilerplate regex r'By submitting this pull request.*?Apache 2\.0 license\.?.*?(?=\n\n|\n#|\Z)' uses re.DOTALL with a non-greedy .*? followed by another .*?, which could be slow or match too little/too much depending on the PR body content. If the boilerplate text varies slightly (e.g., different line endings or wording), it may not be stripped.

text = re.sub(r'By submitting this pull request.*?Apache 2\.0 license\.?.*?(?=\n\n|\n#|\Z)', '', text, flags=re.DOTALL)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 31, 2026

PR Code Suggestions ✨

Latest suggestions up to 1fe7886
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix dependabot label detection using substring match

The labels field is a list of label strings, so checking 'dependabot' not in labels
tests for exact equality with a list element. However, dependabot labels are
typically named something like "dependencies" or contain "dependabot" as a
substring. The check should use any() with a substring match to correctly detect
dependabot PRs.

src/release_notes_workflow/release_notes.py [174]

-if pr_body and 'dependabot' not in labels:
+if pr_body and not any('dependabot' in label.lower() for label in labels):
Suggestion importance[1-10]: 7

__

Why: The current check 'dependabot' not in labels tests for exact list membership, but dependabot-related labels (e.g., "dependencies") don't contain the string "dependabot" as an element. The suggested any() with substring match is more robust for detecting dependabot PRs, though the actual label naming convention matters here.

Medium
Fix regex flag for multiline line-end matching

The current regex uses \s$ as an end anchor, but without the re.MULTILINE flag, $
matches only the end of the entire string, not the end of each line. This means the
pattern may fail to strip Signed-off-by/Co-authored-by lines that appear in the
middle of a multi-line message. Add re.MULTILINE to ensure $ matches end-of-line.
*

src/release_notes_workflow/release_notes.py [37]

-text = re.sub(r'\s*(Signed-off-by|Co-authored-by):\s*\S+.*?(?=\s*(?:Signed-off-by|Co-authored-by):|\s*$)', '', text)
+text = re.sub(r'\s*(Signed-off-by|Co-authored-by):\s*\S+.*?(?=\s*(?:Signed-off-by|Co-authored-by):|\s*$)', '', text, flags=re.MULTILINE)
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that without re.MULTILINE, $ only matches the end of the entire string, which could cause the regex to miss Signed-off-by/Co-authored-by lines in the middle of multi-line messages. However, the regex also uses re.DOTALL behavior implicitly via .*?, so the actual impact depends on the input format.

Low

Previous suggestions

Suggestions up to commit e85a8de
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix dependabot label check to use substring matching

The labels variable is a list, so 'dependabot' not in labels checks for an exact
element match. Dependabot labels are typically strings like "dependencies" or
contain "dependabot" as a substring, not the exact string "dependabot". This
condition will never filter out dependabot PRs as intended. Use a substring check
across all label values instead.

src/release_notes_workflow/release_notes.py [174]

-if pr_body and 'dependabot' not in labels:
+if pr_body and not any('dependabot' in label.lower() for label in labels):
Suggestion importance[1-10]: 8

__

Why: The labels list contains strings like "dependencies" rather than the exact string "dependabot", so 'dependabot' not in labels will never match and dependabot PRs won't be filtered as intended. The fix using any('dependabot' in label.lower() for label in labels) correctly performs substring matching.

Medium
Fix regex to correctly strip sign-off lines

The regex uses [^|]*? which stops at pipe characters, potentially leaving partial
lines in the output. Commit messages rarely contain pipes, but author names or
emails might. Use a pattern that matches until end-of-line or the next tag instead.

src/release_notes_workflow/release_notes.py [37]

-text = re.sub(r'\s*(Signed-off-by|Co-authored-by):\s*[^|]*?(?=\s*(Signed-off-by|Co-authored-by):|\s*$)', '', text)
+text = re.sub(r'\s*(Signed-off-by|Co-authored-by):[^\n]*', '', text)
Suggestion importance[1-10]: 6

__

Why: The current regex uses [^|]*? which stops at pipe characters, potentially leaving partial content. The suggested fix using [^\n]* is more robust and correctly matches to end-of-line, which is the standard behavior for stripping sign-off lines.

Low
General
Make DCO boilerplate removal more robust

The DCO boilerplate regex requires the phrase "signing off your commits" to appear
in the same match, making it fragile if the boilerplate wording varies slightly. A
more robust approach matches from the known start phrase to the end of the paragraph
or a closing parenthesis.

src/release_notes_workflow/release_notes.py [50]

-text = re.sub(r'By submitting this pull request.*?signing off your commits.*?\)\.?\s*', '', text, flags=re.DOTALL)
+text = re.sub(r'By submitting this pull request[^.]*\..*?\)\s*', '', text, flags=re.DOTALL)
Suggestion importance[1-10]: 3

__

Why: The current regex requires the exact phrase "signing off your commits" which could fail if the boilerplate wording varies slightly. However, the improved regex [^.]*\..*?\) is also fragile and may match unintended content, making this a marginal improvement.

Low

@andrross
Copy link
Copy Markdown
Member Author

@gaiksaya @rishabh6788 Kiro suggested improvements to the prompt to be more direct and remove redundancy. I figure kiro knows what the models want :) The only substantive change in the prompt is to remove exclusion rule for "Dependency bumps that only affect test fixtures or build tooling". That actually seems like a difficult judgement call and there seems to be little harm to including all dependency updates so on balance it seems simpler without it.

Stripping things like the checklist and DCO out of the input data actually cuts down on the amount of text to bedrock by half, which should help since that was just noise.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.60%. Comparing base (bf58274) to head (1fe7886).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6068      +/-   ##
==========================================
- Coverage   96.61%   96.60%   -0.01%     
==========================================
  Files         405      405              
  Lines       18851    18906      +55     
==========================================
+ Hits        18212    18265      +53     
- Misses        639      641       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Strip Signed-off-by/Co-authored-by tags from commit messages
- Strip HTML comments, Check List, Related Issues sections, and DCO
  boilerplate from PR descriptions
- Drop PR body for dependabot PRs (upstream changelogs are noise)
- Tighten AI_RELEASE_NOTES_PROMPT_COMMIT_OPENSEARCH: remove redundant
  sections, add example rewrite, use template variables for version/URL
- Simplify Maintenance category to include all dependency updates

Signed-off-by: Andrew Ross <andrross@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 1fe7886

Comment thread src/llms/prompts.py
@gaiksaya gaiksaya merged commit d024259 into opensearch-project:main Mar 31, 2026
17 checks passed
@github-project-automation github-project-automation Bot moved this from 👀 In Review to ✅ Done in Engineering Effectiveness Board Mar 31, 2026
@andrross andrross deleted the tightened-prompt branch March 31, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants