Skip to content

Fix/cleanup spam list code#12050

Open
shoaib-inamdar wants to merge 2 commits intointernetarchive:masterfrom
shoaib-inamdar:fix/cleanup_spam_list-code
Open

Fix/cleanup spam list code#12050
shoaib-inamdar wants to merge 2 commits intointernetarchive:masterfrom
shoaib-inamdar:fix/cleanup_spam_list-code

Conversation

@shoaib-inamdar
Copy link
Contributor

Closes #11905

fix : added a cleanup script for spam lists

Technical

Testing

Screenshot

Stakeholders

@shoaib-inamdar shoaib-inamdar force-pushed the fix/cleanup_spam_list-code branch from c40c087 to a9b3c10 Compare March 12, 2026 15:42
@shoaib-inamdar
Copy link
Contributor Author

@cdrini , does this code script , helps in working with the cleanup of spam lists from the database?

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Mar 13, 2026
@jimchamp jimchamp marked this pull request as ready for review March 20, 2026 20:17
Copilot AI review requested due to automatic review settings March 20, 2026 20:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a one-time maintenance script to remove existing spam lists from Open Library by finding /type/list records whose name/description match configured spam words and marking those lists as deleted, helping address spam list search results (Issue #11905).

Changes:

  • Added scripts/cleanup_spam_lists.py to scan all lists and detect spam using the spamwords store document.
  • Added deletion flow (with --dry-run) that saves /type/delete docs for matching lists.
  • Added operator output/usage notes, including reminder to reindex Solr after deletions.

Comment on lines +73 to +96
def main():
parser = argparse.ArgumentParser(description='Clean up spam lists')
parser.add_argument(
'--dry-run',
action='store_true',
help='Print what would be deleted without deleting',
)
args = parser.parse_args()

spam_words = get_spam_words()
if not spam_words:
print("No spam words configured. Exiting.")
return

print(f"Scanning for spam lists using {len(spam_words)} spam words...")
spam_lists = find_spam_lists(spam_words)
print(f"Found {len(spam_lists)} spam lists.")

for item in spam_lists:
action = '[DRY RUN] Would delete' if args.dry_run else 'Deleting'
print(f" {action}: {item['key']} — \"{item['name']}\"")
if not args.dry_run:
delete_list(item['key'])

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script performs destructive actions by default when run without flags. For one-time cleanup scripts, it’s safer to default to dry-run/test mode and require an explicit confirmation flag (e.g. --delete/--confirm) before calling delete_list(...).

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +71
def delete_list(key: str) -> None:
"""Mark a list as deleted."""
doc = {'key': key, 'type': {'key': '/type/delete'}}
web.ctx.site.save(doc, action='lists', comment='Spam cleanup (#11905)')

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bulk writes via web.ctx.site.save(...) are typically performed under a known account context (e.g. with RunAs('ImportBot'):) and with web.ctx.ip set, so the changeset is attributed/authorized consistently (see patterns in other scripts). As written, deletions may be unauthenticated or attributed inconsistently depending on runtime configuration.

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +31
# Bootstrap the openlibrary app context
sys.path.insert(0, '.')
import infogami
from infogami import config

config.db_parameters = {'dbn': 'postgres', 'db': 'openlibrary'}
infogami._setup()
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script bootstraps Infogami by mutating infogami.config.db_parameters and calling infogami._setup() at import time. That bypasses the repo’s typical script pattern of import _init_path + load_config(openlibrary.yml) + infogami._setup() inside main() (e.g. scripts/update_stale_work_references.py:13-16) and makes it easy to run against an unintended environment/DB (also, importing the module has side effects). Prefer taking a config path argument, calling openlibrary.config.load_config(...) and infogami._setup() inside main(), and using _init_path/scripts._init_path for PYTHONPATH setup.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,102 @@
#!/usr/bin/env python
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shebang uses python, which may resolve to Python 2 on some systems. Since the repo is Python 3-only, update this to python3 (consistent with most other scripts under scripts/).

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +62
for key in results:
doc = web.ctx.site.get(key)
if doc is None:
continue
name = doc.get('name') or ''
description = str(doc.get('description') or '')
if is_spam_list_content(name, description, spam_words):
spam_lists.append({'key': key, 'name': name})
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_spam_lists does an N+1 fetch: for each batch of keys returned by site.things(...) it calls web.ctx.site.get(key) individually. For ~250k lists this will be very slow and can put unnecessary load on infobase. Prefer using web.ctx.site.get_many(results) (or batching results further if needed) and iterating the returned docs.

Suggested change
for key in results:
doc = web.ctx.site.get(key)
if doc is None:
continue
name = doc.get('name') or ''
description = str(doc.get('description') or '')
if is_spam_list_content(name, description, spam_words):
spam_lists.append({'key': key, 'name': name})
# Fetch all documents for this batch of keys in a single call to avoid N+1 queries.
docs = web.ctx.site.get_many(results)
for doc in docs:
if doc is None:
continue
name = doc.get('name') or ''
description = str(doc.get('description') or '')
if is_spam_list_content(name, description, spam_words):
spam_lists.append({'key': doc.get('key'), 'name': name})

Copilot uses AI. Check for mistakes.
shoaib-inamdar added a commit to shoaib-inamdar/openlibrary that referenced this pull request Mar 21, 2026
)

Addresses all Copilot review suggestions:
- Use python3 shebang (consistent with other scripts)
- Use _init_path + load_config(ol_config) + infogami._setup() inside
  main() instead of bootstrapping at module level (see update_stale_work_references.py)
- Accept config file path as positional argument
- Default to dry-run mode; require explicit --confirm flag for deletions
- Attribute deletions to ImportBot with web.ctx.ip for consistent auditing
- Fetch docs in batches via site.get_many() instead of N+1 site.get() calls
@shoaib-inamdar shoaib-inamdar force-pushed the fix/cleanup_spam_list-code branch from a9b3c10 to 5d0bcaa Compare March 21, 2026 15:09
)

Addresses all Copilot review suggestions:
- Use python3 shebang (consistent with other scripts)
- Use _init_path + load_config(ol_config) + infogami._setup() inside
  main() instead of bootstrapping at module level (see update_stale_work_references.py)
- Accept config file path as positional argument
- Default to dry-run mode; require explicit --confirm flag for deletions
- Attribute deletions to ImportBot with web.ctx.ip for consistent auditing
- Fetch docs in batches via site.get_many() instead of N+1 site.get() calls
@shoaib-inamdar shoaib-inamdar force-pushed the fix/cleanup_spam_list-code branch from 868f2b8 to fbffad6 Compare March 21, 2026 19:16
@shoaib-inamdar
Copy link
Contributor Author

@jimchamp , i have made copilot requested changes , have a look at it and let me know if some new changes are needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Response Issues which require feedback from lead

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove Spam Lists

3 participants