Fix/cleanup spam list code#12050
Fix/cleanup spam list code#12050shoaib-inamdar wants to merge 2 commits intointernetarchive:masterfrom
Conversation
c40c087 to
a9b3c10
Compare
|
@cdrini , does this code script , helps in working with the cleanup of spam lists from the database? |
There was a problem hiding this comment.
Pull request overview
Adds a one-time maintenance script to remove existing spam lists from Open Library by finding /type/list records whose name/description match configured spam words and marking those lists as deleted, helping address spam list search results (Issue #11905).
Changes:
- Added
scripts/cleanup_spam_lists.pyto scan all lists and detect spam using thespamwordsstore document. - Added deletion flow (with
--dry-run) that saves/type/deletedocs for matching lists. - Added operator output/usage notes, including reminder to reindex Solr after deletions.
| def main(): | ||
| parser = argparse.ArgumentParser(description='Clean up spam lists') | ||
| parser.add_argument( | ||
| '--dry-run', | ||
| action='store_true', | ||
| help='Print what would be deleted without deleting', | ||
| ) | ||
| args = parser.parse_args() | ||
|
|
||
| spam_words = get_spam_words() | ||
| if not spam_words: | ||
| print("No spam words configured. Exiting.") | ||
| return | ||
|
|
||
| print(f"Scanning for spam lists using {len(spam_words)} spam words...") | ||
| spam_lists = find_spam_lists(spam_words) | ||
| print(f"Found {len(spam_lists)} spam lists.") | ||
|
|
||
| for item in spam_lists: | ||
| action = '[DRY RUN] Would delete' if args.dry_run else 'Deleting' | ||
| print(f" {action}: {item['key']} — \"{item['name']}\"") | ||
| if not args.dry_run: | ||
| delete_list(item['key']) | ||
|
|
There was a problem hiding this comment.
This script performs destructive actions by default when run without flags. For one-time cleanup scripts, it’s safer to default to dry-run/test mode and require an explicit confirmation flag (e.g. --delete/--confirm) before calling delete_list(...).
scripts/cleanup_spam_lists.py
Outdated
| def delete_list(key: str) -> None: | ||
| """Mark a list as deleted.""" | ||
| doc = {'key': key, 'type': {'key': '/type/delete'}} | ||
| web.ctx.site.save(doc, action='lists', comment='Spam cleanup (#11905)') | ||
|
|
There was a problem hiding this comment.
Bulk writes via web.ctx.site.save(...) are typically performed under a known account context (e.g. with RunAs('ImportBot'):) and with web.ctx.ip set, so the changeset is attributed/authorized consistently (see patterns in other scripts). As written, deletions may be unauthenticated or attributed inconsistently depending on runtime configuration.
scripts/cleanup_spam_lists.py
Outdated
| # Bootstrap the openlibrary app context | ||
| sys.path.insert(0, '.') | ||
| import infogami | ||
| from infogami import config | ||
|
|
||
| config.db_parameters = {'dbn': 'postgres', 'db': 'openlibrary'} | ||
| infogami._setup() |
There was a problem hiding this comment.
This script bootstraps Infogami by mutating infogami.config.db_parameters and calling infogami._setup() at import time. That bypasses the repo’s typical script pattern of import _init_path + load_config(openlibrary.yml) + infogami._setup() inside main() (e.g. scripts/update_stale_work_references.py:13-16) and makes it easy to run against an unintended environment/DB (also, importing the module has side effects). Prefer taking a config path argument, calling openlibrary.config.load_config(...) and infogami._setup() inside main(), and using _init_path/scripts._init_path for PYTHONPATH setup.
scripts/cleanup_spam_lists.py
Outdated
| @@ -0,0 +1,102 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
The shebang uses python, which may resolve to Python 2 on some systems. Since the repo is Python 3-only, update this to python3 (consistent with most other scripts under scripts/).
scripts/cleanup_spam_lists.py
Outdated
| for key in results: | ||
| doc = web.ctx.site.get(key) | ||
| if doc is None: | ||
| continue | ||
| name = doc.get('name') or '' | ||
| description = str(doc.get('description') or '') | ||
| if is_spam_list_content(name, description, spam_words): | ||
| spam_lists.append({'key': key, 'name': name}) |
There was a problem hiding this comment.
find_spam_lists does an N+1 fetch: for each batch of keys returned by site.things(...) it calls web.ctx.site.get(key) individually. For ~250k lists this will be very slow and can put unnecessary load on infobase. Prefer using web.ctx.site.get_many(results) (or batching results further if needed) and iterating the returned docs.
| for key in results: | |
| doc = web.ctx.site.get(key) | |
| if doc is None: | |
| continue | |
| name = doc.get('name') or '' | |
| description = str(doc.get('description') or '') | |
| if is_spam_list_content(name, description, spam_words): | |
| spam_lists.append({'key': key, 'name': name}) | |
| # Fetch all documents for this batch of keys in a single call to avoid N+1 queries. | |
| docs = web.ctx.site.get_many(results) | |
| for doc in docs: | |
| if doc is None: | |
| continue | |
| name = doc.get('name') or '' | |
| description = str(doc.get('description') or '') | |
| if is_spam_list_content(name, description, spam_words): | |
| spam_lists.append({'key': doc.get('key'), 'name': name}) |
) Addresses all Copilot review suggestions: - Use python3 shebang (consistent with other scripts) - Use _init_path + load_config(ol_config) + infogami._setup() inside main() instead of bootstrapping at module level (see update_stale_work_references.py) - Accept config file path as positional argument - Default to dry-run mode; require explicit --confirm flag for deletions - Attribute deletions to ImportBot with web.ctx.ip for consistent auditing - Fetch docs in batches via site.get_many() instead of N+1 site.get() calls
a9b3c10 to
5d0bcaa
Compare
) Addresses all Copilot review suggestions: - Use python3 shebang (consistent with other scripts) - Use _init_path + load_config(ol_config) + infogami._setup() inside main() instead of bootstrapping at module level (see update_stale_work_references.py) - Accept config file path as positional argument - Default to dry-run mode; require explicit --confirm flag for deletions - Attribute deletions to ImportBot with web.ctx.ip for consistent auditing - Fetch docs in batches via site.get_many() instead of N+1 site.get() calls
868f2b8 to
fbffad6
Compare
for more information, see https://pre-commit.ci
|
@jimchamp , i have made copilot requested changes , have a look at it and let me know if some new changes are needed |
Closes #11905
fix : added a cleanup script for spam lists
Technical
Testing
Screenshot
Stakeholders