A Python bot that automatically identifies and removes duplicate author entries from works in the Open Library database.
This bot addresses a data quality issue in Open Library where some works have the same author listed multiple times. The bot:
- Processes a JSON file containing works with duplicate authors
- Fetches each work's data from Open Library
- Removes duplicate author entries while preserving the original order
- Updates the works with cleaned author lists
- Provides detailed logging and statistics
Open Library works sometimes contain duplicate author IDs in their authors field. For example:
- Work: OL39584341W - Waarheen met Brussel?
- Issue: The same author appears twice in the authors list
This bot was created to clean up approximately 3,949 affected works identified through data analysis.
- โ Dry Run Mode: Test without making actual changes
- โ Rate Limiting: Respects server resources with configurable delays
- โ Comprehensive Logging: Both console and file logging
- โ Progress Tracking: Real-time progress updates
- โ Error Handling: Gracefully handles API errors and edge cases
- โ Statistics: Detailed summary of operations performed
- โ Batch Processing: Process all works or limit to a test batch
- Python 3.7+
- Virtual environment (recommended)
- Clone the repository
git clone https://github.com/yourusername/openlibrary_bot.git
cd openlibrary_bot- Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install requests- Add your data file
- Place
duplicate_authors.jsonin the project root - This file should contain the list of works with duplicate authors
- Place
Edit the configuration section in remove_duplicate_authors.py:
# Bot credentials
BOT_USERNAME = "YourBotUsername" # Your Open Library bot account
BOT_PASSWORD = "YourBotPassword" # Your bot account password
# File paths
JSON_FILE = "duplicate_authors.json" # Path to your data file
# Execution settings
DRY_RUN = True # True = simulation only, False = make actual changes
MAX_WORKS_TO_PROCESS = 10 # Number of works to process (None = all)
DELAY_BETWEEN_REQUESTS = 2.0 # Seconds between API callsThe bot expects a JSON file with the following structure:
[
{
"work_id": "/works/OL26463951W",
"duplicate_author_ids": [
"/authors/OL3308154A"
],
"all_authors": [
"/authors/OL3308154A",
"/authors/OL3308154A"
]
},
...
]Test with a small batch without making any changes:
python remove_duplicate_authors.pyThis will:
- Process the first 10 works (configurable)
- Show what changes would be made
- Generate a log file with detailed information
- NOT make any actual updates
Check the generated log file (e.g., bot_run_20251017_005159.log):
- Verify the bot is detecting duplicates correctly
- Check for any errors or warnings
- Review the statistics
After verifying dry run results:
DRY_RUN = False
MAX_WORKS_TO_PROCESS = 5 # Test with just 5 workspython remove_duplicate_authors.pyManually verify the changes on Open Library.
Once confident everything works:
DRY_RUN = False
MAX_WORKS_TO_PROCESS = None # Process all workspython remove_duplicate_authors.py======================================================================
Open Library Duplicate Authors Removal Bot
======================================================================
Mode: DRY RUN (simulation only)
JSON File: duplicate_authors.json
Max works to process: 10
======================================================================
2025-10-17 00:51:59 - INFO - Attempting to login as DuplicateRemoverBot...
2025-10-17 00:52:02 - INFO - Login successful!
2025-10-17 00:52:02 - INFO - Loaded 3949 entries from JSON
2025-10-17 00:52:02 - INFO - Processing first 10 works only (test mode)
2025-10-17 00:52:02 - INFO - Found 10 works to process
2025-10-17 00:52:02 - INFO - Starting processing...
2025-10-17 00:52:02 - INFO - Progress: 1/10
2025-10-17 00:52:02 - INFO - Processing work: OL26463951W
2025-10-17 00:52:02 - INFO - Original author count: 2
2025-10-17 00:52:02 - INFO - Removed 1 duplicate author(s)
2025-10-17 00:52:02 - INFO - New author count: 1
2025-10-17 00:52:02 - INFO - [DRY RUN] Would update work OL26463951W
==================================================
FINAL STATISTICS
==================================================
Total works processed: 10
Successful updates: 8
Failed updates: 0
Skipped works: 2
Total duplicate authors removed: 8
==================================================
- Go to Open Library Account Creation
- Create a new account with a descriptive name (e.g.,
DuplicateAuthorRemoverBot) - Use a valid email address
Before running in production:
-
Contact the Open Library team via:
-
Explain your bot's purpose:
Hi! I've created a bot to remove duplicate author entries from ~3,949 works.
The bot has been tested in dry-run mode and I'd like to request bot permissions
to perform the cleanup. Repository: [your-repo-link]
- Wait for approval before running in live mode
- Authentication: Logs into Open Library using bot credentials
- Data Loading: Reads the JSON file with works containing duplicates
- Work Processing: For each work:
- Fetches current work data from Open Library API
- Identifies duplicate authors in the authors field
- Creates a cleaned authors list (keeping first occurrence)
- Updates the work (if not in dry-run mode)
- Rate Limiting: Waits between requests to respect server resources
- Logging: Records all operations to both console and log file
- Statistics: Provides summary of operations performed
openlibrary_bot/
โโโ remove_duplicate_authors.py # Main bot script
โโโ duplicate_authors.json # Input data (works with duplicates)
โโโ bot_run_*.log # Generated log files
โโโ venv/ # Virtual environment (not in repo)
โโโ README.md # This file
- Always test in dry-run mode first
- Start with small batches before processing all works
- Respect rate limits - default is 2 seconds between requests
- Review logs regularly to catch any issues
- Get bot permissions before running in production
- Estimated time: ~2.2 hours for all 3,949 works (with 2s delays)
- Verify your username and password are correct
- Check if your account has bot permissions
- Ensure you're not being rate limited
- The work may have been deleted or merged
- The work ID might be incorrect in the JSON file
- Your account may not have bot permissions
- You might not be logged in correctly
- Increase
DELAY_BETWEEN_REQUESTSvalue - Process in smaller batches
- Open Library Developer Docs
- Open Library API Documentation
- Writing Bots for Open Library
- Open Library Data Dumps
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
Nishant Singh
- GitHub: @NishantSinghhhhh
- Open Library team for maintaining the database
- Ray Berger for identifying the duplicate author issue
- Internet Archive for hosting Open Library
- Total works affected: 3,949
- Date identified: October 2024
- Status: Ready for cleanup
Note: This bot is designed to improve data quality in Open Library. Always test thoroughly before running in production mode.