This repository contains Python-based scrapers for extracting product listings and detailed product information from Tokopedia. These scrapers leverage the Crawlbase Crawling API to handle JavaScript rendering, CAPTCHA challenges, and anti-bot protections. The extracted data is processed using BeautifulSoup for HTML parsing and Pandas for structured storage.
➡ Read the full blog here to learn more.
The Tokopedia Product Listing Scraper (tokopedia_listing_scraper.py) extracts:
- Product Name
- Price
- Product URL
- Shop Name
The scraper supports pagination, ensuring comprehensive data extraction. The extracted data is saved in a JSON file.
The Tokopedia Product Detail Scraper (tokopedia_product_scraper.py) extracts detailed product information, including:
- Product Name
- Store Name
- Full Description
- Price
- Images URL
The extracted data is saved in a JSON file.
Ensure that Python is installed on your system. Check the version using:
# Use python3 if required (for Linux/macOS)
python --versionNext, install the required dependencies:
pip install crawlbase beautifulsoup4- Crawlbase – Handles JavaScript rendering and bypasses bot protections.
- BeautifulSoup – Parses and extracts structured data from HTML.
- Sign up for Crawlbase here to get an API token.
- Use the JS token for Tokopedia scraping, as the site uses JavaScript-rendered content.
Replace "YOUR_CRAWLBASE_TOKEN" in the script with your Crawlbase JS Token.
# For product listing scraping
python tokopedia_listing_scraper.py
# For product detail scraping
python tokopedia_product_scraper.pyThe scraped data will be saved in tokopedia_search_results.json or tokopedia_product_data.json, depending on the script used.
- Expand scrapers to extract additional product details like discounted prices, seller reputation, and available promotions.
- Optimize data storage and add support for CSV and database integration.
- Implement asynchronous requests to speed up data extraction.
- Enhance scraper efficiency with Crawlbase Smart Proxy to prevent blocks.
- Automate scheduled scraping for real-time price monitoring and product tracking.
- ✔ Bypasses anti-bot protections with Crawlbase.
- ✔ Handles JavaScript-rendered content seamlessly.
- ✔ Extracts accurate and structured product data efficiently.