Skip to content

fmotalleb/scrapper-go

Repository files navigation

Scrapper-Go

Scrapper-Go is a powerful and flexible Go application that acts as a wrapper around Playwright, enabling you to define and execute web scraping pipelines using simple YAML configuration files. It provides a robust engine for automating browser interactions, extracting data, and handling various web scenarios.

Features

  • YAML-driven Scraping: Define complex scraping workflows using intuitive YAML configurations.
  • Playwright Integration: Leverages the full power of Playwright for browser automation, supporting Chromium, Firefox, and WebKit.
  • API Server: Expose your scraping capabilities as a RESTful API endpoint.
  • Interactive Shell: Interact with the scrapper in a live shell environment for testing and development.
  • Dependency Management: Easily install Playwright browsers and drivers with a dedicated setup command.

Installation

Prerequisites

  • Go (1.18 or higher)
  • Node.js (for Playwright dependencies)

Build from Source

  1. Clone the repository:

    git clone https://github.com/fmotalleb/scrapper-go.git
    cd scrapper-go
  2. Install Playwright dependencies:

    go run main.go setup

    You can specify which browsers to install:

    go run main.go setup --browsers chromium,firefox

    Or skip browser installation:

    go run main.go setup --skip-browsers
  3. Build the application:

    go build -o scrapper-go .

Usage

Executing a Scraping Pipeline

You can see Documentation for more information on pipelines and logics.

You can run a YAML-defined scraping pipeline directly:

./scrapper-go -c path/to/your/config.yaml

Example config.yaml:

# Your YAML scraping configuration here

Subcommands

serve - Start the API Server

Run Scrapper-Go as an API service. By default, it listens on 127.0.0.1:8080. Note: This application does not support authentication. It is recommended to run it behind a reverse proxy for production use.

./scrapper-go serve
# Or specify address and port
./scrapper-go serve -a 0.0.0.0 -p 8081

For API usage see Api Documentation (ai generated might be slope, look at the code for actual implementation).

setup - Install Playwright Dependencies

As described in the installation section, this command helps manage Playwright's browsers and drivers.

./scrapper-go setup --browsers webkit

shell - Interactive Shell

Start an interactive shell for direct interaction and testing of scraping steps.

./scrapper-go shell

Configuration

Scrapper-Go looks for a configuration file named .scrapper-go.yaml in your home directory by default. You can specify a different configuration file using the -c or --config flag.

Docs are generated using gemini so there will be hiccups somewhere. I wont be writing any docs manually because this software is used to bypass js challenge on our internal hot-spot login page and some minor scrapping situations, this is mostly an experiment :).

Contributing

We welcome contributions! Please see CONTRIBUTING.md (if available) for details on how to contribute.

License

This project is licensed under the GNU General Public License v2.0 - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors