Webis - HTML Content Extraction Tool

Webis is an intelligent web data extraction tool that uses AI technology to automatically identify valuable information on web pages, filter out noise, and provide high-quality input for downstream AI training and knowledge base construction.

Installation

Prerequisites

Python 3.10
Conda (recommended for environment management)
NVIDIA GPU (optional, for CUDA support)

It is recommended to create an isolated Conda environment to avoid dependency conflicts:

# Create a new Conda environment named 'webis' with Python 3.10  
conda create -n webis python=3.10 -y  

# Activate the environment  
conda activate webis  

# Install PyTorch (for CUDA 12.1; adjust according to your CUDA version)  
## linux GPU version ：
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia  
## cpu version：
conda install pytorch torchvision torchaudio -c pytorch

Note: Ensure the webis environment is activated for all subsequent steps. If CUDA errors occur, install the CPU version:

conda install pytorch torchvision torchaudio cpuonly -c pytorch

Installing Webis

Method 1: Command Line Installation (Recommended)

# Clone the repository  
git clone https://github.com/TheBinKing/Webis.git  
cd Webis  

# Install dependencies and set up  
chmod +x install.sh  
./install.sh

Method 2: Manual Installation

# Clone the repository  
git clone https://github.com/TheBinKing/Webis.git  
cd Webis  

# Install the package and dependencies  
pip install -e .  

# Add the bin directory to PATH  
export PATH="$PATH:$(pwd)/bin"  
echo 'export PATH="$PATH:$(pwd)/bin"' >> ~/.bashrc  
source ~/.bashrc

Usage

Webis supports both CLI and API service modes. Always start the model server first!

Step 1: Start the Servers

Model Server (port 8000):

python scripts/start_model_server.py

Web API Server (port 8002):

python scripts/start_web_server.py

Note: The default model (Easonnoway/Web_info_extra_1.5b) will be automatically downloaded from HuggingFace. The first run may take some time.

API Usage Example

The api_usage.py script demonstrates how to process HTML files via the API interface, supporting both synchronous and asynchronous modes, suitable for familiarizing clients with operations.

Synchronous Processing Mode

Ideal for small numbers of files, where the client waits for the server to complete processing:

# Send an HTML file for synchronous processing  
response = requests.post(  
    "http://localhost:8002/extract/process-html",  
    files=files,  
    data=data  
)  

# Download the processed results  
response = requests.get(f"http://localhost:8002/tasks/{task_id}/download", stream=True)

Asynchronous Processing Mode

Ideal for large numbers of files or long processing times; submit the task and periodically check its status:

# Submit an asynchronous processing task  
response = requests.post(  
    "http://localhost:8002/extract/process-async",  
    files=files,  
    data=data  
)  

# Monitor task status  
response = requests.get(f"http://localhost:8002/tasks/{async_task_id}")  

# Download results after task completion  
download_response = requests.get(f"http://localhost:8002/tasks/{async_task_id}/download", stream=True)

Running the API Example

# Basic usage  
python samples/api_usage.py  

# Enhance processing results using the DeepSeek API (requires an API key)  
python samples/api_usage.py --use-deepseek --api-key YOUR_API_KEY_HERE

Tip: Ensure there are HTML files in the input_html/ directory. Results will be saved as {task_id}_results.zip (synchronous) and {async_task_id}_async_results.zip (asynchronous).

CLI Usage Example

The cli_usage.sh script provides quick examples of command-line interface usage, suitable for batch processing or script integration.

Basic Usage

# Process HTML files  
./samples/cli_usage.sh

Note: The script calls the webis extract command and requires a valid YOUR_API_KEY_HERE. Results are saved to the output_basic/ directory.

Other Commands

# View version information  
$PROJECT_ROOT/bin/webis version  

# Check API connection  
$PROJECT_ROOT/bin/webis check-api --api-key YOUR_API_KEY  

# View help  
$PROJECT_ROOT/bin/webis --help  
$PROJECT_ROOT/bin/webis extract --help

About the Model

Model Details

Name: Web_info_extra_1.5b
HuggingFace: Easonnoway/Web_info_extra_1.5b
Parameters: 1.5B
Function: DOM tree node classification

Usage Instructions

Downloaded by default to ~/.cache/huggingface/hub.
Use --model-path to specify a local path.
Cache management: Set HF_HOME or TRANSFORMERS_CACHE to customize the location; use huggingface-cli delete-cache to clear the cache.

Project Structure

bin/ - Command-line tools
src/ - Source code
- cli/ - CLI implementation
- core/ - Core logic
- server/ - API server
scripts/ - Startup scripts
samples/ - Usage examples (including api_usage.py and cli_usage.sh)
- input_html/ - Sample HTML files
- output_basic/ - CLI output results
config/ - Configuration files

Troubleshooting

Invalid API Key: Ensure YOUR_API_KEY is valid and the network is functioning.
Server Failure: Check if ports 8000/8002 are occupied (netstat -tuln | grep 8000) or adjust the ports.
Conda Issues: Accept channel ToS (conda tos accept --channel CHANNEL) or remove the channel (conda config --remove channels CHANNEL).

Contributing

Contributions are welcome! Please submit issues or pull requests on GitHub. For support, contact the maintainers or join the community discussion.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
bin		bin
config		config
docs		docs
frontend		frontend
samples		samples
scripts		scripts
src		src
webis.egg-info		webis.egg-info
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webis - HTML Content Extraction Tool

Table of Contents

Installation

Prerequisites

Installing Webis

Method 1: Command Line Installation (Recommended)

Method 2: Manual Installation

Usage

Step 1: Start the Servers

API Usage Example

Synchronous Processing Mode

Asynchronous Processing Mode

Running the API Example

CLI Usage Example

Basic Usage

Other Commands

About the Model

Model Details

Usage Instructions

Project Structure

Troubleshooting

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Webis - HTML Content Extraction Tool

Table of Contents

Installation

Prerequisites

Installing Webis

Method 1: Command Line Installation (Recommended)

Method 2: Manual Installation

Usage

Step 1: Start the Servers

API Usage Example

Synchronous Processing Mode

Asynchronous Processing Mode

Running the API Example

CLI Usage Example

Basic Usage

Other Commands

About the Model

Model Details

Usage Instructions

Project Structure

Troubleshooting

Contributing

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages