Webis is an intelligent web data extraction tool that uses AI technology to automatically identify valuable information on web pages, filter out noise, and provide high-quality input for downstream AI training and knowledge base construction.
- Python 3.10
- Conda (recommended for environment management)
- NVIDIA GPU (optional, for CUDA support)
It is recommended to create an isolated Conda environment to avoid dependency conflicts:
# Create a new Conda environment named 'webis' with Python 3.10
conda create -n webis python=3.10 -y
# Activate the environment
conda activate webis
# Install PyTorch (for CUDA 12.1; adjust according to your CUDA version)
## linux GPU version :
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
## cpu version:
conda install pytorch torchvision torchaudio -c pytorchNote: Ensure the
webisenvironment is activated for all subsequent steps. If CUDA errors occur, install the CPU version:
conda install pytorch torchvision torchaudio cpuonly -c pytorch # Clone the repository
git clone https://github.com/TheBinKing/Webis.git
cd Webis
# Install dependencies and set up
chmod +x install.sh
./install.sh # Clone the repository
git clone https://github.com/TheBinKing/Webis.git
cd Webis
# Install the package and dependencies
pip install -e .
# Add the bin directory to PATH
export PATH="$PATH:$(pwd)/bin"
echo 'export PATH="$PATH:$(pwd)/bin"' >> ~/.bashrc
source ~/.bashrc Webis supports both CLI and API service modes. Always start the model server first!
- Model Server (port 8000):
python scripts/start_model_server.py - Web API Server (port 8002):
python scripts/start_web_server.py Note: The default model (
Easonnoway/Web_info_extra_1.5b) will be automatically downloaded from HuggingFace. The first run may take some time.
The api_usage.py script demonstrates how to process HTML files via the API interface, supporting both synchronous and asynchronous modes, suitable for familiarizing clients with operations.
Ideal for small numbers of files, where the client waits for the server to complete processing:
# Send an HTML file for synchronous processing
response = requests.post(
"http://localhost:8002/extract/process-html",
files=files,
data=data
)
# Download the processed results
response = requests.get(f"http://localhost:8002/tasks/{task_id}/download", stream=True) Ideal for large numbers of files or long processing times; submit the task and periodically check its status:
# Submit an asynchronous processing task
response = requests.post(
"http://localhost:8002/extract/process-async",
files=files,
data=data
)
# Monitor task status
response = requests.get(f"http://localhost:8002/tasks/{async_task_id}")
# Download results after task completion
download_response = requests.get(f"http://localhost:8002/tasks/{async_task_id}/download", stream=True) # Basic usage
python samples/api_usage.py
# Enhance processing results using the DeepSeek API (requires an API key)
python samples/api_usage.py --use-deepseek --api-key YOUR_API_KEY_HERE Tip: Ensure there are HTML files in the
input_html/directory. Results will be saved as{task_id}_results.zip(synchronous) and{async_task_id}_async_results.zip(asynchronous).
The cli_usage.sh script provides quick examples of command-line interface usage, suitable for batch processing or script integration.
# Process HTML files
./samples/cli_usage.sh Note: The script calls the
webis extractcommand and requires a validYOUR_API_KEY_HERE. Results are saved to theoutput_basic/directory.
# View version information
$PROJECT_ROOT/bin/webis version
# Check API connection
$PROJECT_ROOT/bin/webis check-api --api-key YOUR_API_KEY
# View help
$PROJECT_ROOT/bin/webis --help
$PROJECT_ROOT/bin/webis extract --help - Name: Web_info_extra_1.5b
- HuggingFace: Easonnoway/Web_info_extra_1.5b
- Parameters: 1.5B
- Function: DOM tree node classification
- Downloaded by default to
~/.cache/huggingface/hub. - Use
--model-pathto specify a local path. - Cache management: Set
HF_HOMEorTRANSFORMERS_CACHEto customize the location; usehuggingface-cli delete-cacheto clear the cache.
bin/- Command-line toolssrc/- Source codecli/- CLI implementationcore/- Core logicserver/- API server
scripts/- Startup scriptssamples/- Usage examples (includingapi_usage.pyandcli_usage.sh)input_html/- Sample HTML filesoutput_basic/- CLI output results
config/- Configuration files
- Invalid API Key: Ensure
YOUR_API_KEYis valid and the network is functioning. - Server Failure: Check if ports 8000/8002 are occupied (
netstat -tuln | grep 8000) or adjust the ports. - Conda Issues: Accept channel ToS (
conda tos accept --channel CHANNEL) or remove the channel (conda config --remove channels CHANNEL).
Contributions are welcome! Please submit issues or pull requests on GitHub. For support, contact the maintainers or join the community discussion.