Real estate listing tracker for Bologna with web frontend and AI-powered analysis.
- Multi-portal scraping: Immobiliare.it and Idealista.it
- Duplicate detection: Matches listings across portals by price, surface, and location
- Sold tracking: Detects when listings disappear from search results
- AI analysis: Gemini extracts missing fields and generates summaries from descriptions/floor plans
- Web frontend: React app with map view, filters, favorites, and image galleries
# Install dependencies
make install
# Run the web app
make backend # API on port 8000
make frontend # React app on port 5173/houses
├── househunter_workflow.ipynb # Data processing notebook
├── scraping/ # Python modules
│ ├── extract_immobiliare.py
│ ├── extract_idealista.py
│ ├── utils.py # Deduplication
│ └── derive_fields.py # AI field extraction
├── src/
│ ├── backend/api/ # FastAPI server
│ └── frontend/ # React + Vite + Tailwind
├── data/
│ ├── scraped/ # Raw HTML + images by date
│ └── processed/ # Final parquet + user state
└── Makefile
- Scrape search results and listings (manual browser)
- Process in Jupyter notebook:
- Parse HTML into dataframe
- Deduplicate across portals
- Derive missing fields with AI
- Save to parquet
- Browse via web frontend with map and filters
The notebook (househunter_workflow.ipynb) runs these steps:
- Load search results from all scrape dates
- Detect new/sold listings by comparing dates
- Parse listing HTML files
- Deduplicate (exact price + surface + 100m distance)
- Derive fields with Gemini (summary, bedrooms, etc.)
- Save to
data/processed/listings.parquet
- Split view: listing cards + interactive map
- Filters: Active, Sold, Favorites, Removed
- Image gallery with lightbox and floor plans
- AI-generated summary on each card
- Click active listings to open on portal
- Click sold listings to view saved details
Uses Gemini with two separate API calls:
-
Attributes (from description text):
- Summary: 2-3 sentences highlighting best features
- Extracted: bedrooms, bathrooms, balconies, terraces, cantina, garage, elevator, heating, AC, condition
-
Beauty score (from property images):
- Score 1-5 rating based on interior quality and finishes
- Notes explaining the rating
Results are cached in data/processed/derived_fields.json and flattened into the parquet file. Only processes listings not already in cache (incremental).
make install # Install dependencies
make backend # Start API server (port 8000)
make frontend # Start React dev server (port 5173)
make jupyter # Open Jupyter notebook
make scrape # Scrape individual listings
make scrape_search # Scrape search result pages- Python 3.11+ with uv
- Node.js 18+ for frontend
- Google Cloud auth for Vertex AI (Gemini)