Skip to content

Build a natural language query system for an employee database that dynamically adapts to the actual schema and can handle both structured employee data and unstructured documents. The system should work without hard-coding table names, column names, or relationships.

Notifications You must be signed in to change notification settings

zenitsu0509/Employee-NLQ

Repository files navigation

NLP Query Engine for Employee Data

This project delivers a full-stack AI-powered query system that can ingest employee databases and unstructured documents, automatically discover schemas, and serve natural language queries with production-ready optimizations.

Project Overview

  • Backend: FastAPI service providing ingestion, schema discovery, and query endpoints
  • Frontend: React application for database configuration, document uploads, query interface, and analytics
  • Document Processing: Multi-format ingestion with adaptive chunking and embedding generation
  • Query Engine: Unified SQL and semantic retrieval with caching, optimization, and monitoring hooks

Screenshot

App UI

Quick Start

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • PostgreSQL (optional, SQLite works for testing)

Backend Setup

  1. Create virtual environment and install dependencies:
cd backend
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r ../requirements.txt
  1. Run the FastAPI server (dev):
uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
  1. (Optional) Enable background workers with Redis
  • Set in config.yml:
queue:
  enabled: true
  redis_url: "redis://localhost:6379/0"
  queue_name: "ingestion"
  • Start Redis and a worker in another shell:
# Start worker using config.yml
python backend/worker.py
  1. Test the API:
curl http://localhost:8000/health

Frontend Setup

  1. Install Node.js dependencies:
cd frontend
npm install
  1. Start the development server:
npm run dev
  1. Point the frontend at the deployed API (local dev):

Create frontend/.env.local so Vite knows which backend URL to call. Restart npm run dev after saving.

VITE_API_BASE_URL=
  1. Access the web interface:

Open http://localhost:5173 in your browser

Docker Setup (Alternative)

  1. Run with Docker Compose:
docker-compose up --build

This starts:

Frontend on Vercel / static hosting

  • Set an environment variable in your hosting provider (Vercel → Project Settings → Environment Variables):
VITE_API_BASE_URL=""
  • Trigger a new deployment so the SPA builds with the production API base URL.

Persistent Vector Store (PgVector)

To persist embeddings and enable scalable semantic search, you can switch the vector store to PgVector.

  1. Ensure your Postgres has the pgvector extension installed.
  2. Update config.yml:
vector_store:
  type: pgvector
  connection_string: ${VECTOR_DB_URL}  # e.g., postgresql+psycopg://user:pass@localhost:5432/embeddings
  table_name: document_chunks

If the PgVector initialization fails, the app will fall back to the in-memory FAISS store.

Usage Examples

1. Connect to Database

curl -X POST http://localhost:8000/api/ingest/database \
  -H "Content-Type: application/json" \
  -d '{"connection_string": "postgresql+psycopg://user:password@localhost:5432/employees"}'

2. Upload Documents

curl -X POST http://localhost:8000/api/ingest/documents \
  -F "files=@resume.pdf" \
  -F "files=@handbook.docx" \
  -F "connection_string=postgresql+psycopg://user:password@localhost:5432/employees"

CSV upload (sample data)

You can quickly test CSV uploads using the provided sample files in sample-data/:

curl -X POST http://localhost:8000/api/ingest/documents \
  -F "files=@sample-data/employees.csv" \
  -F "files=@sample-data/departments.csv" \
  -F "files=@sample-data/salaries.csv" \
  -F "files=@sample-data/skills.csv" \
  -F "connection_string=postgresql+psycopg://user:password@localhost:5432/employees"

Then poll the ingestion job status:

curl -s http://localhost:8000/api/ingest/jobs | jq

2b. Import tabular data (CSV/Excel) into your database

You can upload CSV/XLSX files and have the backend create/append tables in your Postgres (e.g., Neon/Supabase).

curl -X POST http://localhost:8000/api/ingest/tabular \
  -F "files=@sample-data/employees.csv" \
  -F "files=@sample-data/departments.csv" \
  -F "connection_string=${DATABASE_URL}" \
  -F "if_exists=append"            # or replace|fail \
  -F "table_name="                 # optional; defaults to sanitized filename \
  -F "delimiter=,"                 # optional for CSV; default , (.tsv uses \t) \
  -F "sheet_name=Sheet1"           # optional for Excel

This returns a job id you can monitor with /api/ingest/jobs. Large CSVs are loaded in chunks to avoid memory spikes. By default, tables are appended; use if_exists=replace to overwrite.

3. Query Your Data

curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"query": "How many employees hired this year?", "top_k": 5}'

4. Modify Data (DML via natural language)

The query engine can also perform safe UPDATE/INSERT/DELETE operations generated by the LLM. The system avoids adding LIMIT to DML and commits the transaction if successful.

curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{
    "connection_string": "postgresql+psycopg://user:password@localhost:5432/employees",
    "query": "increase the salary of hr department by 10 percent"
  }'

Response example:

{
  "results": [{"affected_rows": 3, "status": "success"}],
  "query_type": "sql",
  "metrics": {"response_ms": 120.5, "cache_hit": false}
}

Note: Always verify the result (affected_rows) and consider running within a transaction against non-production databases.

5. Optional: List ingestion jobs

If you uploaded documents and lost the job id, you can list current jobs:

curl -s http://localhost:8000/api/ingest/jobs

Testing

Run the backend test suite:

cd backend
python -m pytest tests/ -v

Configuration

Edit config.yml to customize:

  • Database connection pooling
  • Embedding model settings
  • Cache configuration
  • Logging levels

Notes:

  • When using PostgreSQL with SQLAlchemy 2.x, prefer the psycopg v3 driver by specifying postgresql+psycopg://... in the connection string. This avoids requiring psycopg2.
  • If running the backend inside Docker Compose, your connection string host should be the service name:
    • postgresql+psycopg://user:password@postgres:5432/employees If running the backend locally against Dockerized Postgres on your machine, use:
    • postgresql+psycopg://user:password@localhost:5432/employees

Groq (LLM) configuration

This project can generate SQL with an LLM (Llama via Groq) for robust natural language understanding, including JOINs and DML. Configure your key in config.yml:

groq:
  api_key: "<YOUR_GROQ_API_KEY>"
  model: "llama-3.1-8b-instant"

Security tip: Prefer setting GROQ_API_KEY via environment variable or secrets management in production.

Architecture

Backend Structure

backend/
├── api/
│   ├── routes/          # FastAPI endpoints
│   ├── services/        # Business logic
│   │   ├── schema_discovery.py    # Auto-discover DB schemas
│   │   ├── query_engine.py        # Unified query processing
│   │   ├── document_processor.py  # Document ingestion
│   │   └── vector_store.py        # Semantic search
│   └── models/          # Pydantic schemas
└── tests/               # Unit tests

Frontend Structure

frontend/src/
├── components/
│   ├── AppLayout.tsx           # Main layout
│   └── contexts/              # React contexts
├── App.tsx                    # Root component
└── main.tsx                   # Entry point

Features

Schema Discovery: Automatically detects table structures and relationships
Natural Language Queries: Converts English to SQL and semantic search
Document Processing: Handles PDFs, Word docs, CSVs with intelligent chunking
Caching: TTL-based response caching for performance
Multi-format Results: Returns structured data and document excerpts
Query History: Tracks and caches previous queries
LLM-powered SQL: Uses Groq (Llama) to translate natural language to SQL
Data Modification (DML): Natural language UPDATE/INSERT/DELETE with safety guards (no LIMIT on DML, affected_rows returned)

Troubleshooting

  • No module named 'psycopg2'

    • Cause: SQLAlchemy is attempting to use the old psycopg2 driver when your environment only has psycopg v3 installed (as specified in requirements.txt).
    • Fix: Use a connection string that explicitly selects psycopg v3: postgresql+psycopg://user:password@host:5432/dbname.
    • Optional: You could install psycopg2-binary, but psycopg2 may not support the newest Python versions promptly (e.g. 3.13). Prefer psycopg v3.
  • psycopg import errors on Windows (psycopg v3): "couldn't import psycopg 'c'/'binary' implementation" or "libpq library not found"

    • Cause: Base psycopg package tries to load a compiled implementation and falls back to a pure-Python variant which requires libpq on your system. On Windows, libpq typically isn't present by default.

    • Fix: Install the binary wheel that bundles libpq.

      • We pin this in requirements.txt as: psycopg[binary]==3.2.3.

      • If you already installed dependencies before this change, update your venv:

        # from repo root, using the existing .venv
        source .venv/Scripts/activate  # on Windows bash; PowerShell: .venv\\Scripts\\Activate.ps1
        pip uninstall -y psycopg2 psycopg2-binary || true
        pip install --upgrade "psycopg[binary]==3.2.3"
    • Ensure your connection string uses psycopg v3: postgresql+psycopg://... (not postgresql:// and not +psycopg2).

About

Build a natural language query system for an employee database that dynamically adapts to the actual schema and can handle both structured employee data and unstructured documents. The system should work without hard-coding table names, column names, or relationships.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •