NLP Query Engine for Employee Data

This project delivers a full-stack AI-powered query system that can ingest employee databases and unstructured documents, automatically discover schemas, and serve natural language queries with production-ready optimizations.

Project Overview

Backend: FastAPI service providing ingestion, schema discovery, and query endpoints
Frontend: React application for database configuration, document uploads, query interface, and analytics
Document Processing: Multi-format ingestion with adaptive chunking and embedding generation
Query Engine: Unified SQL and semantic retrieval with caching, optimization, and monitoring hooks

Screenshot

Quick Start

Prerequisites

Python 3.11+
Node.js 18+
PostgreSQL (optional, SQLite works for testing)

Backend Setup

Create virtual environment and install dependencies:

cd backend
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r ../requirements.txt

Run the FastAPI server (dev):

uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000

(Optional) Enable background workers with Redis

Set in config.yml:

queue:
  enabled: true
  redis_url: "redis://localhost:6379/0"
  queue_name: "ingestion"

Start Redis and a worker in another shell:

# Start worker using config.yml
python backend/worker.py

Test the API:

curl http://localhost:8000/health

Frontend Setup

Install Node.js dependencies:

cd frontend
npm install

Start the development server:

npm run dev

Point the frontend at the deployed API (local dev):

Create frontend/.env.local so Vite knows which backend URL to call. Restart npm run dev after saving.

VITE_API_BASE_URL=

Access the web interface:

Open http://localhost:5173 in your browser

Docker Setup (Alternative)

Run with Docker Compose:

docker-compose up --build

This starts:

Backend API: http://localhost:8000
Frontend UI: http://localhost:5173
PostgreSQL: localhost:5432
Redis: localhost:6379

Frontend on Vercel / static hosting

Set an environment variable in your hosting provider (Vercel → Project Settings → Environment Variables):

VITE_API_BASE_URL=""

Trigger a new deployment so the SPA builds with the production API base URL.

Persistent Vector Store (PgVector)

To persist embeddings and enable scalable semantic search, you can switch the vector store to PgVector.

Ensure your Postgres has the pgvector extension installed.
Update config.yml:

vector_store:
  type: pgvector
  connection_string: ${VECTOR_DB_URL}  # e.g., postgresql+psycopg://user:pass@localhost:5432/embeddings
  table_name: document_chunks

If the PgVector initialization fails, the app will fall back to the in-memory FAISS store.

Usage Examples

1. Connect to Database

curl -X POST http://localhost:8000/api/ingest/database \
  -H "Content-Type: application/json" \
  -d '{"connection_string": "postgresql+psycopg://user:password@localhost:5432/employees"}'

2. Upload Documents

curl -X POST http://localhost:8000/api/ingest/documents \
  -F "files=@resume.pdf" \
  -F "files=@handbook.docx" \
  -F "connection_string=postgresql+psycopg://user:password@localhost:5432/employees"

CSV upload (sample data)

You can quickly test CSV uploads using the provided sample files in sample-data/:

curl -X POST http://localhost:8000/api/ingest/documents \
  -F "files=@sample-data/employees.csv" \
  -F "files=@sample-data/departments.csv" \
  -F "files=@sample-data/salaries.csv" \
  -F "files=@sample-data/skills.csv" \
  -F "connection_string=postgresql+psycopg://user:password@localhost:5432/employees"

Then poll the ingestion job status:

curl -s http://localhost:8000/api/ingest/jobs | jq

2b. Import tabular data (CSV/Excel) into your database

You can upload CSV/XLSX files and have the backend create/append tables in your Postgres (e.g., Neon/Supabase).

curl -X POST http://localhost:8000/api/ingest/tabular \
  -F "files=@sample-data/employees.csv" \
  -F "files=@sample-data/departments.csv" \
  -F "connection_string=${DATABASE_URL}" \
  -F "if_exists=append"            # or replace|fail \
  -F "table_name="                 # optional; defaults to sanitized filename \
  -F "delimiter=,"                 # optional for CSV; default , (.tsv uses \t) \
  -F "sheet_name=Sheet1"           # optional for Excel

This returns a job id you can monitor with /api/ingest/jobs. Large CSVs are loaded in chunks to avoid memory spikes. By default, tables are appended; use if_exists=replace to overwrite.

3. Query Your Data

curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"query": "How many employees hired this year?", "top_k": 5}'

4. Modify Data (DML via natural language)

The query engine can also perform safe UPDATE/INSERT/DELETE operations generated by the LLM. The system avoids adding LIMIT to DML and commits the transaction if successful.

curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{
    "connection_string": "postgresql+psycopg://user:password@localhost:5432/employees",
    "query": "increase the salary of hr department by 10 percent"
  }'

Response example:

{
  "results": [{"affected_rows": 3, "status": "success"}],
  "query_type": "sql",
  "metrics": {"response_ms": 120.5, "cache_hit": false}
}

Note: Always verify the result (affected_rows) and consider running within a transaction against non-production databases.

5. Optional: List ingestion jobs

If you uploaded documents and lost the job id, you can list current jobs:

curl -s http://localhost:8000/api/ingest/jobs

Testing

Run the backend test suite:

cd backend
python -m pytest tests/ -v

Configuration

Edit config.yml to customize:

Database connection pooling
Embedding model settings
Cache configuration
Logging levels

Notes:

When using PostgreSQL with SQLAlchemy 2.x, prefer the psycopg v3 driver by specifying postgresql+psycopg://... in the connection string. This avoids requiring psycopg2.
If running the backend inside Docker Compose, your connection string host should be the service name:
- postgresql+psycopg://user:password@postgres:5432/employees If running the backend locally against Dockerized Postgres on your machine, use:
- postgresql+psycopg://user:password@localhost:5432/employees

Groq (LLM) configuration

This project can generate SQL with an LLM (Llama via Groq) for robust natural language understanding, including JOINs and DML. Configure your key in config.yml:

groq:
  api_key: "<YOUR_GROQ_API_KEY>"
  model: "llama-3.1-8b-instant"

Security tip: Prefer setting GROQ_API_KEY via environment variable or secrets management in production.

Architecture

Backend Structure

backend/
├── api/
│   ├── routes/          # FastAPI endpoints
│   ├── services/        # Business logic
│   │   ├── schema_discovery.py    # Auto-discover DB schemas
│   │   ├── query_engine.py        # Unified query processing
│   │   ├── document_processor.py  # Document ingestion
│   │   └── vector_store.py        # Semantic search
│   └── models/          # Pydantic schemas
└── tests/               # Unit tests

Frontend Structure

frontend/src/
├── components/
│   ├── AppLayout.tsx           # Main layout
│   └── contexts/              # React contexts
├── App.tsx                    # Root component
└── main.tsx                   # Entry point

Features

✅ Schema Discovery: Automatically detects table structures and relationships
✅ Natural Language Queries: Converts English to SQL and semantic search
✅ Document Processing: Handles PDFs, Word docs, CSVs with intelligent chunking
✅ Caching: TTL-based response caching for performance
✅ Multi-format Results: Returns structured data and document excerpts
✅ Query History: Tracks and caches previous queries
✅ LLM-powered SQL: Uses Groq (Llama) to translate natural language to SQL
✅ Data Modification (DML): Natural language UPDATE/INSERT/DELETE with safety guards (no LIMIT on DML, affected_rows returned)

Troubleshooting

No module named 'psycopg2'
- Cause: SQLAlchemy is attempting to use the old psycopg2 driver when your environment only has psycopg v3 installed (as specified in requirements.txt).
- Fix: Use a connection string that explicitly selects psycopg v3: postgresql+psycopg://user:password@host:5432/dbname.
- Optional: You could install psycopg2-binary, but psycopg2 may not support the newest Python versions promptly (e.g. 3.13). Prefer psycopg v3.
psycopg import errors on Windows (psycopg v3): "couldn't import psycopg 'c'/'binary' implementation" or "libpq library not found"
- Cause: Base psycopg package tries to load a compiled implementation and falls back to a pure-Python variant which requires libpq on your system. On Windows, libpq typically isn't present by default.
- Fix: Install the binary wheel that bundles libpq.
  - We pin this in requirements.txt as: psycopg[binary]==3.2.3.
  - If you already installed dependencies before this change, update your venv:
```
# from repo root, using the existing .venv
source .venv/Scripts/activate  # on Windows bash; PowerShell: .venv\\Scripts\\Activate.ps1
pip uninstall -y psycopg2 psycopg2-binary || true
pip install --upgrade "psycopg[binary]==3.2.3"
```
- Ensure your connection string uses psycopg v3: postgresql+psycopg://... (not postgresql:// and not +psycopg2).

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
alembic		alembic
backend		backend
frontend		frontend
infra/azure		infra/azure
sample-data		sample-data
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
alembic.ini		alembic.ini
config.yml		config.yml
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Query Engine for Employee Data

Project Overview

Screenshot

Quick Start

Prerequisites

Backend Setup

Frontend Setup

Docker Setup (Alternative)

Frontend on Vercel / static hosting

Persistent Vector Store (PgVector)

Usage Examples

1. Connect to Database

2. Upload Documents

CSV upload (sample data)

2b. Import tabular data (CSV/Excel) into your database

3. Query Your Data

4. Modify Data (DML via natural language)

5. Optional: List ingestion jobs

Testing

Configuration

Groq (LLM) configuration

Architecture

Backend Structure

Frontend Structure

Features

Troubleshooting

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

zenitsu0509/Employee-NLQ

Folders and files

Latest commit

History

Repository files navigation

NLP Query Engine for Employee Data

Project Overview

Screenshot

Quick Start

Prerequisites

Backend Setup

Frontend Setup

Docker Setup (Alternative)

Frontend on Vercel / static hosting

Persistent Vector Store (PgVector)

Usage Examples

1. Connect to Database

2. Upload Documents

CSV upload (sample data)

2b. Import tabular data (CSV/Excel) into your database

3. Query Your Data

4. Modify Data (DML via natural language)

5. Optional: List ingestion jobs

Testing

Configuration

Groq (LLM) configuration

Architecture

Backend Structure

Frontend Structure

Features

Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages