TheirStory Interview Archive Portal

Interview archive platform with vector semantic search and Named Entity Recognition (NER). Built with Weaviate, Next.js, and advanced NLP processing.

✨ What is it

A complete system to archive, process, and search video/audio interviews with their transcriptions. Enables intelligent semantic search, automatic entity extraction (people, organizations, places), and synchronized navigation with timestamps.

🚀 Features

Semantic Search: Vector search with local embeddings (no external APIs)
Automatic NER: Entity extraction with GLiNER (zero-shot, multilingual)
Sentence Chunking: Sentence-based chunking with configurable overlap
Multi-format: Video and audio with synchronized transcriptions
Live Highlighting: Entities highlighted with clickable timestamps
Multi-organization: Centralized configuration system
Docker: Local deployment or cloud Weaviate connection

🛠️ Tech Stack

Backend & NLP:

Weaviate (vector database)
FastAPI (Python 3.11)
GLiNER multi-v2.1 (NER)
Sentence Transformers LaBSE (local embeddings)

Frontend:

Next.js + TypeScript
Material UI
Zustand (state)

Requirements:

Docker & Docker Compose
Node.js ≥18
Yarn

🚀 Quick Start

Get started in 4 easy steps

# 1. Clone the repository
git clone https://github.com/theirstory/ts-portal.git
cd ts-portal

# 2. Copy config and env files from example
cp config.example.json config.json # Edit config.json with your organization details
cp .env.example .env.local
cp nlp-processor/.env.example nlp-processor/.env

# 3. Start services
docker compose --profile local up

# 4. Open in browser
open http://localhost:3000

Important: Edit config.json to customize your portal with organization name, branding colors, logos, and NER entity labels. See CONFIGURATION.md for all configuration options.

Chat provider configuration

The /discover RAG chat now supports multiple LLM providers through a shared provider abstraction.

Set the default non-secret chat settings in config.json under features.chat:
- provider: anthropic, openai, or openai-compatible
- model: provider-specific model name
- baseUrl: optional for OpenAI-compatible endpoints
Store API keys in environment variables, not in config.json

Example:

{
  "features": {
    "chat": {
      "enabled": true,
      "provider": "anthropic",
      "model": "claude-sonnet-4-20250514",
      "baseUrl": ""
    }
  }
}

Environment variables:

ANTHROPIC_API_KEY=
OPENAI_API_KEY=
AI_API_KEY=

First time: may take several minutes while GLiNER / embedding / spaCy models download. Subsequent runs are much faster thanks to cache reuse.

NLP Environment Notes

Default embedding model is sentence-transformers/LaBSE. NER uses urchade/gliner_multi-v2.1. Chunking is sentence-based with configurable sentence overlap.

Services:

Frontend: localhost:3000
Weaviate: localhost:8080
NLP Processor: localhost:7070

📥 Import Interviews

Getting Interview JSONs from TheirStory

If you have interviews already uploaded to TheirStory, you can easily obtain the JSON files:

Navigate to https://lab.theirstory.io/ts-api-core-demo/v028/
Log in with your TheirStory username and password
Download the JSON files for your interviews

Importing the Data

Recommended approach: create one subfolder per collection with a collection.json file and place interview JSON files inside each folder.

You can use json/interviews/example-collection/collection.json as a copy-paste template for new collections.

Example:

json/interviews/
├── oral-history/
│   ├── collection.json
│   └── interview-1.json
└── veterans/
    ├── collection.json
    └── interview-2.json

Note: If needed, you can skip subfolders: JSON files placed directly under json/interviews/ are imported into the default collection.

# 1. Add your collection subfolders and interview JSON files under:
json/interviews/

# 2. Open a new terminal in the ts-portal root folder and run the manual import
docker compose run --rm weaviate-init

See docs/IMPORTING_INTERVIEWS.md for full details.

Process:

Sentence-based chunking
Embedding generation
NER extraction (people, places, organizations, etc.)
Storage in Weaviate with vectors

Verify:

# Count interviews and chunks
curl -s "http://localhost:8080/v1/objects?class=Testimonies" | jq '.objects | length'
curl -s "http://localhost:8080/v1/objects?class=Chunks" | jq '.objects | length'

JSON Format: See docs/IMPORTING_INTERVIEWS.md

🚢 Production Deployment

This production flow works on any Linux host with Docker.

Before running deployment commands:

Create a Linux server in your hosting provider (DigitalOcean, AWS, Hetzner, etc.).
Connect to that server via SSH (example: ssh root@YOUR_SERVER_IP).

On the server terminal (remote host):

# Install git and clone repo (one time)
sudo apt update && sudo apt install -y git
git clone https://github.com/theirstory/ts-portal.git
cd ts-portal

# Install Docker once (Ubuntu)
sudo bash scripts/deploy/setup-docker-ubuntu.sh
# If prompted about /etc/ssh/sshd_config, choose: keep the local version currently installed

# If you want to use Discover (RAG chat) create and edit the .env.production file with you api keys
cp -n .env.production.example .env.production
nano .env.production

# Deploy/update
./scripts/deploy/deploy-prod.sh

# Site URL after deploy
# http://YOUR_SERVER_IP:3000

# Optional but recommended: domain + HTTPS + firewall
# sudo bash scripts/deploy/setup-nginx-ssl.sh YOUR_DOMAIN YOUR_EMAIL 3000

First production build can take 15-20 minutes on small servers.

If you already have indexed data locally and want to avoid re-import/NLP processing in production (recommended):

On your local terminal:

# One command: export backup + sync config/json/public + upload backup
./scripts/deploy/export-weaviate-data.sh "$PWD/weaviate-data.tar.gz" ts-portal_weaviate_data root@YOUR_SERVER_IP /root/ts-portal

On the server terminal:

cd /root/ts-portal
./scripts/deploy/restore-weaviate-data.sh /tmp/weaviate-data.tar.gz
./scripts/deploy/deploy-prod.sh

Note: with server parameters, export-weaviate-data.sh also syncs config.json, json/, and public/.

Full guide (DigitalOcean example): docs/DEPLOY_PRODUCTION.md

📚 Documentation

CONFIGURATION.md - Portal configuration, colors, NER labels
CONTRIBUTING.md - Contribution guidelines and CLA signing via CLA Assistant
docs/ARCHITECTURE.md - Container architecture and services
docs/IMPORTING_INTERVIEWS.md - JSON format and import process
docs/ENVIRONMENT.md - Environment variables and advanced configuration
docs/COMMANDS.md - All available commands
docs/DEPLOY_PRODUCTION.md - Production deployment guide (works on any Docker host, with DigitalOcean example)
docs/TROUBLESHOOTING.md - Common issues and solutions

⚡ Common Commands

# Start/stop
docker compose --profile local up     # Start with logs
docker compose down                   # Stop

# Production (with Docker)
./scripts/deploy/deploy-prod.sh
docker compose -f docker-compose.prod.yml ps
docker compose -f docker-compose.prod.yml logs -f

# Logs & debugging
docker compose logs -f nlp-processor  # Follow logs
docker compose ps                     # Service status

# Data
docker compose run --rm weaviate-init # Reimport interviews
docker volume rm portals_weaviate_data # Clear DB

# Verify data
curl -s "http://localhost:8080/v1/objects?class=Testimonies" | jq '.objects | length'
curl -s "http://localhost:8080/v1/objects?class=Chunks" | jq '.objects | length'

# Testing NLP
curl -X POST http://localhost:7070/embed \
  -H "Content-Type: application/json" \
  -d '{"text": "Test sentence"}'

# Health checks
curl http://localhost:8080/v1/.well-known/ready  # Weaviate
curl http://localhost:7070/health | jq          # NLP Processor

See docs/COMMANDS.md for the complete list.

📁 Project Structure

ts-portal/
├── app/                    # Next.js application
│   ├── story/[storyUuid]/  # Interview detail pages
│   ├── stores/             # Zustand state management
│   └── utils/              # Helper functions
├── components/             # React components
├── config/                 # Organization configuration
├── docs/                   # Documentation
│   ├── ARCHITECTURE.md
│   ├── COMMANDS.md
│   ├── DEPLOY_PRODUCTION.md
│   ├── ENVIRONMENT.md
│   ├── IMPORTING_INTERVIEWS.md
│   └── TROUBLESHOOTING.md
├── json/interviews/        # Interview JSON files (auto-import)
├── lib/                    # Libraries (Weaviate, theme)
├── nlp-processor/          # Python NLP service
│   ├── main.py             # FastAPI application
│   ├── sentence_chunker.py # Sentence-based chunking algorithm
│   ├── ner_processor.py    # GLiNER NER extraction
│   ├── embedding_service.py# Local SentenceTransformer embeddings
│   └── weaviate_client.py  # Weaviate batch operations
├── scripts/                # Import and schema scripts
├── types/                  # TypeScript type definitions
├── config.json             # Portal configuration
├── CONFIGURATION.md        # Config documentation
└── docker-compose.yml      # Container orchestration

🙏 Credits

Built with:

Weaviate - Vector database
GLiNER - Named Entity Recognition
Sentence Transformers - Embeddings
Next.js - React framework
Material UI - Component library

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.husky		.husky
.yarn		.yarn
app		app
components		components
config		config
docs		docs
json		json
lib		lib
nlp-processor		nlp-processor
public/images		public/images
scripts		scripts
types		types
.cursorrules		.cursorrules
.env.example		.env.example
.env.production.example		.env.production.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc		.prettierrc
.yarnrc.yml		.yarnrc.yml
CLA.md		CLA.md
CONFIGURATION.md		CONFIGURATION.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.prod		Dockerfile.prod
LICENSE		LICENSE
README.md		README.md
config.example.json		config.example.json
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
eslint.config.js		eslint.config.js
next-env.d.ts		next-env.d.ts
next.config.mjs		next.config.mjs
package.json		package.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TheirStory Interview Archive Portal

✨ What is it

🚀 Features

🛠️ Tech Stack

🚀 Quick Start

Get started in 4 easy steps

Chat provider configuration

NLP Environment Notes

📥 Import Interviews

Getting Interview JSONs from TheirStory

Importing the Data

🚢 Production Deployment

📚 Documentation

⚡ Common Commands

📁 Project Structure

🙏 Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TheirStory Interview Archive Portal

✨ What is it

🚀 Features

🛠️ Tech Stack

🚀 Quick Start

Get started in 4 easy steps

Chat provider configuration

NLP Environment Notes

📥 Import Interviews

Getting Interview JSONs from TheirStory

Importing the Data

🚢 Production Deployment

📚 Documentation

⚡ Common Commands

📁 Project Structure

🙏 Credits

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages