Bigdata Briefs v2.0

An AI-powered pipeline that generates financially relevant, novelty-filtered brief reports for a universe of companies. For each entity and reporting window, the service fetches news, extracts key bullet points, filters out previously reported content via embedding-based and search-based novelty checks, and exposes the results through a REST API.

Architecture overview

The pipeline processes each entity through five sequential phases:

Search: exploratory + concept-driven search via Bigdata.com API
Bullet Generation: LLM extracts material bullet points per theme
Novelty Check via Embedding: discards bullets already covered in prior runs
Novelty Check via Search: claim-level verification against recent news evidence
Post-processing: redundancy removal, thematic consolidation, report assembly (in development)

Prerequisites

A Bigdata.com API key
An OpenAI API key
Docker (option A) or uv (option B)

Quickstart

Option A: Docker

# Build
docker build -t bigdata_briefs .

# Run
docker run -d \
  --name bigdata_briefs \
  -p 8000:8000 \
  -e BIGDATA_API_KEY=<your-bigdata-api-key> \
  -e OPENAI_API_KEY=<your-openai-api-key> \
  bigdata_briefs

Option B: uv (no Docker)

# Install uv if needed: https://docs.astral.sh/uv/getting-started/installation/

# Install dependencies
uv sync

# Copy and edit the environment file
cp .env.example .env

# Run
uv run uvicorn bigdata_briefs.api.app:app --host 0.0.0.0 --port 8000

Verify the service

curl http://localhost:8000/health

Interactive API docs are available at http://localhost:8000/docs: open it in your browser to explore and try all endpoints interactively.

Available endpoints

Run the pipeline

`POST /api/v1/batch/run-parallel`

Submits a list of entity IDs (or a named universe) to the pipeline. All entities run concurrently up to the configured worker pool size. Returns a single batch_id to monitor progress.

# Run a list of entities
curl -X POST http://localhost:8000/api/v1/batch/run-parallel \
  -H "Content-Type: application/json" \
  -d '{
    "entity_ids": ["0157B1", "D64C6D", "228D42"],
    "force_window_start": "2026-04-22T00:00:00",
    "force_window_end": "2026-04-22T23:59:59"
  }'

# Run an entire pre-defined universe
curl -X POST http://localhost:8000/api/v1/batch/run-parallel \
  -H "Content-Type: application/json" \
  -d '{
    "universe": "dow_30",
    "force_window_start": "2026-04-22T00:00:00",
    "force_window_end": "2026-04-22T23:59:59"
  }'

Response:

{
  "batch_id": "3f8a1c2d-...",
  "total": 10,
  "submitted_at": "2026-04-22T09:00:00"
}

entity_ids and universe are mutually exclusive. Available universes: dow_30, eurostoxx_50, top_us_100, top_us_500, top_eu_100, top_eu_500.
Omit force_window_start / force_window_end to use the automatic incremental window.
Re-running a window that overlaps an already-completed run for the same entity is blocked automatically.

`POST /api/v1/batch/run`

Same as run-parallel but processes entities sequentially one after the other. Useful for controlled, lower-concurrency runs.

curl -X POST http://localhost:8000/api/v1/batch/run \
  -H "Content-Type: application/json" \
  -d '{
    "entity_ids": ["0157B1", "D64C6D"],
    "force_window_start": "2026-04-22T00:00:00",
    "force_window_end": "2026-04-22T23:59:59"
  }'

Monitor a batch

`GET /api/v1/batch/parallel/{batch_id}/status`

Returns the real-time status of a batch submitted via run-parallel: how many entities have succeeded, failed, are still running, or have not started yet.

curl http://localhost:8000/api/v1/batch/parallel/3f8a1c2d-.../status

Response:

{
  "batch_id": "3f8a1c2d-...",
  "total": 10,
  "succeeded": 8,
  "failed": 1,
  "running": 1,
  "not_started": 0,
  "runs": [
    {
      "entity_id": "0157B1",
      "run_id": "...",
      "status": "succeeded",
      "started_at": "2026-04-22T09:00:03",
      "completed_at": "2026-04-22T09:04:51"
    }
  ]
}

Retrieve results

`POST /api/v1/batch/bullets`

Returns the published bullet points for one or more entities, grouped by run. Pass an empty body to retrieve all entities in the database.

curl -X POST http://localhost:8000/api/v1/batch/bullets \
  -H "Content-Type: application/json" \
  -d '{"entity_ids": ["0157B1", "D64C6D"]}'

# All entities
curl -X POST http://localhost:8000/api/v1/batch/bullets \
  -H "Content-Type: application/json" \
  -d '{}'

Each bullet includes the final text, source citations (headline, chunk text), and novelty metadata (search_action, not_fully_novel).

`POST /api/v1/batch/bullets/detail`

Returns full pipeline detail for every bullet (both published and discarded) for one or more entities. Pass an empty body to retrieve all entities.

For each bullet you get:

Published bullets: relevance score and reasoning that justified publishing
Discarded bullets: the stage that eliminated them and the reason:
- relevance_score: scored too low on financial materiality
- grounding: text not verifiable against cited sources
- novelty_embedding: already reported in a previous run
- novelty_search: per-claim verdicts with the evidence chunks that already covered the information

curl -X POST http://localhost:8000/api/v1/batch/bullets/detail \
  -H "Content-Type: application/json" \
  -d '{"entity_ids": ["0157B1"]}'

HTML report

`GET /api/v1/report/html`

Generates a self-contained HTML page that can be opened directly in the browser. Published bullets are shown in green (fully novel) or amber (partially novel, rewritten to surface the new element). Discarded bullets are grouped under a collapsible Discard section showing the reason, stage, and (for novelty search discards) the prior evidence that already covered each claim.

# All entities (open directly in browser)
http://localhost:8000/api/v1/report/html

# Single entity
http://localhost:8000/api/v1/report/html?entity_id=0157B1

# With API key
http://localhost:8000/api/v1/report/html?entity_id=0157B1&api_key=<your-secret-key>

Download via curl:

curl "http://localhost:8000/api/v1/report/html?entity_id=0157B1" -o report.html

Administration

`POST /api/v1/admin/reset-db`

Drops and recreates all database tables. Use with caution: all run history, embeddings, and saved bullets are permanently deleted.

curl -X POST http://localhost:8000/api/v1/admin/reset-db

Useful when starting a fresh evaluation or clearing test data before a new run.

`POST /api/v1/admin/clear-stale-runs`

Resets rows that are stuck in running status (e.g. after a service crash). Rows older than the configured threshold are marked as failed so they no longer block re-runs for the same entity.

curl -X POST http://localhost:8000/api/v1/admin/clear-stale-runs

Understanding run windows

Every run covers a time window [start, end). You can either specify it explicitly with force_window_start / force_window_end, or let the pipeline compute it automatically via window_mode.

Explicit window: use force_window_start and force_window_end to target a specific period. One day at a time is recommended: a single-day window gives the model a focused, bounded set of news to analyze, producing sharper bullets and more reliable novelty comparisons. Wider windows are possible but for entities with a high volume of news they can generate briefs where temporal references are ambiguous or inconsistent.

Automatic window: omit the dates and let window_mode decide:

`daily` (default)

Covers [UTC midnight of today → now].

If the pipeline already ran today, it resumes from exactly where that run ended.
If the last run was yesterday or earlier, it always resets to midnight of today: prior days never influence today's window start.

`continuous`

Covers [end of last run → now].

If the last run was yesterday at 18:00, today's run covers from 18:00 yesterday to now: no gap, no reset.
If no previous run exists, falls back to [UTC midnight of today → now], same as daily.

	`daily`	`continuous`
No previous run	`[today midnight → now]`	`[today midnight → now]`
Last run was today at 09:00	`[09:00 → now]`	`[09:00 → now]`
Last run was yesterday at 18:00	`[today midnight → now]`	`[yesterday 18:00 → now]`
Last run was 3 days ago	`[today midnight → now]`	`[3 days ago end → now]`

Use daily for standard day-by-day monitoring. Use continuous when you need to guarantee no gaps across runs regardless of when they were last triggered.

curl -X POST http://localhost:8000/api/v1/batch/run-parallel \
  -H "Content-Type: application/json" \
  -d '{"universe": "dow_30", "window_mode": "continuous"}'

Overlap protection: if the requested window (forced or automatic) overlaps any already-completed run for the same entity, that entity's run is rejected immediately with an error and marked as failed in the batch status. No API calls or LLM calls are made. Use a non-overlapping date range or reset the entity via POST /api/v1/admin/reset-db to clear its history.

Pre-defined universes

Six entity universes are bundled with the service:

Universe	Entities	Description
`dow_30`	30	Dow Jones Industrial Average components
`eurostoxx_50`	50	Euro Stoxx 50 components
`top_us_100`	100	Top 100 US companies by market cap
`top_us_500`	500	Top 500 US companies by market cap
`top_eu_100`	100	Top 100 European companies by market cap
`top_eu_500`	500	Top 500 European companies by market cap

Pass "universe": "top_us_100" (or any name above) to run-parallel or run instead of an explicit entity_ids list.

Configuration reference

Environment variable	Description	Default
`BIGDATA_API_KEY`	Bigdata.com API key (required)	:
`OPENAI_API_KEY`	OpenAI API key (required)	:
`MAX_CONCURRENT_ENTITIES`	Max entities running in parallel	`10`
`DB_STRING`	SQLite connection string	`sqlite:///briefs.db`
`LLM_TIMEOUT_SECONDS`	LLM call timeout	`60`
`NOVELTY_LOOKBACK_DAYS`	Days of history used for novelty checks	`14`

See .env.example for the full list with descriptions.

Troubleshooting

Service not responding

docker logs bigdata_briefs
curl http://localhost:8000/health

Entity stuck in running for a long time
Call POST /api/v1/admin/clear-stale-runs to reset it, then re-submit the entity.

All bullets discarded
Expected when the entity has no materially new information in the requested window relative to prior runs. Try a different date range or run on a day with more news activity for that entity.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
bigdata_briefs		bigdata_briefs
tests		tests
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
fly.toml		fly.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bigdata Briefs v2.0

Architecture overview

Prerequisites

Quickstart

Option A: Docker

Option B: uv (no Docker)

Verify the service

Available endpoints

Run the pipeline

`POST /api/v1/batch/run-parallel`

`POST /api/v1/batch/run`

Monitor a batch

`GET /api/v1/batch/parallel/{batch_id}/status`

Retrieve results

`POST /api/v1/batch/bullets`

`POST /api/v1/batch/bullets/detail`

HTML report

`GET /api/v1/report/html`

Administration

`POST /api/v1/admin/reset-db`

`POST /api/v1/admin/clear-stale-runs`

Understanding run windows

`daily` (default)

`continuous`

Pre-defined universes

Configuration reference

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bigdata Briefs v2.0

Architecture overview

Prerequisites

Quickstart

Option A: Docker

Option B: uv (no Docker)

Verify the service

Available endpoints

Run the pipeline

POST /api/v1/batch/run-parallel

POST /api/v1/batch/run

Monitor a batch

GET /api/v1/batch/parallel/{batch_id}/status

Retrieve results

POST /api/v1/batch/bullets

POST /api/v1/batch/bullets/detail

HTML report

GET /api/v1/report/html

Administration

POST /api/v1/admin/reset-db

POST /api/v1/admin/clear-stale-runs

Understanding run windows

daily (default)

continuous

Pre-defined universes

Configuration reference

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /api/v1/batch/run-parallel`

`POST /api/v1/batch/run`

`GET /api/v1/batch/parallel/{batch_id}/status`

`POST /api/v1/batch/bullets`

`POST /api/v1/batch/bullets/detail`

`GET /api/v1/report/html`

`POST /api/v1/admin/reset-db`

`POST /api/v1/admin/clear-stale-runs`

`daily` (default)

`continuous`

Packages