An AI-powered pipeline that generates financially relevant, novelty-filtered brief reports for a universe of companies. For each entity and reporting window, the service fetches news, extracts key bullet points, filters out previously reported content via embedding-based and search-based novelty checks, and exposes the results through a REST API.
The pipeline processes each entity through five sequential phases:
- Search: exploratory + concept-driven search via Bigdata.com API
- Bullet Generation: LLM extracts material bullet points per theme
- Novelty Check via Embedding: discards bullets already covered in prior runs
- Novelty Check via Search: claim-level verification against recent news evidence
- Post-processing: redundancy removal, thematic consolidation, report assembly (in development)
- A Bigdata.com API key
- An OpenAI API key
- Docker (option A) or uv (option B)
# Build
docker build -t bigdata_briefs .
# Run
docker run -d \
--name bigdata_briefs \
-p 8000:8000 \
-e BIGDATA_API_KEY=<your-bigdata-api-key> \
-e OPENAI_API_KEY=<your-openai-api-key> \
bigdata_briefs# Install uv if needed: https://docs.astral.sh/uv/getting-started/installation/
# Install dependencies
uv sync
# Copy and edit the environment file
cp .env.example .env
# Run
uv run uvicorn bigdata_briefs.api.app:app --host 0.0.0.0 --port 8000curl http://localhost:8000/healthInteractive API docs are available at
http://localhost:8000/docs: open it in your browser to explore and try all endpoints interactively.
Submits a list of entity IDs (or a named universe) to the pipeline. All entities run concurrently up to the configured worker pool size. Returns a single batch_id to monitor progress.
# Run a list of entities
curl -X POST http://localhost:8000/api/v1/batch/run-parallel \
-H "Content-Type: application/json" \
-d '{
"entity_ids": ["0157B1", "D64C6D", "228D42"],
"force_window_start": "2026-04-22T00:00:00",
"force_window_end": "2026-04-22T23:59:59"
}'
# Run an entire pre-defined universe
curl -X POST http://localhost:8000/api/v1/batch/run-parallel \
-H "Content-Type: application/json" \
-d '{
"universe": "dow_30",
"force_window_start": "2026-04-22T00:00:00",
"force_window_end": "2026-04-22T23:59:59"
}'Response:
{
"batch_id": "3f8a1c2d-...",
"total": 10,
"submitted_at": "2026-04-22T09:00:00"
}
entity_idsanduniverseare mutually exclusive. Available universes:dow_30,eurostoxx_50,top_us_100,top_us_500,top_eu_100,top_eu_500.
Omitforce_window_start/force_window_endto use the automatic incremental window.
Re-running a window that overlaps an already-completed run for the same entity is blocked automatically.
Same as run-parallel but processes entities sequentially one after the other. Useful for controlled, lower-concurrency runs.
curl -X POST http://localhost:8000/api/v1/batch/run \
-H "Content-Type: application/json" \
-d '{
"entity_ids": ["0157B1", "D64C6D"],
"force_window_start": "2026-04-22T00:00:00",
"force_window_end": "2026-04-22T23:59:59"
}'Returns the real-time status of a batch submitted via run-parallel: how many entities have succeeded, failed, are still running, or have not started yet.
curl http://localhost:8000/api/v1/batch/parallel/3f8a1c2d-.../statusResponse:
{
"batch_id": "3f8a1c2d-...",
"total": 10,
"succeeded": 8,
"failed": 1,
"running": 1,
"not_started": 0,
"runs": [
{
"entity_id": "0157B1",
"run_id": "...",
"status": "succeeded",
"started_at": "2026-04-22T09:00:03",
"completed_at": "2026-04-22T09:04:51"
}
]
}Returns the published bullet points for one or more entities, grouped by run. Pass an empty body to retrieve all entities in the database.
curl -X POST http://localhost:8000/api/v1/batch/bullets \
-H "Content-Type: application/json" \
-d '{"entity_ids": ["0157B1", "D64C6D"]}'
# All entities
curl -X POST http://localhost:8000/api/v1/batch/bullets \
-H "Content-Type: application/json" \
-d '{}'Each bullet includes the final text, source citations (headline, chunk text), and novelty metadata (search_action, not_fully_novel).
Returns full pipeline detail for every bullet (both published and discarded) for one or more entities. Pass an empty body to retrieve all entities.
For each bullet you get:
- Published bullets: relevance score and reasoning that justified publishing
- Discarded bullets: the stage that eliminated them and the reason:
relevance_score: scored too low on financial materialitygrounding: text not verifiable against cited sourcesnovelty_embedding: already reported in a previous runnovelty_search: per-claim verdicts with the evidence chunks that already covered the information
curl -X POST http://localhost:8000/api/v1/batch/bullets/detail \
-H "Content-Type: application/json" \
-d '{"entity_ids": ["0157B1"]}'Generates a self-contained HTML page that can be opened directly in the browser. Published bullets are shown in green (fully novel) or amber (partially novel, rewritten to surface the new element). Discarded bullets are grouped under a collapsible Discard section showing the reason, stage, and (for novelty search discards) the prior evidence that already covered each claim.
# All entities (open directly in browser)
http://localhost:8000/api/v1/report/html
# Single entity
http://localhost:8000/api/v1/report/html?entity_id=0157B1
# With API key
http://localhost:8000/api/v1/report/html?entity_id=0157B1&api_key=<your-secret-key>
Download via curl:
curl "http://localhost:8000/api/v1/report/html?entity_id=0157B1" -o report.htmlDrops and recreates all database tables. Use with caution: all run history, embeddings, and saved bullets are permanently deleted.
curl -X POST http://localhost:8000/api/v1/admin/reset-dbUseful when starting a fresh evaluation or clearing test data before a new run.
Resets rows that are stuck in running status (e.g. after a service crash). Rows older than the configured threshold are marked as failed so they no longer block re-runs for the same entity.
curl -X POST http://localhost:8000/api/v1/admin/clear-stale-runsEvery run covers a time window [start, end). You can either specify it explicitly with force_window_start / force_window_end, or let the pipeline compute it automatically via window_mode.
Explicit window: use force_window_start and force_window_end to target a specific period. One day at a time is recommended: a single-day window gives the model a focused, bounded set of news to analyze, producing sharper bullets and more reliable novelty comparisons. Wider windows are possible but for entities with a high volume of news they can generate briefs where temporal references are ambiguous or inconsistent.
Automatic window: omit the dates and let window_mode decide:
Covers [UTC midnight of today → now].
- If the pipeline already ran today, it resumes from exactly where that run ended.
- If the last run was yesterday or earlier, it always resets to midnight of today: prior days never influence today's window start.
Covers [end of last run → now].
- If the last run was yesterday at 18:00, today's run covers from 18:00 yesterday to now: no gap, no reset.
- If no previous run exists, falls back to
[UTC midnight of today → now], same asdaily.
daily |
continuous |
|
|---|---|---|
| No previous run | [today midnight → now] |
[today midnight → now] |
| Last run was today at 09:00 | [09:00 → now] |
[09:00 → now] |
| Last run was yesterday at 18:00 | [today midnight → now] |
[yesterday 18:00 → now] |
| Last run was 3 days ago | [today midnight → now] |
[3 days ago end → now] |
Use daily for standard day-by-day monitoring. Use continuous when you need to guarantee no gaps across runs regardless of when they were last triggered.
curl -X POST http://localhost:8000/api/v1/batch/run-parallel \
-H "Content-Type: application/json" \
-d '{"universe": "dow_30", "window_mode": "continuous"}'Overlap protection: if the requested window (forced or automatic) overlaps any already-completed run for the same entity, that entity's run is rejected immediately with an error and marked as
failedin the batch status. No API calls or LLM calls are made. Use a non-overlapping date range or reset the entity viaPOST /api/v1/admin/reset-dbto clear its history.
Six entity universes are bundled with the service:
| Universe | Entities | Description |
|---|---|---|
dow_30 |
30 | Dow Jones Industrial Average components |
eurostoxx_50 |
50 | Euro Stoxx 50 components |
top_us_100 |
100 | Top 100 US companies by market cap |
top_us_500 |
500 | Top 500 US companies by market cap |
top_eu_100 |
100 | Top 100 European companies by market cap |
top_eu_500 |
500 | Top 500 European companies by market cap |
Pass "universe": "top_us_100" (or any name above) to run-parallel or run instead of an explicit entity_ids list.
| Environment variable | Description | Default |
|---|---|---|
BIGDATA_API_KEY |
Bigdata.com API key (required) | : |
OPENAI_API_KEY |
OpenAI API key (required) | : |
MAX_CONCURRENT_ENTITIES |
Max entities running in parallel | 10 |
DB_STRING |
SQLite connection string | sqlite:///briefs.db |
LLM_TIMEOUT_SECONDS |
LLM call timeout | 60 |
NOVELTY_LOOKBACK_DAYS |
Days of history used for novelty checks | 14 |
See .env.example for the full list with descriptions.
Service not responding
docker logs bigdata_briefs
curl http://localhost:8000/healthEntity stuck in running for a long time
Call POST /api/v1/admin/clear-stale-runs to reset it, then re-submit the entity.
All bullets discarded
Expected when the entity has no materially new information in the requested window relative to prior runs. Try a different date range or run on a day with more news activity for that entity.
