Skip to content

feat: Weekly DB publish (Parquet) #3115

@pochoclin

Description

@pochoclin

Automate a weekly export of the canonical media database to Parquet with a signed manifest and integrity hashes. This export will include all links to all known providers discovered by our Popcorn Time bot (the crawler that scans the web).

As we value our community and want people to build on top of our dataset, we will publish the database weekly in an open, reusable format that can be easily consumed by developers and researchers.

Scale (current, rounded)

  • Links: 11,000,000+ (daily grow)
  • Media (movies + TV): 235,000+ (daily grow)
  • Providers: 800+
  • Countries: ~30; Languages: ~150

Tasks

  • Parquet datasets in with zstd compression
  • Versioned manifest per release
  • CI workflow on a weekly schedule that builds, validates, uploads, and tags
  • Backfill job runnable on demand
  • Artifacts (Parquet) hosted on CDN bucket with immutable, versioned paths (/schema=v1/date=YYYY-MM-DD/...)
  • Manifests are published in GitHub Releases (tag db-YYYY-MM-DD), not committed to the repo history
  • Manifest includes canonical CDN URLs and integrity hashes for each file
  • Manifest is signed (GPG or Sigstore); publish signature alongside the asset in the Release
  • CDN cache control
  • CDN has lifecycle: keep N weekly versions, archive or delete older
  • Access logs enabled; checksums verified post-upload in CI
  • CORS allows GET,HEAD for programmatic clients (TBD)

Parquet constraints

  • Compression: ZSTD (determine best level)
  • Row group: 128–256 MB
  • Dictionary encoding enabled where possible

Versioned Manifest

Each weekly release includes a manifest.json file published alongside the Parquet datasets. The manifest is the authoritative contract for schema version, release metadata, and artifact integrity.

Sample TS types as references:

type ArtifactKind = "media" | "links" | "providers";

type Manifest = {
  schema_version: string;      // e.g. "1.0.0"
  release_date: string;        // "YYYY-MM-DD"
  git_commit: string;          // commit hash of exporter
  version: string;             // monotonically increasing, e.g. "2025.09.16.1"
  artifacts: Record<ArtifactKind, Artifact>;
};

type Artifact = {
  path: string;                // object storage prefix
  rows: number;                // total row count
  sha256: string;              // aggregate hash of all files
  files: FileMeta[];
};

type FileMeta = {
  name: string;                // filename
  size_bytes: number;          // uncompressed size
  rows: number;                // row count
  sha256: string;              // file hash
};

Stable column names and types

Medias

  • id
  • title
  • year
  • content_type
  • original_language
  • countries[]
  • genres[]
  • tmdb_id
  • rank
  • updated_at

Links

  • id
  • media_id
  • provider
  • provider_ref
  • country?
  • url (hashed)
  • quality?
  • audios[]
  • subtitles[]
  • updated_at

Providers

  • id
  • name
  • country?
  • kind
  • updated_at

Metadata

Metadata

Assignees

Labels

featuregithub_actionsPull requests that update GitHub Actions codeopen discussionrustPull requests that update rust code

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions