-
-
Notifications
You must be signed in to change notification settings - Fork 722
Open
Labels
featuregithub_actionsPull requests that update GitHub Actions codePull requests that update GitHub Actions codeopen discussionrustPull requests that update rust codePull requests that update rust code
Description
Automate a weekly export of the canonical media database to Parquet with a signed manifest and integrity hashes. This export will include all links to all known providers discovered by our Popcorn Time bot (the crawler that scans the web).
As we value our community and want people to build on top of our dataset, we will publish the database weekly in an open, reusable format that can be easily consumed by developers and researchers.
Scale (current, rounded)
- Links: 11,000,000+ (daily grow)
- Media (movies + TV): 235,000+ (daily grow)
- Providers: 800+
- Countries: ~30; Languages: ~150
Tasks
- Parquet datasets in with
zstdcompression - Versioned manifest per release
- CI workflow on a weekly schedule that builds, validates, uploads, and tags
- Backfill job runnable on demand
- Artifacts (Parquet) hosted on CDN bucket with immutable, versioned paths (/schema=v1/date=YYYY-MM-DD/...)
- Manifests are published in GitHub Releases (tag db-YYYY-MM-DD), not committed to the repo history
- Manifest includes canonical CDN URLs and integrity hashes for each file
- Manifest is signed (GPG or Sigstore); publish signature alongside the asset in the Release
- CDN cache control
- CDN has lifecycle: keep N weekly versions, archive or delete older
- Access logs enabled; checksums verified post-upload in CI
- CORS allows
GET,HEADfor programmatic clients (TBD)
Parquet constraints
- Compression: ZSTD (determine best level)
- Row group: 128–256 MB
- Dictionary encoding enabled where possible
Versioned Manifest
Each weekly release includes a manifest.json file published alongside the Parquet datasets. The manifest is the authoritative contract for schema version, release metadata, and artifact integrity.
Sample TS types as references:
type ArtifactKind = "media" | "links" | "providers";
type Manifest = {
schema_version: string; // e.g. "1.0.0"
release_date: string; // "YYYY-MM-DD"
git_commit: string; // commit hash of exporter
version: string; // monotonically increasing, e.g. "2025.09.16.1"
artifacts: Record<ArtifactKind, Artifact>;
};
type Artifact = {
path: string; // object storage prefix
rows: number; // total row count
sha256: string; // aggregate hash of all files
files: FileMeta[];
};
type FileMeta = {
name: string; // filename
size_bytes: number; // uncompressed size
rows: number; // row count
sha256: string; // file hash
};Stable column names and types
Medias
- id
- title
- year
- content_type
- original_language
- countries[]
- genres[]
- tmdb_id
- rank
- updated_at
Links
- id
- media_id
- provider
- provider_ref
- country?
- url (hashed)
- quality?
- audios[]
- subtitles[]
- updated_at
Providers
- id
- name
- country?
- kind
- updated_at
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
featuregithub_actionsPull requests that update GitHub Actions codePull requests that update GitHub Actions codeopen discussionrustPull requests that update rust codePull requests that update rust code
Type
Projects
Status
In Progress