hma-matrix

HMA (Hasher Matcher Actioner) is a tool from Meta to detect content that's been copied (or slightly modified) from sources already identified.

This repository provides a Matrix-specific extensions to HMA for (primarily) the Matrix ecosystem to benefit from in a familiar "Matrix way" of deploying applications. Use of this repository is not required to set up an HMA instance - it just wraps up some of the functionality to be familiar to Matrix developers/server operators.

Usage

This repo provides a Docker image which is layered on top of HMA's image. Configuration is done via environment variables rather than Python (if you'd like to use/override the Python config, HMA directly is probably a better choice).

The following environment variables can be specified:

HMA_DB_HOST (default localhost) - The hostname (with port if required) for your PostgreSQL database running HMA.
HMA_DB_NAME (default hma) - The name of the database on the PostgreSQL server to use.
HMA_DB_USER (default hma) - The username to access the PostgreSQL database.
HMA_DB_PASS (no default - required) - The password for the above user.
HMA_API_KEY (no default - required if HMA_API_KEY_REQUIRED is not false) - The API key to require on requests. The UI may not function with an API key set - use HMA's Docker Compose/development setup for experimentation with the UI.
HMA_API_KEY_REQUIRED (default true) - Set this to false to disable the API key requirement. This can allow you to use the UI properly when there's no API key. Ignored when an API key is specified.
HMA_WORKER_ROLE (no default - required) - The type of functionality to enable on this particular instance. This allows for load balancing some/all aspects of HMA's operations. Multiple roles can be specified as a comma-separated list (e.g., HASHER,MATCHER) to run multiple functions in a single container. Reverse proxying and routing traffic to workers is left as an exercise for the reader 😇.

The role must be one of the following:
- HASHER - Services the /h/* API endpoints for hashing. Note that hashing can be resource intensive.
- MATCHER - Services the /m/* API endpoints for matching. The matcher can additionally have the following environment variables:
  - HMA_INDEX_CACHE_INTERVAL_SECONDS (default 30) - The interval to cache the internal index at.
- CURATOR - Services the /c/* API endpoints for managing content, banks, exchanges, etc.
- CRON - Runs scheduled tasks and builds the internal index for the matcher(s). There should be no more than one of these running at a time. The cron worker can additionally have the following environment variables:
  - HMA_FETCHER_INTERVAL_SECONDS (default 240 (4 minutes)) - The interval to fetch from exchanges at. This is set to 30 seconds in hma-matrix's compose.yaml.
  - HMA_INDEXER_INTERVAL_SECONDS (default 60 (1 minute)) - The interval to rebuild the internal index at.
- UI - Services the /ui/* endpoints. Note that the UI might not function if an API key is set. The UI worker is the only worker that's not required to run a complete HMA instance.

The above can then be provided to the hma-matrix Docker image to run a worker:

# Single role
docker run -d -e HMA_WORKER_ROLE=UI [...] -p 127.0.0.1:5100:5100 ghcr.io/matrix-org/hma-matrix:[version]

# Multiple roles in one container (comma-separated)
docker run -d -e HMA_WORKER_ROLE=UI,MATCHER [...] -p 127.0.0.1:5100:5100 ghcr.io/matrix-org/hma-matrix:[version]

See the GHCR repo for available tags/versions.

See the HMA API docs for more information on the API endpoints themselves.

Note

It's possible with this setup to specify different API keys for different functions. All of the same type of worker will need to be using the same API key, but the hashers can use a different API key from the curators for an amount of role-based access control.

Note

If there's a config option you'd like to set (or override) from HMA directly, you can do so by prefixing the option with OMM_. For example, OMM_MAX_REMOTE_FILE_SIZE=1073741824 will set the max remote file size to 1gb.

For developers

To quickly set up a local HMA stack for developing applications which use HMA, the Docker Compose file from this repo can be used.

git clone https://github.com/matrix-org/hma-matrix.git && cd hma-matrix
Set the SYNAPSE_ADMIN_ACCESS_TOKEN environment variable. It only needs to be a valid token if you intend to use the Synapse Quarantined Media exchange described below.
docker compose up -d
Visit http://localhost:5100/ui

Caution

This stack is set up by default without an API key to allow use of the UI. It's recommended to set an API key in production deployments.

When API authentication is enabled, supply the API key as a Bearer token in the Authorization header:

curl -h "Authorization: Bearer ${HMA_API_KEY}" ...

Matrix-specific extensions

This repo also contains Matrix-specific extensions for HMA, such as exchanges for importing data from homeservers.

Exchanges need to be configured and enabled from the HMA API (TODO: Link to docs when they exist upstream). Credentials to run these exchanges are stored as environment variables.

Synapse: Quarantined media exchange

Exchange API: synapse_quarantined

Credentials: An access token valid for your Synapse server's Admin API, stored in SYNAPSE_ADMIN_ACCESS_TOKEN.

API JSON template (supplied to HMA API):

{
  "admin_api_url": "https://client-server-api.example.org"
}

Important caveats:

This exchange does hashing in-process, which can slow down fetch time. Give your cron worker some extra CPU headroom.
This exchange does not support use of the "unquarantine" API in Synapse. Using that API will not remove media from the HMA bank.
This exchange uses a batch size of 250 for remote media and 1000 for local media. This can mean it'll take a while to fetch all media from a server if the server has a lot of quarantined media.
This exchange best supports pdq and video_md5 signal types. Note that video_md5 is actually just an MD5 hash of the input file, regardless of whether it's actually a video.

Future considerations

Ideas that may or may not be implemented:

Quarantined media import/exchange for MMR
Maybe policy list support if we can figure out how to make that work safely?
"flagged as spam" from policyserv/mjolnir/draupnir/meowlnir/etc

Versioning

This repository puts out new versions when HMA does and when there's functionality worth releasing, such as changes to the config.py file.

Structure: [hma-version]-matrix.[increment] where [increment] is on a per-[hma-version] basis.

Example: 1.0.21-matrix.2 is HMA v1.0.21, second release by this repo in the v1.0.21 series.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
matrix_exchanges		matrix_exchanges
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml
config.py		config.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
startup.sh		startup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hma-matrix

Usage

For developers

Matrix-specific extensions

Synapse: Quarantined media exchange

Future considerations

Versioning

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hma-matrix

Usage

For developers

Matrix-specific extensions

Synapse: Quarantined media exchange

Future considerations

Versioning

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages