HMA (Hasher Matcher Actioner) is a tool from Meta to detect content that's been copied (or slightly modified) from sources already identified.
This repository provides a Matrix-specific extensions to HMA for (primarily) the Matrix ecosystem to benefit from in a familiar "Matrix way" of deploying applications. Use of this repository is not required to set up an HMA instance - it just wraps up some of the functionality to be familiar to Matrix developers/server operators.
This repo provides a Docker image which is layered on top of HMA's image. Configuration is done via environment variables rather than Python (if you'd like to use/override the Python config, HMA directly is probably a better choice).
The following environment variables can be specified:
-
HMA_DB_HOST(defaultlocalhost) - The hostname (with port if required) for your PostgreSQL database running HMA. -
HMA_DB_NAME(defaulthma) - The name of the database on the PostgreSQL server to use. -
HMA_DB_USER(defaulthma) - The username to access the PostgreSQL database. -
HMA_DB_PASS(no default - required) - The password for the above user. -
HMA_API_KEY(no default - required ifHMA_API_KEY_REQUIREDis notfalse) - The API key to require on requests. The UI may not function with an API key set - use HMA's Docker Compose/development setup for experimentation with the UI. -
HMA_API_KEY_REQUIRED(defaulttrue) - Set this tofalseto disable the API key requirement. This can allow you to use the UI properly when there's no API key. Ignored when an API key is specified. -
HMA_WORKER_ROLE(no default - required) - The type of functionality to enable on this particular instance. This allows for load balancing some/all aspects of HMA's operations. Reverse proxying and routing traffic to workers is left as an exercise for the reader 😇.The role must be one of the following:
HASHER- Services the/h/*API endpoints for hashing. Note that hashing can be resource intensive.MATCHER- Services the/m/*API endpoints for matching. The matcher can additionally have the following environment variables:HMA_INDEX_CACHE_INTERVAL_SECONDS(default30) - The interval to cache the internal index at.
CURATOR- Services the/c/*API endpoints for managing content, banks, exchanges, etc.CRON- Runs scheduled tasks and builds the internal index for the matcher(s). There should be no more than one of these running at a time. The cron worker can additionally have the following environment variables:HMA_FETCHER_INTERVAL_SECONDS(default240(4 minutes)) - The interval to fetch from exchanges at. This is set to30seconds in hma-matrix'scompose.yaml.HMA_INDEXER_INTERVAL_SECONDS(default60(1 minute)) - The interval to rebuild the internal index at.
UI- Services the/ui/*endpoints. Note that the UI might not function if an API key is set. The UI worker is the only worker that's not required to run a complete HMA instance.
The above can then be provided to the hma-matrix Docker image to run a worker:
docker run -d -e HMA_WORKER_ROLE=UI [...] -p 127.0.0.1:5100:5100 ghcr.io/matrix-org/hma-matrix:[version]See the GHCR repo for available tags/versions.
See the HMA API docs for more information on the API endpoints themselves.
Note
It's possible with this setup to specify different API keys for different functions. All of the same type of worker will need to be using the same API key, but the hashers can use a different API key from the curators for an amount of role-based access control.
Note
If there's a config option you'd like to set (or override) from HMA directly, you can do so by prefixing the option with OMM_. For example, OMM_MAX_REMOTE_FILE_SIZE=1073741824 will set the max remote file size to 1gb.
To quickly set up a local HMA stack for developing applications which use HMA, the Docker Compose file from this repo can be used.
git clone https://github.com/matrix-org/hma-matrix.git && cd hma-matrix- Set the
SYNAPSE_ADMIN_ACCESS_TOKENenvironment variable. It only needs to be a valid token if you intend to use the Synapse Quarantined Media exchange described below. docker compose up -d- Visit
http://localhost:5100/ui
Caution
This stack is set up by default without an API key to allow use of the UI. It's recommended to set an API key in production deployments.
When API authentication is enabled, supply the API key as a Bearer token in the Authorization header:
curl -h "Authorization: Bearer ${HMA_API_KEY}" ...This repo also contains Matrix-specific extensions for HMA, such as exchanges for importing data from homeservers.
Exchanges need to be configured and enabled from the HMA API (TODO: Link to docs when they exist upstream). Credentials to run these exchanges are stored as environment variables.
Exchange API: synapse_quarantined
Credentials: An access token valid for your Synapse server's Admin API, stored in SYNAPSE_ADMIN_ACCESS_TOKEN.
API JSON template (supplied to HMA API):
{
"admin_api_url": "https://client-server-api.example.org"
}Important caveats:
- This exchange does hashing in-process, which can slow down fetch time. Give your
cronworker some extra CPU headroom. - This exchange does not support use of the "unquarantine" API in Synapse. Using that API will not remove media from the HMA bank.
- This exchange uses a batch size of 250 for remote media and 1000 for local media. This can mean it'll take a while to fetch all media from a server if the server has a lot of quarantined media.
- This exchange best supports
pdqandvideo_md5signal types. Note thatvideo_md5is actually just an MD5 hash of the input file, regardless of whether it's actually a video.
Ideas that may or may not be implemented:
- Quarantined media import/exchange for MMR
- Maybe policy list support if we can figure out how to make that work safely?
- "flagged as spam" from policyserv/mjolnir/draupnir/meowlnir/etc
This repository puts out new versions when HMA does and when there's functionality worth releasing, such as changes to the config.py file.
Structure: [hma-version]-matrix.[increment] where [increment] is on a per-[hma-version] basis.
Example: 1.0.21-matrix.2 is HMA v1.0.21, second release by this repo in the v1.0.21 series.