Refactor Grafana dashboard to use server_name label#19337
Refactor Grafana dashboard to use server_name label#19337MadLittleMods merged 16 commits intodevelopfrom
server_name label#19337Conversation
1093fa0 to
2157fa1
Compare
| @@ -195,7 +195,7 @@ | |||
| "datasource": { | |||
| "uid": "${DS_PROMETHEUS}" | |||
| }, | |||
| "expr": "sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',instance=\"$instance\",code=~\"2..\"}[$bucket_size])) by (le)", | |||
| "expr": "sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',server_name=\"$server_name\",code=~\"2..\"}[$bucket_size])) by (le)", | |||
There was a problem hiding this comment.
Updating all of the synapse_xxx server-level metrics to use $server_name instead of $instance
…nstance_mapping`
| index: 2 | ||
| - targets: ["my.workerserver.here:port"] | ||
| labels: | ||
| instance: "my.server" |
There was a problem hiding this comment.
No longer needed as the recommendation is to rely on the server_name label
This helps when scraping from the same `instance`. You can see that we already did
this kind of thing for `synapse_storage_events_stale_forward_extremities_persisted_bucket`
and `synapse_http_httppusher_http_pushes_processed_total`.
Example data this helps with:
```
synapse_server_name_to_instance_mapping{instance="host.docker.internal:33074"}
synapse_server_name_to_instance_mapping{index="2", instance="host.docker.internal:33074", job="event_persister", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="federation_reader", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="user_dir", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="background_worker", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="event_persister", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="media_repository", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="main", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="federation_inbound", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="stream_writers", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="client_reader", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="pusher", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="event_creator", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="device_lists", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="appservice", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="2", instance="host.docker.internal:33074", job="device_lists", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="synchrotron", server_name="hs1"} 1
synapse_server_name_to_instance_mapping{index="1", instance="host.docker.internal:33074", job="federation_sender", server_name="hs1"} 1
```
```
process_cpu_seconds_total{instance="host.docker.internal:33074"}
process_cpu_seconds_total{index="2", instance="host.docker.internal:33074", job="event_persister"} 1.68
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="federation_reader"} 1.08
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="user_dir"} 0.97
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="background_worker"} 1.02
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="event_persister"} 1.45
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="media_repository"} 0.69
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="main"} 2.5300000000000002
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="federation_inbound"} 1.11
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="stream_writers"} 1.33
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="client_reader"} 0.94
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="pusher"} 0.67
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="event_creator"} 1.25
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="device_lists"} 0.8099999999999999
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="appservice"} 0.69
process_cpu_seconds_total{index="2", instance="host.docker.internal:33074", job="device_lists"} 0.76
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="synchrotron"} 2.21
process_cpu_seconds_total{index="1", instance="host.docker.internal:33074", job="federation_sender"} 1.3299999999999998
```
contrib/grafana/synapse.json
Outdated
| @@ -541,7 +541,7 @@ | |||
| "datasource": { | |||
| "uid": "${DS_PROMETHEUS}" | |||
| }, | |||
| "expr": "rate(process_cpu_seconds_total{instance=\"$instance\",job=~\"$job\",index=~\"$index\"}[$bucket_size])", | |||
| "expr": "rate(process_cpu_seconds_total{job=~\"$job\",index=~\"$index\"}[$bucket_size]) * on (instance, job, index) group_left(server_name)\nsynapse_server_name_to_instance_mapping{server_name=\"$server_name\"}", | |||
There was a problem hiding this comment.
Updating all of the process-level metrics with this pattern:
xxx * on(instance, job, index) group_left(server_name)
synapse_server_name_to_instance_mapping{server_name="$server_name"}
With instance_to_server_name_mapping, looking up the instance that the $server_name lives to match against the process-level metric.
contrib/grafana/synapse.json
Outdated
| @@ -53,7 +53,7 @@ | |||
| "uid": "${DS_PROMETHEUS}" | |||
| }, | |||
| "enable": true, | |||
| "expr": "changes(process_start_time_seconds{instance=\"$instance\",job=~\"synapse\"}[$bucket_size]) * on (instance, job) group_left(version) synapse_build_info{instance=\"$instance\",job=\"synapse\"}", | |||
| "expr": "(\n changes(process_start_time_seconds{job=\"synapse\"}[$bucket_size]) * on (instance, job, index) group_left(server_name)\n synapse_server_name_to_instance_mapping{server_name=\"$server_name\"}\n) * on (instance, job, index) group_left(version) synapse_build_info{job=\"synapse\"}", | |||
There was a problem hiding this comment.
Skip reviewing the annotation query until the end (because it's the most complicated change)
It's the same basic pattern upgrade as the other process-level metrics but we just have an extra * on ... layer
| @@ -0,0 +1 @@ | |||
| Refactor Grafana dashboard to use `server_name` label (instead of `instance`). | |||
There was a problem hiding this comment.
But the instance label actually has a special meaning and we're actually abusing it by using it that way:
instance: The : part of the target's URL that was scraped.
-- prometheus.io/docs/concepts/jobs_instances#automatically-generated-labels-and-time-series
Not really material to this actual PR, but just so you know I think that's just a 'sane default' but in my experience it's quite conventional to override it when you know better, e.g. using the blackbox exporter you typically relabel instance to be something more meaningful. (e.g. target_label: instance on https://prometheus.io/docs/guides/multi-target-exporter/)
There was a problem hiding this comment.
Thanks for the sharing the link and context!
I think that overriding instance to mean something else is a mistake, with vhosting being the shining example of how this falls apart otherwise. I understand how it can work out fine in simple cases though. It's all conventions anyway.
The 'sane default' outlook on this makes sense as that is what happens by default 🤔 - With the way I'm thinking about this, a more specific definition might be "URL of the server process that the metrics came from". This covers all of the complicated scenarios:
- One server per process ->
instancepoints to the process - Multiple servers per process (vhosting) with one metrics endpoint for the process ->
instancepoints to the whole process - Proxying metrics from other servers ->
instancepoints to where the metrics came from
There is one layer of complexity that even this metrics setup doesn't handle yet. For example with Synapse Pro for small hosts, if we instead wanted to pack in multiple workers (for the same homeserver) into the same process (instead of multiple monolith homeservers), we wouldn't have the necessary labels to differentiate them. In addition to server_name, we would additionally need to label all of the server-level metrics with worker_name (or job/index). (I understand this use case doesn't make sense but it might for another person's setup)
I feel like a lot of docs are missing around how to handle metrics with vhosting. I would definitely prefer to lean on conventions but there doesn't seem to be any.
reivilibre
left a comment
There was a problem hiding this comment.
still need to read the big JSON dump :-))
| labelnames=[SERVER_NAME_LABEL], | ||
| ) | ||
| """ | ||
| Maps Synapse `server_name`s to the `instance`s they're hosted on. |
There was a problem hiding this comment.
so if I'm getting this right, it's not really a map as such (the value is meaningless), just that it's a dummy metric we can rely on to always be there that will always have {instance, server_name} levels.
It seems this type of metric (labels carry data for the length of the process, value is fixed at 1) conventionally has a _info suffix, e.g. see https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/#info
Would it make more sense to just call this synapse_server_names_info or something of that ilk.
I suppose in a vhosting model, when virtual-hosts are deregistered, we would remove their label from here?
There was a problem hiding this comment.
so if I'm getting this right, it's not really a map as such (the value is meaningless), just that it's a dummy metric we can rely on to always be there that will always have
{instance, server_name}levels.
I think you have the correct grasp on it. The labels do allow us to map server_name -> instance though (and "mapping" is the purpose of it) 🤷
It seems this type of metric (labels carry data for the length of the process, value is fixed at 1) conventionally has a _info suffix, e.g. see https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/#info
Would it make more sense to just call this
synapse_server_names_infoor something of that ilk.
Sounds like a good practice to follow 👍
I suppose in a vhosting model, when virtual-hosts are deregistered, we would remove their label from here?
Yes. My current thinking is to remove it when we hs.shutdown() but I think this is better as a follow-up where we can describe it directly and test it.
Probably the final piece for https://github.com/element-hq/synapse-small-hosts/issues/106
There was a problem hiding this comment.
Would it make more sense to just call this
synapse_server_names_infoor something of that ilk.
Renamed to synapse_server_name_info 👍
reivilibre
left a comment
There was a problem hiding this comment.
quite an interesting (if a little boilerplatey) approach, multiplying by some other constant-1 metric to match on more labels.
I think I'm getting it now, I'm happy with this.
I leave the decision around naming / following the convention or not up to you — I don't suspect it matters an awful lot and I don't know if this metric feels like a 'conventional _info metric' beyond matching the basic pattern, or not.
|
Thanks for the review @reivilibre 🐗 |
These are automatic changes from importing/exporting from Grafana 12.3.1. In order to verify that I'm not sneaking in any changes, you can follow these steps to get the same output. Reproduction instructions: 1. Start [Grafana](https://hub.docker.com/r/grafana/grafana) ``` docker run -d --name=grafana --add-host host.docker.internal:host-gateway -p 3000:3000 grafana/grafana ``` 1. Visit the Grafana dashboard, http://localhost:3000/ (Credentials: `admin`/`admin`) 1. Import the Synapse dashboard: `contrib/grafana/synapse.json` 1. Export the Synapse dashboard. On the dashboard page -> **Export** -> **Export as code** -> Using the **Classic** model -> Check **Export for sharing externally** -> Copy 1. Paste into `contrib/grafana/synapse.json` 1. `git status`/`git diff` to check if there is any diff Sanity checked the dashboard itself by importing the dashboard on https://grafana.matrix.org/ (Grafana 10.4.1 according to https://grafana.matrix.org/api/health). The process-level metrics won't work because #19337 just merged and isn't on `matrix.org` yet. Also just generally, this dashboard works for me locally with the [load-tests](element-hq/synapse-rust-apps#397) I've been doing. ### Motivation There are few fixes I want to make to the Grafana dashboard and it sucks having to manually translate everything back over because we have different formatting. Hopefully after this bulk change, future exports will have exactly what we want to change.
This PR contains the following updates: | Package | Update | Change | |---|---|---| | [element-hq/synapse](https://github.com/element-hq/synapse) | minor | `1.145.0` → `1.146.0` | --- >⚠️ **Warning** > > Some dependencies could not be looked up. Check the Dependency Dashboard for more information. --- ### Release Notes <details> <summary>element-hq/synapse (element-hq/synapse)</summary> ### [`v1.146.0`](https://github.com/element-hq/synapse/releases/tag/v1.146.0) [Compare Source](element-hq/synapse@v1.145.0...v1.146.0rc1) ### Synapse 1.146.0 (2026-01-27) No significant changes since 1.146.0rc1. #### Deprecations and Removals - [MSC2697](matrix-org/matrix-spec-proposals#2697) (Dehydrated devices) has been removed, as the MSC is closed. Developers should migrate to [MSC3814](matrix-org/matrix-spec-proposals#3814). ([#​19346](element-hq/synapse#19346)) - Support for Ubuntu 25.04 (Plucky Puffin) has been dropped. Synapse no longer builds debian packages for Ubuntu 25.04. ### Synapse 1.146.0rc1 (2026-01-20) #### Features - Add a new config option [`enable_local_media_storage`](https://element-hq.github.io/synapse/latest/usage/configuration/config_documentation.html#enable_local_media_storage) which controls whether media is additionally stored locally when using configured `media_storage_providers`. Setting this to `false` allows off-site media storage without a local cache. Contributed by Patrice Brend'amour [@​dr](https://github.com/dr).allgood. ([#​19204](element-hq/synapse#19204)) - Stabilise support for [MSC4312](matrix-org/matrix-spec-proposals#4312 `m.oauth` User-Interactive Auth stage for resetting cross-signing identity with the OAuth 2.0 API. The old, unstable name (`org.matrix.cross_signing_reset`) is now deprecated and will be removed in a future release. ([#​19273](element-hq/synapse#19273)) - Refactor Grafana dashboard to use `server_name` label (instead of `instance`). ([#​19337](element-hq/synapse#19337)) #### Bugfixes - Fix joining a restricted v12 room locally when no local room creator is present but local users with sufficient power levels are. Contributed by [@​nexy7574](https://github.com/nexy7574). ([#​19321](element-hq/synapse#19321)) - Fixed parallel calls to `/_matrix/media/v1/create` being ratelimited for appservices even if `rate_limited: false` was set in the registration. Contributed by [@​tulir](https://github.com/tulir) @​ Beeper. ([#​19335](element-hq/synapse#19335)) - Fix a bug introduced in 1.61.0 where a user's membership in a room was accidentally ignored when considering access to historical state events in rooms with the "shared" history visibility. Contributed by Lukas Tautz. ([#​19353](element-hq/synapse#19353)) - [MSC4140](matrix-org/matrix-spec-proposals#4140): Store the JSON content of scheduled delayed events as text instead of a byte array. This fixes the inability to schedule a delayed event with non-ASCII characters in its content. ([#​19360](element-hq/synapse#19360)) - Always rollback database transactions when retrying (avoid orphaned connections). ([#​19372](element-hq/synapse#19372)) - Fix `InFlightGauge` typing to allow upgrading to `prometheus_client` 0.24. ([#​19379](element-hq/synapse#19379)) #### Updates to the Docker image - Add [Prometheus HTTP service discovery](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config) endpoint for easy discovery of all workers when using the `docker/Dockerfile-workers` image (see the [*Metrics* section of our Docker testing docs](docker/README-testing.md#metrics)). ([#​19336](element-hq/synapse#19336)) #### Improved Documentation - Remove docs on legacy metric names (no longer in the codebase since 2022-12-06). ([#​19341](element-hq/synapse#19341)) - Clarify how the estimated value of room complexity is calculated internally. ([#​19384](element-hq/synapse#19384)) #### Internal Changes - Add an internal `cancel_task` API to the task scheduler. ([#​19310](element-hq/synapse#19310)) - Tweak docstrings and signatures of `auth_types_for_event` and `get_catchup_room_event_ids`. ([#​19320](element-hq/synapse#19320)) - Replace usage of deprecated `assertEquals` with `assertEqual` in unit test code. ([#​19345](element-hq/synapse#19345)) - Drop support for Ubuntu 25.04 'Plucky Puffin', add support for Ubuntu 25.10 'Questing Quokka'. ([#​19348](element-hq/synapse#19348)) - Revert "Add an Admin API endpoint for listing quarantined media ([#​19268](element-hq/synapse#19268))". ([#​19351](element-hq/synapse#19351)) - Bump `mdbook` from 0.4.17 to 0.5.2 and remove our custom table-of-contents plugin in favour of the new default functionality. ([#​19356](element-hq/synapse#19356)) - Replace deprecated usage of PyGitHub's `GitRelease.title` with `.name` in release script. ([#​19358](element-hq/synapse#19358)) - Update the Element logo in Synapse's README to be an absolute URL, allowing it to render on other sites (such as PyPI). ([#​19368](element-hq/synapse#19368)) - Apply minor tweaks to v1.145.0 changelog. ([#​19376](element-hq/synapse#19376)) - Update Grafana dashboard syntax to use the latest from importing/exporting with Grafana 12.3.1. ([#​19381](element-hq/synapse#19381)) - Warn about skipping reactor metrics when using unknown reactor type. ([#​19383](element-hq/synapse#19383)) - Add support for reactor metrics with the `ProxiedReactor` used in worker Complement tests. ([#​19385](element-hq/synapse#19385)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi42OS4yIiwidXBkYXRlZEluVmVyIjoiNDIuNjkuMiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiaW1hZ2UiXX0=--> Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/3533 Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net> Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
# Famedly Synapse Release v1.146.0_1 depends on: famedly/complement#10 ## Famedly additions for v1.146.0_1 - feat: trigger CI actions (that are triggered on PRs) in merge queue (FrenchGithubUser) ### Notes for Famedly: #### Deprecations and Removals - matrix-org/matrix-spec-proposals#2697 (Dehydrated devices) has been removed, as the MSC is closed. Developers should migrate to matrix-org/matrix-spec-proposals#3814. (element-hq/synapse#19346) - Support for Ubuntu 25.04 (Plucky Puffin) has been dropped. Synapse no longer builds debian packages for Ubuntu 25.04. #### Updates to the Docker image - Add [Prometheus HTTP service discovery](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config) endpoint for easy discovery of all workers when using the docker/Dockerfile-workers image (see the [Metrics section of our Docker testing docs](https://github.com/famedly/synapse/pull/docker/README-testing.md#metrics)). (element-hq/synapse#19336) #### Features - Add a new config option [enable_local_media_storage](https://element-hq.github.io/synapse/latest/usage/configuration/config_documentation.html#enable_local_media_storage) which controls whether media is additionally stored locally when using configured media_storage_providers. Setting this to false allows off-site media storage without a local cache. Contributed by Patrice Brend'amour @dr.allgood. (element-hq/synapse#19204) - Stabilise support for matrix-org/matrix-spec-proposals#4312 m.oauth User-Interactive Auth stage for resetting cross-signing identity with the OAuth 2.0 API. The old, unstable name (org.matrix.cross_signing_reset) is now deprecated and will be removed in a future release. (element-hq/synapse#19273) - Refactor Grafana dashboard to use server_name label (instead of instance). (element-hq/synapse#19337)
prometheus recording rules are no longer necessary, see: element-hq/synapse#19133, instance label has been removed in favor of builtin server_name label element-hq/synapse#19337 https://github.com/element-hq/synapse/blob/v1.147.1/contrib/grafana/synapse.json Commits: https://github.com/element-hq/synapse/commits/v1.147.1/contrib/grafana/synapse.json
Refactor Grafana dashboard to use
server_namelabel:synapse_xxx(server-level) metrics to useserver_name="$server_name",instead ofinstance="$instance"synapse_server_name_infometric to map Synapseserver_names to theinstances they're hosted on.xxx * on (instance, job, index) group_left(server_name) synapse_server_name_info{server_name="$server_name"}All of the changes here are backwards compatible with whatever people were doing before with their Prometheus/Grafana dashboards.
Previously, the recommendation was to use the
instancelabel to group everything under the same server:synapse/docs/metrics-howto.md
Lines 93 to 147 in 803e4b4
But the
instancelabel actually has a special meaning and we're actually abusing it by using it that way:Since #18592 (Synapse
v1.139.0), we now have theserver_namelabel to use instead.Additionally, the assumption that a single process is serving a single server is no longer true with Synapse Pro for small hosts.
Part of https://github.com/element-hq/synapse-small-hosts/issues/106
Motivating use case
Although this change also benefits Synapse Pro for small hosts (https://github.com/element-hq/synapse-small-hosts/issues/106), this is actually spawning from adding Prometheus metrics to our workerized Docker image (#19324, #19336) with a more correct label setup (without
instance) and wanting the dashboard to be better.Testing strategy
host.docker.internal) so they can access exposed ports of other Docker containers. We want to allow Synapse to access the Prometheus container and Grafana to access to the Prometheus container.sudo ufw allow in on docker0 comment "Allow traffic from the default Docker network to the host machine (host.docker.internal)"sudo ufw allow in on br-+ comment "(from Matrix Complement testing) Allow traffic from custom Docker networks to the host machine (host.docker.internal)"docker build -t matrixdotorg/synapse -f docker/Dockerfile .(docs)prometheus.yml)synapse_build_infoadmin/admin)http://host.docker.internal:9090contrib/grafana/synapse.jsonTo test workers, you can use the testing strategy from #19336 (assumes both changes from this PR and the other PR are combined)
Dev notes
How to stress the
deploysannotation:docker exec -it synapse /bin/bashvim /usr/local/lib/python3.13/site-packages/synapse/util/__init__.pySYNAPSE_VERSIONdocker stop synapsedocker start synapseTodo
server_namevariable to dashboardprocess_metricsdeploysannotationPull Request Checklist
EventStoretoEventWorkerStore.".code blocks.