[BugFix][Metrics] Fix Prometheus Multiprocess Metrics Issues and Add ZMQ Communication Metrics by fl0w2o48 · Pull Request #5185 · PaddlePaddle/FastDeploy

fl0w2o48 · 2025-11-24T05:07:52Z

Motivation

This PR addresses two primary concerns: fixing data integrity issues with Prometheus metrics in multi-process environments and enhancing observability for ZMQ communication.

1. Prometheus Multi-process Issues:

Aggregation Failures: In multi-process mode, the custom Collector was incorrectly filtering metrics. It attempted to read all metrics from the current process memory rather than the shared file system, causing Counter and Histogram data from other processes to be lost.
Initialization Order: The PROMETHEUS_MULTIPROC_DIR was being set after load_engine(). This caused the Engine process and API Server process to write to different directories (or disabled multi-process mode entirely if the Prometheus client was loaded too early).
Configuration Override: The API Server forced the use of a UUID-based directory for metrics, ignoring user-defined environment variables.

2. Lack of ZMQ Observability:

There were no metrics available to monitor the health, latency, and throughput of the ZMQ communication between the Engine and the API Server.

Modifications

Prometheus Fixes:

Metric Collection Logic:
- Modified the collection strategy: Gauge metrics are now exclusively read from the current process memory (as multi-process aggregation for Gauges is ambiguous).
- Counter and Histogram metrics are now correctly read from the multi-process file storage to ensure proper aggregation.
Initialization Flow:
- Moved the setup of PROMETHEUS_MULTIPROC_DIR to the very beginning of __init__. This ensures the environment is configured before the Prometheus client loads, guaranteeing that both the Engine and API Server share the correct directory.
Directory Configuration:
- Updated the logic to prioritize the user-defined PROMETHEUS_MULTIPROC_DIR environment variable. It now falls back to a random UUID directory only if the user has not specified one.

New ZMQ Metrics:
Added the fastdeploy:zmq:* metric series to monitor ZMQ performance:

Send: msg_send_total, msg_send_failed_total, msg_bytes_send_total
Receive: msg_recv_total, msg_bytes_recv_total
Latency: fastdeploy:zmq:latency (Histogram)

Usage or Command

Verification:
Start the service in a multi-process environment and request the metrics endpoint.

curl http://localhost:<port>/metrics

Expected Output (ZMQ Section):
You should see the aggregated metrics and the new ZMQ entries:

# HELP fastdeploy:zmq:msg_send_total Total number of zmq messages sent
fastdeploy:zmq:msg_send_total{address="ipc:///dev/shm/8296.socket"} 21.0
# HELP fastdeploy:zmq:latency Latency of zmq message (in millisecond)
fastdeploy:zmq:latency_bucket{address="ipc:///dev/shm/8296.socket",le="0.25"} 11.0
fastdeploy:zmq:latency_count{address="ipc:///dev/shm/8296.socket"} 21.0

Accuracy Tests

Multi-process Aggregation: Verified that Counter metrics correctly sum up values from multiple worker processes.
ZMQ Latency: Verified that fastdeploy:zmq:latency correctly records transmission time between the API Server and Engine.
Persistence: Confirmed that metrics are written to the correct PROMETHEUS_MULTIPROC_DIR specified by the environment variable.

Checklist

Add at least a tag in the PR title.
- Tags: [BugFix], [Metrics], [Feature]
Format your code, run pre-commit before commit.
Add unit tests.
- Note: Added tests for ZMQ metric collection and multi-process directory logic.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-11-24T05:07:58Z

Thanks for your contribution!

Jiang-Jia-Jun · 2025-11-24T06:30:11Z

fastdeploy/__init__.py

 import typing

+# first import prometheus setup to set PROMETHEUS_MULTIPROC_DIR
+# 否则会因为Prometheus包先被导入导致无法正确设置多进程


改成英文注释

Copilot

Pull request overview

This PR addresses critical bugs in Prometheus metrics collection for multi-process environments and adds ZMQ communication observability metrics.

Fixes metric aggregation failures in multi-process mode by separating Gauge metrics (read from memory) from Counter/Histogram metrics (read from shared filesystem)
Corrects initialization order by setting PROMETHEUS_MULTIPROC_DIR before Prometheus client loads in __init__.py
Adds comprehensive ZMQ metrics (fastdeploy:zmq:*) to monitor message throughput, failures, and latency

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
fastdeploy/init.py	Sets up Prometheus multiprocess directory early to ensure proper initialization before client loads
fastdeploy/metrics/prometheus_multiprocess_setup.py	New module to handle Prometheus multiprocess directory setup with user environment variable prioritization
fastdeploy/metrics/metrics.py	Refactored metric collection logic to properly handle multi-process aggregation; added ZMQ, HTTP, and server metrics; separated Gauge metrics for correct handling
fastdeploy/metrics/metrics_middleware.py	New middleware to track HTTP request metrics (requests total, duration)
fastdeploy/metrics/stats.py	New dataclass to hold ZMQ metrics statistics
fastdeploy/inter_communicator/zmq_server.py	Added ZMQ metrics collection with message wrapping for latency tracking
fastdeploy/inter_communicator/zmq_client.py	Added ZMQ metrics collection with message wrapping for latency tracking
fastdeploy/entrypoints/openai/api_server.py	Simplified metrics endpoint and integrated PrometheusMiddleware; removed redundant setup calls
fastdeploy/entrypoints/openai/utils.py	Added ZMQ metrics recording for dealer connections
fastdeploy/entrypoints/openai/serving_chat.py	Updated to use `main_process_metrics` instead of deprecated `work_process_metrics`
fastdeploy/entrypoints/engine_client.py	Updated to use `main_process_metrics` instead of deprecated `work_process_metrics`
fastdeploy/splitwise/internal_adapter_utils.py	Simplified metrics collection call by removing unused parameters
fastdeploy/metrics/work_metrics.py	Removed deprecated file; metrics moved to main MetricsManager
tests/metrics/test_prometheus_multiprocess_setup.py	New test suite for multiprocess setup logic
tests/metrics/test_metrics_middleware.py	New test suite for HTTP metrics middleware
tests/metrics/test_metrics.py	Updated test to reflect simplified metrics API
tests/entrypoints/openai/test_metrics_routes.py	Removed obsolete test for deprecated setup function

Copilot · 2025-11-24T06:43:00Z

fastdeploy/inter_communicator/zmq_client.py

+            _zmq_metrics_stats.msg_bytes_send_total += len(msg)

-    def recv_json(self):
+            return self.socket.send(msg, flags=flags)


Missing _ensure_socket() call before using self.socket.send(). This will cause an AttributeError if the socket hasn't been created yet. Add self._ensure_socket() at the beginning of the method, similar to recv_json() at line 89.

Copilot · 2025-11-24T06:43:00Z

fastdeploy/entrypoints/openai/utils.py

+                if _zmq_metrics_stats is not None:
+                    _zmq_metrics_stats.msg_recv_total += 1
+                    if "zmq_send_time" in response:
+                        _zmq_metrics_stats.zmq_latency = time.perf_counter() - response["zmq_send_time"]
+                    address = dealer.transport.getsockopt(zmq.LAST_ENDPOINT)
+                    main_process_metrics.record_zmq_stats(_zmq_metrics_stats, address)


Unnecessary null check: _zmq_metrics_stats is guaranteed to be non-None since it's instantiated on the previous line. The condition if _zmq_metrics_stats is not None: will always be true and can be removed.

Suggested change

if _zmq_metrics_stats is not None:

_zmq_metrics_stats.msg_recv_total += 1

if "zmq_send_time" in response:

_zmq_metrics_stats.zmq_latency = time.perf_counter() - response["zmq_send_time"]

address = dealer.transport.getsockopt(zmq.LAST_ENDPOINT)

main_process_metrics.record_zmq_stats(_zmq_metrics_stats, address)

_zmq_metrics_stats.msg_recv_total += 1

if "zmq_send_time" in response:

_zmq_metrics_stats.zmq_latency = time.perf_counter() - response["zmq_send_time"]

address = dealer.transport.getsockopt(zmq.LAST_ENDPOINT)

main_process_metrics.record_zmq_stats(_zmq_metrics_stats, address)

Copilot · 2025-11-24T06:43:01Z

fastdeploy/inter_communicator/zmq_server.py

+        envelope = ForkingPickler.loads(data_bytes)
+        if isinstance(envelope, dict):
+            if "__meta" in envelope and "send_ts" in envelope["__meta"]:
+                _zmq_metrics_stats.msg_recv_total += 1
+                _zmq_metrics_stats.msg_bytes_recv_total += len(data_bytes)
+                _zmq_metrics_stats.zmq_latency = time.perf_counter() - envelope["__meta"]["send_ts"]
+                main_process_metrics.record_zmq_stats(_zmq_metrics_stats, self.address)
+                return envelope["data"]
+        return envelope


Metrics are not recorded when the envelope is not a dict or doesn't have the __meta key. In these cases, the function returns early without calling record_zmq_stats(). Consider wrapping the logic in a try-finally block similar to recv_json() to ensure metrics are always recorded.

Suggested change

envelope = ForkingPickler.loads(data_bytes)

if isinstance(envelope, dict):

if "__meta" in envelope and "send_ts" in envelope["__meta"]:

_zmq_metrics_stats.msg_recv_total += 1

_zmq_metrics_stats.msg_bytes_recv_total += len(data_bytes)

_zmq_metrics_stats.zmq_latency = time.perf_counter() - envelope["__meta"]["send_ts"]

main_process_metrics.record_zmq_stats(_zmq_metrics_stats, self.address)

return envelope["data"]

return envelope

try:

envelope = ForkingPickler.loads(data_bytes)

if isinstance(envelope, dict) and "__meta" in envelope and "send_ts" in envelope["__meta"]:

_zmq_metrics_stats.msg_recv_total += 1

_zmq_metrics_stats.msg_bytes_recv_total += len(data_bytes)

_zmq_metrics_stats.zmq_latency = time.perf_counter() - envelope["__meta"]["send_ts"]

return envelope["data"]

else:

# Record metrics for malformed envelope

_zmq_metrics_stats.msg_recv_total += 1

_zmq_metrics_stats.msg_bytes_recv_total += len(data_bytes)

return envelope

finally:

main_process_metrics.record_zmq_stats(_zmq_metrics_stats, self.address)

Copilot · 2025-11-24T06:43:01Z

fastdeploy/inter_communicator/zmq_client.py

+        data_bytes = self.socket.recv(flags=flags)
+        envelope = ForkingPickler.loads(data_bytes)
+        if isinstance(envelope, dict):
+            if "__meta" in envelope and "send_ts" in envelope["__meta"]:
+                _zmq_metrics_stats.msg_recv_total += 1
+                _zmq_metrics_stats.msg_bytes_recv_total += len(data_bytes)
+                _zmq_metrics_stats.zmq_latency = time.perf_counter() - envelope["__meta"]["send_ts"]
+                main_process_metrics.record_zmq_stats(_zmq_metrics_stats, self.address)
+                return envelope["data"]
+        return envelope


Metrics are not recorded when the envelope is not a dict or doesn't have the __meta key. In these cases, the function returns early without calling record_zmq_stats(). Consider wrapping the logic in a try-finally block similar to recv_json() to ensure metrics are always recorded.

Suggested change

data_bytes = self.socket.recv(flags=flags)

envelope = ForkingPickler.loads(data_bytes)

if isinstance(envelope, dict):

if "__meta" in envelope and "send_ts" in envelope["__meta"]:

_zmq_metrics_stats.msg_recv_total += 1

_zmq_metrics_stats.msg_bytes_recv_total += len(data_bytes)

_zmq_metrics_stats.zmq_latency = time.perf_counter() - envelope["__meta"]["send_ts"]

main_process_metrics.record_zmq_stats(_zmq_metrics_stats, self.address)

return envelope["data"]

return envelope

try:

data_bytes = self.socket.recv(flags=flags)

envelope = ForkingPickler.loads(data_bytes)

if isinstance(envelope, dict):

if "__meta" in envelope and "send_ts" in envelope["__meta"]:

_zmq_metrics_stats.msg_recv_total += 1

_zmq_metrics_stats.msg_bytes_recv_total += len(data_bytes)

_zmq_metrics_stats.zmq_latency = time.perf_counter() - envelope["__meta"]["send_ts"]

return envelope["data"]

return envelope

finally:

main_process_metrics.record_zmq_stats(_zmq_metrics_stats, self.address)

Copilot · 2025-11-24T06:43:01Z

fastdeploy/metrics/metrics.py

+
        if hasattr(main_process_metrics, "spec_decode_draft_acceptance_rate"):
            self.register_speculative_metrics(registry)



The register_all() method doesn't register ZMQ metrics even when they are initialized. When init_zmq_metrics() is called (when FD_DEBUG is enabled), the ZMQ metrics are created but they won't be registered in non-multiprocess mode (line 121). Consider adding a loop to register ZMQ_METRICS if _collect_zmq_metrics is True.

Suggested change

# Register ZMQ metrics if they are being collected

if getattr(self, "_collect_zmq_metrics", False):

for metric in getattr(self, "ZMQ_METRICS", []):

registry.register(metric)

Copilot · 2025-11-24T06:43:02Z

fastdeploy/metrics/metrics.py

+        "msg_recv_total": {
+            "type": Counter,
+            "name": "fastdeploy:zmq:msg_recv_total",
+            "description": "Total number of zmq messages recieved",


Typo in description: 'recieved' should be 'received'.

Copilot · 2025-11-24T06:43:02Z

fastdeploy/metrics/metrics.py

+        "msg_bytes_recv_total": {
+            "type": Counter,
+            "name": "fastdeploy:zmq:msg_bytes_recv_total",
+            "description": "Total number of bytes recieved over zmq",


Typo in description: 'recieved' should be 'received'.

Copilot · 2025-11-24T06:43:03Z

fastdeploy/metrics/metrics.py

        """Initializes the Prometheus metrics and starts the HTTP server if not already initialized."""
+
+        # 在模块加载，指标注册先设置Prometheus环境变量
+        setup_multiprocess_prometheus()


The function setup_multiprocess_prometheus() is called both in fastdeploy/__init__.py (line 33) and in MetricsManager.__init__() (line 595). This creates duplicate setup calls. Since setup_multiprocess_prometheus() is already called at module import in __init__.py, the call in MetricsManager.__init__() is redundant and should be removed to avoid confusion.

Suggested change

setup_multiprocess_prometheus()

Copilot · 2025-11-24T06:43:03Z

fastdeploy/inter_communicator/zmq_server.py

+            _zmq_metrics_stats.msg_bytes_send_total += len(msg)

-    def recv_json(self):
+            return self.socket.send(msg, flags=flags)


Missing _ensure_socket() call before using self.socket.send(). This will cause an AttributeError if the socket hasn't been created yet. Add self._ensure_socket() at the beginning of the method, similar to recv_json() at line 91.

Copilot · 2025-11-24T06:43:03Z

tests/metrics/test_prometheus_multiprocess_setup.py

+        test_dir = "/tmp/prom_main_test-uuid"
+        # 使用 tmp_path 创建临时目录
+        os.makedirs(test_dir, exist_ok=True)


Hardcoded path /tmp/prom_main_test-uuid is used instead of the tmp_path fixture provided by pytest. This could cause test failures or side effects on systems where /tmp is not writable or tests run in parallel. Consider using str(tmp_path / "prom_main_test-uuid") instead.

codecov-commenter · 2025-11-25T11:55:13Z

Codecov Report

❌ Patch coverage is 53.63985% with 121 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@214942e). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/inter_communicator/zmq_server.py	14.70%	58 Missing ⚠️
fastdeploy/inter_communicator/zmq_client.py	33.33%	38 Missing ⚠️
fastdeploy/metrics/metrics.py	67.21%	17 Missing and 3 partials ⚠️
fastdeploy/entrypoints/openai/utils.py	77.77%	1 Missing and 1 partial ⚠️
fastdeploy/entrypoints/openai/serving_chat.py	66.66%	1 Missing ⚠️
...astdeploy/metrics/prometheus_multiprocess_setup.py	94.44%	0 Missing and 1 partial ⚠️
fastdeploy/splitwise/internal_adapter_utils.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5185   +/-   ##
==========================================
  Coverage           ?   59.84%           
==========================================
  Files              ?      319           
  Lines              ?    38974           
  Branches           ?     5866           
==========================================
  Hits               ?    23325           
  Misses             ?    13810           
  Partials           ?     1839

Flag	Coverage Δ
GPU	`59.84% <53.63%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

paddle-bot bot added the contributor External developers label Nov 24, 2025

fl0w2o48 closed this Nov 24, 2025

fl0w2o48 reopened this Nov 24, 2025

sunlei1024 previously approved these changes Nov 24, 2025

View reviewed changes

fl0w2o48 changed the title ~~[Feature] add metrics for ZMQ and fix multiprocess metrics~~ [BugFix][Metrics] Fix Prometheus Multiprocess Metrics Issues and Add ZMQ Communication Metrics Nov 24, 2025

Jiang-Jia-Jun reviewed Nov 24, 2025

View reviewed changes

Jiang-Jia-Jun requested a review from Copilot November 24, 2025 06:37

Copilot started reviewing on behalf of Jiang-Jia-Jun November 24, 2025 06:37 View session

Copilot finished reviewing on behalf of Jiang-Jia-Jun November 24, 2025 06:40

Copilot AI reviewed Nov 24, 2025

View reviewed changes

fl0w2o48 dismissed sunlei1024’s stale review via 23a31cf November 24, 2025 11:20

fl0w2o48 force-pushed the develop branch 2 times, most recently from 23a31cf to a4fa504 Compare November 24, 2025 14:03

[Feature] add metrics for ZMQ and fix multiprocess metrics

8528f76

fl0w2o48 force-pushed the develop branch from a4fa504 to 8528f76 Compare November 24, 2025 14:04

Jiang-Jia-Jun added the skip-ci: coverage label Nov 25, 2025

Jiang-Jia-Jun previously approved these changes Nov 25, 2025

View reviewed changes

Merge branch 'develop' into develop

e9153d9

github-actions bot removed the skip-ci: coverage label Nov 25, 2025

Merge branch 'develop' into develop

4613567

fl0w2o48 dismissed Jiang-Jia-Jun’s stale review via 4613567 November 26, 2025 08:57

sunlei1024 previously approved these changes Nov 26, 2025

View reviewed changes

fix test_metrics.py

2ee6a7a

fl0w2o48 dismissed sunlei1024’s stale review via 2ee6a7a November 27, 2025 03:26

Jiang-Jia-Jun added the skip-ci: coverage label Nov 27, 2025

Jiang-Jia-Jun merged commit e63d715 into PaddlePaddle:develop Nov 27, 2025
14 of 17 checks passed

EmmonsCurse mentioned this pull request Nov 27, 2025

[CI]【Hackathon 9th Sprint No.50】NO.50 功能模块 fastdeploy/entrypoints/engine_client.py 单测补充 -part #5045

Merged

5 tasks


		if hasattr(main_process_metrics, "spec_decode_draft_acceptance_rate"):
		self.register_speculative_metrics(registry)

+        # Register ZMQ metrics if they are being collected
+        if getattr(self, "_collect_zmq_metrics", False):
+            for metric in getattr(self, "ZMQ_METRICS", []):
+                registry.register(metric)

Comments

Conversation

fl0w2o48 commented Nov 24, 2025 • edited by sunlei1024 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 24, 2025

Uh oh!

Jiang-Jia-Jun Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fl0w2o48 commented Nov 24, 2025 •

edited by sunlei1024

Loading

codecov-commenter commented Nov 25, 2025 •

edited

Loading