Skip to content

[TESTING][OBSERVABILITY]: Metrics Accuracy, Tracing Completeness, and Dashboard Validation #2476

@crivetimihai

Description

@crivetimihai

[TESTING][OBSERVABILITY]: Metrics Accuracy, Tracing Completeness, and Dashboard Validation

Goal

Produce a comprehensive manual test plan for validating the observability stack provides accurate and complete operational visibility including Prometheus metrics, distributed tracing, log correlation, and Grafana dashboards.

Why Now?

Observability is essential for production operations:

  1. Incident Response: Accurate metrics enable fast debugging
  2. SLA Monitoring: Need reliable data for SLO tracking
  3. Capacity Planning: Metrics inform scaling decisions
  4. Audit Trail: Traces provide request lifecycle visibility
  5. Dashboard Reliability: Broken dashboards waste engineering time

User Stories

US-1: SRE - Metrics Accuracy

As an SRE
I want Prometheus metrics to accurately reflect system state
So that I can trust alerts and dashboards

Acceptance Criteria:

Feature: Metrics Accuracy

  Scenario: Request counter accuracy
    Given I make 100 API requests
    When I query the request counter metric
    Then the counter should show exactly 100
    And the increment should happen atomically
US-2: Developer - Request Tracing

As a developer
I want complete distributed traces for requests
So that I can debug issues across services

Acceptance Criteria:

Feature: Distributed Tracing

  Scenario: Full request trace
    Given a tool invocation request
    When I view the trace in Jaeger
    Then I should see spans for gateway, backend, and tool
    And parent-child relationships should be correct

Architecture

                    OBSERVABILITY STACK
+------------------------------------------------------------------------+
|                                                                        |
|   Gateway                Collectors              Visualization         |
|   -------                ----------              -------------         |
|                                                                        |
|   +-----------+         +------------+         +-------------+         |
|   |  Gateway  |-------->| Prometheus |-------->|  Grafana    |         |
|   |  /metrics |         +------------+         +-------------+         |
|   +-----------+                                                        |
|        |                                                               |
|        |                +------------+         +-------------+         |
|        +--------------->|   Jaeger   |-------->|  Jaeger UI  |         |
|        | (traces)       +------------+         +-------------+         |
|        |                                                               |
|        |                +------------+         +-------------+         |
|        +--------------->|   Loki     |-------->|  Grafana    |         |
|        | (logs)         +------------+         +-------------+         |
|                                                                        |
|   Correlation:                                                         |
|   request_id links logs <-> traces <-> metrics                         |
|                                                                        |
+------------------------------------------------------------------------+

Test Environment Setup

# Environment variables
export GATEWAY_URL="http://localhost:8000"
export PROMETHEUS_URL="http://localhost:9090"
export JAEGER_URL="http://localhost:16686"
export GRAFANA_URL="http://localhost:3000"

# Start observability stack
docker-compose -f docker-compose.observability.yml up -d

# Verify services
curl -s "$PROMETHEUS_URL/-/healthy"
curl -s "$JAEGER_URL/api/services" | jq '.data'
curl -s "$GRAFANA_URL/api/health" | jq '.database'

# Generate test token
export TOKEN=$(python -m mcpgateway.utils.create_jwt_token \
  --username admin@example.com --secret "$JWT_SECRET")

Manual Test Cases

Case Component Validation Expected Result
OBS-01 Metrics endpoint /metrics accessible All metrics present
OBS-02 Counter accuracy Request counts Exact match
OBS-03 Histogram buckets Latency distribution Accurate percentiles
OBS-04 Trace propagation Parent-child spans Correct hierarchy
OBS-05 Log correlation Request ID linking Logs findable by trace
OBS-06 Dashboard queries Grafana panels No broken panels
OBS-07 Alerting rules Test alerts Fire correctly

OBS-01: Metrics Endpoint Validation

Steps:

# Fetch metrics
curl -s "$GATEWAY_URL/metrics" > /tmp/metrics.txt

# Check for required metrics
REQUIRED_METRICS=(
  "http_requests_total"
  "http_request_duration_seconds"
  "mcp_tool_invocations_total"
  "mcp_tool_duration_seconds"
  "gateway_active_connections"
  "redis_connection_pool_size"
  "database_connection_pool_size"
)

for metric in "${REQUIRED_METRICS[@]}"; do
  grep -q "$metric" /tmp/metrics.txt && \
    echo "PASS: $metric present" || \
    echo "FAIL: $metric missing"
done

Validation:

# Verify metric format (Prometheus text format)
promtool check metrics < /tmp/metrics.txt && echo "PASS: Valid Prometheus format"

Expected Result:

  • All documented metrics present
  • Metrics in valid Prometheus format
  • HELP and TYPE annotations correct
OBS-02: Counter Accuracy Test

Steps:

# Get initial counter value
INITIAL=$(curl -s "$GATEWAY_URL/metrics" | grep 'http_requests_total{' | head -1 | awk '{print $2}')

# Make exactly 100 requests
for i in {1..100}; do
  curl -s "$GATEWAY_URL/health" > /dev/null
done

# Get final counter value
FINAL=$(curl -s "$GATEWAY_URL/metrics" | grep 'http_requests_total{' | head -1 | awk '{print $2}')

# Calculate difference
DIFF=$((FINAL - INITIAL))
echo "Requests made: 100, Counter increment: $DIFF"
[ "$DIFF" -eq 100 ] && echo "PASS" || echo "FAIL"

Expected Result:

  • Counter increments by exactly 100
  • No missed or duplicate counts
OBS-03: Histogram Accuracy Test

Steps:

# Make requests with known latency (use slow endpoint if available)
for i in {1..50}; do
  curl -s "$GATEWAY_URL/health" > /dev/null
done

# Query histogram
curl -s "$PROMETHEUS_URL/api/v1/query?query=http_request_duration_seconds_bucket" | jq '.data.result'

# Calculate percentiles
curl -s "$PROMETHEUS_URL/api/v1/query?query=histogram_quantile(0.95,rate(http_request_duration_seconds_bucket[5m]))" | jq '.data.result[0].value[1]'

Expected Result:

  • Histogram buckets populated correctly
  • P95 calculation returns reasonable value
  • Bucket boundaries match documented values
OBS-04: Trace Propagation Test

Steps:

# Make request and capture trace ID
RESPONSE=$(curl -s -D /tmp/headers.txt -X POST "$GATEWAY_URL/mcp/http" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": {"name": "test-tool", "arguments": {}}}')

# Extract trace ID from response headers
TRACE_ID=$(grep -i "traceparent" /tmp/headers.txt | awk -F'-' '{print $2}')
echo "Trace ID: $TRACE_ID"

# Query Jaeger for trace
curl -s "$JAEGER_URL/api/traces/$TRACE_ID" | jq '.data[0].spans | length'

Validation:

# Verify span hierarchy
curl -s "$JAEGER_URL/api/traces/$TRACE_ID" | jq '.data[0].spans[] | {operationName, spanID, references}'

Expected Result:

  • Trace ID present in response headers
  • Multiple spans in trace (gateway -> backend -> tool)
  • Parent-child references correct
  • All spans have same trace ID
OBS-05: Log Correlation Test

Steps:

# Make request with known request ID
REQUEST_ID="test-correlation-$(date +%s)"
curl -s -X POST "$GATEWAY_URL/mcp/http" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-Request-ID: $REQUEST_ID" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}'

# Search logs for request ID
docker logs mcpgateway 2>&1 | grep "$REQUEST_ID"

# Or query Loki
curl -s "$LOKI_URL/loki/api/v1/query?query={job=\"mcpgateway\"}|=\"$REQUEST_ID\"" | jq '.data.result'

Expected Result:

  • Request ID appears in all log entries for that request
  • Can find logs by request ID
  • Log entries include trace ID for correlation
OBS-06: Dashboard Validation

Steps:

# List all dashboards
curl -s "$GRAFANA_URL/api/search?type=dash-db" \
  -H "Authorization: Bearer $GRAFANA_TOKEN" | jq '.[].title'

# Test each dashboard for broken queries
DASHBOARDS=$(curl -s "$GRAFANA_URL/api/search?type=dash-db" \
  -H "Authorization: Bearer $GRAFANA_TOKEN" | jq -r '.[].uid')

for uid in $DASHBOARDS; do
  echo "Testing dashboard: $uid"
  # Get dashboard JSON
  DASHBOARD=$(curl -s "$GRAFANA_URL/api/dashboards/uid/$uid" \
    -H "Authorization: Bearer $GRAFANA_TOKEN")

  # Extract and test queries
  echo "$DASHBOARD" | jq '.dashboard.panels[].targets[].expr' 2>/dev/null | while read query; do
    [ -n "$query" ] && curl -s "$PROMETHEUS_URL/api/v1/query?query=$query" | jq '.status'
  done
done

Expected Result:

  • All dashboards load without errors
  • All panel queries return data
  • No "No data" or error states
OBS-07: Alerting Rules Test

Steps:

# List configured alerts
curl -s "$PROMETHEUS_URL/api/v1/rules" | jq '.data.groups[].rules[] | {name, state}'

# Trigger test alert (e.g., stop gateway to trigger InstanceDown)
docker stop mcpgateway
sleep 60

# Check for firing alert
curl -s "$PROMETHEUS_URL/api/v1/alerts" | jq '.data.alerts[] | {alertname, state}'

# Restart gateway
docker start mcpgateway

Expected Result:

  • Alerts fire when conditions met
  • Alert resolves when condition clears
  • Alert manager receives notifications (if configured)

Test Matrix

Component Metric/Feature Validation Method Pass Criteria
Metrics http_requests_total Counter accuracy Exact match
Metrics latency histograms Percentile calc P95 accurate
Tracing Span propagation Jaeger query Full trace
Tracing Context headers Header inspection traceparent present
Logging Request ID Log search ID in all logs
Dashboards All panels Query validation No errors
Alerting Instance down Stop service Alert fires

Success Criteria

  • All documented metrics present in /metrics endpoint
  • Request counter accuracy verified (100%)
  • Histogram percentiles match actual latency distribution
  • Distributed traces capture full request lifecycle
  • Log correlation works via request ID
  • All Grafana dashboard panels show data
  • Alerting rules fire correctly under test conditions

Related Files

  • mcpgateway/middleware/metrics.py - Metrics middleware
  • mcpgateway/middleware/tracing.py - Tracing middleware
  • mcpgateway/utils/logging.py - Structured logging
  • deployment/grafana/dashboards/ - Dashboard definitions
  • deployment/prometheus/rules/ - Alerting rules

Related Issues

Metadata

Metadata

Labels

MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafechoreLinting, formatting, dependency hygiene, or project maintenance choresmanual-testingManual testing / test planning issuesobservabilityObservability, logging, monitoringreadyValidated, ready-to-work-on itemstestingTesting (unit, e2e, manual, automated, etc)

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions