Skip to content

Add testing readiness review and action plan#2

Merged
Chris0Jeky merged 1 commit intomainfrom
codex/review-documents-and-improve-testing-strategy
Nov 18, 2025
Merged

Add testing readiness review and action plan#2
Chris0Jeky merged 1 commit intomainfrom
codex/review-documents-and-improve-testing-strategy

Conversation

@Chris0Jeky
Copy link
Copy Markdown
Owner

Summary

  • add TESTING_READINESS.md outlining current testing posture
  • document quickstart smoke steps and prioritized testing actions
  • propose medium-term CI and quality gate improvements

Testing

  • not run (documentation-only change)

Codex Task

@Chris0Jeky Chris0Jeky merged commit cfabdd8 into main Nov 18, 2025
@Chris0Jeky Chris0Jeky deleted the codex/review-documents-and-improve-testing-strategy branch November 18, 2025 03:44
Chris0Jeky added a commit that referenced this pull request Feb 16, 2026
…ove-testing-strategy

Add testing readiness review and action plan
Chris0Jeky added a commit that referenced this pull request Apr 9, 2026
TryConsumeAtomicAsync now includes ExpiresAt > now in the WHERE clause
to close the TOCTOU race window between application-level expiry check
and SQL execution.

DeleteExpiredAsync now uses raw SQL instead of loading all rows into
memory (DoS prevention). Also deletes consumed codes to prevent
unbounded table growth.

Uses EF Core SQLite DateTimeOffset format for correct string comparison.

Addresses findings #2 (CRITICAL), #4 (HIGH), #6 (HIGH), #13 (LOW).
Chris0Jeky added a commit that referenced this pull request Apr 22, 2026
- Fix config path: WorkerSettings:MaxBatchSize -> Workers:MaxBatchSize
- Document queue backlog threshold divergence from HealthController's
  dynamic formula Math.Max(MaxBatchSize * 20, 100)
- Fix PromQL examples: metrics are Histograms, not gauges -- use
  _sum/_count series with appropriate caveats
- Add threshold reconciliation section explaining differences with
  CLOUD_REFERENCE_ARCHITECTURE.md alarm stubs
- Fix Known Gap #2: use exact default (30s) instead of approximate (~30s)
  and show the full Math.Max formula
Chris0Jeky added a commit that referenced this pull request Apr 22, 2026
* docs: define monitoring and alerting rules (OPS-30)

Add docs/ops/ALERTING_RULES.md with 10 alert rules covering API error
rate, latency, worker heartbeat, disk, memory, queue backlog, database
connectivity, health endpoint, CPU, and Redis backplane. Each rule
specifies metric source, threshold, evaluation window, priority (P1/P2),
runbook steps, and escalation triggers.

Includes integration guidance for Grafana, AWS CloudWatch, PagerDuty,
and external uptime monitoring with example PromQL queries and Terraform
alarm definitions.

Closes #868

* docs: add ALERTING_RULES.md to ops README index

Cross-reference the new alerting rules document from the ops directory
index alongside the existing observability docs.

* docs: update OBSERVABILITY_BASELINE alert thresholds and cross-reference

Update the alert threshold baseline section to match the authoritative
thresholds in ALERTING_RULES.md and add a callout directing operators
to the comprehensive alerting rules document.

* docs: add known gaps section to alerting rules

Document three known gaps found during adversarial review:
1. OutboundWebhookDeliveryWorker not monitored by health endpoint
2. Health endpoint staleness thresholds differ from alert thresholds
3. No dedicated LLM provider error rate alert

Also clarify that Alert 3 applies to workers with OTLP metric emission
(LlmQueueToProposalWorker and ProposalHousekeepingWorker only).

* fix: correct alerting rules accuracy issues from adversarial review

- Fix config path: WorkerSettings:MaxBatchSize -> Workers:MaxBatchSize
- Document queue backlog threshold divergence from HealthController's
  dynamic formula Math.Max(MaxBatchSize * 20, 100)
- Fix PromQL examples: metrics are Histograms, not gauges -- use
  _sum/_count series with appropriate caveats
- Add threshold reconciliation section explaining differences with
  CLOUD_REFERENCE_ARCHITECTURE.md alarm stubs
- Fix Known Gap #2: use exact default (30s) instead of approximate (~30s)
  and show the full Math.Max formula
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant