Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/features/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@
- [Scroll API](opensearch/scroll-api.md)
- [Search Backpressure](opensearch/search-backpressure.md)
- [Search API Enhancements](opensearch/search-api-enhancements.md)
- [Search API Tracker](opensearch/search-api-tracker.md)
- [Search Pipeline](opensearch/search-pipeline.md)
- [Search Request Stats](opensearch/search-request-stats.md)
- [Search Scoring](opensearch/search-scoring.md)
Expand Down
171 changes: 171 additions & 0 deletions docs/features/opensearch/search-api-tracker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Search API Tracker

## Summary

The Search API Tracker provides visibility into search request outcomes at the coordinator node level. It tracks HTTP response status codes for search operations, enabling operators to monitor partial failures in multi-search (`_msearch`) requests that would otherwise be hidden behind a successful HTTP 200 response.

This feature is particularly valuable for:
- Detecting hidden failures in `_msearch` operations
- Monitoring search API health across the cluster
- Identifying trends in user errors vs system failures
- Building alerting and dashboards based on search success rates

## Details

### Architecture

```mermaid
graph TB
subgraph "Request Processing"
Client[Client Request] --> RestLayer[REST Layer]
RestLayer --> TransportSearchAction[TransportSearchAction]
TransportSearchAction --> ShardSearch[Shard Search Operations]
end

subgraph "Status Tracking Layer"
TransportSearchAction -->|ActionListener wrapper| StatusTracker[Status Tracker]
StatusTracker -->|onResponse| SuccessCounter[Increment Success]
StatusTracker -->|onFailure| ErrorCounter[Increment Error]
end

subgraph "Statistics Storage"
SuccessCounter --> SearchResponseStatusStats[SearchResponseStatusStats]
ErrorCounter --> SearchResponseStatusStats
SearchResponseStatusStats --> StatusCounterStats[StatusCounterStats]
StatusCounterStats --> IndicesService[IndicesService]
end

subgraph "Stats Exposure"
IndicesService --> NodeIndicesStats[NodeIndicesStats]
NodeIndicesStats --> NodeStatsAPI["Node Stats API<br/>GET /_nodes/stats"]
end
```

### Data Flow

```mermaid
flowchart TB
A[Search Request] --> B{Request Type}
B -->|Single Search| C[TransportSearchAction.doExecute]
B -->|Multi-Search| D[TransportMultiSearchAction]
D --> C
C --> E[Execute Search]
E --> F{Result}
F -->|Success| G[searchResponse.status]
F -->|Failure| H[ExceptionsHelper.status]
G --> I[SearchResponseStatusStats.inc]
H --> I
I --> J[Categorize by Error Type]
J --> K[Update LongAdder Counter]
```

### Components

| Component | Package | Description |
|-----------|---------|-------------|
| `StatusCounterStats` | `o.o.action.admin.indices.stats` | Container for doc and search response status stats |
| `SearchResponseStatusStats` | `o.o.action.admin.indices.stats` | Tracks search response status by HTTP family |
| `DocStatusStats` | `o.o.action.admin.indices.stats` | Tracks indexing document status by HTTP family |
| `RestStatus.getErrorType()` | `o.o.core.rest` | Maps HTTP status codes to error type categories |

### Configuration

No configuration is required. The feature is enabled by default and has minimal performance overhead due to the use of `LongAdder` for concurrent counter updates.

### Usage Example

#### Querying Search API Statistics

```bash
GET /_nodes/stats/indices
```

#### Response Structure

```json
{
"_nodes": {
"total": 3,
"successful": 3,
"failed": 0
},
"cluster_name": "my-cluster",
"nodes": {
"node-1": {
"indices": {
"status_counter": {
"doc_status": {
"success": 50000,
"user_error": 150,
"system_failure": 2
},
"search_response_status": {
"success": 25000,
"user_error": 500,
"system_failure": 10
}
}
}
}
}
}
```

#### Monitoring Use Cases

Calculate search error rate:
```
error_rate = (user_error + system_failure) / (success + user_error + system_failure)
```

Detect _msearch partial failures by comparing:
- HTTP-level 200 responses (from load balancer/proxy logs)
- `search_response_status.user_error` + `search_response_status.system_failure` counters

### Error Type Mapping

| Error Type | HTTP Status Family | Examples |
|------------|-------------------|----------|
| `success` | 1xx, 2xx, 3xx | 200 OK, 201 Created |
| `user_error` | 4xx | 400 Bad Request, 404 Not Found, 429 Too Many Requests |
| `system_failure` | 5xx | 500 Internal Server Error, 503 Service Unavailable |

### Implementation Details

The tracking is implemented by wrapping the `ActionListener` in `TransportSearchAction.doExecute()`:

```java
ActionListener<SearchResponse> searchStatusStatsUpdateListener = ActionListener.wrap(
(searchResponse) -> {
listener.onResponse(searchResponse);
indicesService.getSearchResponseStatusStats().inc(searchResponse.status());
},
(e) -> {
listener.onFailure(e);
indicesService.getSearchResponseStatusStats().inc(ExceptionsHelper.status(e));
}
);
```

## Limitations

- Statistics are node-local and cumulative since node startup
- No per-index or per-shard breakdown of search status
- Historical data is not persisted; counters reset on node restart
- Does not track internal search phases (query, fetch) separately

## Related PRs

| Version | PR | Description |
|---------|-----|-------------|
| v3.4.0 | [#18601](https://github.com/opensearch-project/OpenSearch/pull/18601) | Add search API tracker |

## References

- [Issue #18377](https://github.com/opensearch-project/OpenSearch/issues/18377): Feature request for tracking non-successful Search API calls across coordinator nodes
- [Issue #18438](https://github.com/opensearch-project/OpenSearch/issues/18438): Bug report requesting DocStatusStats refactoring
- [Node Stats API](https://docs.opensearch.org/3.0/api-reference/nodes-apis/nodes-stats/): Official documentation for the Node Stats API

## Change History

- **v3.4.0**: Initial implementation with `StatusCounterStats`, `SearchResponseStatusStats`, and refactored `DocStatusStats`
145 changes: 145 additions & 0 deletions docs/releases/v3.4.0/features/opensearch/search-api-tracker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Search API Tracker

## Summary

This release introduces a Search API Tracker that tracks response status codes for search API calls at the coordinator node level. The feature addresses a gap in observability where `_msearch` requests return HTTP 200 even when individual sub-requests fail, making it difficult to detect partial failures without client-side inspection.

The tracker adds a new `status_counter` section to the Node Stats API response, providing counters for both document indexing (`doc_status`) and search response (`search_response_status`) statuses, categorized by error type: `success`, `user_error`, and `system_failure`.

## Details

### What's New in v3.4.0

- New `StatusCounterStats` class that aggregates both `DocStatusStats` and `SearchResponseStatusStats`
- Search response status tracking in `TransportSearchAction` at the coordinator node
- Refactored `DocStatusStats` from `IndexingStats` to a standalone class with improved architecture
- New `status_counter` section in Node Stats API response under `indices`
- Error type categorization using `RestStatus.getErrorType()` method

### Technical Changes

#### Architecture Changes

```mermaid
graph TB
subgraph "Search Request Flow"
Client[Client] --> SearchAPI[Search API]
SearchAPI --> TransportSearchAction[TransportSearchAction]
TransportSearchAction --> IndicesService[IndicesService]
end

subgraph "Status Tracking"
TransportSearchAction -->|onResponse/onFailure| StatusCounterStats[StatusCounterStats]
StatusCounterStats --> SearchResponseStatusStats[SearchResponseStatusStats]
StatusCounterStats --> DocStatusStats[DocStatusStats]
end

subgraph "Stats Exposure"
StatusCounterStats --> NodeIndicesStats[NodeIndicesStats]
NodeIndicesStats --> NodeStatsAPI[Node Stats API]
end
```

#### New Components

| Component | Description |
|-----------|-------------|
| `StatusCounterStats` | Container class holding both `DocStatusStats` and `SearchResponseStatusStats` |
| `SearchResponseStatusStats` | Tracks search response status codes by HTTP status family |
| `DocStatusStats` (refactored) | Moved from `IndexingStats.Stats` to standalone class in `action.admin.indices.stats` |

#### New Configuration

No new configuration settings are required. The feature is enabled by default.

#### API Changes

The Node Stats API (`GET /_nodes/stats`) response now includes a `status_counter` section:

```json
{
"indices": {
"status_counter": {
"doc_status": {
"success": 1000,
"user_error": 5,
"system_failure": 0
},
"search_response_status": {
"success": 500,
"user_error": 10,
"system_failure": 2
}
}
}
}
```

### Usage Example

Query node stats to retrieve search API tracking information:

```bash
GET /_nodes/stats/indices
```

Response includes the new `status_counter` section showing cumulative counts since node startup:

```json
{
"nodes": {
"node_id": {
"indices": {
"status_counter": {
"doc_status": {
"success": 15000,
"user_error": 25,
"system_failure": 0
},
"search_response_status": {
"success": 8500,
"user_error": 150,
"system_failure": 5
}
}
}
}
}
}
```

### Error Type Categories

| Category | HTTP Status Codes | Description |
|----------|-------------------|-------------|
| `success` | 1xx, 2xx, 3xx | Successful operations |
| `user_error` | 4xx | Client errors (bad request, not found, etc.) |
| `system_failure` | 5xx | Server errors (internal error, service unavailable, etc.) |

### Migration Notes

- The `doc_status` field has been moved from `indices.indexing` to `indices.status_counter`
- The output format changed from HTTP status family codes (e.g., `2xx`, `4xx`) to error type names (`success`, `user_error`, `system_failure`)
- Uses `LongAdder` instead of `AtomicLong` for better concurrent performance

## Limitations

- Statistics are cumulative since node startup and reset on node restart
- Tracking occurs at the coordinator node level, not at the shard level
- The PR was closed without being merged as of the investigation date

## Related PRs

| PR | Description |
|----|-------------|
| [#18601](https://github.com/opensearch-project/OpenSearch/pull/18601) | Add search API tracker |

## References

- [Issue #18377](https://github.com/opensearch-project/OpenSearch/issues/18377): Feature request for tracking non-successful Search API calls
- [Issue #18438](https://github.com/opensearch-project/OpenSearch/issues/18438): Bug report requesting DocStatusStats refactoring
- [Node Stats API Documentation](https://docs.opensearch.org/3.0/api-reference/nodes-apis/nodes-stats/): Official documentation

## Related Feature Report

- [Full feature documentation](../../../features/opensearch/search-api-tracker.md)
1 change: 1 addition & 0 deletions docs/releases/v3.4.0/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- [Node Stats Bugfixes](features/opensearch/node-stats-bugfixes.md) - Fix negative CPU usage values in node stats on certain Linux distributions
- [S3 Repository](features/opensearch/s3-repository.md) - Add LEGACY_MD5_CHECKSUM_CALCULATION to opensearch.yml settings for S3-compatible storage
- [Scroll Query Optimization](features/opensearch/scroll-query-optimization.md) - Cache StoredFieldsReader per segment for improved scroll query performance
- [Search API Tracker](features/opensearch/search-api-tracker.md) - Track search response status codes at coordinator node for observability
- [Security Kerberos Integration](features/opensearch/security-kerberos-integration.md) - Update Hadoop to 3.4.2 and enable Kerberos integration tests for JDK-24+
- [Settings Bugfixes](features/opensearch/settings-bugfixes.md) - Fix duplicate registration of dynamic settings and patch version build issues
- [Stats Builder Pattern Deprecations](features/opensearch/stats-builder-pattern-deprecations.md) - Deprecated constructors in 30+ Stats classes in favor of Builder pattern
Expand Down