Skip to content

[BUG] geotile_grid aggregation on LineString maxes out CPU and stalls cluster #20413

@RLashofRegas

Description

@RLashofRegas

Describe the bug

When running a simple geotile_grid aggregation on a geoshape field that has been indexed with only a single document with a geojson LineString, we see that the query times out and the data node CPU jumps to a high value and stays that way indefinitely until the node is manually restarted.

Related component

Search:Aggregations

To Reproduce

Note 1: I have done this with an Amazon OpenSearch managed cluster but I have not tried to reproduce it locally.
Note 2: I originally saw the issue on a larger cluster with r7g.4xlarge data nodes even though the below steps mention a very small single-node cluster.

  1. Create a new Amazon OpenSearch managed cluster with the following settings:
    • Domain creation method - Standard Create
    • Templates - Dev/test
    • Availability Zones - 1-AZ without standby
    • Engine Version - 2.19
    • Number of Data Nodes - 1 r8g.large.search data node
  2. Create an index with the following definition:
opensearch.indices.create(
    index_name,
    {
        "settings": {
            "index.number_of_shards": 1,
            "index.number_of_replicas": 0
        },
        "mappings": {
            "properties": {
                "geolocation": {"type": "geo_shape"}
            }
        }
    },
)
  1. Index a single document with the following data:
document = {
    "geolocation": {
        "type": "LineString",
        "coordinates": [
            [120.69105000000002, -2.1092199999999366],
            [120.75767000000008, -2.159189999999967],
            [120.82192000000009, -2.1877399999999625],
        ]
    }
}
opensearch.index(index_name, document)
  1. Run the following search query to verify the document was successfully indexed:
query = {
    "size": 1,
    "timeout": "60s",
    "query": {
        "match_all": {}
    }
}

try:
    result = opensearch.search(index=index_name, body=query)
except RequestError as e:
    print(e)
    raise

print(json.dumps(result, indent=2))
  1. Run the following query to reproduce the issue:
query = {
    "size": 1,
    "timeout": "60s",
    "query": {
        "match_all": {}
    },
    "aggs": {
        "locations": {
            "geotile_grid": {
                "field": "geolocation",
                "precision": 29
            }
        }
    }
}

try:
    result = opensearch.search(index=index_name, body=query)
except RequestError as e:
    print(e)
    raise

print(json.dumps(result, indent=2))
  1. Note that a TimeoutError occurs. Go to Amazon CloudWatch and plot the Data Node CPUUtilization metric. Note that the CPU has jumped to ~50% and stays there (this instance type has 2 vCPUs, so this is essentially one vCPU handling the single shard and getting maxed out).
  2. monitor for a while and notice that the CPU never goes down. In addition running POST /_tasks/_cancel does not work. The only option we have found is to restart the node.

Expected behavior

If this type of query is supported it should not bring down the cluster with only a single document. if it's not it should throw an error.

Additional Details

Plugins
Default Amazon OpenSearch configuration as mentioned above.

Host/Environment (please complete the following information):

  • OS: Linux - AWS Managed cluster

Metadata

Metadata

Assignees

Type

No type

Projects

Status

🆕 New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions