Skip to content

[Feature]GeoIP datasource implementation #6559

@heemin32

Description

@heemin32

Description

This document contains implementation detail on GeoIP datasource as part of #5856

Tasks

Tasks are listed here to track a progress in the implementation. One PR can cover multiple tasks if code change is small.

Create datasource

  • Create API interface
  • Read default value from a cluster configuration property
  • Read manifest file and validate input parameter
  • Store meta data in a system index
  • Schedule update GeoIP db task for new datasource

Update datasource

  • Update metadata in a system index
  • Schedule update GeoIP db task for existing datasource

Read datasource

  • Return metadata

Delete datasource

  • Return error if there is GeoIP processor using this GeoIP datasource
  • Update metadata in a system index
  • Schedule delete GeoIP db task

Update GeoIP database

  • Check if update is required
  • Download zip file and ingest data into an index without storing it in a disk
  • Delete old index
  • Schedule either next update or delete task

Delete GeoIP database

  • Delete GeoIP datasource index
  • Delete GeoIP datasource metadata.

User scenarios

Create/Update of GeoIP data source

  1. Customer make a call to OpenSearch cluster to create GeoIP data source. It takes parameters of endpoint and update interval. Default value is provided as well. Default value can be configurable using property.
  2. The data about GeoIP data source will be stored in a system index named .geoip_datasource
  3. PUT/POST API handler for data source
    1. Read manifest file.
    2. Validate parameter.
      1. Manifest file is reachable.
      2. Manifest file format is correct.
      3. Update_interval is less than valid_for value in the manifest file.
    3. Store data in a system index
    4. Scheduling update
      1. If data source name exist
        1. If there is ongoing update
          1. Does nothing
        2. If there is no ongoing update
          1. Cancel scheduled update task
          2. Reschedule update task after update_interval.
      2. If data source name does not exist
        1. Schedule update task
    5. Return OK
  4. Update task
    1. It reads a manifest file.
      1. If md5_hash value is same with previous one
        1. Only update meta data of the data source: expire_after, next_update_at, last_skipped_at.
      2. If md5_hash value is different with previous one,
        1. Download and ingest it into a new system index.
        2. Update meta data of the data source: md5_hash, expire_after, updated_at, next_update_at, last_succeeded_at, last_processing_time.
        3. Delete the old index.
        4. Schedule the next update task.

Datasource API signature

PUT /_geoip/datasource/my-datasource
{
  "endpoint": "https://geoip.opensearch.org/v1/geolite2-city/manifest.json"
  "update_interval_in_days": 20
}
GET /_geoip/datasource/my-datasource
{
  "endpoint": "https://geoip.opensearch.org/v1/manifest/geolite2-city",
  "update_interval_in_days": 20,
  "state": "AVAILABLE",
  "expire_after": 12343434,
  "next_update": 12341244,
  "database": {
    "provider": "maxmind",
    "md5_hash": "63d0cea9d550e495fde1b81310951bd7"
    "updated_at": 123123213,
    "valid_for_in_days" : 30,
    "fields": ["latitude", "longitude", "country", "city"]
  },
  "indices": [
    ".geoip_datasource.my-datasource.123123213",
    ".geoip_datasource.my-datasource.123123212"
  ],
  "update_stats": {
    "last_succeeded_at": 123123,
    "last_processing_time_in_millis": 912999,
    "last_failed_at": 123123213123,
    "last_skipped_at": 123123213,
  }
}

GeoIP database in an index

Index
/.geoip_datasource.my-datasource.1
{
   "_cidr" : "2a12:49c5:4380::/41",
   "_data" : {
       "country_name" : "Georgia",
       "continent_name" : "Asia",
        ...
    }
}

Manifest.json

{
  "url": "https://d17zozg08cgjfy.cloudfront.net/GeoLite2-ASN-CSV_20221206.zip",
  "db_name": "GeoLite2-ASN.csv",
  "md5_hash": "safasdfaskkkesadfasdf",
  "valid_for_in_days": 30,
  "updated_at": 3134012341236,
  "provider": "maxmind"
}

Deletion of GeoIP data source

  1. Customer make a call to OpenSearch cluster to delete GeoIP data source.
  2. It check if there are any GeoIP processor using the GeoIP data source.
    1. If there are, return error.
    2. If there are not
      1. Mark the datasource as deleted.
      2. If there is ongoing update
        1. Let the update task to trigger delete task at the end
      3. If there is no ongoing update
        1. Cancel scheduled update task
        2. Schedule delete task immediately.
          1. Delete GeoIP data index
          2. Delete GeoIP data source data
DELETE /_geoip/datasource/my-datasource

Cluster manager node failure

All of the works related with GeoIP datasource will be executed in a cluster manager node. The cluster manager node maintains scheduled tasks in memory. When cluster manager node fails, it will fail over to the one of cluster eligible node. The new cluster manager node will scan all existing GeoIP datasource and schedule tasks again accordingly. It use "next_update" field in GeoIP datasource to set correct time to update GeoIP databases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementEnhancement or improvement to existing feature or requestfeatureNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions