Add API worker websocket and HTTP endpoints for FL to PyGrid

The various worker libraries will need to communicate with PyGrid according to an API that's defined in PyGrid. I currently believe that we should aim to support both Websocket messages as well as HTTPS endpoints to accomplish this - hopefully with this philosophy becoming a standard of PyGrid.

All socket calls should follow the format of:
```
{
  "type": "the type of the message",
  "data": {}
}
```

I'd like for the following endpoints to be added:

## Authentication with PyGrid
Method in worker library:
```js
const worker = new syft({
  url: 'https://localhost:3000',
  auth_token: MY_AUTH_TOKEN
});
```
HTTP endpoint: `POST /federated/authenticate`
Socket `"type"`: `federated/authenticate`
Request data:
```json
{
  "auth_token": "MY_AUTH_TOKEN"
}
```

_Note that `auth_token` supplied above is an optional argument depending on the setup of PyGrid._

This endpoint is where the worker library (syft.js, KotlinSyft, or SwiftSyft) is attempting to authenticate as a worker with PyGrid.

In order to guarantee the identity of a worker, it's important to have some sort of authentication workflow present. While this isn't strictly required, it will prove an important mechanism in our federated learning workflow for preventing a variety of attacks, most notably a "Sybil attack". This would happen when a worker could generate multiple versions of themself, thus steering all model training to be done by the same worker on the same data, but with unique "worker id's" - which would overfit the model. To prevent this, we strongly suggest that every deployment of PyGrid's FL system implement some sort of oAuth 2.0 protocol.

In this circumstance, a worker would be logged in to their application via oAuth and would be given an authentication token with which to make secure web requests inside the app. Assuming that PyGrid has also been set up to include this same oAuth mechanism, a worker could forward this `auth_token` to PyGrid, which then validates that token as an actual user with the same oAuth provider. It's important to do this because it avoids putting the responsibility of having to incorporate our own authentication system with PyGrid, and instead farms this responsibility out to a third-party system.

In the event that the administrator of the PyGrid gateway does not want to add oAuth support, or there is no login capability within the web or mobile app the worker is running on, then this authentication process is skipped and a `worker_id` is assigned. This is insecure and open to attacks - it's not suggested, but is required as part of our system.

There are three possible responses, one success and two error responses:

**Success** - _triggered when there is no oAuth flow required by PyGrid OR when there is a required oAuth flow in PyGrid and the `auth_token` sent by the worker validates the existence of that user by a third-party_
```json
{
  "worker_id": "ID OF THE WORKER"
}
```

**Error** - _triggered when there is an oAuth flow required by PyGrid and no `auth_token` is sent_
```json
{
  "error": "Authentication is required, please pass an 'auth_token'."
}
```

**Error** - _triggered when there is an oAuth flow required by PyGrid and the `auth_token` that was sent is invalid_
```json
{
  "error": "The 'auth_token' that you passed is invalid."
}
```

The success response will include a `worker_id` which should be cached for long-term use. **This will be passed with all subsequent calls to PyGrid.**

## Connection Speed Test
Method in worker library: `job.start()`
HTTP endpoint: `GET /federated/speed-test` and `POST /federated/speed-test`
Socket `"type"`: N/A
Query string: `?random=RANDOM HASH VALUE&worker_id=ID OF THE WORKER`

**This endpoint is HTTP only.**

We need some way of getting a reliable average upload and download speed for a worker in order to potentially qualify them for joining an FL worker cycle. In order to do this, we need to endpoints at the same location: a `GET` route for testing worker download speed and a `POST` route for testing worker upload speed. In each route, a random query string value must be appended onto the end of the request to prevent the server or the worker from caching the result after multiple rounds.

When performing the download speed test, PyGrid will generate a random file of a certain size (to be determined) which the worker may download. The time it takes the worker to download will be captured by the worker and stored.

When performing the upload speed test, the worker will generate a random file of a certain size (to be determined) which will be uploaded to PyGrid (and then discarded). The time it takes the worker to upload will be also captured by the worker and stored.

_Note: The above is merely a proposal of how this workflow should work. The real-world solution should be determined and this document will be modified to fit the best solution we come up with. This paradigm should be heavily tested against real-world connection speed tests to ensure a reliable result. @Prtfw please do some extra research on this to cover our bases._

## FL Worker Cycle Request
Method in worker library: _Also part of `job.start()` behind the scenes_
HTTP endpoint: `POST /federated/cycle-request`
Socket `"type"`: `federated/cycle-request`
Request data:
```json
{
  "worker_id": "ID OF THE WORKER",
  "model": "my-federated-model",
  "version": "0.1.0",
  "ping": "8ms",
  "download": "46.3mbps",
  "upload": "23.7mbps"
}
```
_Note that `version` supplied above is an optional argument._

This endpoint is where the worker library (syft.js, KotlinSyft, or SwiftSyft) is attempting to join an active federated learning cycle. PyGrid, depending on the current state of the cycle, the speed of the worker's connection, and how many workers have already been chosen.

Given this information, PyGrid will send **one of two responses**:

**Rejection**
```json
{
  "status": "rejected",
  "timeout": 2700,
  "model": "my-federated-model",
  "version": "0.1.0"
}
```

This means that the worker was rejected from the current cycle and asked to request to join another cycle in 2700 seconds. The number of seconds will depend on when the next cycle is expected to start. If a timeout is not sent, this means that it's the last cycle and there will not be another one to join.

**Accepted**
```json
{
  "status": "accepted",
  "model": "my-federated-model",
  "version": "0.1.0",
  "request_key": "LONG HASH VALUE",
  "plans": { "training_plan": "ID OF THE TRAINING PLAN", "another_plan": "ID OF ANOTHER PLAN" },
  "client_config": "CLIENT CONFIG OBJECT",
  "protocols": { "secure_agg_protocol": "ID OF THE PROTOCOL" },
  "model_id": "ID OF THE MODEL"
}
```

In the event that the worker is accepted into the current cycle, they will be sent a named list of the ID's of various plans they need to execute, a named list of the ID's of various protocols they need to execute, the id of the model, and the client config. The plans, protocols, and model will not be downloaded in this response. Instead, the worker will need to make an additional request to receive them (due to the size constraints of the response). They will pass the `request_key` given above as a form of "authenticating" the download request. This is specific to the relationship between the worker AND the cycle and cannot be reused for future cycles or other workers. This will be detailed in the ["Plan Download section"](#Plan Download).

_Note that it is not possible for a worker to participate in the same cycle multiple times. The client creates a "job" request. If they are accepted, they should not be allowed to submit another job request for the same cycle._

## Plan Download
Method in worker library: _Also part of `job.start()` behind the scenes_
HTTP endpoint: `GET /federated/get-plan`
Socket `"type"`: N/A
Query string: `?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&plan_id=ID OF THE PLAN&receive_operations_as=list`

**This endpoint is HTTP only.**

This method will allow a worker that has been accepted into a cycle to request the download of a plan from PyGrid. They need to submit their `request_key` provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.

The worker also needs to specify how the worker likes to receive plans: either a list of operations (`"list"`) or TorchScript (`"torchscript"`) depending on the type of worker requesting (https://github.com/OpenMined/PyGrid/issues/437). This is found in the `receive_operations_as` key of the request data.

Response: _This downloads the plan to the worker._

## Protocol Download
Method in worker library: _Also part of `job.start()` behind the scenes_
HTTP endpoint: `GET /federated/get-protocol`
Socket `"type"`: N/A
Query string: `?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&protocol_id=ID OF THE PROTOCOL`

**This endpoint is HTTP only.**

This method will allow a worker that has been accepted into a cycle to request the download of a protocol from PyGrid. They need to submit their `request_key` provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.

Response: _This downloads the protocol to the worker._

## Model Download
Method in worker library: _Also part of `job.start()` behind the scenes_
HTTP endpoint: `GET /federated/get-model`
Socket `"type"`: N/A
Query string: `?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&model_id=ID OF THE MODEL`

**This endpoint is HTTP only.**

This method will allow a worker that has been accepted into a cycle to request the download of a model from PyGrid. They need to submit their `request_key` provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.

Response: _This downloads the model to the worker._

## Report
Method in worker library: `job.report()`
HTTP endpoint: `POST /federated/report`
Socket `"type"`: `federated/report`
Request data:
```json
{
  "worker_id": "ID OF THE WORKER",
  "request_key": "LONG HASH VALUE",
  "diff": "FINAL MODEL DIFF FROM TRAINING"
}
```

This method will allow a worker that has been accepted into a cycle and finished training a model on their device to upload the resulting model diff.

If the worker did not train a protocol to be done after the plan(s) was executed, then they will simply submit their entire model diff. If they want to manually add noise to this diff as a layer of protection, they may do so at the developer's discretion from inside the worker implementation.

If the worker did execute a protocol and they have finished the secure aggregation protocol with other workers, they will now receive a share of the resulting securely aggregated model diff. In this case, they will submit the share of the diff, rather than their original model diff. PyGrid will handle the decryption of the shares once they're all submitted.

Response: `{ "status": "success" }`

The response of success is sent if the response is a 200. **The worker should not be informed if the model diff was accepted or denied as part of the global model update.**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add API worker websocket and HTTP endpoints for FL to PyGrid #445

Authentication with PyGrid

Connection Speed Test

FL Worker Cycle Request

Plan Download

Protocol Download

Model Download

Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add API worker websocket and HTTP endpoints for FL to PyGrid #445

Description

Authentication with PyGrid

Connection Speed Test

FL Worker Cycle Request

Plan Download

Protocol Download

Model Download

Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions