-
-
Notifications
You must be signed in to change notification settings - Fork 215
Add API worker websocket and HTTP endpoints for FL to PyGrid #445
Description
The various worker libraries will need to communicate with PyGrid according to an API that's defined in PyGrid. I currently believe that we should aim to support both Websocket messages as well as HTTPS endpoints to accomplish this - hopefully with this philosophy becoming a standard of PyGrid.
All socket calls should follow the format of:
{
"type": "the type of the message",
"data": {}
}
I'd like for the following endpoints to be added:
Authentication with PyGrid
Method in worker library:
const worker = new syft({
url: 'https://localhost:3000',
auth_token: MY_AUTH_TOKEN
});HTTP endpoint: POST /federated/authenticate
Socket "type": federated/authenticate
Request data:
{
"auth_token": "MY_AUTH_TOKEN"
}Note that auth_token supplied above is an optional argument depending on the setup of PyGrid.
This endpoint is where the worker library (syft.js, KotlinSyft, or SwiftSyft) is attempting to authenticate as a worker with PyGrid.
In order to guarantee the identity of a worker, it's important to have some sort of authentication workflow present. While this isn't strictly required, it will prove an important mechanism in our federated learning workflow for preventing a variety of attacks, most notably a "Sybil attack". This would happen when a worker could generate multiple versions of themself, thus steering all model training to be done by the same worker on the same data, but with unique "worker id's" - which would overfit the model. To prevent this, we strongly suggest that every deployment of PyGrid's FL system implement some sort of oAuth 2.0 protocol.
In this circumstance, a worker would be logged in to their application via oAuth and would be given an authentication token with which to make secure web requests inside the app. Assuming that PyGrid has also been set up to include this same oAuth mechanism, a worker could forward this auth_token to PyGrid, which then validates that token as an actual user with the same oAuth provider. It's important to do this because it avoids putting the responsibility of having to incorporate our own authentication system with PyGrid, and instead farms this responsibility out to a third-party system.
In the event that the administrator of the PyGrid gateway does not want to add oAuth support, or there is no login capability within the web or mobile app the worker is running on, then this authentication process is skipped and a worker_id is assigned. This is insecure and open to attacks - it's not suggested, but is required as part of our system.
There are three possible responses, one success and two error responses:
Success - triggered when there is no oAuth flow required by PyGrid OR when there is a required oAuth flow in PyGrid and the auth_token sent by the worker validates the existence of that user by a third-party
{
"worker_id": "ID OF THE WORKER"
}Error - triggered when there is an oAuth flow required by PyGrid and no auth_token is sent
{
"error": "Authentication is required, please pass an 'auth_token'."
}Error - triggered when there is an oAuth flow required by PyGrid and the auth_token that was sent is invalid
{
"error": "The 'auth_token' that you passed is invalid."
}The success response will include a worker_id which should be cached for long-term use. This will be passed with all subsequent calls to PyGrid.
Connection Speed Test
Method in worker library: job.start()
HTTP endpoint: GET /federated/speed-test and POST /federated/speed-test
Socket "type": N/A
Query string: ?random=RANDOM HASH VALUE&worker_id=ID OF THE WORKER
This endpoint is HTTP only.
We need some way of getting a reliable average upload and download speed for a worker in order to potentially qualify them for joining an FL worker cycle. In order to do this, we need to endpoints at the same location: a GET route for testing worker download speed and a POST route for testing worker upload speed. In each route, a random query string value must be appended onto the end of the request to prevent the server or the worker from caching the result after multiple rounds.
When performing the download speed test, PyGrid will generate a random file of a certain size (to be determined) which the worker may download. The time it takes the worker to download will be captured by the worker and stored.
When performing the upload speed test, the worker will generate a random file of a certain size (to be determined) which will be uploaded to PyGrid (and then discarded). The time it takes the worker to upload will be also captured by the worker and stored.
Note: The above is merely a proposal of how this workflow should work. The real-world solution should be determined and this document will be modified to fit the best solution we come up with. This paradigm should be heavily tested against real-world connection speed tests to ensure a reliable result. @Prtfw please do some extra research on this to cover our bases.
FL Worker Cycle Request
Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: POST /federated/cycle-request
Socket "type": federated/cycle-request
Request data:
{
"worker_id": "ID OF THE WORKER",
"model": "my-federated-model",
"version": "0.1.0",
"ping": "8ms",
"download": "46.3mbps",
"upload": "23.7mbps"
}Note that version supplied above is an optional argument.
This endpoint is where the worker library (syft.js, KotlinSyft, or SwiftSyft) is attempting to join an active federated learning cycle. PyGrid, depending on the current state of the cycle, the speed of the worker's connection, and how many workers have already been chosen.
Given this information, PyGrid will send one of two responses:
Rejection
{
"status": "rejected",
"timeout": 2700,
"model": "my-federated-model",
"version": "0.1.0"
}This means that the worker was rejected from the current cycle and asked to request to join another cycle in 2700 seconds. The number of seconds will depend on when the next cycle is expected to start. If a timeout is not sent, this means that it's the last cycle and there will not be another one to join.
Accepted
{
"status": "accepted",
"model": "my-federated-model",
"version": "0.1.0",
"request_key": "LONG HASH VALUE",
"plans": { "training_plan": "ID OF THE TRAINING PLAN", "another_plan": "ID OF ANOTHER PLAN" },
"client_config": "CLIENT CONFIG OBJECT",
"protocols": { "secure_agg_protocol": "ID OF THE PROTOCOL" },
"model_id": "ID OF THE MODEL"
}In the event that the worker is accepted into the current cycle, they will be sent a named list of the ID's of various plans they need to execute, a named list of the ID's of various protocols they need to execute, the id of the model, and the client config. The plans, protocols, and model will not be downloaded in this response. Instead, the worker will need to make an additional request to receive them (due to the size constraints of the response). They will pass the request_key given above as a form of "authenticating" the download request. This is specific to the relationship between the worker AND the cycle and cannot be reused for future cycles or other workers. This will be detailed in the ["Plan Download section"](#Plan Download).
Note that it is not possible for a worker to participate in the same cycle multiple times. The client creates a "job" request. If they are accepted, they should not be allowed to submit another job request for the same cycle.
Plan Download
Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: GET /federated/get-plan
Socket "type": N/A
Query string: ?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&plan_id=ID OF THE PLAN&receive_operations_as=list
This endpoint is HTTP only.
This method will allow a worker that has been accepted into a cycle to request the download of a plan from PyGrid. They need to submit their request_key provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.
The worker also needs to specify how the worker likes to receive plans: either a list of operations ("list") or TorchScript ("torchscript") depending on the type of worker requesting (#437). This is found in the receive_operations_as key of the request data.
Response: This downloads the plan to the worker.
Protocol Download
Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: GET /federated/get-protocol
Socket "type": N/A
Query string: ?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&protocol_id=ID OF THE PROTOCOL
This endpoint is HTTP only.
This method will allow a worker that has been accepted into a cycle to request the download of a protocol from PyGrid. They need to submit their request_key provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.
Response: This downloads the protocol to the worker.
Model Download
Method in worker library: Also part of job.start() behind the scenes
HTTP endpoint: GET /federated/get-model
Socket "type": N/A
Query string: ?worker_id=ID OF THE WORKER&request_key=LONG HASH VALUE&model_id=ID OF THE MODEL
This endpoint is HTTP only.
This method will allow a worker that has been accepted into a cycle to request the download of a model from PyGrid. They need to submit their request_key provided in the cycle request call above. This provides an extra means of authentication for PyGrid to ensure it's sending data to the right worker.
Response: This downloads the model to the worker.
Report
Method in worker library: job.report()
HTTP endpoint: POST /federated/report
Socket "type": federated/report
Request data:
{
"worker_id": "ID OF THE WORKER",
"request_key": "LONG HASH VALUE",
"diff": "FINAL MODEL DIFF FROM TRAINING"
}This method will allow a worker that has been accepted into a cycle and finished training a model on their device to upload the resulting model diff.
If the worker did not train a protocol to be done after the plan(s) was executed, then they will simply submit their entire model diff. If they want to manually add noise to this diff as a layer of protection, they may do so at the developer's discretion from inside the worker implementation.
If the worker did execute a protocol and they have finished the secure aggregation protocol with other workers, they will now receive a share of the resulting securely aggregated model diff. In this case, they will submit the share of the diff, rather than their original model diff. PyGrid will handle the decryption of the shares once they're all submitted.
Response: { "status": "success" }
The response of success is sent if the response is a 200. The worker should not be informed if the model diff was accepted or denied as part of the global model update.