Conversation
|
@remyleone if you want to have a look. |
There was a problem hiding this comment.
@JulienVdG 😍 thank you for taking a look at this issue. I definitely love the approach. I think that keeping a map of discovered endpoints in each node is great and will indeed help maintain stability in the cluster!
Also, thank you for adding the tests! I know these are very verbose and daunting.
One question: don't we need to remove the requirement in node.Ready for the node to have an endpoint?
Edit: as commented in #109 (comment), we don't need to modify node.Ready
| updateNATEndpoints(nodes, peers, oldConf) | ||
| natEndpoints := updateNATEndpoints(nodes, peers, oldConf, m.logger) | ||
| nodes[m.hostname].DiscoveredEndpoints = natEndpoints | ||
| m.nodes[m.hostname].DiscoveredEndpoints = natEndpoints |
There was a problem hiding this comment.
nodes is a map of pointers to Nodes, so I think updating the nodes map should suffice and we should not need to update m.nodes as well
There was a problem hiding this comment.
The applyTopology function starts by doing shallow copies of the Ready Nodes ...
Updating nodes will provide the updated endpoint to NewTopology.
Updating m.nodes ensure that next call to checkIn will publish the DiscoveredEndpoints to k8s.
There are others way to do it, but this looked easy enough.
There was a problem hiding this comment.
Yes I definitely agree that we want both nodes[m.hostname] and m.nodes[m.hostname] to include the updated endpoints so that the checkIn method can update the endpoints annotation.
What I don't see is why we need to explicitly set .DiscoveredEndpoints = natEndpoints for both of them. Since nodes is a shallow copy, updating nodes[m.hostname] will also update m.nodes[m.hostname], no?
There was a problem hiding this comment.
Well the copy is not a pointer copy but a shallow copy so more like:
https://play.golang.com/p/istIdrj__0_P
for it to work I need to make the endpoint list a pointer so that the copy points to the same map
eg: https://play.golang.com/p/M2uunAQXRyI
I can do that if you think it's better that way.
There was a problem hiding this comment.
There was a problem hiding this comment.
Thinking about this some more, it occurs to me that the only reason we really had to do the shallow copies was to prevent the endpoints of the nodes to be overwritten by the updateNATEndpoints func, which used to mutate the parameters. Now that the func no longer mutates data we could probably switch from shallow copies to just copying the pointers. We can do this in a follow up however
| n.Endpoint = peer.Endpoint | ||
| level.Debug(logger).Log("msg", "WireGuard Update NAT Endpoint", "node", n.Name, "endpoint", peer.Endpoint, "former-endpoint", n.Endpoint, "same", n.Endpoint.Equal(peer.Endpoint)) | ||
| // Only public ip ? (should check location leader but only in topology ... or have topology handle that list) | ||
| // Better check wg latest-handshake |
There was a problem hiding this comment.
So the idea is to only set the endpoint, if the last handshake is not older than some constant, instead of how it is now?
So nodes that have an outdated endpoint will eventually forget about it and get the updated endpoint from another nodes' annotation?
There was a problem hiding this comment.
That would be a great feature to add eventually
There was a problem hiding this comment.
The bare minimum is when latest-handshake is 0 wireguard didn't see any connection yet, so better not advertise it (anyway in that cas it's still equal to n.Endpoint). Then yes we could detect outdated endpoint.
Also in topology instead of a static ordering we could sort by latest-handshake, the fact that we give priority to the current detected endpoint already ensure stability, but having more noise would allow to try the different endpoints one node could have (in case of really complicated network)
The cost however in another call to wg (ie wg show kilo0 latest-handshakes) as the wg showconf call will not return the latest-handshake
| PKG := github.com/squat/$(PROJECT) | ||
| REGISTRY ?= index.docker.io | ||
| IMAGE ?= squat/$(PROJECT) | ||
| ifneq ($(REGISTRY),index.docker.io) |
There was a problem hiding this comment.
Would you mind splitting this up into a new PR so we can keep each chunk of work separate and still squash the commits related to NAT?
There was a problem hiding this comment.
I missed the drop from this PR, sorry...
Compare IP first by default and compare DNS name first when we know the Endpoint was resolved.
Now that updateNATEndpoints was updated to discoverNATEndpoints and that the endpoints are overridden by topology instead of mutating the nodes and peers object, we can safely drop this copy.
|
This one should be fine now, do you want me to squash the commits ? |
My approach is highly inspired from @squat comments in #109, with a difference on the annotation usage:
The annotation is note used for a node to present it's own endpoint detected by other, but the other way around: a node published the NAT endpoints it detected.
This has 2 advantages:
Then I updated the topology code to resolve the detected endpoint so that the local node will use :
I tested this on a cluster with public nodes and 2 NAT regions and the NAT regions could then see each-other.
The code probably still needs polishing.