Add LSQ leashing 0.30.0.1#18
Add LSQ leashing 0.30.0.1#18tweag-ev-ak wants to merge 12 commits intoleashing-stub-branch-for-prfrom
Conversation
* Update ouroboros-network ref * Add support for LeashID * Support explicit unleashing
There was a problem hiding this comment.
This review is probably more than you had intended to ask for! You offered to have a call to discuss this code with me and I said "looks small enough for me to look over myself first", and suddenly here I am saying things like "this code needs comments", which is a bit unfair since you just wanted to discuss the ideas with me.
But I figured a more thorough/serious review than requested will likely be more useful, so quit trying to prevent myself from settling into "review mode" and let it rip. I hope the result is helpful in the way you were hoping 🤞And we can still have that call if there's particular things you'd like to discuss synchronously.
| -- is called. | ||
| ChainDB.LoEEnabled $ | ||
| AF.Empty AF.AnchorGenesis | ||
| gnkaGDDArgs <- for (gcGDDConfig gcfg) $ \p -> do |
There was a problem hiding this comment.
TODO doesn't need to be monadic anymore
| leashingState <- readLeashingState | ||
| if not $ Map.null leashingState then | ||
| ChainDB.LoEEnabled <$> readLoEFragment | ||
| -- readLoEFragment >>= \case |
| -- ^ The LoE fragment. | ||
| -> StrictTVar m (ChainDB.GetLoEFragment m blk) | ||
| -> m () | ||
| setGetLoEFragment readLeashingState readGsmState readGenesisLoEFragment readLoEFragment varGetLoEFragment = |
There was a problem hiding this comment.
What's the difference between readLoEFragment and readGenesisLoEFragment?
My suspicion is that readLoEFragment should have a more distinct name now. TODO return here when I have an opinion about what that name should be
| {-# LANGUAGE TypeOperators #-} | ||
| {-# LANGUAGE UndecidableInstances #-} | ||
|
|
||
| {-# OPTIONS_GHC -Wno-unused-matches -Wno-unused-local-binds #-} |
There was a problem hiding this comment.
What's this about? Different versions of GHC disagree or something?
| import qualified Ouroboros.Network.AnchoredFragment as AF | ||
| import Ouroboros.Network.Protocol.LocalStateQuery.Type (LeashID) | ||
|
|
||
| type LeashingFingerprint blk = ([(LeashID, ChainHash (HeaderWithTime blk))], Maybe (ChainHash (HeaderWithTime blk))) |
There was a problem hiding this comment.
Please add comment explaining what the two components are.
I'm guessing: one LeashID per LSQ client in Acquired state, and the Just is from GSM's Syncing mode
| oldLoEFrag <- atomically $ swapTVar varLoEFrag loeFrag | ||
| -- The chain selection only depends on the LoE tip, so there | ||
| -- is no point in retriggering it if the LoE tip hasn't changed. | ||
| when (AF.headHash oldLoEFrag /= AF.headHash loeFrag) $ |
There was a problem hiding this comment.
Was moved to the Leashing.hs module ✔️
| , getDiffusionPipeliningSupport :: | ||
| DiffusionPipeliningSupport | ||
| , getBlockchainTime :: BlockchainTime m | ||
| , getLeashingStateVar :: StrictTVar m (ChainDB.LeashingState blk) |
There was a problem hiding this comment.
My instincts are that LeashingState (and so getLeashingStateVar etc) should be explicitly mentioning LSQ.
Recap of where the jargon comes from:
- The Limit on Eagerness (LoE) prevents a node syncing via Ouroboros Genesis from committing to a chain that would force it to disconnect from any peer while it's syncing. The
gddWatcheris already inspecting those peers' candidates, and so it is a convenient place to compute the least restrictive LoE anchor fragment that protects the current peers. - "Leashing" is a way an adversarial peer could abuse the LoE to prevent the syncing node from making progress. The Limit on Patience (LoP) and the Genesis Density Disconnector (GDD) cooperate to mitigate that attack vector.
- The PR that this feature adds is to intentionally let LocalStateQuery (LSQ) peers reuse the existing LoE in a way that intentionally leashes the node (regardless of whether it's syncing).
So I think "leashing state" is too vague. But "LsqLoE" or "LsqLeashing" etc would be sufficiently indicative.
There was a problem hiding this comment.
And the PR should be "Add LSQ leashing". Etc.
There was a problem hiding this comment.
For example, that would make it easier to notice/remember that getLeashingStateVar only exists so that the LSQ servers can keep that state up-to-date.
There was a problem hiding this comment.
I agree, that's a great idea 👍
| -- | The LoE fragment. It starts at a (recent) immutable tip and ends at | ||
| -- the common intersection of the candidate fragments. | ||
| StrictTVar m (AnchoredFragment (HeaderWithTime blk)) -> | ||
| StrictTVar m (Maybe (AnchoredFragment (HeaderWithTime blk))) -> |
There was a problem hiding this comment.
It wasn't obvious to me why this must be a Maybe. But I think the answer is: the leashingWatcher needs to know whether or not the GDD's written LoE fragment should be included. So this value is Nothing while Genesis isn't even bothering to keep it up-to-date but it's Just while Genesis is actually active.
A comment explaining that seems very useful.
There was a problem hiding this comment.
TODO double-check that a Nothing is definitely written to this TVar the Genesis State Machine (GSM) transitions into CaughtUp (aka "turns off" the GDD etc).
There was a problem hiding this comment.
It wasn't obvious to me why this must be a Maybe. But I think the answer is: the leashingWatcher needs to know whether or not the GDD's written LoE fragment should be included. So this value is Nothing while Genesis isn't even bothering to keep it up-to-date but it's Just while Genesis is actually active.
That's correct. Will add a comment.
There was a problem hiding this comment.
TODO double-check that a Nothing is definitely written to this TVar the Genesis State Machine (GSM) transitions into CaughtUp (aka "turns off" the GDD etc).
Good point, added the write action 👍
| (genesisArgs, setLoEinChainDbArgs) <- | ||
| mkGenesisNodeKernelArgs llrnGenesisConfig | ||
| genesisArgs <- mkGenesisNodeKernelArgs llrnGenesisConfig | ||
| varGetLoEFragment <- newTVarIO $ pure ChainDB.LoEDisabled |
There was a problem hiding this comment.
TODO I didn't scrutinize this very much. Seems plausible as just a copy-paste from what used to be in mkGenesisNodeKernelArgs.
| else | ||
| readGenesisLoEFragment >>= \case | ||
| Just glf -> do | ||
| -- leashing is disabled, run old behavior |
There was a problem hiding this comment.
See my other big comment about "names should mention LSQ".
I think "leashing is disabled" should say "LSQ leashing is disabled", for example.
There was a problem hiding this comment.
Separate concern: I don't think we need mutual exclusion "either the LSQ LoE or the GDD LoE". Because we can intersect them to correctly combine them, can't we?
There was a problem hiding this comment.
Separate concern: I don't think we need mutual exclusion "either the LSQ LoE or the GDD LoE". Because we can intersect them to correctly combine them, can't we?
Yes, that's what we do right now in the leashing watcher:
wNotify :: (LsqLeashingWatcherState blk) -> m ()
wNotify LsqLeashingWatcherState{..} = do
let
lsqLeashingCandidates = Map.toList lsqLeashingState
prefix = maybe curChain id genesisLoEFrag
loeFrag = fst $ sharedCandidatePrefix prefix lsqLeashingCandidates
if there is a genesis LoE fragment, we use it as a base for sharedCandidatePrefix, otherwise it's current chain. Is it a correct approach?
Only later, in setGetLoEFragment, we decide which the LoE fragment to return, basing on both fragment variables, the lsqLeashingState and the gsm state.
readLoEFragment in setGetLoEFragment returns the resulting LoE fragment calculated by lsqLeashingWatcher.
readGenesisLoEFragment returns the genesis loe fragment calculated by gddWatcher.
Sorry that I confused you because of all these variables mess. I wanted to keep the original implementation as much as possible and extend it rather than changing the existent one. I thought it would be easier to understand the additions. Turned out it only messed up the understanding.
We can do that final calculation in the lsq leashing watcher instead of setGetLoEFragment. I just couldn't decide what's the correct approach. Either we do all the logic in the watchers and remove the getLoEFragment from setGetLoEFragment (it will become just plain readLoEFragment) or combine the whole calculation in setGetLoEFragment (what I ended up with to keep the old behaviour and make our additions more clear).
|
Uh oh.... I'm anticipating a burdensome amount of conflicts with IntersectMBO#1891, since it's changing how Forkers are acquired. Maybe it won't be too bad 🤞 |
0ecf3e1 to
ce1fc45
Compare
ce1fc45 to
9274b49
Compare
9274b49 to
4210cb3
Compare
axman6
left a comment
There was a problem hiding this comment.
Just a few comments while reading through
| let | ||
| leashingFragment = case mpt of | ||
| ImmutableTip -> AF.Empty $ AF.anchor currentChain | ||
| SpecificPoint p -> AF.takeWhileOldest (\(HeaderWithTime h _) -> headerPoint h <= p) currentChain |
There was a problem hiding this comment.
There are several functions for splitting an AnchoredFragment efficiently which might be better than takeWhileOldest https://ouroboros-network.cardano.intersectmbo.org/ouroboros-network/api/Ouroboros-Network-AnchoredFragment.html#g:8
| ImmutableTip -> AF.Empty $ AF.anchor currentChain | ||
| SpecificPoint p -> AF.takeWhileOldest (\(HeaderWithTime h _) -> headerPoint h <= p) currentChain | ||
| VolatileTip -> currentChain | ||
| let newState = Map.insert leashId leashingFragment lsqLeashingState |
There was a problem hiding this comment.
Do we need to check if this leashId exists already? I think it's fine, I'm pretty sure we always want to update to this fragment, but just want to check it isn't a situation we haven't considered.
| Right forker | ||
| | Just leashId <- mLeashId -> do | ||
| traceM $ "My leash id " <> show leashId | ||
| atomically $ do |
There was a problem hiding this comment.
One thing that's worth adding to this discussion is that the main way that clients use LSQ is via short lived connections, so things like the Conn type will change for every query.
The querying functions explicitly always do Acquire -> Query -> Release -> Done (which also means a new thread is spun up for every query).
I've played around with making a slightly longer lived querying client but wasn't happy with it (hopefully the overhead of the extra round-trips isn't too large compared to the query itself). We might want to look at that again if the resource overhead on the node is a problem.
- Adds leashId to MsgDone message - Removes leashId from MsgRelease message
No description provided.