-
-
Notifications
You must be signed in to change notification settings - Fork 174
Description
Describe the bug
In a new production system that we are setting up for our customer, AM appears to hang under specific conditions. The exact conditions for causing this hang isn't very clear, but I need to fix it. I will write what I know about it, in this report.
For at least now, I believe it is related to SessionConstraint by catching a never-ending loop in the logs, related to calls to refresh(false) within DestroyOldestSession. The transaction ID remains the same, so I believe it is the same request that is never ending. Since this runs in Kubenetes, I couldn't really figure out how to attach a debugger to it... so it is regretable that nothing more concrete can be determined yet.
I will explain why I think this is so, below. I will also paste the relevant log sections in another comment.
To Reproduce
Steps to reproduce the behavior:
- Have 2FA, session constraints enabled. I use "Destroy oldest" strategy and limited the number of sessions to 1. I do not know if 2FA matters.
- Log in and out of the system.
- After some time, it may encounter excessive memory use and/or become unresponsive. After some time, it will likely crash due to OOM.
Expected behavior
OpenAM should enforce session constraints without issue.
Screenshots
Stack trace, in one of the threads that trigger never-ending reading of tokens:
In other case (not in the same run), there are too many instances of LDAPConnectionFactory etc, which is related to equally many instances of CachedConnectionPool:
Personally, I think that the memory issue is a symptom of the actual problem and is not really the problem itself. I don't really have strong evidence for this part, but I remember seeing connection pools getting created in the loop. If it happens fast enough, perhaps they are getting created faster than OpenDJ can handle.
Desktop (please complete the following information):
- Windows 10 Pro x64, 22H2
- Firefox x64
Additional context
I commented out this loop that was added here, like this. And the problem appears to have gone away:
...
// Step 3: checking the constraints
if (sessions != null && SystemProperties.getAsBoolean("org.openidentityplatform.openam.cts.quota.exhaustion.enabled", true)) {
sessions.remove(internalSession.getSessionID().toString());
//while (sessions.size() >= quota) {
reject = getQuotaExhaustionAction().action(internalSession, sessions);
//if (reject)
//break;
//}
...
However, I will only know for sure whether this matters, tomorrow. Usually, it happens very often whenever my customer is at work. I don't know why, but I assume it is caused by more sessions getting created and destroyed.
Within DestroyOldestAction, the action() method now looks like this:
public boolean action(InternalSession is, Map<String, Long> sessions) {
long smallestExpTime = Long.MAX_VALUE;
String oldestSessionID = null;
for (Map.Entry<String, Long> entry : sessions.entrySet())
if (!StringUtils.equals(is.getSessionID().toString(), entry.getKey())){
try {
Session session = new Session(new SessionID(entry.getKey()));
session.refresh(false);
long expTime = session.getTimeLeft();
if (expTime <= smallestExpTime) {
smallestExpTime = expTime;
oldestSessionID = entry.getKey();
}
} catch (SessionException ssoe) {}
}
if (oldestSessionID != null) {
destroy(oldestSessionID,sessions);
}
return false;
}
There is a call to refresh(), as seen inside the stack trace above.
If we go back further (before 9f36814), this same method looked like this at some point:
if (oldestSessionID != null) {
SessionID sessID = new SessionID(oldestSessionID);
try {
Session s = sessionCache.getSession(sessID);
debug.warning("destroy {}", sessID);
s.destroySession(s);
} catch (SessionException e) {
if (debug.messageEnabled()) {
debug.message("Failed to destroy the next expiring session.", e);
}
// deny the session activation request
// in this case
return true;
}
The exception flow may occur if the token no longer exists. The Session.refresh() method would throw this exception:
try {
RestrictedTokenContext.doUsing(activeContext,
new RestrictedTokenAction() {
public Object run() throws Exception {
doRefresh(reset);
return null;
}
});
} catch (Exception e) {
sessionCache.removeSID(sessionID);
if (sessionDebug.messageEnabled()) {
sessionDebug.message("session.Refresh " + "Removed SID:"
+ sessionID);
}
throw new SessionException(e);
}
This means that no token can be removed. However, the SessionConstraint class now has a loop, that will never exit if there is not at least 1 session destroyed.
I think that the original logic without the loop existed, because of this possibility that the session cannot be destroyed because it was already destroyed. In the original version, the creation of a new session would be blocked too.
But because of the code change by in 36677ca, it also now means that the session may always be created (if the loop in SessionConstraint does exit). I think, that could be permissible, if we can assume that the mechanism always works: the outcome is that the session is destroyed or that there is just no valid session to destroy. If not, then we may need to revert this change too.
But because there are no comments on why these changes were made in the first place, I do not know how to suggest a proper fix.
No longer blocking session creation, if old session cannot be destroyed: https://github.com/OpenIdentityPlatform/OpenAM/blame/02b7dd00e64302c7efe90a35fb43946c346dbb8c/openam-core/src/main/java/org/forgerock/openam/session/service/DestroyOldestAction.java#L82
Why exactly, do we need this loop? https://github.com/OpenIdentityPlatform/OpenAM/blame/28f00ee2f3601c5b2e22388f4b786abd5aafae78/openam-core/src/main/java/com/iplanet/dpro/session/service/SessionConstraint.java#L158
@vharseko would you be able to help explain why these changes were made in such a way, so that I can help to suggest a fix? Or would you already know what to do? Thank you.