[RTFACT-8735] Improving the HA recovery mechanism when a cluster member experience OOM issue Created: 07/Dec/15 Updated: 20/May/19 Resolved: 09/Aug/16
|Project:||Artifactory Binary Repository|
|Affects Version/s:||3.9.4, 4.3.0|
|Reporter:||Shay Bagants||Assignee:||Ofer Cohen (Inactive)|
When an HA cluster member experiencing an OutOfMemory issue, it can that leaves the JVM in an unstable state while some of the Artifactory processes and operations forcibly stopped or failed due to the OOM, while others may still run.
|Comment by Doug Strick [ 24/Feb/16 ]|
I'd like to add we just experienced another issue where our primary node running Artifactory 3.9.4 in a 2 node cluster went down due to the OS becoming unresponsive. Looks like that was due to high IO activity. However, the primary node didn't receive a heartbeat from the slave within 30 seconds so it removed it from the cluster. The primary node wasn't responding fast enough so the load balancer marked it down and all traffic went to the slave node. Whenever users tried to upload the slave node just gave a hazelcast error. Even after restarting the primary and it showed everything the slave was added to the cluster the slave still gave hazelcast errors and had to be restarted.
|Comment by Matan Katz [ 09/Aug/16 ]|
Artifactory suffering from OOM may be in limbo mode, will cause other nodes potentially detect it as up and running, but actually it's not functional.
We added a flag to the JVM "-XX:OnOutOfMemoryError=\"kill -9 %p\"" causing Artifactory to stop incase of "out of memory" exception, preventing Artifactory from being in limbo mode.