[RTFACT-8735] Improving the HA recovery mechanism when a cluster member experience OOM issue Created: 07/Dec/15  Updated: 20/May/19  Resolved: 09/Aug/16

Status: Resolved
Project: Artifactory Binary Repository
Component/s: High Availability
Affects Version/s: 3.9.4, 4.3.0
Fix Version/s: 4.11.1

Type: Improvement Priority: Normal
Reporter: Shay Bagants Assignee: Ofer Cohen (Inactive)
Resolution: Fixed Votes: 1
Labels: None

Assigned QA: Matan Katz

 Description   

When an HA cluster member experiencing an OutOfMemory issue, it can that leaves the JVM in an unstable state while some of the Artifactory processes and operations forcibly stopped or failed due to the OOM, while others may still run.
An example for such a state is while one of the cluster nodes experienced an OOM and does not function correctly due to the OOM, but the job which responsible for updating the DB with the node last heartbeat still runs. This can cause the other cluster members to try and re-join this node as the DB shows that the node is active and this is a potential for other issues.



 Comments   
Comment by Doug Strick [ 24/Feb/16 ]

I'd like to add we just experienced another issue where our primary node running Artifactory 3.9.4 in a 2 node cluster went down due to the OS becoming unresponsive. Looks like that was due to high IO activity. However, the primary node didn't receive a heartbeat from the slave within 30 seconds so it removed it from the cluster. The primary node wasn't responding fast enough so the load balancer marked it down and all traffic went to the slave node. Whenever users tried to upload the slave node just gave a hazelcast error. Even after restarting the primary and it showed everything the slave was added to the cluster the slave still gave hazelcast errors and had to be restarted.

Comment by Matan Katz [ 09/Aug/16 ]

Artifactory suffering from OOM may be in limbo mode, will cause other nodes potentially detect it as up and running, but actually it's not functional.
Best practice is to kill Artifactory node suffering from OOM so other nodes will be able to detect it as unavailable.

We added a flag to the JVM "-XX:OnOutOfMemoryError=\"kill -9 %p\"" causing Artifactory to stop incase of "out of memory" exception, preventing Artifactory from being in limbo mode.
This flag is added to "$ARTIFACTORY_HOME/bin/artifactory.default".
If upgrading from previous version (Rpm/Deb distributions or service installation) please make sure to add this flag to /etc/opt/jfrog/artifactory/default file (or copy artifactory.default as is to this location and rename it to default).

Generated at Sat Apr 04 13:03:20 UTC 2020 using JIRA 7.6.16#76018-sha1:9ed376192612a49536ac834c64177a0fed6290f5.