We're running Artifactory as a high availability cluster inside of our EKS-hosted Kubernetes cluster. While doing some routine maintenance we discovered that Artifactory does not ever clear out HA nodes that have failed their heartbeats, even after long periods of time.
I assume this behavior is because Artifactory was originally designed to run on bare metal/VMs that were expected to be durable, but on Kubernetes our nodes are containers that are constantly getting restarted, shifted around, terminated, etc and so there is a large buildup of nodes that have failed their heartbeats and are never going to come back, which leaves us with a large amount of cruft.
Speaking with JFrog support, it appears that the only way to clean these up is by manually deleting the old nodes via the UI, which is not an enjoyable experience (I'm currently staring at 19 pages worth of these dead nodes on our production cluster, for reference). Furthermore the documentation I found regarding deleting HA nodes mentions the possibility of these old nodes causing problems in the cluster overall: https://www.jfrog.com/confluence/display/RTF6X/Managing+the+HA+Cluster#ManagingtheHACluster-RemovinganUnusedNode
My ask is to have a garbage collector added to Artifactory that can be configured to clean up nodes that have failed their heartbeat over a certain threshold. As an example, I could enable this new garbage collector on our Artifactory cluster to clean up any HA nodes that have failed their heartbeats for more than 24 hours and have the collector run once per day on a schedule.