When enabling event-based pull replication for a large number of repositories (say several hundred of repos), the target server can reach a thread pool exhaustion state on two levels:
- The simpler issue is with tomcat HTTP threads. Event based pull replication opens one connection per each repository to be replicated to. So 200 repos entail 200 open HTTP threads on each server (source and target, an "inbound" channel thread on the target, and an "outbound" channel thread on the source). The customer will need to increase the maximal amount of threads for inbound connections when working with large volumes of repos.
- The more complex and severe issue that is also harder to triage is with async threads. On HA setups, each channel needs to use HA-propagation to open corresponding channels on other nodes in the cluster:
POST node1/.../channels/establishChannel -> | -----> POST node2/.../channels/establishChannel | -----> POST node3/.../channels/establishChannel POST node2/.../channels/establishChannel -> | -----> POST node1/.../channels/establishChannel | -----> POST node3/.../channels/establishChannel POST node3/.../channels/establishChannel -> | -----> POST node1/.../channels/establishChannel | -----> POST node2/.../channels/establishChannel
Each of the nested POST requests require an art-exec thread from our cached async thread pool. By default, the async thread pool spawns up to num_of_cores * 4 threads. When working in large volumes, this thread pool will be significantly over-utilized, sometimes creating a bottleneck of hundreds of propagation tasks in the async queue, leading to the following propagation error:
2020-01-21 16:48:39,192 [art-exec-19] [ERROR] (o.a.a.h.p.HaPropagationServiceImpl:301) - Error waiting for propagation event: null java.util.concurrent.TimeoutException: null at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.artifactory.addon.ha.propagate.HaPropagationServiceImpl.getResponse(HaPropagationServiceImpl.java:297) <SEE FULL STACK IN ATTACHED FILE>
This situation is very bad because of the snowball effect that it creates - all art-exec threads are used up, and will get cleaned up only after all the queued tasks timeout 3 times each (due to propagation retries), which can take up to 90 seconds per-task (3 * 30s of the default propagation timeout). One obvious side effect is that the Heartbeat job will not be able to run for the entire period of exhaustion, impeding the node's availability.
Suggested solutions:
- Event-based pull replication should opt to using a dedicated thread pool which's size is dynamically calculated by the number of repos configured with event-based pull, or use a significantly higher max threads limitation (say up to 100, configurable by a well-documented property.
- Until something like #1 is implemented, document this limitation and instruct users to increase the async core pool size accordingly when needed.
- is related to
-
RTFACT-23649 Event-based pull replication fails for https request with Nginx and Apache.
- Done