Uploaded image for project: 'Artifactory Binary Repository'
  1. Artifactory Binary Repository
  2. RTFACT-21262

Exhaustion of async pool when enabling event-based pull replication on HA

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: 3 - High
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: 7.5.0, 6.20.0
    • Component/s: None
    • Labels:
    • Severity:
      High
    • At Risk:
      High

      Description

      When enabling event-based pull replication for a large number of repositories (say several hundred of repos), the target server can reach a thread pool exhaustion state on two levels:

      1. The simpler issue is with tomcat HTTP threads. Event based pull replication opens one connection per each repository to be replicated to. So 200 repos entail 200 open HTTP threads on each server (source and target, an "inbound" channel thread on the target, and an "outbound" channel thread on the source). The customer will need to increase the maximal amount of threads for inbound connections when working with large volumes of repos. 
      2. The more complex and severe issue that is also harder to triage is with async threads. On HA setups, each channel needs to use HA-propagation to open corresponding channels on other nodes in the cluster:
      POST node1/.../channels/establishChannel ->
        |
        -----> POST node2/.../channels/establishChannel
        |
        -----> POST node3/.../channels/establishChannel
      
      POST node2/.../channels/establishChannel ->
        |
        -----> POST node1/.../channels/establishChannel
        |
        -----> POST node3/.../channels/establishChannel
      
      POST node3/.../channels/establishChannel ->
        |
        -----> POST node1/.../channels/establishChannel
        |
        -----> POST node2/.../channels/establishChannel

      Each of the nested POST requests require an art-exec thread from our cached async thread pool. By default, the async thread pool spawns up to num_of_cores * 4 threads. When working in large volumes, this thread pool will be significantly over-utilized, sometimes creating a bottleneck of hundreds of propagation tasks in the async queue, leading to the following propagation error:

      2020-01-21 16:48:39,192 [art-exec-19] [ERROR] (o.a.a.h.p.HaPropagationServiceImpl:301) - Error waiting for propagation event: null
      java.util.concurrent.TimeoutException: null
      at java.util.concurrent.FutureTask.get(FutureTask.java:205)
      at org.artifactory.addon.ha.propagate.HaPropagationServiceImpl.getResponse(HaPropagationServiceImpl.java:297)
      
      <SEE FULL STACK IN ATTACHED FILE>

      This situation is very bad because of the snowball effect that it creates - all art-exec threads are used up, and will get cleaned up only after all the queued tasks timeout 3 times each (due to propagation retries), which can take up to 90 seconds per-task (3 * 30s of the default propagation timeout). One obvious side effect is that the Heartbeat job will not be able to run for the entire period of exhaustion, impeding the node's availability. 

       

      Suggested solutions:

      1. Event-based pull replication should opt to using a dedicated thread pool which's size is dynamically calculated by the number of repos configured with event-based pull, or use a significantly higher max threads limitation (say up to 100, configurable by a well-documented property.
      2. Until something like #1 is implemented, document this limitation and instruct users to increase the async core pool size accordingly when needed.

        Attachments

          Issue Links

            Activity

                People

                Assignee:
                nadavy Nadav Yogev
                Reporter:
                uriahl Uriah Levy
                Votes:
                5 Vote for this issue
                Watchers:
                15 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved:

                    Sync Status

                    Connection: RTFACT Sync
                    RTMID-21262 -
                    SYNCHRONIZED
                    • Last Sync Date: