Uploaded image for project: 'Artifactory Binary Repository'
  1. Artifactory Binary Repository
  2. RTFACT-20629

While using docker pull the execution time is 1.6 secs with DISTINCT in db query. Execution time reduces drastically without DISTINCT in the DB query



    • Type: Improvement
    • Status: Open
    • Priority: Normal
    • Resolution: Unresolved
    • Affects Version/s: 6.10.8
    • Fix Version/s: None
    • Component/s: None
    • Labels:


      Customer is using Enterprise+ 6.10.8 with Postgres. On this and previous versions they have been seeing sporadic but persistent read timeouts for pulls from our docker registry (v2 registry – 7 million artifacts). They have also been seeing CPU usage spikes in our database coinciding with the timeouts and also latency of 1.8 seconds on response from Artifactory also coincident with these timeouts. The machines in question are adequately provisioned.

      The offending query is as follows:

      select distinct
      {{ n.repo as itemRepo,}}
      {{ n.node_path as itemPath,}}
      {{ n.node_name as itemName,}}
      {{ n.created as itemCreated,}}
      {{ n.modified as itemModified,}}
      {{ n.updated as itemUpdated,}}
      {{ n.created_by as itemCreatedBy,}}
      {{ n.modified_by as itemModifiedBy,}}
      {{ n.node_type as itemType,}}
      {{ n.bin_length as itemSize,}}
      {{ n.node_id as itemId,}}
      {{ n.depth as itemDepth,}}
      {{ n.sha1_actual as itemActualSha1,}}
      {{ n.sha1_original as itemOriginalSha1,}}
      {{ n.md5_actual as itemActualMd5,}}
      {{ n.md5_original as itemOriginalMd5,}}
      {{ n.sha256 as itemSha2}}
      from nodes n
      where ( n.node_name like $1 escape '^' and n.repo = $2) and n.node_type = $3

      Analysis on the query is as follows:

      Limit (cost=417490.41..417490.86 rows=10 width=408) (actual time=16088.515..16088.569 rows=10 loops=1)
      -> Unique (cost=417490.41..452121.47 rows=769579 width=408) (actual time=16088.514..16088.565 rows=10 loops=1)
      -> Sort (cost=417490.41..419414.36 rows=769579 width=408) (actual time=16088.513..16088.556 rows=10 loops=1)
      Sort Key: repo, node_path, node_name, created, modified, updated, created_by, modified_by, node_type, bin_length, node_id, depth, sha1_actual, sha1_original, md5_a
      ctual, md5_original, sha256
      Sort Method: external merge Disk: 318648kB
      -> Seq Scan on nodes n (cost=0.00..58161.79 rows=769579 width=408) (actual time=0.004..273.049 rows=769419 loops=1)
      Planning time: 0.125 ms
      Execution time: 16161.358 ms
      (8 rows)

      This is 1.6 secs which is slow. They identified that the use of DISTINCT in the query as the problem. The customer states: “distinct on non-indexed fields needs to compare all the values of those fields and since that didn't fit in the allowed sort memory buffer for the db connection, postgres had no other option then to sort them on disk”

      See the line in the previously quoted log: “Sort Method: external merge Disk: 318648kB”

      Without the DISTINCT the execution time drops from 1.6 seconds to 0.025ms:

      Limit (cost=0.00..0.76 rows=10 width=408) (actual time=0.005..0.008 rows=10 loops=1)
      -> Seq Scan on nodes n (cost=0.00..58161.79 rows=769579 width=408) (actual time=0.005..0.008 rows=10 loops=1)
      Planning time: 0.052 ms
      Execution time: 0.025 ms

      The customer wants to remove this latency and these sporadic timeouts for the docker pull.

      The customer has the below questions:

      1. What causes this query to be generated – it looks like it might be generated from a simple GET for a specific artifact.
      2. Whether the use of DISTINCT is actually necessary for this call as a starting point.




            swarnenduk Swarnendu Kayal
            0 Vote for this issue
            2 Start watching this issue