Uploaded image for project: 'Artifactory Binary Repository'
  1. Artifactory Binary Repository
  2. RTFACT-14296

Docker hitting scaling issues when a particular base layer is a component of thousands of images (Part 2)

    Details

    • Type: Improvement
    • Status: Open
    • Priority: High
    • Resolution: Unresolved
    • Affects Version/s: 4.12.1, 4.14.2, 5.3.1
    • Fix Version/s: None
    • Component/s: Docker
    • Labels:
      None

      Description

      This issue was partially addressed in RTFACT-12909. However, the query is still very slow, and still needs to be addressed. I have updated the text to reflect only the remaining work required.

      We are experiencing significant slowdown due to our use of Docker repository hosting with Artifactory. One of the issues we are facing seems to be a scaling issue with the Docker handling of layers, where if the same layer is used by thousands of images, the processing overhead for Docker operations becomes progressively slower.

      Query #1:

      [2017-05-28 04:06:39.326 UTC] <artifactory@127.0.0.1(37430)> artifactory: PostgreSQL JDBC Driver: LOG:  duration: 145.988 ms  execute S_14: select distinct  n.repo as itemRepo,n.node_id as itemId,n.node_path as itemPath,n.node_name as itemName,n.node_type as itemType from  nodes n where ( n.repo = $1 and n.depth >= $2 and n.node_name = $3 and n.node_type = $4) 
      [2017-05-28 04:06:39.326 UTC] <artifactory@127.0.0.1(37430)> artifactory: PostgreSQL JDBC Driver: DETAIL:  parameters: $1 = 'docker-v2-local', $2 = '1', $3 = 'sha256__a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4', $4 = '1'
      [2017-05-28 04:06:39.340 UTC] <artifactory@127.0.0.1(33921)> artifactory: PostgreSQL JDBC Driver: LOG:  duration: 169.690 ms  execute S_11: select distinct  n.repo as itemRepo,n.node_id as itemId,n.node_path as itemPath,n.node_name as itemName,n.node_type as itemType from  nodes n where ( n.repo = $1 and n.depth >= $2 and n.node_name = $3 and n.node_type = $4) 
      [2017-05-28 04:06:39.340 UTC] <artifactory@127.0.0.1(33921)> artifactory: PostgreSQL JDBC Driver: DETAIL:  parameters: $1 = 'docker-v2-local', $2 = '1', $3 = 'sha256__a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4', $4 = '1'
      [2017-05-28 04:06:39.349 UTC] <artifactory@127.0.0.1(39528)> artifactory: PostgreSQL JDBC Driver: LOG:  duration: 183.926 ms  execute <unnamed>: select distinct  n.repo as itemRepo,n.node_id as itemId,n.node_path as itemPath,n.node_name as itemName,n.node_type as itemType from  nodes n where ( n.repo = $1 and n.depth >= $2 and n.node_name = $3 and n.node_type = $4) 
      [2017-05-28 04:06:39.349 UTC] <artifactory@127.0.0.1(39528)> artifactory: PostgreSQL JDBC Driver: DETAIL:  parameters: $1 = 'docker-v2-local', $2 = '1', $3 = 'sha256__a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4', $4 = '1'
      [2017-05-28 04:06:39.364 UTC] <artifactory@127.0.0.1(39484)> artifactory: PostgreSQL JDBC Driver: LOG:  duration: 164.312 ms  execute <unnamed>: select distinct  n.repo as itemRepo,n.node_id as itemId,n.node_path as itemPath,n.node_name as itemName,n.node_type as itemType from  nodes n where ( n.repo = $1 and n.depth >= $2 and n.node_name = $3 and n.node_type = $4) 
      [2017-05-28 04:06:39.364 UTC] <artifactory@127.0.0.1(39484)> artifactory: PostgreSQL JDBC Driver: DETAIL:  parameters: $1 = 'docker-v2-local', $2 = '1', $3 = 'sha256__a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4', $4 = '1'
      

      Here is the PostgreSQL "explain analyze" output for this query:

      artifactory=# explain analyze select distinct  n.repo as itemRepo,n.node_id as itemId,n.node_path as itemPath,n.node_name as itemName,n.node_type as 
      itemType from  nodes n where ( n.repo = 'docker-v2-local' and n.depth >= 1 and n.node_name = 'sha256__a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4' and n.node_type = 1);
                                                                           QUERY PLAN                                                                     
      ----------------------------------------------------------------------------------------------------------------------------------------------------
       Unique  (cost=21833.97..21863.11 rows=3885 width=101) (actual time=107.094..124.709 rows=15495 loops=1)
         ->  Sort  (cost=21833.97..21843.68 rows=3885 width=101) (actual time=107.088..115.748 rows=15495 loops=1)
               Sort Key: node_id, node_path
               Sort Method: external sort  Disk: 2200kB
               ->  Bitmap Heap Scan on nodes n  (cost=627.29..21602.35 rows=3885 width=101) (actual time=15.222..66.333 rows=15495 loops=1)
                     Recheck Cond: ((node_name)::text = 'sha256__a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'::text)
                     Filter: ((depth >= 1) AND ((repo)::text = 'docker-v2-local'::text) AND (node_type = 1))
                     Rows Removed by Filter: 142
                     Heap Blocks: exact=11839
                     ->  Bitmap Index Scan on nodes_node_name_idx  (cost=0.00..626.32 rows=15436 width=0) (actual time=10.049..10.049 rows=15641 loops=1)
                           Index Cond: ((node_name)::text = 'sha256__a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'::text)
       Planning time: 0.532 ms
       Execution time: 129.258 ms
      (13 rows)
      

      Because this SHA-256 shows up 15,000 times, much of the table is scanned and then returned. You can also see that this query is frequent (many times a second).

      Query #2:

      This is the query I had originally raised a concern about.

      [2017-05-28 04:08:57.262 UTC] <artifactory@127.0.0.1(39532)> artifactory: PostgreSQL JDBC Driver: LOG:  duration: 724.550 ms  execute S_13: select distinct  n.repo as itemRepo,n.node_path as itemPath,n.node_name as itemName,n.created as itemCreated,n.modified as itemModified,n.updated as itemUpdated,n.created_by as itemCreatedBy,n.modified_by as itemModifiedBy,n.node_type as itemType,n.bin_length as itemSize,n.node_id as itemId,n.depth as itemDepth,n.sha1_actual as itemActualSha1,n.sha1_original as itemOriginalSha1,n.md5_actual as itemActualMd5,n.md5_original as itemOriginalMd5 from  nodes n left outer join node_props np100 on np100.node_id = n.node_id where ( n.repo = $1 and( np100.prop_key = $2 and  substr(np100.prop_value::text, 1, 255) = substr($3, 1, 255))) and n.node_type = $4 limit  10 
      [2017-05-28 04:08:57.262 UTC] <artifactory@127.0.0.1(39532)> artifactory: PostgreSQL JDBC Driver: DETAIL:  parameters: $1 = 'docker-v2-local', $2 = 'sha256', $3 = 'a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4', $4 = '1'
      

      The explain analyze output:

      artifactory=# explain analyze select distinct  n.repo as itemRepo,n.node_path as itemPath,n.node_name as itemName,n.created as itemCreated,n.modified as itemModified,n.updated as itemUpdated,n.created_by as itemCreatedBy,n.modified_by as itemModifiedBy,n.node_type as itemType,n.bin_length as itemSize,n.node_id as itemId,n.depth as itemDepth,n.sha1_actual as itemActualSha1,n.sha1_original as itemOriginalSha1,n.md5_actual as itemActualMd5,n.md5_original as itemOriginalMd5 from  nodes n left outer join node_props np100 on np100.node_id = n.node_id where ( n.repo = 'docker-v2-local' and( np100.prop_key = 'sha256' and  substr(np100.prop_value::text, 1, 255) = substr('a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4', 1, 255))) and n.node_type = 1 limit  10;
                                                                                                         QUERY PLAN                                                                             
                             
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      -----------------------
       Limit  (cost=37076.77..37077.15 rows=10 width=262) (actual time=475.014..475.037 rows=10 loops=1)
         ->  Unique  (cost=37076.77..37130.77 rows=1440 width=262) (actual time=475.011..475.033 rows=10 loops=1)
               ->  Sort  (cost=37076.77..37080.37 rows=1440 width=262) (actual time=475.010..475.019 rows=10 loops=1)
                     Sort Key: n.node_path, n.node_name, n.created, n.modified, n.updated, n.created_by, n.modified_by, n.bin_length, n.node_id, n.depth, n.sha1_actual, n.sha1_original, n.md5_
      actual, n.md5_original
                     Sort Method: external merge  Disk: 4424kB
                     ->  Nested Loop  (cost=706.84..37001.23 rows=1440 width=262) (actual time=14.082..280.463 rows=15495 loops=1)
                           ->  Bitmap Heap Scan on node_props np100  (cost=706.42..10819.79 rows=5719 width=8) (actual time=14.038..90.829 rows=15637 loops=1)
                                 Recheck Cond: (substr((prop_value)::text, 1, 255) = 'a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'::text)
                                 Filter: ((prop_key)::text = 'sha256'::text)
                                 Heap Blocks: exact=7311
                                 ->  Bitmap Index Scan on node_props_prop_value_idx  (cost=0.00..704.99 rows=15925 width=0) (actual time=12.277..12.277 rows=31172 loops=1)
                                       Index Cond: (substr((prop_value)::text, 1, 255) = 'a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'::text)
                           ->  Index Scan using nodes_pk on nodes n  (cost=0.42..4.57 rows=1 width=262) (actual time=0.010..0.011 rows=1 loops=15637)
                                 Index Cond: (node_id = np100.node_id)
                                 Filter: (((repo)::text = 'docker-v2-local'::text) AND (node_type = 1))
                                 Rows Removed by Filter: 0
       Planning time: 1.467 ms
       Execution time: 477.136 ms
      (18 rows)
      

      This involves the same layer, and is much more expensive. It depends what the users are doing as to which one is the biggest concern. When this one happens multiple times a second, it causes our server to become overloaded. The "LIMIT 10" added, as well as the effective use of the substr index help. But, the core problem is still here.

      I think the fundamental problem here is that:

      1) nodes has tens of thousands of matches for the repo key.
      2) node_props has 15,000+ matches for the SHA-256.
      3) The results cannot be prepared without a complex table merge.

      If you had the option, I would highly suggest pulling the SHA-256 up into the nodes table. This would be more efficient storage, and more efficient searching. You could have an index on the SHA-256 and the repo key, and very quickly identify results.

      For the second query, I have seen evidence that dropping DISTINCT can make the query < 5 milliseconds, compared to the current > 400 ms. This is because with the LIMIT 10, the nested join (instead of hash merge) is very effective at finding results more quickly. But, this doesn't seem to help with the first query.

      I think this issue needs to be addressed still to ensure that Artifactory is scalable when it comes to supporting Docker. This SHA-256 refers to an image layer that occurs when a Dockerfile step adds environment variables but does not affect any file content. This is an extremely common situation with Docker.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mark.mielke Mark Mielke
              • Votes:
                4 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: