[RTFACT-12909] Docker hitting scaling issues when a particular base layer is a component of thousands of images Created: 28/Nov/16 Updated: 15/Oct/18 Resolved: 26/Mar/17
|Project:||Artifactory Binary Repository|
|Affects Version/s:||4.12.1, 4.14.2|
|Reporter:||Mark Mielke||Assignee:||Yuval Reches|
|Sprint:||Leap 7, Leap 9|
We are experiencing significant slowdown due to our use of Docker repository hosting with Artifactory. One of the issues we are facing seems to be a scaling issue with the Docker handling of layers, where if the same layer is used by thousands of images, the processing overhead for Docker operations becomes progressively slower. This is the problematic query:
In our case, we have a particular layer that shows up in 10,000+ images. Note that in the below results, it is taking advantage of an index I created, as described here:
This previously took 250+ milliseconds with 12,000+ matching nodes. We removed several hundred obsolete Docker images and brought it down to just over 10,000 matching nodes. This now takes 170 - 200 milliseconds to query:
To join and output 10,000 records it is taking 173 milliseconds. Does the Docker plugin really need to get ALL of the results from this query? Or might it be satisfied with only a few results?
In our case, this query is happening many times a second during peak, where peak is during continuous build operations that invoke Docker push and pull form Artifactory. Before I implemented the index hack, it was reporting a variety of SHA-256 values. After I implemented the index hack to speed up how fast it finds the SHA-256 values, now only a few values such as the one above remain as a concern. Due to the 10,000+ results that are being prepared and processed multiple times a second, it is really affecting the performance of Artifactory as a whole. It is causing Artifactory to be laggy in the day for all users of Artifactory - whether the UI, Maven, Docker, or many of the others.
|Comment by Mark Mielke [ 30/Nov/16 ]|
After further investigation, we determined that this particular layer is common to most images as it is a very basic set of Dockerfile instructions. I expect if you query any Artifactory database that supports Docker, that has a large number of images in it, you will find that some layers are much more common than other layers, and you may actually have this exact same layer in your own database.
|Comment by Brandon Sanders [ 11/Jan/17 ]|
We have this same layer on 190,000 images in our repository and due to regulatory requirements it is problematic for us to prune those images. Even with the index from
|Comment by Mark Mielke [ 13/Mar/17 ]|
I was investigating a bit if we could work-around this in any way until JFrog deploys a patch, and I encountered two aspects I want to make sure get covered:
1) The "SELECT DISTINCT" has the effect of forcing PostgreSQL to query all the records, de-dupe them, and then start returning records. So, I think the "DISTINCT" needs to be dropped to turn this into any sort of partial query. In our case, we're up to 17,000 matching rows for the "empty layer" used to create environment variables, and just on the PostgreSQL side is taking the 500+ ms.
2) The current implementation using an "AqlEagerResult". If I am understanding this correctly - findBlobsGlobally() determines canRead() and other checks all 17,000 results before dumping it into a List, and passing it to the caller. So not only is the PostgreSQL side doing quite a lot of processing, but the Java side is too.
For a minimum fix here, I am thinking it should use AqlLazyResult with a streamable query from PostgreSQL (i.e. not "SELECT DISTINCT") and process one item at a time, exiting the loop as soon as it gets a first result that sufficiently meets requirements. In our case, of the 17,000 rows - probably 16,990 of the rows would be acceptable candidates, and it should normally find 1 of them within the first 10 results.
For other use cases, I wonder if something more sophisticated might be required. If you had 500 repositories, each owned by a different team, and they all used this very common "empty image" layer, the canRead() might have to scan 500 or more rows before finding a result. It would still be faster than scanning 17,000 rows, but I would wonder if there was some way to make it more likely that a local result would be found? For example, maybe the query should start by looking for rows where property "sha256" is a particular value and the repository is the same?
Maybe you already had all these ideas, and are working on a better fix. I just want to make sure we won't get this feature, and still see 500+ ms queries, because one of the elements of the problem wasn't recognized.
|Comment by Mark Mielke [ 28/May/17 ]|
This issue was partially addressed. Our server has gone from failing to surviving, but the system load is still quite high during peak, and we remain concerned. I opened a new issue to capture the remaining concerns: RTFACT-14296