Uploaded image for project: 'Artifactory Binary Repository'
  1. Artifactory Binary Repository
  2. RTFACT-20490

Better performance is needed for api/storageinfo



    • Type: Performance
    • Status: Open
    • Priority: Normal
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: REST API
    • Labels:


      When trying to view the Admin>Advanced>Storage or the REST API /api/storageinfo there is a risk that the request will time out for Artifactory instances with a lot of repos and artifacts.

      For our instance with 126 local repos and about 26M artifacts we get a timeout for the API after 600 seconds (due to how we have configured our haproxy).

      But the reality is that a REST API endpoint should not take 10 minutes before it returns any information.

      In this case it would be fair to make sure that the database query doesn't contain any SUM() calls that can take forever to return if one has a larger amount of artifacts.

      A solution for this could be that a thread in the application adds or subtracts from a repo size summary field on every create/delete artifact event. Or a scheduled event could sum up the total and update the summary fields asynchronously. Sure, the sum can differ from reality by a fraction if not all events are processed yet, but considering that the current API can take more than 10 minutes to return it woudln't be an issue. I don't think anyone expects a byte exact number anyway (also, the storage sizes are turned into kB/MB/GB in the UI anyway).


      Please fix this since currently the functionality of this is inaccessible to us.


        "errors" : [ {
          "status" : 500,
          "message" : "Repository storage summary failed with exception: Communications link failure\n\nThe last packet successfully received from the server was 600,004 milliseconds ago.  The last packet sent successfully to the server was 600,005 milliseconds ago."
        } ]



      I've done some analyzing on a locally deployed instance and debugged what queries are actually executed on the database.

      There are two queries:

      SELECT repo, SUM(CASE WHEN node_type = 0 THEN 1 ELSE 0 END) as folders, SUM(CASE WHEN node_type = 1 THEN 1 ELSE 0 END) as files, SUM(bin_length) FROM nodes GROUP BY repo;
      SELECT count(b.sha1), sum(b.bin_length) FROM binaries b WHERE b.sha1 NOT LIKE '##%';

      These queries take an insanely long time to complete due to the SUM statements and that it looks at the whole `nodes` table without limiting the number of rows its going to summarize.

      Just as an example: executing the top query but replacing the `GROUP BY repo` statement with `WHERE repo = "reponame"` for one of our mid-sized repositories takes almost a minute if the query wasn't cached.

      repo                 folders   files     SUM(bin_length)
      firmware-releases    144144    495623    9532328981425

      You can just imagine how long that operation would take for the repos we have that has 10M artifacts in them. I don't know how long it actually takes because the DB connection times out after 10 minutes.


      This performance issue is also visible on the front page where a SELECT COUNT( * ) FROM nodes WHERE node_type = 1; query is executed to get the number of artifacts just to be able to present it. The data shown in that message is mostly just a bragging statement and has no value to normal users visiting Artifactory (RTFACT-20532).

      This performance issue is also visible when one clicks the "Show" link for getting the "Artifact Count / Size" on the General page of a repo (as laid out in RTFACT-17669).


      Is it we who have a badly configured database or is it just that Artifactory isn't really optimized for the amount of artifacts we have in our instance?




            stefanga Stefan Gangefors
            3 Vote for this issue
            4 Start watching this issue