[RTFACT-8841] Artifactory remote repositories lock up after heavy use Created: 22/Dec/15  Updated: 14/Nov/17  Resolved: 27/Dec/15

Status: Resolved
Project: Artifactory Binary Repository
Component/s: HttpRepo
Affects Version/s: 4.3.1, 4.3.3
Fix Version/s: 4.4.0

Type: Bug Priority: Critical
Reporter: Mark Mielke Assignee: Michael Pasternak (Inactive)
Resolution: Fixed Votes: 0
Labels: None


 Description   

Infrequently we had a few repository lock ups cleared by an Artifactory restart. But, when we started using it for Yocto at remote sites the use case changed to reproducible lock up. I was able to reproduce the failure without Yocto using "wget -r" on a remote repository. The behaviour would be that after restart the remote would work for many requests, but lock up after a few hundred. From a clean restart, it would lock up on the same artifact. However, if I fetched that artifact first, and then "wget -r", it would lock up on a different artifact.

This issue has plagued us for some time now. I had tried many things to resolve it. Finally, I examined the jstack traces and found the following code did not properly close the CloseableHttpResponse:

diff '--exclude=.svn' -r -U8 4.3.3/backend/core/src/main/java/org/artifactory/service/SmartRepoServiceImpl.java 4.3.3-new/backend/core/src/main/java/org/artifactory/service/SmartRepoServiceImpl.java
--- 4.3.3/backend/core/src/main/java/org/artifactory/service/SmartRepoServiceImpl.java	2015-12-22 14:09:12.969950882 -0500
+++ 4.3.3-new/backend/core/src/main/java/org/artifactory/service/SmartRepoServiceImpl.java	2015-12-22 12:41:09.416713364 -0500
@@ -496,16 +496,17 @@
     private boolean pullAndUpdateProperties(RepoResource resource, HttpRepo repo) {
         log.debug("Downloading remote properties for artifact {}", resource);
 
         boolean ok, notFound;
         String targetArtifact = buildPropertiesRestApiUrl(repo, resource);
         HttpGet getMethod = new HttpGet(HttpUtils.encodeQuery(targetArtifact));
         try {
             CloseableHttpResponse getResponse = repo.executeMethod(getMethod);
+            try {
             ok = HttpStatus.SC_OK == getResponse.getStatusLine().getStatusCode();
             notFound = HttpStatus.SC_NOT_FOUND == getResponse.getStatusLine().getStatusCode();
             if (ok || notFound) {
                 InputStream stream = getResponse.getEntity().getContent();
                 Properties properties = (Properties) InfoFactoryHolder.get().createProperties();
                 if (ok && stream != null) {
                     ItemProperties itemProperties = JacksonReader.streamAsClass(stream, ItemProperties.class);
                     RepoRequests.logToContext("Received remote property content '{}'", itemProperties);
@@ -515,16 +516,19 @@
                         if (!entry.getKey().startsWith(ReplicationAddon.PROP_REPLICATION_PREFIX)) {
                             properties.putAll(entry.getKey(), values);
                         }
                     }
                 }
                 setProperties(resource, properties);
             }
             repositoryService.unexpireIfExists(repo.getLocalCacheRepo(),resource.getRepoPath().getPath());
+            } finally {
+                getResponse.close();
+            }
         } catch (Exception e) {
             log.debug("Cannot update remote properties: {}", e);
             return false;
         }
         return notFound ? false : ok;
     }
 
     private boolean resourceIsExpirable(RepoResource resource, HttpRepo repo) {

I patched the above code into our jfrog-artifactory-4.3.3-pro.zip and now the problem is gone. I am unable to reproduce our issue.

Because this can easily happen, and because it could affect any Artifactory to Artifactory ("smart repo") proxy caching, I think this should be a critical issue for you to fix.



 Comments   
Comment by Michael Pasternak (Inactive) [ 27/Dec/15 ]

"wget -r" is not a right way to check SmartRepo, by performing a recursive download your
server can hang up due to variety of reasons cause it downloads all resources it meets,
such as html pages, images, etc.,

while SmartRepo is triggered only when artifacts are downloaded from smart remote repository,
(which i tend to believe is not the case with 'wget -r'),

also this issue effects mostly remoteDeletedSync which is off be default in 4.3.3, so unless
you explicitly enabled it, it should not occur, and you should not see any connection starvation

only place were you should feel the difference is before upgrading to 4.3.3, as 4.3.3 switched off
remoteDeletedSync which was always on before and produced extra content synchronization

Comment by Mark Mielke [ 27/Dec/15 ]

Hi Michael:

Many of the assertions made above about effect are not correct in my experience:

1) "wget -r"... I didn't "wget -r" on the Artifactory URL. I did it on a specific repository, such as "https://artifactory.COMPANY.COM/MY.REPOSITORY/". This meant that of several hundred artifacts in the repository, it fetched them all in sequence. It did include the index.html from the directories, but surely this should never lock up the repository so that it can no longer be used until a restart.

2) It was not just when delete sync was enabled. I thought of these fancy smart repo features early on, and even with all the features off it happened. Underneath, it was still querying /api/ to get the properties from the artifact, and getting HTTP 404 (by memory). I think it is the HTTP 404 path that was leaking the connections.

3) The problem did not automatically clear after a time. While I believe you could have an idle cleaner thread running, I think that as long as the Apache HttpClient connection pool is not exhausted, it could work fine. But, as soon as the issue is hit - the repository locks up semi-permanently, and requires some sort of change to reset such as an application restart, or I think I may have unfreezed it before by adjusting the URL from http:// to https:// and back. But this didn't always work.

4) 4.3.3 did not fix the problem.

5) The patch I provide above did fix the problem.

I think probably that there are more cases than you think that cause this method to be called, and that once the connections leak to the point that the connection pool is exhausted, even a background cleaner thread cannot fix the problem.

Thanks for fixing this!

Comment by Mark Mielke [ 27/Dec/15 ]

Also to clarify on 1)... "wget -r"... the purpose was to reproduce the exact issue that our users were experiencing. In the case of the Yocto builds, where we were trying to leverage Artifactory as a binary management tool with replication features, Yocto quickly pulls a whole large number of .tar.gz from a single repository as it builds a Yocto Linux image. My purpose in "wget -r" was to subtract Yocto from the equation, to reproduce the exact issue without Yocto. I think it is acceptable for me to do this. I wasn't testing Smart Repo exactly... I was testing "what Yocto does" which is pull hundreds of artifacts quickly in sequence. That the problem was with Smart Repo was discovered after.

Comment by Mark Mielke [ 27/Dec/15 ]

"while SmartRepo is triggered only when artifacts are downloaded from smart remote repository,
(which i tend to believe is not the case with 'wget -r')," ... My experience of how it works in practice, is that it calls this code if the cache has expired and it wants to re-verify that what it has in cache is still good. In the case of a Artifactory restart (which was required to clear the problem), when the Yocto (or Yocto "simulation" using "wget -r") was launched, Artifactory would call the remote Artifactory repository with the /api/ path to fetch all properties the first time the artifact was referenced after restart. After a certain number of these around 200 - 300 (I never pinned down the exact number, but it wasn't 256!), it would lock up and get semi-permanently stuck for that one repository alone. If I queried certain paths that locked up first, then it would lock up on different artifacts.

If I completely removed the remote and re-created it empty, then it would sometimes (not always) work and fetch all the artifacts. Possibly this was due to the cleaner, and the longer time to fetch instead of just check that the cached artifacts were ok? In this case, I think it continued to work for a time. But, when I restarted Artifactory again, it went back to the freeze after a few hundred.

All of this was with as many features as I could turn off, off. Including property synchronization. I think Smart Repository cannot be fully disabled, and it's only the advanced features that can be disabled. And Smart Repository seems to call the above pullAndUpdateProperties method even with most or all Smart Repository features intentionally disabled.

Comment by Roy Zanbel (Inactive) [ 31/Dec/15 ]

Mark Mielke Thank you for taking the time to investigate.
The issue is indeed related to remote repository and not only to the smart remote repository as mentioned in previous comment.
Based on your input we were able to find the root cause. The issue was fixed and will be delivered in the next Artifactory version.

Generated at Sun Nov 17 20:21:24 UTC 2019 using JIRA 7.6.16#76018-sha1:9ed376192612a49536ac834c64177a0fed6290f5.