RTFACT-17638 introduces a mechanism that copies image layers from tag X -> Y eagerly when the manifest.json is pulled. The following deadlock is a combination of an app-level lock and a DB lock.
Pulling a docker manifest concurrently (3+) creates the following deadlock:
- Thread #1 - the fastest thread of the 3, moves layer X of the image to its new path, and commits the transaction.
- Thread #2 - the thread decides that layer X needs to be copied too (there is no lock that protects the docker business level from copying the same layer twice - nor does it check whether the target path exists or not). This thread then proceeds to overwriting the already existing layer at the target path. When overwriting Artifactory does the following:
- It removes the target file & saves the session (the lock on the path is released, before the TX is committed!)
- The target path is then re-created (another repoPath lock is required)
- Thread #3 - Initially waits to acquire a lock on the target path. When Thread #2 releases the lock @ #2.1, this thread "steals" the lock on the repoPath. So now, Thread #2 cannot re-acquire the lock (as required by #2.2), however, Thread #2 already holds a read-lock on the row in the database
- Thread #2 wants to proceed with re-creating the overwritten path, but the lock on the path was stolen by Thread #3.
- The situation is a deadlock since Thread #2 waits for the app-level lock that Thread #3 holds, and Thread #3 waits for the DB lock that Thread #2 holds.
(See http-nio-8081-exec-3 and http-nio-8081-exec-2 on the attached thread dump)
- This was reproduced on Derby & Postgres.