How do I use Amazon Elastic Filesystem (EFS) with Artifactory HA

Artifactory High Availability (HA) in AWS may use S3 for scalable storage or Amazon's Elastic File System (EFS) may be implemented for an NFS filestore. Designing for EFS implementation must take into account certain aspects of how EFS works.

This document explains how to use EFS's baseline performance and burst credit model towards achieving optimal performance using EFS with Artifactory HA. For more information about Amazon EFS, see What is Amazon Elastic File System?

Amazon EFS Performance Overview - Baseline and Burst Throughput

Throughput on Amazon EFS scales as a file system grows. A file system can drive throughput continuously at its baseline rate.  Additionally, Amazon EFS is designed to burst to high throughput levels for periods of time.

For full documentation on EFS performance, see Throughput Scaling in Amazon EFS.

Amazon EFS uses a credit system to determine when file systems can burst. Each file system earns credits over time at a baseline rate that is determined by the size of the file system, and uses credits whenever it reads or writes data. Whenever a file system is inactive or driving throughput below its baseline rate, the file system accumulates burst credits.

If your file system has no burst credits available, the I/O throughput is the baseline rate, until burst credits replenish.  The baseline rate may severely impact Artifactory performance.  Therefore, it is best practice to avoid needing burst credits at all, or very rarely, and instead be able to provide the entire needed throughput with the baseline throughput allocation.  You can monitor the balance by using Amazon CloudWatch metric for Amazon EFS.

How to avoid exceeding the burst credit balance?

For the best results, you must consider what your workload is and the capacity required to meet that performance. Take it this way: With more capacity, the higher your baseline and burst throughput.

Many JFrog Artifactory servers scale to the TiB range in storage, which is often ample performance for the use case. For smaller filestores however, the baseline throughput is quite low: 1 GiB of data is 50 KiB/s and scales linearly, per GiB, above that.  If your workload requires more throughput than what is provided by the baseline and burst model for your capacity described above, there are two options:

Adding capacity to the file system

You can add capacity to the file system to reach the base level of performance that you need. Adding capacity is fairly simple. Either upload a number of large artifacts to Artifactory, or add additional files of sufficient size in a directory on the EFS volume outside the filestore directory to boost your storage usage to the desired amount (if need be, with random files of arbitrary length). As your system scales up with increased storage use over time, you can bring these back down later. Note that some customers have had success with combining large storage/low transfer loads into the same EFS volume as the artifactory filestore. .  

Using local EBS caches

You can use local EBS caches  on Artifactory node instances. This option reduces workload on the EFS filestore, and the use of the burst credit balance.  Configure Artifactory to use a cache in EBS on each node. This local cache stores artifacts on each individual node and serve them directly to the client instead of pulling them from EFS. Each node may have duplicate cache entries (since any node can serve any request) but this greatly reduces the access to EFS. It's important to consider that this same mechanism can also be used to enhance performance if S3 is being used as the binary store. This method may allow you to utilize the bursting capability more often, but should still only be used when the baseline throughput is at a reasonable value. In particular, bear in mind that when a cache is implemented, it must FIRST be streamed into the cache at the EFS speed, and then sent out, so if the EFS is running very slow due to exhaustion of burst credit balance, this may result in client timeouts.

 

For example, if a file was downloaded 1000 times without a local cache, EFS would have the download activity of 1000 * filesize. But, with local cache enabled, it may end up being just 2 * filesize.  The EBS cache is recommended to be persisted (non-ephemeral), if you do not want to have to refill it when a new instance is created.  For more information about configuring a cachefs, see the notes on configuring the filestore in the binarystore.xml file.

 

Symptoms of inadequate throughput/exceeding the burst credit balance

Symptoms of inadequate storage design include ping delays (api/system/ping) and very slow download/upload process. The delays may show up as timeout errors or broken connections caused by client timeout disconnects in your Artifactory log.

 

Artifactory versions prior to 5.0 Implemented on S3 Storage

If you implemented Artifactory HA versions prior to 5.0 with S3 storage, you needed a cluster-wide write cache (called the eventual cache) to be implemented on a shared NFS mount. JFrog does not recommend EFS for this cache as it is typically low storage and high usage, which tends to exceed your burst credit balance quickly.  For Artifactory versions 5.0 and higher, the cache for an S3 implementation should be EBS local disk caches.