How to use File Sharding for easily expanding and building advanced configurations?

Single instances: File sharding allows easy file storage expansion for new and running single Artifactory instance: add filestores under new drives when a need arises to add storage by adding drives and providing their paths under the binarystore.xml in the form of what we call Binary providers (a binary provider translates to filestore, therefore, S3, or other forms of cloud storage types are also considered binary providers).

* Please note that the use of some of the binary providers can require an Enterprise license.

HA clusters: When setting up new Artifactory clusters or thinking about changing/expanding your current storage, these questions should and will be asked:

  1. Will this be a brand new HA cluster? Will we be adding new nodes as part of the upgrade/install?

  2. How many disks and in what constellation we will want to our storage and nodes?

  3. Do we want storage redundancy and how will the distribution of artifacts will look like under our setup?

  4. Are binaries going to be stored on your nodes' native filesystems?

 

More questions can be asked such as and File sharding can also provide a solution for them:

  1. Would you like for them to be the prioritized between storage providers?

  2. Will we be migrating and splitting our currently used filestore(s) to other mounts?

  3. Can we combine local/mount filestores with Cloud/Object based storages (e.g. Amazon's S3)?

 

Although some of these questions can be answered by reading through our Wiki pages this page aims to provide some practical examples and explanations to get you started.

For more general information about HA Installation and Setup please refer to our wiki page: HA Installation and Setup, Configuring the Filestore or File Sharding.

 

** If you have considerations for the ability to predict artifacts/binaries distributions across file stores, then the best practise is keep filestores in a symmetric way so that it'll be easy to predict how artifacts will be copied across the binary providers.

 

** The statement above about redundancy is even more relevant for the sharding-cluster. Especially in a case when you a single node that would have more than a single shard - you will need to add two shards for each node in a symmetric way (including paths), since the binarystore.xml is shared between instances and is not intended for per-node configuration.

 

Here's are some configuration options and example binarystore.xml to help you build your desired Artifactory High Availability <> Storage topology:

 

1. NFS mounts in a Mesh topology:

Diagram:

    Artifactory_Node_1                      Artifactory_Node_2
        |                   |                                          |
        |                   |                                          |
        |                   |                                          |
    Shard 1      Shard2                                Shard3

** What storage scenarios does this strategy help with?

>> This storage configuration opens the door to a number of possible strategies for adding filestores to new/existing HA Nodes (* NFS mounts will be needed to exist in a full-mesh topology). You can use this configuration when you are looking to create add filestores to your cluster and possibly create redundancy between shards but especially useful if you would like to add a single shard to a single node.

* If you are looking to create specific separation between filestores that will allow I/O prioritization and/or want to group filestores refer to the Zones strategies below.

* If you are looking to avoid using NFS mounts and want to use Artifactory's capabilities for sharing files across filestores in your cluster - refer to the Remote Binary provider and multiple fileshards solution below.

This setup's binarystore.xml would require adjusting local paths to match the local filesystem + remote node's filesystem. The recommended strategy here would be percentageFree in order to fill disks equally.

 

Advantages:

  1. Easily add a single filestore in case you run out of space

  2. More flexible in creating asymmetric storage configurations

  3. Allows configuring and increasing redundancy conveniently

  4. Matches our recommendation is to work with NFS mounts due to more experience by us and other customers

Disadvantages:

  1. Needs more mounting work

  2. In case redundancy is being used, you will be unable to predict how binaries/artifacts will be distributed across the shards.

  3. If the percentageFree is used (we recommend to use it), then you could be writing only to the largest free space shard.

Example binarystore.xml:

<config version="1">  <chain>      <provider id="cache-fs" type="cache-fs">                <!-- This is a cached filestore -->          <provider id="sharding" type="sharding">                <!-- This is a sharding provider -->              <sub-provider id="shard1" type="state-aware"/> <!-- There are three mounts -->              <sub-provider id="shard2" type="state-aware"/>              <sub-provider id="shard3" type="state-aware"/>          </provider>      </provider>  </chain>// Specify the read and write strategy and redundancy for the sharding binary provider <provider id="sharding" type="sharding">      <readBehavior>roundRobin</readBehavior>                           <writeBehavior>percentageFreeSpace</writeBehavior>      <redundancy>2</redundancy></provider>//The primary node will have its 1+2 native shards, but will need to have shard 3 mounted locally under the same path as on the secondary node//For each sub-provider (mount), specify the filestore location  <provider id="shard1" type="state-aware">      <fileStoreDir>filestore1</fileStoreDir>  </provider>  <provider id="shard2" type="state-aware">      <fileStoreDir>filestore2</fileStoreDir>  </provider>//HA-Node 2 will have its native shard as shard3, but will need to have shards 1+2 mounted locally under the same path as on the primary node  <provider id="shard3" type="state-aware">      <fileStoreDir>filestore3</fileStoreDir>  </provider></config>

 

2. Remote Binary provider (Cloud Native Storage) and multiple file shards:

Diagram:

   Artifactory_Node_1                      Artifactory_Node_2
       |                   |                                    |                  |

       |                   |                                    |                  |
       |                   |                                    |                  |
   Shard 1      Shard2                        Shard3       Shard4

>> Remove reliance on any NFS mounts, a new (Artifactory 5.x) HTTP based mechanism to deploy and fetch binaries between Artifactory cluster nodes. This configuration is useful when you are looking to add storage to nodes (in a symmetric way) while wanting to avoid using NFS mounts and instead utilize Artifactory's capabilities for sharing files across filestores in your cluster.

Advantages:

  1. Very easy to get started with - no NFS mounting needs to be pre-made

  2. Low overhead, Artifactory takes care of accessing files between connected shards

  3. No need to mount drives in cross-node fashion.

  4. Advanced and has a high level configurable elements under the binarystore.xml

Disadvantages:

  1. HTTP Protocol is used, could be potentially slower than using NFS

  2. Could require more tuning and some additional resources

  3. Not as flexible as the Sharding NFS and Zones strategy

 

* You will need to add two shards for each node in a symmetric way (including paths), since the binarystore.xml is shared between instances and is not intended for per-node configuration.

* If you want to predict files' distribution across shards in your cluster, the best practice is to set up symmetric filestores and with an even value number for <redundancy>.

* It is not possible to combine shards with sharding-cluster.

Example binarystore.xml:

<config version="1">  <chain>      <provider id="cache-fs" type="cache-fs">                <!-- This is a cached filestore -->          <provider id="sharding" type="sharding">                <!-- This is a sharding provider -->              <sub-provider id="shard1" type="state-aware"/> <!-- There are three mounts -->              <sub-provider id="shard2" type="state-aware"/>              <sub-provider id="shard3" type="state-aware"/>          </provider>      </provider>  </chain>// Specify the read and write strategy and redundancy for the sharding binary provider <provider id="sharding" type="sharding">      <readBehavior>roundRobin</readBehavior>                           <writeBehavior>percentageFreeSpace</writeBehavior>      <redundancy>2</redundancy></provider>//The primary node will have its 1+2 native shards, but will need to have shard 3 mounted locally under the same path as on the secondary node//For each sub-provider (mount), specify the filestore location  <provider id="shard1" type="state-aware">      <fileStoreDir>filestore1</fileStoreDir>  </provider>  <provider id="shard2" type="state-aware">      <fileStoreDir>filestore2</fileStoreDir>  </provider>//HA-Node 2 will have its native shard as shard3, but will need to have shards 1+2 mounted locally under the same path as on the primary node  <provider id="shard3" type="state-aware">      <fileStoreDir>filestore3</fileStoreDir>  </provider></config>

3. Zones configuration:

Diagram:

Artifactory_Node_1                 Artifactory_Node_2              Artifactory_Node_1          Artifactory_Node_2   
             |                                                       |                                              |                                           |

             |                                                       |                                              |                                           |
             |                                                       |                                              |                                           |
        Shard 1                                          Shard2                                   Shard3                                Shard4

** What storage scenarios does this strategy help with?

 

>> This solution offers a strategy to add sharded filestores to new and existing HA Nodes (* NFS mounts will be needed to exist in a full-mesh topology) and configure an additional layer of artifacts' distribution on your storage. E.g. when separating filestores and nodes to 2 separate logical Datacenters.

 

When using Zone Strategy (useful for HA configurations), depending on which of the Artifactory nodes a request arrives to, a write attempt will first be attempted on that internal zone (Artifactory Zone East will try to write to one of the sub-binary providers in Zone East), should it fail, Artifactory will continue to Zone West. Zone priority is managed by the order of the providers IDs given in the configuration file:

<property name="art-east.zones" value="east,west"/>
 

A sentence about error recovery and retry mechanism: each time a Garbage collector runs, an Optimizer job will run right after and will take care of "balancing" any necessary "spread" of files, for example: a copy of artifact from one shard to another one, or from Zone A to Zone B.

  • Should be protected by cache-fs level that would decrease the latency if a need arises to fetch files from a remote NFS share.

  • In case a state-aware binary provider will be disconnected (and it will know it's down) and therefore you will create a handling upload threads only for the active binary providers.

  • The optimizer will balance out the shards in case the redundancy is > 1

 

Advantages:

  1. Smarter in terms of creating location/datacenter based topology

  2. Zones are priority-definable, so that each node would have the ability to write locally before writing to the remote storage

  3. Zones strategy allows concurrent filestores writes (1GB will be written at the same time, cutting 2x the time needed). This, consequently will activate two threads that will handle the upload.

Disadvantages:

  1. Allows only round-robin strategy internally between shards

  2. Limitation: if a shard (binary provider) reaches full capacity, it would be deactivated (even for read operations!!!) - this means that you should need to be able to monitor space externally of Artifactory

  3. Currently only round-robin is possible, having an internal strategy of freeSpace would be best practice

  4. Needs migration of files to distribute them across your filestores more evenly.

 

Example binarystore.xml:
 

<config version="3">  <chain>      <provider id="cache-fs" type="cache-fs">          <provider id="sharding" type="sharding">              <redundancy>2</redundancy>              <readBehavior>zone</readBehavior>              <writeBehavior>zone</writeBehavior>              <minSpareUploaderExecutor>2</minSpareUploaderExecutor>              <concurrentStreamWaitTimeout>1</concurrentStreamWaitTimeout>              <sub-provider id="shard-zoneA-1" type="state-aware"/>              <sub-provider id="shard-zoneA-2" type="state-aware"/>              <sub-provider id="shard-zoneB-3" type="state-aware"/>              <sub-provider id="shard-zoneB-4" type="state-aware"/>              <property name="art1.zones" value="b,a"/>              <property name="art2.zones" value="b,a"/>              <property name="art3.zones" value="a,b"/>              <property name="art4.zones" value="a,b"/>          </provider>      </provider>  </chain>  <!-- Shards FS provider configuration -->  <!-- ZONE A / DATACENTER A -->  <provider id="shard-zoneA-1" type="state-aware">      <fileStoreDir>/home/support/Downloads/servers/filestores/filestore1</fileStoreDir>      <zone>a</zone>  </provider>  <provider id="shard-zoneA-2" type="state-aware">      <fileStoreDir>/home/support/Downloads/servers/filestores/filestore2</fileStoreDir>      <zone>a</zone>  </provider>  <!-- ZONE B / DATACENTER B -->  <provider id="shard-zoneB-3" type="state-aware">      <fileStoreDir>/home/support/Downloads/servers/filestores/filestore3</fileStoreDir>      <zone>b</zone>  </provider>  <provider id="shard-zoneB-4" type="state-aware">      <fileStoreDir>/home/support/Downloads/servers/filestores/filestore4</fileStoreDir>      <zone>b</zone>  </provider></config>