Writing Performance User Plugins for an HA Environment

Abstract:

Daryl Spartz / Yahoo, May 2016: In a large installation with large amounts of artifacts to remove, the execution of the artifact Cleanup plugin can have adverse effects stemming from overloaded disk activity. This presentation will describe our modifications to the plugin that allowed us to successfully remove over 13TB of data with no system impact or response time loss.

Talk Transcription:

My name is Daryl Spartz. I work with Yahoo. I’m relatively new with Artifactory. I started last October, I believe.

My first assignment was to clean up artifacts. Everyone’s probably been in the situation where you just keep growing and growing and growing and I’ll probably repeat this later, but we, we had so many artifacts that we were getting pretty full. We have allocated a 27 terabyte filer volume. And we were somewhere around 23 percent and — 23 terabytes full, so. We tried to clean this up in the past and we ran into problems. So, my assignment was to do this without being disruptive. So, that’s where this presentation comes from. We are in a high availability environment, so we have two filers that are snap-mirrored between each other and all that. And I’ll go a little bit more into that.

So, what does it mean to be performant? We’ll discuss that briefly. And it really revolves around a case study using the Artifactory Cleanup plugin. And I’ll just discuss our high availability architecture. And as a side line this, so after I got this all working, we migrated from Artifactory 3.9 to 4.7. And although we used some Chef, we switched a whole lot more stuff to Chef. And I’ll go over the cookbook and some of the recipes that I used to do this and how that all works. And then if we have question and answer, hopefully you’ll have time for that.

So what does it mean to be performant? It’s like any piece of software. If it’s not done right. It’s run in the context of Artifactory so you could be affecting Artifactory itself, at least the threads are running. You know, in terms of CPU usage, the amount of memory that you’re consuming and in this particular case that I talk about the cleanup, you’re hitting the I/O system by removing the artifacts.

So, as a plugin you’re somewhat limited what you can control. You’re using the functions or API calls within Artifactory. You can’t really control that but what you can control is the number of times you call a particular API, you know, and the frequency. If you have something that’s long running, you know, you’re going to be consuming a lot of resources so you — you may want to handle it differently or throttle yourself somehow. And I’ll cover that.

So, in this particular case, I started with the Artifactory Cleanup plugin. It’s available on GitHub under JFrog’s organization and the purpose of it is to remove artifacts that basically aren’t being used. And so I started with the original script and the script does an API call to get a list of all the artifacts that are a particular age and haven’t been downloaded in a particular amount of time. And then the script just goes through loop for each one that it found and deletes it.

And that turned out to be an issue with us. Again, we had 23 terabytes full. By the time it ended, we cleaned up, like, nine terabytes. So we had lots and lots of stuff. So every time we delete something, we had all the activity on the filer and it turned out if we let it go as it was, it tied up our database. So we got some locks running. Long running locks. And it caused us issues and we ended up having to call JFrog support to figure out can we just kill the transactions in progress or what, so.

So I started with that. And, again when you’re on a high availability cluster, using NSF mount point. And when we ran it full — full bore, we did notice that we were impacting the service on the filer because of all the I/Os that were going on. We’re not the only tenant on the filer so we’re not being a good citizen with the other users of the system. We also, when we ran this, we took a particular host we ran it from and the CPU load on that particular instance of Artifactory or the, I don’t remember if it was a standby or whatever, but still, the CPU spiked in and our monitoring started alerting us that so much CPUs going on and stuff like that. It wasn’t necessarily a catastrophic event, but if it was part of the load balancing cluster that you’re hitting, you could affect the people that are directed to that particular host, so. Again, you want to be a good citizen and don’t affect your service level agreements by giving it excess load.

So this is again what we found out. We ended up with a lock wait timeout when we ran it. Initially ran it. And we had to manage it. So what I discovered when I started digging into this, is that the plugins, you probably know this, plugins are loaded at startup times. So, it reads from the plugins directory and loads them, and then basically loads the class inside the JVM. And then you can define multiple endpoints within the class to hit to invoke the different functions.

So those are important points once I discovered that because what I wanted to do was to be able to control the plugin. Not just let it run full out, not hard code, a specific throttling mechanism. But I wanted to manage it, maybe I wanted to pause it, maybe I wanted to stop it, or whatever, so. Knowing that.

This is a quick diagram of our HA cluster here. And so we have a C-name that all users or applications hit. And then that goes through a rotation between these two sides, labeled GQ1 and BF1. I forgot to change that. But, that’s the name of our co-locations. And the BF1 is like our disaster recovery, so in this particular case, no traffic’s really going to the BF1. So when it comes to the GQ1 side, we have a load balancing VIP that directs the traffic to three separate instance. One is the primary, and then the other two are slaves. And then the database is behind. We have a dual master database between the two colos and a slave replication between the master and the slave. And this is I believe the way JFrog helped us architect the BCP. Our disaster recovery mechanism.

So, what I ended up doing is I took the original Artifactory Groovy plugin and I came up with enhancements to it. I added an optional pacing parameter that you could add to it on the REST API call or actually even as the scheduled job as we’ll see in a minute. So that you’re not going as fast as you can. You can basically slow it down so it doesn’t hammer the disk driver, the network activity that’s going on being it’s NSF. And I made that dynamically adjustable so if I’m seen as taking too long and there’s no effects, I can decrease the throttling time so it can run a little faster and we can watch our monitoring metrics to see if we’re impacting any services.

I added a pause, resume, and stop controls. So, again, if we’re — we think we’re affecting performance somewhere, I can pause it. And then the state that it had, the number of files that it’s going to remove will remain there and I can resume it later. When maybe, off hours or something. Or if I didn’t want to do the adjusting of the pause time.

I added bunch of logging messages I found really important or useful. It would tell me where I am in the process. So if we had, in one case we had several million. Where was I? Am I in the first 10 thousand or am I near the end. And I was curious to know how much space was being removed so I logged that as well.

One of the other things that I added is an enhancement. The original plugin used a properties file to define policies. And it was a single policy. You can list as many repos as you want, but it was the same policy. So if you wanted to look back three months. It’s been up there for three months but hasn’t been downloaded for three months, that’s it for everyone. So I changed that so you can have one or more repos per policy or however you want to do that.

And so I contribute that back to the open source and you can download it. The latest version there. The one point was that you needed the latest 4 dot X for the Groovy to support the handling the config slurper processes is fine. You also want that because in the previous Groovy support, there was a vulnerability anyway, so.

So this is — these are some of the curl requests that you can do and how — how you specify it. Unfortunately, it’s underlined that one. But basically it’s the same call but I’ve added a couple of options where you specify the months, the repo name, or list of names, and I added a dry run option. So. If you want to check this out, to make sure it’s not going to mess anything up, it basically just tells you that I was going to do this and it doesn’t do that. And then the pacing parameter. It’s in milliseconds. So in this case it’s, basically, two seconds per delete. Then the other commands are stop, pause, and resume, like I’ve mentioned. So again this is all out there on GitHub. You can download this and use it.

Here’s an example of a run now with the logging that I’ve added. So when it starts up I have two repos in this particular policy. It’s called Mobile and Fastbreak. And then the number of months I look back has to be older than three months. And I have the pacing set up for one second per delete. One second is probably very conservative but, you know, maybe half a second or less. And again, I can dynamically adjust that. So when it starts out, it tells you the parameters and then starts deleting. And if you notice on the last couple lines, it tells you how many files it’s deleting. It’s two of 257,000. And the total bytes so far are returned. I believe that’s like 10 — 10 megabytes. Or am I off? That’s a gigabyte. So, anyway, so that information goes out there. It’s all logged at the info level so that’s, of course, adjusted — can be adjusted, well it’s adjusted through the log back dot xml of Artifactory.

And here’s the policy. The single policy is on the bottom. This is how the old configuration properties file is. And the one above, again the config slurper of Groovy supports this properly so you can have as many policies as you want, and then within each policy, you define the repo or repos and then you can optionally put these other values on there. If you want a pacing parameter or not or if you want a dry run or not. And such, so. That is the way that works now.

So that’s kind of what it did. That actually, like I said, took — cleaned up, like, nine terabytes of data. We set it at, I believe, originally, like, two seconds. I got it down to a second. And it took us like a week to do it, but it’s free, you know, it doesn’t hurt to let it run unless you really are tight on space.

Yes?

[Audience] A couple questions. Did you break your snap mirror before running or did you just get the backup or did you use the […] instance? Because I think if you deleted the master to […]

Yes.

[Audience]

No we did not break the snap mirror, we let it go. We do take snapshots so if I need to recover anything there are hourly, weekly snapshots taken. So, even after running this, we didn’t see a space returned back to us until the snapshots ended up rolling off.

[Audience] You got rid of the snapshots […].

Yeah.

[Audience] The second question that I have is […] Docker images as part of your […]?

So the question was do I delete any Docker. At this point, no. The Mobile and the Fastbreak — I’m not sure what Fastbreak actually is but the Mobile is the iOS and Android images. And Fastbreak, I think, I’m not sure exactly what that was, but it was not Docker. We don’t use Artifactory for Docker at this point.

Any other questions? Yes.

[Audience] So, you’re — you’re using the GUI script. But […] could you also use the CLI […].

So the question is whether the CLI, scripting it through CLI would be better or not. To be honest, I haven’t used CLI in this particular case. But I make just one call to get the list of all of the artifacts to be deleted. I don’t know how that would be returned back in the CLI. Some sort of object, I guess. And you, I assume, do the same thing. You want to be able to pace it. This has the pacing built in. You’d have to script it with the CLI if you wanted to pace it or not. But I think, essentially, it would be the same thing.

The only advantage of a plugin is that in this particular case, they have a scheduled job that’s defined within the Groovy script. So, it’s set to run, I believe, on a Sunday at 5 A.M. or something like that, so it’s a regular occurring thing. You can set up a crom to run the CLI as well if you wanted to but this seemed a little bit easier. It’s programmable. The repos I want by just updating the properties file. Again, you can do the same thing with the CLI. You can set up a properties file. So I don’t know if there’s any real advantage to it. It’s just packaged that way. It’s there for you already.

Any other questions? Yes?

[Audience] So did you have to get the […] cleanup as well because sometimes the artifacts are not […]?

Yes. There’s — the other plugin, delete something or other, I can’t remember the name of it. But yes, we did do that and right now that’s a manual step to kick that one off. I’m sure we can figure out a way to automate that as well.

Any other questions?

[Audience]

I’m sorry?

[Audience]

Yeah. We’ve basically worked with the organization to let us know what kind of policy was right for them. We picked the three biggest consumers of all of our space as a starting point. We haven’t gone beyond that. We have had discussion about having a way to tag particular artifacts and then we clean up the ones that are tagged, or not tagged, that sort of thing too. Cause this is, anything that hasn’t been downloaded three months, well that doesn’t mean it’s not actually out in production. Right? You could eventually get rid of something that’s in production and you didn’t want to. So this isn’t the ideal thing. This is where, maybe the scripting idea of using CLI. You can do it that way and have a different means of tagging and checking the tags or maybe even enhancing this plugin to optionally look for certain tags to identify it but yeah, it’s something that we negotiated with the groups.

Any other questions? It’s hard to see. That’s better.

So the next part I was just going to talk about because this was the first thing that I did was the cleanup. And then the next where we wanted to move from Artifactory 3.9 to 4.7. We had some stuff using Chef. The ancillatory stuff. But the main stuff was our own proprietary packaging and configuration. So we — for Artifactory itself, we completely moved it to Chef. So everything from installing to configuring Artifactory, whether it’s a stand-alone or an HA configuration. Is done through Chef recipe. And then the start and stop functions of Artifactory. And hopefully at some point — I looked on the Chef marketplace. I didn’t find anything that did exactly and everything that I wanted so hopefully my stuff will go out there at some point.

So, the way I approach this is that I wanted it data driven. So everything I did was pretty much data driven. The version of Artifactory to pull down, it’s a attribute — Chef attribute. And it’s defined in the cookbook but it can be overwritten in the different environments. So we have a test environment, a staging environment, and a production environment. And we could potentially override that attribute and have different versions if we wanted to test it out. The particular package name, the location where to download this from. So we set up a remote caching mirror to JFrog so when we first run this, we get the latest or the version of Artifactory we want and then we cache it. So, if I do that in our test environment, when I go to staging, I don’t have to go back out to JFrog. It’s now cached. And same with production.

But all that stuff is data driven. Everything is set up. Like in a HA environment, the filer of volume names, the actual hosts that are — have the volumes. Those are all attributes. And they can all be configured either in the environment or roles attribute settings. So when you assign a node, the environment, for example, staging would mean certain things and then I can set different filer volumes for staging versus production versus anything else. So.

So a lot of the resources are just standard resources. Use yum repository to point to our caching mirror. Package to define the enterprise version name and then the version number. So now if I want to upgrade to 4.8 that it came out, I change one line in our default attribute in our cookbook, commit it, and our pipeline will deploy it. Pretty simple. We use all of our instances obviously have to have license. I put them in Chef vault as in any other mechanism. So now when a particular deployment is going on, on a particular host, it goes to Chef vault and it gets its particular license and I use the file resource to place it in the right location in Artifactory’s configuration for that particular host.

Use templates for a number of items. The storage properties in particular. Trying to remember exactly what’s in that one but it’s a template. The system properties and then the cluster properties. They’re all templates. Everything comes from attributes. There’s no hard coded anywhere in any of the recipes. And then again, those can vary. So I have one attribute that defines if this installation is a stand-alone or it’s a high availability and then where these files get located differs based on that. The scripts or the recipe handles where that goes.

There are a number of places where we have xml files that we need to modify even though we don’t own it. And I didn’t want to take the whole file and put it under a template. Particular, our instances are set up for SSL and so I need to modify the server dot xml. And so I used a Chef cookbook that I found on the marketplace. It’s called xml underscore file. And I found it in a few things but I needed more out of it. So I extended that and I contributed that back to the owner of that package. One set of changes he adopted the second set I haven’t heard back from him. But it basically allows me to modify an existing xml file. For example the connectors I can replace that with SSL connectors and we use the APR Tomcat native library for faster connections. So I have to replace, delete the ones that come with Artifactory, and replace it with our specific xml segments.

I also, for the plugins, again we take selected plugins. We only have a handful right now that we get from the open source. But I wanted to be able to turn on the logging and I didn’t want to manually go and edit that so I use the same xml file to go in and add a logger entry for the log back to say, for this particular plugin I want to log at level debug or info or warning or whatever.

And it works great as you saw, it was a few slides back, but that was part of the properties file. The logging level you define in that now. You can default it to nothing. So no logger entry is created for it. And that’s the case for the delete empty folder piece but for the cleanup I wanted to see and I wanted to know how many files and, you know, how many bytes were saved basically. So, I have in the properties I set info so when that’s detected, this xml file will go out and add a logger entry. If it doesn’t already exist, add the entry. And if I at some point want to change that, the xml file will be smart not to add another entry and I don’t have to lead to the old one. It’ll find just that segment and update just that segment. So that was a useful thing.

For HA, again, use a bunch of templates for the node properties. Directories for creating the cluster home where you have the […] directory, the data directory, and the backup up directory things. I used the resource mount for mounting the NSF thing. So you define, as properties, the host, the volume name, and any options on the mount that you particularly have. And so you have two sets. You have the primary and then the secondary for a HA configuration. And then this was done at the […] location level as part of the roll that you’d find in Chef. So, on the other side, you kind of just flip it around or use their volumes and then they point back to each other. So it works really good. Use link for symlinking from the main Artifactory path to the filer volumes.

And then I created the start and stop. They’re just basically functions as part of this cookbook or library functions for starting, stopping. For starting I added a retry functionality cause I noticed when we migrated from three-nine to four dot x, we came to a point when Artifactory didn’t come up. And it basically put out a log message that this was a problem but just restart it. And it turned out restarting it came up so I added the retry logic so I wait for a period of time. If I don’t detect that it’s up, and I detect it by doing, like, a curl to the application. Cause if you went out to the system, the Java process was still there but it’s just not responding on the port it’s supposed to be. So if the curl works then we’re fine. If it doesn’t wait for a little bit and then try a retry. It just saves a little bit of getting alerts and having to handle it.

The other thing that I spent a bit of time on was creating the kitchen environment. And this was really, really useful cause I could test out all these configuration options. For a kitchen environment again I could override the attributes that are defined in the cookbook. I can say this is a standalone and I want to give the host this name. You know, I can set this all up and all on Vagrant, on my laptop, I can get an Artifactory instance installed, up and running, using, you know, testing the SSL capabilities that were there and everything. I can change that in my kitchen to an HA. I didn’t go as far as to create two different Vagrant VMs. I didn’t do that. But for the one, I exported some NSF from my laptop so that the Vagrant could then mount those so I could test the mounting aspect. All that sort of stuff. And it was really, really useful. I can do all this without committing code and watching our pipeline fail somewhere down the road. So it was very, very useful.

And I think that’s it. So I just want to thank JFrog for giving me the chance to talk to you guys and appreciate your coming to listen to me. And if I have some. My name of the plugin. Where you can find the artifact cleanup. The one I committed. And then the xml file. I point to my repo because he hasn’t accepted my latest change. It turns out that you can have a partial xml update but you can only have one and I extended it so you can have multiples. I’m not sure why he hasn’t responded back on that, you know, positive or negative on it but I’ll probably try and contact him again on that.

So, questions? Anyone has questions?