The first 5 million is the hardest: How Cisco went from 0 to 5 million artifacts

Abstract:

Prathibha Ayyappan, Cisco, May 2016: The Cisco team has deployed Artifactory in multiple Cisco data centers around the world with support for different types of repositories like maven, npm, python, yum and docker. How we made this happen in one year and a team of 5 is the essence of this talk. Come learn about our complex architecture, check out our custom application that completely automates repository management using Artifactory REST APIs and our stats application that gathers statistics about our Artifactory service.

Talk Transcription:

Ok. So, hi everyone. Thanks for coming to my talk and to see how we went from one to five million artifacts in a year with just five engineers.

But before we get started I wanted to give a quick introduction of what we are, what we do, and just the dynamics of where we fit into Cisco.

I belong to the Global Architecture and Technological Services organization. Is this on too? I have one here. Can everyone hear me? So like I was saying we belong to the GATS Global Architecture and Technology services arc in Cisco and what we do is we provide services, like tooling services, to engineers and engineer — it doesn’t mean we are not engineers — but the engineering arc of Cisco are the teams that work on real products that Cisco delivers to the market.

My name is Prathibha Ayyappan and I’ve been with Cisco — this is my 5th year — and this is my 4th year doing the […] and I was initially a Java full stack developer so we still do that but I’m also focused towards the CI/CD tools. Our team specifically is called Build Management Services and what we do is we provide CI/CD tools and services to engineers so we do Jenkins, Artifactory, SonarQube, Bamboo, Coverity those are the five we do for now and then we keep adding to it.

Like I said our target audience is the whole of the engineering community. The one good thing in — I don’t know if it’s a good thing — but one thing in Cisco is there are no dumbed down approach, there’s no hard and fast rule to go a certain route. Every team has the flexibility to do whatever they want. So we work with teams that want to work with us and then once we have the success stories we go back to teams that don’t want to work with us and get them on board.

So this is how — these screenshots are from last week. This is our Artifactory master side so we are at eight million two hundred and something thousand artifacts. We have twelve hundred local repositories in our master side. We these numbers are really not very clear but the graph here shows that we got close to ten million requests during a time frame so that it was like around on April 6 we got ten million requests. Just total downloads and uploads to our Artifactory server.

We had close to 330 download requests and, 30, or 58.2 million upload requests. So that’s the scale of the Artifactory instance we’re operating in. And these — these numbers — at the — at least — least the quest one is only for the San Jose site it’s not even raw data. Those are from the San Jose site as well. This is just one instance of Artifactory and this is how big our installation and our service is.

So today what I’m going to be talking about is how to do this or how we got here. Because we were new to the product and the product was fairly new too. So if you’re going to, if you’re looking to run Artifactory as a service — which I know I’ve been talking to few of you in the audience, and you are trying to do similar ways in your company, so it’s good to have familiar faces and if you are aiming for a zero downtime, no outages service because when developers are developing and they want to upload and download, get dependencies, push the artifacts they don’t want even, you know, ten seconds, ten or, you know, even a minute of a downtime. So if you are aiming to get that kind of a service level agreement for your customers. If you want to have global presence. And I know EMC, Bloomberg, a lot of companies here they have global presence so if you have developers that are working from outside of the United States and you, if you want to have sites for them and if you just want to be innovative and do something new. And not be like a typical IT team where people go and create cases and then you have a support team and a tier 1 and a tier 2 and a tier 3 and then finally it comes to you, then probably you should look at we’re trying to do.

So let’s go back and get some history. What did we do? Where did we start? So there was this instance of Nexus that someone was running out of sheer goodness. He’s a developer and he was just maintaining it for a couple of orgs. And it was close to 500 gigs of data and he was fed up of maintaining it so he wanted someone to take it over and we were kind of starting our CI/CD team. So we thought that would be a great time to start and, you know, just take over. But we didn’t want to do things that were being done till then so we did a bake off of all the tools that are in the market and Artifactory popped up so we did a bake off between Artifactory and Nexus and we chose Artifactory for the obvious reasons that Artifactory does has support for many more artifact types not just Maven and Gradle builds. So at that point when we started in 2014 it did have Yum support which was big for us because Cisco, as you all know, is a company has been doing embedded development and rpm’s are the major artifact that gets created so Yum support and Artifactory was big for us. So we went Artifactory route.

Now what did we do first. So that’s our architecture gate one. Which is we our team is based out of RDP, the Raleigh-Durham area, in North Carolina and we have an engineering data center out there so we did a stand-alone plain installation of Artifactory. So if you see we had an Apache proxy that talked to an application server, CEL 6.2 is our operating system. It’s a flavor of Red Hat Linux which has some more Cisco competence on top of it. So Tomcat and then it had file, two terabyte file, and if you see it was just two terabyte but that’s how we started maybe that was good as […] 500 gigs and two terabyte was huge. And then we connected to an Oracle is was […] at that time so it was not 12C. That’s how it looked and everything worked fine or at least we thought it did and then we realized that most of our engineering teams operate out of San Jose. And RDP, they still thought that the performance was not great if they were trying to upload and download from RDP and the Maven dot Cisco dot com instance that we initially migrated wasn’t San Jose so they knew the performance lag immediately.

So that’s our architecture take two. Thankfully Artifactory did have full replication support so you could do an application level replication and what we did was we did the same thing and we set up a site in San Jose. And we made San Jose our primary so all the users started going to San Jose and we started just using our TP as our VR instance.

So I know it’s very backwards but we set up our VR instance first and that’s how it all began. But that was our take two and what we realized was if San Jose went down we had to manually fail over our change proxies to go to the RDP server and it was not automated. And then we found a team within Cisco that was providing a product called Global Site Selector as a service and that is a Cisco home grown product but basically what it does and it does DNS, geolocation, and fail over kind of configurations so if you configure GSS to work in fail over mode it will know immediately if the site goes down, it will go automatically fail over to the other side.

So we did that and then we had like we had a real solution. We had a primary site, we had a DR site, we had GSS that worked great, and we tested this in fail over mode and it worked fine.

So that was our architecture take 3 and we were very happy with it. But then we realized that Cisco is a global company and people from Bangalore and Israel and Green Park and all other places in the world did not find our service very good. It was only the US customers that were happy.

So we had to do this. So basically we set up two new sites, and it’s again Artifactory, we had a proxy, it has an app server, it has an Oracle backend. All that is the same but we used GSS again, the Global Site Selector, to do geological load balancing so if someone tries to use an Artifactory instance from some other part of the world they will not get directed to the master. And we also had to switch URLs a little. So our master site had a new URL and our global site had another URL so two URLs we operate under. And this was our take four. Which is fairly up to date. We still have only these four main sites and we’re looking to add in the future but these are our main.

But then we had to fail over to RDP twice last year because of an Oracle upgrade and once the file was going through an outage or something like that and then because we were by then people were using it a lot and we it took us hours to replicate back from RDP to San Jose. It’s not easy, I mean, we’re a networking company but, you know, there is so much only the network can do. There’s only so much bandwidth in the data centers so if you have to replicate back hundred thousand artifacts it’s going to take some time. And if you went to the Mission Control talk this morning I think […] almost mentioned that it can take almost weeks to copy everything back. So, this was not always ideal for us because we had only one node and if something happened then we would have to always fail over to RDP and then replicate across the country all the time.

So this is how our architecture looks right now. We have — we have — two buildings in San Jose and those are in different ,so they’re technically part of the same data center but they’re still different buildings and they are part of different network racks and all that so generally our system admins are pretty understanding and if it’s a planned downtime they don’t take everything down at once. So what we did was we made use of that feature and then we made use of Artifactory HA and we set up two more nodes for Artifactory and they’re in different buildings. We also have two HA proxy sites, servers that are in different buildings and they are load balanced with an ACE load balancer which basically works in fail over mode so the point is we don’t want to have one single point of failure. Just saying. So if our HA proxy server goes down the load balancer, the ACE load balancer will automatically fail over to the other HA proxy. If one of our nodes go down we are still okay because we have the other two nodes that are still working. We have a snap meter in place between the two NAS, like the […] and if you see from two terabytes we are now at 50 terabytes. But that’s how it is right now. So if one file goes down we can work with the storage team to break the snap meter and we can still rep — you know — push data to the other file.

Our only single point of failure now is our database. Which we are working with the database admins to, you know, give us more of. And it is on a — like a — USC P2 class […] which has HA so if the BM itself that the database is hosted on goes down they will move it automatically. But if they do any database, these upgrades are patches, then we have a single point of failure and we do have to fail to RDP.

Sure. Yeah. […] So the replication and Artifactory is from San Jose to RDP and this is just a file backup it’s not even an backup it’s like a hard stand-by because if we have to — if this file goes down then we can just break the snap meter we don’t have to take the whole site down.

[…] Yeah I’m gonna talk about that next.

[…] Yeah I’m gonna talk about that next.

Okay. Looks good. That’s how we are right now.

So like I was saying there is one type in Bangalore there’s one type in Green Park, San Jose, and you can see all the traffic, from say Latin America, goes to San Jose. Everything from the east coast goes to San Jose because though the RPT site is real, it’s our DR site and nobody really goes to that site and that’s how the redirection works. And if say Green Park goes down all our traffic will get routed to San Jose and if San Jose goes down then all our traffic goes to RDP.

Okay. So, coming to your questions: How did we set up read through caches, replication and all that fun stuff?

So we can just read the whole master set up as our abstraction for now but basically what we do is every write from either Green Park or Bangalore, any of our global sites, they will get replicated directly to the local repositories and the master. And they really don’t care if it’s the San Jose or the RDP master because they can completely abstract it. And what we do is we set up read through caches. And if you are aware of how the Artifactory icons look those are the local repositories, those are the remote repositories, and the virtual repositories aggregates the local and the remotes. Basically what we do is we rely heavily on the remote and virtual repositories support.

So we create remote repositories that pull from the virtual groups, are the virtual repose form our master so we don’t have to replicate everything everywhere we can get by just writing everything to all the sites and replicating everything to the master but don’t have to do the remote. The reverse. And if someone wants an artifact that — that was — applied to the master, and that’s not available in this local site, they can use the virtual repo here and that will basically get it from the master and then cache it on the local.

So there are a couple of things we have to always expose to […] because still […] started support for deploying to virtuals our users would always have to deploy to the local repository directly but that always have to fetch from the virtual repositories because they really don’t know what side they are going to. So that’s one thing but that is more like a user training thing that we got around slowly. We’re still getting around it. But our bigger main — bigger problem — is restricted to the master site if Artifactory does not have support for remote and virtual repositories. So Yum and Debian are big examples. Docker and […] were examples but I can’t quote them anymore because they Artifactory didn’t introduce support for virtual and remote repositories for […] and Docker. But provisional logic is basically based on repository type so we have to make sure that we know what repository that the team needs and our provisioning logic totally defers depending on what kind of repository they want.

We have other challenges. We have a team that does not like — I was alluding to in the beginning of my talk — our team is very small and we do development and we basically set up things but we are not the support team. So if our support team gets questions about why the artifact not being able to, you know, why am I not able to get this artifact back then they need to know the provisioning logic. They don’t have the rights to go and create repositories.

They are not efficient. So you know a developer is trying to run his build or Jenkins is trying to run a build and its failing and our developer cannot wait for two days to basically understand our global setup.

So just training, lack of knowledge, all these are challenges that every services faces and we face that too. And the actual manual provisioning process. So say I actually had to go and create repositories in four sites: San Jose, RDP, Green Park, and Bangalore. And set up replication properly, set up […]. For example if I just check a box it would delete everything from the master site because using delete […]. In stuff like that there is so much scope for human errors and it takes so long. We actually did this exercise and if we had to create Maven repositories it would take us 45 clicks. if we had to create Debian or Yum repositories those are only the master the way we set up replication and set up users, permission targets, groups it was taking 25 clicks. And this is considering there was no human error. If there is human error, you have to redo the whole thing, remember.

So that’s just productivity lost and our customers were not happy. At all. We tried to automate it and that is what I was going to talk about next. On how we changed our provisioning time, oh and by the way, if we did Docker then we also had extra piece to do the Apache provisioning so if you know how the Docker provisioning works every node gets its own board. Unless you have your own sub-domains to do different Docker repositories and we can get that because every time you have to create your own sub-domain you have to get your request […]. That’s more time consuming that just giving them different Apache ports. So, so that’s added work there. So we did this exercise manually too. Our Docker provisioning process was taking 90 minutes. That’s one and a half hours to provision one Docker repository.

PyPI, NPM, Yum, and Maven did almost same it was not taking 90 min but it was taking almost an hour to set up everything everywhere. Not fun. So what we did was we build an […], basically, that automates everything. It’s our pseudo Mission Control. This was before Mission Control was there. So what we wanted to do was we wanted to automate provisioning orchestration. And the way we build this app it’s completely based on plugins so you can so currently supposed Jenkins, Artifactory, SonarQube if we added Bamboo support in the future then we can just go and plug it into the system. And the one you see here is our Artifactory existence that’s our pseudo […] version so that is our Artifactory instance in the DMZ so we could, we could automate — automatically provision all these instances by just using one application.

And that is what I’m going to demo next. Are there any questions?

Okay. So those are just some screenshots. But basically what we did was we are all Java developers so we stuck with Java. We stuck with the Spring framework. We have an AngularJS frontend. We use REST APIs heavily and then we have a service layer which basically sanitizes, validates, and then provisions. And like I was saying, all these API clients are pluggable. So if I, I’m trying to make a call to the Artifactory instances then I’ll use the Artifactory client.

We also have an Oracle database that it works with so that way we treat our accounts management database as our source of truth. So we know exactly who made the request, who wanted the repositories. And when you work in a huge org that makes a lot of sense because you want to track which business unit is trying […] with you and just to have all that in place there’s no way you can drag back an Artifactory. And we also wanted to know which teams are using all our services: Jenkins, Artifactory, and SonarQube. So that was helpful for us as well.

So account management I can log in as myself it has LDAP integration. Like I was saying, we have customers and teams so we create records for them that go into the database so we know who is trying to use our services. And then we have the service layer written I will talk about in detail because that makes more sense.

There is also access management because I was admin I could see all this, the admin functionality. But if you were just a user you will not see all that and you will can see is the services that are tied to your account. So if you are a customer that belongs to, say the internet of things the most advanced thing Cisco is trying to do right now, the IODR, then you will see only the services that belong to you. And what you can do, so what we can do basically is the admins in Artifactory makes you the admin in account management without having to actually explain how Artifactory works, how Artifactory is configured, and you can auto-provision your repositories for your teams. And we have self-service APIs in the background.

So if you were a user and you were trying to make an Artifactory service you just type up a name and I use swampup and you can fill the users that need access, you can type in — we give you a default employer ID. This is mainly for accounts that need like a generic user to run builds from Jenkins or say you don’t want to expose your real […] credentials. So it’s pretty neat that we can give out a generic ID. Again that is tracked in the database so we can retrieve it for you or you can go change it if you forget it.

So you can do that and you can add your host to repose. So you create repose you see we have support for different types, you can see if it’s Maven, if it’s Yum, you can see if you want to calculate the metadata automatically, you can tell what depth your metadata needs to be. You can go change it in the future. If you decide you don’t want your metadata to be at three. You can also create Docker repositories and it will do the Apache provisioning automatically. And we also have this neat documentation that shows you what your URLs should be, how you need to fetch your dependencies, what goes in your BON file and that’s again dependent it’s, you know, it depends on what kind of repository you’re trying to provision.

And then you can go to the virtual groups and actually change the configuration of your virtual group. So if you want two more hosted repos to be aggregated within your virtual groups you can do that. Which is what I did here. And you can also pull in some remote repositories if you care about that. And then you just click okay and submit and wait for a couple of minutes. But we have a great logging system. And we also have an audit trail. Every time someone tries to update a repository, or any — the whole servers — you will know exactly what changed and who changed it. So that’s great for tracking purposes. And once we save you can also modify our servers like I was saying and you, if you wanted, add new users that need access to that repository you can totally do that.

So what did we learn? We saved a lot of time. We saved close to 1800 hours just, you know, not doing manual provisioning, not going into […] sites, not making human errors. And account management takes like two minutes instead of 90 minutes to do the same thing. So that’s a whole lot of productivity gain for us and you know time is money so we made Cisco some money, or at least that’s what we think.

And that is also our usage ramp – user ramp — so you know less than close to 100 users we went to 4500 users in account management which shows that, you know, we’re doing something right, they like to use it. They don’t have to do anything with Artifactory they don’t have to know anything about it and they get their repositories.

So the future of — before I talk about the future — I just was to preface by saying that all that was possible only because Artifactory has great REST APIs. All we’re doing internally is using the REST APIs.

But Artifactory had Mission Control now so we’re going to investigate, prototype, and then integrate. I know we cannot get rid of account management completely because we still want to track customers and teams and all that stuff, so what we’ll do is remove the API client part so we’ll not integrate with Artifactory. Instead we’ll just integrate with Mission Control and then we’ll have all our logic that does the provisioning and Mission Control will create the groups that […] was showing us this morning and so we’ll have like a master group and then, of course, they have DR and then we’ll have the global sites configured properly and then we can have something that has the repository provisioning in this and how to configure the repositories and then probably just integrate with Mission Control from account management. That’s the plan for now.

Questions?