Docker at scale


Jagan Subramanian, JFrog and Viraj Purang/ Oracle, May 2016: At Oracle, numerous product teams utilize docker as part of their cloud continuous development. Each product team is able to use their own internal docker registry based on Artifactory’s docker support, enabling teams to manage their projects in a distinct registry, exercise better access control to their docker images, and share images across the organization.

Talk Transcription:

[Jagan] Okay Viraj, so why don’t we start by telling us a little bit more about your team.

[Viraj] Okay. Well, I have to start with that.

[Jagan] Oh I forgot that.

[Viraj] This is what our lawyers require us to, you know, put up there.

[Viraj] So that’s a little bit about my team. So, we do a bunch of stuff pretty much like any one of you out here. If you’re working in the DevOps space, you know, we do CI/CD, we write pipelines, we have a DSL through which we are able to, you know, create a bunch of CI/CD pipelines that do, you know, a bunch of stuff. Of, you know, running tests on the farm, doing builds, doing unit tests, you know, creating Docker instances, deploying them in production, all of that is kind of driven through those pipelines. And our team is responsible for that kind of stuff.

[Viraj] We do a bunch of tool development in around Docker, around Artifactory, around internal tools like Carson, which is something, you know, that one of my friends is going to talk about in one of the later presentations during the day. And we do analysis, we do POCs. So whenever we have a new product coming in or whenever we have a new version of Docker coming in or even when we started using Docker for the first time, we are the team who tries out things first. Writes up the specs about it, gives up — gives the recommendations around it and, you know, we come up with the documentation that drives the whole process.

[Viraj] In terms of, you know, doing pretty much anything in production, we interface with other teams who take some of our different services in production.

[Viraj] We are very strongly focused on testing so all our deployments into production, all our deployments into the Artifactory data centers, anything pretty much happens on the basis of automated testing we do. We try not to manually deploy anything. That’s how we can separate ourselves from a bunch of other teams within Oracle or outside of Oracle which do the same thing.

[Viraj] We also do a lot of graphing and, you know, we look at numbers. We instrument our code a lot and we look at, you know, pretty much all HTTP requests. What times we are doing. We try to bring them down. It’s just, you know, that’s how we function.

[Viraj] You don’t have to renovate this fast.

[Jagan] They’re getting directions to get out earlier, huh. Okay.

[Viraj] I’m not that bad.

[Jagan] So, so last year Viraj, you know, there was a huge increase in terms of the Maven builds and very rapid adaptation within Oracle, right. So how has this changed since last year?

[Viraj] So, last year he was the one who was giving out these numbers so he should be probably the last person to ask me these questions. But, yeah, we have pretty large repository last year. I’m not sure what the exact numbers but we were doing somewhere around 40 terabytes in terms of just general storage. Right now we are at 80 terabytes and this is after deleting about a million artifacts a day. And that’s just general Artifactory usage. Includes like a bunch of different types of packaging systems. But —

[Jagan] I think the slide did not change, actually.

[Viraj] But there’s a lot of changes that are happening around us. There’s internal changes, there’s external changes, and then there’s changes that are happening on the Artifactory side itself. And I’ll go through all of them in order.

[Viraj] So the first one, like right here, you know, last year I guess this was pretty much the same thing but we weren’t as cognizant about that. But approximately 90 percent of the people, here or outside, they already know about Docker. Out of that, 70 percent have tried something on Docker. And most of them are running something at the very least in development if not in production.

[Viraj] StackEngine is one of the companies Oracle bought sometime last year. These are the numbers that they provided but the core reasons, core drivers why they’re doing it is because of the, you know, the hybrid cloud, the VM ware cost, et cetera.

[Viraj] Apart from that, we had some internal drivers. And if you look at Oracle’s footprint over the last year, this is a little bit of a dated slide, but we have about 800 petabytes of storage. That’s our cloud footprint as of now. And all of that, at least, like I’d say, a bunch of that is driven through Docker related or Docker deployed applications. So we’re really drinking from the firehose at this point of time.

[Viraj] And then, you know, there’s artifact diversity. So, in 2015 we were just using Maven. There were, you know, maybe a couple of repositories here and there for different types of packaging systems. Our database, obviously, is on Oracle. We use, you know, Oracle Linux on which Artifactory is deployed.

[Viraj] But if you look at 2016 numbers, we have a lot of them. Like, you know, Python. We have RubyGems. We have Yum repository. And obviously, Docker is one of them. And this is just within one year of the explosion that has happened. So, you know, things are really changing around us.

[Jagan] Wow. That is — that’s a lot set of technologies there. So, how — how does the JFrog — JFrog platform help you, you know, move from — from this to adapt to this diverse, you know, packaging types and endeavors sort of build and all those things that you have.

[Viraj] So I think bunch of it is fairly obvious. But, even though it’s obvious, it is really what drives us. So, every day, you know, if you like, let’s say if you look at the — the types of repositories that I’m showing right here, you know, if I had to go and install a Yum repository and a Nuget repository and a Node.js npm repository or, you know, a Bower repository or any one of these I would of had to work with different companies, install each one of them separately. You know, configure load balances for each one of them separately. Configure […] proxies for each one of them separately. It’s just, I mean, the explosion just goes on. And from our perspective just being able to do it all in Artifactory, you know, nothing beats that. So that was one big reason why we went with Artifactory.

[Viraj] Another big thing that, you know, that it’s really allow us to do is like a bunch of services that we develop on top of Artifactory. We can still reuse them. So there’s a workflow engine that we have that — that’s driven based on certain events. It drives how we provision our systems. It drives how we provision users on our systems. It drives, you know, the mitigating of different repositories. We didn’t have to write any of that because we already wrote that for Artifactory and, you know, once done, all we had to do was to just add type and we were good to go.

[Jagan] So what you’re saying is that the universal aspect of Artifactory really benefits you. You don’t have to look for other custom solutions.

[Viraj] Exactly. Yeah. And there’s like the standard things like Logstash or Sensu monitoring. All of that is built in, you know, we also have our own internal, you know, agents, like EM and so on and so forth. It’s all there, ready to go. And we are able to scale. So for example our systems are running in about six or seven different data centers. We didn’t have to set up all of that separately for any one of these repositories. We’re good to go right from the beginning.

[Jagan] So, so clearly the scale is large. Right? So, you know, when you were moving through this — through this evolution, why would you even choose Docker in the first place. Or what were the challenges to get to this point.

[Viraj] So, there’s a bunch of reasons and I’ll come to that in a second but one thing that I think you should probably look at is our Docker numbers. Okay. And. Okay. So this is where we are right now. And if you look at the numbers that should just generally give you an idea of how much we’re using Docker. And so, we have 80 terabytes of total artifacts. And out of that four terabytes are, you know, just pure Docker. And we’re doing 1.5 Docker specific requests at this point of time. So that’s the depth of our integration with Docker at this level.

[Viraj] Now coming to the reason why we are doing it. Well, for one, the numbers. Right? Like I’ve had friends in the industry that had issues with Docker hub purely because of numbers. And we’re able to get all this information because, you know, we have systems, existing systems, that can provide us this data. But the reason why we went for Docker was actually multi-fold. So, there’s two different, you know, paths, if I look at it. There’s development use cases. So, and there’s, you know, a bunch of services we offer in production that – that are either Docker based or Docker itself as a container. There are some services which — which are not yet out in production. And I’ll get to that slide in a second.

[Viraj] So one big reason why we went for Docker was because of, you know, requirements that were provided due to the continuous integration. And, you know, one big thing is we have a bunch of products. I can’t even count the number at this point of time. But each one of those products has a lengthy list of test cases that are on a day to day basis. And every time they run the test cases the systems they have to be brought down. There is cleanup that needs to be done. There’s pre-configs that need to be installed. And we need something that can combine all of this into one atomic unit. You know, we have installers which do some of it but that capability to be able to do everything in one shot is something that we were missing at the very least. And so that is something that drove our core, you know, drive to Docker.

[Viraj] Another big reason is that, you know, you can patch Docker. You know, you can make small changes into the Docker image and you’re good to go. There’s obviously, you know, smaller issues that relate to infrastructure consolidation. With Docker we are able to run most of our stuff at 100 percent utilization. With VMs we were, like, two percent, five percent, 10 percent. So there’s a lot of, you know, numbers that we were able to crunch on that.

[Viraj] There’s obviously, you know, there’s — there’s a bunch of farms that – that have been created within the company. There’s, you know, a lot of DI — DIY PaaS services so, you know, there’s teams that do Kubernetes. There’s teams that do Mesos and Docker Swarm.

[Viraj] And then obviously, I mean, we don’t have to go through installing JVMs or app servers and then deploy the artifacts into them. We just build a container once, we deploy it everywhere. And I think that part is really the most important one from my perspective. The containers should be free of state and configuration. Something that you can’t do with installers at this point in time.

[Viraj] And other part of it which is, you know, not focused as much on the developers but more on what we offer to customers is the services that Oracle actually offers on the cloud. And this can be divided into like four different pieces. There’s, you know, there’s existing Oracle products that we bundle inside Docker containers. Currently Oracle Linux, WebLogic, Tuxedo, HTTP server, and Coherence are certified. And there’s a slide that I have towards the end of this presentation where, you know, we provide the links to where you can actually go and get those.

[Viraj] Then, you know, there’s infrastructure as a service. So basically customers can do your own deployments or management or orchestration within the Compute Cloud Service that Oracle offers. And so you can use your own third party tools or open source software. Then, there’s a new service that, you know, it’s not yet out. I don’t have the dates specifically for that, but you can actually design the stack of containers that you want to deploy and provide images and Oracle will run those images, manage your orchestrations and Docker workloads.

[Viraj] And then you have, you know, a bunch of services which, you know, you use but, you know, you don’t have access to the Docker APIs or to the Docker client. But internally we are deploying them using Docker. So consolidation of, like, these five things. So the developer workflows and the, you know, the four different types of services that I’ve talked about, you know, that’s really what is driving our push into, you know, Docker workspace.

[Jagan] Cool. Looks pretty impressive. So, so clearly there’s a lot of registries that you guys have. You know, from — from the previous slide that you had. So how are these registries used in your workflow?

[Viraj] Okay.

[Viraj] They didn’t quite get that so they had to repeat it again. That’s what it meant, right? Hi Siri.

[Jagan] So yeah — so — so yeah about the […]. Go ahead.

[Viraj] Okay. So in most cases, I mean, if you look here. Right. So we got a bunch of developers checking their core changes. They have declarative dependencies that are mentioned in their Docker file. And once they do their builds, whether they’re in Jenkins or Hudson or TeamCity or whatever, those builds, you know, they create their images and they go into the registries. Now each one of the services that we’re using is typically there’s a one to one map between the images. There could be more. And those images are — so those images we combine them into stacks.

[Viraj] So you can have a stack where you have mySQL in it, Tomcat in it, a bunch of Tomcat instances, and those stacks. Those stacks, you know, they’re basically what our service footprint is. And then we use a CD or some other, you know, flavor to discover the services because none of the host names or IP addresses are specifically embedded into the services. The ops teams then, I mean, they help us with the resource pools and groups. And if we have images that actually go through the whole round of testing without any issues then we deploy those images into production. It’s actually a pretty, you know, simple workflow. But really, I mean, that’s — that’s there to it.

[Jagan] Okay. So, you know, clearly you have a large number of products.

[Viraj] Right.

[Jagan] Releases and modules that need to come together, right. So can you — can you give us a sense for like how is this complex […] managed. What is the pipeline look like for you?

[Viraj] The CI/CD pipeline that we have.

[Jagan] Yeah.

[Viraj] Okay. So traditionally, you know, we had a pipeline that would look somewhat like this. Straight flow, you know, there’d be some, you know, forks on, like, up or down or somewhere but for the better part it was mostly linear. But each one of these pipelines was doing something different. So you’d have service one which would be using and internal tool that, you know, that — that creates VMs. And in those VMs we have a bunch of, you know, […] clusters and, you know, web logic servers and whatnot. That would be running. There’d be another service which would be doing something very similar but inside Docker. So instead of using virtual machines, they’re using Docker instances or Docker containers to do the same thing.

[Viraj] So with that kind of complexity, with that kind of difference of, you know, topology that was available, what we had to do was to look for something that doesn’t actually go and change the way they do things because they’ve already gone on that path. But we needed to provide standards that can be used to develop other set of tools or provide management reporting and bunch of stuff like that.

[Viraj] So what we did was used an event broker, and again there’s another team from our company that’s — that’s going to talk about the stuff in a couple of presentations down the line. How that event broker topology actually works but what we did was something like this. So we divided our pipelines into stages and these stages are, you know, pretty standard. These names are pretty standard across the board, you know, there’d be the same thing. But in each one of those stages the specific actions that take place are different depending on the product that you’re using.

[Viraj] And so the jobs, so these orange dots that you see, these are Jenkins or Hudson jobs. We divided them up to different stages. And so if you look at this, you know, you could have one job that would do environment cleanup, another job that does this, you know, service registry. But there could be multiple jobs that actually create the part for you. And so on and so forth. So this allows us to, you know, have a consistent interface on the top while at the bottom these jobs could be completely different. Our tools interact with the upper layer — the stages that we talked about and these specific, you know, pieces. So the create parts or delete part could actually be a Docker command that is being run.

[Jagan] Well, looks like a — looks like a tremendous journey there. I’m sure it was super easy to do this, right. Can you tell us a little bit about what the challenges, you know, what was the evolution here.

[Viraj] Yeah. So we went through a little bit of, you know, pain on this one. So when you are running a 80 terabyte repository, nothing comes easy. So, backups and restores are almost impractical to do. Most filers start running out of space at 100 terabytes and, you know, Docker itself is pretty. So I’m going actually switch over to the next slide. Are you able to see this?

[Audience] No.

[Viraj] No?

[Jagan] […] read through it.

[Viraj] Okay. So, you know, there’s a few things that we faced. One was the fact that the frequency of releases in Docker are, I mean, it’s amazing. So, seven days and they come out with a release. Our systems that are based on Maven, on the other hand, they’re okay with — with even, you know, one Artifactory release in six months. And because of that, there’s a huge amount of flux that happens on the Maven side while we’re updating the latest versions of Docker. You know, because, let’s say Docker moved from version — API version v-one to v-two. Or the compatibility of the clients change so, you know, we had to switch our repository to four-point-seven. The Oracle Linux version that was supposed to run version one-point-one-zero that changed.

[Viraj] There’s a lot of these, like, interrelated things that keep happening and so what we ended up doing was bifurcating our deployment. So, Docker repositories will go into a separate Artifactory server and the other Maven or Bower or you know whatever else was there, they stay on one repository. And it’s not really about Docker versus other repository types I think it’s more about the frequency of deployments. If you have repository types whose versions change quite significantly, I mean, and are fairly frequent then we put them on a separate repository otherwise we put them on a — on the main stable repo that we have.

[Viraj] Apart from that, you know, we’ve had deletion related issues. So version one of the API, we tried deleting tags, we tried deleting images and there were issues with it. Like we would delete the images, sorry, we would delete the tags and the images would not get deleted. The space would not be reclaimed and, you know, given the fact that we were already at, like, 80 terabytes or something, we were actually running into a situation where we were concerned that, you know, we’ll run out of space on our filers.

[Viraj] So our solution was to, I mean, it was pretty simple, just move to v-two. But then converting a one terabyte repository from version one to version two, the migration itself takes, like, a couple of hours. So, you know, there’s problems like that that we faced over the years. And then there’s specific issues that are related to Docker itself. So, you know, they changed their mechanism for storing their configs from Docker cfg to config dot json that’s in the dot Docker folder. And most users have to do a Docker login and do their — run their commands and stuff like that.

[Viraj] So there’s a lot of training related to, you know, just making sure that people understand that the version of repository just changed from, like, four-point-two to four-point-seven. So on and so forth. So we have an internal team that, you know, that actually sits down, just runs these tests against the latest ones. Sees how it’s different from, you know, what was done earlier and then publishes them as a — as a guideline.

[Viraj] Just trying to think of other things. Yeah. I think that’s pretty much it. I mean, there’s other smaller things.

[Jagan] So, so, yeah. It seems like it has been quite a journey for you guys and — and — and there’s a lot of things that have changed at Oracle and that implementation that you guys have with JFrog Artifactory. So what is the next — next set of things you guys are focusing on.

[Viraj] Okay. So, this is what we’re focusing on now. So our CI/CD pipeline is, the one that I just showed you with the orange dots and we have a bunch of services that are actually following that. But we are focusing on getting more and more teams onboarded on that one. But more than that, especially from a Docker perspective, you know, testware. So all the test cases that you write that are supposed to run in production. That are supposed to run in any other, you know, topology environment. We — we’re starting to bundle them as Docker images.

[Viraj] So, you know, all the configurations that are related to those tests that will be driven through Docker images. Then the deployers themselves are, like, so if you need to deploy while you’re deploying something you have to run a workflow. That itself is run inside the Docker image that we created. Obviously this is not standard. This is not across the board. There’s a bunch of services that we work with to do something like this. But the idea is to roll forward and try our, you know, this in the coming year. There’s — I think the most important one that we are looking at right now is using Bintray in production to… So whatever artifacts we have that we have in our artifact repository, we want to be able to push them into production in a more standard fashion.

[Jagan] So, so I’m guessing the announcements earlier today with distribution repository should — should make this easier. Can you — can you talk a little bit more what, you know, what is the use case, how the production infrastructure is different and what are the challenges you have today that you’re trying to solve or look for a solution on Bintray.

[Viraj] Right. So, our internal, you know, the biggest problem that I think he was alluding to during the — during the initial talk was about firewalls. So that’s, I think, pretty standard. Everyone has that problem. How to pull bits from the DMZ or, like, when your client is in the DMZ and you want to pull something from your internal corporate network. So that’s one, you know, big thing. We have alternative solutions that are in place. They’re, you know, you can write up a synchronization server, for example. But that doesn’t give you integration with LDAP. That doesn’t give you integration with a bunch of other services that are running in the cloud. And, so the idea is to have Bintray, you know, go there and, like, do pretty much whatever it gives.

[Jagan] Okay.

[Viraj] Entitlements, and authentication, authorizations. These are basic things. But out there, you know, because of the fact that you can’t integrate with other services, we just have simple services that do point A to point B, not the whole.

[Jagan] Great.

[Viraj] Set of actions.

[Jagan] Well, looks — looks really great. So I think we’re coming up on time, maybe we’ll open up for some questions if there are any.