Gilad Garon and Kiril Nesenko – VMware, May 2016: VMware’s Common SaaS Platform (CSP) is a brand new offering designed to enhance the productivity of developers and cloud providers by equipping them with a set of common and configurable capabilities (such as Identity, Telemetry, Account Management, Billing etc.), thus enabling them to focus on their core businesses.
CSP is distributed to numerous cloud providers around the globe, used by developers and IT alike to empower their services and better answer the business need of their customers.
Please join us and witness how we take continuous delivery to the next step where sometimes the target environment is not on our control and still seamlessly manage and deliver our unique collection of capabilities, packaged as platform for ease of use, using the best and shiniest tools the frogs can provide.
[Gilad] All right. So we’re going to talk about how do we do CI/CD process in VMware. So we’re going to begin on the, I’ll explain to you a little bit of context on what is the Common SaaS Platform. It’s just something that VMware is developing internally. Don’t worry. Not a product speech. I’m not a product guy and we don’t sell it. So you’re good for that. After that we’re going to talk about thoroughly about the CI/CD process for the Common SaaS Platform. Just CSP. Then we’re going to go on how do we upgrade CSP once it’s in production. And if we have time, I will talk a bit about Xenon, which is a distributed control plane open source framework from VMware.
So my name is Gilad. I’m a architect in VMware. And alongside here with me is Kiril Nesenko our DevOps lead. And we work in the cloud provider software […]. Which is just a fancy way name of saying we develop services for cloud providers.
And you might be asking yourself: Services? VMware? How many did you know — how many people in here know that VMware has services. Except for the VMware guys, of course. Yeah, we’re mostly known for vSphere, but VMware is transitioning from a product based company to a software — to a services based company. Not that it doesn’t mean that we’re ditching vSphere. Of course vSphere is a great product and we’re going to continue to develop it. But we’re making a transition and we try — and we’re starting to develop SaaS offerings and we’ve been doing that for the last two years. So in the progress of our experience of developing SaaS, we noticed that many services have a common ground inside VMware that they need to hook up to. All — most of the services need to hook up to the VMware billing systems, identity systems, VMware ID, so when you log into the service, you don’t need to create a new account, you can just use your existing VMware ID account if you have one. And other capabilities such as monitoring and telemetry.
And what we’ve discovered was not surprising, that developers prefer to write, business logic, right. Who here has written integration with their billing system? That’s not a fun procedure. In the good case, you have a soap based like interface, in the worst case, you need to send out some mqp servers with huge amounts of XML which is not really fun.
So, like good engineers, we decided to […] our efforts and to create a platform that internal VMware services will run into and will integrate with the platform. The platform will do the heavy lifting for them and integrate into the billing system and identity systems inside VMware.
And that platform is what myself and Kiril been building. It’s called Common SaaS Platform or CSP. So when designing SaaS platform, we set up – we had a set of design principles that we have decided upon. Our first principle is that our platform should be cloud agnostic. It should run on any cloud provider that we want outside of VMware or inside VMware. Of course you can’t do SaaS without being highly available and scalable. So that’s two more design principles that we have. Our platform should have great public APIs so that internal services that are using our platform will have an awesome experience and not a bad one. Our platform should be modular. We should have the capability of adding capabilities to the platform and even distribute our platform with some of the capabilities turned off or some capabilities that are missing. And finally our platform should be easy to operate once in production and easy to develop when you’re in the development phase.
And how do these design principles turn out to be in practice? So in order to be cloud agnostic, we decided that the only thing we want from the infrastructure that is running our platform is it to be — to have support for containers. So our main artifact with CSP is a container running a jar. And I believe that most of the cloud providers can support containers. For highly available we are using tunable consistency and most of our data is eventually consistent. And also our platform is stateful and not stateless. We do not have a database backend separate. We store all of our data inside the platform itself. And we use filesystem just for backup and recoverability. For scalability, our platform is running on a dynamic cluster, using SWIM protocols. So we can just add nodes on the fly in an ad-hoc manner.
In order to facilitate great public APIs, we decided there’s no internal APIs in CSP. All of our capabilities, all of the — the developer themselves inside CSP are using their own APIs and not some shortcuts that we use — that normally platforms usually create. So no internal APIs.
Modularity means that our capabilities come as libraries in our classpath. Each capability is a separate jar and coupling between those capabilities, such as the billing module wants to talk with the identity module, it uses the public APIs for that. And not some — not direct code access. And to be able to develop it simply and operate it in simplicity, our platform is deployed in a single jar. No Tomcat containers, just one jar that you run simply. It’s not Spring Bootstrap based, it’s […] based for our internal framework. And that gives us a lot of flexibility.
So in practice when we deploy our platform, it looks like this. And that proved to us that when we stack — when we stuck to our principles, we now have a pretty simple platform. This is CSP. And when it is deployed in some cloud provider. I don’t know if you can see it in the back because of the projector issues. But all these green circles — green squares are just containers that have a single jar inside them. That’s the platform. And these containers are using an ad-hoc network to connect to each other. If we just want to add another container and scale our platform, we just pull one in, give it a peer IP address, and we’re good to go. But I can talk about CSP architecture all day and that’s not really an architectural convention so and fortunately for us we have a pretty cool CI/CD process. And for that I’ll invite Kiril here to the stage to present you with our CI/CD process and I’ll be back towards the end. Thank you.
Okay. So, I’m going to talk about how we do CI/CD process for CSP product. Okay, so this is the use case that works for us. Maybe in your case, you might choose different tools. But 95 percent of what I’m going to present is fully open source. You can reuse it. We have some internal processes. I won’t share them because you won’t be able to reproduce them but you can use the same patterns that we use.
So, this is the high level overview of our infrastructure. So we use Gerrit server for code reviewing and Git for source control management. Jenkins as a CI server. Artifactory. We use Bintray to deliver our artifacts to our customers and we use few environments. So as Gilad mentioned, we use Docker containers, okay. So be able to orchestrate all this stuff, you have few options currently. You have Kubernetes, you have Mesos, you have Swarm, so we choose the Mesos infrastructure because this is what currently works for us. Maybe in the future we are going to switch, but currently we’re based on Mesos. So we use Mesos to orchestrate all this stuff.
So we have few environments. We have an automation environment. We have our lead, staging, production, and we have also customer environments that are using our products. So, how the flow goes, how it works. So, developer sends a commit, it goes through our review system, from the review system it goes to the Jenkins. Jenkins does all the job testing, builds, whatever – the code analysis. Then we publish the artifact so then the artifact is the Docker image, so we publish it into our internal Docker Artifactory. Then we deploy and tests into our automation environment, which is Mesos. We run tests. Everything. And if everything is okay, we can deploy to R and D staging in production.
So this is kind of optional. Currently we do not deploy there automatically, we can do it. Okay. The infrastructure is ready but we do not deploy on each commit. Okay. So we — we like deploy once a day or something like that.
So after we see that the artifacts in our production environments are okay, we promote the artifact. Then we push it into the Bintray and the customers are able to take those artifacts from the Bintray. And if we have customers that are not satisfied with Bintray solution, we can just push to their Docker registry. We just use the logger for Artifactory but it could be a simple Docker registry, we just push the container, the image, and then they can use it.
So this is the high overview of what happens. How do we deliver the product? And now I’m going to deep dive on the left side what happens on the CI/CD and how do we do it.
So this is our Mesos infrastructure. This is how we deploy CSP so you can see here we have three masters. We use Marathon for our scheduler to run tasks on top of Mesos. We use Docker slaves, right. For load balancing we use Marathon-lb which is open source as well. So all the orchestration are going through the Marathon load balancer and we use one external load balancer, it depends currently which one. External load balancer which forwards the traffic to the internal load balancers.
So which tools do we use for CSP — for CI/CD? Artifacts is Artifactory and Bintray. CI is Jenkins. Source control, Git. Code review is Gerrit. Gerrit is an open source project which is developed by Google folks. Very great, it has great integration with Jenkins. For slaves, for Jenkins slaves, we use Dockers. What does it mean? That means that we don’t use static slaves, okay. I have — we have a few Docker servers, so we have slaves on demand. Okay. It depends on the load that we have. So each time we run a job we provision a new slave which is a Docker container and we run the test there. So we can have as much slaves as we want. And each time that we execute a job we are sure that we are starting with a clean environment. Okay. Which is cool.
So, this is our internal infrastructure. So we have approximately 300 Jenkins jobs and we are growing because we are in test and new pipelines. So we have a separate Git repositories for the project. We used to have one, created a lot of problems for us. From the CI perspective, very hard to understand what which project was changed inside. We needed to create some wrappers inside the job. Didn’t work for us. So we decided to separate them. So as I mentioned we have the on the fly Jenkins slaves and Jenkins and Slack integration. And these are the technologies that we use for Mesos. So Marathon is a scheduler. For load balancing, Marathon-lb. We also use Mesos-dns. Calico for the networking solution. And Chronos to execute tasks on our cluster.
And so as I mentioned, we have a lot of Jenkins jobs and we are growing. So, for those of you who use Jenkins, it’s very hard to manage those from UI. Okay. Just a mess.
So this is an example. Which job do you want to change? I want to change all Gradle jobs. Okay so let’s say you have 200 Gradle jobs and you want to bump a Gradle version so if you are using the UI, you need to go through each job and change this version. Which is not cool.
So for this purpose, we use Jenkins job builder. This is an open source project which was developed by the OpenStack folks. So what it gives you, it gives you the ability to save your job in the yaml format. Okay. So when you create a new job on the Jenkins side, and you click the button save on the backend of the filesystem it creates the XML file. Okay. So XML is very, it’s not human readable. You cannot like trick in this file into your Git and maintain it but yaml format is more human readable and you can, so you create your jobs in the yaml format, which is human readable. If you want to change your job, or create a new one, you go through the same process as a developer. Which means you check into the Git, you create a new patch, change yaml format, send it into our reviewing system, we test it and then we deploy. When you deploy, Jenkins takes those yaml files, the tool will transform this yaml file into the XML and deploy on the Jenkins side. So, everything will happen for you automatically. No manual — nothing that you should do manual on the UI.
So you save the configuration to duplication, I’ll show you how. So you can include shell, Groovy, Python scripts for whatever inside those jobs. So let’s say you have the same script that you want to include into the different jobs, you just save it in the one place on your filesystem and you include this file into the yaml. And you can actually test it before deploying. So, that’s what we do. And you can organize it on the filesystem like all builders, all deployment jobs, whatever.
And of course it serves as a backup. Okay, because everything, the whole configuration is saved on the Git side. Okay. So I really don’t care if some developer removed the job, he changed the job, it fails, okay. Each time we can run the redeployment job and everything will be redeployed. Okay. So our developers, we don’t use UI. We use only yaml files.
So, I don’t know if you can see, because the resolution is not good. But this is the example of the yaml file. This is the descriptions of the jobs. So you can see the job name is docker_clean_images_container. So whether you want to support concurrency or not, which parameters do you want to include inside this job, when do we want this job to be triggered, which Git should I clone. Which builder, which script do I want to execute. So you can fill in.
And this is the end result. This is how it will look like on the Jenkins side. Okay, you create the yaml file, you run the tool, it will deploy to the Jenkins and this will be the end result. But still, it’s not enough because if you have 200 Gradle jobs, still, you need one file per job. Okay. Doesn’t solve the problem.
That’s why there are templates. Okay. So you can create a template for the common jobs like Gradle jobs or the builder jobs. And a job can use, inherit, from this template and just fill in the missing parameters. Let’s say for the Gradle jobs, when you build your Java projects you want to be sure that everything is common through all jobs like environment variables, the same Gradle version, everything should be the same instead of one parameter, the build project, right. I mean, the project that you want to build. So for this purpose you can create the template and reuse this template like this. Just forward the missing parameters into the job and the job will be created for you. So in this example if you have 200 Gradle jobs, and you want to bump the Gradle version, you will change it only in one place. In the template. And everything will be redeployed for you. So no manual will work through UI, everything goes through the Git.
This is the example of Jenkins redeploying its jobs. So, the job is running, redeploy in each one, and we are good. Deployed.
Okay, so which job types do we have. We have currently three major job types. Okay. The first one is gating job. The purpose of the gating job is to listen for patch-set-created events. What does that mean? It means that each time the developer that sends a commit, the gating job will be executed. Okay. We are running tests. Code coverage. This is the next slide.
So the build. The build is the actual job that will build something for you. It will build an rpm for example, war, jar, Docker container, whatever. And the listeners. The listener […] that, the job that listens for change-merged events. Which means that when you commit the change, I mean, when you submit a merged patch. Okay. This kind of job will be executed. It will construct the pipeline for you dynamically. How — we will explain how do we do it.
So gating job. For each patch we run a gating job. And each Git project has its own gating job. As I mentioned before, we splitted all our projects — its project in its own git repository, so for each git repository, we have it’s own gating job. Which means currently we have 20. And the purpose of this job is to build, test, and post results to our reviewing system. Which means when you, when the developer creates a commit and sends a patch, the job listens to the patch created event, the job will be executed automatically, it will build, test the project for you, and post the results to the Gerrit. So within 10 minutes for a big project the developer already has the results on the Gerrit side and he might understand whether it’s good or not good.
So this is the flow how the gating job works. So developer sends a patch. Then Jenkins fetches this patch set. Runs the test against this patch. Post the results to the Gerrit. So developer decides whether he should merge or not merge. Also, in our case, developers are not allowed to merge patches without the human review. Okay. So we need one more person, one more developer to review this patch. So if the patch is merged, we are starting the pipeline, which will deploy the product for us.
So this is the example of the Gerrit. This are the print screens for our instance. As you can see, this is the change. Someone sent a change and added a few more lines to this one.
So this is the example of the Jenkins failure. As you can see this is the gating job. So developer send a commit and something — the build failed for example. So you can see that we had minus one from Jenkins. So this patch cannot be merged. Okay. So we really don’t care if the build works on your laptop. Or on your environment. We don’t care. Okay. It should work on Jenkins because each time we provision a new slave, clean environments. So, make it happen on Jenkins. And if it’s like kind of […] problems, which we might have, because the slaves, the Jenkins slaves are Docker images, developer just can pull the image to his laptop and execute the test and everything on his environment. Okay. So he can easily reproduce the build on his environment.
This is the example of Sonar failure. Which means that Jenkins job might pass, I mean, the test and the build is okay. But, the Sonar, the code coverage tool got minus one. Okay, this is the first line here. So developers should fix it. The reasons for that developer added a code complexity or he has major comments on the patch. Everything will be available through this UI, you just click on the links, they’re […] links. And you can see the results there. Okay. So developer should fix it.
And, the last one, is Gerrit failure. So, what is Gerrit failure? How Gerrit can the reviewing system can fail the patch. So when you work with Git, you have two kind of hooks. The first hook is Git hook, which is executed on the client side when you commit, you can execute some scripts on your laptop and it will check something for you. Okay. So this is the client side. Hook. Which is the Git hook.
And Gerrit hooks, these are the special kind of hooks which are executed on the server side. What can you do with the hook when is executed on the server side? So you can decide which event you want to execute each hook. Okay. So for example, if it’s a […] created event, you can do x y z. And if I merger the patch, I can do something else. Okay.
So you can — so what do we do currently with those hooks? We are checking the commit message style. We don’t allow commits like fix bug or something like that. Okay. We do not allow that because we have an internal tool that — the release process internal tool, which are collecting all the commits from all projects and we are generating the release notes on the fly. Okay. So that’s why we need a special commit — commit template. We are checking for trailing white space and, for example, let’s say that you have a bug, okay. And you put the bug link into your commit message. And when you merge and everything is okay, so from the Gerrit, you can integrate with the external system like Bugzilla, Jira and to update this ticket, you can close it. You can reopen. You can do whatever you want. Okay.
So this is the example of the Gerrit which failed the patch. Okay. And here you can see the patch had trailing spaces and developer should remove it. So for this be in order to be able to merge a patch, actually to commit it, you need to pass a Jenkins side, which means tests, code coverage, you need to pass Gerrit, and you need to pass a review. Okay? A human review. So after you managed to actually merge your patch, okay, we are starting the dynamic pipelines.
So, who is responsible for the dynamic pipeline? Those are the listener jobs. Which are executed on patch merge event. Okay. So when the merge — when we have a merge event on the Gerrit side, the job will be executed on the Jenkins side. So we orchestrate the pipeline dynamically. So how do we do it? We are using the BuildFlow plugin. Which is the way to orchestrate your builds through the code. Okay? So in the Jenkins, usually, when you create a job, it’s kind of static. You do one, two, three, and if some of the steps failed, everything fails. Okay. So here you can create your pipeline dynamically through the Groovy code. Okay, so you write the code, and depends on the event, you can decide whatever you want, how do you want to create your pipeline.
So, same as for gaiting jobs, each Git project will have its own listener job. Okay. So each listener job might create a different pipeline. But all of them will run the same code base. Okay. We do not want to manage different pipelines, I mean different code bases for different listeners. Okay. So we use one code base and the pipeline is orchestrated dynamically for you.
And of course, on a failure, user is notified on Slack. So when the pipeline started and let’s say you fail in the deployment phase, okay, we know how to notify the user. So this is the example of this guy which broke the build, we attach the link to the job, okay. So developer will be able to access the job and to see what happened. Why it failed? And this is me shouting on this guy to fix the build.
So example, like, to see what happens actually. So as you can see, we have listeners on the left side. So listeners. We commit and one flow might go like here. And another one can go like there. Okay, so why that happens? Because it depends on the project type and depends on the — on the, we can decide it programmatically. Okay.
So, the parallel deployments, how do we do it? So developer sends a commit to our Gerrit server, from Gerrit it goes to our Jenkins, from Jenkins we deploy a Docker. After we ran all the tests, the code analysis, Gerrit, everything that I showed before. So we deploy — we deploy a Docker image to Artifactory, then we deploy to our automation environment where we are going to run more tests. And here you can see on R and D, staging, and production environment, we already have a running deployment. Okay. We already have running […], running cluster. So what is going to happen when the automation gets past, we push to Bintray, and then we deploy a new cluster to the R and D environment, we perform the integration to the data, so we’re working on the blue-green deployments. Okay? We deploy a new one, we perform integration, and we destroy an old environment. Okay. Same will happen for staging, and same might happen for production. So as you can see we have another developer that brings in a new — a new change, it might go through the same flow. Okay. So everything will be the same, same flow.
So to be able to do this. Okay, continuously deploy on your environments, you need to have a strong infrastructure and good tests. Okay. Because if you cannot say for sure that your product is 100 percent tested and, okay, you cannot continuously deploy to your environments. Right? So we have a stronger testing infrastructure on the Jenkins side.
And those are the examples of the pipelines. Those are the screenshots that I created from the Jenkins so you can see the one pipeline is very short. Okay. The second one might be much longer. Okay. And all of them are running the same codebase. And the third one might look like this. Okay. So everything is constructed dynamically for you. What I didn’t mention, we have special kind of jobs which what are they doing, they are going through all the projects and we extracting the open source dependencies that we have. Okay. And we automatically are integrating with VMware internal systems that we need to notify them which open source project that we use, and which licenses or whatever. But, like off topic.
So, this is all — this is for CI/CD perspective. This is very, like, high level. How do we do it? I can show you here after and Gilad now will explain how we aggregate the platform. Okay.
Okay thanks. So as you can see we have a comprehensive pipeline that takes us from straight from the developer machine, through internal build pipelines, and through corporate pipelines such as legal, and to see if our open source we can use them and check for security vulnerabilities. And that’s just a single pipeline towards production.
And when we reach production, we need to upgrade our platform, right. CSP is a stateful platform which means, again, that all the data is stored in memory. Some of it on disk space but not really relevant. And it’s all happening in the same layer. There are business logic and state and persistency layer are all inside the same layer so that poses some issues when upgrading.
Upgrading is not — it’s not really a new notion. Right? There’s blue-green, red-black, I call them old-news, simpler for me. And we also have rolling upgrades. And when we researched on how we were going to do the upgrade. We had two main goals. One is to have minimal service interruptions that our external customers will not feel that we’re upgrading our platform. And two, because the state and the business logic are coupled together, we can start and do a schema upgrade if necessary and then perform the business logic upgrade. We need to do it in the same phase. And, of course, the other main goal is just deploying the new bits and bytes to the production environments.
So we face some challenges. Well, CSP is a symmetrical cluster which means that we cannot add nodes with different code into it. We need to use blue-green for that. And, like I said, since state and business logic are on the same tier, we cannot separate our schema upgrades from the business logic layer.
So we designed some kind of upgrade system that does the following. Now you have to remember that when we upgrade the platform, the platform is live, receiving traffic, and its state it, is changed all the time. So what we did was actually we thought about how we can do it in migration cycles. And a migration cycle mean that we pull the new cluster, the new CSP deployment, pulls data from the old platform before schema transformation on it. And then returns some meaningful metrics to the update orchestrator running in the background. Which is also running on the new cluster. Then we inspect those metrics and we try to make an educated guess whether if we can stop traffic from a minimal amount of time that will not cause service interruptions. Or do we need to continue and do another migration cycle.
So the process is to migration cycle, check the threshold, which does — which analyses the metrics that the old, the previous migration cycle had returned to it. If the threshold was crossed, then we queue the traffic on our proxy, perform another migration cycle to compensate for all the data that was changed during the upgrade itself. And then release the traffic and route it to the new cluster.
We have an example for that here. So I don’t know if you in the back can see it properly. I’m sorry it’s the projector issues. But, this is the existing situation, right. We have the blue node group. Node group is a cluster in CSP which runs several nodes. And we decided to start a migration. So we deploy a new CSP with the updated code and we have a state discovery phase which we discover what we want to copy, and where does it reside on the green — on the blue cluster. And then we start the migration transformation cycle. So the new cluster pulls the state from the blue one and performs the state — schema transformations necessary.
So in this example on the first run, we pulled 50 million documents and that took like — took us 25 seconds. We run the threshold on it, to see if these metrics are good. And now of course we cannot, because if the next cycle will take 25 seconds, then HTTP clients will start sending errors. So we perform another migration cycle. And this one just took 6 million. Status keeping — keep updating in the platform because it’s live. So six million documents have been changed and this only took five seconds. So as you can see the delta is closing off each time we do a migration cycle. And in the final migration cycle, we only migrate 90K of documents and that took us half a second. So if we went through our custom threshold, the threshold will tell us that’s probably a good metric and probably the next migration cycle will take even less of that. So in this point in time, we just stop the traffic to the blue and queue it in our proxy, perform another migration cycle, which pulls the last remaining state that hasn’t been changed. And this case, only 10K of documents in a really small amount of time. And after that’s done, we reroute the traffic to the new cluster. And we keep the blue one in case of something goes wrong in the meanwhile. And that is how we do the upgrading in CSP.
So I have a few more minutes before the Q and A so I can just give a few more seconds about Xenon. CSP is based on Xenon, which is VMware open source distributed control plane. When you usually build a SaaS platform, you need to stand up an amount of nodes which containing the following layers. Right?
You need an orchestration layer and container. You’ll probably use Spring Boot. You need a persistency layer. You can go with Cassandra or Mongo, the popular choices. You need a translation layer or ORM if you want one. Probably because you’re running in a cluster, you need Zookeeper or ETCD to configure and maintain your cluster. And you also need a UI server. You can use Node.js. Some cache layer because you care about […] and performance, like Redis. And some message bus, eventually.
And while all these technologies are proven. Right? We’ve all used them here in this room, I believe, it sort of creates a technological zoo. Right? Because if you want to deploy a service with all these layers, it can get a bit complicated cause you need to start up your Cassandra cluster and you need to start the nodes one after another. And after the Cassandra cluster has started, now you can boot up your own cluster with Spring Boot and connect it to a Redis cluster and configured to Zookeeper.
And while this is a good approach, it’s a bit complex to operate and I think it’s a bit complex also to develop for it because your developing environment, you need to run this or you can read some embedded mocks or just embedded technology.
So the Xenon framework from VMware is a battery. It’s a battery that includes framework. Which is not the most popular way to do things today in the industry but still that’s how the way we went for it. And it gives you some of these layers inside, so you can use them out of the box like a REST API, so you can create your own microservices or with Xenon, it’s more a nanoservices approach. You get the persistency layer based on […] which is a proven technology and works on distribution very easily. And you also get leader election capabilities, publish-subscription capabilities, statistics, metrics, and even UI serving capabilities.
So that’s it for — on our side. Xenon is open source. You can read about it in GitHub. And Jenkins Job Builder that Kiril’s mentioned is also a great project by the OpenStack folks. And that’s it. I thank you.