DevOps @ Research & Artifactory

Abstract:

Christian Vecchiola / IBM – IBM Research accounts for about 3 thousand people, a third of which are software engineers. As IBM embraces the DevOps culture and practices, so does Research.

In this talk I will provide insights on the DevOps journey at Research, and how infrastructure and tooling, in particular artifact management, has influenced how we collaborate, do research, and build software.

Talk Transcription:

Thank you for coming by for this last talk of the day. I will be mindful of the fact that we have parties coming so let’s keep it in time.

This talk is about applying DevOps in a slightly different setting. My name is Christian Vecchiola and I’m interested in cloud computing and development operations. And more importantly on how these two things enable us to do new and interesting things.

I work at the Australia research lab of IBM. Which is one of the youngest additions of the family of research labs that IBM has spread across the world. Research is not located only at these places, in fact, we used to say that the world is our lab. Which means we go wherever research contribution is needed. And what we are interested in doing is developing project as worldwide reach and impact on practical applications.

So overall what we do in IBM research is essentially being the headlights of the whole company. Shedding light to the future. Understanding what will be the new breakthroughs and eventually invent the next big thing. The next revolutionary thing.

So some of you might ask, okay, we are not really developing products, application within the research division. We do experimental works. So, why should we care about DevOps? Where does development operation fit into this picture, and why we should put effort on this? Well, it turns out that, at least in my lab, there is a considerable amount of effort in researching. In software engineering. Primarily to make accessible and exposed outside the research community all the things that we do through application and services. And for that, we need to be fast. We need to be able to roll out new prototypes, new — new updates to our projects pretty quickly. We need to be able to iterate between phases very fast and we need to do it in a smart way.

Here is a little bit of a typical scenario of what happens in our lab. And not only in our lab. So someone has an idea or is presented with a customer problem that requires a different thinking. And generally that is what triggered the research, the investigation, understanding of how we can solve the problem in a better way. Whatever the problem would […]. And once the research team has formed its mind about how we should go about something, this is where the development kicks in. And through a lot of coffee and tests, we eventually reach to our very first prototype that we can use internally, also to explore our own research and understanding. And then this — this cycle can be repeated many times until we are comfortable with something we can showcase and use to engage with customers. And eventually this will send us back to the drawing board to do additional iterations.

So where does development operation help in all of this. First. It helps us in run our models faster. And run our experiment on environment that we can provision on demand and that we can provision when the experiments are done. Sometimes we run very heavy computation, so being able to do, to provision large environments on hand is a pretty handy thing. It helps us in deploying more quickly and more reliably. We want to be able to remove the human error from the loop. But more importantly, it helps us in automating a lot of this stuff so that we can actually focus on what really matters and […] us. Which is the research work. Obviously we didn’t get to the point of automating lots of our production pipeline overnight.

And actually, our journey has been quite long, I would say. We started — my lab started in 2011 and at that point we were, as we would say, in search of infrastructure. When the lab opened, we had only one small cluster locally that we just used to run every computation for different scientific models. And there were many laptops. And that was our infrastructure. Most of the project development at that time, in the very first four to six months, was ad-hoc solution by the researchers that has to be their own system administrator and software engineers.

We did a little bit long way from that and in 2014, we entered the year as I shall DevOps. We had about 90 people. The entire research staff. Mostly — most of the work that we did was team based projects. Which means the ad-hoc solution could not work anymore because they were not sustainable. By the second half of 2014, we were able, as a lab, to have a centralized pipeline for everyone in the lab. And actually we were the first lab, across the 12 labs of IBM, to realize that goal. To have one single CI pipeline that was able to be used by all the research groups in our lab.

Moving forward to today and this year, we are in the process of having DevOps champions in each lab that can spread the culture and enforce the good practices. Some of them already have, others are in the process of having. We start having global rollout of DevOps services from the research division for all the labs. And what we are doing is, we start developing the phenomenon of social coding. So that different researchers from different lab try to leverage as much as possible the code, the services, and the capabilities that other lab have developed.

What are the tools and platform that we use internally? Starting from source control. As of 2013, we started using Git based system. And we had the very first installation based on GitLab in our lab in 2013. And the year after, a global GitLab instance was deployed at the Yorktown lab. Source control has been used quite aggressively, at least, in our lab. As of today we have more than 800 products in source control. And we are about 120 — 120 people, growing up to 150 towards September. And we are now moving towards GitHub Enterprise.

In terms of build services, primarily Jenkins is our continuous integration tool. We have, again, multiple installations. We started in our lab about 2012, I think our — even before the source control. And now we have different instances primarily in Australia and US. The installations we have in the United States counted for most of all the other labs. What do we do with Jenkins? Essentially we build most of our projects, which are primarily Java based builds, Node based, or Python based. And recently we’ve been developing a lot of Docker images that we use internally.

In terms of artifact repositories, we use JFrog Artifactory. And I’ve witnessed in these five years, a very interesting pattern on the adoption of DevOps chains. The very first thing that generally teams get into is source control. So that is the very first DevOps tools that they get to use. They want to have a place where to share code. And generally for us it’s been Git. Later on they start to integrate build services and that identified the next level of maturity of the project. Once you have the source control fixed, you have the build fixed, you start thinking about how to assetize the different libraries and components that you build and this is when you need artifact management.

So in 2013, in our lab we rolled out an instance of Artifactory primarily for Java builds and about two years later, an instance of Artifactory for the worldwide labs has been deployed in the United States. In our lab we primarily use it for Java but in other parts of the world, Artifactory serves as a repository for Java, obviously Node packages, and Docker images.

In terms of infrastructure management instead, most of our core services are based on Chef recipes. And we use Chef to generate an immutable and repeatable set-up for infrastructure.

As of today, 2016, we have some of the core basic services under codified as Chef recipes, such as license servers, monitoring servers, Docker registries, and our internal Mesos cluster. We started this effort about one year and a half ago, and we are slowly replacing most of the installations that we had for all the core services. Those that we will not use from the global services with Chef based installation.

Now rather than talking just about the tool, I would like to give you a couple of examples about how we use all these tools together to make something. To support project development or for our internal deployments. So let’s look at some of the DevOps pipeline at work when we apply them to specific projects.

This is one example. So this project is called Surgical Unit Resource Optimization. And essentially what is does is it combines statistical modeling and mathematical optimization to predict what would be the demand of elective surgeries over time for a hospital. And to optimize this demand to provide a better allocation of surgeries, operational — operation theaters, and wards. Now, I will not talk about the details of the research behind it. What I want to focus here today is the software engineering effort that supports this reser — this research from being something that you write on a research paper for a conference into a system that people can use and try.

So, and this is essentially a picture of what the development team working together with the researchers has to work on every day. So they generally do local development. And all of our code for the project is in source control through Git. Whenever they commit, the continuous integration kicks in, takes most of the dependencies from our Artifactory solution, builds, tests the software, run code analysis, redeploy the build’s artifacts, and some of the software metrics are going into SonarQube which provide us an understanding of how much technical depth we have on the project and how much time it takes to fix it. Whether we have enough coverage on tests or whether we should use some other good practices, for instance, for Java.

And if the build is successful, as in most of the case it is, we deploy automatically on to test environment. One in Bluemix and one is our internal SoftLayer infrastructure. And this process occurs at every commit. Deployment to what we call the production environment, which we use for demos and for discussion presentations, occurs on a measured release milestone about every two months for us because we are not extremely fast in releasing new feature. Especially when we need to code up research improvement that the system does. And, obviously, the production environments are monitored to a combination of Sensu and Kibana. So that we are always sure that at any given time, our systems are healthy, that we can show our research, and I don’t get called at night. Or in some other non-appropriate times.

We do most of our project management through a combination of Slack and Redmine. Slack primarily for future planning, keeping in track with bug and issues. Sorry. Redmine primarily for future planning, milestones, roadmaps, and keeping track with bugs and also project documentation. Why we use Slack as a tool for combine — for discussing about the project and do a much more person to person let’s get together and solve this issue type of discussion. Some of our Slack channels are integrated with the build systems. So we can get feedbacks into Slack as notification whenever a build fails. Or whenever someone commits code. Slack has plenty of additional plugin that you can plug into and configure for the channel and comes very handy if you want to integrate some components of the development operation pipeline as part of your project management and discussion activities.

Another project that where the DevOps pipeline is used is testing recipes — Chef recipes with Jenkins. So most of our Chef recipes for our services and systems are stored in source control. And basically we have all our Chef cookbooks there. Whenever how our infrastructure and services teams makes a change to the cookbook because they need to upgrade the version of Jenkins. They need to script the installation of a new plugin. What happens is there is a commit into source control, continuous integration kicks in, and this is essentially the process of the Jenkins job. We have a Ruby validation for the recipe to be sure that there are no errors, no syntaxical or semantic errors. There is an enforcer of best practices for Chef recipes and once these two stages are past, what we do is we automatically provision a […] machine in SoftLayer, which is our […] infrastructure. We execute and test the recipe automatically on to that machine to be sure that not only the recipes are syntaxically and semantically correct, but it also does what we expect to do. And if the test is passed successfully, we provision the machine, bump up the version, and promote the recipe to be used for deployment next time we need to deploy any of these infrastructure.

I think this is a pretty interesting use of the development operation pipeline because it’s not primarily designed for building software but it actually to verify and test configuration management.

And I would like to just conclude to — with some reflections about what DevOps has meant to us in the past five years. It has been primarily a revolution in terms of culture, processes, and tools.

It has been a revolution of culture because we started sharing more. We started, because we share more, we were more compelled to test our code, to document it properly, and we also start to learn how to automate – automate some of the deployments and learn from the experience so that we can do better next time.

This obviously has required us to develop some processes in relation to build, test, and deployment. Iterations for project management, automation and tracking and be able to continuously repeat this process.

And obviously tools, which are important but are not the only things. So we started putting more structure and approach towards search control, build environments, artifact management, project tracking, configuration management, but also code coverage, and code analysis which has really helped us to actually improve the quality of the software that we deliver. And to ensure what we put in our environment production is unlikely to break. At least not very often.

So this is what we have done so far. I would — what is next? What I am expecting to see in the coming two years. I would expect to see a considerable increase within the research community in relation to social coding. I would like to expect — I — I, I would be happy to see an improvement in convergence in terms of tool chains and pipelines that we use. As I mentioned before, we are now in the process of rolling out GitHub Enterprise for the entire research division and this will allow us, rather than having several GitLab repositories for different labs, having one single source control where everyone can see, can do a Git pull or can fork some code or contribute more actively to the projects, that, for instance, another lab or another team in the – in the same lab is doing. And obviously we will be keep learning and practicing. Right. So that at every round, we will get better and better in what we do.

So I have kind of concluded most of my talk. I will be able to take questions since we have a considerable amount of time left.