Managing huge files on the right storage with Git LFS

Abstract:

Tim Pettersen, Senior Developer / Developer Advocate at Atlassian, May 2016: I’m an engaged, detail-oriented software developer with excellent communication skills and a passion for API design and product integration. I have seven years of experience working with java and a broad range of related technologies working for one of the world’s premiere vendors of tools for software teams: Atlassian Software.

Talk Transcription:

All right folks. We might — might be getting quite a bit of content so we’ll see how if we can get through it all before the end.

So, you might be wondering why I’ve got this big picture of the Eiffel Tower in the background. Like many great stories, the one I’m going to tell you today starts in Paris. It was the eighth of April, two thousand and fifteen, just a little bit over a year ago, and a bunch of engineers, much like yourselves, was sitting at the Git Contributor Summit. Which is this big summit that happens the day before the annual Git Merge Conference. What happens is all of the — well, not all — but most of the Git contributors get together in a room and talk about what they’re going to do over the next year to move Git, everyone’s favorite distributed version control system, forward.

And at this conference, something really interesting happened. There was a conversation between two companies that typically are fairly staunch competitors. They were Nicola Paolucci and John Garcia from Atlassian on the Bitbucket team and a guy named Rick Olson from GitHub.

And the conversation went something like this. Nick said, hey Rick at Atlassian we’ve been working on this really interesting new bit of technology to solve the problem that Git has with storing large binary files. So we’ve built this new extension to Git, written in Go because we’ve wanted it to be a cross platform, we plan on announcing it during our Git Merge session tomorrow at the conference. And Rick from GitHub said, that’s kind of interesting because at GitHub we’ve also been working on a solution to Git’s problem with large binaries. And you know what, we’ve built a new tool in Go that extends Git in order to track these binary files better. And we’re going to announce it during our session at Git Merge directly after yours.

And Nicola was like, what. That’s a picture of James Watt, by the way. And Rick was like totes. Isn’t that crazy? What do you call your tool? And this is where it gets really weird. Nick from Atlassian said, well we call ours Git LOB. Stands for long objects like you might find in a database cause we got these big binary chunks of data that we need to track. And Rick said, ah that’s weird we called ours Git LFS, another three letter acronym which stands for large file storage. Now after a bit of back and forth between the two companies, Atlassian decided to open source and archive Git LOB and start porting some of the features we’ve built to the Git LFS project.

We’ve did this for a couple of reasons. The first is that we didn’t want to fragment the community. We wanted to build one open source tool that solved the problem of tracking large binaries for every user of Git out there. And the second reason was that they were both written in Go, both extended Git using the same extension points and hooks, and were both fairly similarly designed. So it was pretty easy to port those features over.

And that’s why I’m here at Atlassian, who’s previously worked on the Bitbucket server project and JIRA to talk about this joint project between Atlassian and GitHub. How to track large files on the right storage using Git LFS.

Now this is really exciting. I’ve done quite a lot of Git evangelism in the past talking about branching work flows and rebasing and how not to force push and screw over the rest of your team. And sometimes after a presentation, I run into this really awkward conversation where I’ll be talking to someone who, for whatever reason, just can’t use Git. Now there are some engineers out there have a really hard time using Git in its native form because of its — this problem with tracking large binary content.

If you’re a game developer and you want to track game texture assets or audio files or full motion video, that’s really hard to do inside of a repository. Similarly if you’re a researcher and you want to track large data sets. Maybe, you know, gigabytes or terabytes alongside the scripts and other modeling bits of programming that you’re writing, which would be convenient in your Git repository, you can’t do that because of the performance problems you run into with large binary files.

So today I’m going to first talk a little bit about why Git has such a hard time tracking binary content. Then I’ll talk a little bit about how Git LFS solves this problem and we’ll take a look under the hood at how Git LFS actually extends Git. Cause it’s kind of interesting to see how Git itself has been built in such an extensible manner. Then I’ll give you some tips for converting your existing repositories over to use Git LFS if you’re already tracking binary content in them. Then we’ll look at how you can use your existing Git hosting provider with Artifactory to effectively store your large binaries all in one place. And then finally I’ll give you some tips for using Git LFS in a team context to make sure you don’t clobber the rest of your team’s large binary files when you’re working on them.

So, first of all to understand why Git has such a hard time with large binaries you need to know a little bit about the Git data model. Now when you think about the changes to a code base over time, as it relates to version control, you’re probably thinking about a set of revisions or commits. Now in Git, these revisions, actually they’re called commits in Git, these commits don’t just float in space. They’re related to each other through ancestry. Specifically each commit has a reference to its immediate parent or parents in the case of a merge commit. Now if we put in branches or tags, more commonly known as refs in Git parlance, you start to get a picture of what the Git data model looks like. Now this data structure is known as a DAG, or a Directed Acyclic Graph. Basically each node represents a commit, or as we’ll see in a second, another object sitting inside your Git data store. And it’s directed because each object references another object that was created at some point in time before it. The A in DAG stands for acyclic because each of these objects is immutable. When you create a new object, it refers to an object that already exists. So you can never create a cycle inside this data structure.

Now to understand why Git has such a hard time tracking binary content we need to look a little bit deeper. And see what the structure of these commit objects actually is. So Git ships with a command called cat-file which you can use to inspect any object inside your Git repo. If you pass it is dash p flag, it will print it out in a pretty format which is human readable.

So commits are actually pretty simple. They’re just a few bytes. First of all you have a reference to the commit’s immediate parent or parents. And that hexadecimal SHA-1 address that you see which looks kind of scary when you first move to Git from a centralized version control system, is actually a SHA-1 hash of the Git object’s contents. So you might have heard of Git being referred to a content addressable file system and that’s because every single Git object is referenced by one of these SHA-1 hashes. And this has some really important properties as to how Git stores this data. One of the most valuable properties is that makes it makes it easy to detect duplicate objects. And you’ll see why that’s important in just a second.

And now the next thing that this commit object has is a tree. This is another SHA-1 hash. Which is the address and contents of that tree. And the tree is analogous to the root level directory that’s being tracked by this Git repository. So we’ll see in a second but that tree actually contains references to every single object, or file, that’s being tracked in our Git repo. Then we have a little bit of commit metadata. So the committer, who created the commit, the author, who originally authored the code, these are usually the same person, depending on your workflow, and then you have a commit message as well. But for most commits, this is basically all you’re going to see.

Now if we whip out our cat file command again, we can point at that tree object and look at how it’s structured. And you’ll see it looks a lot like a directory on a file system. We have file modes, we have nested trees, which are basically the sub-directories at the root level of your Git repository, and then we have also blobs. And these blobs again SHA-1 hashes that point to the actual content of this file on disk. And then finally we have the path. So all of the names of every file and directory in your repo are actually stored inside these tree objects.

And if we keep recursing down into our Git tree, we can see the entire contents of our repository. And if we whack in another one of those refs, or branches, or tags, what we have here is pretty close to the complete Git data model. So it’s actually — it’s actually pretty simple when you look at it. To be honest, understanding this was one of the best ways that I, kind of, got how Git works.

There’s a really good book by a gentleman by Scott Chacon called Pro Git. It sort of introduces this concept very early on and then starts talking about all of the different Git commands that you use in your workflows as transformations on this Git data model. So if you’re having trouble or if you’re learning, starting to learn Git, reading that book and kind of grocking this is one of the best ways to understand what’s going on under the hood.

Now the reason why Git has such a hard time with large binary content is because we actually end up creating a new one of these trees every time we create a commit. So every time we change a file in a repository, and add it and commit it, we create a new one of these blobs and that means that every tree that is either the immediate parent or a grandparent of that tree — of that blob, has to be rewritten as well. Now fortunately because Git is content addressable, it’s very easy for Git to not create duplicates of objects that haven’t changed. But if an object changes, Git has to create an entirely new blob. Now Git does eventually do some delta encoding of these blobs. So if you have, but depending on what the binary format is, it may or may not compress very well.

So let’s take a really simple example where we have a repository which is literally one directory with one file in it, say a ridiculously high resolution file of an elephant stored in some sort of raw file format. Now because it’s raw, each pixel is going to be encoded as a set of bytes. So if we go and do something like change the hue of the elephant. Sorry, the color differentiation is not very good here but that elephant is supposed to be pink. Then when we add and commit it to our repository, it’s going to create an entirely new blob and it’s going to double the size of the repo. And so on and so forth as we make more changes to this image.

So as you have large binaries, the repository bloats and bloats and bloats. Now with traditional centralized version control systems, this wasn’t such a huge problem because the entire history of your repo is stored on a central server and each developer is typically only retrieving a single commit at a time and working with that, so you only have kind of the snapshot of the latest version. The elephant with the little mum tattoo. But with Git, it’s a distributed version control system. So you’re actually copying around the entire history of your repository every time you need to do a push or a pull or a clone. So that means that I’m pushing every single version of this file that I’ve touched up to the server and the rest of my team has to pull down every single version of that file.

Similarly if I’ve got the play scripts or continuous integration builds they can sometimes do shallow clones depending on their exact use cases. But often you end up having to copy around this rich history which is incredibly heavy. And that not only blows up the size of the repo, it slows down those pushes and pull ups — pushes and pulls. And it means the load on your Git server grows until everything comes to a grinding halt.

Now Git LFS is a solution to this problem. Now as I’ve mentioned, it’s an extension to Git. And this isn’t the first extension to Git that’s attempted to solve the problem with large files. There’s been tools like Git Annex, Git Media, Git BigFiles. That have all — Git Fat is another one, and all these have attempted to solve this problem. But Git LFS has taken a slightly new approach and tried to make it as transparent as possible with your existing Git workflows. So as you’ll see in a second, once you have Git LFS installed, you can just work with your existing repository as usual.

Now Git does have a fairly famously steep learning curve. So I think one of the reasons that Git LFS is becoming more successful than its precursors is because it doesn’t add to that learning curve. It’s actually just as simple as using raw Git.

So at a high level, basically what Git LFS does is instead of storing all of these large binary files in your Git objects directory, as part of your repo, instead it replaces them with lightweight pointer files which contain references to these large objects. And they actually get committed as part of that DAG that we looked at. Where as these large objects are completely divorced from that DAG and stored in completely separate storage.

So when you run git push, those large files get transferred to a separate storage and your Git DAG gets transferred to your Git repository as normal. Then, and this is where the magic happens, when a developer does a clone, or a fetch, or a pull, that DAG is transferred back down to that developer’s computer. And then only the versions of those large files that the developer wants to work with, which is typically the tip of the branch that they just checked out, get retrieved from large file storage. So you don’t get every single version of these large files in your history.

Now let’s take a little look at the structure of one of these pointer files. It is literally just a few bytes. So if you use that cat file command again, you can see that it’s three fields. We got a version schema which is the version LFS pointer format, we have an object ID which is a SHA-256 hash of the object’s contents, and then we have the size of that object in bytes. So instead of a near multi-megabyte or multi-gigabyte file, inside our Git repo, slowing down everything down, we have a handful of bytes.

Now you might notice that this is a SHA-256 hash instead of a SHA-1 hash. Now there’s a couple of reasons for this. The first is that Git just celebrated its eleventh birthday in April. And Git, when it was first designed, SHA-1 was — there was no kind of, you know, known theoretical weaknesses. These days, SHA-1 does — is potentially slightly weaker than the original design constraints were. So SHA-256 is a more modern standard which we believe is more secure.

There’s also a couple of interesting properties that SHA-256 has. One of the big ones is that S3 has built in support for SHA-256 validation of large files. So if you were to use S3, as a Git LFS backend, then it can automatically validate the contents of your objects. So it’s kind of a practical choice from that prospective as well.

So installing Git LFS is ridiculously easy. It’s written in Go so there are binaries available for every platform that Git is available on. And it’s as simple as installing it using your favorite platform manager. I like Homebrew on OSX. And then once you have the package installed locally, you can simply run git lfs install and that will set it up for you.

What this does under the hood is it adds this new thing called a filter configuration to your global Git configuration. And what this does is it maps a new clean and smudge filter to files that are being tracked using LFS. Now clean and smudge filters are a native Git concept that allow you to intercept git add and git checkout commands. We’ll see how that works in just a second.

Once you’ve got Git LFS set up you can run the git lfs track command and to tell LFS which file patterns or file names you want to track in LFS as opposed to adding directly to your Git objects directory. Now what this does is it adds a new entry to your git attributes file. Which again is another native Git concept that you can use for extending Git. And it binds that pattern to the LFS smudge and clean filters.

So how these filters work. Let’s take a look at the cleans filter first. Is when a developer runs the git add command and passes it the name of a file, instead of adding that directly to the Git index, and creating an object in your Git objects directory, it instead hands it off to this git lfs clean command. And it takes a SHA-256 hash of that object’s contents and adds that to the special object cache sitting under git slash lfs slash objects. So it looks very similar to the git objects directory but its namespaced underneath lfs. So again, Git LFS is trying to follow the existing Git patterns to make it as transparent as possible.

Once it’s stored that content there, under the SHA-256 hash, as the file name so again it’s content addressable, it adds a new pointer file with the SHA-256 hash of the size of the object into your Git objects directory. So instead of adding, you know, megabytes or gigabytes, it’s just a few bytes which are being added to the size of your repo.

Then when a developer does a fetch or a clone or a pull, and actually wants to work with these large files, the Git smudge filter kicks in. So when a developer runs checkout, that file, the pointer file, is handed off to the Git smudge filter. It goes off and looks in your local Git LFS objects cache and tries to find a file that matches that SHA-256 hash. If it can’t find it there, it actually reads through to your back end LFS store which is going to be hosted either with your Git version control provider, like Bitbucket or potentially on a separate storage like Artifactory. Then once it’s located your object’s contents, it writes it out on your local — in your local working copy under the name of the original file.

So while this is all kind of interesting, I have kind of been wasting your time because you don’t actually need to know anything about this pointer file format. As a developer when you run git add, you’re running that against the original file that you’ve just finished editing or creating and when you run git checkout, when that command exits, you’re just going to have that new version of that large file sitting on your local disk. You never actually see these pointer files. They’re an implementation detail of Git LFS. So everything is transparent when you work with Git LFS as part of your workflow.

So that’s how you create and retrieve large objects with LFS, now I want to talk a little bit about how you transfer these large objects to and from your server. So the way that Git LFS intercepts a push where you transfer these objects up to the server, is using a Git hook. Has anyone here worked with Git hooks before? Couple of people. Okay cool. So for those of you who haven’t, look in any Git directory or Git repository that you have created locally. Inside the dot git dir, which is where Git stores all of your data associated with your repo, you’ll see a hooks directory and in there, there’ll be a whole bunch of sample scripts that show cool things you can do when you intercept Git invocations.

You can do things like pre-populate commit messages for convenience. So one of the ones that I use, being an […] developer, is we name all of our Git branches after juror issues. So I got this pre-commit message, the pre-commit message hook, that plucks the issue key out of the branch name and then inserts that at the beginning of the — of my commit message so I don’t have to go and manually type that out every time.

You can also create things like pre-commit hooks that can do things like run the unit tests before you create a commit. Which is a really powerful idea because that means that every single commit that you create by default is already passing the tests. So it’s sort of like a pre-CI stage. Now the pre-push hook that Git LFS uses predictably allows you to intercept the Git push operation. So when you run git push, and you’ve created some new large objects locally that need to be transferred to the server, you’ll see this output saying, hey I’m transferring some files to your Git LFS repo and here’s the status as it gets uploaded.

Now what’s really nice here is that you don’t get any authentication prompt. Git LFS actually piggybacks off your existing Git LFS credentials whether or not you’re using HTTP or SSH. A Git LFS OS server will handle both of those use cases. So you don’t have to go off and end up, you know, typing additional commands or adding it to your keychain. It just works transparently with your existing model of authentication. I’ve actually got some slides on how this auth process works at the very end so if I’ve got time I’ll jump into that because it’s a really, really cool use of […], actually.

Similarly when you run git pull, you’ll see some output from that smudge filter as it reads through to that backing LFS store saying that it’s downloading these large files. And as I’ve mentioned before it works transparently with SSH as well. Which is one of the killer features of Git LFS as opposed to some of its earlier competitors like Git Media.

So that’s in a nutshell how Git LFS works and I want to talk a little bit about how you can convert your existing repositories over to use Git LFS. And I may have mislead you a little bit earlier when I was talking about that Git LFS track command. Because if you use that on an existing repository that already has these big binary files, it will convert those large files going forward into pointers, which is good. But unfortunately it won’t do anything for the already bloated size of your repository because you still have these large blobs sitting in your repo history.

Now if you’re fairly professional, well not a professional, but if you got a lot of experience with Git, specifically with rewriting history, at this point you might reach for the Git filter branch command. Has anyone used filter branch before? Yes. Kind of — kind of scary and painful isn’t it. Took me — I made this slide and it took me, like, 10 minutes to forget what this was actually doing. So what it’s doing here is actually going back and re-removing all of those large blobs from your repository history. That’s not a great idea. It will dramatically cut the size of your repo but unfortunately it means that those actually been obliterated from your history. So if you do need to go back and look up one of those large blobs, it’s going to be gone. Which is not great for auditing purposes or if you need to roll back.

Now Andy Neff, one of the core Git LFS contributors came up with this truly genius slash insane git filter branch command that will actually go back and rewrite all of those blobs to be Git LFS pointers instead. Which is truly impressive, but it turns out there is actually an easier way to do this these days.

There’s this awesome tool called the BFG Repo-Cleaner. It’s by a developer who works at The Guardian called Roberto Tyley. Now Roberto, when he initially built this tool, had a problem. A developer had accidentally committed something sensitive to an earlier version of a properties file sitting in his repo. And this was a problem because to remove that he’d have to go back and basically, you know, run filter branch and eviscerate that from history. And now filter branch is awesome but it’s kind of the swiss army knife of repository history rewriting. But it is also pretty intimidating to use and it can be extremely slow. Now one of the things about Git is that at the end of the day, it’s a collection of bash scripts and underlying other commands written in C. But it doesn’t really have a system process. So filter branch typically actually walks your entire DAG and reprocesses the same objects and trees over and over again.

Now the BFG Repo-Cleaner is a tool that’s been specifically built to kill bits of your history and it’s built on top of JGit which is a complete reimplementation of Git in pure Java. And it does have a system process. So it actually goes back and memorizes every single tree and object that it’s processed as it walks through it and because it’s content addressable, knows that it doesn’t need to go and reprocess those again. So it’s actually 10 to 720 times faster than filter branch.

If you go to the BFG Repo-Cleaner homepage, they got this spreadsheet of these process — of these open source projects that they run it on to rewrite history and it’s pretty impressive the stats they’ve got on how much quicker it is.

Now it was originally build to kill history so it can actually delete files, like obliterate files, from your history or entire folders or even strings within files. So if you ever, like, committed a credential or, you know, a password or an AWS key or something like that into a properties file, you can go back and replace that with hash marks or something like that using it. But as of version one dot twelve dot five, Roberto’s built in support for Git LFS. So instead of this genius slash insane filter branch command, you can simply install the BFG, which is written in Java, so incredibly portable and will work on pretty much any platform that Git will and then run the handy convert to git lfs command and pass it the patterns or names of files that you want to track with LFS.

Now there is one slightly arcane flag you have to pass called no blob protection. What that does is it means it will rewrite the tip commit of your branch as well as your entire history. By default the BFG assumes that your repository is in a good state now so it doesn’t touch that tip commit for safety reasons but when you’re rewriting your history with LFS you usually want to rewrite your tip commits as well. By tip, I just mean the latest commit on the branch you currently have checked out.

So that’s how you can use the BFG to rewrite your history. But this kind of question that you get to, particularly with big repositories or repositories that have very long histories and that is which files should I be tracking with LFS. Now at Git Merge, earlier this year in New York, I heard this awesome talk from a guy named Charles Bailey who works for Bloomberg and he’s just open sourced this set of tools called Repofactor. And what Repofactor does is it helps you identify large blob chains sitting inside your repository. Now when I say chain, a blob delta chain. So basically it looks at the first blob that was created for a particular file, and then it finds all of the other blobs that are stored in your repository that are also created for newer versions of that file. Then it calculates the average size of that blob over time. And this is really good cause that size is sort of a function of how well that file compresses. So if it doesn’t compress very well, using zlib compression, which is what Git uses under the hood, then that makes it a really good candidate for LFS cause you’re gonna save a huge amount of file space or disk space rather when you convert to LFS.

Now it is a set of command line tools rather than just sort of, you know, an application that you plug your Git repository in. So after a bit of messing around, this is kind of the most effective way that I’ve found to use it. First of all you’re gonna want to use the generate larger than command and you pass it an integer saying this is the threshold of the number of bytes that I consider a large file. And then it’s going to spit out a SHA-1 hash of the object. With each one of these blobs followed by the size on disk of that blob and then it’s going to give you the average blob size of that particular chain of delta blob — sorry, chain of blob deltas. Once you’ve got that, you can pipe it to the add file info command which uses the — well the bash file command to sniff out the content types of each of these blobs. Then you can start to see a little bit more information about it. So now we know that these blobs represent PNG files of a reasonable resolution. Then you can pipe it to sort and start sorting by that average blob size. So the ones that are, kind of, better candidates for LFS float towards the top. And then if you write that out to a file, you can use the report on large objects command to generate the actual file names of each ones — of each of these blobs. And from there we can use that as input into our BFG invocation to start tracking some of these large files using LFS.

So that’s how you can convert your existing repositories over to use Git LFS. But now you have to start thinking about where you’re actually going to store this stuff on the server. Now if you’re using Bitbucket server, enabling Git LFS is really easy. You check the allow LFS checkbox in your repository settings and save it. Bitbucket also reads through transparently to your underlying Git LFS stores so all of the cool image diffing thing and previews of your binaries will just work magic – well not magically — will just work automatically inside your repository.

However, if you’re using JFrog Artifactory, and odds are you probably are if you’re at this conference, you might want to start using Artifactory to manage your binaries instead because of all the awesome features that it has for wrangling your binary files. You can actually set up a new local repository to track your Git LFS objects and then you can browse your LFS repos and actually see all of these large blobs sitting inside Artifactory using the tree browser. You can also use AQL to query those large objects which is pretty neat. You can do everything else, like set watches and things like that. So if you want to store all of your binaries inside Artifactory you can totally do that.

Now the way this works is that your developer, or the person who creates — who first initializes LFS for your repository will need to configure that repo to point to Artifactory for your large files. And the way you do this is by creating an lfs config file that sits in the root of your repo. And this can act like any other file in your repository. So you can check it, and commit it, and then push that to the server and it will just start working for all of your other devs.

Now you don’t have to remember the format of this because the awesome Artifactory set me up button works for LFS as well. It will actually generate that configuration for you based on the location of your Artifactory instance. And what it’s doing is overwriting the location of the LFS API. So instead of pointing to the bucket server, it points to Artifactory instead.

Now there is one major drawback for using a local Artifactory repository for this. And that is that Bitbucket won’t be aware of where these LFS files are being tracked. So that means that instead of having beautiful image diffs inside the UI, you’re going to see diffs of those little pointer files. Which is not an optimal experience. So instead, as of Artifactory four dot seven, they’ve released support for remote LFS repositories. So that means you can actually set up Artifactory to start — to act as a proxy or a cache of these LFS files that are tracked in Bitbucket server. So that means LF — your Bitbucket server instance is still tracking these large objects for you but Artifactory is going to be aware of them so you can do things like set watches and query them using AQL. Which is really powerful.

And the other thing you can do, with four dot seven they also released support for virtual repos. So you can create a virtual Artifactory repo that aggregates a set of these repos – of these LFS storages across Bitbucket and Artifactory as well. So yes, and as I’ve mentioned, you can set up virtual repos as well if you want a, kind of, virtual view over a set of these LFS repos. And I think you can actually push directly to a virtual repo as well and then it’ll have a default LFS store that it’ll push onto.

So that’s how you can use Artifactory alongside Bitbucket server or whatever you’re using for Git hosting. Now I want to talk a little bit about how you can use Git LFS in a team context. Cause so far I’ve mainly been talking about how an individual developer interacts with your repo. Now the first biggest thing to remember is you have to be very careful with merge conflicts. Now Git is very, very, very good at handling merges with text files but unfortunately it knows nothing about the binary files that you are tracking in your repository and has no good way to perform a semantic merge. And there is nothing more depressing than working all day on a large binary file only to find out that someone else on your team has also been modifying that file on a different branch. Because that means, you know, think of a large video file for example, you might spend all day rendering this thing, and then find out that someone else has been doing the same thing and there is literally no way you can resolve those conflicts. You have to go back and reapply all of your work on top of theirs or make them do the same thing.

So really at the moment, Git LFS doesn’t have a concept of locking. There’s actually a Google Summer of Code applicant who was successfully approved and is working on a locking spec for Git LFS. So fingers crossed that something will come out of that in the next few months. But in the meantime the best thing you can do is just tap the rest of your team on your shoulder or, you know, anyone else who’s likely to be modifying the same file and, you know, pass the mutex so to speak and tell them that they should maybe hold off on moving, modifying that file until you’re finished.

Now I’ve mainly been talking about how you can — how Git LFS is awesome because it only retrieves the latest version of your content which is the likely what you want to work with. But in some cases you might actually want to fetch more than just the large binaries that referenced — that are referenced by the commit you’re checking out.

So there’s this command called git lfs fetch which by default will just inflate those pointer files from your tip commit. But you can pass the dash dash recent flag and that means it’s going to go off and retrieve content that is associated with recently updated branches. How it defines recently is any branch that has a commit on it that’s newer than seven days. Which is kind of handy. And if you like this behavior, you can set it to be the default by setting the fetch recent always flag to true. This is kind of handy, in particular, if you’re about to jump on a plane or something like that and you need to work with multiple branches so you can just run git lfs fetch dash dash recent quickly and it will pull it all down for you.

Now you can tweak this behavior as well. There’s a few property flags you can set. The first is fetch recent refts days. And that basically changes how long that sliding window is in terms what it considers a recent branch. You can also set the fetch recent commits days property. Now this defaults to zero and the reason for that is this means it’s actually going to retrieve large content for commits that aren’t branch tips. So if I got a branch that has 10 commits on it in the last day, it’s actually going to go off and retrieve all of the large files for every one of those. This is a pretty unusual situation. The only, kind of, you know, sensible use case I can think of for this is if you’re about to jump on a plane or lose internet connectivity and you need to do some repo surgery. Like, you know, maybe do a binary search and walk back through your history or you’re planning on rebasing or doing some cherry picking or something like that.

And if you like, you can also set the lfs fetch recent remote refs command. Now this only really makes sense in a very small team or a very slow moving repository because what it does is it goes off and retrieves large file content that’s associated with any branch or tag in your repository. So, you know, if you have a team with 10 people, then you know, 30 branches, it’s going to go off and retrieve 30 commits worth of large files. Which is going to be incredibly expensive. So I would only really enable that if you’re working in a fairly small team and you really need to get all that history locally for some reason.

You can also use the git lfs prune command, which is kind of the opposite and it goes off and reclaims disk space for you in that local LFS cache. It has a couple of different settings as well. By default it considers — it only will delete things that it considers as old. Which is basically seven or whatever you got the recent commits flags up to, plus this offset which defaults to three. So this is going to delete any large files that are associated with commits that are over 10 days old. Now it is a little bit smarter than that. It won’t delete anything – anything that hasn’t — that’s referenced — that isn’t referenced by a commit that hasn’t yet been pushed to the server. And what I also recommend doing is enabling this prune verify remote always flag. Which means that checks with your LFS store to make sure that the LFS object it’s about to clean up has actually been pushed to that server at some point and still exists. So that’s kind of a nice little fail safe. You might want to unset that if you are actually trying to prune large files from commits that you are genuinely wanting to delete. So, you know, if you create a bunch of different, like, little, interstitial versions of a keynote file or an image then, you know, you might want to actually prune those without actually pushing them to the server.

So that’s how you can fetch more large objects. And in some cases, more than you could possibly want. Now I’m going to talk a little bit about how you can actually fetch less objects and leave some of those pointer files sitting around on disk instead of retrieving all of that large content. Now why would you want to do this? Well in some cases you might have something like a continuous integration build that doesn’t need to pull down all of those large assets. So say we had a, you know, like a Unity computer game and we had some unit tests built to test our physics engine, or something like that, we don’t want to pull down all of those heavy textures and full motion video and sound files. So we might just exclude our entire assets directory and let that build. Alternatively, if we’re a specialist, like an audio engineer, and we just want to work with our sound files, we can make everything a lot more efficient by just using one of these include flags and just including the audio assets for our project so we can pull those down. You can also configure these to be permanent settings using the fetch exclude and fetch include flags for LFS.

There was one more thing I wanted to say about this. Oh yes. And these — the patterns there just like regular […]. So if you’re used to using Ant or in fact it matches the dot git ignore syntax as well, so really easy to use.

Now if you’re a developer who prefers to avoid the sharp edges of command line Git there’s also a set of IDEs and GUI tools that have support for Git LFS. Haven’t seen an official statement from JetBrains yet but it appears that IntelliJ and AppCode and the rest of the JetBrains family do seem to work with Git LFS provided you have the binaries sitting on your path. I was at Eclipse Con earlier this year and spoke to Matthias Sohn who is one of the maintainers — or the maintainer of the EGit project and if you’re using Eclipse, it works out of the box with version four dot two plus. You just need to make sure that you have Git LFS on your path that Eclipse is using. So that means that you should either start Eclipse from your command line with the path variable set up properly or otherwise configure your path correctly.

So EGit is based on the JGit project, which again is a Java reimplementation of Git but it still shells out to the native version of Git LFS for the time being. Though I think they are planning on doing a Java reimplementation at some point. NetBeans as well theoretically works. I spoke to […] who is one of the core maintainers of that. He’s planning on putting out a webinar on using it but again provided Git LFS is on your path, it should just work straight out of the box for you. Because NetBeans shells out to Git under the hood for all of its Git integration.

Sadly, the bad news is Xcode doesn’t appear to yet support Git LFS. Apple actually have their own fork of Git that they maintain and it’s based off a fairly old version that doesn’t yet support Git LFS. Or doesn’t support some of the extension points that Git LFS uses. So unfortunately if you’re using Xcode or if you are an iOS developer, for the time being you’ll probably need to either use the command line or Atlassian maintains an open source — not open source, but maintains a free Git GUI called SourceTree which has full LFS support. So you can add and stage and commit files and it will actually do all the Git LFS invocations under the hood for you. It also has some explicit commands built into it and some other nice stuff like it will repair broken Git LFS repos so if it detects that the pre-push hook isn’t installed for some reason, it will go off and fix that up for you. And also does like a binary previews like the image diffing that we saw before in Bitbucket and reads through automatically to your backend Git LFS store.

So, that’s pretty much all I have in terms of — in terms of Git LFS. As I’ve mentioned before, it is a joint project between Atlassian and GitHub and we’re super stoked to be collaborating with them on such an important project. It is fully supported in Bitbucket server as of version four dot three and we have an internal alpha running with it in Bitbucket Cloud. So it will be coming to the public very soon as well.

If you like, you can follow me on Twitter for occasional updates and tweets on random Git arcana and Bitbucket and juror trivia. Thank you very much for your time.