Google has recently publicized a new method for compressing Chrome patches, which they call "Courgette". They argue that the generic bsdiff tool isn't efficient enough for them, and they can save a lot of bandwidth by reducing patch size. That makes it more practical for them to patch more often, automatically.
They offer one example of the effect of this new method for compression. For a 9.9MB full update, the bsdiff patch is 6.8% of that size (688KB), and a Courgette update is 0.76% (77KB). This number is certainly impressive, but it is only one example. I'd like to know if it is reasonable to expect this level of compression on all updates. Existing work in this area does consider updates to many different packages, across many types of updates, and the particular update does make a difference.
Though they compare against bsdiff, their technique sounds a lot like Exediff. The Exediff algorithm goes something like this. First, try to match instructions in the original executable against instructions in the updated version. All unmatched instructions are "primary differences," and these changes must be encoded in the patch file. Then, based on these changes, predict how the matched instructions should have changed. When the change is predictable, it doesn't need to be encoded in the update. Otherwise, the change is a "secondary difference" and must be include in the update. Exediff uses knowledge of the instruction set to predict how matched instructions will change, if at all. Despite this, the author of bsdiff showed that across many different types of patches, Exediff and bsdiff have similar performance.
So, both Exediff and Courgette appear to do the same thing. How then does Courgette do so much better at compressing? It looks like the "adjust" step is where the magic happens. Courgette changes the updated binary so that it has fewer differences, letting the patch be smaller. That updated binary is then reconstructed by the client, not the original update.
This leads to some interesting questions, the first concerning repeated patching. For instance, if I have a base version A which I patch to version B, how should I generate the patch to version C? Do I generate the binary for C so that the patch B -> C is minimized, or should the patch A -> C be minimized? What happens for patch M, or Z, or AAA?
Deciding which patch to optimize would seem to depend on which upgrade path is more popular. With the goal of decreasing total bandwidth usage, you could use server logs (or other usage data) to predict an initial distribution of versions currently in use. Based on historical data, you could then predict which users are going to update their software before the next patch is released. The aggregate of this data gives you a distribution of patch downloads. Pick the most popular one, and optimize that. Possibly, the others will be similarly small. Or, given such a distribution, can that information be used in the adjustment process to generate a binary that minimizes changes over a set of patches, rather than one in particular?
My guess is that Courgette optimizes each patch against only the latest version, assuming that most everyone will be running it. With auto-updating software, a user would have to fail to open the program for an extended amount of time, longer than it takes to push out two patches. This may not be likely.
I also wonder what side effects the binary modifications might have on the updated binary. By reordering or moving code, could it get slower?
These are interesting points that I hope Google engineers may address if they ever decide to publish a paper describing Courgette in greater depth. I hope they do.