There's also patience diff[1], which does a better job in some situations[2] (you might not normally see these inferior diffs because the diffs are massaged after the LCS algorithm).
I don't understand why there isn't a better code-aware diff tool available, like an AST diff. When I split a method up, or add an if statement, can't we detect that most of the code didn't change, it was just moved around?
This is the biggest frustration to me when doing and sending code reviews, having to explain "I know it's a 5 thousand line change, but I really didn't change any of the logic, I just moved the class into a subclass which caused an indentation change."
I'm working on one on-and-off (more off than on lately) that compares parse trees to group related changes together and tries to do some high-level analysis (like finding moves). I finally got my core algorithm working reasonably well last year, but then started a new job and haven't had the energy to work on it more since—I'm hoping to get back to it in a month or two.
Anyway, if you're curious, I have the code and a (very) rough write-up up on GitLab[1]. Keep in mind that it's very preliminary, and that my approach is by no means the only way to do this! There's been a fair amount of research on this in the software engineering community (ie SIGSOFT). I've looked through some of the existing papers, and there have been a range of algorithms and tools with different metrics, different levels of language-awareness and different capabilities.
AST-assisted diffing is something we've looked into for Review Board, and it's something we'll probably add at some point. There's certainly some benefit to doing it.
However, you don't need it for the case of "I've moved the class into a subclass which caused an indentation change." We handle that case by tracking indentation-only changes and by tracking moved code.
For instance, if you were to move a class or a method or some logic around within a file, we show that it was moved but not modified, saving you some review time. You can move an entire function into a class, with no modifications (or just a line changed here and there to reference `self` or `this`) and we intelligently show that. Saves a ton of time during code review.
If you nested a bunch of things inside a new conditional or subclass without otherwise changing the lines, we highlight just the indentation changes while making it clear the rest of the line is the same. Nest a 5 thousand line function inside of a new class, and you'll only have to worry about reviewing the class definition, not the indented 5 thousand lines.
You could get some benefits to this logic with AST-assisted diffs, to help you better know where a function/class/loop/etc. begins and ends (particularly with languages like Python), which can help you better represent some changes. You also get the benefit of knowing up-front about syntax errors. I've also given some thought to being able to "zoom out" of the line-by-line changes to a file and let you focus on the high-level structure of the classes/methods/docs of a file and how those have changed.
The downside to AST-assisted diffs is compatibility with language changes. It's important to have a sane fallback when the file isn't exactly as you expect it to be (such as if the language adds some new form of syntactic sugar but you're on an older version of the product doing the AST-assisted diffing, or you're parsing something that claims to be in a certain language but is using special syntax meant to be pre-processed during builds).
I remember the incredibly elegant AST-based diffing in Eclipse (which I haven't used in about 10 years; I can only assume it's still fantastic), which showed you an outline of which classes, methods etc. have changed.
AST-based diff naturally requires a parser for every language, and for some languages like C and C++ you need to bake preprocessor stuff (macros, ifdefs, includes etc.) into the node tree. But it's doable.
Except in Python where indentation changes are meaningful.
def coalesce(*args):
while args:
next = args.pop()
if next is not None:
return next
return None
def coalesce(*args):
while args:
next = args.pop()
if next is not None:
return next
return None
All string diff algorithms in alignment derive from previous innovations on raw strings. SW is a derivative of NW, which itself (although the paper does not admit it) derives from the work done on edit distance in the mid 60s). It's actually interesting that this algorithm was rediscovered independently multiple times within a short time frame.
Also, the "Myers" in the article has worked a bit with sequence alignment. He helped design the BLAST algorithm (with Altschul) as well as suffix arrays (with Manber), the algorithms that reassembled the human genome for Celera, and finally, dropped the ultimate bomb: https://academic.oup.com/bioinformatics/article/21/suppl_2/i...
All string diff algorithms in alignment derive from previous innovations on raw strings.
For some applications, it's more useful to just take the hash of each line and compute line diffs from an array of hashes. Diff algorithms which ignore whitespace changes sometimes do this.
Few algorithms recognize text re-ordering. This is a problem with Wikipedia diffs, where rearranging text is commonplace. Displaying the changes might require big arrows, but there's no problem with that today.
[1]: http://alfedenzo.livejournal.com/170301.html [2]: http://bramcohen.livejournal.com/73318.html