Wikidiff2 is partly based on the original wikidiff, partly on DifferenceEngine.php, and partly my own work. It performs word-level (space-delimited) diffs on general text, and character-level diffs on text composed of characters from the Japanese and Thai alphabets and the unified han. Japanese, Chinese and Thai do not use spaces to separate words. The input is assumed to be UTF-8 encoded. Invalid UTF-8 sequences will be passed through with no error issued, but it is recommended that the user converts the input text to UTF-8 rather than relying on this property. The input text should have unix-style line endings.

The output is an HTML fragment -- a number of HTML table rows with the rest of the document structure omitted. The characters "<", ">" and "&" will be HTML-escaped in the output.

A number of test files are provided:

* test-a1 and test-a2, expected output test-a.diff
* test-b1 and test-b2, expected output test-b.diff

These pairs together give 85.9% coverage of DiffEngine.h. The remaining untested part is mostly in _DiffEngine::_shift_boundaries(). 

* chinese-reverse-1.txt and chinese-reverse-2.txt (in chinese-reverse.zip)

These files are 2.3MB each, and give a worst-case performance test. Performance in the worst case is sensitive to the performance of the associative array class used to cross-reference the strings. I tried using an STL map and a Judy array. The Judy array gave an 11% improvement in execution time over the map, which could probably be increased to 15% with further optimisation work. I don't consider that to be a sufficient improvement to warrant adding a library dependency, but the code has been left in for the benefit of Judy fans and performance perfectionists. It can be enabled by compiling with -DUSE_JUDY. The C++ wrapper for JudyHS might be of use to someone.

Wikidiff2 can be compiled as either a PHP extension (with the help of swig), or as a standalone executable. The standalone is mainly intended for testing and development purposes.

Tim Starling
February 2006


vim: wrap
