Line normalization before diff

Description

The DiffRowGenerator class offers the lineNormalizer property. By default, it is used to replace < and > by their escaped versions &lt; and &gt;.

The lineNormalizer is applied to the input texts before the diff is calculated. While I see this is as a useful feature, in case of the default settings it might be surprising that the resulting text might not have HTML escaping anymore:

final var generator = DiffRowGenerator.create() //
        .mergeOriginalRevised(true) //
        .showInlineDiffs(true) //
        .inlineDiffByWord(true) //
        .build();

final var rows = generator.generateDiffRows(List.of("hello <world>"), List.of("bye >world<"));

final var resultingText = rows.stream() //
        .map(DiffRow::getOldLine) //
        .collect(Collectors.joining(StringUtils.LF));

The resulting text is

<span class="editOldInline">hello</span><span class="editNewInline">bye</span> &<span class="editOldInline">lt</span><span class="editNewInline">gt</span>;world&<span class="editOldInline">gt</span><span class="editNewInline">lt</span>;

Note the part & is considered as an equal text part because both replacements &lt; and &gt; start with an ampersand. This resulting text is therefore no valid HTML anymore.

In order for this behaviour to be a problem, the following conditions must all be true:

  1. The inlineDiffByWord must be used
  2. The default lineNormalizer must be used
  3. The two provided texts must differ at a position which starts with a character that is replaced by the lineNormalizer
  4. A release >= 4.15 must be used.

Workaround
Override the lineNormalizer. E.g., by using the SPLIT_BY_WORD_PATTERN of release 4.12, in which the ampersand was not considered a character that splits words.

Solution approaches
IMHO, the SPLIT_BY_WORD_PATTERN of release 4.15+ is fine and I do not consider it to be the problem.

The library could offer one of the following features:

  1. a parameter which defines when the 'lineNormalizer' should be applied (before diff-ing or after)
  2. a second type of line-normalizer that is applied after diff-ing
  3. an option to have the library apply the processDiffs function to non-diffs as well