Coalesce common plain filters in a bucket into a single trie

This issue is opened somewhat a posteriori, as work for this has already been committed in gorhill/uBlock@c229003 (the new STrieContainer code was just a matter of reusing existing HNTrieContainer code and modifying it to deal with plain string rather than hostname, i.e. simplifying really).

However I want to document and further investigate pushing the trie usage farther, hence this issue.

Done

In the static network filtering engine, filter buckets are used to hold collections of filters which share the same token. Tokens are used to store/lookup filters or collections of filters so as to minimize the number of filters visited each time a network request needs to be matched against the content of the filter lists.

With gorhill/uBlock@a594b3f, I introduced a method to help investigate the content of all filter buckets. To do so, call the method from uBO's dev console as follow:

µBlock.staticNetFilteringEngine.bucketHistogram()

This will output a list of all filter buckets in the static network filtering engine, from largest to smallest. Before the work on trie usage was committed, bucketHistogram() would show the content of the largest buckets as follow (top ten):

bits: 0, token: "ad", 926 entries
bits: 0, token: "ads", 636 entries
bits: 65, token: "phncdn", 253 entries
bits: 0, token: "analytic", 174 entries
bits: 0, token: "tracking", 155 entries
bits: 72, token: "http", 146 entries
bits: 72, token: "https", 139 entries
bits: 88, token: "http", 122 entries
bits: 0, token: "adv", 121 entries
bits: 88, token: "https", 118 entries
bits: 0, token: "advertis", 102 entries

As can be seen the ad and ads bucket are quite large, and anytime uBO encounter a URL with the token ad or ads, it must potentially visit all or a large number of the entries in these buckets to find out whether there is a matching filter.

Oftentimes, a majority of filters in these buckets are plain filters which can be coalesced into a single trie entity, and a single scan of the trie will tell whether any of the filter has been matched.

After the work to coalesce trieable filters from the same bucket into a single trie was committed, the buckets looked like this:

bits: 0, token: "ad", 57 entries + 2 tries coalescing 869 strings
bits: 0, token: "ads", 78 entries + 1 trie coalescing 558 strings
bits: 65, token: "phncdn", 22 entries + 1 trie coalescing 231 strings
bits: 0, token: "analytic", 35 entries + 1 trie coalescing 139 strings
bits: 0, token: "tracking", 17 entries + 1 trie coalescing 138 strings
bits: 72, token: "http", 146 entries
bits: 72, token: "https", 139 entries
bits: 88, token: "http", 122 entries
bits: 0, token: "adv", 26 entries + 1 trie coalescing 95 strings
bits: 88, token: "https", 118 entries
bits: 0, token: "advertis", 18 entries + 1 trie coalescing 84 strings

Sometimes, buckets end up with 0 entries and 1 trie as all the filters in the bucket can be coalesced into a single trie, for example:

bits: 0, token: "adserver", 0 entries + 1 trie coalescing 29 strings

The result as per µBlock.staticNetFilteringEngine.benchmark() method was a measurable performance and memory usage gain. I do not expect the performance gain to be noticeable from a user's point of view, although the memory efficiency can be noticed, especially when uBO loads from a selfie (typically well under 40 MB at launch with default settings/lists).

To do

Investigate pushing the coalecing-into-tries idea farther in the context of filter buckets. There are still many entries left in some filter buckets which can theoretically be coalesced into a trie.

However this would require a "bidi" trie, i.e. a trie which can scan left and right of the token position in the URL -- current implementation scans just to the right.

The ability to scan both directions would open the door to also coalesce FitlerPlain-based filters into the same trie currently used by the FilterPlainPrefix1-based filters (a specialized version of FitlerPlain).

For examples, this would further reduce the number of entries to scan to 19 entries in the ad bucket (from 57 entries with current trie, or 926 entries with no trie), and to just 8 entries in the ads bucket (from 78 entries with current trie, or 636 entries with no trie).