Release Notes v2023.08.0

Posted: 2023-08-22

This release mainly aims to improve the operational side of the search engine, with an emphasis of automating tedious manual processes and optimizing crawling and data processing to use fewer resources.

Conventionally I try to link to relevant commits in these notes, but some of the changes were so sweeping and protracted it was hard to narrow it down to individual commits; in those cases I’ll link to the relevant code instead.

New Features

Better Feature Detection and Blog Filter

The FeatureExtractor which analyzes websites’ HTML for things like advertisements and tracking code has been improved a fair bit. Website generator detection was also improved in this process.

Curated via a publicly available set of domains, the new filter selects for blogs and similar websites. These domains are also given slightly different processing rules on the assumption they are blogs.

Commit: cbbf60

Crawler - Smart Recrawling

The crawler has been enhanced to be able to make use of older crawl data to do optional fetching via the ETag and Last-Modified headers. This saves bandwidth and processing power for the server.

Code: CrawlDataReference CrawlerRetriever$recrawl

Operator’s GUI

A new user interface has been built for operating Marginalia Search. It was previously operated via command line instructions, direct SQL commands, and the like. This manual operation was both tedious and error prone.

The UI allows basic administrative operations such as dealing with domain complaints, creating API keys, blocking websites; but also has abstractions for triggering crawls and managing the heavier processes in the system.

Code: control-service

Message Queue / Actor Abstraction

To enable automation of the system several new abstractions have been introduced, including a message queue and an Actor abstraction on top of that. See /log/85-mq_sm_actor_ui for a detailed break down of this functionality.

Code: message-queue

Better language identification

Instead of using a naive home-made language identification algorithm, the fasttext library (via jfasttext) was used. It is much better at language identification, and as the name implies, pretty fast albeit not quite as fast when you run it via JNI. FastText is a very pleasant classifier library that will likely find other additional uses in the project in the future.

Commit: 46d761

Optimizations

There have been a lot of optimizations of the processes, these are just some of the bigger ones.

Converter - Reduced Memory Footprint and Increased Speed

The converter was keeping more items in memory than was necessary due to loading its input data up front by domain, and then iterating over each item. Streaming processing was introduced instead, which reduced the memory footprint so much that several previous memory optimizations such as transparent string compression became unnecessary, which in turn sped up the process a fair bit.

Commits: 507f26

Converter/Loader - Side Loading (experimental)

Some websites such as for example Wikipedia or Stack Overflow are too big to exhaustively crawl in a traditional sense, but they have data dumps available. Experimental support for side-loading Wikipedia was built.

This functionality is very immature.

To permit side loading large domains, the loader was also modified to reduce the amount of data it keeps in memory while loading. This was mainly accomplished by re-arranging the order the loading instructions are written by the converter.

Commits: f11103

Other Changes

Better feature detection and a new approach to advertisement filtering

A bit of effort was spent trying to figure out the modern advertisement ecosystem, and lessons learned were incorporated into the feature detection logic of the search engine.

A major shift in operation is to instead of looking for ads, the search engine will instead look for ad-tech tracking. This is much easier to do with the sort of static analysis Marginalia does, and probably what you want anyway. It turns out you can’t really run ads with no tracking without exposing yourself to click fraud, and you need to be pretty aggressive with how you do the tracking in a way that’s not easy to hide.

Commits: 0f9b90

Bugfix: Loader Stop Bug

There was a fairly trivial error in the loader process where it would stop loading documents from a website if any of their URLs were for some reason not loaded, typically because they were too long. This primarily affected large wordpress-style websites.

if (urlId <= 0) {
    logger.warn("Failed to resolve ID for URL {}", doc.url());
    return;
}

should have been

if (urlId <= 0) {
    logger.warn("Failed to resolve ID for URL {}", doc.url());
    continue;
}

Fixing the bug had the unanticipated side-effect of severely decreasing the average quality of the websites in the index, since large wordpress-style websites are often not very good.

To mitigate the quality problem, the ranking algorithm was modified to penalize large websites with kebab-case urls. This was a relatively invasive change that meant routing additional feature bits into the forward index. An upside of this is that the index has more information available for ranking websites, and it’s possible to e.g. apply a penalty to sites with adtech or likely affiliate links on them.

Commits: 4598c7 704de5

Bugfix: Crash on excluding keywords that are not known by the search engine

A rare bug was found that caused an error when excluding documents that contain a keyword where the keyword was not known to the search engine. This was due to a piece of debug logging that wouldn’t even have printed, yet still managed to trigger an index out of bounds error.

Commits: cb55c7

Upgraded dependencies – expected JDK version increased to 18+

Dependencies with security vulnerabilities were upgraded, which introduced a strange interaction with JDK 17, the previous default version, where non-ASCII letters would become garbled when reading crawl data. The exact cause of this is unknown, but a solution that works is to use JDK 18+ instead.

Flyway Migrations

Database migrations are now managed via Flyway. This eliminates manual database upgrades.

Commits: 58556a