Release Notes

These are release notes for Marginalia Search. These are also mirrored on 🌎 GitHub

See also 📁 marginalia-search and 📁 status.

Release Notes v2023.10.0

Posted: 2023-10-07

This is a mostly technical release. It takes the index from 106M to 164M documents.

Zero Downtime Upgrades and halved memory consumption

The initial focus of the release was to address the sometimes lengthy downtimes that have plagued the project when loading a new index.

There is a somewhat lengthy write-up about this here; but the short version is that this was very successful and a drastic optimization, removed not only the needed downtime, but added neat new features and slashed the RAM requirements in half!

Pull Request #42

A annoyance fueled optimization methodology also slashed the index construction time in half at later point. Pull Request #52.

Java 21 PREVIEW

There were unintended consequences of the changes above, and the system needed an upgrade to Java 21 with enabled preview features. This has to do with off-heap memory lifecycle management. Up until Java 21 (preview), Java offered no way of explicitly closing off-heap memory, including memory mapped files. This caused the filesystem to hold onto references to the mapped data even after the associated files had been deleted, which vastly increased the amount of disk required to construct the index using the new method of recursive merging.

A positive side-effect of this is that using the new foreign memory API is a lot faster than Java’s old byte buffers, since the size can exceed 2 GB without userspace paging.

There are some stray vestigial remains of the old way of memory mapping files still lingering, to be rooted out in the next release.

Writeup: https://www.marginalia.nu/log/89-disk-usage-mystery/ Commits: d0aa75

Parquet files in converter and crawl specs

For a long time, compressed json files have been used to store much of the unprocessed and half-processed crawl data. This is very easy to use, but tends to be a bit awkward when you have millions of the files. It’s also not the most performant format in the world, since e.g. it doesn’t announce how long a string is upfront, you need to just keep reading to find out.

Parquet is a clever format popular in big data applications that largely solves these problems. Parquet in Java is not so great, however, since the only(?) implementation is deeply tied to the Hadoop ecosystem, and separating the two isn’t entirely trivial.

Thankfully there’s a helpful library called parquet-floor that tries to do this. It is a bit on the basic side, but its technological and biological distinctiveness was added to our own, and now it does what’s necessary.

The biggest benefit of this is that it’s much easier to interact with. Previously to inspect some processed data, you’d need to use some combination of unix command line tools and jq to get at it. With parquet, much more convenient tools are available. The entire dataset can be queried with SQL using for example DuckDB!

The parquetification of the project is still ongoing. The crawl data needs to be addressed too, but this is in a future release.

Pull Request #48

Improved sideloading support

There’s been kinda-sorta support for sideloading encyclopedia data from Wikipedia already, but it’s been pretty shaky. This release introduces the ability to sideload not only Wikipedia data, but also Stackexchange dumps and just directories with HTML for e.g. javadocs.

These will not go live in the production index until it can be figured out how to make such large popular websites not show up as the first result for every query.

I wrote a rough documentation for how to do this.

Commits: 70aa04 5b0a6d 6bbf40 98bcdf 9b385e 5e5aaf

Notable bugfixes:

  • A concurrency bug was casuing some of the position data to be corrupted. This had a fairly adverse effect on the quality of the search results, causing bad matches to be promoted and good matches to be dismissed as irrelevant. a433bb

Release Notes v2023.08.0

Posted: 2023-08-22

This release mainly aims to improve the operational side of the search engine, with an emphasis of automating tedious manual processes and optimizing crawling and data processing to use fewer resources.

Conventionally I try to link to relevant commits in these notes, but some of the changes were so sweeping and protracted it was hard to narrow it down to individual commits; in those cases I’ll link to the relevant code instead.

New Features

Better Feature Detection and Blog Filter

The FeatureExtractor which analyzes websites’ HTML for things like advertisements and tracking code has been improved a fair bit. Website generator detection was also improved in this process.

Curated via a publicly available set of domains, the new filter selects for blogs and similar websites. These domains are also given slightly different processing rules on the assumption they are blogs.

Commit: cbbf60

Crawler - Smart Recrawling

The crawler has been enhanced to be able to make use of older crawl data to do optional fetching via the ETag and Last-Modified headers. This saves bandwidth and processing power for the server.

Code: CrawlDataReference CrawlerRetriever$recrawl

Operator’s GUI

A new user interface has been built for operating Marginalia Search. It was previously operated via command line instructions, direct SQL commands, and the like. This manual operation was both tedious and error prone.

The UI allows basic administrative operations such as dealing with domain complaints, creating API keys, blocking websites; but also has abstractions for triggering crawls and managing the heavier processes in the system.

Code: control-service

Message Queue / Actor Abstraction

To enable automation of the system several new abstractions have been introduced, including a message queue and an Actor abstraction on top of that. See /log/85-mq_sm_actor_ui for a detailed break down of this functionality.

Code: message-queue

Better language identification

Instead of using a naive home-made language identification algorithm, the fasttext library (via jfasttext) was used. It is much better at language identification, and as the name implies, pretty fast albeit not quite as fast when you run it via JNI. FastText is a very pleasant classifier library that will likely find other additional uses in the project in the future.

Commit: 46d761

Optimizations

There have been a lot of optimizations of the processes, these are just some of the bigger ones.

Converter - Reduced Memory Footprint and Increased Speed

The converter was keeping more items in memory than was necessary due to loading its input data up front by domain, and then iterating over each item. Streaming processing was introduced instead, which reduced the memory footprint so much that several previous memory optimizations such as transparent string compression became unnecessary, which in turn sped up the process a fair bit.

Commits: 507f26

Converter/Loader - Side Loading (experimental)

Some websites such as for example Wikipedia or Stack Overflow are too big to exhaustively crawl in a traditional sense, but they have data dumps available. Experimental support for side-loading Wikipedia was built.

This functionality is very immature.

To permit side loading large domains, the loader was also modified to reduce the amount of data it keeps in memory while loading. This was mainly accomplished by re-arranging the order the loading instructions are written by the converter.

Commits: f11103

Other Changes

Better feature detection and a new approach to advertisement filtering

A bit of effort was spent trying to figure out the modern advertisement ecosystem, and lessons learned were incorporated into the feature detection logic of the search engine.

A major shift in operation is to instead of looking for ads, the search engine will instead look for ad-tech tracking. This is much easier to do with the sort of static analysis Marginalia does, and probably what you want anyway. It turns out you can’t really run ads with no tracking without exposing yourself to click fraud, and you need to be pretty aggressive with how you do the tracking in a way that’s not easy to hide.

Commits: 0f9b90

Bugfix: Loader Stop Bug

There was a fairly trivial error in the loader process where it would stop loading documents from a website if any of their URLs were for some reason not loaded, typically because they were too long. This primarily affected large wordpress-style websites.

if (urlId <= 0) {
    logger.warn("Failed to resolve ID for URL {}", doc.url());
    return;
}

should have been

if (urlId <= 0) {
    logger.warn("Failed to resolve ID for URL {}", doc.url());
    continue;
}

Fixing the bug had the unanticipated side-effect of severely decreasing the average quality of the websites in the index, since large wordpress-style websites are often not very good.

To mitigate the quality problem, the ranking algorithm was modified to penalize large websites with kebab-case urls. This was a relatively invasive change that meant routing additional feature bits into the forward index. An upside of this is that the index has more information available for ranking websites, and it’s possible to e.g. apply a penalty to sites with adtech or likely affiliate links on them.

Commits: 4598c7 704de5

Bugfix: Crash on excluding keywords that are not known by the search engine

A rare bug was found that caused an error when excluding documents that contain a keyword where the keyword was not known to the search engine. This was due to a piece of debug logging that wouldn’t even have printed, yet still managed to trigger an index out of bounds error.

Commits: cb55c7

Upgraded dependencies – expected JDK version increased to 18+

Dependencies with security vulnerabilities were upgraded, which introduced a strange interaction with JDK 17, the previous default version, where non-ASCII letters would become garbled when reading crawl data. The exact cause of this is unknown, but a solution that works is to use JDK 18+ instead.

Flyway Migrations

Database migrations are now managed via Flyway. This eliminates manual database upgrades.

Commits: 58556a

Release Notes v2023.06.0

Posted: 2023-06-29

New Features

Generator keywords

To provide additional ways of selecting search results, a synthetic keyword has been added for the <meta name="generator" content="..."> tag. This is basically a vanity tag that is used by some HTML generators to advertise themselves, and it’s also common for hand-edited HTML to include this tag with a string like “vim” or “myself”, as a wink to human readers of the code.

The generator keywords have the form generator:value. For example, to search for websites made with Hugo, you can use generator:hugo. Generator categories have also been added as searchable keywords, for example generator:wiki, generator:forum, generator:docs.

These last keywords have been added as options in in the search engine’s filters.

a9a2960e d86e8522

Crawler support for sitemaps

To ensure the crawler is able to find all the pages of a website, while wasting minimal time and bandwidth on dead links, the crawler now supports the sitemap protocol. Implementing this support was relatively straightforward as a site map parser was already available within Crawler Commons, a library which is already used for parsing robots.txt files.

The crawler will look for a sitemap directive in robots.txt, and will also look for /sitemap.xml in the root of the server, as well as parse RSS and Atom feeds for links if they are found in the root document of the website.

ecc940e36

Crawler specialization for Lemmy, Discourse and Mediawiki

Some server software for larger websites have a lot of valid links, but also many links that are highly ephemeral (such a mastdon feed, or the index of a forum). To help the crawler only index the pages that don’t change that often, has specialized logic has been introduced for Lemmy, Discourse and Mediawiki.

This also saves processing power for the server, as these applications often have relatively expensive rendering logic.

This is a bit of an experiment. Implementing these specializations is relatively easy, and if it pans out it will be extended to other software.

ed373eef

Improved Site Info

The site information view has been improved to show better placeholder information for unknown domains, including a link to the git repository for submitting websites to be crawled.

a6a66c6d

Bug Fixes

Pub-date validation

The published date of a page is now validated against the plausible range of the HTML standard it’s written in. It’s impossible that a HTML5 document was written in 1997, and unlikely that a HTML2 document was written in 2021. 7326ba74

A bug was also discovered in the JSON+LD parser, that caused rare null pointer exceptions. This code is a bit of a hack and could definitely be cleaned up further. 21125206

Optimizations

The converter process, which extracts keywords and meta data from HTML documents, has been optimized to run about 20-25% faster. The crawler has also been modified to spend less effort on domains that historically have demonstrated to not have a lot of viable pages. As a result, crawling is twice as fast, processing takes about 24 hours instead of 60+ hours.

The converter optimization was achieved by replacing expensive string operations (like toLower()) with custom logic that doesn’t require allocation.

BigString

The BigString is an object for transparent storage of compressed strings in memory that enables the processor to work load the full contents of a website into memory at once, and then unpack each document as it’s being processed.

BigString was optimized to use fixed buffers. Allocating large arrays in Java is expensive, and the garbage collector has to work hard to clean up the mess. This introduces some lock contention, but it is still significantly faster than the previous version.

Another small speed-up is from using java.lang.String’s char[] constructors instead of byte[]-constructors, reducing unnecessary back-and-forth charset conversion.

Commit: e4372289

RDRPosTagger

The RDRPosTagger library, which does Part Of Speech tagging, already impressively fast, already aggressively modified to be faster, has been further optimized to be faster still, and its Java object tree design was replaced with flat integer arrays.

This was always an expensive operation, but now it’s much faster. The speed-up comes from replacing string comparisons with integer comparisons, as well as re-ordering the data in memory to reduce the cache thrashing that is typically associated with walking a branching tree structure. Part of this is from eliminating Java object headers.

Commit: 186a02

Release Notes v2023.03.2

Posted: 2023-05-25

This is primarily a bugfix release that primarily addresses some issues with a metadata corruption that was introduced in the previous release.

New Features

File keywords

To provide more tools for navigating the web, the converter now generates synthetic keywords for documents that link to files on the same server based on their file ending.

If the file contains a link such as

<a href="file.zip">Download</a>

then he document will be tagged with the keyword file:zip as well as file:archive.

The category keywords are file:audio, file:video, file:image, file:document, file:archive.

Since earlier, the converter has also generated keywords based on filenames, even if the filename itself doesn’t appear in the visible portion of the document. So in the example above, file.zip would also be a relevant keyword for the document.

Commit: a9f7b4c4

Bug fixes

Metadata corruption

As a workaround for the limitations of the Java language, document metadata is encoded through explicit bit twiddling. It’s basically a manual implementation of a C struct on top a 64 bit long. This is a great performance improvement and allows for very compact storage of the metadata, but the approach is also notoriously error prone and difficult to do in a safe way. It’s basically the programming equivalen tof running with scissors.

A bug crept in where parts of the document metadata was garbled. This made it impossible to search by year, and also broke the ‘blog’ and ‘vintage’ filters, and may also have deteriorated the search result quality a bit.

The bug wasn’t directly caused by the bit twiddling, but by mispopulating the fields in a constructor. It’s a fairly trivial error, but it was hard to detect since it was not immediately obvious that the data was corrupted given the limited visibility into the “struct”, and reproducing the error in a test proved difficult since the test used the constructor correctly.

Despite testing on a pre-production environment, the bug was not discovered until it was deployed to production. If anything I think it highlights a need for finding better testing strategies. This functionality is fairly smeared out over the code path, the functionality is difficult to isolate and it’s often not immediately apparent when it’s broken, all this makes it a continuous struggle to test in a systematic way. In general it’s very hard to test this sort of logic, as it requires a large and relatively realistic corpus of data to test against which makes isolating behavior harder, and the outcome is also never clearly right or wrong, but a matter of this-feels-right or this-seems-wrong.

Commit: 2ab26f37

Publish Date Detection

The order of the heuristics in the publish date detection has been improved to reduce the number of false positives, the support for JSON+LD has also been improved to support additional cases.

Marginalia uses a long list of different heuristics to try to detect the publish date of a document. It was previously assumed that HTML5’s <time[pubdate="pubdate"]> element would generally contain a valid publish date for the current document, but this is not always the case, as some blogging platforms also include <article>-tags, including <time> for snippets of other articles. The heuristics have been reordered to try to detect the date from other sources first, and then fall back to the <time> element as one of the less reliable heuristics.

Commit: 619fb8ba

Response cache for the API service to help misconfigured clients

It’s been a long standing problem that some misconfigured API consumers spam the API endpoint with the same query multiple times in a row, very rapidly consuming the rate limit. A cache has been added before the rate limit that will return the same result for the same query within a short time window “for free”.

This was also a good opportunity to clean up the API service a bit and improve the test coverage.

Commit: 112f43b3

Minor Fixes

  • Stopgap fix for a bug in dealing with quote terms containing stop words. 6fae51a8
  • Fix data loading bug where domains with some IPv6 addresses would blow up. d42ab191
  • Fix bug where some synthetic keywords would fail to return results. df1850bd

Experiments That Never Made It

A wise man once said “it’s not R&D if you aren’t throwing away half your work”. Here are some of the experiments that didn’t make it into production.

A synthetic keyword for image filenames that look like they come out of a smartphone

Alongside the file keywords, an experiment was run with generating a synthetic keyword for image filenames that look like they come out of a smartphone, e.g. filenames with the format “IMG_nnnnnn.jpg”. While very easy to build, this turned out to be not very useful. The idea was scrapped.

LDA topic modeling

Latent Dirichlet Allocation (LDA) is a popular topic modeling algorithm that’s often used to extract topics from a corpus of documents. The idea was to use this to offer additional ways of navigating the web. The idea was scrapped because the results were not quite useful. The main work involved porting the LDA implementation in Mallet from a very old style of Java to a modern one. Since this was a fairly large task, it was decided to keep the code around in a branch in case it could be useful for other purposes.

Performance wise it might be plausible to do something with LDA in the future. The branch with the patched Mallet code is available here.