Release Notes v2023.06.0

Posted: 2023-06-29

New Features

Generator keywords

To provide additional ways of selecting search results, a synthetic keyword has been added for the <meta name="generator" content="..."> tag. This is basically a vanity tag that is used by some HTML generators to advertise themselves, and it’s also common for hand-edited HTML to include this tag with a string like “vim” or “myself”, as a wink to human readers of the code.

The generator keywords have the form generator:value. For example, to search for websites made with Hugo, you can use generator:hugo. Generator categories have also been added as searchable keywords, for example generator:wiki, generator:forum, generator:docs.

These last keywords have been added as options in in the search engine’s filters.

a9a2960e d86e8522

Crawler support for sitemaps

To ensure the crawler is able to find all the pages of a website, while wasting minimal time and bandwidth on dead links, the crawler now supports the sitemap protocol. Implementing this support was relatively straightforward as a site map parser was already available within Crawler Commons, a library which is already used for parsing robots.txt files.

The crawler will look for a sitemap directive in robots.txt, and will also look for /sitemap.xml in the root of the server, as well as parse RSS and Atom feeds for links if they are found in the root document of the website.

ecc940e36

Crawler specialization for Lemmy, Discourse and Mediawiki

Some server software for larger websites have a lot of valid links, but also many links that are highly ephemeral (such a mastdon feed, or the index of a forum). To help the crawler only index the pages that don’t change that often, has specialized logic has been introduced for Lemmy, Discourse and Mediawiki.

This also saves processing power for the server, as these applications often have relatively expensive rendering logic.

This is a bit of an experiment. Implementing these specializations is relatively easy, and if it pans out it will be extended to other software.

ed373eef

Improved Site Info

The site information view has been improved to show better placeholder information for unknown domains, including a link to the git repository for submitting websites to be crawled.

a6a66c6d

Bug Fixes

Pub-date validation

The published date of a page is now validated against the plausible range of the HTML standard it’s written in. It’s impossible that a HTML5 document was written in 1997, and unlikely that a HTML2 document was written in 2021. 7326ba74

A bug was also discovered in the JSON+LD parser, that caused rare null pointer exceptions. This code is a bit of a hack and could definitely be cleaned up further. 21125206

Optimizations

The converter process, which extracts keywords and meta data from HTML documents, has been optimized to run about 20-25% faster. The crawler has also been modified to spend less effort on domains that historically have demonstrated to not have a lot of viable pages. As a result, crawling is twice as fast, processing takes about 24 hours instead of 60+ hours.

The converter optimization was achieved by replacing expensive string operations (like toLower()) with custom logic that doesn’t require allocation.

BigString

The BigString is an object for transparent storage of compressed strings in memory that enables the processor to work load the full contents of a website into memory at once, and then unpack each document as it’s being processed.

BigString was optimized to use fixed buffers. Allocating large arrays in Java is expensive, and the garbage collector has to work hard to clean up the mess. This introduces some lock contention, but it is still significantly faster than the previous version.

Another small speed-up is from using java.lang.String’s char[] constructors instead of byte[]-constructors, reducing unnecessary back-and-forth charset conversion.

Commit: e4372289

RDRPosTagger

The RDRPosTagger library, which does Part Of Speech tagging, already impressively fast, already aggressively modified to be faster, has been further optimized to be faster still, and its Java object tree design was replaced with flat integer arrays.

This was always an expensive operation, but now it’s much faster. The speed-up comes from replacing string comparisons with integer comparisons, as well as re-ordering the data in memory to reduce the cache thrashing that is typically associated with walking a branching tree structure. Part of this is from eliminating Java object headers.

Commit: 186a02