To provide additional ways of selecting search results, a synthetic keyword
has been added for the
<meta name="generator" content="..."> tag. This is basically a vanity
tag that is used by some HTML generators to advertise themselves, and it’s also
common for hand-edited HTML to include this tag with a string like “vim” or “myself”,
as a wink to human readers of the code.
The generator keywords have the form
generator:value. For example, to search for websites made
with Hugo, you can use
generator:hugo. Generator categories have also been added as searchable
keywords, for example
These last keywords have been added as options in in the search engine’s filters.
Crawler support for sitemaps
To ensure the crawler is able to find all the pages of a website, while
wasting minimal time and bandwidth on dead links, the crawler now supports the
sitemap protocol. Implementing this support
was relatively straightforward as a site map parser was already available within
Crawler Commons, a library
which is already used for parsing
The crawler will look for a sitemap directive in robots.txt, and will also look for
/sitemap.xml in the root of the server,
as well as parse RSS and Atom feeds for links if they are found in the root document of the website.
Crawler specialization for Lemmy, Discourse and Mediawiki
Some server software for larger websites have a lot of valid links, but also many links that are highly ephemeral (such a mastdon feed, or the index of a forum). To help the crawler only index the pages that don’t change that often, has specialized logic has been introduced for Lemmy, Discourse and Mediawiki.
This also saves processing power for the server, as these applications often have relatively expensive rendering logic.
This is a bit of an experiment. Implementing these specializations is relatively easy, and if it pans out it will be extended to other software.
Improved Site Info
The site information view has been improved to show better placeholder information for unknown domains, including a link to the git repository for submitting websites to be crawled.
The published date of a page is now validated against the plausible range of the HTML standard it’s written in. It’s impossible that a HTML5 document was written in 1997, and unlikely that a HTML2 document was written in 2021. 7326ba74
A bug was also discovered in the JSON+LD parser, that caused rare null pointer exceptions. This code is a bit of a hack and could definitely be cleaned up further. 21125206
The converter process, which extracts keywords and meta data from HTML documents, has been optimized to run about 20-25% faster. The crawler has also been modified to spend less effort on domains that historically have demonstrated to not have a lot of viable pages. As a result, crawling is twice as fast, processing takes about 24 hours instead of 60+ hours.
The converter optimization was achieved by replacing expensive string operations (like toLower()) with custom logic that doesn’t require allocation.
BigString is an object for transparent storage of compressed strings in memory
that enables the processor to work load the full contents of a website into memory at once, and then unpack each document as
it’s being processed.
BigString was optimized to use fixed buffers. Allocating large arrays in Java is expensive, and the garbage collector has
to work hard to clean up the mess. This introduces some lock contention, but it is still significantly faster than the
Another small speed-up is from using
char constructors instead of
unnecessary back-and-forth charset conversion.
The RDRPosTagger library, which does Part Of Speech tagging, already impressively fast, already aggressively modified to be faster, has been further optimized to be faster still, and its Java object tree design was replaced with flat integer arrays.
This was always an expensive operation, but now it’s much faster. The speed-up comes from replacing string comparisons with integer comparisons, as well as re-ordering the data in memory to reduce the cache thrashing that is typically associated with walking a branching tree structure. Part of this is from eliminating Java object headers.