This is a mostly technical release. It takes the index from 106M to 164M documents.
Zero Downtime Upgrades and halved memory consumption
The initial focus of the release was to address the sometimes lengthy downtimes that have plagued the project when loading a new index.
There is a somewhat lengthy write-up about this here; but the short version is that this was very successful and a drastic optimization, removed not only the needed downtime, but added neat new features and slashed the RAM requirements in half!
A annoyance fueled optimization methodology also slashed the index construction time in half at later point. Pull Request #52.
Java 21 PREVIEW
There were unintended consequences of the changes above, and the system needed an upgrade to Java 21 with enabled preview features. This has to do with off-heap memory lifecycle management. Up until Java 21 (preview), Java offered no way of explicitly closing off-heap memory, including memory mapped files. This caused the filesystem to hold onto references to the mapped data even after the associated files had been deleted, which vastly increased the amount of disk required to construct the index using the new method of recursive merging.
A positive side-effect of this is that using the new foreign memory API is a lot faster than Java’s old byte buffers, since the size can exceed 2 GB without userspace paging.
There are some stray vestigial remains of the old way of memory mapping files still lingering, to be rooted out in the next release.
Writeup: https://www.marginalia.nu/log/89-disk-usage-mystery/ Commits: d0aa75
Parquet files in converter and crawl specs
For a long time, compressed json files have been used to store much of the unprocessed and half-processed crawl data. This is very easy to use, but tends to be a bit awkward when you have millions of the files. It’s also not the most performant format in the world, since e.g. it doesn’t announce how long a string is upfront, you need to just keep reading to find out.
Parquet is a clever format popular in big data applications that largely solves these problems. Parquet in Java is not so great, however, since the only(?) implementation is deeply tied to the Hadoop ecosystem, and separating the two isn’t entirely trivial.
Thankfully there’s a helpful library called parquet-floor that tries to do this. It is a bit on the basic side, but its technological and biological distinctiveness was added to our own, and now it does what’s necessary.
The biggest benefit of this is that it’s much easier to interact with. Previously to inspect some processed data, you’d need to use some combination of unix command line tools and jq to get at it. With parquet, much more convenient tools are available. The entire dataset can be queried with SQL using for example DuckDB!
The parquetification of the project is still ongoing. The crawl data needs to be addressed too, but this is in a future release.
Improved sideloading support
There’s been kinda-sorta support for sideloading encyclopedia data from Wikipedia already, but it’s been pretty shaky. This release introduces the ability to sideload not only Wikipedia data, but also Stackexchange dumps and just directories with HTML for e.g. javadocs.
These will not go live in the production index until it can be figured out how to make such large popular websites not show up as the first result for every query.
I wrote a rough documentation for how to do this.
- A concurrency bug was casuing some of the position data to be corrupted. This had a fairly adverse effect on the quality of the search results, causing bad matches to be promoted and good matches to be dismissed as irrelevant. a433bb