This is primarily a bugfix release that primarily addresses some issues with a metadata corruption that was introduced in the previous release.
To provide more tools for navigating the web, the converter now generates synthetic keywords for documents that link to files on the same server based on their file ending.
If the file contains a link such as
then he document will be tagged with the keyword
file:zip as well as
The category keywords are
Since earlier, the converter has also generated keywords based on filenames, even if the filename itself doesn’t appear in the visible portion of the document.
So in the example above,
file.zip would also be a relevant keyword for the document.
As a workaround for the limitations of the Java language, document metadata is encoded through explicit bit twiddling. It’s basically a manual implementation of a C struct on top a 64 bit long. This is a great performance improvement and allows for very compact storage of the metadata, but the approach is also notoriously error prone and difficult to do in a safe way. It’s basically the programming equivalen tof running with scissors.
A bug crept in where parts of the document metadata was garbled. This made it impossible to search by year, and also broke the ‘blog’ and ‘vintage’ filters, and may also have deteriorated the search result quality a bit.
The bug wasn’t directly caused by the bit twiddling, but by mispopulating the fields in a constructor. It’s a fairly trivial error, but it was hard to detect since it was not immediately obvious that the data was corrupted given the limited visibility into the “struct”, and reproducing the error in a test proved difficult since the test used the constructor correctly.
Despite testing on a pre-production environment, the bug was not discovered until it was deployed to production. If anything I think it highlights a need for finding better testing strategies. This functionality is fairly smeared out over the code path, the functionality is difficult to isolate and it’s often not immediately apparent when it’s broken, all this makes it a continuous struggle to test in a systematic way. In general it’s very hard to test this sort of logic, as it requires a large and relatively realistic corpus of data to test against which makes isolating behavior harder, and the outcome is also never clearly right or wrong, but a matter of this-feels-right or this-seems-wrong.
Publish Date Detection
The order of the heuristics in the publish date detection has been improved to reduce the number of false positives, the support for JSON+LD has also been improved to support additional cases.
Marginalia uses a long list of different heuristics to try to detect the publish date of a document. It was previously assumed that HTML5’s
<time[pubdate="pubdate"]> element would generally contain a valid publish date for the current document, but this is not always the case, as some blogging platforms also include
<time> for snippets of other articles. The heuristics have been reordered to try to detect the date from other sources first, and then fall back to the
<time> element as one of the less reliable heuristics.
Response cache for the API service to help misconfigured clients
It’s been a long standing problem that some misconfigured API consumers spam the API endpoint with the same query multiple times in a row, very rapidly consuming the rate limit. A cache has been added before the rate limit that will return the same result for the same query within a short time window “for free”.
This was also a good opportunity to clean up the API service a bit and improve the test coverage.
- Stopgap fix for a bug in dealing with quote terms containing stop words. 6fae51a8
- Fix data loading bug where domains with some IPv6 addresses would blow up. d42ab191
- Fix bug where some synthetic keywords would fail to return results. df1850bd
Experiments That Never Made It
A wise man once said “it’s not R&D if you aren’t throwing away half your work”. Here are some of the experiments that didn’t make it into production.
A synthetic keyword for image filenames that look like they come out of a smartphone
Alongside the file keywords, an experiment was run with generating a synthetic keyword for image filenames that look like they come out of a smartphone, e.g. filenames with the format “IMG_nnnnnn.jpg”. While very easy to build, this turned out to be not very useful. The idea was scrapped.
LDA topic modeling
Latent Dirichlet Allocation (LDA) is a popular topic modeling algorithm that’s often used to extract topics from a corpus of documents. The idea was to use this to offer additional ways of navigating the web. The idea was scrapped because the results were not quite useful. The main work involved porting the LDA implementation in Mallet from a very old style of Java to a modern one. Since this was a fairly large task, it was decided to keep the code around in a branch in case it could be useful for other purposes.
Performance wise it might be plausible to do something with LDA in the future. The branch with the patched Mallet code is available here.