Release Notes

These are release notes for Marginalia Search. These are also mirrored on 🌎 GitHub

See also 📁 marginalia-search and 📁 status.

Release Notes v2024.10.0

Posted: 2024-10-14

This is a new major release of marginalia search, mostly leaning toward the technical side.

Emphasis has been on ensuring the search engine has the technical capabilities to serve more types of queries, especially longer queries which it previously did not handle very well.

Effort has also been put toward making sure it’s possible to install and run outside of docker. There is still some work to be done to streamline the installation process, but we’re getting there.

Search Improvements

Query Parsing

The query parsing and evaluation model has been re-written from scratch, as the original model was very flaky and hard to maintain. The new model performs query segmentation, and introduces a graph based model for bag-of-words query evaluation. Writeup, PR#89

Position Index and Phrase Matching

Full phrase matching has been implemented, allowing not just “quoted search terms” to function better, but the result ranking to consider the word order in the query. This required the introduction of high accuracy keyword position data, and a re-write of the index. Writeup, PR#99

The constraints on valid keywords have also been relaxed, meaning it should be easier to search for e.g. program code, accepting tokens such as “$var” or “strlen()”.

Slop

As part of this change, ephemeral data is now stored in the built-for-marginalia Slop format, with a new marginalia adjacent library for reading and writing this data. As a result of this, and optimizations surrounding it, index construction is time is now reduced by something like 80% in production.

Since this is a bespoke format, portability is ensured via the format being self-documenting and simple to a fault, with the explicit design goal that anyone should be able to parse it by just looking at the file names, which may look like e.g.

cities.0.dat.s8[].gz
cities.0.dat-len.varint.bin
population.0.dat.s32le.bin
average-age.0.dat.f64le.gz

‘search.marginalia.nu’ Application

Capture Function

A new screenshot capture function has been added, screenshots are fetched/refreshed by page views on the site:-viewer, whether by human browsing or GoogleBot’s rambling. Request throttling and re-fetch timers are in place to ensure this can’t be used for abuse. This ensures that frequently viewed sites are kept up to date, and has helped the screenshot library grow quite considerably. PR#120

Pagination

Pagination has been added for the search results. This is in a sense fake pagination, made possible because each index node fetches the total number of requested results, but only the best results across all nodes are selected in the query service, and the pagination is done within this set. It’s unlikely paging beyond page 8 or 9 is going to be helpful anyway. PR#119

Base application

Added a new domain management view that permits inspection and index node assignment on a domain level, as well as the easy addition of new domains to be crawled.

screenshot of the new domains view — Screenshot of the new domain management view

Because we can now manage the domains table directly, the crawl spec abstraction has been retired. This was historically used to specify which domains to be crawled, but was clunky and difficult to interact with.

Architecture

The project has been migrated to JDK 22.

A new dependence on zookeeper has been introduced, to let the project self-manage routing and port mapping via a new service discovery registry. Zookeeper is a key value store that is well suited to distributed state management and configuration keeping.

This is to liberate the project from being dependent on docker, and as a result the system can now, again, run on a bare metal Linux installation, without having to go through the rigmarole of manually mapping each service to a port and IP-address.

A side effect is that the code can be configured to run in any number of configurations. Users with only one machine may not want a bunch of small services, so they can in theory assemble the entire application as one service instead.

As part of this, the old client-service library was overhauled, and a lot of questionable technical choices were expunged. The services now all talk gRPC instead of HTTP.

PR#81 PR#90 PR#92

Misc

Findings from UX and Security assessments have been addressed. These were mostly small things PR#93 PR#101

Error recovery and logging has been improved for the “download sample crawl data” actor, as it was previously a bit opaque with what it was actually doing.

The CSS was given a bit of an overhaul and dark mode was revived, work by @samstorment PR#94 PR#98

A new actor has been added that periodically polls certain links for outbound links (e.g. hacker news), and adds them to the list of domains to be crawled, to help automatically discover interesting links.

Work In Progress

The Crawler now attempts to fetch favicons. The aim is to be able to present them along with the search results in the future, but for now we’re just gathering them.

Content-type probing via HEAD requests is disabled for now, evaluating the use of Accept header instead. The crawler would previously attempt to, depending on how much the path looked like it might be some file format we’re not interested in, probe URL endpoints with a HEAD first and fetch the content-type. It’s questionable whether this was ever a good idea.

Notable Bugfixes

Fixed bug that caused some domains to fail to fully crawl. The exact circumstances are a bit flaky, but in some cases, the crawler would halt at the first document, and fail to load links from it.
The crawler was not properly stripping the W/-prefix from weak E-tags, when making conditional requests, causing unnecessary traffic. This has been corrected.
Fixed bug where the summarizer would pick up the contents of <noscript> tags. This caused escaped HTML to sometimes show up in the document summaries, most commonly goat counter’s code.

Bugs found in dependencies

Identified the cause of transient instabilities during index construction as being caused by a JVM compiler error. This should be corrected in the latest version of GraalVM JDK 22. Writeup

Release Notes v2024.01.0

Posted: 2024-01-24

This is a major new release of the search engine software, corresponding to nearly four months of changes. In these months, the state of the code hasn’t been stable enough for a new release, but it’s now been brought to a stable point.

Release Highlights:

The installation procedure has been cleaned up.
It’s now possible to run the search engine in a white label/bare-bones mode, without any of the Marginalia Search branding or logic.
The Marginalia Search web interface has been overhauled. The site-info page has especially been given a large upgrade.
The search engine can use anchor texts to supplement keywords.
The search engine can use multiple index shards.
The operations GUI has been overhauled.
An operations manual has been written.
The crawler can now resume crawls in process due to intermediate WARCs.
The search engine can import several formats without external pre-processing.
The Academia filter has been improved
The Recipe filter has been improved
The system now penalizes documents that have obvious hallmarks of being written by ChatGPT in its quality assessment.

Other technical changes:

Several bugfixes in the ranking algorithm has improved search result precision
Domain link graph have moved out of the database, improving processing time
The system can be configured to automatically perform db migrations
Ranking algorithm improvements

Known Limitations:

Service discovery is currently a bit limited, making it only possible to run the system within docker (or similar) at this point, as host names and ports are not configurable. This is not intended to be a permanent state of affairs.
The Marginalia Search website has lost its dark mode.
There might be an off-heap resource leak in the crawler. It’s primarily a problem with very long crawl runs.

Barebones Install

The system can be configured to run in a barebones mode, which only starts the minimal number of services necessary to serve search queries. A HTTP/JSON interface is provided to enable the search engine to act as a search backend.

There isn’t really any good “off the shelf” ways of running your own internet search engine. Marginalia Barebones wants to address that. Out of the box it offers both a traditional crawling-based workflow, as well as sideloading worflows for various formats, such as WARC or just directory trees, if you’d rather crawl with wget.

As this is this is a first time it’s been possible to run the search engine in this fashion, it’s at this stage not very configurable, and a lot of the opinionated takes of the Marginalia search engine are hard coded in. These are intended to be relaxed and made more configurable in upcoming releases.

The barebones install mode is made possible in part due to an overhauled installation procedure. A new install script has been written offering a basic install wizard. Configuration has also been broken out into mostly being a single properties file.

A video demoing the install and basic operations of this is available here:

Overhauled Web Interface

The Marginalia Search web interface has been overhauled. The old card-based design didn’t really work out, and has been replaced with something a bit more traditional. The filters have moved out of a dropdown next to the search query and into a sidebar, making them more visible.

Screenshot of the new search results page

The site info view has been significantly overhauled, integrating several discovery/exploration features. Experimental RSS support is added, as well as

The site info view also presents information about the site’s IP and ASN, both of which are searchable. You can also include (or exclude) autonomous systems by name in the search query, e.g. as:amazon.

Among other features, site crosslinks are made explorable.

Anchor Text Support

The search engine can now use anchor texts to supplement the keywords in a document. This has had a very large positive impact on the search result quality! An in-depth write-up is available going over the details of this change.

Marginalia Search makes its anchor text data freely available, along with the other data exports.

PR 59

Multiple index shard support

The system now has support for multiple backing indices. This permits a basic distributed set-up, but can also e.g. allow pinning different parts of the index to specific physical disks. There is a write-up going over the details of this change.

Some of the internal APIs have also been migrated off REST to GRPC. This is an ongoing process, and several more APIs are slated for migration in future releases.

PR 55

New Operations GUI

This concludes the final polishing pass on the operations GUI. The GUI offers control over all of the operations of the search engine, as well as monitoring and configuration.

Screenshot of the control GUI, crawler running

Most operations are now available via user-friendly guides with inline documentation.

A manual is also available at https://docs.marginalia.nu/, explaining the concepts in depth.

Screenshot of the control GUI, export wizard

Commits

Crawler Modifications

The crawler can now resume crawls in process due to storing in-progress crawls in the WARC format. Upon completion of a domain, the WARC is converted to parquet. The system can be configured to keep the WARCs for archival purposes, but this is not the default behavior as WARC files are very large, even when compressed.

Previously the crawler would restart crawling a domain from scratch if it crashed or was restarted somehow. Thanks to this change, this is no longer the case.

The crawl data is no longer stored in compresed JSON, as before, but in parquet. This change is still not 100% complete. This is due to the needs of data migration. To avoid data loss, it needs to be done in in multiple phases.

In implementing this, a few inefficiencies in dealing with very large crawl data was discovered in the subsequent processing steps. A special processing mode was implemented for dealing with extremely large domains. This runs with a simplified processing logic, but is also largely not bounded by RAM at all.

Improved Sideloading Support

The previously available sideloading support for stackexchange and wikipedia-data has been polished, and no longer need 3rd party tools to pre-process the data. It’s all done automatically, and is available from an easy guide in the control GUI.

The index nodes have been given upload-directories, to make it easier to figure out where to put the sideload data. The contents of these directories are visible from the control GUI.

Screenshot of one of the new sideloading wizards

(also a few others)

New Search Keywords:

as:ASN – search result must have an IP belonging to ASN
as:asn-name – search result must have an AS with an org information containing the string
ip:country – search result must be geolocated in country
special:academia – includes only results with a tld like .edu, .ac.uk, .ac.jp, etc.
count>10 – keyword must match at least 10 results on domain (this will likely be removed later)

Release Notes v2023.10.0

Posted: 2023-10-07

This is a mostly technical release. It takes the index from 106M to 164M documents.

Zero Downtime Upgrades and halved memory consumption

The initial focus of the release was to address the sometimes lengthy downtimes that have plagued the project when loading a new index.

There is a somewhat lengthy write-up about this here; but the short version is that this was very successful and a drastic optimization, removed not only the needed downtime, but added neat new features and slashed the RAM requirements in half!

Pull Request #42

A annoyance fueled optimization methodology also slashed the index construction time in half at later point. Pull Request #52.

Java 21 PREVIEW

There were unintended consequences of the changes above, and the system needed an upgrade to Java 21 with enabled preview features. This has to do with off-heap memory lifecycle management. Up until Java 21 (preview), Java offered no way of explicitly closing off-heap memory, including memory mapped files. This caused the filesystem to hold onto references to the mapped data even after the associated files had been deleted, which vastly increased the amount of disk required to construct the index using the new method of recursive merging.

A positive side-effect of this is that using the new foreign memory API is a lot faster than Java’s old byte buffers, since the size can exceed 2 GB without userspace paging.

There are some stray vestigial remains of the old way of memory mapping files still lingering, to be rooted out in the next release.

Writeup: https://www.marginalia.nu/log/89-disk-usage-mystery/ Commits: d0aa75

Parquet files in converter and crawl specs

For a long time, compressed json files have been used to store much of the unprocessed and half-processed crawl data. This is very easy to use, but tends to be a bit awkward when you have millions of the files. It’s also not the most performant format in the world, since e.g. it doesn’t announce how long a string is upfront, you need to just keep reading to find out.

Parquet is a clever format popular in big data applications that largely solves these problems. Parquet in Java is not so great, however, since the only(?) implementation is deeply tied to the Hadoop ecosystem, and separating the two isn’t entirely trivial.

Thankfully there’s a helpful library called parquet-floor that tries to do this. It is a bit on the basic side, but its technological and biological distinctiveness was added to our own, and now it does what’s necessary.

The biggest benefit of this is that it’s much easier to interact with. Previously to inspect some processed data, you’d need to use some combination of unix command line tools and jq to get at it. With parquet, much more convenient tools are available. The entire dataset can be queried with SQL using for example DuckDB!

The parquetification of the project is still ongoing. The crawl data needs to be addressed too, but this is in a future release.

Pull Request #48

Improved sideloading support

There’s been kinda-sorta support for sideloading encyclopedia data from Wikipedia already, but it’s been pretty shaky. This release introduces the ability to sideload not only Wikipedia data, but also Stackexchange dumps and just directories with HTML for e.g. javadocs.

These will not go live in the production index until it can be figured out how to make such large popular websites not show up as the first result for every query.

I wrote a rough documentation for how to do this.

Commits: 70aa04 5b0a6d 6bbf40 98bcdf 9b385e 5e5aaf

Notable bugfixes:

A concurrency bug was casuing some of the position data to be corrupted. This had a fairly adverse effect on the quality of the search results, causing bad matches to be promoted and good matches to be dismissed as irrelevant. a433bb

Release Notes v2023.08.0

Posted: 2023-08-22

This release mainly aims to improve the operational side of the search engine, with an emphasis of automating tedious manual processes and optimizing crawling and data processing to use fewer resources.

Conventionally I try to link to relevant commits in these notes, but some of the changes were so sweeping and protracted it was hard to narrow it down to individual commits; in those cases I’ll link to the relevant code instead.

New Features

Better Feature Detection and Blog Filter

The FeatureExtractor which analyzes websites’ HTML for things like advertisements and tracking code has been improved a fair bit. Website generator detection was also improved in this process.

Curated via a publicly available set of domains, the new filter selects for blogs and similar websites. These domains are also given slightly different processing rules on the assumption they are blogs.

Commit: cbbf60

Crawler - Smart Recrawling

The crawler has been enhanced to be able to make use of older crawl data to do optional fetching via the ETag and Last-Modified headers. This saves bandwidth and processing power for the server.

Code: CrawlDataReference CrawlerRetriever$recrawl

Operator’s GUI

A new user interface has been built for operating Marginalia Search. It was previously operated via command line instructions, direct SQL commands, and the like. This manual operation was both tedious and error prone.

The UI allows basic administrative operations such as dealing with domain complaints, creating API keys, blocking websites; but also has abstractions for triggering crawls and managing the heavier processes in the system.

Code: control-service

Message Queue / Actor Abstraction

To enable automation of the system several new abstractions have been introduced, including a message queue and an Actor abstraction on top of that. See /log/85-mq_sm_actor_ui for a detailed break down of this functionality.

Code: message-queue

Better language identification

Instead of using a naive home-made language identification algorithm, the fasttext library (via jfasttext) was used. It is much better at language identification, and as the name implies, pretty fast albeit not quite as fast when you run it via JNI. FastText is a very pleasant classifier library that will likely find other additional uses in the project in the future.

Commit: 46d761

Optimizations

There have been a lot of optimizations of the processes, these are just some of the bigger ones.

Converter - Reduced Memory Footprint and Increased Speed

The converter was keeping more items in memory than was necessary due to loading its input data up front by domain, and then iterating over each item. Streaming processing was introduced instead, which reduced the memory footprint so much that several previous memory optimizations such as transparent string compression became unnecessary, which in turn sped up the process a fair bit.

Commits: 507f26

Converter/Loader - Side Loading (experimental)

Some websites such as for example Wikipedia or Stack Overflow are too big to exhaustively crawl in a traditional sense, but they have data dumps available. Experimental support for side-loading Wikipedia was built.

This functionality is very immature.

To permit side loading large domains, the loader was also modified to reduce the amount of data it keeps in memory while loading. This was mainly accomplished by re-arranging the order the loading instructions are written by the converter.

Commits: f11103

Other Changes

Better feature detection and a new approach to advertisement filtering

A bit of effort was spent trying to figure out the modern advertisement ecosystem, and lessons learned were incorporated into the feature detection logic of the search engine.

A major shift in operation is to instead of looking for ads, the search engine will instead look for ad-tech tracking. This is much easier to do with the sort of static analysis Marginalia does, and probably what you want anyway. It turns out you can’t really run ads with no tracking without exposing yourself to click fraud, and you need to be pretty aggressive with how you do the tracking in a way that’s not easy to hide.

Commits: 0f9b90 …

Bugfix: Loader Stop Bug

There was a fairly trivial error in the loader process where it would stop loading documents from a website if any of their URLs were for some reason not loaded, typically because they were too long. This primarily affected large wordpress-style websites.

if (urlId <= 0) {
    logger.warn("Failed to resolve ID for URL {}", doc.url());
    return;
}

should have been

if (urlId <= 0) {
    logger.warn("Failed to resolve ID for URL {}", doc.url());
    continue;
}

Fixing the bug had the unanticipated side-effect of severely decreasing the average quality of the websites in the index, since large wordpress-style websites are often not very good.

To mitigate the quality problem, the ranking algorithm was modified to penalize large websites with kebab-case urls. This was a relatively invasive change that meant routing additional feature bits into the forward index. An upside of this is that the index has more information available for ranking websites, and it’s possible to e.g. apply a penalty to sites with adtech or likely affiliate links on them.

Commits: 4598c7 704de5

Bugfix: Crash on excluding keywords that are not known by the search engine

A rare bug was found that caused an error when excluding documents that contain a keyword where the keyword was not known to the search engine. This was due to a piece of debug logging that wouldn’t even have printed, yet still managed to trigger an index out of bounds error.

Commits: cb55c7

Upgraded dependencies – expected JDK version increased to 18+

Dependencies with security vulnerabilities were upgraded, which introduced a strange interaction with JDK 17, the previous default version, where non-ASCII letters would become garbled when reading crawl data. The exact cause of this is unknown, but a solution that works is to use JDK 18+ instead.

Flyway Migrations

Database migrations are now managed via Flyway. This eliminates manual database upgrades.

Commits: 58556a

Release Notes v2023.06.0

Posted: 2023-06-29

New Features

Generator keywords

To provide additional ways of selecting search results, a synthetic keyword has been added for the <meta name="generator" content="..."> tag. This is basically a vanity tag that is used by some HTML generators to advertise themselves, and it’s also common for hand-edited HTML to include this tag with a string like “vim” or “myself”, as a wink to human readers of the code.

The generator keywords have the form generator:value. For example, to search for websites made with Hugo, you can use generator:hugo. Generator categories have also been added as searchable keywords, for example generator:wiki, generator:forum, generator:docs.

These last keywords have been added as options in in the search engine’s filters.

a9a2960e d86e8522

Crawler support for sitemaps

To ensure the crawler is able to find all the pages of a website, while wasting minimal time and bandwidth on dead links, the crawler now supports the sitemap protocol. Implementing this support was relatively straightforward as a site map parser was already available within Crawler Commons, a library which is already used for parsing robots.txt files.

The crawler will look for a sitemap directive in robots.txt, and will also look for /sitemap.xml in the root of the server, as well as parse RSS and Atom feeds for links if they are found in the root document of the website.

ecc940e36

Crawler specialization for Lemmy, Discourse and Mediawiki

Some server software for larger websites have a lot of valid links, but also many links that are highly ephemeral (such a mastdon feed, or the index of a forum). To help the crawler only index the pages that don’t change that often, has specialized logic has been introduced for Lemmy, Discourse and Mediawiki.

This also saves processing power for the server, as these applications often have relatively expensive rendering logic.

This is a bit of an experiment. Implementing these specializations is relatively easy, and if it pans out it will be extended to other software.

ed373eef

Improved Site Info

The site information view has been improved to show better placeholder information for unknown domains, including a link to the git repository for submitting websites to be crawled.

a6a66c6d

Bug Fixes

Pub-date validation

The published date of a page is now validated against the plausible range of the HTML standard it’s written in. It’s impossible that a HTML5 document was written in 1997, and unlikely that a HTML2 document was written in 2021. 7326ba74

A bug was also discovered in the JSON+LD parser, that caused rare null pointer exceptions. This code is a bit of a hack and could definitely be cleaned up further. 21125206

Optimizations

The converter process, which extracts keywords and meta data from HTML documents, has been optimized to run about 20-25% faster. The crawler has also been modified to spend less effort on domains that historically have demonstrated to not have a lot of viable pages. As a result, crawling is twice as fast, processing takes about 24 hours instead of 60+ hours.

The converter optimization was achieved by replacing expensive string operations (like toLower()) with custom logic that doesn’t require allocation.

BigString

The BigString is an object for transparent storage of compressed strings in memory that enables the processor to work load the full contents of a website into memory at once, and then unpack each document as it’s being processed.

BigString was optimized to use fixed buffers. Allocating large arrays in Java is expensive, and the garbage collector has to work hard to clean up the mess. This introduces some lock contention, but it is still significantly faster than the previous version.

Another small speed-up is from using java.lang.String’s char[] constructors instead of byte[]-constructors, reducing unnecessary back-and-forth charset conversion.

Commit: e4372289

RDRPosTagger

The RDRPosTagger library, which does Part Of Speech tagging, already impressively fast, already aggressively modified to be faster, has been further optimized to be faster still, and its Java object tree design was replaced with flat integer arrays.

This was always an expensive operation, but now it’s much faster. The speed-up comes from replacing string comparisons with integer comparisons, as well as re-ordering the data in memory to reduce the cache thrashing that is typically associated with walking a branching tree structure. Part of this is from eliminating Java object headers.

Commit: 186a02

Release Notes v2023.03.2

Posted: 2023-05-25

This is primarily a bugfix release that primarily addresses some issues with a metadata corruption that was introduced in the previous release.

New Features

File keywords

To provide more tools for navigating the web, the converter now generates synthetic keywords for documents that link to files on the same server based on their file ending.

If the file contains a link such as

<a href="file.zip">Download</a>

then he document will be tagged with the keyword file:zip as well as file:archive.

The category keywords are file:audio, file:video, file:image, file:document, file:archive.

Since earlier, the converter has also generated keywords based on filenames, even if the filename itself doesn’t appear in the visible portion of the document. So in the example above, file.zip would also be a relevant keyword for the document.

Commit: a9f7b4c4

Bug fixes

Metadata corruption

As a workaround for the limitations of the Java language, document metadata is encoded through explicit bit twiddling. It’s basically a manual implementation of a C struct on top a 64 bit long. This is a great performance improvement and allows for very compact storage of the metadata, but the approach is also notoriously error prone and difficult to do in a safe way. It’s basically the programming equivalen tof running with scissors.

A bug crept in where parts of the document metadata was garbled. This made it impossible to search by year, and also broke the ‘blog’ and ‘vintage’ filters, and may also have deteriorated the search result quality a bit.

The bug wasn’t directly caused by the bit twiddling, but by mispopulating the fields in a constructor. It’s a fairly trivial error, but it was hard to detect since it was not immediately obvious that the data was corrupted given the limited visibility into the “struct”, and reproducing the error in a test proved difficult since the test used the constructor correctly.

Despite testing on a pre-production environment, the bug was not discovered until it was deployed to production. If anything I think it highlights a need for finding better testing strategies. This functionality is fairly smeared out over the code path, the functionality is difficult to isolate and it’s often not immediately apparent when it’s broken, all this makes it a continuous struggle to test in a systematic way. In general it’s very hard to test this sort of logic, as it requires a large and relatively realistic corpus of data to test against which makes isolating behavior harder, and the outcome is also never clearly right or wrong, but a matter of this-feels-right or this-seems-wrong.

Commit: 2ab26f37

Publish Date Detection

The order of the heuristics in the publish date detection has been improved to reduce the number of false positives, the support for JSON+LD has also been improved to support additional cases.

Marginalia uses a long list of different heuristics to try to detect the publish date of a document. It was previously assumed that HTML5’s <time[pubdate="pubdate"]> element would generally contain a valid publish date for the current document, but this is not always the case, as some blogging platforms also include <article>-tags, including <time> for snippets of other articles. The heuristics have been reordered to try to detect the date from other sources first, and then fall back to the <time> element as one of the less reliable heuristics.

Commit: 619fb8ba

Response cache for the API service to help misconfigured clients

It’s been a long standing problem that some misconfigured API consumers spam the API endpoint with the same query multiple times in a row, very rapidly consuming the rate limit. A cache has been added before the rate limit that will return the same result for the same query within a short time window “for free”.

This was also a good opportunity to clean up the API service a bit and improve the test coverage.

Commit: 112f43b3

Minor Fixes

Stopgap fix for a bug in dealing with quote terms containing stop words. 6fae51a8
Fix data loading bug where domains with some IPv6 addresses would blow up. d42ab191
Fix bug where some synthetic keywords would fail to return results. df1850bd

Experiments That Never Made It

A wise man once said “it’s not R&D if you aren’t throwing away half your work”. Here are some of the experiments that didn’t make it into production.

A synthetic keyword for image filenames that look like they come out of a smartphone

Alongside the file keywords, an experiment was run with generating a synthetic keyword for image filenames that look like they come out of a smartphone, e.g. filenames with the format “IMG_nnnnnn.jpg”. While very easy to build, this turned out to be not very useful. The idea was scrapped.

LDA topic modeling

Latent Dirichlet Allocation (LDA) is a popular topic modeling algorithm that’s often used to extract topics from a corpus of documents. The idea was to use this to offer additional ways of navigating the web. The idea was scrapped because the results were not quite useful. The main work involved porting the LDA implementation in Mallet from a very old style of Java to a modern one. Since this was a fairly large task, it was decided to keep the code around in a branch in case it could be useful for other purposes.

Performance wise it might be plausible to do something with LDA in the future. The branch with the patched Mallet code is available here.