Information relating to the Marginalia Search project.
Marginalia Search is an independent DIY search engine that focuses on non-commercial content, and attempts to show you sites you perhaps weren’t aware of in favor of the sort of sites you probably already knew existed.
You may also be interested in the 🏷️ search-engine tag.
DocumentsIt’s been three years since the inception of Marginalia Search, then a dinky experiment to find where the heck the cool Internet has gone, now my full time job.
While there’s always things that can be improved, it’s fair to say the search engine has never worked as well as it does right now.
A great number of milestones have been reached, perhaps biggest of all the search engine has moved out of my living room and into a proper enterprise server.One of the great joys of working on a search engine is that you get to reverse engineer SEO spam, and overall study how it evolves over time.
I’ve been noticing the search engine spam strategy of adding ‘reddit’ to page titles for a few years now, but it feels like it’s been growing a lot recently. I don’t think it’s actually working, but it’s so cute that they are trying.Marginalia Search very recently gained the ability to filter results by Autonomous System, not only searching by ASN but by the organization information for that AS. At a glance this seems like a somewhat frivolous feature, but it has interesting effects.
Autonomous Systems are part of the Internet’s routing infrastructure. If your mental model of an IP number is that they are the phone number of the computer, this is something akin to a postal code.The Marginalia Crawler has seen improvements! A long term problem with the crawler design is that if for whatever reason the crawler shuts down, then it needs to re-start fetching whatever domains it was currently traversing during the termination from zero.
This isn’t fantastic, since not only does crawling a website take a fair bit of time, it’s a nuisance for the server admins to re-crawl stuff that was already fetched, and a real liability for ending up in robots.I’ve been working on getting anchor tag keywords into the search engine, basically using link texts to complement the keywords on a webpage.
The problem I’m attempting to address is that many websites don’t really describe themselves particularly well. As Steve Ballmer’s stage performance once illustrated, merely repeating a word doesn’t on its own make what you’re saying relevant to the term.
Another good example of how it falls short is PuTTY’s website, which will be used as a pilot case to improve.So a bit of an update on what I’ve been working on. This will be adapted into release notes in a while, but I haven’t quite wrapped a bow on the change set yet.
Still, it has certainly been a few weeks. Didn’t quite land how busy I’ve been until I set down to draft this post. Them’s some changes, and I’m skipping a few to keep this meandering post at a sane length.So the search engine is moving to a new server soon, thanks to the generous grant mentioned recently.
If you visit search.marginalia.nu now, it may or may not use the old or new server. It’ll be like this for a while, since I need them both for testing and maintenance type work.
I’ll also apologize if this post is a bit chaotic. It is a reflection of a very chaotic couple of weeks that apart from setting up this migration also involved a very short notice invitation for a presentation at ossym23.I’m happy to announce that the generous people at FUTO have granted the project $15,000 with no strings attached to help the search engine out with some more server power.
FUTO is a young Austin, TX-based organization “dedicated to developing, both through in-house engineering and investment, technologies that frustrate centralization and industry consolidation”. It’s one to keep an eye on, I believe their heart is in the right place and they have every possibility of making a real difference.So… I’ve had the most unreal week of coding. Zero exaggeration, I’ve halved the RAM requirements of the search engine, removed the need to take the system offline during an upgrade, removed hard limits on how many documents can be indexed, and quadrupled soft limits on how many keywords can be in the corpus.
It’s been a long term goal to keep it possible to run and operate the system on low-powered hardware, and so far improvements have been made, to the point where my 32 Gb RAM developer machine feels spacey rather than cramped, but this set of changes takes it several notches further.This is a bit of an what I’ve been working on style of post. It’s also a bit of a complement for the release notes of the upcoming release which should be dropping in a week or so. There’s some spit and polish still missing from these things, but if I don’t write about them now too much will have been ejected from the cache to make a well written post about it.I’m working on Marginalia Search full time.
I left the office for the last time today, and it’s the strangest feeling. I’ve quit jobs, taken time off work, been laid off, but this is different from any of those things. This is deliberate.
There’s a note of relief. I’ve essentially been working two pretty demanding jobs; one for pay and one for passion and the joy of making a difference.