Marginalia Search

Information relating to the Marginalia Search project.

Marginalia Search is an independent DIY search engine that focuses on non-commercial content, and attempts to show you sites you perhaps weren’t aware of in favor of the sort of sites you probably already knew existed.

URL:  🌎 https://search.marginalia.nu/
Git:  🌎 https://git.marginalia.nu/

You may also be interested in the 🏷️ search-engine tag.

Documents

NameDate
📁 ../2024-05-16
📄 FAQ2023-03-28
📄 API2023-03-23
📄 About Marginalia Search2022-12-23
📄 For Webmasters2022-10-28
📄 Privacy Considerations2022-09-22
📄 Donate To This Project2022-09-05

Recent Posts in 🏷️ search-engine

2024-05-16 Experiment in Java native calls

I’ve experimentally replaced some of the Java implementations of quicksort and binary search with calls to C++ code, and saw huge benefits for the sorting code but the same or worse performance for binary search. The Marginalia Search engine is mainly written in Java, which is language that is good at many things, but not particularly pleasant to work with when it comes to low level systems programming. Unfortunately, a part of building an internet search engine involves database-adjacent low level programming.

2024-04-17 Query Parsing and Understanding

Been working on improving Marginalia Search query parsing and understanding. This is going to be a pretty long update, as it’s a few months’ work. Apart from cleaning up the somewhat messy query parsing code, a problem I’m trying to address is that the search engine is currently only good at dealing with fairly focused queries, they don’t need to be short, but if you try to qualify a search that is too broad by adding more terms, it often doesn’t produce anything useful.

2024-04-10 Deep Bug

The project has been haunted by a mysterious bug since sometime February. It relates to the code that constructs the index, particularly the code that merges partial indices. In short the search engine constucts the reverse index through successive merging of smaller indices, which reduces the overall memory requirement. You can conceptualize the revese index itself as two files, one with offset pointers into another file, which has sorted numbers. This code runs after each partition finishes crawling and processing its data, and has a run time of about 4 hours.

2024-02-28 The Yak Shave

I set out a little over a week ago to add a service registry to Marginalia Search, primarily to reduce its dependence on docker. I would like it to be able to run on bare metal as well, which poses a problem since configuring the application manually is a bit of a headache with dozens of ports that need to be set up. It would also be desirable to be able to run multiple instances of important services in order elliminate downtime during upgrades.

2024-02-25 Marginalia: 3 Years

It’s been three years since the inception of Marginalia Search, then a dinky experiment to find where the heck the cool Internet has gone, now my full time job. While there’s always things that can be improved, it’s fair to say the search engine has never worked as well as it does right now. A great number of milestones have been reached, perhaps biggest of all the search engine has moved out of my living room and into a proper enterprise server.

2024-02-07 Best SEO spam 2024 reddit

One of the great joys of working on a search engine is that you get to reverse engineer SEO spam, and overall study how it evolves over time. I’ve been noticing the search engine spam strategy of adding ‘reddit’ to page titles for a few years now, but it feels like it’s been growing a lot recently. I don’t think it’s actually working, but it’s so cute that they are trying.

2023-12-22 A Frivolous Feature

Marginalia Search very recently gained the ability to filter results by Autonomous System, not only searching by ASN but by the organization information for that AS. At a glance this seems like a somewhat frivolous feature, but it has interesting effects. Autonomous Systems are part of the Internet’s routing infrastructure. If your mental model of an IP number is that they are the phone number of the computer, this is something akin to a postal code.

2023-12-20 WARC'in the crawler

The Marginalia Crawler has seen improvements! A long term problem with the crawler design is that if for whatever reason the crawler shuts down, then it needs to re-start fetching whatever domains it was currently traversing during the termination from zero. This isn’t fantastic, since not only does crawling a website take a fair bit of time, it’s a nuisance for the server admins to re-crawl stuff that was already fetched, and a real liability for ending up in robots.

2023-11-07 Anchor Tags

I’ve been working on getting anchor tag keywords into the search engine, basically using link texts to complement the keywords on a webpage. The problem I’m attempting to address is that many websites don’t really describe themselves particularly well. As Steve Ballmer’s stage performance once illustrated, merely repeating a word doesn’t on its own make what you’re saying relevant to the term. Another good example of how it falls short is PuTTY’s website, which will be used as a pilot case to improve.

2023-10-30 Partitioning The Index

So a bit of an update on what I’ve been working on. This will be adapted into release notes in a while, but I haven’t quite wrapped a bow on the change set yet. Still, it has certainly been a few weeks. Didn’t quite land how busy I’ve been until I set down to draft this post. Them’s some changes, and I’m skipping a few to keep this meandering post at a sane length.

2023-10-07 Moving Marginalia to a New Server

So the search engine is moving to a new server soon, thanks to the generous grant mentioned recently. If you visit search.marginalia.nu now, it may or may not use the old or new server. It’ll be like this for a while, since I need them both for testing and maintenance type work. I’ll also apologize if this post is a bit chaotic. It is a reflection of a very chaotic couple of weeks that apart from setting up this migration also involved a very short notice invitation for a presentation at ossym23.

2023-09-15 Marginalia Search receives FUTO Grant

I’m happy to announce that the generous people at FUTO have granted the project $15,000 with no strings attached to help the search engine out with some more server power. FUTO is a young Austin, TX-based organization “dedicated to developing, both through in-house engineering and investment, technologies that frustrate centralization and industry consolidation”. It’s one to keep an eye on, I believe their heart is in the right place and they have every possibility of making a real difference.

2023-08-30 Absurd Success

So… I’ve had the most unreal week of coding. Zero exaggeration, I’ve halved the RAM requirements of the search engine, removed the need to take the system offline during an upgrade, removed hard limits on how many documents can be indexed, and quadrupled soft limits on how many keywords can be in the corpus. It’s been a long term goal to keep it possible to run and operate the system on low-powered hardware, and so far improvements have been made, to the point where my 32 Gb RAM developer machine feels spacey rather than cramped, but this set of changes takes it several notches further.