<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>search-engine on marginalia.nu</title><link>https://www.marginalia.nu/tags/search-engine/</link><description>Recent content in search-engine on marginalia.nu</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 30 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://www.marginalia.nu/tags/search-engine/index.xml" rel="self" type="application/rss+xml"/><item><title>An NSFW filter for Marginalia Search</title><link>https://www.marginalia.nu/log/a_134_nsfw/</link><pubDate>Mon, 30 Mar 2026 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_134_nsfw/</guid><description>&lt;p&gt;&amp;hellip; optional, that is.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve been working on an NSFW filter for Marginalia Search,
as that is something some people have asked for,
primarily API consumers.&lt;/p&gt;
&lt;p&gt;The search engine has had some domain based filtering for a while,
based on the UT1 lists, but that isn&amp;rsquo;t a very comprehensive approach.&lt;/p&gt;
&lt;p&gt;We&amp;rsquo;ll land on a single hidden layer neural network approach,
implemented from scratch, but before landing on that,
many other things were tried along the way.&lt;/p&gt;</description></item><item><title>Index Compression, Query Execution Improvements</title><link>https://www.marginalia.nu/log/a_131_index_compression/</link><pubDate>Fri, 13 Feb 2026 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_131_index_compression/</guid><description>&lt;p&gt;The Marginalia Search index has recently seen some design tweaks
to make it perform better, primarily the introduction of postings list compression.&lt;/p&gt;
&lt;p&gt;Last year, the index was partially &lt;a href="https://www.marginalia.nu/log/a_123_index_io/"&gt;re-implemented with SSDs in mind&lt;/a&gt;.
This was largely a success, but left some lingering issues with tail latencies that sometimes weren&amp;rsquo;t what they needed to be.&lt;/p&gt;
&lt;p&gt;To ensure predictable execution times,
the query execution is provided a timeout value,
after which it will wrap up and return the best results it&amp;rsquo;s found.
Query execution was so flaky that the &lt;em&gt;actual&lt;/em&gt; timeout used when terminating the execution used to be something like 50ms lower than the provided value.
This is obviously not a fantastic state of affairs.&lt;/p&gt;</description></item><item><title>Trust in Ranking</title><link>https://www.marginalia.nu/log/a_130_trust_in_ranking/</link><pubDate>Sat, 31 Jan 2026 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_130_trust_in_ranking/</guid><description>&lt;p&gt;The Marginalia Search default ranking algorithm recently saw a fairly radical improvement, due to a new domain trust system that drastically reduces the number of content farm results, as long as there are human results it usually finds them across all the usual test queries.&lt;/p&gt;
&lt;p&gt;Recently fixing a few bugs that made the search engine work more correctly had the unexpected and undesired side-effect of also making it surface more search engine spam and content farm-type results.&lt;/p&gt;</description></item><item><title>New Search Filtering in Web and API</title><link>https://www.marginalia.nu/log/a_127_index_filtering/</link><pubDate>Mon, 08 Dec 2025 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_127_index_filtering/</guid><description>&lt;p&gt;The search engine recently exposed a fair number of new tools for custom filtering to the API consumers and users of the new UI.&lt;/p&gt;
&lt;p&gt;This was originally going to be an incredibly chaotic update, both annuncing the new features and doing a technical walkthrough of the changes but that ambition turned out a bit &lt;em&gt;too&lt;/em&gt; chaotic, so let&amp;rsquo;s split them up and focus on the feature announcement bit today.&lt;/p&gt;
&lt;h2 id="new-search-filtering-gui"&gt;New Search Filtering GUI&lt;/h2&gt;
&lt;p&gt;It&amp;rsquo;s now possible to define a custom filter in the GUI, on the &lt;code&gt;marginalia-search.com&lt;/code&gt; version of the website!&lt;/p&gt;</description></item><item><title>Language Support for Marginalia Search</title><link>https://www.marginalia.nu/log/a_126_multilingual/</link><pubDate>Mon, 06 Oct 2025 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_126_multilingual/</guid><description>&lt;p&gt;One of the big ambitions for the search engine this year has been to enable searching in more languages than English, and a pilot project for this has just been completed, allowing experimental support for German, French and Swedish.&lt;/p&gt;
&lt;p&gt;These changes are now live for testing, but with an extremely small corpus of documents.&lt;/p&gt;
&lt;p&gt;As the search engine has been up to this point built with English in mind, some anglo-centric assumptions made it into its code. A lot of the research on search engines generally seems to embed similar assumptions.&lt;/p&gt;</description></item><item><title>Faster Index I/O with NVMe SSDs</title><link>https://www.marginalia.nu/log/a_123_index_io/</link><pubDate>Sat, 16 Aug 2025 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_123_index_io/</guid><description>&lt;p&gt;The Marginalia Search index has been partially rewritten to perform much better, using new data structures designed to make better use of modern hardware. This post will cover the new design, and will also touch upon some of the unexpected and unintuitive performance characteristics of NVMe SSDs when it comes to read sizes.&lt;/p&gt;
&lt;p&gt;The index is already fairly large, but can sometimes feel smaller than it is, and paradoxically, query performance is a big part of why. If each query has a budget of 100-250ms, a design that finds and ranks results faster in that time period will produce better search results. There are other limitations as well, query understanding is still somewhat limited, where only minor changes to a query can unearth dozens of new related results.&lt;/p&gt;</description></item><item><title>Finding Dead Websites</title><link>https://www.marginalia.nu/log/a_122_dead_websites/</link><pubDate>Tue, 17 Jun 2025 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_122_dead_websites/</guid><description>&lt;p&gt;As some of the work planned for Marginalia Search this year has been progressing a bit faster than anticipated, there was time to implement an unplanned change.&lt;/p&gt;
&lt;p&gt;This post details the implementation of a system for detecting when servers are online, to avoid serving dead links and improve data quality, and for detecting when websites have significant changes including ownership transfers and parking.&lt;/p&gt;
&lt;h1 id="table-of-contents"&gt;Table Of Contents&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#rationale"&gt;Feature Rationale&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#repr"&gt;Data Representation&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#livedata"&gt;Live Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#eventdata"&gt;Event Data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#details"&gt;Change Detection Details&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#availability"&gt;Availability Detection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#ownership"&gt;Ownership Changes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#dns"&gt;DNS&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#hurdles"&gt;Implementation Hurdles&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#scheduling"&gt;Scheduling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#certvalid"&gt;Certificate Validation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/a_122_dead_websites/#conclusion"&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a name="rationale"&gt;&lt;/a&gt;&lt;/p&gt;</description></item><item><title>Profiling Websites</title><link>https://www.marginalia.nu/log/a_121_profiling_websites/</link><pubDate>Thu, 29 May 2025 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_121_profiling_websites/</guid><description>&lt;p&gt;The most recent change to the search engine is a system that profiles websites based on their rendered DOM. The goal is identifying advertisements, trackers, nuisance popovers, and similar elements.&lt;/p&gt;
&lt;p&gt;The search engine already tries to do this, but isn&amp;rsquo;t very good at it because it&amp;rsquo;s only looking at static code.&lt;/p&gt;
&lt;p&gt;It turns out to be somewhat difficult to determine what a website that has non-trivial javascript will look like based its source code alone, as this would require us to among other things solve the halting problem.&lt;/p&gt;</description></item><item><title>PDF to Text, a challenging problem</title><link>https://www.marginalia.nu/log/a_119_pdf/</link><pubDate>Tue, 13 May 2025 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_119_pdf/</guid><description>&lt;p&gt;The search engine has recently gained the ability to index the PDF file format. The change will deploy over a few months.&lt;/p&gt;
&lt;p&gt;Extracting text information from PDFs is a significantly bigger challenge than it might seem.
The crux of the problem is that the file format isn&amp;rsquo;t a text format at all, but a graphical format.&lt;/p&gt;
&lt;p&gt;It doesn&amp;rsquo;t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on &amp;ldquo;paper&amp;rdquo;. These
glyphs may be rotated, overlap, and appear out of order, with very little semantic information
attached to them.&lt;/p&gt;</description></item><item><title>Crawl Order and Disorder</title><link>https://www.marginalia.nu/log/a_117_crawl_order/</link><pubDate>Thu, 27 Mar 2025 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_117_crawl_order/</guid><description>&lt;p&gt;A problem the search engine&amp;rsquo;s crawler has struggled with for some time is that it takes a fairly long time to finish up, usually spending several days wrapping up the final few domains.&lt;/p&gt;
&lt;p&gt;This has been actualized recently, since the migration to slop crawl data has dropped memory requirements of the crawler by something like 80%, and as such I&amp;rsquo;ve been able to increase the number of crawling tasks, which has led to a bizarre case where 99.9% of the crawling is done in 4 days, and the remaining 0.1% takes a week.&lt;/p&gt;</description></item><item><title>Marginalia Search receives second nlnet grant</title><link>https://www.marginalia.nu/log/a_116_grant_2.0/</link><pubDate>Tue, 25 Mar 2025 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_116_grant_2.0/</guid><description>&lt;p&gt;I&amp;rsquo;m happy and grateful to announce that the Marginalia Search
project has been accepted for a second &lt;a href="https://nlnet.nl/"&gt;nlnet&lt;/a&gt; grant.&lt;/p&gt;
&lt;p&gt;All the details are not yet finalized, but tentatively the grant will go toward addressing most of the items in the project
&lt;a href="https://github.com/MarginaliaSearch/MarginaliaSearch/blob/master/ROADMAP.md"&gt;roadmap for 2025&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve already been working full time on the project since &lt;a href="https://www.marginalia.nu/log/83_full_time/"&gt;summer 2023&lt;/a&gt;, and this grant secures additional development time, and extends the runway to a comfortable degree.&lt;/p&gt;
&lt;p&gt;Will post more details as they are finalized.&lt;/p&gt;</description></item><item><title>Marginalia Search: 4 Years</title><link>https://www.marginalia.nu/log/a_114_4_years/</link><pubDate>Mon, 03 Mar 2025 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_114_4_years/</guid><description>&lt;p&gt;This update is a few days late, the canonical birth date of the project is Feb 26.&lt;/p&gt;
&lt;p&gt;It has been another year of Marginalia Search. The project is still ongoing, still my full time job, although the project is entering a somewhat more mature phase of development, most of the big pieces are in place and do a decent job at what they do.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/MarginaliaSearch/MarginaliaSearch/blob/master/ROADMAP.md"&gt;roadmap for the project&lt;/a&gt; is available on GitHub.&lt;/p&gt;</description></item><item><title>RSS Feeds and Real Time Crawling</title><link>https://www.marginalia.nu/log/a_113_rtc/</link><pubDate>Thu, 26 Dec 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_113_rtc/</guid><description>&lt;p&gt;A while back an update went live that, with some caveats, changes the time it takes for an update on a website to reflect in the search engine index from up to 2 months to 1-2 days. Conditions being if the website has an RSS or Atom feed.&lt;/p&gt;
&lt;p&gt;The big crawl job takes about two months, and is run partition by partition, meaning there&amp;rsquo;s typically a slice of the index that is two months stale at any given point in time. To help compensate for this, a new crawler and index partition has been added that focuses on recently updated content.&lt;/p&gt;</description></item><item><title>Notes on binary soup</title><link>https://www.marginalia.nu/log/a_112_slop_ideas/index.md/</link><pubDate>Tue, 05 Nov 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_112_slop_ideas/index.md/</guid><description>&lt;p&gt;I recently put together a small library called &lt;a href="https://github.com/MarginaliaSearch/SlopData"&gt;Slop&lt;/a&gt;, for intermediate on-disk data representation for the search engine, replacing a few ad-hoc formats I had in place before.&lt;/p&gt;
&lt;p&gt;This post isn&amp;rsquo;t so much an attempt to convince anyone else to use this library, as it makes trade-offs catering to a fairly niche use case, but to explore some of its design ideas, as it all came together very nicely, in the hopes that other libraries can draw ideas from it.&lt;/p&gt;</description></item><item><title>Release Notes v2024.10.0</title><link>https://www.marginalia.nu/release-notes/v2024-10-0/</link><pubDate>Mon, 14 Oct 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/release-notes/v2024-10-0/</guid><description>&lt;p&gt;This is a new major release of marginalia search, mostly leaning toward the technical side.&lt;/p&gt;
&lt;p&gt;Emphasis has been on ensuring the search engine has the technical capabilities to serve more types of queries, especially longer queries which it previously did not handle very well.&lt;/p&gt;
&lt;p&gt;Effort has also been put toward making sure it&amp;rsquo;s possible to install and run outside of docker. There is still some work to be done to streamline the installation process, but we&amp;rsquo;re getting there.&lt;/p&gt;</description></item><item><title>Phrase Matching in Marginalia Search</title><link>https://www.marginalia.nu/log/a_111_phrase_matching/</link><pubDate>Mon, 30 Sep 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_111_phrase_matching/</guid><description>&lt;p&gt;Marginalia Search now properly supports phrase matching. This not only permits a more robust implementation of quoted search queries, but also helps promote results where the search terms occur in the document exactly in the same order as they do in the query.&lt;/p&gt;
&lt;p&gt;This is a write-up about implementing this change. This is going to be a relatively long post, as it represents about 4 months of work.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m also happy and grateful to announce that the nlnet people reached out after the run of &lt;a href="../a_107_nlnext"&gt;the grant&lt;/a&gt; was over and asked me if I had more work in the pipe, and agreed to fund this change as well!&lt;/p&gt;</description></item><item><title>One year of solo dev, wrapping up the grant-funded work</title><link>https://www.marginalia.nu/log/a_107_nlnext/</link><pubDate>Tue, 18 Jun 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_107_nlnext/</guid><description>&lt;p&gt;&lt;a href="https://www.marginalia.nu/log/83_full_time/"&gt;A year ago&lt;/a&gt; I walked out of the office for the last time. I handed in my corpo laptop, said some good-byes, and since then I have been my own boss.&lt;/p&gt;
&lt;p&gt;This first year has been funded by an &lt;a href="https://nlnet.nl/" rel="external noopener"&gt;NLnet&lt;/a&gt; grant, which I&amp;rsquo;m in the midst of wrapping up. As of now, the work is all done, the final request for payment has been sent.&lt;/p&gt;
&lt;p&gt;There&amp;rsquo;s a similar last-day-of-school levity to both these events.&lt;/p&gt;</description></item><item><title>Experiment in Java native calls</title><link>https://www.marginalia.nu/log/a_106_native_calls/</link><pubDate>Thu, 16 May 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_106_native_calls/</guid><description>&lt;p&gt;I&amp;rsquo;ve experimentally replaced some of the Java implementations of quicksort and binary search with calls to C++ code, and saw huge benefits for the sorting code but the same or worse performance for binary search.&lt;/p&gt;
&lt;p&gt;The Marginalia Search engine is mainly written in Java, which is language that is good at many things, but not particularly pleasant to work with when it comes to low level systems programming.&lt;/p&gt;
&lt;p&gt;Unfortunately, a part of building an internet search engine involves database-adjacent low level programming.&lt;/p&gt;</description></item><item><title>Query Parsing and Understanding</title><link>https://www.marginalia.nu/log/a_103_query_parsing/</link><pubDate>Wed, 17 Apr 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_103_query_parsing/</guid><description>&lt;p&gt;Been working on improving Marginalia Search query parsing and understanding. This is going to be a pretty long update, as it&amp;rsquo;s a few months&amp;rsquo; work.&lt;/p&gt;
&lt;p&gt;Apart from cleaning up the somewhat messy query parsing code, a problem I&amp;rsquo;m trying to address is that the search engine is currently only good at dealing with fairly focused queries, they don&amp;rsquo;t need to be short, but if you try to qualify a search that is too broad by adding more terms, it often doesn&amp;rsquo;t produce anything useful.&lt;/p&gt;</description></item><item><title>Deep Bug</title><link>https://www.marginalia.nu/log/a_104_dep_bug/</link><pubDate>Wed, 10 Apr 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_104_dep_bug/</guid><description>&lt;p&gt;The project has been haunted by a mysterious bug since sometime February. It relates to the code that constructs the index, particularly the code that merges partial indices.&lt;/p&gt;
&lt;p&gt;In short the search engine constucts the reverse index through successive merging of smaller indices, which reduces the overall memory requirement.&lt;/p&gt;
&lt;p&gt;You can conceptualize the revese index itself as two files, one with offset pointers into another file, which has sorted numbers. This code runs after each partition finishes crawling and processing its data, and has a run time of about 4 hours.&lt;/p&gt;</description></item><item><title>The Yak Shave</title><link>https://www.marginalia.nu/log/a_102_yak_shave/</link><pubDate>Wed, 28 Feb 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_102_yak_shave/</guid><description>&lt;p&gt;I set out a little over a week ago to add a service registry to Marginalia Search,
primarily to reduce its dependence on docker. I would like it to be able to run
on bare metal as well, which poses a problem since configuring the application manually
is a bit of a headache with dozens of ports that need to be set up. It would also be
desirable to be able to run multiple instances of important services in order elliminate
downtime during upgrades.&lt;/p&gt;</description></item><item><title>Marginalia: 3 Years</title><link>https://www.marginalia.nu/log/a_101_marginalia-3-years/</link><pubDate>Sun, 25 Feb 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_101_marginalia-3-years/</guid><description>&lt;p&gt;It&amp;rsquo;s been three years since the inception of Marginalia Search, then
a dinky experiment to find where the heck the cool Internet has gone,
now my full time job.&lt;/p&gt;
&lt;p&gt;While there&amp;rsquo;s always things that can be improved, it&amp;rsquo;s fair to say
the search engine has never worked as well as it does right now.&lt;/p&gt;
&lt;p&gt;A great number of milestones have been reached, perhaps biggest
of all the search engine has moved out of my living room and into
a proper enterprise server.&lt;/p&gt;</description></item><item><title>Best SEO spam 2024 reddit</title><link>https://www.marginalia.nu/log/a_100_reddit_spam/</link><pubDate>Wed, 07 Feb 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/a_100_reddit_spam/</guid><description>&lt;p&gt;One of the great joys of working on a search engine is that you get to reverse engineer SEO spam, and overall study how it evolves over time.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve been noticing the search engine spam strategy of adding &amp;lsquo;reddit&amp;rsquo; to page titles for a few years now, but it feels like it&amp;rsquo;s been growing a lot recently. I don&amp;rsquo;t think it&amp;rsquo;s actually &lt;em&gt;working&lt;/em&gt;, but it&amp;rsquo;s so cute that they are trying.&lt;/p&gt;</description></item><item><title>Release Notes v2024.01.0</title><link>https://www.marginalia.nu/release-notes/v2024-01-0/</link><pubDate>Wed, 24 Jan 2024 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/release-notes/v2024-01-0/</guid><description>&lt;p&gt;This is a major new release of the search engine software, corresponding to nearly four months of changes. In these months, the state of the code hasn&amp;rsquo;t been stable enough for a new release, but it&amp;rsquo;s now been brought to a stable point.&lt;/p&gt;
&lt;p&gt;Release Highlights:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The installation procedure has been cleaned up.&lt;/li&gt;
&lt;li&gt;It&amp;rsquo;s now possible to run the search engine in a white label/bare-bones mode, without any of the Marginalia Search branding or logic.&lt;/li&gt;
&lt;li&gt;The Marginalia Search web interface has been overhauled. The site-info page has especially been given a large upgrade.&lt;/li&gt;
&lt;li&gt;The search engine can use anchor texts to supplement keywords.&lt;/li&gt;
&lt;li&gt;The search engine can use multiple index shards.&lt;/li&gt;
&lt;li&gt;The operations GUI has been overhauled.&lt;/li&gt;
&lt;li&gt;An operations manual has been written.&lt;/li&gt;
&lt;li&gt;The crawler can now resume crawls in process due to intermediate WARCs.&lt;/li&gt;
&lt;li&gt;The search engine can import several formats without external pre-processing.&lt;/li&gt;
&lt;li&gt;The Academia filter has been improved&lt;/li&gt;
&lt;li&gt;The Recipe filter has been improved&lt;/li&gt;
&lt;li&gt;The system now penalizes documents that have obvious hallmarks of &lt;a href="https://github.com/MarginaliaSearch/MarginaliaSearch/commit/e53bb70bef7dc833c88f689d6fbf052f45c9f3cb"&gt;being written by ChatGPT&lt;/a&gt; in its quality assessment.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Other technical changes:&lt;/p&gt;</description></item><item><title>A Frivolous Feature</title><link>https://www.marginalia.nu/log/96_frivolous_asn/</link><pubDate>Fri, 22 Dec 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/96_frivolous_asn/</guid><description>&lt;p&gt;Marginalia Search very recently gained the ability to filter results by Autonomous System,
not only searching by ASN but by the organization information for that AS. At a glance this
seems like a somewhat frivolous feature, but it has interesting effects.&lt;/p&gt;
&lt;p&gt;Autonomous Systems are part of the Internet&amp;rsquo;s routing infrastructure. If your mental model of an IP
number is that they are the phone number of the computer, this is something akin to a postal code.
Digging much deeper than that into BGP and autonomous systems is not really in the scope of this article, but
&lt;a href="https://en.wikipedia.org/wiki/Autonomous_system_%28Internet%29"&gt;Wikipedia&lt;/a&gt; has a relatively lucid article on this.&lt;/p&gt;</description></item><item><title>WARC'in the crawler</title><link>https://www.marginalia.nu/log/94_warc_warc/</link><pubDate>Wed, 20 Dec 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/94_warc_warc/</guid><description>&lt;p&gt;The Marginalia Crawler has seen improvements! A long term problem with the crawler design is
that if for whatever reason the crawler shuts down, then it needs to re-start fetching whatever
domains it was currently traversing during the termination from zero.&lt;/p&gt;
&lt;p&gt;This isn&amp;rsquo;t fantastic, since not only does crawling a website take a fair bit of time,
it&amp;rsquo;s a nuisance for the server admins to re-crawl stuff that was already fetched, and
a real liability for ending up in robots.txt or some iptables ruleset.&lt;/p&gt;</description></item><item><title>Anchor Tags</title><link>https://www.marginalia.nu/log/93_atags/</link><pubDate>Tue, 07 Nov 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/93_atags/</guid><description>&lt;p&gt;I&amp;rsquo;ve been working on getting anchor tag keywords into the search engine,
basically using link texts to complement the keywords on a webpage.&lt;/p&gt;
&lt;p&gt;The problem I&amp;rsquo;m attempting to address is that many websites don&amp;rsquo;t really describe
themselves particularly well. As Steve Ballmer&amp;rsquo;s stage performance once illustrated,
merely repeating a word doesn&amp;rsquo;t on its own make what you&amp;rsquo;re saying relevant to the term.&lt;/p&gt;
&lt;p&gt;Another good example of how it falls short is &lt;a href="https://www.chiark.greenend.org.uk/~sgtatham/putty/"&gt;PuTTY&amp;rsquo;s website&lt;/a&gt;,
which will be used as a pilot case to improve.&lt;/p&gt;</description></item><item><title>Partitioning The Index</title><link>https://www.marginalia.nu/log/92_multirack_drifting/</link><pubDate>Mon, 30 Oct 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/92_multirack_drifting/</guid><description>&lt;p&gt;So a bit of an update on what I&amp;rsquo;ve been working on. This will be adapted into release notes in
a while, but I haven&amp;rsquo;t quite wrapped a bow on the change set yet.&lt;/p&gt;
&lt;p&gt;Still, it has certainly been a few weeks. Didn&amp;rsquo;t quite land how busy I&amp;rsquo;ve been until I set down to
draft this post. Them&amp;rsquo;s some changes, and I&amp;rsquo;m skipping a few to keep this meandering post at a sane length.&lt;/p&gt;</description></item><item><title>Moving Marginalia to a New Server</title><link>https://www.marginalia.nu/log/90-new-server-design/</link><pubDate>Sat, 07 Oct 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/90-new-server-design/</guid><description>&lt;p&gt;So the search engine is moving to a new server soon, thanks to the generous grant
&lt;a href="https://www.marginalia.nu/log/88-futo-grant/"&gt;mentioned recently&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you visit search.marginalia.nu now, it may or may not use the old or new server. It&amp;rsquo;ll be like this for
a while, since I need them both for testing and maintenance type work.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ll also apologize if this post is a bit chaotic. It is a reflection of a very chaotic couple of weeks that
apart from setting up this migration also involved a very short notice invitation for a
presentation at &lt;a href="https://opensearchfoundation.org/en/events-osf/5th-international-open-search-symposium-ossym2023/"&gt;ossym23&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Release Notes v2023.10.0</title><link>https://www.marginalia.nu/release-notes/v2023-10-0/</link><pubDate>Sat, 07 Oct 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/release-notes/v2023-10-0/</guid><description>&lt;p&gt;This is a mostly technical release. It takes the index from 106M to 164M documents.&lt;/p&gt;
&lt;h2 id="zero-downtime-upgrades-and-halved-memory-consumption"&gt;Zero Downtime Upgrades and halved memory consumption&lt;/h2&gt;
&lt;p&gt;The initial focus of the release was to address the sometimes lengthy downtimes that have plagued the project when loading a new index.&lt;/p&gt;
&lt;p&gt;There is a somewhat &lt;a href="https://www.marginalia.nu/log/87_absurd_success/"&gt;lengthy write-up about this here&lt;/a&gt;; but the short version is that this was very successful and a drastic optimization, removed not only the needed downtime, but added neat new features and &lt;strong&gt;slashed the RAM requirements in half&lt;/strong&gt;!&lt;/p&gt;</description></item><item><title>Marginalia Search receives FUTO Grant</title><link>https://www.marginalia.nu/log/88-futo-grant/</link><pubDate>Fri, 15 Sep 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/88-futo-grant/</guid><description>&lt;p&gt;I&amp;rsquo;m happy to announce that the generous people at &lt;a href="https://futo.org/"&gt;FUTO&lt;/a&gt; have granted the project $15,000 with no strings attached to help the search engine out with some more server power.&lt;/p&gt;
&lt;p&gt;FUTO is a young Austin, TX-based organization &amp;ldquo;&lt;em&gt;dedicated to developing, both through in-house engineering and investment, technologies that frustrate centralization and industry consolidation&lt;/em&gt;&amp;rdquo;. It&amp;rsquo;s one to keep an eye on, I believe their heart is in the right place and they have every possibility of making a real difference.&lt;/p&gt;</description></item><item><title>Absurd Success</title><link>https://www.marginalia.nu/log/87_absurd_success/</link><pubDate>Wed, 30 Aug 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/87_absurd_success/</guid><description>&lt;p&gt;So&amp;hellip; I&amp;rsquo;ve had the most unreal week of coding. Zero exaggeration, I&amp;rsquo;ve halved the
RAM requirements of the search engine, removed the need to take the system
offline during an upgrade, removed hard limits on how many documents can be indexed,
and quadrupled soft limits on how many keywords can be in the corpus.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s been a long term goal to keep it possible to run and operate the system
on low-powered hardware, and so far improvements have been made, to the point
where my 32 Gb RAM developer machine feels spacey rather than cramped, but this
set of changes takes it several notches further.&lt;/p&gt;</description></item><item><title>Release Notes v2023.08.0</title><link>https://www.marginalia.nu/release-notes/v2023-08-0/</link><pubDate>Tue, 22 Aug 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/release-notes/v2023-08-0/</guid><description>&lt;p&gt;This release mainly aims to improve the operational side of the search engine, with an emphasis of automating tedious manual processes and optimizing crawling and data processing to use fewer resources.&lt;/p&gt;
&lt;p&gt;Conventionally I try to link to relevant commits in these notes, but some of the changes were so sweeping and protracted it was hard to narrow it down to individual commits; in those cases I&amp;rsquo;ll link to the relevant code instead.&lt;/p&gt;</description></item><item><title>Message Queues, State Machines, Actors, UI</title><link>https://www.marginalia.nu/log/85-mq_sm_actor_ui/</link><pubDate>Sat, 12 Aug 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/85-mq_sm_actor_ui/</guid><description>&lt;p&gt;This is a bit of an &lt;em&gt;what I&amp;rsquo;ve been working on&lt;/em&gt; style of post. It&amp;rsquo;s also a bit of a complement for the
release notes of the upcoming release which should be dropping in a week or so. There&amp;rsquo;s some spit and
polish still missing from these things, but if I don&amp;rsquo;t write about them now too much will have been
ejected from the cache to make a well written post about it.&lt;/p&gt;</description></item><item><title>Release Notes v2023.06.0</title><link>https://www.marginalia.nu/release-notes/v2023-06-0/</link><pubDate>Thu, 29 Jun 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/release-notes/v2023-06-0/</guid><description>&lt;h2 id="new-features"&gt;New Features&lt;/h2&gt;
&lt;h3 id="generator-keywords"&gt;Generator keywords&lt;/h3&gt;
&lt;p&gt;To provide additional ways of selecting search results, a synthetic keyword
has been added for the &lt;code&gt;&amp;lt;meta name=&amp;quot;generator&amp;quot; content=&amp;quot;...&amp;quot;&amp;gt;&lt;/code&gt; tag. This is basically a vanity
tag that is used by some HTML generators to advertise themselves, and it&amp;rsquo;s also
common for hand-edited HTML to include this tag with a string like &amp;ldquo;vim&amp;rdquo; or &amp;ldquo;myself&amp;rdquo;,
as a wink to human readers of the code.&lt;/p&gt;
&lt;p&gt;The generator keywords have the form &lt;code&gt;generator:value&lt;/code&gt;. For example, to search for websites made
with Hugo, you can use &lt;code&gt;generator:hugo&lt;/code&gt;. Generator categories have also been added as searchable
keywords, for example &lt;code&gt;generator:wiki&lt;/code&gt;, &lt;code&gt;generator:forum&lt;/code&gt;, &lt;code&gt;generator:docs&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>Full Time</title><link>https://www.marginalia.nu/log/83_full_time/</link><pubDate>Fri, 16 Jun 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/83_full_time/</guid><description>&lt;p&gt;I&amp;rsquo;m working on Marginalia Search full time.&lt;/p&gt;
&lt;p&gt;I left the office for the last time today, and it&amp;rsquo;s the strangest feeling. I&amp;rsquo;ve quit jobs, taken time off work, been laid off, but this is different from any of those things. This is deliberate.&lt;/p&gt;
&lt;p&gt;There&amp;rsquo;s a note of relief. I&amp;rsquo;ve essentially been working two pretty demanding jobs; one for pay and one for passion and the joy of making a difference.&lt;/p&gt;</description></item><item><title>Release Notes v2023.03.2</title><link>https://www.marginalia.nu/release-notes/v2023-03-2/</link><pubDate>Thu, 25 May 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/release-notes/v2023-03-2/</guid><description>&lt;p&gt;This is primarily a bugfix release that primarily addresses some issues with a metadata corruption that was introduced in the previous release.&lt;/p&gt;
&lt;h1 id="new-features"&gt;New Features&lt;/h1&gt;
&lt;h2 id="file-keywords"&gt;File keywords&lt;/h2&gt;
&lt;p&gt;To provide more tools for navigating the web, the converter now generates synthetic keywords for documents that link to files on the same server based on their file ending.&lt;/p&gt;
&lt;p&gt;If the file contains a link such as&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-html" data-lang="html"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&amp;lt;&lt;span style="color:#f92672"&gt;a&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;href&lt;/span&gt;&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;file.zip&amp;#34;&lt;/span&gt;&amp;gt;Download&amp;lt;/&lt;span style="color:#f92672"&gt;a&lt;/span&gt;&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;then he document will be tagged with the keyword &lt;code&gt;file:zip&lt;/code&gt; as well as &lt;code&gt;file:archive&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>Going Github</title><link>https://www.marginalia.nu/log/77-going-github/</link><pubDate>Sat, 25 Mar 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/77-going-github/</guid><description>&lt;p&gt;I&amp;rsquo;ve moved Marginalia&amp;rsquo;s sources to Github. Can&amp;rsquo;t pick every battle.&lt;/p&gt;
&lt;p&gt;The main reason is I&amp;rsquo;m kind of tired of the amount of spam bots that keep signing up to my Gitea. The juice of self-hosting a public-access git forge, even locked down to prevent arbitrary repo creation, that juice just isn&amp;rsquo;t worth the squeeze.&lt;/p&gt;
&lt;p&gt;This is not without some consideration.&lt;/p&gt;
&lt;p&gt;To be blunt, I don&amp;rsquo;t like Github. Their use of dark patterns leaves a real nasty after-taste. I&amp;rsquo;m also old enough to remember the Microsoft of the early 2000s very vividly.&lt;/p&gt;</description></item><item><title>Search Result Quality For Multiple Terms</title><link>https://www.marginalia.nu/log/76-search-result-quality-for-multiple-terms/</link><pubDate>Thu, 23 Mar 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/76-search-result-quality-for-multiple-terms/</guid><description>&lt;p&gt;This is a bit of a follow up to the previous post.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/75-grand-restructuring.gmi"&gt;The Grand Code Restructuring [ 2023-03-17 ]&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Marginalia&amp;rsquo;s search result quality has, for a long while, been pretty good as long as your search query is a single term, but for multiple search terms it&amp;rsquo;s been a bit hit-and-miss. Marginalia was never great at this, but the quality of results in this usage pattern has taken a bit of a dive recently due to a re-write of the index last fall.&lt;/p&gt;</description></item><item><title>The Grand Code Restructuring</title><link>https://www.marginalia.nu/log/75-grand-restructuring/</link><pubDate>Fri, 17 Mar 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/75-grand-restructuring/</guid><description>&lt;p&gt;In general I don&amp;rsquo;t like to fuss over code, but this is exactly what I&amp;rsquo;ve been doing in preparation of the NLnet funded work. I&amp;rsquo;ve spent the last month restructuring Marginalia&amp;rsquo;s code base. It&amp;rsquo;s not completely done, but I&amp;rsquo;ve made great headway.&lt;/p&gt;
&lt;p&gt;Things got the way they got because in general for experimental solo-development projects, I think it makes sense to be fairly tolerant of technical debt.&lt;/p&gt;
&lt;p&gt;Since refactoring is something that is extremely difficult to break up into parallel tracks or do in small iterations, the cost of refactoring is effectively multiplied by the number of people that could be working on the code.&lt;/p&gt;</description></item><item><title>Marginalia Search: 2 years, big news</title><link>https://www.marginalia.nu/log/74-marginalia-2-years/</link><pubDate>Sun, 26 Feb 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/74-marginalia-2-years/</guid><description>&lt;p&gt;No time like the project&amp;rsquo;s two year anniversary to drop this particular bomb&amp;hellip;&lt;/p&gt;
&lt;p&gt;Marginalia&amp;rsquo;s gotten an NLNet grant. This means I&amp;rsquo;ll be able to work full time on this project at least a year.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://nlnet.nl/project/Marginalia/"&gt;https://nlnet.nl/project/Marginalia/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This grant is essentially the best-case scenario for funding this project. It&amp;rsquo;ll be able to remain independent, open-source, and non-profit.&lt;/p&gt;
&lt;p&gt;I won&amp;rsquo;t start in earnest for a few months as I&amp;rsquo;ve got loose ends to tie up before I can devote that sort of time. More details to come, but I&amp;rsquo;ll say as much as the first step is a tidying up of the sources and a move off my self-hosted git instance to an external git host yet to be decided.&lt;/p&gt;</description></item><item><title>Faster Index Joins</title><link>https://www.marginalia.nu/log/70-faster-index-joins/</link><pubDate>Tue, 03 Jan 2023 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/70-faster-index-joins/</guid><description>&lt;p&gt;The most common (and most costly) operation of the marginalia search engine&amp;rsquo;s index is something like given a set of documents containing one keyword, find each documents containing another keyword.&lt;/p&gt;
&lt;p&gt;The naive approach is to just iterate over each document identifier in the first set and do a membership test in the b-tree containing the second. This is an O(m log n)-operation, which on paper is pretty fast.&lt;/p&gt;
&lt;p&gt;It turns out it can be made faster.&lt;/p&gt;</description></item><item><title>Creepy Website Similarity</title><link>https://www.marginalia.nu/log/69-creepy-website-similarity/</link><pubDate>Mon, 26 Dec 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/69-creepy-website-similarity/</guid><description>&lt;p&gt;This is a write-up about an experiment from a few months ago, in how to find websites that are similar to each other. Website similarity is useful for many things, including discovering new websites to crawl, as well as suggesting similar websites in the Marginalia Search random exploration mode.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://explore2.marginalia.nu/"&gt;A link to a slapdash interface for exploring the experimental data.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The approach chosen was to use the link graph look for websites that are linked to from the same websites. This turned out to work remarkably well.&lt;/p&gt;</description></item><item><title>About Marginalia Search</title><link>https://www.marginalia.nu/marginalia-search/about/</link><pubDate>Fri, 23 Dec 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/marginalia-search/about/</guid><description>&lt;div style="border: 1px solid red; padding-left: 1ch; padding-right: 1ch;"&gt;
&lt;p&gt;&lt;strong&gt;This information is outdated&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The marginalia search project info now lives on &lt;a href="https://about.marginalia-search.com/"&gt;about.marginalia-search.com&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Ever feel like the Internet has gotten a bit&amp;hellip; I don&amp;rsquo;t know, samey? There&amp;rsquo;s funny images scrolling by and you blow some air through your nose and keep scrolling and then someone has done something upsetting and you write an angry comment and then you scroll some more.&lt;/p&gt;
&lt;p&gt;Remember when you used to explore the Internet, when you used to discover cool little websites made by people and it wasn&amp;rsquo;t just a bunch of low effort content mill listicles and blog spam?&lt;/p&gt;</description></item><item><title>Carbon Dating HTML</title><link>https://www.marginalia.nu/log/66-carbon-dating/</link><pubDate>Thu, 27 Oct 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/66-carbon-dating/</guid><description>&lt;p&gt;One of the more common feature requests I&amp;rsquo;ve gotten for Marginalia Search is the ability to search by date. I&amp;rsquo;ve been a bit reluctant because this has the smell of a a surprisingly hard problem. Or rather, a surprisingly large number of easy problems.&lt;/p&gt;
&lt;p&gt;The initial hurdle we&amp;rsquo;ll encounter is that among structured data, pubDate in available in RDFa, OpenGraph, JSON+LD, and Microdata.&lt;/p&gt;
&lt;p&gt;A few examples:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;&amp;lt;meta property=&amp;#34;datePublished&amp;#34; content=&amp;#34;2022-08-24&amp;#34; /&amp;gt;
&amp;lt;meta itemprop=&amp;#34;datePublished&amp;#34; content=&amp;#34;2022-08-24&amp;#34; /&amp;gt;
&amp;lt;meta property=&amp;#34;article:published_time&amp;#34; content=&amp;#34;2022-08-24T14:39:14Z&amp;#34; /&amp;gt;
&amp;lt;script type=&amp;#34;application/ld+json&amp;#34;&amp;gt;
{&amp;#34;datePublished&amp;#34;:&amp;#34;2022-08-24T14:39:14Z&amp;#34;}
&amp;lt;/script&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;So far not so that bad. This is at least a case where the web site tells you that here is the pub-date, the exact format of the date may vary, but this is solvable.&lt;/p&gt;</description></item><item><title>Marginalia's Index Reaches 100,000,000 Documents</title><link>https://www.marginalia.nu/log/64-hundred-million/</link><pubDate>Fri, 21 Oct 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/64-hundred-million/</guid><description>&lt;p&gt;A very brief note to announce reaching a long term goal and major milestone for marginalia search.&lt;/p&gt;
&lt;p&gt;The search engine now indexes 106,857,244 documents!&lt;/p&gt;
&lt;p&gt;The previous record was a bit south of seventy million. A hundred million has been a pie-in-the-sky goal for a very long time. It&amp;rsquo;s seemed borderline impossible to index a that many documents on a PC. Turns out it&amp;rsquo;s not. It&amp;rsquo;s more than possible.&lt;/p&gt;
&lt;p&gt;Twice this may even be technically doable, but is way past the pain point of sheer logistics. It&amp;rsquo;s already a real headache to deal with this much data.&lt;/p&gt;</description></item><item><title>Fragments of the Old Web</title><link>https://www.marginalia.nu/links/fragments-old-web/</link><pubDate>Thu, 15 Sep 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/links/fragments-old-web/</guid><description>&lt;p&gt;The following is a list of curiosities I&amp;rsquo;ve found while crawling the internet looking for websites. A sort of greatest hits from my search engine.&lt;/p&gt;
&lt;p&gt;There is no real system or theme to the links, other than the fact that they have made me go &amp;ldquo;huh, that&amp;rsquo;s neat&amp;rdquo; while visiting them.&lt;/p&gt;
&lt;p&gt;Most of these are effectively impossible to find on Google, since they don&amp;rsquo;t use HTTPS, and aren&amp;rsquo;t optimized for mobile, and aren&amp;rsquo;t plastered in ads and tracking scripts.&lt;/p&gt;</description></item><item><title>Donate To This Project</title><link>https://www.marginalia.nu/marginalia-search/supporting/</link><pubDate>Mon, 05 Sep 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/marginalia-search/supporting/</guid><description>&lt;div style="border: 1px solid red; padding-left: 1ch; padding-right: 1ch;"&gt;
&lt;p&gt;&lt;strong&gt;This information is outdated&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The marginalia search project info now lives on &lt;a href="https://about.marginalia-search.com/"&gt;about.marginalia-search.com&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;I&amp;rsquo;m just one guy building all of this on my own. I&amp;rsquo;d like to expand the search engine and make it more useful. My hope is that it will bring value to its users and enable a thriving independent Internet.&lt;/p&gt;
&lt;p&gt;The search engine doesn&amp;rsquo;t have any secret sauce, all the &lt;a href="https://git.marginalia.nu/"&gt;source code is publicly available&lt;/a&gt; and as far as is legally and logistically possible, the &lt;a href="https://downloads.marginalia.nu/"&gt;data&lt;/a&gt; is also available.&lt;/p&gt;</description></item><item><title>The Evolution of Marginalia's crawling</title><link>https://www.marginalia.nu/log/63-marginalia-crawler/</link><pubDate>Tue, 23 Aug 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/63-marginalia-crawler/</guid><description>&lt;p&gt;In the primordial days of Marginalia Search, it used a dynamic approach to crawling the Internet.&lt;/p&gt;
&lt;p&gt;It ran a number of crawler threads, 32 or 64 or some such, that fetched jobs from a director service, that grabbed them straight out of the URL database, these jobs were batches of 100 or so documents that needed to be crawled.&lt;/p&gt;
&lt;p&gt;Crawling was not planned ahead of time, but rather decided through a combination of how much of a website had been visited, and the quality score of that website determined where to go next. It also promoted crawling websites adjacent to high quality websites.&lt;/p&gt;</description></item><item><title>Marginaliacoin, and hidden forums</title><link>https://www.marginalia.nu/log/62-marginaliacoin/</link><pubDate>Thu, 18 Aug 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/62-marginaliacoin/</guid><description>&lt;p&gt;I discovered someone has made a cryptocurrency called &amp;ldquo;Memex Marginalia Inu&amp;rdquo;. It appears to have been created February 23, which is around when the entry &amp;ldquo;I Have No Capslock And I Must Scream&amp;rdquo; went absurdly viral to the point where Elon Musk tweeted a link to it.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.marginalia.nu/log/48-i-have-no-capslock.gmi"&gt;I Have No Capslock&amp;hellip;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Mr Musk&amp;rsquo;s twitter orbit is exceptionally strange. The tweet was followed by a deluge of bizarre activity, strange emails with calls about stonk canine lunar expeditions, and apparently also a cryptocurrency land-grab of sorts. I can&amp;rsquo;t claim to understand why, but many of the emails got after the tweet were on the theme &amp;ldquo;what does this mean?&amp;rdquo;, almost as though Elon&amp;rsquo;s tweet was some sort of prophetic omen.&lt;/p&gt;</description></item><item><title>Fun with Anchor Text Keywords</title><link>https://www.marginalia.nu/log/59-anchor-text/</link><pubDate>Thu, 23 Jun 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/59-anchor-text/</guid><description>&lt;p&gt;Anchor texts are a very useful source of keywords for a search engine, and in an older version of the search engine, it used the text of such hyperlinks as a supplemental source for keywords, but due to a few redesigns, this feature has fallen off.&lt;/p&gt;
&lt;p&gt;Last few days has been spent working on trying to re-implement it in a new and more powerful fashion. This has largely been enabled by a crawler re-design from a few months ago, which offers the crawled data in a lot more useful fashion and allows a lot more flexible post-processing.&lt;/p&gt;</description></item><item><title>marginalia.nu goes open source</title><link>https://www.marginalia.nu/log/58-marginalia-open-source/</link><pubDate>Fri, 27 May 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/58-marginalia-open-source/</guid><description>&lt;p&gt;After a bit of soul searching with regards to the future of the website, I&amp;rsquo;ve decided to open source the code for marginalia.nu, all of its services, including the search engine, encyclopedia, memex, etc.&lt;/p&gt;
&lt;p&gt;A motivating factor is the search engine has sort of grown to a scale where it&amp;rsquo;s becoming increasingly difficult to productively work on as a personal solo project. It needs more structure. What&amp;rsquo;s kept me from open sourcing it so far has also been the need for more structure. The needs of the marginalia project, and the needs of an open source project have effectively aligned.&lt;/p&gt;</description></item><item><title>Uncertain Future For Marginalia Search</title><link>https://www.marginalia.nu/log/56-uncertain-future/</link><pubDate>Thu, 28 Apr 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/56-uncertain-future/</guid><description>&lt;p&gt;I found myself effectively without a job on short notice.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m not at all worried about finding another one, I have savings, and I have experience, and I have demonstrable skill. What I am concerned about is finding a source of income that&amp;rsquo;s compatible with putting some time on my personal projects.&lt;/p&gt;
&lt;p&gt;Last bunch of years, I&amp;rsquo;ve been working 32 hour weeks, which is a pretty sweet deal especially combined with the zero hour commute you get working from home during the pandemic. Not every employer is fine with that, and while I do have options, I&amp;rsquo;m in a worse bargaining position than I have been before.&lt;/p&gt;</description></item><item><title>Lexicon Architectural Rubberducking</title><link>https://www.marginalia.nu/log/55-lexicon-rubberduck/</link><pubDate>Mon, 11 Apr 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/55-lexicon-rubberduck/</guid><description>&lt;p&gt;I&amp;rsquo;m going to think out loud for a moment about a problem I&amp;rsquo;m considering.&lt;/p&gt;
&lt;p&gt;RAM is a precious resource on any server. Look at VPS servers, and you&amp;rsquo;ll be hard pressed to find one with much more than 32 Gb. Look at leasing a dedicated server, and it&amp;rsquo;s the RAM that really drives up the price. My server has 128 Gb, and it it&amp;rsquo;s so full it needs to unbutton its pants to sit down comfortably. Anything I can offload to disk is great.&lt;/p&gt;</description></item><item><title>The Bargain Bin B-Tree</title><link>https://www.marginalia.nu/log/54-bargain-bin-btree/</link><pubDate>Thu, 07 Apr 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/54-bargain-bin-btree/</guid><description>&lt;p&gt;I&amp;rsquo;ve been working lately on a bit of an overhaul of how the search engine does indexing. How it indexes its indices. &amp;ldquo;Index&amp;rdquo; is a bit of an overloaded term here, and it&amp;rsquo;s not the first that will crop up.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s start from the beginning and build up and examine the problem of searching for a number in a list of numbers. You have a long list of numbers, let&amp;rsquo;s sort them because why not.&lt;/p&gt;</description></item><item><title>Growing Pains</title><link>https://www.marginalia.nu/log/52-growing-pains/</link><pubDate>Wed, 23 Mar 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/52-growing-pains/</guid><description>&lt;p&gt;The search engine index has grown quite considerably the last few weeks. It&amp;rsquo;s actually surpassed 50 million documents, which is quite some milestone. In February it was sitting at 27-28 million or so.&lt;/p&gt;
&lt;p&gt;About 80% of this is side-loading all of stackoverflow and stackexchange, and part of it is additional crawling.&lt;/p&gt;
&lt;p&gt;The crawler has to date fetched 91 million URLs, but only about a third of what is fetched actually qualifies for indexing for various reasons, some links may be dead, some may be redirects, some may just have too much javascript and cruft to qualify.&lt;/p&gt;</description></item><item><title>Marginalia Search: 1 year</title><link>https://www.marginalia.nu/log/49-marginalia-1-year/</link><pubDate>Fri, 25 Feb 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/49-marginalia-1-year/</guid><description>&lt;p&gt;I&amp;rsquo;ve caught some bug and don&amp;rsquo;t have the energy to write more than a brief note.&lt;/p&gt;
&lt;p&gt;I want to commemorate the fact that work on the Marginalia search engine started one year ago. The first commit was on February 26th 2021, and contained a sketch for a website crawler and some data models.&lt;/p&gt;
&lt;p&gt;In many ways, the paint is barely dry, yet it feels like this project has been around for a long while.&lt;/p&gt;</description></item><item><title>The Anatomy of Search Engine Spam</title><link>https://www.marginalia.nu/log/46-anatomy-of-search-engine-spam/</link><pubDate>Mon, 07 Feb 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/46-anatomy-of-search-engine-spam/</guid><description>&lt;p&gt;Black hat SEO is endlessly fascinating phenomenon to study. This post is about some tactics they use to make their sites rank higher.&lt;/p&gt;
&lt;p&gt;The goal of blackhat SEO is to boost the search engine ranking of a page nobody particularly wants to see, usually ePharma, escort services, online casinos, shitcoins, hotel bookings; the bermuda pentagon of shady websites.&lt;/p&gt;
&lt;p&gt;The theory behind most modern search engines is that if you get links from a high ranking domain, then your domain gets a higher ranking as well, which increases the traffic. The reality is a little more complicated than that, but this is a sufficient mental model to understand the basic how-to.&lt;/p&gt;</description></item><item><title>Can we unfuck internet discoverability?</title><link>https://www.marginalia.nu/log/45-unfuck-internet-discoverability/</link><pubDate>Fri, 04 Feb 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/45-unfuck-internet-discoverability/</guid><description>&lt;p&gt;I&amp;rsquo;ve been thinking a lot about how difficult it has become to discover quality content on the Internet, not because it isn&amp;rsquo;t there, but because the signal to noise ratio is really bad, and most venues of discovery don&amp;rsquo;t seem to be able to handle it.&lt;/p&gt;
&lt;p&gt;Recommendation algorithms seem to work almost too well, to the point where it&amp;rsquo;s all kind of just showing you things you already like, rarely anything new that you might like. It&amp;rsquo;s an absolute tragedy both for small websites and for their potential audience.&lt;/p&gt;</description></item><item><title>Discovery and Design Considerations</title><link>https://www.marginalia.nu/log/44-discovery-and-design/</link><pubDate>Tue, 18 Jan 2022 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/44-discovery-and-design/</guid><description>&lt;p&gt;It&amp;rsquo;s been a productive several weeks. I&amp;rsquo;ve got the feature pulling updates from RSS working, as mentioned earlier.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve spent the last weeks designing the search engine&amp;rsquo;s web design, and did the MEMEX too for good measure.&lt;/p&gt;
&lt;p&gt;It needed to be done as the blog theme that previously made the foundation for the design off had several problems, including loading a bunch of unnecessary fonts, and not using the screen space of desktop browsers well at all.&lt;/p&gt;</description></item><item><title>Search Result Relevance</title><link>https://www.marginalia.nu/log/41-search-result-relevance/</link><pubDate>Fri, 10 Dec 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/41-search-result-relevance/</guid><description>&lt;p&gt;This entry is about a few problems the search engine has been struggling with lately, and how I&amp;rsquo;ve been attempting to remedy them.&lt;/p&gt;
&lt;p&gt;Before the article starts, I wanted to share an amusing new thing in the world of Internet spam.&lt;/p&gt;
&lt;p&gt;For a while, people have been adding things like &amp;ldquo;reddit&amp;rdquo; to the end of their Google queries to get less blog spam. Well, guess what? The blog spammers are adding &amp;ldquo;reddit&amp;rdquo; to the end of their titles now.&lt;/p&gt;</description></item><item><title>Old and New</title><link>https://www.marginalia.nu/log/38-old-and-new/</link><pubDate>Fri, 12 Nov 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/38-old-and-new/</guid><description>&lt;p&gt;I&amp;rsquo;ve been thinking recently about the emphasis put on &amp;ldquo;new&amp;rdquo;, specifically for search engines, but the discussion has some merit even in a wider context. I will start wide and narrow down.&lt;/p&gt;
&lt;p&gt;It is common to conflate new with good, and most being young sometime between 1950-2000 will indeed have seen marvellous improvements in quality of life and technology with each passing year. In the light of that, it&amp;rsquo;s at least easy to explain how one might confuse the two.&lt;/p&gt;</description></item><item><title>A Jaunt Through Keyword Extraction</title><link>https://www.marginalia.nu/log/37-keyword-extraction/</link><pubDate>Thu, 11 Nov 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/37-keyword-extraction/</guid><description>&lt;p&gt;Search results are only as good as the search engine&amp;rsquo;s ability to figure out what a page is about. Sure a keyword may appear in a page, but is it the topic of the page, or just some off-hand mention?&lt;/p&gt;
&lt;p&gt;I didn&amp;rsquo;t really know anything about data mining or keyword extraction starting out, so I&amp;rsquo;ve had to learn on the fly. I&amp;rsquo;m just going to briefly list some of my first naive attempts at keyword extraction, just to give a context.&lt;/p&gt;</description></item><item><title>Shaking N-gram needles from large haystacks</title><link>https://www.marginalia.nu/log/31-ngram-needles/</link><pubDate>Fri, 22 Oct 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/31-ngram-needles/</guid><description>&lt;p&gt;A recurring problem when searching for text is identifying which parts of the text are in some sense useful. A first order solution is to just extract every word from the text, and match documents against whether they contain those words. This works really well if you don&amp;rsquo;t have a lot of documents to search through, but as the corpus of documents grows, so does the number of matches.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s possible to bucket the words based on where they appear in the document, but this is not something I&amp;rsquo;m doing at the moment and not something I will implement in the foreseeable future.&lt;/p&gt;</description></item><item><title>The Mystery of the Ceaseless Botnet DDoS</title><link>https://www.marginalia.nu/log/29-botnet-ddos/</link><pubDate>Sun, 10 Oct 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/29-botnet-ddos/</guid><description>&lt;p&gt;I&amp;rsquo;ve been dealing with a botnet for the last few days, that&amp;rsquo;s been sending junk search queries at an increasingly aggressive rate. They were reasonably easy to flag and block but just kept increasing the rate until that stopped working.&lt;/p&gt;
&lt;p&gt;Long story short, my patience ran out and put my website behind cloudflare. I didn&amp;rsquo;t want to have to do this, because it does introduce a literal man in the middle and that kinda undermines the whole point of HTTPS, but I just don&amp;rsquo;t see any way around it. I just can&amp;rsquo;t spend every waking hour playing whac-a-mole with thousands of compromised servers flooding me with 50,000 search requests an hour. That&amp;rsquo;s five-six times more than when I was on the front page of HackerNews, and the attempts only increased.&lt;/p&gt;</description></item><item><title>Web Browsing</title><link>https://www.marginalia.nu/log/28-web-browsing/</link><pubDate>Sat, 09 Oct 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/28-web-browsing/</guid><description>&lt;p&gt;An idea I&amp;rsquo;ve had for a long time with regards to navigating the web is to find a way to browse it.&lt;/p&gt;
&lt;p&gt;&amp;ldquo;Browse&amp;rdquo; a difficult word to use, because it has a newer connotation of just using a web browser, I mean it in the old pre-Internet sense, browse like when you flip through a magazine, or peruse an antiques shop, not really looking for anything in particular just sort of seeing if anything catches your eye.&lt;/p&gt;</description></item><item><title>Getting with the times</title><link>https://www.marginalia.nu/log/27-getting-with-the-times/</link><pubDate>Wed, 06 Oct 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/27-getting-with-the-times/</guid><description>&lt;p&gt;Since my search engine has expanded its scope to include blogs as well as primordial text documents, I&amp;rsquo;ve done some thinking about how to keep up with newer websites that actually grow and see updates.&lt;/p&gt;
&lt;p&gt;Otherwise, as the crawl goes on, it tends to find fewer and fewer interesting web pages, and as the interesting pages are inevitably crawled to exhaustion, accumulate an ever growing amount of junk.&lt;/p&gt;
&lt;p&gt;Re-visiting each page and looking for new links in previously visited pages is probably off the table, that&amp;rsquo;s something I can maybe do once a month.&lt;/p&gt;</description></item><item><title>Experimenting with Personalized PageRank</title><link>https://www.marginalia.nu/log/26-personalized-pagerank/</link><pubDate>Sat, 02 Oct 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/26-personalized-pagerank/</guid><description>&lt;p&gt;The last few days I&amp;rsquo;ve felt like my first attempt at a ranking algorithm for the search engine was pretty good, like it was producing some pretty interesting results. It felt close to what I wanted to accomplish.&lt;/p&gt;
&lt;p&gt;The first ranking algorithm was a simple link-counting algorithm that did some weighting to promote pages that look in a certain fashion. It did seem to keep the page quality up, but also seemed to as a strange side-effect promote very &amp;ldquo;1996&amp;rdquo;-looking websites. This isn&amp;rsquo;t quite what I wanted to accomplish, I wanted to promote new sites as well as long as they were rich in content.&lt;/p&gt;</description></item><item><title>Astrolabe - The October Update</title><link>https://www.marginalia.nu/log/25-october-update/</link><pubDate>Fri, 01 Oct 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/25-october-update/</guid><description>&lt;ul&gt;
&lt;li&gt;&lt;a href="https://search.marginalia.nu"&gt;https://search.marginalia.nu&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The October Update is live. It introduced drastically improved topic identification and an actual ranking algorithm; and the result is interesting to say the least. What&amp;rsquo;s striking is how much it&amp;rsquo;s beginning to feel like a search engine. When it fails to find stuff, you can kinda see how.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve played with it for a while now and it does seem to produce relevant results for a lot of topics. A trade down in whimsical results but a big step up if you are looking for something specific, at least within the domain of topics where there are results to find.&lt;/p&gt;</description></item><item><title>Against the Flood</title><link>https://www.marginalia.nu/log/22-against-the-flood/</link><pubDate>Sun, 19 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/22-against-the-flood/</guid><description>&lt;p&gt;So hacker news apparently discovered my search engine, and really took a liking to the idea. Actually that&amp;rsquo;s a bit of an understatement, the thread has gotten 3.3k points and lingered on the front page for half a week. And I wasn&amp;rsquo;t planning for it to go quite that public yet. It has quietly been online for a while, but it was only very recently it started to feel like it was really coming together. It wasn&amp;rsquo;t perfect, there was still a lot of jankiness and limitations that could have been fixed with more time. The index was half the size it should have been. Someone discovered it and shared it. It took off like a rocket, and I&amp;rsquo;m still at a loss for words at the reception it&amp;rsquo;s gotten. I have received so many encouraging comments, emails, offers of collaboration, a few have even joined the patreon. I&amp;rsquo;ve been working through all the messages and I aim to reply to them all, but it takes time. I&amp;rsquo;m very grateful for all of this, since I half thought I was alone in this.&lt;/p&gt;</description></item><item><title>New Solutions Creating Old Problems</title><link>https://www.marginalia.nu/log/21-new-solutions-old-problems/</link><pubDate>Tue, 14 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/21-new-solutions-old-problems/</guid><description>&lt;p&gt;I&amp;rsquo;ve spent some time the last week optimizing how the search engine identifies appropriate search results, putting far more consideration into where and how the search terms appear in the page when determining the order they are presented.&lt;/p&gt;
&lt;p&gt;Search-result relevance is a pretty difficult problem, but I do think the changes has brought the search engine in a very good direction.&lt;/p&gt;
&lt;p&gt;A bit simplified, I&amp;rsquo;m building tiered indices, ranging from&lt;/p&gt;</description></item><item><title>The Curious Case of the Dot-Com Link Farms</title><link>https://www.marginalia.nu/log/20-dot-com-link-farms/</link><pubDate>Thu, 09 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/20-dot-com-link-farms/</guid><description>&lt;p&gt;I spent some time today weeding out yet more link-farms from my search engine&amp;rsquo;s index.&lt;/p&gt;
&lt;p&gt;Typically what I would do is just block the subnet assigned to the VPS provider they&amp;rsquo;re on, and that does seem to work rather well. The cloud providers that don&amp;rsquo;t police what they host is almost always home to quite a lot of this stuff, so I don&amp;rsquo;t particularly mind scorching some earth in the name of a clean index.&lt;/p&gt;</description></item><item><title>The Small Website Discoverability Crisis</title><link>https://www.marginalia.nu/log/19-website-discoverability-crisis/</link><pubDate>Wed, 08 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/19-website-discoverability-crisis/</guid><description>&lt;p&gt;There are a lot of small websites on the Internet: Interesting websites, beautiful websites, unique websites.&lt;/p&gt;
&lt;p&gt;Unfortunately they are incredibly hard to find. You cannot find them on Google or Reddit, and while you can stumble onto them with my search engine, it is not in a very directed fashion.&lt;/p&gt;
&lt;p&gt;It is an unfortunate state of affairs. Even if you do not particularly care for becoming the next big thing, it&amp;rsquo;s still discouraging to put work into a website and get next to no traffic beyond the usual bots.&lt;/p&gt;</description></item><item><title>Soaring High</title><link>https://www.marginalia.nu/log/18-soaring-high/</link><pubDate>Thu, 02 Sep 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/18-soaring-high/</guid><description>&lt;p&gt;I&amp;rsquo;m currently indexing with my search engine. This isn&amp;rsquo;t an always-on sort of an affair, but rather something I turn on and off as it tends to require at least some degree of babysitting.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve also been knocked out by the side-effects of the vaccine shot I got the other day, so it&amp;rsquo;s been mostly hands-off &amp;ldquo;parenting&amp;rdquo;.&lt;/p&gt;
&lt;p&gt;What I&amp;rsquo;m trying to figure out just how far I can take it. I really don&amp;rsquo;t know. I took some backups and just let it do its thing relatively unmonitored.&lt;/p&gt;</description></item><item><title>The Astrolabe Part II: The Magic Power of Sampling Bias</title><link>https://www.marginalia.nu/log/10-astrolabe-2-sampling-bias/</link><pubDate>Tue, 03 Aug 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/10-astrolabe-2-sampling-bias/</guid><description>&lt;p&gt;As I have mentioned earlier, perhaps the biggest enemy of PageRank is the hegemony of PageRank-style algorithms. Once an algorithm like that becomes not only dominant, but known, it also creates a market for leveraging its design particulars.&lt;/p&gt;
&lt;p&gt;Homogenous ecosystems are almost universally bad. It doesn&amp;rsquo;t really matter if it&amp;rsquo;s every computer running Windows XP, or every farmer planting genetically identical barley, what you get is extreme susceptibility to exploitation.&lt;/p&gt;</description></item><item><title>Index Optimizations</title><link>https://www.marginalia.nu/log/06-optimization/</link><pubDate>Fri, 23 Jul 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/06-optimization/</guid><description>&lt;blockquote&gt;
&lt;p&gt;Don&amp;rsquo;t chase small optimizations&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Said some smart person at some particular time, probably. If not, he ought to have; if worse comes to worst, I&amp;rsquo;m declaring it now. The cost of 2% here and 0.5% there is high, and the benefits are (by definition) low.&lt;/p&gt;
&lt;p&gt;I have been optimizing Astrolabe, my search engine. The different kind of Search Engine Optimization. I&amp;rsquo;ve spent a lot of time recently doing soft optimization, improving the quality and relevance of search results, to great results. I&amp;rsquo;ll write about that later.&lt;/p&gt;</description></item><item><title>On Link Farms</title><link>https://www.marginalia.nu/log/04-link-farms/</link><pubDate>Wed, 14 Jul 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/04-link-farms/</guid><description>&lt;p&gt;I&amp;rsquo;m in the midst of rebuilding the index of my search engine to allow for better search results, and I&amp;rsquo;ve yet again found need to revisit how I handle link farms. It&amp;rsquo;s an ongoing arms race between search engines and link farmers to adjust (and circumvent) the detection algorithms. Detection and mitigation of link farms is something I&amp;rsquo;ve found I need to modify very frequently, as they are constantly evolving to look more like real websites.&lt;/p&gt;</description></item><item><title>The Astrolabe Part I: Lenscraft</title><link>https://www.marginalia.nu/log/01-astrolabe/</link><pubDate>Wed, 07 Jul 2021 00:00:00 +0000</pubDate><guid>https://www.marginalia.nu/log/01-astrolabe/</guid><description>&lt;p&gt;Something you probably know, but may not have thought about a lot is that the Internet is large. It is unbelievably vast beyond any human comprehension. What you think of as &amp;ldquo;The Internet&amp;rdquo; is a tiny fraction of that vast space with its billions upon billions of websites.&lt;/p&gt;
&lt;p&gt;We use various technologies, such as link aggregators and search engines to find our way and make sense of it all. Our choices in navigational aides also shapes the experience we have of the Internet. These convey a warped sense of what the Internet truly is. There is no way of not doing that. Since nothing can communicate the raw reality of the internet to a human mind, concessions need to be made. Some content needs to be promoted, other needs to be de-emphasized. An objective rendering is a pipe dream, even a fair random sample is a noisy incomprehensible mess.&lt;/p&gt;</description></item></channel></rss>