Replacing a 3 GB SQLite db with a 10 MB FST (finite state transducer) binary

176 points by hiAndrewQuinn a day ago on hackernews | 31 comments

lscharen | 23 hours ago

I was halfway through the article and began thinking that his described data structure sounded very familiar to something I used about 20 years ago.

Sure enough, the first paragraph on the Wikipedia entry for DAFSA is:

DAFSA is the rediscovery of a data structure called Directed Acyclic Word Graph (DAWG)

[OP] hiAndrewQuinn | 22 hours ago

Apparently the structure itself has a bit of a history. The word 'rediscovery' tipped me off to go to Wikipedia myself and read up more about this.

First Blumer et al., 1983 came up with a "DAWG", but reading the abstract [1] I was left a little confused as to how exactly we get from 'here is how we store all substrings of a string in O(|string|) space, with "is this a substring [yn]" recognition in O(|substring|) time' to the modern DAFSA, as cool and useful as that is. Come to think of it I bet I could use that in some LeetCode problems.

But the structure we actually think of as a DAWG or DAFSA (or FST, I guess, thanks to this Rust crate) is in the paper "The World’s Fastest Scrabble Program". That worked but you had to construct a whole trie first, then compact it down, so build time was a memory hog. Then Dr. Daciuk of 3city sharpened the blade in 2000 by proving that this was as good as it gets in the unsorted case, but on a sorted set you could build the DAFSA incrementally, because an increasingly large part of the graph you were building was already optimized.

And then from there BurntSushi got involved with the implementation and the rest is history.

[1]: https://www.sciencedirect.com/science/article/pii/0304397585...

(That's what I can glean from ~30 minutes of not particularly focused reading. Forgive me if I have made any mistakes.)

lscharen | 18 hours ago

An interesting property of DAWGs is that the compact/compressed variation (CDAWG) can be built in linear time.

https://moodle2.units.it/pluginfile.php/718375/mod_resource/...

[OP] hiAndrewQuinn | 15 hours ago

oh now that's rad!

carefulfungi | 21 hours ago

I recently learned about a related data structure designed for scrabble style word searches: https://en.wikipedia.org/wiki/GADDAG.

zozbot234 | 18 hours ago

Isn't this just a regular expression?

Hendrikto | 22 hours ago

> I do wish to point out, of course, that the whole reason it was possible to experiment cheaply and come across this serendipity was because 9 months ago, faced with the choice to either do the bad easy thing or the good nothing, I chose to do the bad easy thing.5 The SQLite database worked! I understood how it worked, behind the scenes with its B-trees and its Full Text Search extension.

This is the most important takeaway, imo, and a very valuable technique: Start with the obvious, stupid solution that definitely works. Then do the optimized version, while making sure it matches the naive implementation. In this case, the optimized version could even be generated from the naive one.

baublet | 22 hours ago

Came here to add this, too. Sometimes the most valuable thing a solution can buy you is time to think of a better solution.

chrisweekly | 22 hours ago

Yes - tho I'd refine it to say it's not just more time, it's time plus the forcing function of a working solution requiring more deeply understanding the problem space.

simmonmt | 21 hours ago

It jumped out at me too, but because I wondered what it would look like in the AI version of this story. Having had it build the SQL version do you ... a) miss the leap because you don't understand how it works, don't care to know, and go off to vibe the next thing b) ask it lots of questions because reasons to develop that deep understanding then make the leap or c) rely on it (prompt: "this can't be good enough do better") to go make the leap for you.

(Assuming for the sake of argument that you guided it to the SQL version first)

embedding-shape | 21 hours ago

Depends on what your overall goal is with what you're building. Is it to rush out as many features as you possible can, before VC-funding falls through the floor so you too can get a slice of the pie before the party is over? Or are you "retired" in your 30s and now have time to build the perfect software for you? Do you need to publish and release an experiment to see how people react to it or use it, before you can know if it's the right thing or not?

Almost everything needs to be contextualized before you can even begin to answer what the right way forward is, depends so heavily on what situation you're in.

goosejuice | 20 hours ago

Fwiw, giving opus 4.7 two sentences about building a cli doing Finnish to English translation and looking for a space efficient solution leads to an answer pointing to fst. For the same reasons stated in the blog. This is without a search tool.

The K shaped LLM scenario makes a lot of sense to me. Educated and experienced devs get better output because they know what to ask.

moron4hire | 20 hours ago

Technical debt, like all forms of debt, can be used for leverage.

Imustaskforhelp | 20 hours ago

I feel like it is important to manage the risk and to clearly manage this debt. personally I try to stay safe from both debt and technical debt until there are sound reasons for both.

It is KISS stack for me personally (Keep it stupid simple)

I would still consider technical debt to be different than other forms of debt though, It feels way more of a tradeoff to me but perhaps all debt can be classified as such. Either way I think it makes for an interesting decision nonetheless.

kanwisher | 19 hours ago

If the product fails all technical debt goes away. So using technical debt to prove out a product is very different. Cause often times you don’t have the money or resources to build correctly at first since you don’t know if the expected payoff will be there

moron4hire | 13 hours ago

Yeah, how you end up acquiring technical debt can mean a lot. If it's because the system is underspecified, then your technical debt is really a prototype. If it's because your team doesn't know what they are doing, you're kind of screwed either way. But if you have the right requirements and the right team and you know you're trading money today for time tomorrow, it can be the difference between meeting a deadline or getting canned.

NooneAtAll3 | 18 hours ago

problem with such approach is that sometimes "obvious easy thing" becomes so entrenched and affecting everything, that ripping it out becomes unproportionally monumental task

miki123211 | 16 hours ago

A modern spin on this technique is as follows: Write (or use an LLM to write) something so simple that it is both obviously correct and very easy to verify the correctness of. Then, use that same LLM to create a comprehensive suite of tests, which further prove the correctness of the simple implementation. Once the tests are there, let the LLM run wild and ask it to optimize the hell out of the implementation while keeping the tests untouched.

IshKebab | 14 hours ago

Terrible advice. In the real world you can't fully rewrite a project when it turns out that your obvious stupid solution isn't good enough. You're stuck with it forever.

asibahi | 22 hours ago

This was a very interesting read. I wonder if similar techniques can apply to Turkish or Japanese dictionaries?

comonoid | 19 hours ago

I don't know about Japanese, but to Turkish definitely can.

macleginn | 17 hours ago

An old C++ library using a related approach (DAWG) was seemingly created by a Japanese programmer: https://code.google.com/archive/p/dawgdic/

It was later used to power a popular Python package for processing Russian morphology (pymorphy).

cadamsdotcom | 19 hours ago

Why was the download 3gb, if the solution created a 300x reduction primarily by sharing suffixes? Wouldn’t vanilla compression have dealt with that and achieved a decent (not ideal) amount of compression of the database?

andriy_koval | 19 hours ago

I think there is no vanilla compression in SQLite.

woadwarrior01 | 19 hours ago

There's a commercial extension that can do this.

https://sqlite.org/com/cerod.html

Arcuru | 17 hours ago

But there's probably not encryption either, so the sqlite database file probably has a lot of duplicated data inside it that's visible externally. I'm also curious how well it compresses by just running the sqlite file through a compression tool. I don't know enough about sqlite storage internals to guess.

wood_spirit | 19 hours ago

This was a fun read! Thanks for the great introduction to Finite State Transducers. I hadn't heard the formal term before, but your article gave me serious déjà vu.

Years ago, I entered a Scrabble programming contest and needed to compress a GADDAG dictionary to fit into my 6MB L3 cache. Without knowing the official name for it, I ended up using the exact same suffix-compression mechanism by moving characters to the edges instead of the nodes to merge overlapping paths.

Sharing my old write-up here in case you or other data-structure nerds find the overlap interesting! https://williame.github.io/post/87682811573.html

rishkewl | 18 hours ago

Nice reminder that a lot of these wins come from removing identity that the algorithm doesn’t actually need.

[OP] hiAndrewQuinn | 6 hours ago

On my list to pore over, and I will return once I have grokked it :) It was a little too much for my tired brain last night.

hmokiguess | 19 hours ago

This was such a pleasant read, thank you for sharing your experience, I feel motivated now to go solve the same problems twice!

rishkewl | 18 hours ago

This feels like a good case of "SQLite was the right bad solution."