In many cases not. E.g. for caching with python, diskcache is a good choice.
For small amounts of data, a JSON file does the job (you pointed to JSONL as an option).
But for larger collections, that should be searchable/processable, postgres is a good choice.
Memory of course, as you wrote, also seems reasonable in many cases.
people wildly underestimate the os page cache and modern nvme drives tbh. disk io today is basically ram speeds from 10 years ago. seeing startups spin up managed postgres + redis clusters + prisma on day 1 just to collect waitlist emails is peak feature vomit.
a jsonl file and a single go binary will literally outlive most startup runways.
also, the irony of a database gui company writing a post about how you dont actually need a database is pretty based.
The irony isn’t lost on us, trust me. We spent a while debating whether to even publish this one.
But yeah, the page cache point is real and massively underappreciated. Modern infrastructure discourse skips past it almost entirely. A warm NVMe-backed file with the OS doing the caching is genuinely fast enough for most early-stage products.
props for actually publishing it tbh. transparent engineering takes are so rare now, usually its just seo fluff.
weve basically been brainwashed to think we need kubernetes and 3 different databases just to serve a few thousand users. gotta burn those startup cloud credits somehow i guess.
mad respect for the honesty though, actually makes me want to check out db pro when i finally outgrow my flat files.
I'm feel like I could write another post: Do you even need serverless/Cloud because we've also been brainwashed into thinking we need to spend hundreds/thousands a month on AWS when a tiny VPS will do.
You are both right, with the exception that it requires knowledge and taste to accomplish, both of which are in short supply in the industry.
Why setup a go binary and a json file? Just use google forms and move on, or pay someone for a dead simple form system so you can capture and commmunicate with customers.
People want to do the things that make them feel good - writing code to fit in just the right size, spending money to make themselves look cool, getting "the right setup for the future so we can scale to all the users in the world!" - most people don't consider the business case.
What they "need" is an interesting one because it requires a forecast of what the actual work to be done in the future is, and usually the head of any department pretends they do that when in reality they mostly manage a shared delusion about how great everything is going to go until reality hits.
I have worked for companies getting billions of hits a month and ones that I had to get the founder to admit there's maybe 10k users on earth for the product, and neither of them was good at planning based on "what they need".
Serverless is cheap as hell as low volumes. Your tiny VPS can't scale to zero. If you're doing sustained traffic your tiny VPS might win though. The real value in Cloud is turning capex spend into opex spend. You don't have to wait weeks or months to requisition equipment.
id 100% read that post. the jump from free tier serverless to why is my aws bill $400 this month for a hobby project is a rite of passage at this point. a $5 hetzner or digitalocean box with dokku/docker-compose is basically a superpower that most newer devs just bypass entirely now.
> weve basically been brainwashed to think we need kubernetes and 3 different databases just to serve a few thousand users. gotta burn those startup cloud credits somehow i guess.
I don't think it makes any sense to presume everyone around you is brainwashed and you are the only soul cursed with reasoning powers. Might it be possible that "we" are actually able to analyse tradeoffs and understand the value of, say, have complete control over deployments with out of the box support for things like deployment history, observability, rollback control, and infrastructure as code?
Or is it brainwashing?
Let's put your claim to the test. If you believe only brainwashed people could see value in things like SQLite or Kubernetes, what do you believe are reasonable choices for production environments?
i think you missed the "on day 1" part of my comment. k8s, iac, and observability are incrdible tools when you actually have the scale and team to justifiy them.
my point is strictly about premature optimizaton. ive seen teams spend their first month writing helm charts and terraform before they even have a single paying user. if you have product-market fit and need zero-downtime rollbacks, absolutly use k8s. but if youre just validatng an mvp, a vps and docker-compose (or sqlite) is usually enough to get off the ground.
Except that eventually you'll find you lose a write when things go down because the page cache is write behind. So you start issuing fsync calls. Then one day you'll find yourself with a WAL and buffer pool wondering why you didn't just start with sqlite instead.
Definitely appreciate the post and the discussion that has come from it... While I'm still included to just reach for SQLite as a near starting point, it's often worth considering depending on your needs.
In practice, I almost always separate the auth chain from the service chain(s) in that if auth gets kicked over under a DDoS, at least already authenticated users stand a chance of still being able to use the apps. I've also designed auth system essentially abstracted to key/value storage with adapters for differing databases (including SQLite) for deployments...
Would be interested to see how LevelDB might perform for your testing case, in that it seems to be a decent option for how your example is using data.
> people wildly underestimate the os page cache and modern nvme drives
And worse, overestimate how safe is their data!
All this fancy thing about not using a RDBMS could had been true only if the APIs and actual implementation across ALL the IO path were robust and RELIABLE.
But is not!
EVERY LAYER LIES
ALL of them
ALL OF TIME
That is why the biggest reason building a real database (whatever the flavor) is that there is no way to avoid pay performance taxes all over the place because you can't believe the IO and having a (single | some files) getting hammered over and over make this painfully obvious.
One of the most sobering experiences is that you write your IO with all the care in the world, let the (your brand new) DB run for hours, on good, great, hardware, and in less than a week you will find that that breaks in funny ways.
I'm so old I remember working on databases that were designed to use RAW, not files. I'm betting some databases still do, but probably only for mainframe systems nowadays.
> Oracle® Database Platform Guide 10g Release 2 (10.2) for Microsoft Windows Itanium (64-Bit)
Well, I guess that at least confirms Oracle on Itanium (!?) still supported RAW 5 years ago.
I'm guessing everyone's on ASM by now though, if they're still upgrading. I ran into a company not long ago with a huge oracle cluster that still employed physical database admins and logical database admins as separate roles...I would bet they're still paying millions for an out of date version of Oracle and using RAW.
I need a filesystem that does some database things. We got teased with that with WinFS and Beos's BFS, but it seems the football always gets yanked away, and the mainstream of filesystems always reverts back to the APIs established in the 1980s.
Transactions are one thing I want the most, and that's not going to happen on S3. Sure, I can reinvent them by hand, but the point is I want that baked in.
Yeah, closest thing there is MS-SQL FILESTREAM, but even that has flaws and severe limitations... you can do similar transaction implemenations for binary data storage in any given RDBMS, or do similarly via behavior to read/write to filestream along with a transactional row lock that corresponds to the underlying data. But that gets its' own complexities.
Your article completely ignores operational considerations: backups, schema changes, replication/HA. As well as security, i.e. your application has full permissions to completely destroy your data file.
Regardless of whether most apps have enough requests per second to "need" a database for performance reasons, these are extremely important topics for any app used by a real business.
but it's so trivial to implement SQLite, in almost any app or language...there are sufficient ORMs to do the joins if you don't like working with SQL directly...the B-trees are built in and you don't need to reason about binary search, and your app doesn't have 300% test coverage with fuzzing like SQLite does
you should be squashing bugs related to your business logic, not core data storage. Local data storage on your one horizontally-scaling box is a solved problem using SQLite. Not to mention atomic backups?
> and your app doesn't have 300% test coverage with fuzzing like SQLite does
Surely it does? Otherwise you cannot trust the interface point with SQLite and you're no further ahead. SQLite being flawless doesn't mean much if you screw things up before getting to it.
That's true but relying on a highly tested component like SQLite means that you can focus your tests on the interface and your business logic, i.e. you can test that you are persisting to the your datastore rather than testing that your datastore implementation is valid.
Your business logic tests will already, by osmosis, exercise the backing data store in every conceivable way to the fundamental extent that is possible with testing given finite time. If that's not the case, your business logic tests have cases that have been overlooked. Choosing SQLite does mean that it will also be tested for code paths that your application will never touch, but who cares about that? It makes no difference if code that is never executed is theoretically buggy.
Then your business logic contains unspecified behaviour. Maybe you have a business situation where power loss conditions being unspecified is perfectly acceptable, but if that is so it doesn't really matter what happens to your backing data store either.
Came here to also throw in a vote for it being so much easier to just use SQLite. You get so much for so very little. There might be a one-time up-front learning effort for tweaking settings, but that is a lot less effort than what you're going to spend on fiddling with stupid issues with data files all day, every day, for the rest of the life of your project.
Even then... I'd argue for at least LevelDB over raw jsonl files... and I say this as someone who would regularly do ETL and backups to jsonl file formats in prior jobs.
Honestly, there is zero chance you will implement anything close to sqlite.
What is more likely, if you are making good decisions, is that you'll reach a point where the simple approach will fail to meet your needs. If you use the same attitude again and choose the simplest solution based on your _need_, you'll have concrete knowledge and constraints that you can redesign for.
Sometimes yes, I've seen it. It even tends to happen on NoSQL databases as well. Three times I've seen apps start on top of Dynamo DB, and then end up re-implementing relational databases at the application level anyway. Starting with postgres would have been the right answer for all three of those. Initial dev went faster, but tech debt and complexity quickly started soaking up all those gains and left a hard-to-maintain mess.
This always confuses me because we have decades of SQL and all its issues as well. Hundreds of experienced devs talking about all the issues in SQL and the quirks of queries when your data is not trivial.
One would think that for a startup of sorts, where things changes fast and are unpredictable, NoSQL is the correct answer. And when things are stable and the shape of entities are known, going for SQL becomes a natural path.
There is also cases for having both, and there is cases for graph-oriented databases or even columnar-oriented ones such as duckdb.
Seems to me, with my very limited experience of course, everything leads to same boring fundamental issue: Rarely the issue lays on infrastructure, and is mostly bad design decisions and poor domain knowledge. Realistic, how many times the bottleneck is indeed the type of database versus the quality of the code and the imlementation of the system design?
It's almost always a system design issue. Outside of a few specific use cases with big data, I struggle to imagine when I'd use NoSQL, especially in an application or data analytics scenario. At the end of the data, your data should be structured in a predictable manner, and it most likely relates to other data. So just use SQL.
You could also consider renting an Oracle DB. Yep! Consider some unintuitive facts:
• It can be cheaper to use Oracle than MongoDB. There are companies that have migrated away from Mongo to Oracle to save money. This idea violates some of HN's most sacred memes, but there you go. Cloud databases are things you always pay for, even if they're based on open source code.
• Oracle supports NoSQL features including the MongoDB protocol. You can use the Mongo GUI tools to view and edit your data. Starting with NoSQL is very easy as a consequence.
• But... it also has "JSON duality views". You start with a collection of JSON documents and the database not only works out your JSON schemas through data entropy analysis, but can also refactor your documents into relational tables behind the scenes whilst preserving the JSON/REST oriented view e.g. with optimistic locking using etags. Queries on JSON DVs become SQL queries that join tables behind the scenes so you get the benefits of both NoSQL and SQL worlds (i.e. updating a sub-object in one place updates it in all places cheaply).
• If your startup has viral growth you won't have db scaling issues because Oracle DBs scale horizontally, and have a bunch of other neat performance tricks like automatically adding indexes you forgot you needed, you can materialize views, there are high performance transactional message queues etc.
So you get a nice smooth scale-up and transition from ad hoc "stuff some json into the db and hope for the best" to well typed data with schemas and properly normalized forms that benefit from all the features of SQL.
I generally limit Oracle to where you are in a position to have a dedicated team to the design, deployment and management of just database operations. I'm not really a fan of Oracle in general, but if you're in a position to spend upwards of $1m/yr or more for dedicated db staff, then it's probably worth considering.
Even then, PostgreSQL and even MS-SQL are often decent alternatives for most use cases.
That was true years ago but these days there's the autonomous database offering, where DB operations are almost all automated. You can rent them in the cloud and you just get the connection strings/wallet and go. Examples of stuff it automates: backups, scaling up/down, (as mentioned) adding indexes automatically, query plan A/B testing to catch bad replans, you can pin plans if you need to, rolling upgrades without downtime, automated application of security patches (if you want that), etc.
So yeah running a relational DB used to be quite high effort but it got a lot better over time.
At that point, you can say the same for PostgreSQL, which is more broadly supported across all major and minor cloud platforms with similar features and I'm assuming a lower cost and barrier of entry. This is without signing with Oracle, Inc... which tends to bring a lot of lock-in behaviors that come with those feature sets.
TBF, I haven't had to use Oracle in about a decade at this point... so I'm not sure how well it competes... My experiences with the corporate entity itself leave a lot to be desired, let alone just getting setup/started with local connectivity has always been what I considered extremely painful vs common alternatives. MS-SQL was always really nice to get setup, but more recently has had a lot of difficulties, in particular with docker/dev instances and more under arm (mac) than alternatives.
I'm a pretty big fan of PG, which is, again, very widely available and supported.
Autonomous DB can run on-premises or in any cloud, not just Oracle's cloud. So it's not quite the same.
I think PG doesn't have most of the features I named, I'm pretty sure it doesn't have integrated queues for example (SELECT FOR UPDATE SKIP LOCKED isn't an MQ system), but also, bear in mind the "postgres" cloud vendors sell is often not actually Postgres. They've forked it and are exploiting the weak trademark protection, so people can end up more locked in than they think. In the past one cloud even shipped a transaction isolation bug in something they were calling managed Postgres, that didn't exist upstream! So then you're stuck with both a single DB and a single cloud.
Local dev is the same as other DBs:
docker run -d --name <oracle-db> container-registry.oracle.com/database/free:latest
Works on Intel and ARM. I develop on an ARM Mac without issue. It starts up in a few seconds.
Cost isn't necessarily much lower. At one point I specced out a DB equivalent to what a managed Postgres would cost for OpenAI's reported workload:
> I knocked up an estimate using Azure's pricing calculator and the numbers they provide, assuming 5TB of data (under-estimate) and HA option. Even with a 1 year reservation @40% discount they'd be paying (list price) around $350k/month. For that amount you can rent a dedicated Oracle/ExaData cluster with 192 cores! That's got all kinds of fancy hardware optimizations like a dedicated intra-cluster replication network, RDMA between nodes, predicate pushdown etc. It's going to perform better, and have way more features that would relieve their operational headache.
Good points, but Postgres has all those, along with much better local testing story, easier and more reliable CDC, better UDFs (in Python, Go etc.), a huge ecosystem of extensions for eg. GIS data, no licencing issues ever, API compatability with DuckDB, Doris and other DBs, and (this is the big one) is not Oracle.
Unless I’ve missed something, Postgres doesn’t have automatic index creation, nor does it have JSON introspection to automatically convert it to a normalized schema (which is insane; I love it). It also doesn’t do any kind of sharding on its own, though of course forks like Citus exist. It definitely doesn’t do RAC / Exadata (not sure which part this falls under), where multiple nodes are connected and use RDMA to treat a bunch of SSDs as local storage.
I love Postgres, and am not a huge fan of Oracle as a corporation, but I can’t deny that their RDBMS has some truly astounding capabilities.
RAC is a default part of any cloud Oracle DB, I think! I must admit I'm not an expert in all the different SKUs so there might be some that aren't, but if you rent an autonomous DB in the cloud you're probably running on ExaData/RAC. That's why the uptime is advertised as 99.95% even without a separate region.
I was ambiguous. That's an answer telling how to create indexes manually, and saying that you get an index for primary keys and unique constraints automatically. Sure, all databases do that. Oracle can create arbitrary indexes for any relation in the background without it being requested, if it notices that common queries would benefit from them.
Forgetting to create indexes is one of the most common issues people face when writing database apps because the performance will be fine on your laptop, or when the feature is new, and then it slows down when you scale up. Or worse you deploy to prod and the site tanks because a new query that "works fine on my machine" is dog slow when there's real world amounts of data involved. Oracle will just fix it for you, Postgres will require a manual diagnosis and intervention. So this isn't the same capability.
What that utility is doing is quite different. For one, it assumes you start with a schema already. Oracle can infer a schema from a collection of documents even if you don't have one by figuring out which fields are often repeated, which values are unique, etc.
For another, what you get after running that utility is relational tables that you have to then access relationally via normal SQL. What JSON duality views give you is something that still has the original document layout and access mode - you GET/PUT whole documents - and behind the scenes that's mapped to a schema and then through to the underlying SQL that would be required to update the tables that the DB generated for you. So you get the performance of normalized relations but you don't have to change your code.
The nice thing about this is it lets developers focus on application features and semantics in the early stages of a startup by just reshaping their JSON documents at will, whilst someone else focuses on improving performance and data rigor fully asynchronously. The app doesn't know how the data is stored, it just sees documents, and the database allows a smooth transition from one data model to another.
I don't think Postgres has anything like this. If it does it'll be in the form of an obscure extension that cloud vendors won't let you use, because they don't want to/can't support every possible Postgres extension out there.
No, when things change fast and unpredictably, NoSQL is worse than when they are well-known and stable.
NoSQL gains you no speed at all in redesigning your system. Instead, you trade a few hard to do tasks in data migration into an unsurmountable mess of data inconsistency bugs that you'll never actually get into the end of.
> is mostly bad design decisions and poor domain knowledge
Yes, using NoSQL to avoid data migrations is a bad design decision. Usually created by poor general knowledge.
For sure, though with databases it's usually pretty clear even at the start whether your "objects" will be relational in nature. I can't think of a single time that hasn't been the case, over hundreds of apps/services I've been part of. Things like asynchronous jobs, message queues, even object storage, I fully agree though.
Makes sense. But in this case, why NoSQL exists? What problems does it resolves and when should it be considered? I'm being naive, but fast changing environment has been one of the main advantages that I was taught from devs when it comes to NoSQL vs SQL (nosql being the choice for flexible schemas). So it is more about BASE vs ACID?
NoSQL was created to deal with scales where ACID becomes a bottleneck. It also shown itself useful for dealing with data that don't actually have an schema.
If you have either of those problems, you will know it very clearly.
Also, ironically, Postgres became one of the most scalable NoSQL bases out there, and one of the most flexible to use unstructured data too.
Agreed. In my experience (YMMV), there was also a real adoption push in the js world from primarily front-end people that wanted to do some backend but didn't want to learn/deal with SQL databases. I don't say that with malice, I was also on-board the NoSQL train for a bit before I actually gained experience with the headaches it caused. The appeal of "just dump your JSON blob straight in" was (and still is) strong. Software is all about learning, and sometimes that learning is expensive. We've all built something we later regretted.
Probably the best use case would be something like a Facebook profile page for a given user.
It may not have a very rigid schema, you may later add several other optional fields.
You need very large scale (as in no of concurrent accesses), you want to shard the data by e.g. location. But also, the data is not "critical", your highschool not being visible temporarily for certain users is not an issue.
You mostly use the whole dataset "at the same time", you don't do a lot of WHERE, JOIN on some nested value.
In every other case I would rather reach for postgres with a JSONB column.
As a data architect I dislike the term NoSQL and often recommend that my coworkers not use it in technical discussions, as it is too vague. Document, key-value and graph DBs are usually considered NoSQL, but they have fairly different use cases (and I'd argue that search DBs like Elastic / OpenSearch are in their own category as well).
To me write scaling is the main current advantage of KV and document DBs. They can generally do schema evolution fairly easily, but nowadays so can many SQL DBs, with semi-structured column types. Also, you need to keep in mind that KV and document DBs are (mostly) non-relational. The more relational your data, the less likely you are to actually benefit from using those DBs over a relational, SQL DB.
I think part of it is the scale in terms of the past decade and a half... The hardware and vertical scale you could get in 2010 is dramatically different than today.
A lot of the bespoke no-sql data stores really started to come to the forefront around 2010 or so. At that time, having 8 cpu cores and 10k rpm SAS spinning drives was a high end server. Today, we have well over 100 cores, with TBs of RAM and PCIe Gen 4/5 NVME storage (u.x) that is thousands of times faster and has a total cost lower than the servers from 2010 or so that your average laptop can outclass today.
You can vertically scale a traditional RDBMS like PostgreSQL to an extreme degree... Not to mention utilizing features like JSONB where you can have denormalized tables within a structured world. This makes it even harder to really justify using NoSQL/NewSQL databases. The main bottlenecks are easier to overcome if you relax normalization where necessary.
There's also the consideration of specialized databases or alternative databases where data is echo'd to for the purposes of logging, metrics or reporting. Not to mention, certain layers of appropriate caching, which can still be less complex than some multi-database approaches.
What about the microservices/serverless functions world? This was another common topic over the years, that using SQL with this type of system was not optimal, I believe the issue being the connections to the SQL database and stuff.
I think a lot of the deference to microservices/serverless is for similar reasons... you can work around some of this if you use a connection proxy, which is pretty common for PostgreSQL...
That said, I've leaned into avoiding breaking up a lot of microservices unless/until you need them... I'm also not opposed to combining CQRS style workflows if/when you do need micro services. Usually if you need them, you're either breaking off certain compute/logic workflows first where the async/queued nature lends itself to your needs. My limited experience with a heavy micro-service application combined with GraphQL was somewhat painful in that the infrastructure and orchestration weren't appropriately backed by dedicated teams leading to excess complexity and job duties for a project that would have scaled just fine in a more monolithic approach.
YMMV depending on your specific needs, of course. You can also have microservices call natural services that have better connection sharing heuristics depending again on your infrastructure and needs... I've got worker pools that mostly operate of a queue, perform heavy compute loads then interact with the same API service(s) as everything else.
microservices are about the fact that administrative overhead for a software system increases exponentially w.r.t. the complexity of the system. Or to put it another way, microservices are a way to make a complex system without having the architecture explode in size. They have nothing to do with making more efficient software systems. They are about making complex systems that trade dev costs for operational costs.
There's plenty of middle ground between an unchanging SQL schema and the implicit schemas of "schemaless" databases. You can have completely fluid schemas with the full power of relational algebra (e.g. untyped datalog). You shouldn't be using NoSQL just because you want to easily change schemas.
> One would think that for a startup of sorts, where things changes fast and are unpredictable, NoSQL is the correct answer. And when things are stable and the shape of entities are known, going for SQL becomes a natural path.
NoSQL is the "correct" answer if your queries are KV oriented, while predictable performance and high availability are priority (true for most "control planes"). Don't think any well-designed system will usually need to "graduate" from NoSQL to SQL.
No, no it isn't. It never is. Just as building your house on a rubber foundation isn't the correct answer either. This is just cope. Unless your use cases don't care about losing data or data corruption at all, NoSQL isn't the correct answer.
Same. DynamoDB is almost never a good default choice unless you've thought very carefully about your current and future use cases. That's not to say it's always bad! At previous startups we did some amazing things with Dynamo.
I've never used DynamoDB in production, but it always struck me as the type of thing where you'd want to start with a typical relational database, and only transition the critical read/write paths when you get to massive scale and have a very good understanding of your data access patterns.
Here are two checks using joins, one with sqlite, one with the join builtin of ksh93:
check_empty_vhosts () {
# Check which vhost adapter doesn't have any VTD mapped
start_sqlite
tosql "SELECT l.vios_name,l.vadapter_name FROM vios_vadapter AS l
LEFT OUTER JOIN vios_wwn_disk_vadapter_vtd AS r
USING (vadapter_name,vios_name)
WHERE r.vadapter_name IS NULL AND
r.vios_name IS NULL AND
l.vadapter_name LIKE 'vhost%';"
endsql
getsql
stop_sqlite
}
check_empty_vhosts_sh () {
# same as above, but on the shell
join -v 1 -t , -1 1 -2 1 \
<(while IFS=, read vio host slot; do
if [[ $host == vhost* ]]; then
print ${vio}_$host,$slot
fi
done < $VIO_ADAPTER_SLOT | sort -t , -k 1)\
<(while IFS=, read vio vhost vtd disk; do
if [[ $vhost == vhost* ]]; then
print ${vio}_$vhost
fi
done < $VIO_VHOST_VTD_DISK | sort -t , -k 1)
}
Wow, I'm sorry you have to work with such coworkers. For reference, joins are just an expensive use case. DBs do them about 10x faster that you can do them by hand. But if you need a join, you probably should either a) do it periodically and cache the result (making your data inconsistent) or b) just do it in a DB. Confusing caching the result with doing the join efficiently is an amazing misunderstanding of basic Computer Science.
Based on what's in the article, it wouldn't take much to move these files to SQLite or any other database in the future.
Edit: I just submitted a link to Joe Armstrong's Minimum Viable Programs article from 2014. If the response to my comment is about the enterprise and imaginary scaling problems, realize that those situations don't apply to some programming problems.
You can avoid the overhead of working with the database. If you want to work with json data and prefer the advantages of text files, this solution will be better when you're starting out. I'm not going to argue in favor of a particular solution because that depends on what you're doing. One could turn the question around and ask what's special about SQLite.
If your language supports it, what is the overhead of working with SQLite?
What's special about SQLite is that it already solves most of the things you need for data persistence without adding the same kind of overhead or trade offs as Postgres or other persistence layers, and that it saves you from solving those problems yourself in your json text files...
Like by all means don't use SQLite in every project. I have projects where I just use files on the disk too. But it's kinda inane to pretend it's some kind of burdensome tool that adds so much overhead it's not worth it.
Battle-tested, extremely performant, easier to use than a homegrown alternative?
By all means, hack around and make your own pseudo-database file system. Sounds like a fun weekend project. It doesn't sound easier or better or less costly than using SQLite in a production app though.
“Virding's First Rule of Programming: Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.”
Counterpoint, Meta is currently (and for the last decade) trying to rewrite MySQL so it is basically Postgres. They could just change their code so it works with Postgres and retrain their ops on Postgres. But for some reason they think its easier to just rewrite MySQL. Now, that is almost certainly more about office politics than technical matters...but it could also be the case that they have so much code that only works with MySQL that it is true (seriously doubtful).
You are just mislabling good architecture as 'premature optimization'. So I will give you another platitude...
"There is nothing so permanent as a temporary software solution"
I interpret YAGNI to mean that you shouldn't invest extra work and extra code complexity to create capabilities that you don't need.
In this case, I feel like using the filesystem directly is the opposite: doing much more difficult programming and creating more complex code, in order to do less.
It depends on how you weigh the cost of the additional dependency that lets you write simpler code, of course, but I think in this case adding a SQLite dependency is a lower long-term maintenance burden than writing code to make atomic file writes.
The original post isn't about simplicity, though. It's about performance. They claim they achieved better performance by using the filesystem directly, which could (if they really need the extra performance) justify the extra challenge and code complexity.
I avoided DBs like the plague early in my career, in favor of serialized formats on disk. I still think there's a lot of merit to that, but at this point in my career I see a lot more use case for sqlite and the relational features it comes with. At the least, I've spent a lot less time chasing down data corruption bugs since changing philosophy.
Now that said, if there's value to the "database" being human readable/editable, json is still well worth a consideration. Dealing with even sqlite is a pain in the ass when you just need to tweak or read something, especially if you're not the dev.
Pain in the ass was way too strong, I retract that. Mainly I meant relative. For example `nvim <filename>.json` and then /search for what I want, versus tracking down the sqlite file, opening, examining the schema, figuring out where the most likely place is that I care about, writing a SQL statement to query, etc.
> Well, you still need to track down the <filename> part and knowing what you want to search, so you need to examine the schema anyway.
Yes agreed, but it's usually a lot easier to find the filename part, especially if the application follows XDG. Sqlite databases are usually buried somewhere because they aren't expected to be looked at.
SqliteBrowser will let you open up your tables in an Excel-type view. You can also edit directly from the GUI. Still not as frictionless as a plain text file, and I'm not sure how good the search functionality is, but it lets you skip having to write any SQL.
I agree. Databases are useless. You don't even need to load it into the memory. Reading it from the disk when there is a need to read something must be ok. I don't believe the case that there are billions of records so the database must be something optimized for handling it. That amount of records most likely is something like access logs etc, I think they should not be stored at all, for such case.
Even it's postgres, it is still a file on disk. If there is need something like like partitioning the data, it is much more easier to write the code that partitions the data.
If there is a need to adding something with textinputs, checkboxes etc, database with their admin tools may be a good thing. If the data is something that imported exported etc, database may be a good thing too. But still I don't believe such cases, in my ten something years of software development career, something like that never happened.
It isnt sarcasm. I don't really find a case that a database that has it's own query language like SQL is needed. It won't be different than storing a JSON file and filter the content with a for loop, the dev (e.g. me) will be returning a JSON on REST API at the end. A query language may be a good thing if you are working in a team, thats it. SQL is indeed isnt a good thing.
Um, so your use cases are extremely narrow and limited. That's an astonising failure of imagination and a lack of understanding of real-world computer systems if you cannot understand why people have a real need of both the power of SQL and the performance of RDBMSs.
PostGIS is an extension of PostgreSQL. It claims that it has some geography features. I think it is not really related with a database. It brings only a function (HAVERSINE) that gets distance of two points on earth. It is couple of lines of code. It is not really a software project, but a detail about how the earth coordinates are calculated, and I think it is a total made up story. The real computed thing works like math.sqrt function.
I worked as a software engineer for 30 years before being forced to use a database, and that was for a web site. I've been coding actively, daily, since the 70's. Forever we just wrote proprietary files to disk, and that was the norm, for decades. Many a new developer can't even imagine writing their own proprietary file formats, the idea literally scares them. The engineers produced today are a shadow of what they used to be.
Not to nitpick, but it would be interesting to see profiling info of the benchmarks
Different languages and stdlib methods can often spend time doing unexpected things that makes what looks like apples-to-apples comparisons not quite equivalent
Honestly, I have been thinking about the same topic for some time, and I do realize that direct files could be faster.
In my (hypothetical, 'cause I never actually sat down and wrote that) case, I wanted the personal transactions in a month, and I realized I could just keep one single file per month, and read the whole thing at once (also 'cause the application would display the whole month at once).
Filesystems can be considered a key-value (or key-document) database. The funny thing about the example used in the link is that one could simply create a structure like `user/[id]/info.json` and directly access the user ID instead of running some file to find them -- again, just 'cause the examples used, search by name would be a pain, and one point where databases would handle things better.
Many eons ago I wrote a small sales web application in Perl. I couldn't install anything on the ISP's machine, so I used file-backed hashes: one for users, one for orders, another for something else.
As the years went by, I expected the client to move to something better, but he just stuck with it until he died after about 20 years, the family took over and had everything redone (it now runs Wordpress).
The last time I checked, it had hundreds of thousands of orders and still had good performance. The evolution of hardware made this hack keep its performance well past what I had expected it to endure. I'm pretty sure SQLite would be just fine nowadays.
Sorry to break it to you, but the article doesn't describe that at all. In fact, the reason why a DB has such great performance isn't mentioned at all. Learn what a datapath architecture is and how a DB kernel works if you are interested in that topic. And then there is how an optimizer works, how the prepare time works, how metadata is handled, etc...
Separate from performance, I feel like databases are a sub-specialty that has its own cognitive load.
I can use databases just fine, but will never be able to make wise decisions about table layouts, ORMs, migrations, backups, scaling.
I don't understand the culture of "oh we need to use this tool because that's what professionals use" when the team doesn't have the knowledge or discipline to do it right and the scale doesn't justify the complexity.
The SQLite "faster than filesystem" page is specifically about reading small blobs where the overhead of individual filesystem calls (open/read/close per blob) exceeds SQLite reading from a single already-open file. Once you're talking about reading one big JSON file sequentially, that overhead disappears and you're just doing a single read - which is basically the best case for the filesystem and the worst case for SQLite (which still has to parse its B-tree, check schemas, etc).
Don't know if it counts, but my London cinema listings website just uses static json files that I upload every weekend. All of the searching and stuff is done client side. Although I do use sqlite to create the files locally.
Total hosting costs are £0 ($0) other than the domain name.
Sharks vs. dinosaurs seems indeed an appropriate metaphor.
During Cretaceous, when dinosaurs were at their peak, sharks had already become very similar to the sharks of today, e.g. there were big sharks that differed very little from the white sharks and tiger sharks of today.
Then the dinosaurs have disappeared, together with the pterosaurs and the mosasaurs, and they have been replaced by other animals, but the sharks have continued to live until today with little changes, because they had already reached an optimized design that was hard to improve.
Besides the sharks, during Cretaceous there already existed along the dinosaurs other 2 groups of big predators that have changed little since then, crocodiles and big constrictor snakes similar to the pythons of today.
Therefore all 3 (sharks, crocodiles and big constrictor snakes) are examples of locally optimum designs that have been reached more than 70 million years ago, without needing after that any major upgrades.
"Do not cite the deep magic to me witch, I was there when it was written"
If you want to do this for fun or for learning? Absolutely! I did my CS Masters thesis on SQL JOINS and tried building my own new JOIN indexing system (tl;dr: mine wasn't better). Learning is fun! Just don't recommend people build production systems like this.
Is this article trolling? It feels like trolling. I struggle to take an article seriously that conflates databases with database management systems.
A JSON file is a database. A CSV is a database. XML (shudder) is a database. PostgreSQL data files, I guess, are a database (and indexes and transaction logs).
They never actually posit a scenario in which rolling your own DBMS makes sense (the only pro is "hand rolled binary search is faster than SQLite"), and their "When you might need" a DBMS misses all the scenarios, the addition of which would cause the conclusion to round to "just start with SQLite".
It should basically be "if you have an entirely read-only system on a single server/container/whatever" then use JSON files. I won't even argue with that.
Nobody - and I mean nobody - is running a production system processing hundreds of thousands of requests per second off of a single JSON file. I mean, if req/sec is the only consideration, at that point just cache everything to flat HTML files! Node and Typescript and code at all is unnecessary complexity.
PostgreSQL (MySQL, et al) is a DBMS (DataBase Management System). It might sound pedantic but the "MS" part is the thing you're building in code:
concurrency, access controls, backups, transactions: recovery, rollback, committing, etc., ability to do aggregations, joins, indexing, arbitrary queries, etc. etc.
These are not just "nice to have" in the vast, vast majority of projects.
"The cases where you'll outgrow flat files:"
Please add "you just want to get shit done and never have to build your own database management system". Which should be just about everybody.
If your app is meaningfully successful - and I mean more than just like a vibe-coded prototype - it will break. It will break in both spectacular ways that wake you up at 2AM and it will break in subtle ways that you won't know about until you realize something terrible has happened and you lost your data.
It feels like we're throwing away 50 years of collective knowledge, skills, and experience because it "is faster" (and in the same breath note that nobody is gonna hit these req/sec.)
I know, it's really, really hard to type `yarn add sqlite3` and then `SELECT * FROM foo WHERE bar='baz'`. You're right, it's so much easier writing your own binary search and indexing logic and reordering files and query language.
Not to mention now you need a AGENTS.md that says "We use our own home-grown database nonsense if you want to query the JSON file in a different way just generate more code." - NOT using standard components that LLMs know backwards-and-forwards? Gonna have a bad time. Enjoy burning your token budget on useless, counter-productive code.
SRE here. My "Huh, neat" side of my brain is very interested. The SRE side of my brain is screaming "GOD NO, PLEASE NO"
Overhead in any project is understanding it and onboarding new people to it. Keeping on "mainline" path is key to lower friction here. All 3 languages have well supported ORM that supports SQLite.
I'm mostly with you here... it's amazing how many devs don't have a certain amount of baseline knowledge to understand file-io, let alone thin abstractions for custom data and indexing like tfa. Then again, most devs also don't understand the impacts of database normalization under load either.
Sorry, this I think is a dangerous attitude: for me it is not about onboarding. Every newcomer reading `Huh, neat` is poised to repeat the mistakes of us and our ancestors.
Writing your own storage is a great way to understand how databases work (if you do it efficiently, keeping indexes, correct data structures, etc.) and to come to the conclusion that if your intention wasn't just tinkering, you should've used a database from day 1.
I love this article as it shows how fast computers really are.
There is one conclusion that I do not agree with. Near the end, the author lists cases where you will outgrow flat files. He then says that "None of these constraints apply to a lot of applications."
One of the constraints is "Multiple processes need to write at the same time." It turns out many early stage products need crons and message queues that execute on a separate worker. These multiple processes often need to write at the same time. You could finagle it so that the main server is the only one writing, but you'd introduce architectural complexity.
So while from the pure scale perspective I agree with the author, if you take a wider perspective, it's best to go with a database. And sqlite is a very sane choice.
If you need scale, cache the most often accessed data in memory and you have the best of both worlds.
SQLite has become my new go-to when starting any project that needs a DB. The performance is very fast, and if anything is ever successful enough to outgrow SQLite, it wouldn't be that hard to switch it out for Postgres. Not having to maintain/backup/manage a separate database server is cheaper and easier.
You remove a bit of complexity. Sure Postgres is not hard to set up sn to connect to, but Sqlite is just opening a file. It being a file makes it also very easy to test or debug you application.
postgres is great and is also a good default choice. It needs a bit more setup than sqlite. Unless I need a capability that postgres provides, I go with sqlite. It just works.
The one that gets me a lot, which is similar in practice to your point, is when I need server redundancy, even if one server is otherwise plenty for my task. As soon as I'm not running in one place, you need network data storage, and that kicks pretty hard in the direction of a network-accessible database. S3 works sometimes and the recent work on being able to atomically claim files has helped with some of the worst rough edges but it still doesn't take a lot to disqualify it, at least as the only store.
Yeah, files as a database is fun, but I find you ultimately reinvent the wheel when sqlite is pretty battle tested, free and easier to get right or scale up.
I can't talk though because I actually find myself doing this a lot
If you look at those "When do you actually need a database?" constraints I think its missing consistency which prevents bugs and makes debugging easier.
When you combine all those a database is a better alternative for all but the simplest cases.
A few months back I decided to write an embedded db for my firm's internal JS framework. Learned a lot about how/why databases work the way they do. I use stuff like reading memory cached markdown files for static sites, but there are certain things that a database gives you (chief of which for me was query ergonomics—I loved MongoDB's query language but grew too frustrated with the actual runtime) that you'll miss once you move past a trivial data set.
I think a better way to ask this question is "does this application and its constraints necessitate a database? And if so, which database is the correct tool for this context?"
For me, I just wish MongoDB had scaling options closer to how Elatic/Cassandra and other horizontally scalable databases work, in that the data is sharded in a circle with redundancy metrics... as opposed to Mongo, which afaik is still limited to either sharding or replication (or layers of them). FWIW, I wish that RethinkDB had seen more attention and success and for that matter might be more included to use CockroachDB over Mongo, where I can get some of the scaling features while still being able to have some level of structured data.
I feel like someone who works for a DB company ought to mention at least some of the pitfalls in file-based backing stores (data loss due to crashes, file truncation, fsync weirdness, etc)
You need databases if you need any kind of atomicity. Doing atomic writes is extremely fragile if you are just on top of the filesystem.
This is also why many databases have persistence issues and can easily corrupt on-disk data on crash. Rocksdb on windows is a very simple example a couple years back. It was regularly having corruption issues when doing development with it.
Honestly, at this point, if I had a design that required making atomic changes to files, I'd redo the design to use SQLite. The other way around sounds crazy to me.
"Why use spray paint when you can achieve the same effect by ejecting paint from your mouth in a uniform high-velocity mist?" If you happen to have developed that particular weird skill, by all means use it, but if you haven't, don't start now.
That probably sounds soft and lazy. I should learn to use my operating system's filesystem APIs safely. It would make me a better person. But honestly, I think that's a very niche skill these days, and you should consider if you really need it now and if you'll ever benefit from it in the future.
Also, even if you do it right, the people who inherit your code probably won't develop the same skills. They'll tell their boss it's impossibly dangerous to make any changes, and they'll replace it with a database.
The problem is that most of the time when you want "atomic changes to files" the only safe API is copy the file, mutate it, then rename. That doesn't factor in concurrent writers or advisory locks.
If that kind of filesystem traffic is unsuitable for your application then you will reinvent journaling or write-ahead logging. And if you want those to be fast you'll implement checkpointing and indexes.
> They'll tell their boss it's impossibly dangerous to make any changes, and they'll replace it with a database.
This, 100%. Development today is driven by appearances though, you can take advantage of that. Give it a cute name, make sure you have AI generate an emoji-rich README for it, publish it as an open source npm package, then trigger CI a few thousand times to get a pretty download count. They will happily continue using it without fear!
If you start a new job and on your first day they go "Yeah the last guy said we don't need a database, so he rolled his own." are you gonna be excited, or sweating?
Exception being perhaps "The last team chose to build their own data layer, and here's the decision log and architecture docs proving why it was needed."
Serious question, why are people here acting as if formatted files are somehow more reliable than a DB? That just simply isn't true. For most of software development's history, using flat files for persistence of data was the wrong thing to do with good reason. Flat files can easily be corrupted, and that happens much more often than a DB gets corrupted. The reason you might think otherwise is just sampling bias.
I do believe that you are missing a healthy dose of sarcasm. Such as faking downloads to give yourself inflated statistics so that your employer will trust untested and AI-written garbage.
That said, there really are good use cases for readable formatted files. For example configuration files that are checked into source control are far more trackable than a SQLite database for the same purpose. For another example, the files are convenient for a lot of data transfer purposes.
But for updateable data? You need a really good reason not to simply use a database. I've encountered such edge cases. But I've encountered a lot more people who thought that they had an edge case, than really did.
For the simple case, it isn't necessarily that fragile. Write the entire database to a temp file, then after flushing, move the temp file to overwrite the old file. All Unix filesystems will ensure the move operation is atomic. Lots of "we dump a bunch of JSON to the disk" use cases could be much more stable if they just did this.
Doesn't scale at all, though - all of the data that needs to be self-consistent needs to be part of the same file, so unnecessary writes go through the roof if you're only doing small updates on a giant file. Still gotta handle locking if there is risk of a stray process messing it up. And doing this only handles part of ACID.
Correct, the file must be flushed to disk! I'm so used to libraries handling it that I forgot to mention that.
I believe syncing the parent directory is only needed for durability and not atomicity, but if you're using this method you're probably not caring about durability in the first place, because you've designed a system that doesn't let you do durability in a fine-grained way without severe performance loss.
"All Unix filesystems will ensure the move operation is atomic."
This is false, but most fs will. However, there is a lot of fs calls you have to make that you probably don't know about to make the fs operations atomic.
PS The way you propose is probably the hardest way to do an atomic FS operation. It will have the highest probably of failure and have the longest period of operations and service disruption. There is good reason we move rows one at a time or in batches sized to match OS buffers.
I mean, if your atomic unit is a single file and you can tolerate simple consistency models, flat files are perfectly fine. There are many use cases that fit here comfortably where a whole database would be overkill
I'm a big fan of using S3 as a database. A lot of apps can get a lot of mileage just doing that for a good chunk of their data; that which just needs lookup by a single field (usually ID, but doesn't have to be).
I worked in an org where a lot of records were denormalized to be used in a search database... since I went through that level of work anyway, I also fed the exports into S3 records for a "just in case" backup. That backup path became really useful in practice, since there was a need for eventually a "pending" version of records, separate from the "published" version.
In practice, the records themselves took no less than 30 joins for a flat view of the record data that was needed for a single view of what could/should have been one somewhat denormalied record in practice. In the early 2010's that meant the main database was often kicked over under load, and it took a lot of effort to add in appropriate caching and the search db, that wound up handling most of the load on a smaller server.
I'd argue for using LevelDB or similar if I just wanted to store arbitrary data based on a single indexable value like TFA. That said, I'd probably just default to SQLite myself since the access, backup, restore patterns are relatively well known and that you can port/grow your access via service layers that include Turso or Cloudflare D1, etc.
Embedded KV stores like LevelDB are great for what they are, but I’ve often found that I’ll need to add an index to search the data in a different way.
And then another index. And at some point you want to ensure uniqueness or some other constraint.
And then you’re rewriting a half-complete and buggy SQLite. So I’ve come around to defaulting to SQLite/PostgresQL unless I have a compelling need otherwise. They’re usually the right long-term choice for my needs.
Absolutely... I was just bringing it up, as it seems to have in the box support for a lot of what TFA is discussing. I'm generally more inclined to just use SQLite most of the time anyway.
That it's now in the box (node:sqlite) for Deno/TS makes it that much more of an easy button option.
I've used foreign keys and unique indexes to enforce validity on even the smallest, most disposable toy applications I've ever written. These benchmarks are really interesting, but the idea that performance is the only consideration is kind of silly.
Hmm... Sure, if you do not need a database then do not use a database.
Don't use a sports-car to haul furniture or a garbage truck as an ambulance.
For the use case and scale mentioned in the article it's obvious not to use a database.
Am I missing something? I guess many people are the using the tools they are familiar with and rarely question whether they are really applicable. Is that the message?
I think a more interesting question is whether you will need a single source of truth. If you don't you can scale on many small data sets without a database.
I will say this before I shut up with my rant: If you start with a design that scales you will have an easier to scale when it is time without re-engineering your stack. Whether you think you will need to scale depends on your projected growth and the nature of your problem (do you need a single source of truth, etc.)
I worked one place that shoehorned SQL Server into a system to hold a small amount of static data that could easily have been a config file or even (eek) hard-coded.
I suggest every developer write a database from scratch at least once, and use it for something real. Or, even better, let somebody else use it for something real. Then you will know "why database".
It's indeed an amazing design and implementation space to explore. If distributed it is nearly comprehensive in scope. (However, did lol @ your "every developer" - that's being super kind and generous or "developer" is doing heavy lifting here.)
My first time was with a Bukkit plugin as a kid. One of my updates broke existing flat json files. Someone asked me if it has MySQL support, I didn't know what that was, then realized oh this is nice.
There are also things besides databases that I'll DIY and then still wonder why so many people use a premade tool for it, like log4j
I suggest every developer learn how to replicate, backup and restore the very database they are excited about, from scratch at least once. I propose this will teach them what takes to build a production ready system and gain some appreciation for other ways of managing state.
The if is doing a lot of heavy lifing there but with that out of the way you do make a strong case about "local first" apps. Nothing beats the simplicity of a simple `cp` but there are other tradeoffs to be made there.
That's why I was encouraging people to make backups and restores of their DB because personally, it has made me appreciate why different designs exist and when to use them vs just slapping a DB connection to an RDS instance and calling it a day.
In order to ask this question it's important to understand the lifecycle of the data in question. If it is constantly being updated and requires "liveness" (updates are reflected in queries immediately), the simple answer is: yes, you need a database.
But if you have data that is static or effectively static (data that is updated occasionally or batched), then serving via custom file handling can have its place.
If the records are fixed width and sorted on the key value, then it becomes trivial to do a binary search on the mmapped file. It's about as lightweight as could be asked for.
I should have been clear with the assumption baked into that statement: the data in question is in a single file, with fixed size fields and sorted by primary keys. That precludes "looser" datasets, but I believe my point stands for the given context.
> Binary search beats SQLite... For a pure ID lookup, you're paying for machinery you're not using.
You'll likely end up quite a chump if you follow this logic.
sqlite has pretty strong durability and consistency mechanism that their toy disk binary search doesn't have.
(And it is just a toy. It waves away the maintenance of the index, for god's sake, which is almost the entire issue with indexes!)
Typically, people need to change things over time as well, without losing all their data, so backwards compatibility and other aspects of flexibility that sqlite has are likely to matter too.
I think once you move beyond a single file read/written atomically, you might as well go straight to sqlite (or other db) rather than write your own really crappy db.
Please …
Every few years the pendulum swings. First it was “relational databases are too rigid, just use NoSQL.” Then “NoSQL is a mess, just go back to Postgres.” Now: “do you even need a database at all, just use flat files.”
Each wave is partially right. But… each wave is about to rediscover, the hard way, exactly why the previous generation made the choices they did.
SQLite is the answer to every painful lesson learned, every scar from long debug night the last time someone thought “a JSON file is basically a database.”
Michael Stonebraker used to write long, scathing critiques of modern data storage/retrieval fads, and how they were forgetting important historical lessons.
They were terrific reads; his writing on object-oriented databases was the most fun technical reading I did in grad school. And I even learned a lot!
I've been really happy with DuckDB, and I love that fact that it can operate directly on things like JSON an JSONL. It's become my data swiss army knife.
File systems are nice if you need to do manual or transparent script-based manipulations. Like 'oh hey, I just want to duplicate this entry and hand-modify it, and put these others in an archive.' Or use your OS's access control and network sharing easily with heterogeneous tools accessing the data from multiple machines. Or if you've got a lot of large blobs that aren't going to get modified in place.
What the world needs is a hybrid - database ACID/transaction semantics with the ability to cd/mv/cp file-like objects.
Then proceeds to (poorly) implement database on files.
Sure, Hash Map that take ~400mb in memory going to offer you fast lookups. Some workloads will never reach this size can be done as argument, but what are you losing by using SQLite?
What happens when services shutdowns mid write? Corruption that later results in (poorly) implemented WAL being added?
SQLite also showed something important - it was consistent in all benchmarks regardless of dataset size.
I dunno. Even in embedded systems every time I've started without a database I've eventually come to need something like a database, and in every case I've found myself building essentially an ad-hoc poorly managed database into the application including marshalling/unmarshalling, file management, notification, and so on because each new feature over the top of regular files was just that much easier to add versus switching to a database system.
However the driving motivation for adding a database is not necessarily managing data, but the fact that the database system creates a nice abstraction layer around storing data of relational or non-relational form in non-volatile memory and controlling access to it while other systems are updating it. And because it's a nice abstraction, there are a lot of existing libraries that can take advantage of it in your language of choice without requiring you to completely invent all of that stuff over the top of the filesystem. That has knock-on effects when you're trying to add new functionality or new interaction patterns to an existing system.
And in cases where two or more processes need to communicate using the same data, a database gives you some good abstractions and synchronization primitives that make sense, whereas regular files or IPC require you to invent a lot of that stuff. You could use messaging to communicate updates to data but now you have two copies of everything, and you have to somehow atomize the updates so that either copy is consistent for a point in time. Why not use a database?
Knowing what I know today I would start with some kind of database abstraction even if it's not necessarily designed for transactional data, and I would make sure it handled the numerous concerns I have around data sharing, consistency, atomicity, and notification because if I don't have those things I eventually have to invent them to solve the reliability problems I otherwise run in to without them.
Sure. Go ahead and use JSONL files and implement every feature of SQL query. Congrats, you just reinvented a database, while trying to prove you don't need database.
While this is certainly cool to see. And I love seeing how fast webservers can go.. The counter question "Do you even need 25,000 RPS and sub-ms latency?" comes to mind.
I don't choose a DB over a flat file for its speed. I choose a DB for the consistent interface and redundancy.
I think this whole article and post is an attention / points seeking exercise. It is hard to imagine programmer who would not know difference between DBMS and just bunch of files and when to use which
I have a vague recollection that 4chan (At least at one point) didn't use any kind of backend database, they just rewrote the static pages with new content and that was it.
That's why it could handle massive traffic with very little issues.
It is a typical solution from the 90s, first steps of interactivity after hand-written HTML pages served by Apache. POST request is handled by some Perl script that rewrites the HTML page, then redirects to it or directly sends it as a reply. See the most basic frame-based chats (no Javascript, no nothing).
It only handles massive traffic if reads of those static pages are frequent, and updates are rare. When thousands of users are posting, you have to either block everyone on each write, or subdivide the required synchronisation between boards. Also, the regenerated pages might be used just a couple of times before the next rewrite happens (or not viewed at all, like far away board pages, but you still have rewrite all involved pages as a batch), and not much caching happens. In addition to that, each short comment causes multiple full long pages to be regenerated, and stored on disk. You basically get a very ineffective database with very ineffective primitives.
So usually every image board owner starts with decoupling of message posting and page updates to only regenerate pages once in a while, and process multiple edits at once, then some things stop working, as they assume each state change happens in full in multiple places, then they try to rewrite everything or switch to a real database engine.
To not destroy the article author and apreciate his effort to prove something, that might be useful in a extreme case of optimization with a limited ammount of data and NO NEED to update/write the files. Just a read cache only.
If you need to ever update a single byte in your data, please USE A PROPER DATABASE, databases does a lot of fancy thing to ensure you are not going to corrupt/broke your data on disk among other safety things.
you can get surprisingly far with files, but the moment you care about things like concurrent writes or not losing data on crash, the whole thing changes at that point you're not choosing speed vs simplicity anymore -you're choosing how much risk you're willing to carry
SQLite did decently well but I think they should’ve done an additional benchmark with the database loaded completely into memory.
Since they’re using Go to accept requests and forwarding them to their SQLite connection, it may have been worthwhile to produce the same interface with Rust to demonstrate whether or not SQLite itself was hitting its performance limit or if Go had some hand in that.
Other than that, it’s a good demonstration of how a custom solution for a lightweight task can pay off. Keep it simple but don’t reinvent the wheel if the needs are very general.
This is a cool exercise, but I would hesitate to choose files over SQLite or another Dockerised relational database in production.
They are overoptimising for the simplest part of writing the application; the beginning. They've half-implemented an actual database, with none of the safety features. There are a lot of potential headaches that this article has avoided talking about; perhaps because they haven't experienced them yet.
What happens when you need to start expanding the scope of this feature? Joining users on profiles, or users on orgs?
Ask yourself: how many shops have seriously written an application backed by files and stuck with it over the long-run? The answer is likely very few. Therefore, this is likely doubling up the work required.
There is a reason people reach for a database first. I'd strongly encourage anyone to avoid doing stuff like this.
and then there is a decent amount of software that's mostly "one and done" and has immense performance constraints - games and anything that has to do with real-time. for game engines, a custom data format from the very start makes a lot of sense because your budget is usually less than 17ms and 8 threads on low-end hardware and 8.(3)ms across 16 threads on high-end. there, "smart data structures and dumb code beat dumb data structures and smart algorithms" couldn't be more true.
yet, for a generic app or server, just don't fuck your brains and go with SQLite
If you’re going to do something bizarre like this, then why store it as human-readable? If you have fixed-size fields as they do here, make your own serialization format. You’re still doing byte-offset seeks, but it’ll be much faster.
At the very least, use a monotonic key, so you can avoid having to periodically sort.
I remember reading a story from Robert C. Martin, if I recall correctly, about writing an application and trying to decide which DB to use. In the end, they put the DB access paths behind an abstraction and decided that they'd just use the file system to start with, and easily switch it out later. In the end, they shipped, and never did need to use a real DB.
The article is fine, but I wanted to call this out.
"Every database you have ever used reads and writes to the filesystem, exactly like your code does when it calls open()."
Technically not true. Applications like SQLite use mmap to map the file into a locally addressable memory space. This lets you skip the syscalls when reading and writing. The kernel can map that data in dynamically much faster than a userland process can.
Later in the article they go over the process of reading in the entire file into memory, again mmap is much better at this. Would have been nice to see that approach used.
The article definitely oversimplifies the IO happening in a database.
That said, depending who you are talking to, they may not agree with you on "mmap is much better than this". Some people will say you should do what you need in the application logic instead of depending on APIs from the OS. (although not necessarily for the specific example here)
The backing store being used by map() is still a file in a filesystem, so I would say their overall claim is technically true. It's the "exactly like your code does when it calls open()" part that oversimplifies a little (though, again, remains technically true -- it's just giving an example of a thing you can do with a file, not exhaustively listing all the things you can do with a file).
> The kernel can map that data in dynamically much faster than a userland process can.
Not necessarily. The kernel's mmap implementation has quite a strong bias towards certains kinds of access patterns; deviate from them and it can become slower than read(2).
We tried using mmap(2) for audio file i/o in Ardour, and it got notably less bandwidth than just using read(2).
I have come at it from another angle, so maybe it’s useful perspective.
A lot of data pipeline work can function pretty well with each step producing artifacts that the next can consume. If you need unique items, use uuids. If you want to check whether an item exists, use a hash of something (maybe a url). And each day goes in its own subdirectory. It works.
Sometimes, though, you end up reinventing database constructs. Joins, indexing, versioning. A little of that is ok, but it gets gnarly fast. If you mess up, it could mean corrupting your data in a way that’s irreversible. Or, corrupting your data in a way that’s reversible but really hard.
So a database has the benefit of enforcing consistency that a glob over json files doesn’t. It doesn’t mean every one-off script needs a blue-green Postgres DB, but there can be an inflection point.
I sympathize with this so hard. I frequently conduct system design interviews where the problem could easily be handled on a single machine with a flat file, let alone sql lite. Only the rare candidate mentions this; mostly I get a horde of microservices and queues and massive distributed databases that are totally unneccessary.
I mostly agree but I had plenty of cases where the project / team was forced to reinvent a query engine over flat files and/or in-memory caches.
From that POV I moved from the extreme of "you don't need a state at all (and hence no database)" to the bit more moderate "you usually don't need a file or a DB but you almost certainly will want to query whatever state you store so just get a DB early".
I strongly sympathize with "no microservices" of course. That's an overkill for at least 99% of all projects I've ever seen (was a contractor for a long time). But state + querying is an emergent property as a lot of projects move beyond the prototype phase.
I love SQLite, I love the idea, I love having something mature and lightweight, but like the author I discovered it's overkill or dare I say insufficient for certain use cases.
I was building a client side dictionary app with search functionality and thought that using sqlite's wasm port would be the perfect solution.
I used SQLite for a couple years and it started to wear on me. The database files were bigger than they should be, they didn't compress super well, the loading time was a bit slow. Linear search wasn't very fast either. Editing SQLite files and doing joins and stuff was slow and frustrating as well. I wanted something simpler.
I didn't need atomics, I didn't need writes, so I just handrolled a set of indexes from the source tsv files and compressed those with zstd and decompressed them via wasm on every load. Decompressing and loading was now faster than straight loading via SQLite and since my module was 52kb of wasm instead of 800kb for SQLite I could load as many instances of the module as I wanted with no regrets. I use stringzilla for linear scans and it's ridiculously fast.
SQLite is great, but it's not the solution for every problem.
The author of stringzilla, Ash Vardanian, has a bunch of really cool talks/lectures/demos. Highly recommend checking him out if you like that kind of thing :)
I will still reach for a database 99% or the time, because I like things like SQL and transactions. However, I've recently been working on a 100% personal project to manage some private data; extracting insights, graphing trends, etc. It's not high volume data, so I decided to use just the file system, with data backed at yaml files, with some simple indexing, and I haven't run into any performance issues yet. I probably never will at my scale and volume.
In this particular case having something that was human readable, and more importantly diffable, was more valuable to me than outright performance.
Having said that, I will still gladly reach for a database with a query language and all the guarantees that comes with 99% of the time.
Let's put it this way. I always end up needing the functionality and ACID guarantees of a database. I always wish I had a database. But some times I'm forced to use the project's legacy data stores (often flat-file data lakes) and watch every wheel get reinvented as we struggle to glue consistency, transactions, a bespoke query language, etc. onto an unwilling pile of unstructured data.
Made the performance on my machine go from 27,700.99 r/s to 89,687.36 r/s.
I also tried making the get user a prepared statement, and storing the timestamp as an unix timestamp integer, but that didn't make much difference for me.
> The index format is simple: one line per record, exactly 58 bytes: <36-char UUID>:<20-digit byte offset in data file>\n.
It would be much better to write all of this as binary data, omitting separators.
• Since it’s fixed-width and simple, inspecting the data is still pretty easy—there are tools for working with binary data of declared schema, or you could write a few-liner to convert it yourself. You don’t lose much by departing ASCII.
• You might want to complicate it a little by writing a version tag at the start of the file or outside it so you can change the format more easily (e.g. if you ever add a third column). I will admit the explicit separators do make that easier. You can also leave that for later, it probably won’t hurt.
• It removes one type of error altogether: now all bit patterns are syntactically valid.
• It’ll use less disk space, be cheaper to read, be cheaper to write, and probably take less code.
I also want to register alarm at the sample code given for func FindUserBinarySearch. To begin with, despite a return type of (*User, error), it always returns nil error—it swallows all I/O errors and ignores JSON decode errors. Then:
so the article benchmarks speed, but I don't really think most people are making the choice based on speed. Robustness is really the driving force (well at least for me). So for CRUD operations I have way more confidence with something like SQLite than using files. Files are workable in many situations, but much easier if you default to things like SQLite and use files only when you think there is a specific advantage.
There's a whole thing this days about building solvers (e.g. SAT or Ising) out of exotic hardware that does compute in memory. A while back I wondered if one could leverage distributed DB logic to build solvers for massive problems, something like compute in DB.
For anything that starts as a side project, SQLite covers it. The moment you need to ask this question the answer is usually "not yet." Ship first, migrate when it actually hurts
If there is no database, where one should persist state?
Most software is stateful and needs to persist state across restarts, so I would argue that one needs at least SQLite.
On SQLite being safe default: in practice it means supporting multiple databases, say SQLite and Postgres, this is more complicated that supporting just Postgres. As soon as a project leaves localhost and enters cloud development you need talk to a database over network, which warrants MySQL or Postgres.
Which is more complicated: supporting a docker container with mysql or Postgres for local development OR supporting multiple databases in the project?
Of course, the answer could be “it depends”, I but I would not call SQLite a default choice. It would be if you are writing desktop or mobile app, but for anything like a web app it’s a questionable choice.
My personal toolbox’s section for databases has three items: files (usually JSON), SQLite, Postgres. I haven’t and I doubt I will ever run into a situation where none of those fit. They exist, I’m sure, but I’m too general purpose to ever need anything else.
Not directly related, but this is a good example of why I love dependency injection. In most systems, I typically define the interface, implement something super simple at first, and as I iterate I re-evaluate, and I can easily* swap between implementations.
> Binary search beats SQLite. This was unexpected. Plain sorted files with a hand-rolled index outperform SQLite's B-tree by about 1.7x at every scale. SQLite does more work per lookup than a hand-rolled binary search, even for a simple primary key read. That overhead is worth it when you need the features. For a pure ID lookup, you're paying for machinery you're not using.
1. You're paying very little for 1000x the features.
2. The chances your application will keep doing "pure ID lookups" forever are zero. The moment you need to query the data in any other way or have more than one writer you're going to have to throw all this nonsense code into the trash.
3. Do you need the 1.7x speedup? No, of course you don't. It's just optimizing for the sake of optimizing.
I'd have just used sqlite to begin with and not benchmarked anything. No custom code, no need to waste any time, great performance, amazing flexibility — all with minimum effort.
the_inspector | 22 hours ago
Memory of course, as you wrote, also seems reasonable in many cases.
vovanidze | 22 hours ago
a jsonl file and a single go binary will literally outlive most startup runways.
also, the irony of a database gui company writing a post about how you dont actually need a database is pretty based.
[OP] upmostly | 22 hours ago
But yeah, the page cache point is real and massively underappreciated. Modern infrastructure discourse skips past it almost entirely. A warm NVMe-backed file with the OS doing the caching is genuinely fast enough for most early-stage products.
vovanidze | 21 hours ago
weve basically been brainwashed to think we need kubernetes and 3 different databases just to serve a few thousand users. gotta burn those startup cloud credits somehow i guess.
mad respect for the honesty though, actually makes me want to check out db pro when i finally outgrow my flat files.
[OP] upmostly | 21 hours ago
Similar sentiment.
hilariously | 21 hours ago
Why setup a go binary and a json file? Just use google forms and move on, or pay someone for a dead simple form system so you can capture and commmunicate with customers.
People want to do the things that make them feel good - writing code to fit in just the right size, spending money to make themselves look cool, getting "the right setup for the future so we can scale to all the users in the world!" - most people don't consider the business case.
What they "need" is an interesting one because it requires a forecast of what the actual work to be done in the future is, and usually the head of any department pretends they do that when in reality they mostly manage a shared delusion about how great everything is going to go until reality hits.
I have worked for companies getting billions of hits a month and ones that I had to get the founder to admit there's maybe 10k users on earth for the product, and neither of them was good at planning based on "what they need".
hooverd | 19 hours ago
vovanidze | 15 hours ago
locknitpicker | 19 hours ago
I don't think it makes any sense to presume everyone around you is brainwashed and you are the only soul cursed with reasoning powers. Might it be possible that "we" are actually able to analyse tradeoffs and understand the value of, say, have complete control over deployments with out of the box support for things like deployment history, observability, rollback control, and infrastructure as code?
Or is it brainwashing?
Let's put your claim to the test. If you believe only brainwashed people could see value in things like SQLite or Kubernetes, what do you believe are reasonable choices for production environments?
vovanidze | 15 hours ago
my point is strictly about premature optimizaton. ive seen teams spend their first month writing helm charts and terraform before they even have a single paying user. if you have product-market fit and need zero-downtime rollbacks, absolutly use k8s. but if youre just validatng an mvp, a vps and docker-compose (or sqlite) is usually enough to get off the ground.
its all about trade-offs tbh.
grep_it | 20 hours ago
tracker1 | 18 hours ago
In practice, I almost always separate the auth chain from the service chain(s) in that if auth gets kicked over under a DDoS, at least already authenticated users stand a chance of still being able to use the apps. I've also designed auth system essentially abstracted to key/value storage with adapters for differing databases (including SQLite) for deployments...
Would be interested to see how LevelDB might perform for your testing case, in that it seems to be a decent option for how your example is using data.
skapadia | 30 minutes ago
phillipcarter | 18 hours ago
I'm pretty sure most startups just use a quick and easy CRM that makes this process easy, and that tool will certainly use a database.
mamcx | 9 hours ago
And worse, overestimate how safe is their data!
All this fancy thing about not using a RDBMS could had been true only if the APIs and actual implementation across ALL the IO path were robust and RELIABLE.
But is not!
EVERY LAYER LIES
ALL of them
ALL OF TIME
That is why the biggest reason building a real database (whatever the flavor) is that there is no way to avoid pay performance taxes all over the place because you can't believe the IO and having a (single | some files) getting hammered over and over make this painfully obvious.
One of the most sobering experiences is that you write your IO with all the care in the world, let the (your brand new) DB run for hours, on good, great, hardware, and in less than a week you will find that that breaks in funny ways.
P.D: Was part of a team doing a db
ghc | 21 hours ago
bob1029 | 21 hours ago
ghc | 18 hours ago
Well, I guess that at least confirms Oracle on Itanium (!?) still supported RAW 5 years ago.
I'm guessing everyone's on ASM by now though, if they're still upgrading. I ran into a company not long ago with a huge oracle cluster that still employed physical database admins and logical database admins as separate roles...I would bet they're still paying millions for an out of date version of Oracle and using RAW.
evanelias | 11 hours ago
I seem to remember Oracle 10g was first released over 20 years ago? It has been EOL for much longer than 5 years...
chuckadams | 21 hours ago
tracker1 | 18 hours ago
chuckadams | 18 hours ago
tracker1 | 18 hours ago
z3ugma | 21 hours ago
gorjusborg | 21 hours ago
[OP] upmostly | 21 hours ago
evanelias | 21 hours ago
Regardless of whether most apps have enough requests per second to "need" a database for performance reasons, these are extremely important topics for any app used by a real business.
z3ugma | 21 hours ago
you should be squashing bugs related to your business logic, not core data storage. Local data storage on your one horizontally-scaling box is a solved problem using SQLite. Not to mention atomic backups?
9rx | 21 hours ago
Surely it does? Otherwise you cannot trust the interface point with SQLite and you're no further ahead. SQLite being flawless doesn't mean much if you screw things up before getting to it.
RL2024 | 21 hours ago
9rx | 20 hours ago
wmanley | 19 hours ago
9rx | 19 hours ago
moron4hire | 21 hours ago
tracker1 | 18 hours ago
gorjusborg | 21 hours ago
What is more likely, if you are making good decisions, is that you'll reach a point where the simple approach will fail to meet your needs. If you use the same attitude again and choose the simplest solution based on your _need_, you'll have concrete knowledge and constraints that you can redesign for.
z3ugma | 17 hours ago
e.g. worry about what makes your app unique. Data storage is not what makes your app unique. Outsource thinking about that to SQLite
hirvi74 | 21 hours ago
freedomben | 21 hours ago
leafarlua | 21 hours ago
One would think that for a startup of sorts, where things changes fast and are unpredictable, NoSQL is the correct answer. And when things are stable and the shape of entities are known, going for SQL becomes a natural path.
There is also cases for having both, and there is cases for graph-oriented databases or even columnar-oriented ones such as duckdb.
Seems to me, with my very limited experience of course, everything leads to same boring fundamental issue: Rarely the issue lays on infrastructure, and is mostly bad design decisions and poor domain knowledge. Realistic, how many times the bottleneck is indeed the type of database versus the quality of the code and the imlementation of the system design?
dalenw | 20 hours ago
greenavocado | 20 hours ago
mike_hearn | 19 hours ago
You could also consider renting an Oracle DB. Yep! Consider some unintuitive facts:
• It can be cheaper to use Oracle than MongoDB. There are companies that have migrated away from Mongo to Oracle to save money. This idea violates some of HN's most sacred memes, but there you go. Cloud databases are things you always pay for, even if they're based on open source code.
• Oracle supports NoSQL features including the MongoDB protocol. You can use the Mongo GUI tools to view and edit your data. Starting with NoSQL is very easy as a consequence.
• But... it also has "JSON duality views". You start with a collection of JSON documents and the database not only works out your JSON schemas through data entropy analysis, but can also refactor your documents into relational tables behind the scenes whilst preserving the JSON/REST oriented view e.g. with optimistic locking using etags. Queries on JSON DVs become SQL queries that join tables behind the scenes so you get the benefits of both NoSQL and SQL worlds (i.e. updating a sub-object in one place updates it in all places cheaply).
• If your startup has viral growth you won't have db scaling issues because Oracle DBs scale horizontally, and have a bunch of other neat performance tricks like automatically adding indexes you forgot you needed, you can materialize views, there are high performance transactional message queues etc.
So you get a nice smooth scale-up and transition from ad hoc "stuff some json into the db and hope for the best" to well typed data with schemas and properly normalized forms that benefit from all the features of SQL.
freedomben | 19 hours ago
tracker1 | 18 hours ago
Even then, PostgreSQL and even MS-SQL are often decent alternatives for most use cases.
mike_hearn | 18 hours ago
So yeah running a relational DB used to be quite high effort but it got a lot better over time.
tracker1 | 18 hours ago
TBF, I haven't had to use Oracle in about a decade at this point... so I'm not sure how well it competes... My experiences with the corporate entity itself leave a lot to be desired, let alone just getting setup/started with local connectivity has always been what I considered extremely painful vs common alternatives. MS-SQL was always really nice to get setup, but more recently has had a lot of difficulties, in particular with docker/dev instances and more under arm (mac) than alternatives.
I'm a pretty big fan of PG, which is, again, very widely available and supported.
mike_hearn | 18 hours ago
I think PG doesn't have most of the features I named, I'm pretty sure it doesn't have integrated queues for example (SELECT FOR UPDATE SKIP LOCKED isn't an MQ system), but also, bear in mind the "postgres" cloud vendors sell is often not actually Postgres. They've forked it and are exploiting the weak trademark protection, so people can end up more locked in than they think. In the past one cloud even shipped a transaction isolation bug in something they were calling managed Postgres, that didn't exist upstream! So then you're stuck with both a single DB and a single cloud.
Local dev is the same as other DBs:
See https://container-registry.oracle.comWorks on Intel and ARM. I develop on an ARM Mac without issue. It starts up in a few seconds.
Cost isn't necessarily much lower. At one point I specced out a DB equivalent to what a managed Postgres would cost for OpenAI's reported workload:
> I knocked up an estimate using Azure's pricing calculator and the numbers they provide, assuming 5TB of data (under-estimate) and HA option. Even with a 1 year reservation @40% discount they'd be paying (list price) around $350k/month. For that amount you can rent a dedicated Oracle/ExaData cluster with 192 cores! That's got all kinds of fancy hardware optimizations like a dedicated intra-cluster replication network, RDMA between nodes, predicate pushdown etc. It's going to perform better, and have way more features that would relieve their operational headache.
tracker1 | 17 hours ago
chrisweekly | 17 hours ago
mike_hearn | 17 hours ago
Yes I meant it in this sense: "If you knock something up, you make it or build it very quickly, using whatever materials are available."
https://www.collinsdictionary.com/dictionary/english/knock-u...
alexisread | 18 hours ago
rezonant | 14 hours ago
sgarland | 14 hours ago
I love Postgres, and am not a huge fan of Oracle as a corporation, but I can’t deny that their RDBMS has some truly astounding capabilities.
alexisread | 4 hours ago
Index creation https://stackoverflow.com/questions/23876479/will-postgresql...
JSON->DB schema https://jsonschema2db.readthedocs.io/en/latest/index.html
Pg shared disk failover is similar but RAC is quite unique, you’re not going to use though with a rented cluster?
https://www.postgresql.org/docs/current/different-replicatio...
Personally for me any technical advantages don’t outweigh the business side, YMMV :)
mike_hearn | 3 hours ago
> Index creation https://stackoverflow.com/questions/23876479/will-postgresql...
I was ambiguous. That's an answer telling how to create indexes manually, and saying that you get an index for primary keys and unique constraints automatically. Sure, all databases do that. Oracle can create arbitrary indexes for any relation in the background without it being requested, if it notices that common queries would benefit from them.
Forgetting to create indexes is one of the most common issues people face when writing database apps because the performance will be fine on your laptop, or when the feature is new, and then it slows down when you scale up. Or worse you deploy to prod and the site tanks because a new query that "works fine on my machine" is dog slow when there's real world amounts of data involved. Oracle will just fix it for you, Postgres will require a manual diagnosis and intervention. So this isn't the same capability.
> JSON->DB schema https://jsonschema2db.readthedocs.io/en/latest/index.html
Again I didn't provide enough detail, sorry.
What that utility is doing is quite different. For one, it assumes you start with a schema already. Oracle can infer a schema from a collection of documents even if you don't have one by figuring out which fields are often repeated, which values are unique, etc.
For another, what you get after running that utility is relational tables that you have to then access relationally via normal SQL. What JSON duality views give you is something that still has the original document layout and access mode - you GET/PUT whole documents - and behind the scenes that's mapped to a schema and then through to the underlying SQL that would be required to update the tables that the DB generated for you. So you get the performance of normalized relations but you don't have to change your code.
The nice thing about this is it lets developers focus on application features and semantics in the early stages of a startup by just reshaping their JSON documents at will, whilst someone else focuses on improving performance and data rigor fully asynchronously. The app doesn't know how the data is stored, it just sees documents, and the database allows a smooth transition from one data model to another.
I don't think Postgres has anything like this. If it does it'll be in the form of an obscure extension that cloud vendors won't let you use, because they don't want to/can't support every possible Postgres extension out there.
OtomotO | 17 hours ago
Never!
danny_codes | 15 hours ago
So.
Yeah no sane person would be that stupid
marcosdumay | 19 hours ago
NoSQL gains you no speed at all in redesigning your system. Instead, you trade a few hard to do tasks in data migration into an unsurmountable mess of data inconsistency bugs that you'll never actually get into the end of.
> is mostly bad design decisions and poor domain knowledge
Yes, using NoSQL to avoid data migrations is a bad design decision. Usually created by poor general knowledge.
james_marks | 18 hours ago
Stop and go ask more questions until you have a better understanding of the problem.
jampekka | 17 hours ago
freedomben | 16 hours ago
ranger_danger | 14 hours ago
leafarlua | 17 hours ago
marcosdumay | 17 hours ago
If you have either of those problems, you will know it very clearly.
Also, ironically, Postgres became one of the most scalable NoSQL bases out there, and one of the most flexible to use unstructured data too.
freedomben | 16 hours ago
gf000 | 15 hours ago
It may not have a very rigid schema, you may later add several other optional fields.
You need very large scale (as in no of concurrent accesses), you want to shard the data by e.g. location. But also, the data is not "critical", your highschool not being visible temporarily for certain users is not an issue.
You mostly use the whole dataset "at the same time", you don't do a lot of WHERE, JOIN on some nested value.
In every other case I would rather reach for postgres with a JSONB column.
yfontana | 13 hours ago
To me write scaling is the main current advantage of KV and document DBs. They can generally do schema evolution fairly easily, but nowadays so can many SQL DBs, with semi-structured column types. Also, you need to keep in mind that KV and document DBs are (mostly) non-relational. The more relational your data, the less likely you are to actually benefit from using those DBs over a relational, SQL DB.
tracker1 | 19 hours ago
A lot of the bespoke no-sql data stores really started to come to the forefront around 2010 or so. At that time, having 8 cpu cores and 10k rpm SAS spinning drives was a high end server. Today, we have well over 100 cores, with TBs of RAM and PCIe Gen 4/5 NVME storage (u.x) that is thousands of times faster and has a total cost lower than the servers from 2010 or so that your average laptop can outclass today.
You can vertically scale a traditional RDBMS like PostgreSQL to an extreme degree... Not to mention utilizing features like JSONB where you can have denormalized tables within a structured world. This makes it even harder to really justify using NoSQL/NewSQL databases. The main bottlenecks are easier to overcome if you relax normalization where necessary.
There's also the consideration of specialized databases or alternative databases where data is echo'd to for the purposes of logging, metrics or reporting. Not to mention, certain layers of appropriate caching, which can still be less complex than some multi-database approaches.
leafarlua | 17 hours ago
tracker1 | 17 hours ago
That said, I've leaned into avoiding breaking up a lot of microservices unless/until you need them... I'm also not opposed to combining CQRS style workflows if/when you do need micro services. Usually if you need them, you're either breaking off certain compute/logic workflows first where the async/queued nature lends itself to your needs. My limited experience with a heavy micro-service application combined with GraphQL was somewhat painful in that the infrastructure and orchestration weren't appropriately backed by dedicated teams leading to excess complexity and job duties for a project that would have scaled just fine in a more monolithic approach.
YMMV depending on your specific needs, of course. You can also have microservices call natural services that have better connection sharing heuristics depending again on your infrastructure and needs... I've got worker pools that mostly operate of a queue, perform heavy compute loads then interact with the same API service(s) as everything else.
hunterpayne | 11 hours ago
AlotOfReading | 18 hours ago
ignoramous | 15 hours ago
NoSQL is the "correct" answer if your queries are KV oriented, while predictable performance and high availability are priority (true for most "control planes"). Don't think any well-designed system will usually need to "graduate" from NoSQL to SQL.
Prior: https://news.ycombinator.com/item?id=22249490
hunterpayne | 12 hours ago
No, no it isn't. It never is. Just as building your house on a rubber foundation isn't the correct answer either. This is just cope. Unless your use cases don't care about losing data or data corruption at all, NoSQL isn't the correct answer.
icedchai | 18 hours ago
tshaddox | 18 hours ago
akdev1l | 17 hours ago
This is by design, the idea is that scaling your application layer is easy but scaling your storage/db layer is not
Hence make the storage dumb and have the application do the joins and now your app scales right up
(But tbh I agree a lot of applications don’t reach the scale required to benefit from this)
noveltyaccount | 21 hours ago
pgtan | 19 hours ago
goerch | 18 hours ago
datadrivenangel | 17 hours ago
hunterpayne | 11 hours ago
bachmeier | 19 hours ago
Edit: I just submitted a link to Joe Armstrong's Minimum Viable Programs article from 2014. If the response to my comment is about the enterprise and imaginary scaling problems, realize that those situations don't apply to some programming problems.
locknitpicker | 19 hours ago
Why waste time screwing around with ad-hoc file reads, then?
I mean, what exactly are you buying by rolling your own?
bachmeier | 19 hours ago
pythonaut_16 | 19 hours ago
What's special about SQLite is that it already solves most of the things you need for data persistence without adding the same kind of overhead or trade offs as Postgres or other persistence layers, and that it saves you from solving those problems yourself in your json text files...
Like by all means don't use SQLite in every project. I have projects where I just use files on the disk too. But it's kinda inane to pretend it's some kind of burdensome tool that adds so much overhead it's not worth it.
locknitpicker | 19 hours ago
What overhead?
SQLite is literally more performant than fread/fwrite.
cleversomething | 19 hours ago
ablob | 19 hours ago
cleversomething | 19 hours ago
Battle-tested, extremely performant, easier to use than a homegrown alternative?
By all means, hack around and make your own pseudo-database file system. Sounds like a fun weekend project. It doesn't sound easier or better or less costly than using SQLite in a production app though.
whalesalad | 19 hours ago
“Virding's First Rule of Programming: Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.”
mrec | 18 hours ago
https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule
randyrand | 19 hours ago
Wait until you actually need it.
[OP] upmostly | 18 hours ago
Premature optimisation I believe that's called.
I've seen it play out many times in engineering over the years.
hunterpayne | 11 hours ago
You are just mislabling good architecture as 'premature optimization'. So I will give you another platitude... "There is nothing so permanent as a temporary software solution"
dkarl | 18 hours ago
In this case, I feel like using the filesystem directly is the opposite: doing much more difficult programming and creating more complex code, in order to do less.
It depends on how you weigh the cost of the additional dependency that lets you write simpler code, of course, but I think in this case adding a SQLite dependency is a lower long-term maintenance burden than writing code to make atomic file writes.
The original post isn't about simplicity, though. It's about performance. They claim they achieved better performance by using the filesystem directly, which could (if they really need the extra performance) justify the extra challenge and code complexity.
goerch | 17 hours ago
hackingonempty | 18 hours ago
trgn | 17 hours ago
herpdyderp | 17 hours ago
zabzonk | 16 hours ago
banana_giraffe | 17 hours ago
akdev1l | 17 hours ago
trgn | 15 hours ago
hunterpayne | 11 hours ago
freedomben | 21 hours ago
Now that said, if there's value to the "database" being human readable/editable, json is still well worth a consideration. Dealing with even sqlite is a pain in the ass when you just need to tweak or read something, especially if you're not the dev.
giva | 21 hours ago
How? With SQL is super easy to search, compare, and update data. That's what it’s built for.
freedomben | 21 hours ago
giva | 21 hours ago
However, if your all application state can be represented in a single json file of less than a dozen MB, yes, a database can be overkill.
freedomben | 20 hours ago
Yes agreed, but it's usually a lot easier to find the filename part, especially if the application follows XDG. Sqlite databases are usually buried somewhere because they aren't expected to be looked at.
the_pwner224 | 19 hours ago
fatih-erikli-cg | 21 hours ago
Even it's postgres, it is still a file on disk. If there is need something like like partitioning the data, it is much more easier to write the code that partitions the data.
If there is a need to adding something with textinputs, checkboxes etc, database with their admin tools may be a good thing. If the data is something that imported exported etc, database may be a good thing too. But still I don't believe such cases, in my ten something years of software development career, something like that never happened.
Sharlin | 21 hours ago
fatih-erikli-cg | 15 hours ago
Sharlin | 13 hours ago
fatih-erikli-cg | 10 hours ago
bsenftner | 21 hours ago
anonymars | 21 hours ago
vlapec | 20 hours ago
…and it won’t get better anytime soon.
pythonaut_16 | 19 hours ago
That alone is a terrible thing. Open formats are so much more user friendly
zeroonetwothree | 20 hours ago
gavinray | 21 hours ago
Different languages and stdlib methods can often spend time doing unexpected things that makes what looks like apples-to-apples comparisons not quite equivalent
srslyTrying2hlp | 21 hours ago
I ended up just buying a VPS, putting openclaw on it, and letting it Postgres my app.
I feel like this article is outdated since the invention of OpenClaw/Claude Opus level AI Agents. The difficulty is no longer programming.
fifilura | 21 hours ago
jbiason | 21 hours ago
In my (hypothetical, 'cause I never actually sat down and wrote that) case, I wanted the personal transactions in a month, and I realized I could just keep one single file per month, and read the whole thing at once (also 'cause the application would display the whole month at once).
Filesystems can be considered a key-value (or key-document) database. The funny thing about the example used in the link is that one could simply create a structure like `user/[id]/info.json` and directly access the user ID instead of running some file to find them -- again, just 'cause the examples used, search by name would be a pain, and one point where databases would handle things better.
m6z | 21 hours ago
forinti | 21 hours ago
As the years went by, I expected the client to move to something better, but he just stuck with it until he died after about 20 years, the family took over and had everything redone (it now runs Wordpress).
The last time I checked, it had hundreds of thousands of orders and still had good performance. The evolution of hardware made this hack keep its performance well past what I had expected it to endure. I'm pretty sure SQLite would be just fine nowadays.
da02 | 19 hours ago
forinti | 19 hours ago
da02 | 19 hours ago
forinti | 19 hours ago
I should have charged him a percentage. Even if I had charged 0.5%, I would have made more money.
da02 | 17 hours ago
jwitchel | 21 hours ago
goerch | 17 hours ago
hunterpayne | 11 hours ago
randusername | 21 hours ago
I can use databases just fine, but will never be able to make wise decisions about table layouts, ORMs, migrations, backups, scaling.
I don't understand the culture of "oh we need to use this tool because that's what professionals use" when the team doesn't have the knowledge or discipline to do it right and the scale doesn't justify the complexity.
goerch | 18 hours ago
ForHackernews | 21 hours ago
ethan_smith | 19 hours ago
XorNot | 21 hours ago
So my opinion has thoroughly shifted to "start with a database, and if you _really_ don't need one it'll be obvious.
But you probably do.
JohnMakin | 21 hours ago
Joeboy | 21 hours ago
Total hosting costs are £0 ($0) other than the domain name.
shafoshaf | 21 hours ago
The very small bonus you get on small apps is hardly worth the time you spend redeveloping the wheel.
adrian_b | 18 hours ago
During Cretaceous, when dinosaurs were at their peak, sharks had already become very similar to the sharks of today, e.g. there were big sharks that differed very little from the white sharks and tiger sharks of today.
Then the dinosaurs have disappeared, together with the pterosaurs and the mosasaurs, and they have been replaced by other animals, but the sharks have continued to live until today with little changes, because they had already reached an optimized design that was hard to improve.
Besides the sharks, during Cretaceous there already existed along the dinosaurs other 2 groups of big predators that have changed little since then, crocodiles and big constrictor snakes similar to the pythons of today.
Therefore all 3 (sharks, crocodiles and big constrictor snakes) are examples of locally optimum designs that have been reached more than 70 million years ago, without needing after that any major upgrades.
MattRogish | 20 hours ago
If you want to do this for fun or for learning? Absolutely! I did my CS Masters thesis on SQL JOINS and tried building my own new JOIN indexing system (tl;dr: mine wasn't better). Learning is fun! Just don't recommend people build production systems like this.
Is this article trolling? It feels like trolling. I struggle to take an article seriously that conflates databases with database management systems.
A JSON file is a database. A CSV is a database. XML (shudder) is a database. PostgreSQL data files, I guess, are a database (and indexes and transaction logs).
They never actually posit a scenario in which rolling your own DBMS makes sense (the only pro is "hand rolled binary search is faster than SQLite"), and their "When you might need" a DBMS misses all the scenarios, the addition of which would cause the conclusion to round to "just start with SQLite".
It should basically be "if you have an entirely read-only system on a single server/container/whatever" then use JSON files. I won't even argue with that.
Nobody - and I mean nobody - is running a production system processing hundreds of thousands of requests per second off of a single JSON file. I mean, if req/sec is the only consideration, at that point just cache everything to flat HTML files! Node and Typescript and code at all is unnecessary complexity.
PostgreSQL (MySQL, et al) is a DBMS (DataBase Management System). It might sound pedantic but the "MS" part is the thing you're building in code:
concurrency, access controls, backups, transactions: recovery, rollback, committing, etc., ability to do aggregations, joins, indexing, arbitrary queries, etc. etc.
These are not just "nice to have" in the vast, vast majority of projects.
"The cases where you'll outgrow flat files:"
Please add "you just want to get shit done and never have to build your own database management system". Which should be just about everybody.
If your app is meaningfully successful - and I mean more than just like a vibe-coded prototype - it will break. It will break in both spectacular ways that wake you up at 2AM and it will break in subtle ways that you won't know about until you realize something terrible has happened and you lost your data.
Didn't we just have this discussion like yesterday (https://ultrathink.art/blog/sqlite-in-production-lessons)?
It feels like we're throwing away 50 years of collective knowledge, skills, and experience because it "is faster" (and in the same breath note that nobody is gonna hit these req/sec.)
I know, it's really, really hard to type `yarn add sqlite3` and then `SELECT * FROM foo WHERE bar='baz'`. You're right, it's so much easier writing your own binary search and indexing logic and reordering files and query language.
Not to mention now you need a AGENTS.md that says "We use our own home-grown database nonsense if you want to query the JSON file in a different way just generate more code." - NOT using standard components that LLMs know backwards-and-forwards? Gonna have a bad time. Enjoy burning your token budget on useless, counter-productive code.
This is madness.
stackskipton | 20 hours ago
Overhead in any project is understanding it and onboarding new people to it. Keeping on "mainline" path is key to lower friction here. All 3 languages have well supported ORM that supports SQLite.
tracker1 | 18 hours ago
goerch | 17 hours ago
matja | 20 hours ago
ktzar | 20 hours ago
koliber | 20 hours ago
There is one conclusion that I do not agree with. Near the end, the author lists cases where you will outgrow flat files. He then says that "None of these constraints apply to a lot of applications."
One of the constraints is "Multiple processes need to write at the same time." It turns out many early stage products need crons and message queues that execute on a separate worker. These multiple processes often need to write at the same time. You could finagle it so that the main server is the only one writing, but you'd introduce architectural complexity.
So while from the pure scale perspective I agree with the author, if you take a wider perspective, it's best to go with a database. And sqlite is a very sane choice.
If you need scale, cache the most often accessed data in memory and you have the best of both worlds.
My winning combo is sqlite + in-memory cache.
[OP] upmostly | 18 hours ago
koliber | 17 hours ago
pseudosavant | 18 hours ago
koliber | 17 hours ago
I'm also a convert.
oliver236 | 6 hours ago
Croak | 5 hours ago
koliber | 4 hours ago
jerf | 16 hours ago
nine_k | 12 hours ago
If your downtime does not cost much, you can host many things on a single tiny computer.
ryang2718 | 8 hours ago
I can't talk though because I actually find myself doing this a lot
graemep | 13 minutes ago
When you combine all those a database is a better alternative for all but the simplest cases.
rglover | 19 hours ago
I think a better way to ask this question is "does this application and its constraints necessitate a database? And if so, which database is the correct tool for this context?"
tracker1 | 18 hours ago
nishagr | 19 hours ago
swiftcoder | 19 hours ago
ozgrakkurt | 19 hours ago
This is also why many databases have persistence issues and can easily corrupt on-disk data on crash. Rocksdb on windows is a very simple example a couple years back. It was regularly having corruption issues when doing development with it.
dkarl | 18 hours ago
"Why use spray paint when you can achieve the same effect by ejecting paint from your mouth in a uniform high-velocity mist?" If you happen to have developed that particular weird skill, by all means use it, but if you haven't, don't start now.
That probably sounds soft and lazy. I should learn to use my operating system's filesystem APIs safely. It would make me a better person. But honestly, I think that's a very niche skill these days, and you should consider if you really need it now and if you'll ever benefit from it in the future.
Also, even if you do it right, the people who inherit your code probably won't develop the same skills. They'll tell their boss it's impossibly dangerous to make any changes, and they'll replace it with a database.
duped | 16 hours ago
If that kind of filesystem traffic is unsuitable for your application then you will reinvent journaling or write-ahead logging. And if you want those to be fast you'll implement checkpointing and indexes.
ricardobeat | 14 hours ago
This, 100%. Development today is driven by appearances though, you can take advantage of that. Give it a cute name, make sure you have AI generate an emoji-rich README for it, publish it as an open source npm package, then trigger CI a few thousand times to get a pretty download count. They will happily continue using it without fear!
pythonaut_16 | 14 hours ago
If you start a new job and on your first day they go "Yeah the last guy said we don't need a database, so he rolled his own." are you gonna be excited, or sweating?
Exception being perhaps "The last team chose to build their own data layer, and here's the decision log and architecture docs proving why it was needed."
hunterpayne | 13 hours ago
btilly | 13 hours ago
That said, there really are good use cases for readable formatted files. For example configuration files that are checked into source control are far more trackable than a SQLite database for the same purpose. For another example, the files are convenient for a lot of data transfer purposes.
But for updateable data? You need a really good reason not to simply use a database. I've encountered such edge cases. But I've encountered a lot more people who thought that they had an edge case, than really did.
goerch | 18 hours ago
creatonez | 18 hours ago
Doesn't scale at all, though - all of the data that needs to be self-consistent needs to be part of the same file, so unnecessary writes go through the roof if you're only doing small updates on a giant file. Still gotta handle locking if there is risk of a stray process messing it up. And doing this only handles part of ACID.
jeffffff | 17 hours ago
creatonez | 11 hours ago
I believe syncing the parent directory is only needed for durability and not atomicity, but if you're using this method you're probably not caring about durability in the first place, because you've designed a system that doesn't let you do durability in a fine-grained way without severe performance loss.
hunterpayne | 12 hours ago
This is false, but most fs will. However, there is a lot of fs calls you have to make that you probably don't know about to make the fs operations atomic.
PS The way you propose is probably the hardest way to do an atomic FS operation. It will have the highest probably of failure and have the longest period of operations and service disruption. There is good reason we move rows one at a time or in batches sized to match OS buffers.
vector_spaces | 17 hours ago
noselasd | 16 hours ago
At least write to a temp file(in the same filesystem), fsync the file and its folder and rename it over the original.
wasabi991011 | 16 hours ago
gavinray | 15 hours ago
tclancy | 14 hours ago
sgarland | 14 hours ago
yabutlivnWoods | 10 hours ago
https://docs.kernel.org/filesystems/ext4/atomic_writes.html
827a | 19 hours ago
tracker1 | 18 hours ago
In practice, the records themselves took no less than 30 joins for a flat view of the record data that was needed for a single view of what could/should have been one somewhat denormalied record in practice. In the early 2010's that meant the main database was often kicked over under load, and it took a lot of effort to add in appropriate caching and the search db, that wound up handling most of the load on a smaller server.
winrid | 19 hours ago
Just have to use locks to be careful with writes.
I figured I'd migrate it to a database after maybe 10k users or so.
mfro | 19 hours ago
winrid | 17 hours ago
Sadly no solution for non-rooted consoles.
goerch | 17 hours ago
winrid | 17 hours ago
tracker1 | 18 hours ago
moregrist | 18 hours ago
And then another index. And at some point you want to ensure uniqueness or some other constraint.
And then you’re rewriting a half-complete and buggy SQLite. So I’ve come around to defaulting to SQLite/PostgresQL unless I have a compelling need otherwise. They’re usually the right long-term choice for my needs.
tracker1 | 18 hours ago
That it's now in the box (node:sqlite) for Deno/TS makes it that much more of an easy button option.
allknowingfrog | 18 hours ago
amw-zero | 18 hours ago
linuxhansl | 18 hours ago
Don't use a sports-car to haul furniture or a garbage truck as an ambulance. For the use case and scale mentioned in the article it's obvious not to use a database.
Am I missing something? I guess many people are the using the tools they are familiar with and rarely question whether they are really applicable. Is that the message?
I think a more interesting question is whether you will need a single source of truth. If you don't you can scale on many small data sets without a database.
I will say this before I shut up with my rant: If you start with a design that scales you will have an easier to scale when it is time without re-engineering your stack. Whether you think you will need to scale depends on your projected growth and the nature of your problem (do you need a single source of truth, etc.)
Edits: Spelling
cratermoon | 18 hours ago
jrecursive | 18 hours ago
goerch | 18 hours ago
yubblegum | 18 hours ago
traderj0e | 18 hours ago
There are also things besides databases that I'll DIY and then still wonder why so many people use a premade tool for it, like log4j
subhobroto | 17 hours ago
I suggest every developer learn how to replicate, backup and restore the very database they are excited about, from scratch at least once. I propose this will teach them what takes to build a production ready system and gain some appreciation for other ways of managing state.
gf000 | 15 hours ago
subhobroto | 14 hours ago
That's why I was encouraging people to make backups and restores of their DB because personally, it has made me appreciate why different designs exist and when to use them vs just slapping a DB connection to an RDS instance and calling it a day.
rasengan | 18 hours ago
traderj0e | 18 hours ago
pstuart | 18 hours ago
But if you have data that is static or effectively static (data that is updated occasionally or batched), then serving via custom file handling can have its place.
If the records are fixed width and sorted on the key value, then it becomes trivial to do a binary search on the mmapped file. It's about as lightweight as could be asked for.
goerch | 17 hours ago
pstuart | 17 hours ago
jmull | 18 hours ago
You'll likely end up quite a chump if you follow this logic.
sqlite has pretty strong durability and consistency mechanism that their toy disk binary search doesn't have.
(And it is just a toy. It waves away the maintenance of the index, for god's sake, which is almost the entire issue with indexes!)
Typically, people need to change things over time as well, without losing all their data, so backwards compatibility and other aspects of flexibility that sqlite has are likely to matter too.
I think once you move beyond a single file read/written atomically, you might as well go straight to sqlite (or other db) rather than write your own really crappy db.
goerch | 18 hours ago
oliviergg | 18 hours ago
debo_ | 17 hours ago
They were terrific reads; his writing on object-oriented databases was the most fun technical reading I did in grad school. And I even learned a lot!
kevstev | 15 hours ago
debo_ | 15 hours ago
http://www.redbook.io/
goerch | 17 hours ago
mdp | 16 hours ago
waldrews | 18 hours ago
What the world needs is a hybrid - database ACID/transaction semantics with the ability to cd/mv/cp file-like objects.
0x457 | 18 hours ago
Then proceeds to (poorly) implement database on files.
Sure, Hash Map that take ~400mb in memory going to offer you fast lookups. Some workloads will never reach this size can be done as argument, but what are you losing by using SQLite?
What happens when services shutdowns mid write? Corruption that later results in (poorly) implemented WAL being added?
SQLite also showed something important - it was consistent in all benchmarks regardless of dataset size.
throwway120385 | 17 hours ago
However the driving motivation for adding a database is not necessarily managing data, but the fact that the database system creates a nice abstraction layer around storing data of relational or non-relational form in non-volatile memory and controlling access to it while other systems are updating it. And because it's a nice abstraction, there are a lot of existing libraries that can take advantage of it in your language of choice without requiring you to completely invent all of that stuff over the top of the filesystem. That has knock-on effects when you're trying to add new functionality or new interaction patterns to an existing system.
And in cases where two or more processes need to communicate using the same data, a database gives you some good abstractions and synchronization primitives that make sense, whereas regular files or IPC require you to invent a lot of that stuff. You could use messaging to communicate updates to data but now you have two copies of everything, and you have to somehow atomize the updates so that either copy is consistent for a point in time. Why not use a database?
Knowing what I know today I would start with some kind of database abstraction even if it's not necessarily designed for transactional data, and I would make sure it handled the numerous concerns I have around data sharing, consistency, atomicity, and notification because if I don't have those things I eventually have to invent them to solve the reliability problems I otherwise run in to without them.
charcircuit | 17 hours ago
It's the opposite. A file system is a database. And databases can recursively store their data within another database.
tonymet | 17 hours ago
hnlmorg | 17 hours ago
Nope. There are non-persistent in-memory databases too.
In fact, a database can be a plethora of things and the stuff they were building is just a subset of a subset (persistent, local relational database)
jmaw | 17 hours ago
Also notable mention for JSON5 which supports comments!: https://json5.org/
zkmon | 16 hours ago
jmaw | 16 hours ago
I don't choose a DB over a flat file for its speed. I choose a DB for the consistent interface and redundancy.
FpUser | 16 hours ago
mikenikles | 8 hours ago
theshrike79 | 16 hours ago
That's why it could handle massive traffic with very little issues.
flomo | 15 hours ago
ogurechny | 15 hours ago
It only handles massive traffic if reads of those static pages are frequent, and updates are rare. When thousands of users are posting, you have to either block everyone on each write, or subdivide the required synchronisation between boards. Also, the regenerated pages might be used just a couple of times before the next rewrite happens (or not viewed at all, like far away board pages, but you still have rewrite all involved pages as a batch), and not much caching happens. In addition to that, each short comment causes multiple full long pages to be regenerated, and stored on disk. You basically get a very ineffective database with very ineffective primitives.
So usually every image board owner starts with decoupling of message posting and page updates to only regenerate pages once in a while, and process multiple edits at once, then some things stop working, as they assume each state change happens in full in multiple places, then they try to rewrite everything or switch to a real database engine.
agustechbro | 15 hours ago
If you need to ever update a single byte in your data, please USE A PROPER DATABASE, databases does a lot of fancy thing to ensure you are not going to corrupt/broke your data on disk among other safety things.
cold_tom | 15 hours ago
orthogonal_cube | 15 hours ago
Since they’re using Go to accept requests and forwarding them to their SQLite connection, it may have been worthwhile to produce the same interface with Rust to demonstrate whether or not SQLite itself was hitting its performance limit or if Go had some hand in that.
Other than that, it’s a good demonstration of how a custom solution for a lightweight task can pay off. Keep it simple but don’t reinvent the wheel if the needs are very general.
devilsdata | 15 hours ago
They are overoptimising for the simplest part of writing the application; the beginning. They've half-implemented an actual database, with none of the safety features. There are a lot of potential headaches that this article has avoided talking about; perhaps because they haven't experienced them yet.
See: https://danluu.com/file-consistency/
What happens when you need to start expanding the scope of this feature? Joining users on profiles, or users on orgs?
Ask yourself: how many shops have seriously written an application backed by files and stuck with it over the long-run? The answer is likely very few. Therefore, this is likely doubling up the work required.
There is a reason people reach for a database first. I'd strongly encourage anyone to avoid doing stuff like this.
bpavuk | 13 hours ago
yet, for a generic app or server, just don't fuck your brains and go with SQLite
allthetime | 14 hours ago
sgarland | 14 hours ago
At the very least, use a monotonic key, so you can avoid having to periodically sort.
tejohnso | 13 hours ago
briandw | 13 hours ago
"Every database you have ever used reads and writes to the filesystem, exactly like your code does when it calls open()."
Technically not true. Applications like SQLite use mmap to map the file into a locally addressable memory space. This lets you skip the syscalls when reading and writing. The kernel can map that data in dynamically much faster than a userland process can.
Later in the article they go over the process of reading in the entire file into memory, again mmap is much better at this. Would have been nice to see that approach used.
fg137 | 11 hours ago
That said, depending who you are talking to, they may not agree with you on "mmap is much better than this". Some people will say you should do what you need in the application logic instead of depending on APIs from the OS. (although not necessarily for the specific example here)
https://db.cs.cmu.edu/mmap-cidr2022/
akoboldfrying | 11 hours ago
PaulDavisThe1st | 8 hours ago
Not necessarily. The kernel's mmap implementation has quite a strong bias towards certains kinds of access patterns; deviate from them and it can become slower than read(2).
We tried using mmap(2) for audio file i/o in Ardour, and it got notably less bandwidth than just using read(2).
debatem1 | 8 hours ago
PaulDavisThe1st | 6 hours ago
janalsncm | 12 hours ago
A lot of data pipeline work can function pretty well with each step producing artifacts that the next can consume. If you need unique items, use uuids. If you want to check whether an item exists, use a hash of something (maybe a url). And each day goes in its own subdirectory. It works.
Sometimes, though, you end up reinventing database constructs. Joins, indexing, versioning. A little of that is ok, but it gets gnarly fast. If you mess up, it could mean corrupting your data in a way that’s irreversible. Or, corrupting your data in a way that’s reversible but really hard.
So a database has the benefit of enforcing consistency that a glob over json files doesn’t. It doesn’t mean every one-off script needs a blue-green Postgres DB, but there can be an inflection point.
a34729t | 12 hours ago
pdimitar | 10 hours ago
From that POV I moved from the extreme of "you don't need a state at all (and hence no database)" to the bit more moderate "you usually don't need a file or a DB but you almost certainly will want to query whatever state you store so just get a DB early".
I strongly sympathize with "no microservices" of course. That's an overkill for at least 99% of all projects I've ever seen (was a contractor for a long time). But state + querying is an emergent property as a lot of projects move beyond the prototype phase.
tnelsond4 | 11 hours ago
I was building a client side dictionary app with search functionality and thought that using sqlite's wasm port would be the perfect solution.
I used SQLite for a couple years and it started to wear on me. The database files were bigger than they should be, they didn't compress super well, the loading time was a bit slow. Linear search wasn't very fast either. Editing SQLite files and doing joins and stuff was slow and frustrating as well. I wanted something simpler.
I didn't need atomics, I didn't need writes, so I just handrolled a set of indexes from the source tsv files and compressed those with zstd and decompressed them via wasm on every load. Decompressing and loading was now faster than straight loading via SQLite and since my module was 52kb of wasm instead of 800kb for SQLite I could load as many instances of the module as I wanted with no regrets. I use stringzilla for linear scans and it's ridiculously fast.
SQLite is great, but it's not the solution for every problem.
jfreds | 8 hours ago
tnelsond4 | 8 hours ago
It's a little out of the scope of stringzilla but I should still be able to use it for the heavy lifting
thutch76 | 10 hours ago
I will still reach for a database 99% or the time, because I like things like SQL and transactions. However, I've recently been working on a 100% personal project to manage some private data; extracting insights, graphing trends, etc. It's not high volume data, so I decided to use just the file system, with data backed at yaml files, with some simple indexing, and I haven't run into any performance issues yet. I probably never will at my scale and volume.
In this particular case having something that was human readable, and more importantly diffable, was more valuable to me than outright performance.
Having said that, I will still gladly reach for a database with a query language and all the guarantees that comes with 99% of the time.
perrygeo | 9 hours ago
WatchDog | 9 hours ago
Just adding:
Made the performance on my machine go from 27,700.99 r/s to 89,687.36 r/s.I also tried making the get user a prepared statement, and storing the timestamp as an unix timestamp integer, but that didn't make much difference for me.
chrismorgan | 8 hours ago
It would be much better to write all of this as binary data, omitting separators.
• Since it’s fixed-width and simple, inspecting the data is still pretty easy—there are tools for working with binary data of declared schema, or you could write a few-liner to convert it yourself. You don’t lose much by departing ASCII.
• You might want to complicate it a little by writing a version tag at the start of the file or outside it so you can change the format more easily (e.g. if you ever add a third column). I will admit the explicit separators do make that easier. You can also leave that for later, it probably won’t hurt.
• UUID: 36 bytes → 16 bytes.
• Offset: 20 bytes (zero-padded base-ten integer) → 8 bytes.
• It removes one type of error altogether: now all bit patterns are syntactically valid.
• It’ll use less disk space, be cheaper to read, be cheaper to write, and probably take less code.
I also want to register alarm at the sample code given for func FindUserBinarySearch. To begin with, despite a return type of (*User, error), it always returns nil error—it swallows all I/O errors and ignores JSON decode errors. Then:
That strings.TrimRight will only do anything if your data is corrupted. Not important when you control the writing, but worth noting that UUID string comparison is case-insensitive. Superfluous. ParseInt doesn’t mind leading zeroes, and it’ll probably skip them faster than a separate TrimLeft call. That’s begging to make data corruption difficult to debug. Most corruption will now become dataOffset 0. Congratulations! You are now root.keithnz | 7 hours ago
amarant | 7 hours ago
inasio | 7 hours ago
thegdsks | 7 hours ago
kantselovich | 6 hours ago
Most software is stateful and needs to persist state across restarts, so I would argue that one needs at least SQLite.
On SQLite being safe default: in practice it means supporting multiple databases, say SQLite and Postgres, this is more complicated that supporting just Postgres. As soon as a project leaves localhost and enters cloud development you need talk to a database over network, which warrants MySQL or Postgres.
Which is more complicated: supporting a docker container with mysql or Postgres for local development OR supporting multiple databases in the project?
Of course, the answer could be “it depends”, I but I would not call SQLite a default choice. It would be if you are writing desktop or mobile app, but for anything like a web app it’s a questionable choice.
Waterluvian | 6 hours ago
afpx | 5 hours ago
* Not always super easy
jwilliams | 5 hours ago
Mawr | 3 hours ago
1. You're paying very little for 1000x the features.
2. The chances your application will keep doing "pure ID lookups" forever are zero. The moment you need to query the data in any other way or have more than one writer you're going to have to throw all this nonsense code into the trash.
3. Do you need the 1.7x speedup? No, of course you don't. It's just optimizing for the sake of optimizing.
I'd have just used sqlite to begin with and not benchmarked anything. No custom code, no need to waste any time, great performance, amazing flexibility — all with minimum effort.