The hidden cost of PostgreSQL arrays

21 points by sjamaan 17 hours ago on lobsters | 7 comments

PostgreSQL 14 introduced LZ4 as an alternative

Postgres 16 introduced ZSTD, which is even better. (Though LZ4 might still be a better choice for smaller TOAST columns.)

simonw | 3 hours ago

When creating a table, you might expect strict typing. That's true for everything — except the array dimensions. You might think integer[][] enforces a 2D matrix. Except it does not. The [] syntax is effectively syntactic sugar. PostgreSQL does not enforce the number of dimensions of sub-arrays at the schema level at all by default.

That surprised me. I wonder why PostgreSQL is designed that way - I normally expect it to be more strict than that.

Signez | 17 hours ago

Great overview of all the pitfalls in using arrays in Postgres (something that I found very tempting sometimes, but always bit me at some point).

I am surprised though by the mention of 2024 in the beginning of the section about vectors in a blogpost that seems to have been be published yesterday… is it a rerun? :)

doctor_eval | 15 hours ago

It depends on your use case, but arrays can be a very significant performance benefit, simply because a link table requires not just extra columns (the foreign key at a minimum, but arrays also have ordinality), but a pkey index and possibly an fkey index, depending on how you structure the table. If the linked table is sparse then you can have very significant read amplification relative to arrays.

I’ve used arrays extensively over the years and there are definite performance advantages, despite not being “relational” in a dogmatic sense. They aren’t for every use case and should be deployed with care, but link tables can suffer from the 1+N problem, and arrays can help you avoid it.

[OP] sjamaan | 15 hours ago

Absolutely. In one early project where I started to use Postgres for storing simulation results, I ran into disk space issues. Instead of storing rows of essentially (simulation_nr, time, variable, value) I stored (simulation_nr, variable, values[]). Since it was a simulation, time here was simply an atomically increasing integer which would be the position in the array.

The row overhead (24 bytes, IIRC) of storing individual values was quite high, and then there was the timestamp in each row. Of course, querying wasn't very fast either. With arrays, the queries flew and fetching all the measurements for a given variable to display in a plot was almost instantaneous because they were stored neatly together.

: doctor_eval | 15 hours ago
Yes! Data locality is a performance superpower!

buffalo7 | an hour ago

Text flow on that page is totally broken on mobile, I presume because the code blocks don’t have a fixed with and scroll bars. Too bad.