How our protagonist discovered that a key service that powers our support was absurdly vulnerable to overload, and what we did to fix it.
Part of our support infrastructure at work is an in-memory datastore, that allows us to query our outstanding support work over various dimensions, such as work type, whether it's been put on hold for some reason, etc. It's functionally equivalent to a single table in an SQL database, where you have a single dataset, boolean filters and configurable sorting.
At work, we have an in-memory datastore that powers part of our support infrastructure. Its kind of analgous to having bitmap filters with post-hoc filtering, so any use of sort/limit will sort the entire result set. And the key part here, is that the result sets can be large enough that sorts can take one or two seconds.
And for a bit of context, this service deployment wasn't autoscaled at the time, and upstream services will retry failed requests. Sometimes after a relatively short timeout. Which is fun.
So, one day, this service had more query load than it can handle; and because of the inelasticity, it got overloaded, and queries started to take way longer (like, up to a minute vs. a typical time of up to 1-2s). Unfortunately, because this was an incident, and sometimes the panic sets in, one of my theories was that memory had gotten slower. Which of course was absurd, but under time presssure, incident brain can be very real.
However, as earlier foreshadowed, this service had simply became overloaded, so we not only had slightly higher than average demand, but also failure demand from retries. Most of the time in a Go service, we pass around a context, so that when the caller gives up on us, we can cancel the operation, short-circuit and bail early.
However, when we were able to get a cpu profile and take a look, the vast majority of the CPU time was taken up in the sort phase of the query. In go, none of the sort functions support cancellation (reasonably so, as normally you're either in a batch context, or sorting small enough counts that the time taken isn't significant). So, what to do?
Normally, context cancellation has leaf functions check for an error, and then propagate it via the typical errors-as-values mechanism. However, none of the sort functions (eg: sort.Sortfunc) take a context, or allow returning an error.
Thankfully, Go has another, non-local signalling mechanism for handling errors (eg: if you've dereferenced a nil pointer), in the form of panics. This tends not to be used much for error handling per-se, because the non-local flow control can be harded to reason about, but it can make sense within a single narrowly defined context.
For example, the encoding/json package does this, for example throwing via json.(*encodingState).error(…), and recovering within the scope of the top level json.(*encodingState).marshal(…) function. So no client code actually sees the non-local control flow, and no engineers experience unexpected panics.
So we changed the code from something like this:
func execute(ctx context.Context) (results, error) {
resultSet := query.filter(someTable)
slices.SortFunc(resultSet, func(a, b Row) int {
return query.compare(a, b)
})
}
To something like this:
type nonLocalCancellation struct {err error}
func execute(ctx context.Context) (results, error) {
resultSet := query.filter(someTable)
var sortErr error
defer func() {
// Ref: https://go.dev/blog/defer-panic-and-recover
if r := recover(); r != nil {
if c, ok := r.(nonLocalCancellation); ok {
sortErr = c.err
} else {
panic(r)
}
}
}()
slices.SortFunc(resultSet, func(a, b Row) int {
if ctx.err
return query.compare(a, b)
})
if sortErr != nil {
return nil, sortErr
}
return resultSet, nil
}
Which, is a lot of messing about (it's an ugly solution to an ugly problem), but does mean if the caller gives up on the query, we don't waste time sorting a result for someone who will never care about it.