Open sourcing Dicer: Databricks's auto-sharder

69 points by vivek-jain 8 hours ago on hackernews | 10 comments
Does anyone else have something similar?

What are some use cases that you found are useful?

[OP] vivek-jain | 7 hours ago

Sharded in-memory caching turns out to be rather useful at scale :)

Some of the key examples highlighted on our blog are Unity Catalog, which is essentially the metadata layer for Databricks, our Query Orchestration Engine, and our distributed remote cache. See the blog post for more!

louis-paul | 5 hours ago

atuladya | 4 hours ago

It is similar to Slicer in terms of the abstraction (I built Slicer at Google) but the architecture, implementation and algorithms have a lot of differences

bigwheels | 4 hours ago

Did you also work on this databricks dicery?
Yes he did. I attended a talk from him on the same, so that's how I know.

WookieRushing | 5 hours ago

These show up once you have a certain scale where it is either cost inefficient or the hot spots are very dynamic. They also try to avoid latency by being eventually consistent sidecars instead of proxies.

I’ve seen them used for traffic routing, storage system metadata systems, distributed cache etc

khaki54 | 6 hours ago

Seems weird to call it sharding since it's not sharding indexed datasets or anything like that. Is this just a tool to mitigate Databricks’ internal service-scaling challenges?

atuladya | 4 hours ago

Right - this is not about sharding data/datasets. This is for sharding in-memory state that a service might have. The problem of building services at low cost, high scale, low latency and high throughput is common in many environments including our services at Databricks, and Dicer helps with that.

charleshn | 4 hours ago

> Application pods learn the current assignment through a library called the Slicelet (S for server side). The Slicelet maintains a local cache of the latest assignment by fetching it from the Dicer service and watching for updates. When it receives an updated assignment, the Slicelet notifies the application via a listener API.

For a critical control plane component like this, I tend to prefer a constant work pattern [0], to avoid metastable failures [1], e.g. periodically pull the data instead of relying on notifications.

[0] https://aws.amazon.com/builders-library/reliability-and-cons...

[1] https://brooker.co.za/blog/2021/05/24/metastable.html