Friday, August 27, 2021

Why observability requires a distributed column store

Honeycomb is known for its incredibly fast performance: you can sift through billions of rows, comparing high-cardinality data across thousands of fields, and get fast answers to your queries. And none of that is possible without our purpose-built distributed column store.

This post is an introduction to what a distributed column store is, how it functions, and why a distributed column store is a fundamental requirement for achieving observability.

Why build a datastore?

Here at Honeycomb, we talk a lot about constructing wide events with rich context, the need to examine high-cardinality data, and why fast queries are essential for encouraging system exploration. But none of that would be possible without our backend. So what makes our backend tick?

At its core, Honeycomb’s backend is a purpose-built distributed column store. We’ve put in all the work making it fast and reliable, and the secret sauce isn’t all that secret. The origins and inspiration can be traced to Facebook Scuba. Back in 2017, former Honeycomb engineer Sam Stokes gave a brilliant overview of Honeycomb’s internals in his talk, “Why We Built Our Own Distributed Column Store.”

In her post, “So You Want to Build an Observability Tool” our CTO Charity Majors even goes so far as to say, “You pretty much need to use a distributed column store in order to have observability.” 

What makes a distributed column store so special that Honeycomb actually needed to build one in order to make observability possible?

Taken all together, Honeycomb’s core feature set and desired functionality is very much responsible for many of its architecture decisions:

  • There should be no rigid schemas, so it’s easy to make your events as wide as possible.
  • There should be no indexes, so you don’t need to decide which column is “fast” before you start writing data.
  • Queries should always be fast, so there’s no need to rely on pre-aggregated data (like metrics) to understand application behavior.

That last point is worth unpacking because it’s often too subtle. Storing and querying your raw data is what unlocks its full potential. Honeycomb never aggregates or throws away any of your data. You never know what dimensions you might need to examine until you’re debugging an issue you could have never predicted in advance. The backend data store must have the power to slice and dice your aggregate computations by any of the rich context you’ve attached to your traces, any way you see fit, any time you need it, in any way you need it. It turns out that the only way to readily store that data, and process it quickly, is with a columnar database.

You don’t have to be a database engineer to understand the principles that make this storage structure a fundamental requirement. With a little bit of background, you’ll see how Honeycomb’s capabilities emerge as a consequence of how this type of storage system is designed.

Rows vs. Columns

A column store is just a type of database. Conceptually, databases organize data into tables, which are labeled by columns and made up of individual rows.

In Honeycomb, each dataset is stored as a separate table where the columns correspond to fields from your events. Each span is then written as one row in the table. For example, a dataset for HTTP requests might have spans with fields like the request method, URL path, query string, and so on. This is represented as a table like so:

method
path
query
GET
/
GET
/search
?q=cool%20stuff
POST
/comment

Of course, with computer storage, we don’t really have two-dimensional tables. Instead, we have contiguous chunks of “flat” memory. To simplify, think of it like a text file where every cell of the table has to be written as a separate line.

Traditional database systems like MySQL and PostgreSQL are said to be row-oriented. This means data is stored row by row, where one entire row’s data is written before the next one’s. In the above example, the flat text file would look like this:

GET
/
(null)
GET
/search
?q=cool%20stuff
POST
/comment
(null)

There are three columns, so each row is represented by three lines. Lines 1-3 store row 1, lines 4-6 store row 2, lines 7-9 store row 3, and so on. The columns are always in the same order for each row. (For clarity, we’ll insert nulls throughout our examples. In the real world, you could use a sparse format that saves space by not writing out redundant nulls.)

In contrast, column-oriented databases like Amazon Redshift and Apache Cassandra maintain separate files for each column. In our example, there would be three files:

  1. The file for the method column:
    GET
    GET
    POST
  2. The file for the path column:
    /
    /search
    /comment
  3. The file for the query column:
    (null)
    ?q=cool%20stuff
    (null)

When you put together line 1 from each of the files, you get row 1. Line 2 of each file gives you row 2, line 3 gives row 3, and so on.

While the implementation details are naturally more complicated, this mental model already lets us analyze the trade-offs between row- and column-oriented storage. This will give us insight into how and why Honeycomb behaves the way it does!

Writing

Let’s start from the beginning, when Honeycomb ingests your events. As each one comes in, we add a row to the table. So if the event{"method": "GET", "path": "/comments", "query": "?page=2"} comes into our example dataset, the table will be amended to look like this:

method
path
query
GET
/
GET
/search
?q=cool%20stuff
POST
/comment
GET
/comments
?page=2

In a row-oriented database, we’d append 3 lines to the table’s file—1 line for each column in this new row. In a column-oriented database, we’d append 1 line to each of 3 column files for the table. Either way, appending the data isn’t very expensive—just write to the ends of some files.

But what if an event comes in with a new field? Traditional row-oriented databases require you to predefine a rigid schema specifying which columns exist. With Honeycomb, you don’t need to predefine schemas. You can add as many fields as you’d like—the wider your event (row), the more context you’ll have to observe the inner workings of your system. Defining a separate schema ahead of time can be a blocker to iteratively adding any context you want, as you need it, whenever you need it. Instead, when you send Honeycomb data with a new field, we dynamically expand the table. This is natural to do using a column-oriented database.

To see why, suppose we get another event with some browser information attached to a new field, like {"method": "GET", "path": "/login", "browser": "Firefox"}. In a row-oriented format, we can no longer simply append all the lines to the end of the file, because we have a new fourth column. Our old rows only took up 3 lines, whereas this new row takes up 4 lines. With inconsistent sizes, reading the rows back out would be ambiguous. This is why row-based formats use a predefined schema: essentially, it tells us how many lines each row takes up. We could change the schema, but that forces us to go back and update our file to use 4 lines per row, filling in null values for historical data:

  GET
  /
  (null)
+ (null)
  GET
  /search
  ?q=cool%20stuff
+ (null)
  POST
  /comment
  (null)
+ (null)
  GET
  /comments
  ?page=2
+ (null)
+ GET
+ /login
+ (null)
+ Firefox

Before, row 2 started at line 4 of the file. But now, row 1 takes up lines 1-4 and row 2 starts at line 5. Everything had to move up by one, and this effect ripples through the whole table. Shuffling this storage around is very computationally expensive, especially for large tables, since the entire file needs to be rewritten. Depending on your database vendor’s pricing scheme, this could also be financially expensive. Furthermore, in traditional relational databases, this type of update is done as a migration that takes the table offline until the file is finished updating!

However, in a column-oriented database, nothing needs to be moved. For the 3 original columns, we can still append the incoming values to the existing column files as usual:

  1. The file for the method column:
      GET
      GET
      POST
      GET
    + GET
  2. The file for the path column:
      /
      /search
      /comment
      /comments
    + /login
  3. The file for the query column:
      (null)
      ?q=cool%20stuff
      (null)
      ?page=2
    + (null)

Additionally, we create a fourth file for the new browser column:

+ (null)
+ (null)
+ (null)
+ (null)
+ Firefox

Lines 1-4 are null because the past rows did not have browser information. In practice, a sparse format would avoid writing out all these nulls by using some metadata that says “this file starts with row 5.” But the principle is the same: The column-oriented layout does not require us to shuffle around old values. This saves you from having to define a rigid schema ahead of time, while also saving Honeycomb from going offline whenever we have to add a new column.

Reading

Once values are written, we’ll want to query them. Before we consider computing aggregates like counts, averages, and so on, we first have to read the raw data from the files. The way we lay out this data in memory has a big impact on the I/O performance.

As a rule of thumb, the principle of locality says that the closer two pieces of data are to each other in memory, the quicker they are to access. This is easy to imagine on hard disks where a mechanical arm has to scan the surface of a spinning magnetic platter: it’s physically moving from point A to point B, so the shorter the distance, the faster the read. While solid-state drives change this picture, the general principle is nonetheless accurate at various other levels of hardware and software.

So, broadly speaking, row-oriented databases are more efficient for queries where you want to read all the columns from a single row. A canonical example would be a web app where you have a users table with columns like name, email, and avatar. When you render the user’s home page, you fetch all of that data at once using a SQL statement like SELECT * FROM users WHERE id = 1;. In terms of the flat file, all the cells in a single row would be stored on subsequent lines, so the principle of locality suggests that it’s fast to read all those values in order. If instead they were in separate column files, the cells for the row would be scattered across multiple locations in memory, not all close together.

On the other hand, column-oriented databases are suited for queries where you want to read many row values from a few columns. Queries like SELECT AVG(duration_ms) FROM requests; must analyze the duration_ms column of every row in the hypothetical requests table in order to compute the average. Because the whole column is in one contiguous file, scanning through it is going to be quick. In the row-oriented file, each duration_ms would be stored some distance away from the next value, interleaved with a bunch of columns you don’t care about.

Queries with many rows and few columns are Honeycomb’s bread and butter. The query engine computes aggregates like COUNT, AVG, P95, and HEATMAP on the fly that work over a handful of fields at a time. So it makes the most sense to store events in a column-oriented fashion. That way, reading the raw data is efficient enough that we aren’t as bottlenecked by I/O.

One way that row-based stores make reads more efficient is by indexing certain columns (or combinations of columns). That is, they maintain a separate data structure used to quickly access values for a particular column without having to scan through the rows in the flat file. Each index entry will then point back to a line number in the row file, much like the index in the back of a textbook.

In a lot of ways, indexes basically emulate the column files. But there’s more overhead for maintaining references to the rows. For example, a migration would have to recompute the indexes because all the line numbers in the flat file will have changed. Since Honeycomb doesn’t care that much about rows in the first place, it’s not really necessary to maintain this extra information. However, there’s still some merit to the idea, so we’ll come back to indexing in a minute.

Arithmetic

Assuming that data can be read from storage efficiently, the other component of fast querying is actually computing the aggregate functions. On the surface, it’s not complicated to calculate something like AVG(duration_ms): just read all the values from the duration_ms column file, sum them together, and divide by the number of rows. But when the table gets very large, doing this in one shot becomes a bottleneck. This is where distributed computing comes into play.

The basic idea is that two heads are better than one. Instead of having one machine reading from one column file that has all the data, we could split the data into two halves. One machine would compute its aggregates on the first half of the column file. At the same time, another machine would calculate the same aggregates on the second half of the column file. Once both of them were done, the results could be combined together (as long as we’re careful about the math). By working in parallel, we’ve halved the elapsed time it took to do the whole calculation. This works recursively, so in theory we could keep splitting the work up across lots of different machines until eventually the cost of splitting up the data outweighs the benefits.

How do we split the data up, though? In a way, this brings us back to indexing. In row-oriented stores, indexes are more than just column files. Because they’re separate data structures, we’re free to design them to be easy to search through. For example, you could keep an index on timestamps sorted. Then a query over a specific time range could find matching rows by doing a binary search through the timestamps instead of a linear scan.

At Honeycomb, the closest thing we have to such an index is the way we split up the column files into segments. Every event that comes into our system requires a timestamp. Because we’re designed to ingest real-time traces, and since time only moves forward, the events will typically come in already sorted! Similar to log rotation, after enough time we can close the current segment of the column, mark it with some metadata for the minimum/maximum timestamps, and start the next segment. Each machine in a distributed computation is then responsible for handling some subset of the segments.

Ever notice how every Honeycomb query has a time range? This is crucial for our aggregate calculations because then a worker machine can quickly check against the metadata in its particular segments. If a segment doesn’t intersect with the desired time range, we don’t have to read anything from it. It’s not quite the same as a binary search through an index, but it’s a similar way of limiting the amount of storage we have to scan as we split the work up across different machines.

Distributed computation is possible with row-oriented formats, of course. But at the end of the day, it’s natural to do with the column-oriented format, which already has other key benefits for Honeycomb. Thus, from ingest to query to result, our distributed column store works for us every step of the way!

The result is blazing-fast query performance

Observability isn’t just a fancy new way of saying “monitoring” or “debugging.” What sets observability apart is the ability to slice and dice your data, any way you see fit, to unlock new understandings about how your applications truly behave when experienced by end users. Many different capabilities are what make that functionally possible.

Removing blockers like predefining schemas is what lets you quickly iterate the data you send to Honeycomb, and it encourages you to add more rich context to your wide events as you discover new relevant information you need to know. Removing indexes means you don’t need to decide what is fast because everything needs to be fast. And, most importantly, you need unaggregated data that can be sliced and diced and processed along any dimension needed, anytime you need it.

Putting those capabilities together is what lets you dig through billions of rows of data, searching for that hidden needle in your haystack, getting answers to questions you could have never predicted asking in advance. That’s the experience that sets observability apart. 

So yeah, as Charity put it: “You pretty much need to use a distributed column store in order to have observability.”

Try it for yourself by signing up for a free account today.



from Hacker News https://ift.tt/3xC9JjL

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.