SymmetricalDataSecurity: How to Keep Database Table Sizes Down and Prevent Data Bloat

Most web applications use a data store of some kind, often a relational database. When a web app becomes successful, it can become all too easy to start "hoarding" data in the database. But hoarding data leads to the unbounded growth of database tables (both row count and stored data size).

While this works fine to a certain point, it is very useful to prevent some data bloat — or, if you cannot prevent it, to plan for your infrastructure ahead of time to adequately manage growth.

Before we dive in, let's see how we might end up with bloated applications.

More Data is Not Always Great

Most of the applications we work on tend to get bigger with time.

If you use a cloud provider for your database, you might hit your allocated storage limit. Once that happens, you will need to upgrade to a different instance type. Heroku PostgreSQL databases have a limit, for example, a "hobby" tier instance is limited to 1GB of data.

Having more data also has implications for query speeds. What used to be possible without an index becomes impossible with larger tables. Some row range scans will get slower. More locks will be acquired to do database UPDATE and DELETE operations.

How Database Tables Grow

Data hoarding happens gradually. What is not a problem today can easily become a problem in a month or six months. The most treacherous thing about data hoarding is that it is so easy to miss. Consider a few very classic scenarios:

For compliance reasons, you adopt a gem like paper_trail and get an audit_log_entries table. Every operation with a mode of significance in your application creates a row in the audit_log_entries table. Those audit log entries never get archived.
You accept uploads and are using ActiveStorage. You never delete the uploads, so your activestorage_blobs table keeps getting bigger and bigger.
You run a shared CMS for publishing on the Web, and you allow article segments to be stored in your database. Your platform becomes successful, but most of your authors write book-sized articles. The pages table becomes very sizeable even though it only contains a few thousand pages.
You allow user-uploaded content, but you do not delete the data for compliance reasons, and instead use something like paranoia to delete via a flag. Your user_items table grows indefinitely, and without you knowing it, passes 10M rows.

These patterns have implications if not caught in time. If you have a good view of the size of your tables, you can anticipate when you will have to upgrade and schedule maintenance at off-peak times to perform upgrades with less user impact.

Making projections also becomes much easier. For example:

Today our events table fits in memory. At the current growth rate, it won't fit in memory in seven months.
We are at 30% storage use for our RDS instance type. We will hit 90% in January next year.
We have a full table scan query on payments to compute a function over every row where input depends on the query. We know that the payments table will exceed two million rows in three weeks and 20 million by January next year.

These all have the potential to present as an incident or an outage. But if you attack early enough, you can mitigate against this fairly easily. For example:

Set up archiving for all data older than 30 days in the events table.
Upgrade your RDS instance in four months.
To limit the full table scan, add an additional WHERE condition to our query to compute the function on a much smaller subset of rows instead.

Gain Visibility Over Your Database Growth

To keep an eye on your database growth, you have two possibilities:

Install special tooling (like stats for MySQL) and connect it to your metrics collection engine via Prometheus, Telegraph, or other tools.
Use AppSignal, especially if you're already using it for your application.

AppSignal can store several metric types, and one of the metric types it supports is called a gauge. A gauge is a single time series per environment (like production, staging, or development) that you can update from time to time. AppSignal metrics also allow tags, so we can automatically create some tagged gauge metrics for our database tables. Let's do that:

db.row_count for the approximate number of rows in our tables, per table (we'll get to approximates later)
db.data_size_bytes for the number of bytes a table is using
db.index_size_bytes for the number of bytes a table's indices are using

Note that I am suffixing the metric name with the value type — this helps later on when we define how the metric will be displayed. The separate "data" and "indices" metrics are also important. If a table contains many rows and more than a few indices, the size of those indices might be two or three times the size of the stored data (since every index stores derived data for itself, creating some storage overhead).

Counting Database Table Rows Quickly

The "approximate" part of the number of rows is essential. Normally, if you want a precise number of rows in a table, you can do a query like SELECT COUNT(1) FROM my_table. However, it might be slower than we would like.

When you COUNT, the database ensures that the number of rows in the table does not change during the query — and so will lock the table or create a divergent transaction while the query runs. The larger the table, the slower the query — more rows will be scanned, and more locks will accumulate.

Therefore, when all we want is a "close enough" approximation of the number of rows (for performance estimation, the precision of 10-30 thousand rows less or more is good enough), we can use the internal database engine statistics to query for this type of data.

Most databases have optimized access to the table data since rows are written out in "pages". The database engine keeps track of which rows are assigned to which pages, roughly in insertion order:

Rows 1-100 are on page 1
Rows 101-200 are on page 2
Rows 201-300 are on page 3

and so on.

As the engine knows how many pages it has per table and the rough row count per page it uses, it can count the pages it has allocated and then multiply this by the page size (rows per page) to give you that estimate. This has a few advantages: the count is very fast, and the table does not need to be locked while the query runs.

Record Table Sizes for MySQL

What we then need to do differs per database since we need to query the database engine for the table statistics. Let's start with MySQL. Here's the query we need to run:

Then, grab a few columns from its output. The ones we are interested in are Data_length, Index_length, and Rows. The size of a table is defined as the sum of Data_length + Index_length, and the approximate row count based on pages is in Rows.

Let's package it into a code block that collects this data for your entire database. Since we are not working with any ActiveModel classes, we are going to use the query methods directly provided by ActiveRecord:

table_information_rows = ActiveRecord::Base.connection.select_all("SHOW TABLE STATUS").to_ary
table_information_rows.each do |table_info|
  table_properties = {
    data_size_bytes: table_info.fetch('Data_length'),
    index_size_bytes: table_info.fetch('Index_length'),
    approximate_row_count: table_info.fetch('Rows'),
  }

  table_properties.each_pair do |metric_name, metric_value|
    Appsignal.set_gauge("db.#{metric_name}", metric_value.to_i, table: table_info.fetch('Name'))
  end
end

Record Table Sizes for PostgreSQL

We need a more elaborate query for PostgreSQL since we do not have a shortcut query like SHOW TABLE STATUS. We have to query internal PostgreSQL tables:

query = <<~EOS
SELECT
    table_name AS Name,
    full_table_name AS Full_name,
    pg_table_size(full_table_name) AS Data_length,
    pg_indexes_size(full_table_name) AS Index_length,
    row_count_estimate AS Rows
FROM (
  SELECT
   t.table_name AS table_name,
   (t.table_schema || '.' || t.table_name ) AS full_table_name,
   c.reltuples::bigint AS row_count_estimate
  FROM information_schema.tables AS t
  INNER JOIN pg_class AS c ON c.relname = t.table_name
  INNER JOIN pg_namespace AS n ON n.oid = c.relnamespace AND n.nspname = t.table_schema
  WHERE t.table_schema = 'public'
) AS all_tables
EOS

table_information_rows = ActiveRecord::Base.connection.select_all(query).to_ary
table_information_rows.each do |table_info|
  table_properties = {
    data_size_bytes: table_info.fetch('data_length'),
    index_size_bytes: table_info.fetch('index_length'),
    approximate_row_count: table_info.fetch('rows'),
  }

  table_properties.each_pair do |metric_name, metric_value|
    Appsignal.set_gauge("db.#{metric_name}", metric_value.to_i, table: table_info.fetch('name'))
  end
end

Note that this only accounts for the public schema (the default one you are likely using). If you want to include other schemas, you will need to remove the WHERE t.table_schema = 'public' condition and replace table_info.fetch('name') with table_info.fetch('full_table_name').

Make Sure Metrics Update Regularly

This block has to run at regular intervals, so it is a good idea to put it into the Sidekiq scheduler or a Rake task that runs from cron. For example, if you use good_job, you can add it to your "cron" section like so:

config.good_job = {
  # ... your other configuration
  cron: {
    record_database_metrics: {
      cron: "@hourly",
      class: "RecordDatabaseMetricsJob"
    }
  }
}

Create a Dashboard

Once you have data coming in, build a dashboard. You can copy the following dashboard for your needs:

{
  "title": "Database Tables",
  "description": "",
  "visuals": [
    {
      "title": "Row Counts",
      "line_label": "%name% %table%",
      "display": "LINE",
      "format": "number",
      "draw_null_as_zero": false,
      "metrics": [
        {
          "name": "db.approximate_row_count",
          "fields": [
            {
              "field": "GAUGE"
            }
          ],
          "tags": [
            {
              "key": "table",
              "value": "*"
            }
          ]
        }
      ],
      "type": "timeseries"
    },
    {
      "title": "Table Sizes",
      "line_label": "%table%",
      "display": "LINE",
      "format": "size",
      "format_input": "byte",
      "draw_null_as_zero": false,
      "metrics": [
        {
          "name": "db.data_size_bytes",
          "fields": [
            {
              "field": "GAUGE"
            }
          ],
          "tags": [
            {
              "key": "table",
              "value": "*"
            }
          ]
        }
      ],
      "type": "timeseries"
    }
  ]
}

Click Add dashboard and then Import dashboard in the modal dialog that appears. The dashboard will give you graphs like this one:

And this table sizes graph:

Note how we use wildcard tags to automatically build graphs for every table in the database.

Interpreting the Data

When looking at your data, look out for exponential or linear growth in either row count or size — in other words, a table getting bigger and bigger.

If you see this, you have a couple of options. One is to architect for this unbounded growth — know when to upgrade, and whether you can upgrade to the next best size. Another is to set up a regular deletion task — my former colleague Wander Hillen has written a great article on the topic.

For example, in the screenshot above, some tables shrink regularly — this is when the regular cleanup tasks run. You can see that the data accumulates until a certain point, but while the ingress of rows is fairly constant, sizes and row counts drop.

As long as there are these drops and the amount of data in your tables grows at a steady rate, your database won't surprise you with a sudden limit bust.

Sum Up: Avoid Data Bloat from Database Tables

In this post, we looked at how you can gain visibility over the growth of your database and keep your data bloat down.

Data bloat is a real risk for applications that become a success in the market. By setting up metrics for your database tables, you can better anticipate when to scale your database vertically and whether you need to install regular stale data cleanups. AppSignal gauges with tags, combined with a little SQL, can get you this data in a convenient and pleasant format.

Don't let your database surprise you!

from Hacker News https://ift.tt/erLnO0j

SymmetricalDataSecurity

Friday, February 4, 2022

How to Keep Database Table Sizes Down and Prevent Data Bloat

More Data is Not Always Great

How Database Tables Grow

Gain Visibility Over Your Database Growth

Counting Database Table Rows Quickly

Record Table Sizes for MySQL

Record Table Sizes for PostgreSQL

Make Sure Metrics Update Regularly

Create a Dashboard

Interpreting the Data

Sum Up: Avoid Data Bloat from Database Tables

No comments:

Post a Comment

Blog Archive

Search This Blog

Total Pageviews