Tuesday, July 27, 2021

What time-weighted averages are and why you should care

Learn how time-weighted averages are calculated, why they’re so powerful for data analysis, and how to use TimescaleDB hyperfunctions to calculate them faster – all using SQL.

Many people who work with time-series data have nice, regularly sampled datasets. Data could be sampled every few seconds, or milliseconds, or whatever they choose, but by regularly sampled, we mean the time between data points is basically constant. Computing the average value of data points over a specified time period in a regular dataset is a relatively well-understood query to compose. But for those who don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write. Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data.

Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a multi-node, petabyte-scale, completely free relational database for time-series.

Irregularly sampled time-series data just adds another level of complexity – and is more common than you may think. For example, irregularly sampled data, and thus the need for time-weighted averages, frequently occurs in:

  • Industrial IoT, where teams “compress” data by only sending points when the value changes
  • Remote sensing, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations
  • Trigger-based systems, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered)
  • ...and many, many more

At Timescale, we’re always looking for ways to make developers’ lives easier, especially when they’re working with time-series data. To this end, we introduced hyperfunctions, new SQL functions that simplify working with time-series data in PostgreSQL. One of these hyperfunctions enables you to compute time-weighted averages quickly and efficiently, so you gain hours of productivity.


Read on for examples of time-weighted averages, how they’re calculated, how to use the time-weighted averages hyperfunctions in TimescaleDB, and some ideas for how you can use them to get a productivity boost for your projects, no matter the domain.

If you’d like to get started with the time_weight hyperfunction - and many more - right away, spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale Forge, so after you create a new service, you’re all set to use them.

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use time_weight and other hyperfunctions.

Finally, we love building in public and continually improving:

What are time-weighted averages?

I’ve been a developer at Timescale for over 3 years and worked in databases for about 5 years, but I was an electrochemist before that. As an electrochemist, I worked for a battery manufacturer and saw a lot of charts like these:

Battery discharge curve showing cell voltage on the y-axis and capacity in amp-hours on the x-axis. The curve starts high, decreases relatively rapidly through the exponential zone, then stays relatively constant for a long period through the nominal zone, after which the voltage drops quite rapidly as it reaches its fully discharged state.
Example battery discharge curve, which describes how long a battery can power something. (Also a prime example of where time-weighted averages are 💯 necessary) Source: https://www.nrel.gov/docs/fy17osti/67809.pdf

That’s a battery discharge curve, which describes how long a battery can power something. The x-axis shows capacity in Amp-hours, and since this is a constant current discharge, the x-axis is really just a proxy for time. The y-axis displays voltage, which determines the battery’s power output; as you continue to discharge the battery, the voltage drops until it gets to a point where it needs to be recharged.

When we’d do R&D for new battery formulations, we’d cycle many batteries many times to figure out which formulations make batteries last the longest.

If you look more closely at the discharge curve, you’ll notice that there are only two “interesting” sections:

The same battery discharge curve as in the previous image but with the “interesting bits” circled, namely where the voltage decreases rapidly at the beginning and the end of the discharge curve.
Example battery discharge curve, calling out the “interesting bits” (the points in time where data changes rapidly)

These are the parts at the beginning and end of the discharge where the voltage changes rapidly. Between these two sections, there’s that long period in the middle, where the voltage hardly changes at all:

The same battery discharge curve again, except now the “boring” part of the curve is highlighted, which is the middle section where the voltage hardly changes.
Example battery discharge curve, calling out the “boring bits” (the points in time where the data remains fairly constant)

Now, when I said before that I was an electrochemist, I will admit that I was exaggerating a little bit. I knew enough about electrochemistry to be dangerous, but I worked with folks with PhDs who knew a lot more than I did.

But, I was often better than them at working with data, so I’d do things like programming the potentiostat, the piece of equipment you hook the battery up to in order to perform these tests.

For the interesting parts of the discharge cycle (those parts at the start and end), we could have the potentiostat sample at its max rate, usually a point every 10 milliseconds or so. We didn’t want to sample as many data points during the long, boring parts where the voltage didn’t change because it would mean saving lots of data with unchanging values and wasting storage.

To reduce the boring data we’d have to deal with without losing the interesting bits, we’d set up the program to sample every 3 minutes, or when the voltage changed by a reasonable amount, say more than 5 mV.

In practice, what would happen is something like this:

The same battery discharge curve again, this time with data points superimposed on the image. The data points are spaced close together in the “interesting bits,” where the voltage changes quickly at the beginning and end of the discharge curve. The data points are spaced further apart during the “boring” part in the middle, where the voltage hardly changes at all.
Example battery discharge curve with data points superimposed to depict rapid sampling during the interesting bits and slower sampling during the boring bits.

By sampling the data in this way, we'd get more data during the interesting parts and less data during the boring middle section. That’s great!

It let us answer more interesting questions about the quickly changing parts of the curve and gave us all the information we needed about the slowly changing sections – without storing gobs of redundant data. But, here’s a question: given this dataset, how do we find the average voltage during the discharge?

That question is important because it was one of the things we could compare between this discharge curve and future ones, say 10 or 100 cycles later. As a battery ages, its average voltage drops, and how much it dropped over time could tell us how well the battery’s storage capacity held up during its lifecycle – and if it could turn into a useful product.

The problem is that the data in the interesting bits are sampled more frequently (i.e., there are more data points for the interesting bits), which would give it more weight when calculating the average, even though it shouldn't.

The same battery curve again, with the same data points superimposed and the “interesting bits” circled again, however this time noting that the “interesting bits” shouldn’t count extra even though there are more data points included in the circled area.
Example battery discharge curve, with illustrative data points to show that while we collect more data during the interesting bits, they shouldn’t count “extra.”

If we just took a naive average over the whole curve, adding the value at each point and dividing by the number of points, it would mean that a change to our sampling rate could change our calculated average...even though the underlying effect was really the same!

We could easily overlook any of the differences we were trying to identify – and any clues about how we could improve the batteries could just get lost in the variation of our sampling protocol.

Now, some people will say: well, why not just sample at max rate of the potentiostat, even during the boring parts? Well, these discharge tests ran really long. They’d take 10 to 12 hours to complete, but the interesting bits could be pretty short, from seconds or minutes. If we sampled at the highest rate, one every 10ms or so, it would mean orders of magnitude more data to store even though we would hardly use any of it! And orders of magnitude more data would mean more cost, more time for analysis, all sorts of problems.

So the big question is: how do we get a representative average when we’re working with irregularly spaced data points?

Let’s get theoretical for a moment here:

(This next bit is a little equation-heavy, but I think they’re relatively simple equations, and they map very well onto their graphical representation. I always like it when folks give me the math and graphical intuition behind the calculations – but if you want to skip ahead to just see how time-weighted average is used, the mathy bits end here.)

Mathy Bits: How to derive a time-weighted average

Let’s say we have some points like this:

A graph showing value on the y-axis and time on the x-axis. There are four points: open parens t 1 comma v 1 close parens to open parens t 4 comma v 4 close parens spaced unevenly in time on the graph.
A theoretical, irregularly sampled time-series dataset

Then, the normal average would be the sum of the values, divided by the total number of points:

\begin{equation} avg = \frac{(v_1 + v_2 + v_3 + v_4)}{4} \end{equation}

But, because they’re irregularly spaced, we need some way to account for that.

One way to think about it would be to get a value at every point in time, and then divide it by the total amount of time. This would be like getting the total area under the curve and dividing by the total amount of time ΔT.

The same graph as above but with the area under the curve shaded in gray. The area under the curve is drawn by drawing a line through each pair of points and then shading down to the x-axis. The total time spanned by the points from t 1 to t 4 is denoted as Delta T.
The area under an irregularly sampled time-series dataset

\begin{equation} better\_avg = \frac{area\_under\_curve}{\Delta T} \end{equation}

(In this case, we’re doing a linear interpolation between the points). So, let’s focus on finding that area. The area between the first two points is a trapezoid:

The same graph as above, except there is a trapezoid shaded in blue bounded on top by the line connecting the first two points and vertical lines connecting the points to the x-axis. The distance between the two points on the x-axis is denoted delta t 1.
A trapezoid representing the area under the first two points

Which is really a rectangle plus a triangle:

The same graph as the previous, except now the trapezoid, has been divided into a rectangle and a triangle. The rectangle is the height of the first point v 1. The triangle is a right triangle with the line connecting the first two points as the hypotenuse. The distance on the y-axis between the first two points is denoted as delta v 1.
That same trapezoid broken down into a rectangle and a triangle.

Okay, let’s calculate that area:

\begin{equation} area = \Delta t_1 v_1 + \frac{\Delta t_1 \Delta v_1}{2} \end{equation}

So just to be clear, that’s:

a r e a = Δ t 1 v 1 ⏟ area of rectangle + Δ t 1 Δ v 1 2 ⏟ area of triangle \begin{equation} area = \underbrace{\Delta t_1 v_1}_\text{area of rectangle} + \underbrace{\frac{\Delta t_1 \Delta v_1}{2}}_\text{area of triangle} \end{equation}

Okay. So now if we notice that:

\begin{equation} \Delta v_1 = v_2 - v_1 \end{equation}

We can simplify this equation pretty nicely:

Start with:
\begin{equation} \Delta t_1 v_1 + \frac{\Delta t_1 (v_2 - v_1)}{2} \end{equation}

Factor out: \begin{equation}(\frac{\Delta t_1}{2} ) \end{equation}

\begin{equation} \frac{\Delta t_1}{2} (2v_1 + (v_2 - v_1)) \end{equation}

Simplify:
\begin{equation} \frac{\Delta t_1}{2} (v_1 + v_2) \end{equation}

One cool thing to note is that this gives us a new way to think about this solution: it’s the average of each pair of adjacent values, weighted by the time between them:

\begin{equation} area = \underbrace{\frac{(v_1 + v_2)}{2}}_{\text{average of } v_1 \text{ & } v_2} \Delta t_1 \end{equation}

It’s also equal to the area of the rectangle drawn to the midpoint between v1 and v2:

The same graph as the previous, except that now there is a rectangle imposed on the trapezoid. The rectangle is the same width as the others and goes to a height of v 1 plus v 2 over 2.
The area of the trapezoid and of the rectangle, drawn to the midpoint between the two points, is the same.

Now that we’ve derived the formula for two adjacent points, we can repeat this for every pair of adjacent points in the dataset. Then all we need to do is sum that up, and that will be the time-weighted sum, which is equal to the area under the curve. (Folks who have studied calculus may actually remember some of this from when they were learning about integrals and integral approximations!)

With the total area under the curve calculated,  all we have to do is divide the time-weighted sum by the overall  ΔT and we have our time-weighted average. 💥

Now that we've worked through our time-weighted average in theory, let’s test it out in SQL.

How to compute time-weighted averages in SQL

Let’s consider the scenario of an ice cream manufacturer or shop owner who is monitoring their freezers. It turns out that ice cream needs to stay in a relatively narrow range of temperatures (~0-10℉) so that it doesn’t melt and re-freeze, causing those weird crystals that no one likes. Similarly, if ice cream gets too cold, it’s too hard to scoop.

The air temperature in the freezer will vary a bit more dramatically as folks open and close the door, but the ice cream temperature takes longer to change. Thus, problems (melting, pesky ice crystals) will only happen if it's exposed to extreme temperatures for a prolonged period. By measuring this data, the ice cream manufacturer can impose quality controls on each batch of product they’re storing in the freezer.

Taking this into account, the sensors in the freezer measure temperature in the following way: when the door is closed and we’re in the optimal range, the sensors take a measurement every 5 minutes; when the door is opened, the sensors take a measurement every 30 seconds until the door is closed, and the temperature has returned below 10℉.

To model that we might have a simple table like this:

CREATE TABLE freezer_temps (
        freezer_id int,
        ts timestamptz,
        temperature float);

And some data like this:

INSERT INTO freezer_temps VALUES 
( 1, '2020-01-01 00:00:00+00', 4.0), 
( 1, '2020-01-01 00:05:00+00', 5.5), 
( 1, '2020-01-01 00:10:00+00', 3.0), 
( 1, '2020-01-01 00:15:00+00', 4.0), 
( 1, '2020-01-01 00:20:00+00', 3.5), 
( 1, '2020-01-01 00:25:00+00', 8.0), 
( 1, '2020-01-01 00:30:00+00', 9.0), 
( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened!
( 1, '2020-01-01 00:31:30+00', 11.0), 
( 1, '2020-01-01 00:32:00+00', 15.0), 
( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed
( 1, '2020-01-01 00:33:00+00', 18.5), 
( 1, '2020-01-01 00:33:30+00', 17.0), 
( 1, '2020-01-01 00:34:00+00', 15.5), 
( 1, '2020-01-01 00:34:30+00', 14.0), 
( 1, '2020-01-01 00:35:00+00', 12.5), 
( 1, '2020-01-01 00:35:30+00', 11.0), 
( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized
( 1, '2020-01-01 00:40:00+00', 7.0),
( 1, '2020-01-01 00:45:00+00', 5.0);

The period after the door opens, minutes 31-36, has a lot more data points. If we were to take the average of all the points, we would get a misleading value. The freezer was only above the threshold temperature for 5 out of 45 minutes (11% of the time period), but those minutes make up 10 out of 20 data points (50%!) because we sample freezer temperature more frequently after the door is opened.

To find the more accurate, time-weighted average temperature, let’s write the SQL for the formula above that handles that case. We’ll also get the normal average just for comparison’s sake. (Don’t worry if you have trouble reading it, we’ll write a much simpler version later).

WITH setup AS (
        SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
                extract('epoch' FROM ts) as ts_e, 
                extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
                * 
        FROM  freezer_temps), 
nextstep AS (
        SELECT CASE WHEN prev_temp is NULL THEN NULL 
                ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
                * 
        FROM setup)
SELECT freezer_id,
    avg(temperature), -- the regular average
        sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average -- our derived average
FROM nextstep
GROUP BY freezer_id;
 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111

It does return what we want, and gives us a much better picture of what happened, but it’s not exactly fun to write, is it?

We’ve got a few window functions in there, some case statements to deal with nulls, and several CTEs to try to make it reasonably clear what’s going on. This is the kind of thing that can really lead to code maintenance issues when people try to figure out what’s going on and tweak it.

Code is all about managing complexity. A long, complex query to accomplish a relatively simple task makes it much less likely that the developer who comes along next (ie you in 3 months) will understand what’s going on, how to use it, or how to change it if they (or you!) need a different result. Or, worse, it means that the code will never get changed because people don’t quite understand what the query’s doing, and it just becomes a black box that no one wants to touch (including you).

TimescaleDB hyperfunctions to the rescue!

This is why we created hyperfunctions, to make complicated time-series data analysis less complex. Let’s look at what the time-weighted average freezer temperature query looks like if we use the hyperfunctions for computing time-weighted averages:

SELECT freezer_id, 
        avg(temperature), 
        average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;
 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111

Isn’t that so much more concise?! Calculate a time_weight with a 'Linear' weighting method (that’s the kind of weighting derived above ), then take the average of the weighted values, and we’re done. I like that API much better (and I’d better, because I designed it!).

What’s more, not only do we save ourselves from writing all that SQL, but it also becomes far, far easier to compose (build up more complex analyses over top of the time-weighted average). This is a huge part of the design philosophy behind hyperfunctions; we want to make fundamental things simple so that you can easily use them to build more complex, application-specific analyses.

Let’s imagine we’re not satisfied with the average over our entire dataset, and we want to get the time-weighted average for every 10-minute bucket:

SELECT time_bucket('10 mins'::interval, ts) as bucket, 
        freezer_id, 
        avg(temperature), 
        average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY bucket, freezer_id;

We added a time_bucket, grouped by it, and done! Let’s look at some other kinds of sophisticated analysis that hyperfunctions enable.

Continuing with our ice cream example, let’s say that we’ve set our threshold because we know that if the ice cream spends more than 15 minutes above 15 ℉, it’ll develop those ice crystals that make it all sandy/grainy tasting. We can use the time-weighted average in a window function to see if that happened:

SELECT *, 
    average(time_weight('Linear', ts, temperature) OVER fifteen_min) as rolling_twa
FROM freezer_temps
WINDOW fifteen_min AS 
    (PARTITION BY freezer_id ORDER BY ts RANGE  '15 minutes'::interval PRECEDING)
ORDER BY freezer_id, ts;

 freezer_id |           ts           | temperature |    rolling_twa     
------------+------------------------+-------------+--------------------
          1 | 2020-01-01 00:00:00+00 |           4 |                   
          1 | 2020-01-01 00:05:00+00 |         5.5 |               4.75
          1 | 2020-01-01 00:10:00+00 |           3 |                4.5
          1 | 2020-01-01 00:15:00+00 |           4 |  4.166666666666667
          1 | 2020-01-01 00:20:00+00 |         3.5 | 3.8333333333333335
          1 | 2020-01-01 00:25:00+00 |           8 |  4.333333333333333
          1 | 2020-01-01 00:30:00+00 |           9 |                  6
          1 | 2020-01-01 00:31:00+00 |        10.5 |  7.363636363636363
          1 | 2020-01-01 00:31:30+00 |          11 |  7.510869565217392
          1 | 2020-01-01 00:32:00+00 |          15 |  7.739583333333333
          1 | 2020-01-01 00:32:30+00 |          20 |               8.13
          1 | 2020-01-01 00:33:00+00 |        18.5 |  8.557692307692308
          1 | 2020-01-01 00:33:30+00 |          17 |  8.898148148148149
          1 | 2020-01-01 00:34:00+00 |        15.5 |  9.160714285714286
          1 | 2020-01-01 00:34:30+00 |          14 |   9.35344827586207
          1 | 2020-01-01 00:35:00+00 |        12.5 |  9.483333333333333
          1 | 2020-01-01 00:35:30+00 |          11 | 11.369047619047619
          1 | 2020-01-01 00:36:00+00 |          10 | 11.329545454545455
          1 | 2020-01-01 00:40:00+00 |           7 |             10.575
          1 | 2020-01-01 00:45:00+00 |           5 |  9.741666666666667

The window here is over the previous 15 minutes, ordered by time. And it looks like we stayed below our ice-crystallization temperature!

We also provide a special rollup function so you can re-aggregate time-weighted values from subqueries. For instance:

SELECT average(rollup(time_weight)) as time_weighted_average 
FROM (SELECT time_bucket('10 mins'::interval, ts) as bucket, 
                freezer_id, 
                time_weight('Linear', ts, temperature)
        FROM freezer_temps
        GROUP BY bucket, freezer_id) t;
time_weighted_average 
-----------------------
    6.636111111111111

This will give us the same output as a grand total of the first equation because we’re just re-aggregating the bucketed values.

But this is mainly there so that you can do more interesting analysis, like, say, normalizing each ten-minute time-weighted average by freezer to the overall time-weighted average.

WITH t as (SELECT time_bucket('10 mins'::interval, ts) as bucket, 
                freezer_id, 
                time_weight('Linear', ts, temperature)
        FROM freezer_temps
        GROUP BY bucket, freezer_id) 
SELECT bucket, 
        freezer_id, 
        average(time_weight) as bucketed_twa,  
        (SELECT average(rollup(time_weight)) FROM t) as overall_twa, 
        average(time_weight) / (SELECT average(rollup(time_weight)) FROM t) as normalized_twa
FROM t; 

This kind of feature (storing the time-weight for analysis later) is most useful in a continuous aggregate, and it just so happens that we’ve designed our time-weighted average to be usable in that context!

We’ll be going into more detail on that in a future post, so be sure to subscribe to our newsletter so you can get notified when we publish new technical content.

Try time-weighted averages today

If you’d like to get started with the time_weight hyperfunction - and many more - right away, spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. Hyperfunctions are pre-loaded on each new database service on Timescale Forge, so after you create a new service, you’re all set to use them!

If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use time_weight and all other hyperfunctions.

**If you have questions or comments on this blog post, we’ve started a discussion on our GitHub page, and we’d love to hear from you. (And, if you like what you see, GitHub ⭐ are always welcome and appreciated too!)

  • We love building in public, and you can view our upcoming roadmap on GitHub for a list of proposed features, features we’re currently implementing, and features available to use today.

We’d like to give a special thanks to @inselbuch, who submitted the GitHub issue that got us started on this project (as well as the other folks who 👍’d it and let us know they wanted to use it.)

We believe time-series data is everywhere, and making sense of it is crucial for all manner of technical problems. We built hyperfunctions to make it easier for developers to harness the power of time-series data. We’re always looking for feedback on what to build next and would love to know how you’re using hyperfunctions, problems you want to solve, or things you think should - or could - be simplified to make analyzing time-series data in SQL that much better. (To contribute feedback, comment on an open issue or in a discussion thread in GitHub.)

Lastly, in future posts, we’ll give some more context around our design philosophy, decisions we’ve made around our APIs for time-weighted averages (and other features), and detailing how other hyperfunctions work. So, if that’s your bag, you’re in luck – but you’ll have to wait a week or two.



from Hacker News https://ift.tt/3BKcUcm

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.