Instead Of MAX Use LAST_VALUE For Time-Based Data

There is a big difference between two built-in functions that you need to consider to avoid working with bad results.

By Den Delimarsky in Data

February 15, 2020

I was working on some data analysis recently, that involved me dealing with data snapshots. These are effectively point-in-time representations of specific data. For example, if you’d be dealing with the amount of items at a warehouse, you could imagine that you’d be taking a snapshot through the day of how many items you have available during each hour. That will give you a pretty good idea of the in-flow and out-flow of materials.

Now, let’s say that you want to analyze the number of items that are left at your warehouse at the end of the day. And let’s also say that you’re using SQLite to store the data. I’ve noticed an answer on a popular forum today that mentioned that you can just use the MAX function per day, and use that for aggregation. While that will give you a number, it will also likely be the wrong number. Here is why:

Items Time
12 2020-02-12 12:12AM
13 2020-02-12 1:10AM
11 2020-02-12 3:00PM
10 2020-02-12 5:33PM
14 2020-02-12 6:00PM
12 2020-02-12 11:50PM

If we use MAX(), when we would analyze the data we would get 14 as the answer for units available at the end of the day. However, that is the maximum number of units through the day. If you’d be tracking Twitter followers for an account the same way, you probably don’t care about the maximum number of followers your account got in a day, but rather what you ended up with at the end of the day.

Lucky for us, SQLite gives us the ability to use what’s known as a window function - LAST_VALUE. This allows us to get the value for the last row given a “window” - a specific segment of a table (I intentionally simplify this concept here). And in our case, the “window” is the date. Here is what this might look like in the form of a SQLite query using the Twitter follower example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
SELECT * FROM (
SELECT 
	last_value(follower_count)
	OVER (
		PARTITION BY target_date
	) followers,
	last_value(following_count)
	OVER (
		PARTITION BY target_date
	) following,
	target_date
FROM (
	SELECT
		strftime('%Y-%m-%d', time_of_capture) target_date,
		following_count,
		follower_count
	FROM
		(SELECT
			*
		FROM
			user_snapshots
		WHERE
			user_id = 'foo_bar'
		ORDER BY time_of_capture))
) GROUP BY target_date, followers, following

This might look confusing, but the query is actually really simple (and you can optimize it further - it’s verbose for clarity).

  1. We are getting all the user snapshots for a given ID to get a list of user metrics sorted by the time those were captured.
  2. Now that the list is ordered, we are simplifying the date to be just the year, month and day.
  3. We are getting the last value for each partition (“window” in a table) that is created by the simplified date.
  4. We are grouping the values by the simplified date (which already excludes all the hour/minute/second details).

Here is also a tricky part - if you are used to the way various query languages work, you might think that LAST_VALUE() produces a row - after all, isn’t that what we’re after here? Well, yes - with the caveat that the last value will be appended as a column, so the GROUP BY clause at the end allows us to get just one row per date. What this means is that if you look at our table above, the LAST_VALUE, assuming that we partitioned the table by the simplified date and ordered by time, would look something like this:

Items Time LastValue
12 2020-02-12 12
13 2020-02-12 12
11 2020-02-12 12
10 2020-02-12 12
14 2020-02-12 12
12 2020-02-12 12

The conclusion here is simple - MAX() is useful, but do not conflate it with LAST_VALUE() in time-based data. You might be skewing your results.

Subscribe to The Den:

A monthly newsletter about product management, engineering, and tinkering with code.

Feedback

Have any thoughts? Let me know on Twitter!