Instead Of MAX Use LAST_VALUE For Time-Based Data

I was working on some data analysis recently, that involved me dealing with data snapshots. These are effectively point-in-time representations of specific data. For example, if you’d be dealing with the amount of items at a warehouse, you could imagine that you’d be taking a snapshot through the day of how many items you have available during each hour. That will give you a pretty good idea of the in-flow and out-flow of materials.

Now, let’s say that you want to analyze the number of items that are left at your warehouse at the end of the day. And let’s also say that you’re using SQLite to store the data. I’ve noticed an answer on a popular forum today that mentioned that you can just use the MAX function per day, and use that for aggregation. While that will give you a number, it will also likely be the wrong number. Here is why:

Items	Time
12	2020-02-12 12:12AM
13	2020-02-12 1:10AM
11	2020-02-12 3:00PM
10	2020-02-12 5:33PM
14	2020-02-12 6:00PM
12	2020-02-12 11:50PM

If we use MAX(), when we would analyze the data we would get 14 as the answer for units available at the end of the day. However, that is the maximum number of units through the day. If you’d be tracking Twitter followers for an account the same way, you probably don’t care about the maximum number of followers your account got in a day, but rather what you ended up with at the end of the day.

Lucky for us, SQLite gives us the ability to use what’s known as a window function - LAST_VALUE. This allows us to get the value for the last row given a “window” - a specific segment of a table (I intentionally simplify this concept here). And in our case, the “window” is the date. Here is what this might look like in the form of a SQLite query using the Twitter follower example:

SELECT * FROM (
SELECT 
	last_value(follower_count)
	OVER (
		PARTITION BY target_date
	) followers,
	last_value(following_count)
	OVER (
		PARTITION BY target_date
	) following,
	target_date
FROM (
	SELECT
		strftime('%Y-%m-%d', time_of_capture) target_date,
		following_count,
		follower_count
	FROM
		(SELECT
			*
		FROM
			user_snapshots
		WHERE
			user_id = 'foo_bar'
		ORDER BY time_of_capture))
) GROUP BY target_date, followers, following

This might look confusing, but the query is actually really simple (and you can optimize it further - it’s verbose for clarity).

We are getting all the user snapshots for a given ID to get a list of user metrics sorted by the time those were captured.
Now that the list is ordered, we are simplifying the date to be just the year, month and day.
We are getting the last value for each partition (“window” in a table) that is created by the simplified date.
We are grouping the values by the simplified date (which already excludes all the hour/minute/second details).

Here is also a tricky part - if you are used to the way various query languages work, you might think that LAST_VALUE() produces a row - after all, isn’t that what we’re after here? Well, yes - with the caveat that the last value will be appended as a column, so the GROUP BY clause at the end allows us to get just one row per date. What this means is that if you look at our table above, the LAST_VALUE, assuming that we partitioned the table by the simplified date and ordered by time, would look something like this:

Items	Time	LastValue
12	2020-02-12	12
13	2020-02-12	12
11	2020-02-12	12
10	2020-02-12	12
14	2020-02-12	12
12	2020-02-12	12

The conclusion here is simple - MAX() is useful, but do not conflate it with LAST_VALUE() in time-based data. You might be skewing your results.