I was working on some data analysis recently, that involved me dealing with data snapshots. These are effectively point-in-time representations of specific data. For example, if you’d be dealing with the amount of items at a warehouse, you could imagine that you’d be taking a snapshot through the day of how many items you have available during each hour. That will give you a pretty good idea of the in-flow and out-flow of materials.
Now, let’s say that you want to analyze the number of items that are left at your warehouse at the end of the day. And let’s also say that you’re using SQLite to store the data. I’ve noticed an answer on a popular forum today that mentioned that you can just use the MAX function per day, and use that for aggregation. While that will give you a number, it will also likely be the wrong number. Here is why:
If we use
MAX(), when we would analyze the data we would get 14 as the answer for units available at the end of the day. However, that is the maximum number of units through the day. If you’d be tracking Twitter followers for an account the same way, you probably don’t care about the maximum number of followers your account got in a day, but rather what you ended up with at the end of the day.
Lucky for us, SQLite gives us the ability to use what’s known as a window function - LAST_VALUE. This allows us to get the value for the last row given a “window” - a specific segment of a table (I intentionally simplify this concept here). And in our case, the “window” is the date. Here is what this might look like in the form of a SQLite query using the Twitter follower example:
SELECT * FROM ( SELECT last_value(follower_count) OVER ( PARTITION BY target_date ) followers, last_value(following_count) OVER ( PARTITION BY target_date ) following, target_date FROM ( SELECT strftime('%Y-%m-%d', time_of_capture) target_date, following_count, follower_count FROM (SELECT * FROM user_snapshots WHERE user_id = 'foo_bar' ORDER BY time_of_capture)) ) GROUP BY target_date, followers, following
This might look confusing, but the query is actually really simple (and you can optimize it further - it’s verbose for clarity).
- We are getting all the user snapshots for a given ID to get a list of user metrics sorted by the time those were captured.
- Now that the list is ordered, we are simplifying the date to be just the year, month and day.
- We are getting the last value for each partition (“window” in a table) that is created by the simplified date.
- We are grouping the values by the simplified date (which already excludes all the hour/minute/second details).
Here is also a tricky part - if you are used to the way various query languages work, you might think that
LAST_VALUE() produces a row - after all, isn’t that what we’re after here? Well, yes - with the caveat that the last value will be appended as a column, so the
GROUP BY clause at the end allows us to get just one row per date. What this means is that if you look at our table above, the
LAST_VALUE, assuming that we partitioned the table by the simplified date and ordered by time, would look something like this:
The conclusion here is simple -
MAX() is useful, but do not conflate it with
LAST_VALUE() in time-based data. You might be skewing your results.