Instead Of MAX Use LAST_VALUE For TimeBased Data
There is a big difference between two builtin functions that you need to consider to avoid working with bad results.
By Den Delimarsky in Data
February 15, 2020
I was working on some data analysis recently, that involved me dealing with data snapshots. These are effectively pointintime representations of specific data. For example, if you’d be dealing with the amount of items at a warehouse, you could imagine that you’d be taking a snapshot through the day of how many items you have available during each hour. That will give you a pretty good idea of the inflow and outflow of materials.
Now, let’s say that you want to analyze the number of items that are left at your warehouse at the end of the day. And let’s also say that you’re using SQLite to store the data. I’ve noticed an answer on a popular forum today that mentioned that you can just use the MAX function per day, and use that for aggregation. While that will give you a number, it will also likely be the wrong number. Here is why:
Items  Time 

12  20200212 12:12AM 
13  20200212 1:10AM 
11  20200212 3:00PM 
10  20200212 5:33PM 
14  20200212 6:00PM 
12  20200212 11:50PM 
If we use MAX()
, when we would analyze the data we would get 14 as the answer for units available at the end of the day. However, that is the maximum number of units through the day. If you’d be tracking Twitter followers for an account the same way, you probably don’t care about the maximum number of followers your account got in a day, but rather what you ended up with at the end of the day.
Lucky for us, SQLite gives us the ability to use what’s known as a window function  LAST_VALUE. This allows us to get the value for the last row given a “window”  a specific segment of a table (I intentionally simplify this concept here). And in our case, the “window” is the date. Here is what this might look like in the form of a SQLite query using the Twitter follower example:


This might look confusing, but the query is actually really simple (and you can optimize it further  it’s verbose for clarity).
 We are getting all the user snapshots for a given ID to get a list of user metrics sorted by the time those were captured.
 Now that the list is ordered, we are simplifying the date to be just the year, month and day.
 We are getting the last value for each partition (“window” in a table) that is created by the simplified date.
 We are grouping the values by the simplified date (which already excludes all the hour/minute/second details).
Here is also a tricky part  if you are used to the way various query languages work, you might think that LAST_VALUE()
produces a row  after all, isn’t that what we’re after here? Well, yes  with the caveat that the last value will be appended as a column, so the GROUP BY
clause at the end allows us to get just one row per date. What this means is that if you look at our table above, the LAST_VALUE
, assuming that we partitioned the table by the simplified date and ordered by time, would look something like this:
Items  Time  LastValue 

12  20200212  12 
13  20200212  12 
11  20200212  12 
10  20200212  12 
14  20200212  12 
12  20200212  12 
The conclusion here is simple  MAX()
is useful, but do not conflate it with LAST_VALUE()
in timebased data. You might be skewing your results.
Feedback
Have any thoughts? Let me know on Twitter!