Exploring Navigation Functions in BigQuery

A guide to understanding a subset of BigQuery window functions

Tazki Anida Asrul
5 min readAug 19, 2024
Photo by Jamie Street on Unsplash

When working with data in BigQuery, we often need to perform data manipulations over a group of rows. If we want to perform a simple calculation that applies to a whole group, the GROUP BY clause should be enough. But what if we are required to do a calculation that is specific for each row within a group? Do we need a lengthy block of code or only a line of code?

One powerful function for this purpose is the window function. It contains the OVER() clause to specify the window of rows being evaluated within a group. There are several types of window functions. Besides the common aggregation like COUNT() and numbering functions such as ROW_NUMBER(), there is also a subset of the window function called the navigation function.

So, what are the navigation functions?

The navigation functions are functions that allow us to navigate the rows in a partition from the perspective of the current row. In the navigation functions, we can examine the previous or following column values for each row. Below are the functions that are classified as navigation functions.

1. First, Last, and Nth Value

Given the San Francisco film locations data in a BigQuery public dataset. We are required to get the first, last, or fifth movie for each location based on the movie release year. How do we achieve that?

Let’s look at the query.

SELECT   
title,
locations,
release_year,
FIRST_VALUE(title) OVER(PARTITION BY locations ORDER BY release_year) AS first_movie
LAST_VALUE(title) OVER(PARTITION BY locations ORDER BY release_year) AS last_movie,
NTH_VALUE(title,5) OVER(PARTITION BY locations ORDER BY release_year) AS fifth_movie
FROM
`bigquery-public-data.san_francisco.film_locations`
Image by author

Based on the result above, we can see that the first_movie value is consistent across every row, which is A Jitney Experiment . However, the values differ between each row for the last_movie and fifth_movie columns. Why is it?

The window frame for a navigation function starts from the first row and ends on the current row by default. Therefore, in the example above, the last_movie value is always similar to the current row value. Also, the fifth_movie value for the first until the fourth row is NULL. It’s not an accurate answer for the case.

To fix the issue, we need to adjust the window frame for the last_movie and the fifth_movie fields. The default behavior of the window frame with the ORDER BY clause is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. What we should do is specify the window frame to ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. It will ensure that the scope of operation for each row is between the first value and the last value in each partition group. This is the query sample.

SELECT   
title,
locations,
release_year,
LAST_VALUE(title) OVER(PARTITION BY locations ORDER BY release_year ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_movie,
NTH_VALUE(title,5) OVER(PARTITION BY locations ORDER BY release_year ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS fifth_movie
FROM
`bigquery-public-data.san_francisco.film_locations`

And here is the final result.

Image by author

As a side note, these functions always include the NULL values in the calculation. If we want to exclude the NULL values in the calculation, we can use the IGNORE NULLS expression in the function as below.

SELECT 
FIRST_VALUE(title IGNORE NULLS) OVER(PARTITION BY locations ORDER BY release_year) AS first_movie
FROM
`bigquery-public-data.san_francisco.film_locations`

2. Lead and Lag

If we are interested in seeing the previous or following row value in an ordered window frame, we can use the LAG() and LEAD() functions. To get a more detailed example, please take a look at this query.

SELECT
title,
locations,
release_year,
LEAD(title) OVER(PARTITION BY locations ORDER BY release_year) AS following_movie,
LAG(title) OVER(PARTITION BY locations ORDER BY release_year) AS previous_movie,
LEAD(title,2) OVER(PARTITION BY locations ORDER BY release_year) AS two_movies_after,
LAG(title, 3, 'none') OVER(PARTITION BY locations ORDER BY release_year) AS three_movies_before,
FROM
`bigquery-public-data.san_francisco.film_locations`

These functions can skip more than one preceding or following row, depending on the integer value that we define as the second argument. If the current row doesn’t have any preceding or following row, we can put a default value in the third argument instead of returning it as a NULL value.

For example, we put the none value as the default if the movie does not have value for the three_movies_before as the result below.

Image by author

3. Percentile

Here we are at the last navigation function, the percentile. There are two types of percentile functions:

1. PERCENTILE_CONT()

This function calculates a continuous percentile, which means it interpolates between values if the exact percentile value does not exist in the data set. It’s more required for cases that are related to the continuous distribution of data, like financial (e.g. salary) or scientific (e.g. temperature) data

2. PERCENTILE_DISC()

This function calculates a discrete percentile that returns an exact value from the data set. Below is the sample of the discrete percentile function to calculate the 80 percentile of movie release year for each location.

SELECT
title,
release_year,
locations,
PERCENTILE_DISC(release_year, 0.8) OVER(PARTITION BY locations) AS percent_disc_80,
FROM
`bigquery-public-data.san_francisco.film_locations`
WHERE locations like '%War %'
ORDER BY locations DESC

As a note, the percentile value should be in the range of [0, 1]. Therefore, we use 0.8 as the percentile in the query above.

Image by author

Conclusion

So, that’s all about navigation functions. As the subset of window functions, navigation functions can be classified as one of the advanced analytics techniques in BigQuery. By using the navigation functions, we can see the relationship between each row within a specified window frame. Understanding window functions can be challenging, but I hope the explanation above provides more clarity. Thanks for reading!

--

--