Reconstructing historical data using BigQuery

Imagine having a database with your platform or product subscription’s data. The table has the subscriptions creation date, and if they’ve been cancelled, you have that date too – but how can you know how many active subscriptions you had on any given day in the past? You can’t. At least, not without reconstructing the historical data.

This is a common challenge across many businesses where historical states aren’t preserved, making trend analysis difficult. While this article focuses on subscription data, the same technique can be applied to any scenario where you need to reconstruct history from point-in-time data – whether that’s user statuses, inventory levels, or order states. While it’s easy to count what’s active today, reconstructing past metrics requires careful consideration of how your data has changed over time.

Table of contents

Data structure and challenges

Our data source is a BigQuery table stg_og__subscriptions containing subscription records with two key timestamps:

The challenges

Our subscription table only shows the latest state of each subscription; When they started, and if they’re cancelled, when that happened. You don’t see any history in between. 

This makes it challenging when you want to answer questions like “how many active subscriptions did we have last month?” because you need to piece together historical counts from these final states.

Let’s illustrate this with an example. Suppose we have these subscription records:

subscription_idcreatedcancelled_date
12023-01-01 09:30:002023-01-15 14:20:00
22023-01-02 11:45:00NULL
32023-01-01 08:15:002023-01-02 09:00:00

On January 2nd, 2023:

This creates two distinct points of measurement:

For January 2nd:

While the net change was zero (2 active at start, 2 active at end), the underlying subscription base changed. This is why we need to track:

The challenge is that our source data only shows the final state of each subscription. To get these daily metrics, we need to:

This reconstruction becomes essential for answering questions like:

Without tracking these daily state changes, we can’t accurately measure growth, churn, or perform cohort analysis.

Limitation of this approach

The key limitation of this reconstruction method lies in its inability to capture intermediate state transitions. Our data only preserves the first creation date and final cancellation date, meaning we lose visibility of any state changes that occurred between these points.

Consider this scenario:

subscription_idcreatedcancelled_date
123452023-01-01 09:30:002023-01-15 14:20:00

While this appears as a simple 15-day subscription, the actual history might have been:

Our reconstruction method would show this as one continuous subscription from January 1 to January 15, missing the five-day gap when it was inactive.

Depending on your use case, this limitation might or might not be significant. If you primarily need to track overall subscription growth and general business health, this reconstruction method works well for analyzing historical data.

For historical data spanning long periods, you’ll run the reconstruction once on your entire dataset. For ongoing tracking, we recommend maintaining a simple daily metrics table that tracks new subscriptions, cancellations, and total active subscriptions. If you need to track every state change, you can implement a streaming solution – though this requires direct integration with your subscription system (see our Moving forward section below).

The key difference is that while historical reconstruction helps you analyze the past, daily metrics tracking is simpler, more efficient, and more accurate for ongoing monitoring.

Solution and query breakdown

Let’s first look at the complete query, then break down each component to understand how it works:

Full query:
				
					WITH subscription_metrics AS (
  -- Track new subscriptions
  SELECT 
    DATE(created) AS date_day,
    1 AS new_subscription,
    0 AS cancelled_subscription
  FROM `stg_subscriptions`

  UNION ALL

  -- Track cancellations
  SELECT 
    DATE(cancelled_date) AS date_day,
    0 AS new_subscription,
    1 AS cancelled_subscription
  FROM `stg_subscriptions`
  WHERE cancelled_date IS NOT NULL
),
daily_aggregates AS (
  -- Aggregate daily changes
  SELECT
    date_day,
    SUM(new_subscription) AS new_subscriptions,
    SUM(cancelled_subscription) AS cancelled_subscriptions
  FROM subscription_metrics 
  GROUP BY date_day
),
date_spine AS (
  -- Create continuous date range
  SELECT date_day
  FROM UNNEST(GENERATE_DATE_ARRAY(
    (SELECT MIN(date_day) FROM daily_aggregates),
    CURRENT_DATE()
  )) AS date_day
),
running_totals AS (
  -- Calculate running totals
  SELECT
    d.date_day,
    COALESCE(a.new_subscriptions, 0) AS new_subscriptions,
    COALESCE(a.cancelled_subscriptions, 0) AS cancelled_subscriptions,
    SUM(COALESCE(a.new_subscriptions, 0) - COALESCE(a.cancelled_subscriptions, 0)) 
      OVER (ORDER BY d.date_day) AS net_subscriptions
  FROM date_spine d
  LEFT JOIN daily_aggregates a ON d.date_day = a.date_day
)
SELECT 
  date_day,
  LAG(net_subscriptions, 1, 0) OVER (ORDER BY date_day) AS active_subscriptions_at_start,
  new_subscriptions,
  cancelled_subscriptions,
  net_subscriptions AS active_subscriptions_at_end
FROM running_totals
ORDER BY date_day;


				
			

Breaking down the components

1. Subscription metrics CTE

				
					WITH subscription_metrics AS (
SELECT 
 DATE(created) AS date_day,
 1 AS new_subscription,
 0 AS cancelled_subscription
FROM stg_subscriptions

UNION ALL

SELECT 
 DATE(cancelled_date) AS date_day,
 0 AS new_subscription,
 1 AS cancelled_subscription
FROM stg_subscriptions
WHERE cancelled_date IS NOT NULL
)
				
			

This CTE treats creations and cancellations as separate events. Each creation adds 1 to new_subscription, and each cancellation adds 1 to cancelled_subscription on their respective days.

2. Daily aggregates CTE

				
					
daily_aggregates AS (
SELECT
 date_day,
 SUM(new_subscription) AS new_subscriptions,
 SUM(cancelled_subscription) AS cancelled_subscriptions
FROM subscription_metrics 
GROUP BY date_day
)


				
			

This step combines all events happening on the same day. If we had 5 new subscriptions and 2 cancellations on a given day, we’ll see those totals here.

3. Date spine CTE

				
					
date_spine AS (
SELECT date_day
FROM UNNEST(GENERATE_DATE_ARRAY(
 (SELECT MIN(date_day) FROM daily_aggregates),
 CURRENT_DATE()
)) AS date_day
)



				
			

Creates a continuous sequence of dates, ensuring we don’t miss any days, even when there are no events.

4. Running totals CTE

				
					
running_totals AS (
SELECT
 d.date_day,
 COALESCE(a.new_subscriptions, 0) AS new_subscriptions,
 COALESCE(a.cancelled_subscriptions, 0) AS cancelled_subscriptions,
 SUM(COALESCE(a.new_subscriptions, 0) - COALESCE(a.cancelled_subscriptions, 0)) 
   OVER (ORDER BY d.date_day) AS net_subscriptions
FROM date_spine d
LEFT JOIN daily_aggregates a ON d.date_day = a.date_day
)


				
			

This crucial step joins our date spine with daily events, uses COALESCE to handle days with no events, and calculates a running sum of net changes. Together, this ensures we maintain an accurate running count of subscriptions over time.

5. Final select

				
					
SELECT 
date_day,
LAG(net_subscriptions, 1, 0) OVER (ORDER BY date_day) AS active_subscriptions_at_start,
new_subscriptions,
cancelled_subscriptions,
net_subscriptions AS active_subscriptions_at_end
FROM running_totals
ORDER BY date_day


				
			

The final step uses LAG to get the previous day’s end count as current day’s start count, includes daily changes (new and cancelled), and shows end-of-day count as the running total. This gives us a complete picture of subscription activity for each day.

Output review

The query produces these key metrics for each date (date_day):

Sample output:

date_daystartnewcancelledend
2023-01-0110053103
2023-01-0210373107

Moving forward: Tracking options

For ongoing tracking, there are two main approaches:

1. Daily aggregation (recommended)

2. Change tracking (streaming)

The daily aggregation method is ideal for most business needs. The historical reconstruction approach we discussed would only be needed for backfilling old data where you don’t have daily aggregates.

Remember: simple solutions are often the most reliable. Only implement streaming if you truly need to track every state change.

Conclusion

Historical data reconstruction provides a way to analyze past metrics when only current states are available. This technique can be applied to various scenarios beyond subscriptions – from tracking user statuses to inventory levels. While this approach has limitations regarding intermediate states, it remains an efficient solution for analyzing historical data.

For ongoing tracking, the simplest and most efficient approach is to maintain a daily metrics table with new subscriptions, cancellations, and total active counts. Only implement a streaming solution if you need complete audit trails and have access to every state change. The key is to use complex reconstruction only for historical data, while keeping future tracking as simple as possible.

Do you have historical data that needs reconstruction? Whether it’s subscription metrics, user activity, or inventory changes, we can help you uncover valuable insights from your point-in-time data. At Systematik, we build solutions that handle the complexity so you can focus on growing your business. Contact us to learn how we can help.

SHARE POST

Thank You!

Please check your email for the download links to our Ultimate Guide on How to Build a Data Strategy.

P.S. If you don’t see the email in your inbox within a few minutes, please check your spam or junk folder.