Blog

Reconstructing historical data using BigQuery

Jan 04 — 20257 min read

Imagine having a database with your platform or product subscription data. The table has the subscription creation date, and if it has been cancelled, you have that date too. But how can you know how many active subscriptions you had on any given day in the past? You cannot. At least, not without reconstructing the historical data.

This is a common challenge across many businesses where historical states are not preserved, making trend analysis difficult. While this article focuses on subscription data, the same technique can be applied to any scenario where you need to reconstruct history from point-in-time data, whether that is user statuses, inventory levels, or order states.

After reading this guide, you will be able to:

  • Reconstruct daily active counts from a table that only stores creation and cancellation dates
  • Write a BigQuery query that generates a complete daily timeline of subscription activity
  • Understand the tradeoffs and limitations of historical reconstruction
  • Choose the right approach for ongoing tracking after the backfill is complete

Data structure and challenges

Our data source is a BigQuery table stg_og__subscriptions containing subscription records with two key timestamps:

  • subscription_id — Unique identifier for tracking individual subscriptions
  • created — Timestamp for when the subscription started
  • cancelled_date — Timestamp for when the subscription was cancelled. NULL if still active

Why you cannot query historical state directly

Our subscription table only shows the latest state of each subscription: when it started, and if it was cancelled, when that happened. You do not see any history in between.

This makes it challenging when you want to answer questions like "how many active subscriptions did we have last month?" because you need to piece together historical counts from these final states.

Let's illustrate with an example. Suppose we have three subscription records and we want to know the state on January 2nd, 2023:

MetricCountDetail
Active at start of day2Subscriptions 1 and 3
Created during the day1Subscription 2 (11:45 AM)
Cancelled during the day1Subscription 3 (9:00 AM)
Active at end of day2Subscriptions 1 and 2

While the net change was zero (2 active at start, 2 active at end), the underlying subscription base changed completely. This is why we need to track all four metrics, not just the end-of-day count.

To reconstruct these daily metrics from point-in-time data, we need to:

  • Treat each subscription's created date as a "+1" event on that day
  • Treat each subscription's cancelled_date as a "-1" event on that day
  • Calculate running totals to determine the active count at any point in time

Without tracking these daily state changes, we cannot accurately measure growth, churn, or perform cohort analysis.

Limitation of this approach

The key limitation of this reconstruction method is its inability to capture intermediate state transitions. Our data only preserves the first creation date and final cancellation date, meaning we lose visibility of any state changes that occurred between these points.

Example: A subscription created on January 1 and cancelled on January 15 appears as a continuous 15-day subscription. But the actual history might have been: created January 1, cancelled January 5, reactivated January 10, cancelled again January 15. The reconstruction misses the five-day gap when it was inactive.

Depending on your use case, this limitation might or might not be significant. If you primarily need to track overall subscription growth and general business health, this reconstruction method works well for analyzing historical data.

Key point: For historical data, run the reconstruction once on your entire dataset. For ongoing tracking, maintain a simple daily metrics table that records new subscriptions, cancellations, and total active subscriptions each day. This is simpler, more efficient, and more accurate than repeatedly running the full reconstruction.

Solution and query breakdown

Let's first look at the complete query, then break down each component to understand how it works:

WITH subscription_metrics AS (
  SELECT
    subscription_id,
    DATE(created) AS event_date,
    1 AS new_subscription,
    0 AS cancelled_subscription
  FROM `project.dataset.stg_og__subscriptions`
  WHERE created IS NOT NULL

  UNION ALL

  SELECT
    subscription_id,
    DATE(cancelled_date) AS event_date,
    0 AS new_subscription,
    1 AS cancelled_subscription
  FROM `project.dataset.stg_og__subscriptions`
  WHERE cancelled_date IS NOT NULL
),

daily_aggregates AS (
  SELECT
    event_date,
    SUM(new_subscription) AS new_subscriptions,
    SUM(cancelled_subscription) AS cancelled_subscriptions,
    SUM(new_subscription) - SUM(cancelled_subscription) AS net_change
  FROM subscription_metrics
  GROUP BY event_date
),

date_spine AS (
  SELECT date_day
  FROM UNNEST(
    GENERATE_DATE_ARRAY(
      (SELECT MIN(event_date) FROM daily_aggregates),
      CURRENT_DATE(),
      INTERVAL 1 DAY
    )
  ) AS date_day
),

running_totals AS (
  SELECT
    ds.date_day,
    COALESCE(da.new_subscriptions, 0) AS new_subscriptions,
    COALESCE(da.cancelled_subscriptions, 0) AS cancelled_subscriptions,
    SUM(COALESCE(da.net_change, 0))
      OVER (ORDER BY ds.date_day) AS active_subscriptions_at_end
  FROM date_spine ds
  LEFT JOIN daily_aggregates da
    ON ds.date_day = da.event_date
)

SELECT
  date_day,
  LAG(active_subscriptions_at_end, 1, 0)
    OVER (ORDER BY date_day) AS active_subscriptions_at_start,
  new_subscriptions,
  cancelled_subscriptions,
  active_subscriptions_at_end
FROM running_totals
ORDER BY date_day

Subscription metrics CTE

This CTE treats creations and cancellations as separate events using UNION ALL. Each row from the source table produces up to two event rows: one for the created date (with new_subscription = 1) and one for the cancelled_date (with cancelled_subscription = 1). This flattens the data into a single stream of daily events.

-- Creation events
SELECT
  subscription_id,
  DATE(created) AS event_date,
  1 AS new_subscription,
  0 AS cancelled_subscription
FROM `project.dataset.stg_og__subscriptions`
WHERE created IS NOT NULL

UNION ALL

-- Cancellation events
SELECT
  subscription_id,
  DATE(cancelled_date) AS event_date,
  0 AS new_subscription,
  1 AS cancelled_subscription
FROM `project.dataset.stg_og__subscriptions`
WHERE cancelled_date IS NOT NULL

Daily aggregates CTE

This step combines all events happening on the same day. If we had 5 new subscriptions and 2 cancellations on a given day, this CTE rolls them up into a single row with those totals and a net_change of 3.

SELECT
  event_date,
  SUM(new_subscription) AS new_subscriptions,
  SUM(cancelled_subscription) AS cancelled_subscriptions,
  SUM(new_subscription) - SUM(cancelled_subscription) AS net_change
FROM subscription_metrics
GROUP BY event_date

Date spine CTE

Creates a continuous sequence of dates from the earliest event to today using GENERATE_DATE_ARRAY. This ensures we do not miss any days where no subscriptions were created or cancelled. Without the date spine, gaps in activity would create gaps in our timeline.

SELECT date_day
FROM UNNEST(
  GENERATE_DATE_ARRAY(
    (SELECT MIN(event_date) FROM daily_aggregates),
    CURRENT_DATE(),
    INTERVAL 1 DAY
  )
) AS date_day

Running totals CTE

This step joins the date spine with daily events, uses COALESCE to fill in zeros for days with no activity, and calculates a running sum of net changes using a window function. Together, this produces an accurate running count of active subscriptions for every single day.

SELECT
  ds.date_day,
  COALESCE(da.new_subscriptions, 0) AS new_subscriptions,
  COALESCE(da.cancelled_subscriptions, 0) AS cancelled_subscriptions,
  SUM(COALESCE(da.net_change, 0))
    OVER (ORDER BY ds.date_day) AS active_subscriptions_at_end
FROM date_spine ds
LEFT JOIN daily_aggregates da
  ON ds.date_day = da.event_date

Final select

The final step uses LAG to get the previous day's end count as the current day's start count, includes daily changes (new and cancelled), and shows the end-of-day count as the running total. This gives us a complete picture of subscription activity for each day.

SELECT
  date_day,
  LAG(active_subscriptions_at_end, 1, 0)
    OVER (ORDER BY date_day) AS active_subscriptions_at_start,
  new_subscriptions,
  cancelled_subscriptions,
  active_subscriptions_at_end
FROM running_totals
ORDER BY date_day

Output review

The query produces four key metrics for each date:

  • active_subscriptions_at_start — Active subscriptions at beginning of day
  • new_subscriptions — New subscriptions created that day
  • cancelled_subscriptions — Subscriptions cancelled that day
  • active_subscriptions_at_end — Active subscriptions at end of day

Here is what the output looks like for the first 10 days of data:

date_daystartnewcancelledend
2023-01-010303
2023-01-023113
2023-01-033205
2023-01-045005
2023-01-055124
2023-01-064004
2023-01-074316
2023-01-086006
2023-01-096217
2023-01-107108

Notice how the start value for each row matches the end value from the previous row. Days with no activity (like January 4, 6, and 8) still appear in the output thanks to the date spine, with zeros for new and cancelled subscriptions.

Moving forward: tracking options

The reconstruction query is designed for backfilling historical data. For ongoing tracking, there are two approaches depending on your needs.

Daily aggregation (recommended)

Create a daily metrics table that stores the date, new subscription count, cancelled subscription count, and total active subscriptions. Each day, count new and cancelled subscriptions and update the total. This approach uses minimal storage and computation, is simple to maintain and understand, and covers the vast majority of business reporting needs.

Change tracking (streaming)

This approach requires integration with your subscription system to capture every state change as it happens. It must be implemented at the source to ensure no changes are missed. It has higher storage requirements and is more complex to maintain, but it provides a complete audit trail. Use this only if you need to track every intermediate state change for compliance or detailed analysis.

Key point: Simple solutions are often the most reliable. The daily aggregation method handles most business needs. Only implement streaming if you truly need to track every state change for audit trails or compliance.

Conclusion

Historical reconstruction gives you an approximate but useful view of the past from data that was never designed to preserve it. The tradeoff is straightforward: you get a complete daily timeline at low cost, but you miss intermediate state changes (a subscription that was cancelled and reactivated between snapshots). For most business reporting, that tradeoff is worth it.

The more important decision is what you do going forward. Once you have backfilled the historical data, switch to a simple daily aggregation table for ongoing tracking. It is cheaper to run, easier to maintain, and produces accurate results without the approximation. If you have historical data that needs reconstruction, whether it is subscription metrics, user activity, or inventory changes, contact us and we can help you build the right solution.

Have a project in mind?

Let's talk

Book a free 30-minute consultation to discuss your current challenges, explore where you'd like to be in 6 months, and see if we're the right fit.

Book a free consultation