Imagine having a database with your platform or product subscription’s data. The table has the subscriptions creation date, and if they’ve been cancelled, you have that date too – but how can you know how many active subscriptions you had on any given day in the past? You can’t. At least, not without reconstructing the historical data.
This is a common challenge across many businesses where historical states aren’t preserved, making trend analysis difficult. While this article focuses on subscription data, the same technique can be applied to any scenario where you need to reconstruct history from point-in-time data – whether that’s user statuses, inventory levels, or order states. While it’s easy to count what’s active today, reconstructing past metrics requires careful consideration of how your data has changed over time.
Table of contents
Data structure and challenges
Our data source is a BigQuery table stg_og__subscriptions containing subscription records with two key timestamps:
- subscription_id: A unique identifier used to track individual subscriptions.
- created: The timestamp indicating when the subscription started.
- cancelled_date: The timestamp indicating when the subscription was cancelled. If this field is NULL, the subscription is still active.
The challenges
Our subscription table only shows the latest state of each subscription; When they started, and if they’re cancelled, when that happened. You don’t see any history in between.
This makes it challenging when you want to answer questions like “how many active subscriptions did we have last month?” because you need to piece together historical counts from these final states.
Let’s illustrate this with an example. Suppose we have these subscription records:
subscription_id | created | cancelled_date |
---|---|---|
1 | 2023-01-01 09:30:00 | 2023-01-15 14:20:00 |
2 | 2023-01-02 11:45:00 | NULL |
3 | 2023-01-01 08:15:00 | 2023-01-02 09:00:00 |
On January 2nd, 2023:
- Subscription 1 was active all day
- Subscription 2 started during the day (11:45 AM)
- Subscription 3 was cancelled during the day (9:00 AM)
This creates two distinct points of measurement:
- Start of day (00:00:00): We need to know how many subscriptions were active at the beginning of the day
- End of day (23:59:59): We need to know how many subscriptions were active at day's end
For January 2nd:
- Start of day: Subscriptions 1 and 3 were active (2 total)
- During the day:
- Subscription 3 was cancelled (-1)
- Subscription 2 was created (+1)
- End of day: Subscriptions 1 and 2 were active (2 total)
While the net change was zero (2 active at start, 2 active at end), the underlying subscription base changed. This is why we need to track:
- Active subscriptions at start of day
- New subscriptions created during the day
- Subscriptions cancelled during the day
- Active subscriptions at end of day
The challenge is that our source data only shows the final state of each subscription. To get these daily metrics, we need to:
- Consider each subscription's creation date as a "+1" event on that day
- Consider each subscription's cancellation date as a "-1" event on that day
- Calculate running totals to determine the active count at any point in time
This reconstruction becomes essential for answering questions like:
- How many active subscriptions did we have on January 2nd?
- What was our daily net change in subscriptions?
- What was our highest/lowest active subscription count historically?
Without tracking these daily state changes, we can’t accurately measure growth, churn, or perform cohort analysis.
Limitation of this approach
The key limitation of this reconstruction method lies in its inability to capture intermediate state transitions. Our data only preserves the first creation date and final cancellation date, meaning we lose visibility of any state changes that occurred between these points.
Consider this scenario:
subscription_id | created | cancelled_date |
---|---|---|
12345 | 2023-01-01 09:30:00 | 2023-01-15 14:20:00 |
While this appears as a simple 15-day subscription, the actual history might have been:
- Created on January 1
- Cancelled on January 5
- Reactivated on January 10
- Finally cancelled on January 15
Our reconstruction method would show this as one continuous subscription from January 1 to January 15, missing the five-day gap when it was inactive.
Depending on your use case, this limitation might or might not be significant. If you primarily need to track overall subscription growth and general business health, this reconstruction method works well for analyzing historical data.
For historical data spanning long periods, you’ll run the reconstruction once on your entire dataset. For ongoing tracking, we recommend maintaining a simple daily metrics table that tracks new subscriptions, cancellations, and total active subscriptions. If you need to track every state change, you can implement a streaming solution – though this requires direct integration with your subscription system (see our Moving forward section below).
The key difference is that while historical reconstruction helps you analyze the past, daily metrics tracking is simpler, more efficient, and more accurate for ongoing monitoring.
Solution and query breakdown
Let’s first look at the complete query, then break down each component to understand how it works:
Full query:
WITH subscription_metrics AS (
-- Track new subscriptions
SELECT
DATE(created) AS date_day,
1 AS new_subscription,
0 AS cancelled_subscription
FROM `stg_subscriptions`
UNION ALL
-- Track cancellations
SELECT
DATE(cancelled_date) AS date_day,
0 AS new_subscription,
1 AS cancelled_subscription
FROM `stg_subscriptions`
WHERE cancelled_date IS NOT NULL
),
daily_aggregates AS (
-- Aggregate daily changes
SELECT
date_day,
SUM(new_subscription) AS new_subscriptions,
SUM(cancelled_subscription) AS cancelled_subscriptions
FROM subscription_metrics
GROUP BY date_day
),
date_spine AS (
-- Create continuous date range
SELECT date_day
FROM UNNEST(GENERATE_DATE_ARRAY(
(SELECT MIN(date_day) FROM daily_aggregates),
CURRENT_DATE()
)) AS date_day
),
running_totals AS (
-- Calculate running totals
SELECT
d.date_day,
COALESCE(a.new_subscriptions, 0) AS new_subscriptions,
COALESCE(a.cancelled_subscriptions, 0) AS cancelled_subscriptions,
SUM(COALESCE(a.new_subscriptions, 0) - COALESCE(a.cancelled_subscriptions, 0))
OVER (ORDER BY d.date_day) AS net_subscriptions
FROM date_spine d
LEFT JOIN daily_aggregates a ON d.date_day = a.date_day
)
SELECT
date_day,
LAG(net_subscriptions, 1, 0) OVER (ORDER BY date_day) AS active_subscriptions_at_start,
new_subscriptions,
cancelled_subscriptions,
net_subscriptions AS active_subscriptions_at_end
FROM running_totals
ORDER BY date_day;
Breaking down the components
1. Subscription metrics CTE
WITH subscription_metrics AS (
SELECT
DATE(created) AS date_day,
1 AS new_subscription,
0 AS cancelled_subscription
FROM stg_subscriptions
UNION ALL
SELECT
DATE(cancelled_date) AS date_day,
0 AS new_subscription,
1 AS cancelled_subscription
FROM stg_subscriptions
WHERE cancelled_date IS NOT NULL
)
This CTE treats creations and cancellations as separate events. Each creation adds 1 to new_subscription, and each cancellation adds 1 to cancelled_subscription on their respective days.
2. Daily aggregates CTE
daily_aggregates AS (
SELECT
date_day,
SUM(new_subscription) AS new_subscriptions,
SUM(cancelled_subscription) AS cancelled_subscriptions
FROM subscription_metrics
GROUP BY date_day
)
This step combines all events happening on the same day. If we had 5 new subscriptions and 2 cancellations on a given day, we’ll see those totals here.
3. Date spine CTE
date_spine AS (
SELECT date_day
FROM UNNEST(GENERATE_DATE_ARRAY(
(SELECT MIN(date_day) FROM daily_aggregates),
CURRENT_DATE()
)) AS date_day
)
Creates a continuous sequence of dates, ensuring we don’t miss any days, even when there are no events.
4. Running totals CTE
running_totals AS (
SELECT
d.date_day,
COALESCE(a.new_subscriptions, 0) AS new_subscriptions,
COALESCE(a.cancelled_subscriptions, 0) AS cancelled_subscriptions,
SUM(COALESCE(a.new_subscriptions, 0) - COALESCE(a.cancelled_subscriptions, 0))
OVER (ORDER BY d.date_day) AS net_subscriptions
FROM date_spine d
LEFT JOIN daily_aggregates a ON d.date_day = a.date_day
)
This crucial step joins our date spine with daily events, uses COALESCE to handle days with no events, and calculates a running sum of net changes. Together, this ensures we maintain an accurate running count of subscriptions over time.
5. Final select
SELECT
date_day,
LAG(net_subscriptions, 1, 0) OVER (ORDER BY date_day) AS active_subscriptions_at_start,
new_subscriptions,
cancelled_subscriptions,
net_subscriptions AS active_subscriptions_at_end
FROM running_totals
ORDER BY date_day
The final step uses LAG to get the previous day’s end count as current day’s start count, includes daily changes (new and cancelled), and shows end-of-day count as the running total. This gives us a complete picture of subscription activity for each day.
Output review
The query produces these key metrics for each date (date_day):
- active_subscriptions_at_start (start): Active subscriptions at beginning of day
- new_subscriptions (new): New subscriptions created that day
- cancelled_subscriptions (cancelled): Subscriptions cancelled that day
- active_subscriptions_at_end (end): Active subscriptions at end of day
Sample output:
date_day | start | new | cancelled | end |
---|---|---|---|---|
2023-01-01 | 100 | 5 | 3 | 103 |
2023-01-02 | 103 | 7 | 3 | 107 |
Moving forward: Tracking options
For ongoing tracking, there are two main approaches:
1. Daily aggregation (recommended)
- Create a daily metrics table storing:
- Date
- New subscriptions count
- Cancelled subscriptions count
- Total active subscriptions
- Each day, simply count new/cancelled subscriptions and update the total
- Minimal storage and computation
- Simple to maintain and understand
- Perfect for business metrics
2. Change tracking (streaming)
- Requires integration with your subscription system to capture every state change
- Must be implemented at the source to ensure no changes are missed
- Higher storage requirements
- More complex to maintain
- Better for audit trails and compliance
- Use only if you need complete historical state changes
The daily aggregation method is ideal for most business needs. The historical reconstruction approach we discussed would only be needed for backfilling old data where you don’t have daily aggregates.
Remember: simple solutions are often the most reliable. Only implement streaming if you truly need to track every state change.
Conclusion
Historical data reconstruction provides a way to analyze past metrics when only current states are available. This technique can be applied to various scenarios beyond subscriptions – from tracking user statuses to inventory levels. While this approach has limitations regarding intermediate states, it remains an efficient solution for analyzing historical data.
For ongoing tracking, the simplest and most efficient approach is to maintain a daily metrics table with new subscriptions, cancellations, and total active counts. Only implement a streaming solution if you need complete audit trails and have access to every state change. The key is to use complex reconstruction only for historical data, while keeping future tracking as simple as possible.
Do you have historical data that needs reconstruction? Whether it’s subscription metrics, user activity, or inventory changes, we can help you uncover valuable insights from your point-in-time data. At Systematik, we build solutions that handle the complexity so you can focus on growing your business. Contact us to learn how we can help.