Tuesday, January 26, 2016

SQL to de-dupe overlapping ranges of dates

I came across a situation today where I was given a set of date ranges (begin date to end date) and needed to de-duplicate the ranges so any overlapping dates are "collapsed" so the final result would only contain the distinct range(s) of dates. There are probably a number of solutions to this problem, but I rather liked what I came up with so I decided to post it here for posterity. Code first, explanation after:

-- Declare a table variable for test data
DECLARE @Data TABLE
(
 BeginDate DATETIME,
 EndDate  DATETIME
);
-- Fill the test data
INSERT INTO @Data
SELECT '2016-01-01', '2016-01-05'
UNION ALL SELECT '2016-01-03', '2016-01-10'
UNION ALL SELECT '2016-01-12', '2016-01-15'
UNION ALL SELECT '2016-01-14', '2016-01-18'
UNION ALL SELECT '2016-01-16', '2016-01-25'
UNION ALL SELECT '2016-01-16', '2016-01-20'
UNION ALL SELECT '2016-01-16', '2016-01-20'
UNION ALL SELECT '2016-02-01', '2016-03-13'
UNION ALL SELECT '2016-05-01', '2016-05-30'
UNION ALL SELECT '2016-05-28', '2016-06-10';

WITH S AS (
 SELECT BeginDate, MAX(EndDate) AS EndDate -- Select the starting ranges
 FROM @Data D
 WHERE NOT EXISTS(SELECT * FROM @Data WHERE BeginDate < D.BeginDate AND EndDate >= D.BeginDate)
 GROUP BY BeginDate
UNION ALL
 SELECT S.BeginDate, D.EndDate -- recursively expand the ranges
 FROM S
 INNER JOIN @Data D ON D.BeginDate BETWEEN S.BeginDate AND S.EndDate
   AND D.EndDate > S.EndDate
)
SELECT BeginDate, MAX(EndDate)
FROM S
GROUP BY BeginDate;

In this example, I used a common table expression (CTE) for my test data. The work is done by the "S" CTE. It starts by selecting any date range that doesn't have another date range that overlaps its begin date, grouping by that begin date, to select the max end date (to eliminate any ranges that start on the same day). It then recursively joins back to the original data, bringing in any date range that overlaps and has an end date further out. In the final output, you select the BeginDate and MAX(EndDate) to get your distinct list of date ranges.

If you have additional columns you need to group by, you'll need to add them in. For example:

-- Declare a table variable for test data
DECLARE @Data TABLE
(
 ID   INT,
 BeginDate DATETIME,
 EndDate  DATETIME
);
-- Fill the test data
INSERT INTO @Data
SELECT 1, '2016-01-01', '2016-01-05'
UNION ALL SELECT 1, '2016-01-03', '2016-01-10'
UNION ALL SELECT 1, '2016-01-12', '2016-01-15'
UNION ALL SELECT 1, '2016-01-14', '2016-01-18'
UNION ALL SELECT 1, '2016-01-16', '2016-01-25'
UNION ALL SELECT 2, '2016-01-16', '2016-01-20'
UNION ALL SELECT 2, '2016-01-16', '2016-01-20'
UNION ALL SELECT 2, '2016-02-01', '2016-03-13'
UNION ALL SELECT 2, '2016-05-01', '2016-05-30'
UNION ALL SELECT 2, '2016-05-28', '2016-06-10';

WITH S AS (
 SELECT ID, BeginDate, MAX(EndDate) AS EndDate -- Select the starting ranges
 FROM @Data D
 WHERE NOT EXISTS(SELECT * FROM @Data WHERE ID = D.ID AND BeginDate < D.BeginDate AND EndDate >= D.BeginDate)
 GROUP BY ID, BeginDate
UNION ALL
 SELECT S.ID, S.BeginDate, D.EndDate -- recursively expand the ranges
 FROM S
 INNER JOIN @Data D ON S.ID = D.ID
   AND D.BeginDate BETWEEN S.BeginDate AND S.EndDate
   AND D.EndDate > S.EndDate
)
SELECT ID, BeginDate, MAX(EndDate)
FROM S
GROUP BY ID, BeginDate;