星期五, 3月 06, 2015

[SQL] 重疊天數分析

論壇問題
希望計算每個人在日期區間內的實際天數有幾天,但會遇到日期重複的問題,不知道該如何處理
IF OBJECT_ID('Datas') IS NOT NULL 
    DROP TABLE Datas

CREATE TABLE Datas 
(
    ID char(1),
    StartDate date,
    EndDate date
)

INSERT INTO Datas VALUES
    ('A','20140101','20140115'),
    ('A','20140114','20140117'),
    ('B','20140215','20140220'),
    ('B','20140220','20140225'),
    ('C','20140301','20140305'),
    ('C','20140320','20140321');
  • 原始資料
[SQL] 重疊天數分析-1
  • 呈現結果
[SQL] 重疊天數分析-3

傳統作法
;
WITH cteStartDate AS
(
    SELECT 
        DISTINCT ID, startdate
    FROM Datas AS S1
    WHERE NOT EXISTS
    (
        SELECT * 
        FROM Datas AS S2 
        WHERE S2.ID = S1.ID
            AND S2.startdate < S1.startdate
            AND S2.enddate >= S1.StartDate
    )
)
,
cteEndDate AS
(
    SELECT DISTINCT ID, enddate
    FROM Datas AS S1
    WHERE NOT EXISTS
    (
        SELECT * 
        FROM Datas AS S2
        WHERE S2.ID = S1.ID
            AND S2.enddate > S1.enddate
            AND S2.startdate <= S1.enddate)
)
SELECT
    ID,
    startdate,
    (
        SELECT MIN(enddate)
        FROM cteEndDate AS E
        WHERE E.ID = S.ID
            AND enddate >= startdate
    ) AS enddate
FROM cteStartDate AS S


視窗函數作法

該做法只適用於 SQL Server 2012 以上版本
; 
WITH cteDate AS
(
    SELECT 
        ID , 
        StartDate AS dates ,
        1 AS type ,
        1 AS sub
    FROM Datas
    UNION ALL
    SELECT
        ID ,
        EndDate ,
        -1 ,
        0
    FROM Datas
)
,
cteRunningTotal AS
(
    SELECT 
        * ,
        SUM(type) OVER 
            (
                PARTITION BY ID 
                ORDER BY Dates  , Type desc 
                ROWS BETWEEN UNBOUNDed preceding AND current Row 
            ) - sub AS cnt
    FROM cteDate 
)
,
cteGrp AS
(
    SELECT * ,
        ((ROW_NUMBER() OVER (PARTITION BY ID , cnt ORDER BY Dates) - 1) / 2) + 1 AS Grp
    FROM cteRunningTotal
    WHERE Cnt = 0
)
SELECT 
    ID , 
    MIN(dates) AS StartDate , 
    MAX(dates) AS EndDate
FROM cteGrp
GROUP BY ID , Grp
簡易 Step By Step 分析

Step1:cteStartDate
把起始日期和結束日期放在同一個欄位,並新增
  • Type 欄位:起始日期該值為 1、結束日期該值為 -1
  • GrpStartCheck 欄位:該欄用是用來判斷資料群組的起始日期是哪一筆
[SQL] 重疊天數分析-4

Step2:cteRiunningTotal
利用 SQL Server 2012 開始提供的功能,把 Type 欄位進行"累計加總",可以發現
  • 當累計加總為資料群組終點,RunningTotal 欄位累計加總為 0
  • 因為每個資料區間只會有一筆起始日期,每個資料區間 RunningTotal 的第一筆起始資料必定為 1,當該資料減去 GrpStartCheck 時,資料區間的起始和結束就都是 cnt = 0
[SQL] 重疊天數分析-5

Step3:cteGrp
cnt 為 0 的資料代表資料區間的起始和結束,再加上 ROW_NUMBER() 來進行資料分組,每兩筆資料為一個群組

[SQL] 重疊天數分析-6

Step4:對資料進行轉置
針對 ID 和 Grp 欄位進行群組,並利用MIX() 和 MAX() 來找出起始和結束日期,就可以得到結果

[SQL] 重疊天數分析-3

沒有留言:

張貼留言