您好,登錄后才能下訂單哦!
這篇文章主要講解了“如何使用SQL窗口函數(shù)進(jìn)行商務(wù)數(shù)據(jù)分析”,文中的講解內(nèi)容簡(jiǎn)單清晰,易于學(xué)習(xí)與理解,下面請(qǐng)大家跟著小編的思路慢慢深入,一起來(lái)研究和學(xué)習(xí)“如何使用SQL窗口函數(shù)進(jìn)行商務(wù)數(shù)據(jù)分析”吧!
本文主要分析只涉及一張訂單表orders,操作過(guò)程在Hive中完成,具體數(shù)據(jù)如下:
-- 建表
CREATE TABLE orders(
order_id int,
customer_id string,
city string,
add_time string,
amount decimal(10,2));
-- 準(zhǔn)備數(shù)據(jù)
INSERT INTO orders VALUES
(1,"A","上海","2020-01-01 00:00:00.000000",200),
(2,"B","上海","2020-01-05 00:00:00.000000",250),
(3,"C","北京","2020-01-12 00:00:00.000000",200),
(4,"A","上海","2020-02-04 00:00:00.000000",400),
(5,"D","上海","2020-02-05 00:00:00.000000",250),
(5,"D","上海","2020-02-05 12:00:00.000000",300),
(6,"C","北京","2020-02-19 00:00:00.000000",300),
(7,"A","上海","2020-03-01 00:00:00.000000",150),
(8,"E","北京","2020-03-05 00:00:00.000000",500),
(9,"F","上海","2020-03-09 00:00:00.000000",250),
(10,"B","上海","2020-03-21 00:00:00.000000",600);
在業(yè)務(wù)方面,第m1個(gè)月的收入增長(zhǎng)計(jì)算如下:100 *(m1-m0)/ m0
其中,m1是給定月份的收入,m0是上個(gè)月的收入。因此,從技術(shù)上講,我們需要找到每個(gè)月的收入,然后以某種方式將每個(gè)月的收入與上一個(gè)收入相關(guān)聯(lián),以便進(jìn)行上述計(jì)算。計(jì)算當(dāng)時(shí)如下:
WITH
monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
sum(amount) as revenue
FROM orders
GROUP BY 1
)
,prev_month_revenue as (
SELECT
month,
revenue,
lag(revenue) over (order by month) as prev_month_revenue -- 上一月收入
FROM monthly_revenue
)
SELECT
month,
revenue,
prev_month_revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1
結(jié)果輸出
month | revenue | prev_month_revenue | revenue_growth |
---|---|---|---|
2020-01-01 | 650 | NULL | NULL |
2020-02-01 | 1250 | 650 | 92.3 |
2020-03-01 | 1500 | 1250 | 20 |
我們還可以按照按城市分組進(jìn)行統(tǒng)計(jì),查看某個(gè)城市某個(gè)月份的收入增長(zhǎng)情況
WITH
monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
city,
sum(amount) as revenue
FROM orders
GROUP BY 1,2
)
,prev_month_revenue as (
SELECT
month,
city,
revenue,
lag(revenue) over (partition by city order by month) as prev_month_revenue
FROM monthly_revenue
)
SELECT
month,
city,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 2,1
結(jié)果輸出
month | city | revenue | revenue_growth |
---|---|---|---|
2020-01-01 | 上海 | 450 | NULL |
2020-02-01 | 上海 | 950 | 111.1 |
2020-03-01 | 上海 | 1000 | 5.3 |
2020-01-01 | 北京 | 200 | NULL |
2020-02-01 | 北京 | 300 | 50 |
2020-03-01 | 北京 | 500 | 66.7 |
累計(jì)匯總,即當(dāng)前元素和所有先前元素的總和,如下面的SQL:
WITH
monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
sum(amount) as revenue
FROM orders
GROUP BY 1
)
SELECT
month,
revenue,
sum(revenue) over (order by month rows between unbounded preceding and current row) as running_total
FROM monthly_revenue
ORDER BY 1
結(jié)果輸出
month | revenue | running_total |
---|---|---|
2020-01-01 | 650 | 650 |
2020-02-01 | 1250 | 1900 |
2020-03-01 | 1500 | 3400 |
我們還可以使用下面的組合方式進(jìn)行分析,SQL如下:
SELECT
order_id,
customer_id,
city,
add_time,
amount,
sum(amount) over () as amount_total, -- 所有數(shù)據(jù)求和
sum(amount) over (order by order_id rows between unbounded preceding and current row) as running_sum, -- 累計(jì)求和
sum(amount) over (partition by customer_id order by add_time rows between unbounded preceding and current row) as running_sum_by_customer,
avg(amount) over (order by add_time rows between 5 preceding and current row) as trailing_avg -- 滾動(dòng)求平均
FROM orders
ORDER BY 1
結(jié)果輸出:
order_id | customer_id | city | add_time | amount | amount_total | running_sum | running_sum_by_customer | trailing_avg |
---|---|---|---|---|---|---|---|---|
1 | A | 上海 | 2020-01-01 00:00:00.000000 | 200 | 3400 | 200 | 200 | 200 |
2 | B | 上海 | 2020-01-05 00:00:00.000000 | 250 | 3400 | 450 | 250 | 225 |
3 | C | 北京 | 2020-01-12 00:00:00.000000 | 200 | 3400 | 650 | 200 | 216.666667 |
4 | A | 上海 | 2020-02-04 00:00:00.000000 | 400 | 3400 | 1050 | 600 | 262.5 |
5 | D | 上海 | 2020-02-05 00:00:00.000000 | 250 | 3400 | 1300 | 250 | 260 |
5 | D | 上海 | 2020-02-05 12:00:00.000000 | 300 | 3400 | 1600 | 550 | 266.666667 |
6 | C | 北京 | 2020-02-19 00:00:00.000000 | 300 | 3400 | 1900 | 500 | 283.333333 |
7 | A | 上海 | 2020-03-01 00:00:00.000000 | 150 | 3400 | 2050 | 750 | 266.666667 |
8 | E | 北京 | 2020-03-05 00:00:00.000000 | 500 | 3400 | 2550 | 500 | 316.666667 |
9 | F | 上海 | 2020-03-09 00:00:00.000000 | 250 | 3400 | 2800 | 250 | 291.666667 |
10 | B | 上海 | 2020-03-21 00:00:00.000000 | 600 | 3400 | 3400 | 850 |
從上面的數(shù)據(jù)可以看出,存在兩條重復(fù)的數(shù)據(jù)**(5,"D","上海","2020-02-05 00:00:00.000000",250), (5,"D","上海","2020-02-05 12:00:00.000000",300),**顯然需要對(duì)其進(jìn)行清洗去重,保留最新的一條數(shù)據(jù),SQL如下:
我們先進(jìn)行分組排名,然后保留最新的那條數(shù)據(jù)即可:
SELECT *
FROM (
SELECT *,
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
) t
WHERE rank=1
結(jié)果輸出:
t.order_id | t.customer_id | t.city | t.add_time | t.amount | t.rank |
---|---|---|---|---|---|
1 | A | 上海 | 2020-01-01 00:00:00.000000 | 200 | 1 |
2 | B | 上海 | 2020-01-05 00:00:00.000000 | 250 | 1 |
3 | C | 北京 | 2020-01-12 00:00:00.000000 | 200 | 1 |
4 | A | 上海 | 2020-02-04 00:00:00.000000 | 400 | 1 |
5 | D | 上海 | 2020-02-05 12:00:00.000000 | 300 | 1 |
6 | C | 北京 | 2020-02-19 00:00:00.000000 | 300 | 1 |
7 | A | 上海 | 2020-03-01 00:00:00.000000 | 150 | 1 |
8 | E | 北京 | 2020-03-05 00:00:00.000000 | 500 | 1 |
9 | F | 上海 | 2020-03-09 00:00:00.000000 | 250 | 1 |
10 | B | 上海 | 2020-03-21 00:00:00.000000 | 600 | 1 |
經(jīng)過(guò)上面的清洗過(guò)程,對(duì)數(shù)據(jù)進(jìn)行了去重。重新計(jì)算上面的需求1,正確SQL腳本為:
WITH
orders_cleaned as (
SELECT *
FROM (
SELECT *,
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
)t
WHERE rank=1
)
,monthly_revenue as (
SELECT
trunc(add_time,'MM') as month,
sum(amount) as revenue
FROM orders_cleaned
GROUP BY 1
)
,prev_month_revenue as (
SELECT
month,
revenue,
lag(revenue) over (order by month) as prev_month_revenue
FROM monthly_revenue
)
SELECT
month,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1
結(jié)果輸出:
month | revenue | revenue_growth |
---|---|---|
2020-01-01 | 650 | NULL |
2020-02-01 | 1000 | 53.8 |
2020-03-01 | 1500 | 50 |
將清洗后的數(shù)據(jù)創(chuàng)建成視圖,方便以后使用
CREATE VIEW orders_cleaned AS
SELECT
order_id,
customer_id,
city,
add_time,
amount
FROM (
SELECT *,
row_number() over (partition by order_id order by add_time desc) as rank
FROM orders
)t
WHERE rank=1
分組取topN是最長(zhǎng)見(jiàn)的SQL窗口函數(shù)使用場(chǎng)景,下面的SQL是計(jì)算每個(gè)月份的top2訂單金額,如下:
WITH orders_ranked as (
SELECT
trunc(add_time,'MM') as month,
*,
row_number() over (partition by trunc(add_time,'MM') order by amount desc, add_time) as rank
FROM orders_cleaned
)
SELECT
month,
order_id,
customer_id,
city,
add_time,
amount
FROM orders_ranked
WHERE rank <=2
ORDER BY 1
下面的SQL計(jì)算重復(fù)購(gòu)買率:重復(fù)購(gòu)買的人數(shù)/總?cè)藬?shù)*100%以及第一筆訂單金額與第二筆訂單金額之間的典型差額:avg(第二筆訂單金額/第一筆訂單金額)
WITH customer_orders as (
SELECT *,
row_number() over (partition by customer_id order by add_time) as customer_order_n,
lag(amount) over (partition by customer_id order by add_time) as prev_order_amount
FROM orders_cleaned
)
SELECT
round(100.0*sum(case when customer_order_n=2 then 1 end)/count(distinct customer_id),1) as repeat_purchases,-- 重復(fù)購(gòu)買率
avg(case when customer_order_n=2 then 1.0*amount/prev_order_amount end) as revenue_expansion -- 重復(fù)購(gòu)買較上次購(gòu)買差異,第一筆訂單金額與第二筆訂單金額之間的典型差額
FROM customer_orders
結(jié)果輸出:
WITH結(jié)果輸出:
orders_cleaned.order_id | orders_cleaned.customer_id | orders_cleaned.city | orders_cleaned.add_time | orders_cleaned.amount | customer_order_n | prev_order_amount |
---|---|---|---|---|---|---|
1 | A | 上海 | 2020-01-01 00:00:00.000000 | 200 | 1 | NULL |
4 | A | 上海 | 2020-02-04 00:00:00.000000 | 400 | 2 | 200 |
7 | A | 上海 | 2020-03-01 00:00:00.000000 | 150 | 3 | 400 |
2 | B | 上海 | 2020-01-05 00:00:00.000000 | 250 | 1 | NULL |
10 | B | 上海 | 2020-03-21 00:00:00.000000 | 600 | 2 | 250 |
3 | C | 北京 | 2020-01-12 00:00:00.000000 | 200 | 1 | NULL |
6 | C | 北京 | 2020-02-19 00:00:00.000000 | 300 | 2 | 200 |
5 | D | 上海 | 2020-02-05 12:00:00.000000 | 300 | 1 | NULL |
8 | E | 北京 | 2020-03-05 00:00:00.000000 | 500 | 1 | NULL |
9 | F | 上海 | 2020-03-09 00:00:00.000000 | 250 |
最終結(jié)果輸出:
repeat_purchases | revenue_expansion |
---|---|
50 | 1.9666666666666668 |
感謝各位的閱讀,以上就是“如何使用SQL窗口函數(shù)進(jìn)行商務(wù)數(shù)據(jù)分析”的內(nèi)容了,經(jīng)過(guò)本文的學(xué)習(xí)后,相信大家對(duì)如何使用SQL窗口函數(shù)進(jìn)行商務(wù)數(shù)據(jù)分析這一問(wèn)題有了更深刻的體會(huì),具體使用情況還需要大家實(shí)踐驗(yàn)證。這里是億速云,小編將為大家推送更多相關(guān)知識(shí)點(diǎn)的文章,歡迎關(guān)注!
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。