从小红书[数据宝典]中摘取的题目
Jul
最高峰同时直播人数
快手大数据工程师 SQL 题
题目背景: 有一张主播上下播的数据表,记录了直播平台的主播上播和下播时间,根据该数据表计算出平台最高峰同时直播人数。
mysql> show create table t_user\G;
*************************** 1. row ***************************
Table: t_user
Create Table: CREATE TABLE `t_user` (
`user_id` int NOT NULL COMMENT '主播id',
`start_time` datetime DEFAULT NULL COMMENT '上播时间',
`end_time` datetime DEFAULT NULL COMMENT '下播时间'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
mysql> select * from t_user;
+---------+---------------------+---------------------+
| user_id | start_time | end_time |
+---------+---------------------+---------------------+
| 1 | 2024-08-22 01:00:00 | 2024-08-22 02:01:05 |
| 2 | 2024-08-22 01:05:00 | 2024-08-22 02:03:18 |
| 3 | 2024-08-22 02:00:00 | 2024-08-22 04:03:22 |
| 4 | 2024-08-22 03:15:07 | 2024-08-22 04:33:21 |
| 5 | 2024-08-22 03:34:16 | 2024-08-22 06:10:45 |
| 6 | 2024-08-22 05:22:00 | 2024-08-22 07:01:08 |
| 7 | 2024-08-22 06:11:03 | 2024-08-22 09:26:05 |
| 3 | 2024-08-22 08:00:00 | 2024-08-22 12:34:27 |
| 1 | 2024-08-22 11:00:00 | 2024-08-22 16:03:18 |
| 8 | 2024-08-22 15:00:00 | 2024-08-22 17:01:05 |
+---------+---------------------+---------------------+
10 rows in set (0.01 sec)
解题思路
-
将上播时间和下播时间转换成事件列表,每个事件记录时间和类型。
-
对事件列表按时间进行排序。
-
遍历排序后的事件列表,计算每个时间点的同时在线人数并记录最高值。
1WITH events AS (
2 SELECT start_time AS event_time, 1 AS event_type
3 FROM t_user
4 UNION ALL
5 SELECT end_time AS event_time, -1 AS event_type
6 FROM t_user
7),
8sorted_events AS (
9 SELECT event_time, event_type
10 FROM events
11 ORDER BY event_time
12),
13running_count AS (
14 SELECT event_time,
15 SUM(event_type) over (ORDER by event_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS live_count
16 FROM sorted_events
17)
18SELECT MAX(live_count) AS peak_live_count
19FROM running_count;
逻辑解释:
- Common Table Expression (CTE)
- line1~12 将上播时间和下播时间转换成事件列表,记录每个事件的时间和类型(上播为 1,下播为 -1)。此时 sorted_events 内容如下:
mysql> WITH events AS (
SELECT start_time AS event_time, 1 AS event_type
FROM t_user
UNION ALL
SELECT end_time AS event_time, -1 AS event_type
FROM t_user
),
sorted_events AS (
SELECT event_time, event_type
FROM events
ORDER BY event_time
)
SELECT * FROM sorted_events;
+---------------------+------------+
| event_time | event_type |
+---------------------+------------+
| 2024-08-22 01:00:00 | 1 |
| 2024-08-22 01:05:00 | 1 |
| 2024-08-22 02:00:00 | 1 |
| 2024-08-22 02:01:05 | -1 |
| 2024-08-22 02:03:18 | -1 |
| 2024-08-22 03:15:07 | 1 |
| 2024-08-22 03:34:16 | 1 |
| 2024-08-22 04:03:22 | -1 |
| 2024-08-22 04:33:21 | -1 |
| 2024-08-22 05:22:00 | 1 |
| 2024-08-22 06:10:45 | -1 |
| 2024-08-22 06:11:03 | 1 |
| 2024-08-22 07:01:08 | -1 |
| 2024-08-22 08:00:00 | 1 |
| 2024-08-22 09:26:05 | -1 |
| 2024-08-22 11:00:00 | 1 |
| 2024-08-22 12:34:27 | -1 |
| 2024-08-22 15:00:00 | 1 |
| 2024-08-22 16:03:18 | -1 |
| 2024-08-22 17:01:05 | -1 |
+---------------------+------------+
-
line13~17 使用窗口函数
SUM
计算每个时间点的同时在线人数。SUM(event_type) over (ORDER by event_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
:对 event_type 列进行累积求和,按照 event_time 升序排序。ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
表示从窗口的开始到当前行。ROWS
:指定窗口范围是按行数来计算的BETWEEN ... AND ...
:定义窗口的起始和结束边界。UNBOUNDED PRECEDING
:表示从窗口的最开始位置(即第一行)开始。CURRENT ROW
:表示窗口结束在当前行。
此时 sorted_events 内容如下:
...
running_count AS (
SELECT event_time,
SUM(event_type) over (ORDER by event_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS live_count
FROM sorted_events
)
SELECT * FROM running_count;
+---------------------+------------+
| event_time | live_count |
+---------------------+------------+
| 2024-08-22 01:00:00 | 1 |
| 2024-08-22 01:05:00 | 2 |
| 2024-08-22 02:00:00 | 3 |
| 2024-08-22 02:01:05 | 2 |
| 2024-08-22 02:03:18 | 1 |
| 2024-08-22 03:15:07 | 2 |
| 2024-08-22 03:34:16 | 3 |
| 2024-08-22 04:03:22 | 2 |
| 2024-08-22 04:33:21 | 1 |
| 2024-08-22 05:22:00 | 2 |
| 2024-08-22 06:10:45 | 1 |
| 2024-08-22 06:11:03 | 2 |
| 2024-08-22 07:01:08 | 1 |
| 2024-08-22 08:00:00 | 2 |
| 2024-08-22 09:26:05 | 1 |
| 2024-08-22 11:00:00 | 2 |
| 2024-08-22 12:34:27 | 1 |
| 2024-08-22 15:00:00 | 2 |
| 2024-08-22 16:03:18 | 1 |
| 2024-08-22 17:01:05 | 0 |
+---------------------+------------+
在 SQL 中,窗口函数(window functions)允许你在查询结果的某个窗口范围内进行计算。
ROWS BETWEEN
子句用于定义这个窗口的范围。
用户行为轨迹
阿里巴巴大数据工程师面试 SQL 题
题目背景: 现有一张用户在各个地铁站进出的时间表和一张用户在商场扫码的行为表,其中商场扫码会存在多次取最新的数据。请查询并返回出用户整个行为轨迹列表?
mysql> show create table t_subway\G;
*************************** 1. row ***************************
Table: t_subway
Create Table: CREATE TABLE `t_subway` (
`user_id` varchar(5) DEFAULT NULL COMMENT '用户id',
`station_id` int DEFAULT NULL COMMENT '地铁站id',
`in_time` datetime DEFAULT NULL COMMENT '入站时间',
`out_time` datetime DEFAULT NULL COMMENT '出站时间'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
mysql> select * from t_subway;
+---------+------------+---------------------+---------------------+
| user_id | station_id | in_time | out_time |
+---------+------------+---------------------+---------------------+
| 001 | 1 | 2024-08-22 09:01:01 | NULL |
| 001 | 2 | NULL | 2024-08-22 09:21:08 |
| 001 | 3 | 2024-08-22 11:01:41 | NULL |
| 001 | 4 | NULL | 2024-08-22 11:23:12 |
| 001 | 4 | 2024-08-22 15:33:29 | NULL |
| 001 | 1 | NULL | 2024-08-22 15:45:41 |
+---------+------------+---------------------+---------------------+
mysql> show create table t_market\G;
*************************** 1. row ***************************
Table: t_market
Create Table: CREATE TABLE `t_market` (
`user_id` varchar(5) DEFAULT NULL COMMENT '用户id',
`market_id` varchar(5) DEFAULT NULL COMMENT '商场id',
`check_time` datetime DEFAULT NULL COMMENT '扫码时间'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
mysql> select * from t_market;
+---------+-----------+---------------------+
| user_id | market_id | check_time |
+---------+-----------+---------------------+
| 001 | 1001 | 2024-08-22 10:23:04 |
| 001 | 1001 | 2024-08-22 10:25:38 |
| 001 | 1002 | 2024-08-22 10:45:01 |
| 001 | 1004 | 2024-08-22 13:56:27 |
| 001 | 1003 | 2024-08-22 14:37:24 |
+---------+-----------+---------------------+
-- 期望输出
+---------+---------------------------------+
| user_id | place_list |
+---------+---------------------------------+
| 001 | 1,2,1001,1002,3,4,1004,1003,4,1 |
+---------+---------------------------------+
代码实现
-- 创建视图,将所有的地铁进出站和商场扫码记录合并。
CREATE VIEW combined_data AS
SELECT user_id, station_id AS place_id, in_time AS action_time FROM t_subway WHERE in_time IS NOT NULL
UNION ALL
SELECT user_id, station_id AS place_id, out_time AS action_time FROM t_subway WHERE out_time IS NOT NULL
UNION ALL
SELECT user_id, market_id AS place_id, MAX(check_time) AS action_time FROM t_market GROUP BY user_id, market_id;
-- 得到结果
SELECT user_id, GROUP_CONCAT(place_id) AS place_list
FROM (SELECT user_id, place_id, action_time FROM combined_data ORDER BY user_id, action_time) ordered_data
GROUP BY user_id;
解题思路
我们需要将用户在地铁站进出和商场扫码的行为数据合并在一起。并按照时间顺序排列,最终返回用户的行为轨迹。
Aug
求所有连续段的起始位置和结束位置
拼多多大数据工程师 SQL 题
题目背景
有一张表 t_id
记录了 id,id 不重复,但是会存在间断。求出所有连续段的起始位置和结束位置。
mysql> show create table t_id\G;
*************************** 1. row ***************************
Table: t_id
Create Table: CREATE TABLE `t_id` (
`id` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
mysql> select * from t_id;
+------+
| id |
+------+
| 1 |
| 2 |
| 3 |
| 5 |
| 6 |
| 8 |
| 10 |
| 12 |
| 13 |
| 14 |
| 15 |
+------+
思路解析
- 使用
ROW_NUMBER()
窗口函数计算每一行的编号, 并通过id - ROW_NUMBER()
计算每一行与其编号的差值, 命名为gap
。这个gap
用于标识连续的id
段(对于连续的id
,这个差值gap
将保持不变)。
代码实现
1WITH t_id_with_gap AS (
2 SELECT
3 id,
4 id - ROW_NUMBER() OVER (ORDER BY id) AS gap
5 FROM t_id
6),
7groups_xxx AS (
8 SELECT
9 id,
10 MIN(id) OVER (PARTITION BY gap) AS start_id,
11 MAX(id) OVER (PARTITION BY gap) AS end_id
12 FROM t_id_with_gap
13)
14SELECT
15 DISTINCT start_id AS start,
16 end_id AS end
17FROM
18 groups_xxx
19ORDER BY start_id;
代码解析
- CTE group_xxx
PARTITION BY gap
:根据gap
值进行分区,即对于每一个gap
值,计算分区内的最小和最大id
。MIN(id) OVER (PARTITION BY gap) AS start_id
:在每个gap
分区内计算最小的id
,即每个连续id
范围的起始id
。MAX(id) OVER (PARTITION BY gap) AS end_id
:在每个gap
分区内计算最大的id
,即每个连续id
范围的结束id
。
groups_xxx AS (
SELECT
id,
MIN(id) OVER (PARTITION BY gap) AS start_id,
MAX(id) OVER (PARTITION BY gap) AS end_id
FROM t_id_with_gap
)
SELECT * FROM groups_xxx;
+------+----------+--------+
| id | start_id | end_id |
+------+----------+--------+
| 1 | 1 | 3 |
| 2 | 1 | 3 |
| 3 | 1 | 3 |
| 5 | 5 | 6 |
| 6 | 5 | 6 |
| 8 | 8 | 8 |
| 10 | 10 | 10 |
| 12 | 12 | 15 |
| 13 | 12 | 15 |
| 14 | 12 | 15 |
| 15 | 12 | 15 |
+------+----------+--------+
SELECT DISTINCT start_id AS start, end_id AS end
DISTINCT
:用于去除重复的行。确保每个(start_id, end_id)
对是唯一的,即每个连续id
范围只显示一次。