我有一个 3000w 行的数据表,我需要在这个表上统计从某一年开始新参保的人数,原数据库是 ORACLE ,用的是
AND A.AAC001 NOT EXISTS
(SELECT 1
FROM AC02_TEMP AS B WHERE A.AAC001 = B.AAC001 AND
B.AAC030 < '2018-01-01 00:00:00')
的语法,在 clickhouse 上我试了 LEFT JOIN 和 NOT IN ,性能均不理想
SELECT COUNT(1) AS "新参保人数"
FROM AC02_TEMP AS A
WHERE A.AAB301 IN (SELECT AAB301 FROM AA26 WHERE AAA148 = '130800')
AND A.AAE200 = '41'
AND A.AAC031 = '1'
AND A.AAC030 >= '2018-01-01 00:00:00'
AND A.AAC001 NOT IN
(SELECT B.AAC001
FROM AC02_TEMP AS B
WHERE B.AAC030 < '2018-01-01 00:00:00');
以下是 explain
CreatingSets (Create sets before main query execution)
Expression ((Projection + Before ORDER BY))
Aggregating
Expression (Before GROUP BY)
ReadFromMergeTree (default.AC02_TEMP)
Indexes:
PrimaryKey
Keys:
AAC001
AAE200
" Condition: and((AAC001 notIn 18692488-element set), (AAE200 in ['41', '41']))"
Parts: 2/2
Granules: 4821/4821
CreatingSet (Create set for subquery)
Expression ((Projection + Before ORDER BY))
ReadFromMergeTree (default.AA26)
Indexes:
PrimaryKey
Condition: true
Parts: 1/1
Granules: 1/1
我是 clickhouse 新手,目前没什么头绪,求大佬帮助 0.0
1
sunrealzhang OP NOT IN 里的数据大概有两千多万
|
2
sss15 2023-12-13 17:39:20 +08:00
这样子可以不?
SELECT COUNT( 1 ) AS "新参保人数" FROM AC02_TEMP AS A LEFT JOIN ( SELECT B.AAC001 AS TID FROM AC02_TEMP AS B WHERE B.AAC030 < '2018-01-01 00:00:00' ) AS t0 ON t0.AAC001 = A.AAC001 WHERE A.AAB301 IN ( SELECT AAB301 FROM AA26 WHERE AAA148 = '130800' ) AND A.AAE200 = '41' AND A.AAC031 = '1' AND A.AAC030 >= '2018-01-01 00:00:00' AND t0.TID IS NULL |
3
OOKAMI 2023-12-13 17:41:13 +08:00
用 UNION ALL 来做减法汇总试试?
SELECT SUM(CNT) FROM (SELECT COUNT(1) AS CNT FROM AC02_TEMP AS A WHERE A.AAB301 IN (SELECT AAB301 FROM AA26 WHERE AAA148 = '130800') AND A.AAE200 = '41' AND A.AAC031 = '1' AND A.AAC030 >= '2018-01-01 00:00:00' UNION ALL SELECT -1 * COUNT(1) AS CNT FROM AC02_TEMP AS A WHERE A.AAB301 IN (SELECT AAB301 FROM AA26 WHERE AAA148 = '130800') AND A.AAE200 = '41' AND A.AAC031 = '1' AND A.AAC030 >= '2018-01-01 00:00:00' AND EXISTS (SELECT FROM AC02_TEMP AS B WHERE B.AAC001 = A.AAC001 AND B.AAC030 < '2018-01-01 00:00:00')); |
4
sunrealzhang OP @sss15 感谢您的回复,首先连接处应该是 ON t0.TID = A.AAC001 ,然后,AAC001 是这张表的联合主键之一,类型为 Int64 ,未关联上时值不是 NULL 而是 0 (这一点我也不知道为什么要这么实现,按理说未关联上不应该是 NULL 吗),然后我执行了 该查询,依然很慢
``` EXPLAIN indexes = 1 SELECT COUNT( 1 ) AS "新参保人数" FROM AC02_TEMP AS A LEFT JOIN ( SELECT B.AAC001 AS TID FROM AC02_TEMP AS B WHERE B.AAC030 < '2018-01-01 00:00:00' ) AS t0 ON t0.TID = A.AAC001 WHERE A.AAB301 IN ( SELECT AAB301 FROM AA26 WHERE AAA148 = '130800' ) AND A.AAE200 = '41' AND A.AAC031 = '1' AND A.AAC030 >= '2018-01-01 00:00:00' AND t0.TID = 0; ``` 执行计划 ``` CreatingSets (Create sets before main query execution) Expression ((Projection + Before ORDER BY)) Aggregating Expression (Before GROUP BY) Filter (WHERE) Join (JOIN FillRightFirst) Filter (( + Before JOIN)) ReadFromMergeTree (default.AC02_TEMP) Indexes: PrimaryKey Keys: AAE200 " Condition: (AAE200 in ['41', '41'])" Parts: 2/2 Granules: 4821/4821 Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY)))) ReadFromMergeTree (default.AC02_TEMP) Indexes: PrimaryKey Condition: true Parts: 2/2 Granules: 4821/4821 CreatingSet (Create set for subquery) Expression ((Projection + Before ORDER BY)) ReadFromMergeTree (default.AA26) Indexes: PrimaryKey Condition: true Parts: 1/1 Granules: 1/1 ``` |
5
sunrealzhang OP @OOKAMI 感谢您的回复,实际上,clickhouse 对 EXISTS 语法不完全支持,特别是子查询中无法引用外部表和列,这会导致执行时提示 Missing columns: 'A.AAC001',这也是我将 oracle 中的 NOT EXISTS 改成 NOT IN 的原因
|
6
sunrealzhang OP 这个 sql 给我整不会了,开始感觉如果不对表结构和数据进行处理,无法通过这个数据库来满足我们的统计分析需求 0.0
|
7
9yu 2023-12-13 18:05:58 +08:00 via iPhone
不是很懂 SQL 但是楼主的回复都很有礼貌和条理。我绝对想和楼主这样认真的人做同事。
|
8
OOKAMI 2023-12-13 18:07:15 +08:00
不懂 clickhouse ,硬要一个 SQL 出来结果的话,这个行不行,按日期分两段去重查 KEY ,再统计,最后去除重复的,这样也没有大数据集匹配
SELECT COUNT(1) FROM (SELECT AAC001, COUNT(1) AS cnt FROM (SELECT DISTINCT AAC001 FROM AC02_TEMP AS A WHERE A.AAB301 IN (SELECT AAB301 FROM AA26 WHERE AAA148 = '130800') AND A.AAE200 = '41' AND A.AAC031 = '1' AND A.AAC030 >= '2018-01-01 00:00:00' UNION ALL SELECT DISTINCT AAC001 FROM AC02_TEMP AS A WHERE A.AAB301 IN (SELECT AAB301 FROM AA26 WHERE AAA148 = '130800') AND A.AAE200 = '41' AND A.AAC031 = '1' AND A.AAC030 < '2018-01-01 00:00:00')) GROUP BY AAC001) WHERE CNT = 1; |
9
hicdn 2023-12-13 19:11:17 +08:00 via Android
看场景,你是要一次性导出数据还是在业务系统里需要动态查询?
如果是一次性导出数据,直接 dump 整表,然后在 python 里用 set 加载,判断 in |
10
Terry166 2023-12-13 19:45:37 +08:00
In Oracle we can only put up to 1000 values into an IN clause.
Oracle In 字句最多只能包含 1000 个值,否则影响性能。 解决方案: 1 ,用 subquery 把数据集分开处理; 2 ,创建临时表或者物理表来存储中间数据集; 3 ,用 join 来替代 in |
11
512357301 2023-12-13 20:14:32 +08:00 via Android
ck 的 join 是大短板,更何况是子查询。。。
子查询换成 join 试试? |
12
Maboroshii 2023-12-13 21:15:37 +08:00
离线查出所有人的首次参保时间, 然后新增数据的时候,也新增到这个首次参保时间表? 然后直接查这个首次参保时间就可以了。
|
13
sunrealzhang OP @hicdn 动态查询,我们准备把 oracle 的部分可能会导致高负载的查询统计 sql 转移到统计分析类数据库 clickhouse 上
|
14
sunrealzhang OP @512357301 感谢您的回复,我也尝试了 LEFT JOIN 语法,基本等同于 2 楼的方案,性能没有提升
|
15
sunrealzhang OP @9yu 感谢您的回复,实际上在 Oracle 上我们使用的 NOT EXISTS ,我想在 clickhouse 上,我们需要在数据同步时对数据本身进行额外的预处理来缓存首次参保状态,原封不动的使用 clickhouse 来支持完成业务需求是我们的一厢情愿 0.0
|
16
sunrealzhang OP @9yu 谢谢嗷 0.0
|
17
sunrealzhang OP @Maboroshii 感谢回复,我想大概也得这样,同步数据时需要对数据进行额外处理,在 clickhouse 上缓存首次参保状态
|
18
sunrealzhang OP @OOKAMI 感谢您的回复,我仔细看了下,语义应该和我的需求不符,如果某个 AAC001 未命中第一段 join 而只命中了第二段 join ,依然能被查出来,而它并不是 2018 年开始新参保,而是 2018 年之前有过参保记录。
|
19
sunrealzhang OP @sunrealzhang 说错了,是 UNION..
|
20
512357301 2023-12-15 08:23:16 +08:00 via Android
@sunrealzhang 类型为 Int64 ,未关联上时值不是 NULL 而是 0 (这一点我也不知道为什么要这么实现,按理说未关联上不应该是 NULL 吗)
这是 ck 的特性,未关联上的,会根据字段类型返回默认值,int 是 0 ,string 是'',其他的没研究 |
21
dyv9 307 天前 via Android
做报表分析就不该实时查询,先建表,将首次参保读取保存,可分段构建这个表,比如跑循环一年一年地构建。 数据仓库 dtl 软件像 pentaho kettle 可支持这种场景,还可调度任务定时处理。做报表就要放弃直接在实时查询,一定要分步骤加工数据到基本维度表,然后叠加。
|