分析PostgreSQL中的大表连接
这篇文章主要介绍“分析PostgreSQL中的大表连接”,在日常操作中,相信很多人在分析PostgreSQL中的大表连接问题上存在疑惑,小编查阅了各式资料,整理出简单好用的操作方法,希望对大家解答”分析PostgreSQL中的大表连接”的疑惑有所帮助!接下来,请跟着小编一起来学习吧!
让客户满意是我们工作的目标,不断超越客户的期望值来自于我们对这个行业的热爱。我们立志把好的技术通过有效、简单的方式提供给客户,将通过不懈努力成为客户在信息化领域值得信任、有价值的长期合作伙伴,公司提供的服务项目有:域名与空间、网络空间、营销软件、网站建设、鄞州网站维护、网站推广。
数据库配置
主机CPU 4核,内存4G,PG共享缓存128MB,work_mem 4MB。
测试数据
创建4张表,每张表1000w行,数据量约1G,是PG共享内存的8倍。
drop table t_big_1; drop table t_big_2; drop table t_big_3; drop table t_big_4; create table t_big_1(id int,c1 varchar(30),c2 varchar(30),c3 varchar(30)); create table t_big_2(id int,c1 varchar(30),c2 varchar(30),c3 varchar(30)); create table t_big_3(id int,c1 varchar(30),c2 varchar(30),c3 varchar(30)); create table t_big_4(id int,c1 varchar(30),c2 varchar(30),c3 varchar(30)); insert into t_big_1 select x,rpad('c1'||x,30,'c1'),rpad('c2'||x,30,'c2'),rpad('c3'||x,30,'c3') from generate_series(1,10000000) as x; insert into t_big_2 select x,rpad('c1'||x,30,'c1'),rpad('c2'||x,30,'c2'),rpad('c3'||x,30,'c3') from generate_series(1,10000000) as x; insert into t_big_3 select x,rpad('c1'||x,30,'c1'),rpad('c2'||x,30,'c2'),rpad('c3'||x,30,'c3') from generate_series(1,10000000) as x; insert into t_big_4 select x,rpad('c1'||x,30,'c1'),rpad('c2'||x,30,'c2'),rpad('c3'||x,30,'c3') from generate_series(1,10000000) as x; show shared_buffers; show effective_cache_size; show work_mem; select pg_size_pretty(pg_table_size('t_big_1')); select pg_size_pretty(pg_table_size('t_big_2')); select pg_size_pretty(pg_table_size('t_big_3')); select pg_size_pretty(pg_table_size('t_big_4')); analyze t_big_1,t_big_2,t_big_3,t_big_4; explain verbose select a.* from t_big_1 a join t_big_2 b on a.c1 = b.c1; explain verbose select a.id,b.c1,c.c2,d.c3 from t_big_1 a,t_big_2 b,t_big_3 c,t_big_4 d where a.id = b.id and b.id = c.id and c.id = d.id; explain verbose select a.id,b.c1,c.c2,d.c3 from t_big_1 a,t_big_2 b,t_big_3 c,t_big_4 d where a.id = b.id and b.c1 = c.c1 and c.c2 = d.c2;
大表连接
未分析数据表前
[local:/data/run/pg12]:5120 pg12@testdb=# explain verbose pg12@testdb-# select a.id,b.c1,c.c2,d.c3 pg12@testdb-# from t_big_1 a,t_big_2 b,t_big_3 c,t_big_4 d pg12@testdb-# where a.id = b.id and b.c1 = c.c1 and c.c2 = d.c2; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Merge Join (cost=164722831406.26..1096915306139605248.00 rows=73127676034285903872 width=238) Output: a.id, b.c1, c.c2, d.c3 Merge Cond: ((b.c1)::text = (c.c1)::text) -> Sort (cost=58799667920.13..59102008117.66 rows=120936079012 width=82) Output: a.id, b.c1 Sort Key: b.c1 -> Merge Join (cost=2124653.55..1816202724.10 rows=120936079012 width=82) Output: a.id, b.c1 Merge Cond: (a.id = b.id) -> Sort (cost=894232.27..906527.40 rows=4918050 width=4) Output: a.id Sort Key: a.id -> Seq Scan on public.t_big_1 a (cost=0.00..213115.50 rows=4918050 width=4) Output: a.id -> Materialize (cost=1230421.27..1255011.52 rows=4918050 width=82) Output: b.c1, b.id -> Sort (cost=1230421.27..1242716.40 rows=4918050 width=82) Output: b.c1, b.id Sort Key: b.id -> Seq Scan on public.t_big_2 b (cost=0.00..213115.50 rows=4918050 width=82) Output: b.c1, b.id -> Materialize (cost=105923163486.13..106527843881.19 rows=120936079012 width=234) Output: c.c2, c.c1, d.c3 -> Sort (cost=105923163486.13..106225503683.66 rows=120936079012 width=234) Output: c.c2, c.c1, d.c3 Sort Key: c.c1 -> Merge Join (cost=3066006.55..1817144077.10 rows=120936079012 width=234) Output: c.c2, c.c1, d.c3 Merge Cond: ((c.c2)::text = (d.c2)::text) -> Sort (cost=1533003.27..1545298.40 rows=4918050 width=156) Output: c.c2, c.c1 Sort Key: c.c2 -> Seq Scan on public.t_big_3 c (cost=0.00..213115.50 rows=4918050 width=156) Output: c.c2, c.c1 -> Materialize (cost=1533003.27..1557593.52 rows=4918050 width=156) Output: d.c3, d.c2 -> Sort (cost=1533003.27..1545298.40 rows=4918050 width=156) Output: d.c3, d.c2 Sort Key: d.c2 -> Seq Scan on public.t_big_4 d (cost=0.00..213115.50 rows=4918050 width=156) Output: d.c3, d.c2 (41 rows)
可以看到,未分析前,执行计划使用merge join,计划的cost是一个大数。
执行分析后
[local:/data/run/pg12]:5120 pg12@testdb=# explain (analyze,buffers,verbose) select a.id,b.c1,c.c2,d.c3 from t_big_1 a,t_big_2 b,t_big_3 c,t_big_4 d where a.id = b.id and b.c1 = c.c1 and c.c2 = d.c2; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=896126.19..2564935.91 rows=9999844 width=97) (actual time=393803.655..404902.025 rows=10000000 loops=1) Output: a.id, b.c1, c.c2, d.c3 Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=266 read=655676 dirtied=486717 written=486414, temp read=473954 written=486068 -> Parallel Hash Join (cost=895126.19..1563951.51 rows=4166602 width=97) (actual time=393672.896..398825.027 rows=3333333 loops=3) Output: a.id, b.c1, c.c2, d.c3 Hash Cond: ((c.c2)::text = (d.c2)::text) Buffers: shared hit=266 read=655676 dirtied=486717 written=486414, temp read=473954 written=486068 Worker 0: actual time=393629.565..399028.498 rows=3549817 loops=1 Buffers: shared hit=118 read=218079 dirtied=161599 written=161495, temp read=162307 written=161880 Worker 1: actual time=393585.994..399049.295 rows=3609509 loops=1 Buffers: shared hit=119 read=217313 dirtied=161014 written=160913, temp read=163324 written=160736 -> Parallel Hash Join (cost=592683.65..1070481.02 rows=4166681 width=66) (actual time=328335.871..378143.916 rows=3333333 loops=3) Output: a.id, b.c1, c.c2 Hash Cond: ((b.c1)::text = (c.c1)::text) Buffers: shared hit=63 read=491773 dirtied=352782 written=352575, temp read=267125 written=274312 Worker 0: actual time=328475.430..378240.528 rows=3325497 loops=1 Buffers: shared hit=25 read=164024 dirtied=117445 written=117373, temp read=88941 written=91448 Worker 1: actual time=328084.038..377943.176 rows=3311112 loops=1 Buffers: shared hit=29 read=163900 dirtied=117550 written=117481, temp read=88747 written=91320 -> Parallel Hash Join (cost=290238.33..609558.42 rows=4166681 width=35) (actual time=158380.042..198763.345 rows=3333333 loops=3) Output: a.id, b.c1 Hash Cond: (a.id = b.id) Buffers: shared hit=63 read=327838 dirtied=218847 written=218710, temp read=98317 written=100856 Worker 0: actual time=158518.764..199077.411 rows=3331104 loops=1 Buffers: shared hit=25 read=109394 dirtied=72893 written=72845, temp read=32790 written=33668 Worker 1: actual time=158520.409..198920.394 rows=3332824 loops=1 Buffers: shared hit=29 read=109323 dirtied=73002 written=72956, temp read=32934 written=33560 -> Parallel Seq Scan on public.t_big_1 a (cost=0.00..205601.81 rows=4166681 width=4) (actual time=239.830..75704.152 rows=3333333 loops=3) Output: a.id Buffers: shared read=163935 dirtied=109449 written=109391 Worker 0: actual time=239.584..75677.703 rows=3327794 loops=1 Buffers: shared read=54554 dirtied=36489 written=36468 Worker 1: actual time=240.355..75258.837 rows=3347802 loops=1 Buffers: shared read=54882 dirtied=36486 written=36467 -> Parallel Hash (cost=205601.81..205601.81 rows=4166681 width=35) (actual time=65812.428..65812.431 rows=3333333 loops=3) Output: b.c1, b.id Buckets: 65536 Batches: 256 Memory Usage: 3328kB Buffers: shared hit=32 read=163903 dirtied=109398 written=109319, temp written=70136 Worker 0: actual time=65812.900..65812.904 rows=3345876 loops=1 Buffers: shared hit=11 read=54840 dirtied=36404 written=36377, temp written=23428 Worker 1: actual time=65812.873..65812.875 rows=3321816 loops=1 Buffers: shared hit=15 read=54441 dirtied=36516 written=36489, temp written=23320 -> Parallel Seq Scan on public.t_big_2 b (cost=0.00..205601.81 rows=4166681 width=35) (actual time=1.490..47839.237 rows=3333333 loops=3) Output: b.c1, b.id Buffers: shared hit=32 read=163903 dirtied=109398 written=109319 Worker 0: actual time=1.464..47814.446 rows=3345876 loops=1 Buffers: shared hit=11 read=54840 dirtied=36404 written=36377 Worker 1: actual time=1.470..47104.413 rows=3321816 loops=1 Buffers: shared hit=15 read=54441 dirtied=36516 written=36489 -> Parallel Hash (cost=205601.81..205601.81 rows=4166681 width=62) (actual time=113720.080..113720.080 rows=3333333 loops=3) Output: c.c2, c.c1 Buckets: 65536 Batches: 512 Memory Usage: 2432kB Buffers: shared read=163935 dirtied=133935 written=133865, temp written=103856 Worker 0: actual time=113719.124..113719.124 rows=3332395 loops=1 Buffers: shared read=54630 dirtied=44552 written=44528, temp written=34648 Worker 1: actual time=113720.557..113720.558 rows=3329197 loops=1 Buffers: shared read=54577 dirtied=44548 written=44525, temp written=34576 -> Parallel Seq Scan on public.t_big_3 c (cost=0.00..205601.81 rows=4166681 width=62) (actual time=0.126..80608.068 rows=3333333 loops=3) Output: c.c2, c.c1 Buffers: shared read=163935 dirtied=133935 written=133865 Worker 0: actual time=0.260..80737.065 rows=3332395 loops=1 Buffers: shared read=54630 dirtied=44552 written=44528 Worker 1: actual time=0.049..80943.448 rows=3329197 loops=1 Buffers: shared read=54577 dirtied=44548 written=44525 -> Parallel Hash (cost=205601.02..205601.02 rows=4166602 width=62) (actual time=10279.722..10279.722 rows=3333333 loops=3) Output: d.c3, d.c2 Buckets: 65536 Batches: 512 Memory Usage: 2400kB Buffers: shared hit=32 read=163903 dirtied=133935 written=133839, temp written=103004 Worker 0: actual time=10222.812..10222.812 rows=3297904 loops=1 Buffers: shared hit=9 read=54055 dirtied=44154 written=44122, temp written=34236 Worker 1: actual time=10222.839..10222.839 rows=3258559 loops=1 Buffers: shared hit=6 read=53413 dirtied=43464 written=43432, temp written=33504 -> Parallel Seq Scan on public.t_big_4 d (cost=0.00..205601.02 rows=4166602 width=62) (actual time=0.163..7282.409 rows=3333333 loops=3) Output: d.c3, d.c2 Buffers: shared hit=32 read=163903 dirtied=133935 written=133839 Worker 0: actual time=0.108..7244.071 rows=3297904 loops=1 Buffers: shared hit=9 read=54055 dirtied=44154 written=44122 Worker 1: actual time=0.034..7223.191 rows=3258559 loops=1 Buffers: shared hit=6 read=53413 dirtied=43464 written=43432 Planning Time: 1.134 ms Execution Time: 405878.841 ms (83 rows) [local:/data/run/pg12]:5120 pg12@testdb=#
可以看到,执行计划中的成本回归一个正常的数值,算法使用Hash Join。由于内存不足,PG把数据拆分为N份,使用临时表来临时缓存Hash Table,使用不同的Batch来执行Join。
到此,关于“分析PostgreSQL中的大表连接”的学习就结束了,希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习,快去试试吧!若想继续学习更多相关知识,请继续关注创新互联网站,小编会继续努力为大家带来更多实用的文章!
本文标题:分析PostgreSQL中的大表连接
网站地址:http://myzitong.com/article/ihcdgp.html