data skew hash join优化方法

Skew join optimization | Databricks on AWS

Skew Join Optimization in Hive. Handling skewed data with Hive UDF | by Ambrish Bhargava | Expedia Group Technology | Medium

data skew hash join问题是这样的:

解决办法可以从broadcast hash join找灵感:使用了broadcast join的话,那么其实左表不用做任何shuffle, 这样就不会出现单点压力了。但是其实我们只需要部分的broadcast hash join, 针对skewed key进行broadcast join就行,而对其他的key还是走shuffle join.

所以解决办法有点类似于partial broadcast join + partial shuffle join:


下面是实现方式

产生salt_table 首先产生一个 `salt_table`. 这个table针对 skewed value 进行复制

with A as (select 1),
salt_table as (
  select unnest as salt_name,  generate_series as salt_rand from A, unnest(['CHINA', 'UNITED STATES']), generate_series(0, 10)
) select * from salt_table;

以tpch.nation来看,输出结果就是

+---------------+-----------+
| salt_name     | salt_rand |
+---------------+-----------+
| CHINA         |         0 |
| CHINA         |         1 |
| CHINA         |         2 |
| CHINA         |         3 |
| CHINA         |         4 |
| CHINA         |         5 |
| CHINA         |         6 |
| CHINA         |         7 |
| CHINA         |         8 |
| CHINA         |         9 |
| CHINA         |        10 |
| UNITED STATES |         0 |
| UNITED STATES |         1 |
| UNITED STATES |         2 |
| UNITED STATES |         3 |
| UNITED STATES |         4 |
| UNITED STATES |         5 |
| UNITED STATES |         6 |
| UNITED STATES |         7 |
| UNITED STATES |         8 |
| UNITED STATES |         9 |
| UNITED STATES |        10 |
+---------------+-----------+

扩展右表 然后在salt_table基础上扩展右表:右表和salt_tabl做left join. 如果没有join上,那么产生固定的值,否则使用 `salt_rand`.

with A as (select 1),
salt_table as (
  select unnest as salt_name,  generate_series as salt_rand from A, unnest(['CHINA', 'UNITED STATES']), generate_series(0, 10)
),
right_table as (
  select *, ifnull(salt_rand, 0) as rand_col
    from nation left join salt_table
   on nation.n_name = salt_table.salt_name
) select * from right_table;

以tpch.nation来看,输出结果如下

+-------------+----------------+-------------+---------------+-----------+----------+
| n_nationkey | n_name         | n_regionkey | salt_name     | salt_rand | rand_col |
+-------------+----------------+-------------+---------------+-----------+----------+
|           0 | ALGERIA        |           0 | NULL          |      NULL |        0 |
|           1 | ARGENTINA      |           1 | NULL          |      NULL |        0 |
|           2 | BRAZIL         |           1 | NULL          |      NULL |        0 |
|           3 | CANADA         |           1 | NULL          |      NULL |        0 |
|           4 | EGYPT          |           4 | NULL          |      NULL |        0 |
|           5 | ETHIOPIA       |           0 | NULL          |      NULL |        0 |
|           6 | FRANCE         |           3 | NULL          |      NULL |        0 |
|           7 | GERMANY        |           3 | NULL          |      NULL |        0 |
|           8 | INDIA          |           2 | NULL          |      NULL |        0 |
|           9 | INDONESIA      |           2 | NULL          |      NULL |        0 |
|          10 | IRAN           |           4 | NULL          |      NULL |        0 |
|          11 | IRAQ           |           4 | NULL          |      NULL |        0 |
|          12 | JAPAN          |           2 | NULL          |      NULL |        0 |
|          13 | JORDAN         |           4 | NULL          |      NULL |        0 |
|          14 | KENYA          |           0 | NULL          |      NULL |        0 |
|          15 | MOROCCO        |           0 | NULL          |      NULL |        0 |
|          16 | MOZAMBIQUE     |           0 | NULL          |      NULL |        0 |
|          17 | PERU           |           1 | NULL          |      NULL |        0 |
|          18 | CHINA          |           2 | CHINA         |        10 |       10 |
|          18 | CHINA          |           2 | CHINA         |         9 |        9 |
|          18 | CHINA          |           2 | CHINA         |         8 |        8 |
|          18 | CHINA          |           2 | CHINA         |         7 |        7 |
|          18 | CHINA          |           2 | CHINA         |         6 |        6 |
|          18 | CHINA          |           2 | CHINA         |         5 |        5 |
|          18 | CHINA          |           2 | CHINA         |         4 |        4 |
|          18 | CHINA          |           2 | CHINA         |         3 |        3 |
|          18 | CHINA          |           2 | CHINA         |         2 |        2 |
|          18 | CHINA          |           2 | CHINA         |         1 |        1 |
|          18 | CHINA          |           2 | CHINA         |         0 |        0 |
|          19 | ROMANIA        |           3 | NULL          |      NULL |        0 |
|          20 | SAUDI ARABIA   |           4 | NULL          |      NULL |        0 |
|          21 | VIETNAM        |           2 | NULL          |      NULL |        0 |
|          22 | RUSSIA         |           3 | NULL          |      NULL |        0 |
|          23 | UNITED KINGDOM |           3 | NULL          |      NULL |        0 |
|          24 | UNITED STATES  |           1 | UNITED STATES |        10 |       10 |
|          24 | UNITED STATES  |           1 | UNITED STATES |         9 |        9 |
|          24 | UNITED STATES  |           1 | UNITED STATES |         8 |        8 |
|          24 | UNITED STATES  |           1 | UNITED STATES |         7 |        7 |
|          24 | UNITED STATES  |           1 | UNITED STATES |         6 |        6 |
|          24 | UNITED STATES  |           1 | UNITED STATES |         5 |        5 |
|          24 | UNITED STATES  |           1 | UNITED STATES |         4 |        4 |
|          24 | UNITED STATES  |           1 | UNITED STATES |         3 |        3 |
|          24 | UNITED STATES  |           1 | UNITED STATES |         2 |        2 |
|          24 | UNITED STATES  |           1 | UNITED STATES |         1 |        1 |
|          24 | UNITED STATES  |           1 | UNITED STATES |         0 |        0 |
+-------------+----------------+-------------+---------------+-----------+----------+

扩展左表 skewed value随机产生,non skewed value使用0. 这样就可以和上面左表join上.

with left_table as (
select n_nationkey, n_name, n_regionkey, case when n_name in ('CHINA', 'UNITED STATES') then round(rand() * 10) else 0 end as rand_col from nation
)select * from left_table;

输出如下,因为CHINA和US只有一个所以只产生了一个值。如果有多个的话,那么rand_col是不同的。

+-------------+----------------+-------------+----------+
| n_nationkey | n_name         | n_regionkey | rand_col |
+-------------+----------------+-------------+----------+
|           0 | ALGERIA        |           0 |        0 |
|           1 | ARGENTINA      |           1 |        0 |
|           2 | BRAZIL         |           1 |        0 |
|           3 | CANADA         |           1 |        0 |
|           4 | EGYPT          |           4 |        0 |
|           5 | ETHIOPIA       |           0 |        0 |
|           6 | FRANCE         |           3 |        0 |
|           7 | GERMANY        |           3 |        0 |
|           8 | INDIA          |           2 |        0 |
|           9 | INDONESIA      |           2 |        0 |
|          10 | IRAN           |           4 |        0 |
|          11 | IRAQ           |           4 |        0 |
|          12 | JAPAN          |           2 |        0 |
|          13 | JORDAN         |           4 |        0 |
|          14 | KENYA          |           0 |        0 |
|          15 | MOROCCO        |           0 |        0 |
|          16 | MOZAMBIQUE     |           0 |        0 |
|          17 | PERU           |           1 |        0 |
|          18 | CHINA          |           2 |       10 |
|          19 | ROMANIA        |           3 |        0 |
|          20 | SAUDI ARABIA   |           4 |        0 |
|          21 | VIETNAM        |           2 |        0 |
|          22 | RUSSIA         |           3 |        0 |
|          23 | UNITED KINGDOM |           3 |        0 |
|          24 | UNITED STATES  |           1 |        7 |
+-------------+----------------+-------------+----------+