Spark SQL----NULL語義
- 一、比較運算符中的空處理
- 二、邏輯運算符中的空處理
- 三、表達式中的空處理
- 3.1 null-intolerant表達式中的空處理
- 3.2 可以處理空值操作數的空處理表達式
- 3.3 內置聚合表達式中的空處理
- 四、WHERE、HAVING和JOIN子句中的條件表達式的空處理
- 五、在GROUP BY和DISTINCT中空處理
- 六、在ORDER BY中的空處理
- 七、UNION, INTERSECT, EXCEPT中的空處理
- 八、EXISTS 和NOT EXISTS 子查詢中的空處理
- 九、IN 和 NOT IN 子查詢中的空處理
表由一組行組成,每行包含一組列。列與數據類型相關聯,表示實體的特定屬性(例如,age 是一個名為person的實體的列)。有時,特定于行的列的值在該行出現時是未知的。在SQL中,這些值表示為NULL。本節詳細介紹了在各種運算符、表達式和其他SQL構造中處理NULL值的語義。
下面說明了名為person的表的schema layout和數據。數據在年齡列中包含NULL值,該表將用于以下各節中的各種示例。
TABLE: person
Id | Name | Age |
---|---|---|
100 | Joe | 30 |
200 | Marry | NULL |
300 | Mike | 18 |
400 | Fred | 50 |
500 | Albert | NULL |
600 | Michelle | 30 |
700 | Dan | 50 |
一、比較運算符中的空處理
Apache spark支持標準的比較運算符,如“>”、“>=”、“=”、”<“和”<=“。當其中一個操作數或兩個操作數都未知或為NULL時,這些運算符的結果為未知或NULL。為了比較NULL值的相等性,Spark提供了一個NULL安全的相等運算符(“<=>”),當其中一個操作數為NULL時,該運算符返回False,當兩個操作數均為NULL時返回True。下表說明了當一個或兩個操作數都為NULL時比較運算符的行為`:
Left Operand | Right Operand | > | >= | = | < | <= | <=> |
---|---|---|---|---|---|---|---|
NULL | Any value | NULL | NULL | NULL | NULL | NULL | False |
Any value | NULL | NULL | NULL | NULL | NULL | NULL | False |
NULL | NULL | NULL | NULL | NULL | NULL | NULL | True |
例子: |
-- Normal comparison operators return `NULL` when one of the operand is `NULL`.
SELECT 5 > null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| null|
+-----------------+-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT null = null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| null|
+-----------------+-- Null-safe equal operator return `False` when one of the operand is `NULL`
SELECT 5 <=> null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| false|
+-----------------+-- Null-safe equal operator return `True` when one of the operand is `NULL`
SELECT NULL <=> NULL;
+-----------------+
|expression_output|
+-----------------+
| true|
+-----------------+
二、邏輯運算符中的空處理
Spark支持標準邏輯運算符,如AND、OR和NOT。這些運算符將布爾表達式作為參數,并返回布爾值。
下表說明了當一個或兩個操作數都為NULL時邏輯運算符的行為。
Left Operand | Right Operand | OR | AND |
---|---|---|---|
True | NULL | True | NULL |
False | NULL | NULL | False |
NULL | True | True | NULL |
NULL | False | NULL | False |
NULL | NULL | NULL | NULL |
operand | NOT |
---|---|
NULL | NULL |
例子:
-- Normal comparison operators return `NULL` when one of the operands is `NULL`.
SELECT (true OR null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| true|
+-----------------+-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT (null OR false) AS expression_output
+-----------------+
|expression_output|
+-----------------+
| null|
+-----------------+-- Null-safe equal operator returns `False` when one of the operands is `NULL`
SELECT NOT(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| null|
+-----------------+
三、表達式中的空處理
比較運算符和邏輯運算符在Spark中被視為表達式。除了這兩種表達式之外,Spark還支持其他形式的表達式,如函數表達式、強制轉換表達式等。Spark中的表達式大致可分為:
- Null intolerant表達式
- 可以處理NULL值操作數的表達式
- 這些表達式的結果取決于表達式本身。
3.1 null-intolerant表達式中的空處理
當表達式的一個或多個參數為Null時,Null intolerant表達式返回Null,大多數表達式屬于這一類。
例子:
SELECT concat('John', null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| null|
+-----------------+SELECT positive(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| null|
+-----------------+SELECT to_date(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| null|
+-----------------+
3.2 可以處理空值操作數的空處理表達式
這類表達式被設計用來處理NULL值。表達式的結果取決于表達式本身。例如,函數表達式isnull在輸入為空時返回true,在輸入為非空時返回false,而函數coalesce返回其操作數列表中的第一個非null值。但是,coalesce在其所有操作數為NULL時返回NULL。下面是這類表達的不完整列表。
- COALESCE
- NULLIF
- IFNULL
- NVL
- NVL2
- ISNAN
- NANVL
- ISNULL
- ISNOTNULL
- ATLEASTNNONNULLS
- IN
例子:
SELECT isnull(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| true|
+-----------------+-- Returns the first occurrence of non `NULL` value.
SELECT coalesce(null, null, 3, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| 3|
+-----------------+-- Returns `NULL` as all its operands are `NULL`.
SELECT coalesce(null, null, null, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| null|
+-----------------+SELECT isnan(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
| false|
+-----------------+
3.3 內置聚合表達式中的空處理
聚合函數通過處理一組輸入行來計算單個結果。下面是聚合函數如何處理NULL值的規則。
- NULL值在所有聚合函數的處理過程中被忽略。
- 此規則的唯一例外是COUNT(*)函數。
- 當所有輸入值為NULL或輸入數據集為空時,一些聚合函數返回NULL。這些函數的列表如下:
- MAX
- MIN
- SUM
- AVG
- EVERY
- ANY
- SOME
例子:
-- `count(*)` does not skip `NULL` values.
SELECT count(*) FROM person;
+--------+
|count(1)|
+--------+
| 7|
+--------+-- `NULL` values in column `age` are skipped from processing.
SELECT count(age) FROM person;
+----------+
|count(age)|
+----------+
| 5|
+----------+-- `count(*)` on an empty input set returns 0. This is unlike the other
-- aggregate functions, such as `max`, which return `NULL`.
SELECT count(*) FROM person where 1 = 0;
+--------+
|count(1)|
+--------+
| 0|
+--------+-- `NULL` values are excluded from computation of maximum value.
SELECT max(age) FROM person;
+--------+
|max(age)|
+--------+
| 50|
+--------+-- `max` returns `NULL` on an empty input set.
SELECT max(age) FROM person where 1 = 0;
+--------+
|max(age)|
+--------+
| null|
+--------+
四、WHERE、HAVING和JOIN子句中的條件表達式的空處理
WHERE、HAVING操作符根據用戶指定的條件過濾行。JOIN操作符用于根據連接條件組合來自兩個表的行。對于所有這三種操作符,條件表達式都是布爾表達式,可以返回True、False或Unknown (NULL)。如果條件的結果為True,則表示“滿足”。
例子:
-- Persons whose age is unknown (`NULL`) are filtered out from the result set.
SELECT * FROM person WHERE age > 0;
+--------+---+
| name|age|
+--------+---+
|Michelle| 30|
| Fred| 50|
| Mike| 18|
| Dan| 50|
| Joe| 30|
+--------+---+-- `IS NULL` expression is used in disjunction to select the persons
-- with unknown (`NULL`) records.
SELECT * FROM person WHERE age > 0 OR age IS NULL;
+--------+----+
| name| age|
+--------+----+
| Albert|null|
|Michelle| 30|
| Fred| 50|
| Mike| 18|
| Dan| 50|
| Marry|null|
| Joe| 30|
+--------+----+-- Person with unknown(`NULL`) ages are skipped from processing.
SELECT age, count(*) FROM person GROUP BY age HAVING max(age) > 18;
+---+--------+
|age|count(1)|
+---+--------+
| 50| 2|
| 30| 2|
+---+--------+-- A self join case with a join condition `p1.age = p2.age AND p1.name = p2.name`.
-- The persons with unknown age (`NULL`) are filtered out by the join operator.
SELECT * FROM person p1, person p2WHERE p1.age = p2.ageAND p1.name = p2.name;
+--------+---+--------+---+
| name|age| name|age|
+--------+---+--------+---+
|Michelle| 30|Michelle| 30|
| Fred| 50| Fred| 50|
| Mike| 18| Mike| 18|
| Dan| 50| Dan| 50|
| Joe| 30| Joe| 30|
+--------+---+--------+---+-- The age column from both legs of join are compared using null-safe equal which
-- is why the persons with unknown age (`NULL`) are qualified by the join.
SELECT * FROM person p1, person p2WHERE p1.age <=> p2.ageAND p1.name = p2.name;
+--------+----+--------+----+
| name| age| name| age|
+--------+----+--------+----+
| Albert|null| Albert|null|
|Michelle| 30|Michelle| 30|
| Fred| 50| Fred| 50|
| Mike| 18| Mike| 18|
| Dan| 50| Dan| 50|
| Marry|null| Marry|null|
| Joe| 30| Joe| 30|
+--------+----+--------+----+
五、在GROUP BY和DISTINCT中空處理
如章節一比較運算符中的空處理中所討論的,兩個NULL值不相等。但是,出于分組和不同處理的目的,將具有NULL數據的兩個或多個值分組到同一個bucket中。這種行為符合SQL標準和其他企業數據庫管理系統。
例子:
-- `NULL` values are put in one bucket in `GROUP BY` processing.
SELECT age, count(*) FROM person GROUP BY age;
+----+--------+
| age|count(1)|
+----+--------+
|null| 2|
| 50| 2|
| 30| 2|
| 18| 1|
+----+--------+-- All `NULL` ages are considered one distinct value in `DISTINCT` processing.
SELECT DISTINCT age FROM person;
+----+
| age|
+----+
|null|
| 50|
| 30|
| 18|
+----+
六、在ORDER BY中的空處理
Spark SQL在ORDER BY子句中支持空排序規范。Spark處理ORDER BY子句時,首先或最后放置所有NULL值,這取決于空排序規范。默認情況下,所有NULL值放在首位。
例子:
-- `NULL` values are shown at first and other values
-- are sorted in ascending way.
SELECT age, name FROM person ORDER BY age;
+----+--------+
| age| name|
+----+--------+
|null| Marry|
|null| Albert|
| 18| Mike|
| 30|Michelle|
| 30| Joe|
| 50| Fred|
| 50| Dan|
+----+--------+-- Column values other than `NULL` are sorted in ascending
-- way and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age NULLS LAST;
+----+--------+
| age| name|
+----+--------+
| 18| Mike|
| 30|Michelle|
| 30| Joe|
| 50| Dan|
| 50| Fred|
|null| Marry|
|null| Albert|
+----+--------+-- Columns other than `NULL` values are sorted in descending
-- and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age DESC NULLS LAST;
+----+--------+
| age| name|
+----+--------+
| 50| Fred|
| 50| Dan|
| 30|Michelle|
| 30| Joe|
| 18| Mike|
|null| Marry|
|null| Albert|
+----+--------+
七、UNION, INTERSECT, EXCEPT中的空處理
在集合操作的上下文中,以null-safe的方式比較NULL值是否相等。這意味著在比較行時,兩個NULL值被認為是相等的,這與常規的EqualTo(=)操作符不同。
例子:
CREATE VIEW unknown_age SELECT * FROM person WHERE age IS NULL;-- Only common rows between two legs of `INTERSECT` are in the
-- result set. The comparison between columns of the row are done
-- in a null-safe manner.
SELECT name, age FROM personINTERSECTSELECT name, age from unknown_age;
+------+----+
| name| age|
+------+----+
|Albert|null|
| Marry|null|
+------+----+-- `NULL` values from two legs of the `EXCEPT` are not in output.
-- This basically shows that the comparison happens in a null-safe manner.
SELECT age, name FROM personEXCEPTSELECT age FROM unknown_age;
+---+--------+
|age| name|
+---+--------+
| 30| Joe|
| 50| Fred|
| 30|Michelle|
| 18| Mike|
| 50| Dan|
+---+--------+-- Performs `UNION` operation between two sets of data.
-- The comparison between columns of the row ae done in
-- null-safe manner.
SELECT name, age FROM personUNION SELECT name, age FROM unknown_age;
+--------+----+
| name| age|
+--------+----+
| Albert|null|
| Joe| 30|
|Michelle| 30|
| Marry|null|
| Fred| 50|
| Mike| 18|
| Dan| 50|
+--------+----+
八、EXISTS 和NOT EXISTS 子查詢中的空處理
在Spark中,允許在WHERE子句中使用EXISTS和NOT EXISTS表達式。這些是返回TRUE或FALSE的布爾表達式。換句話說,EXISTS是一個成員條件,當它引用的子查詢返回一行或多行時返回TRUE。類似地,NOT EXISTS是一個非成員條件,當從子查詢返回no rows或zero rows時返回TRUE。這兩個表達式不受子查詢結果中存在NULL的影響。它們通常更快,因為它們可以轉換為semijoins / anti-semijoins,而無需為null感知提供特殊規定。
例子:
-- Even if subquery produces rows with `NULL` values, the `EXISTS` expression
-- evaluates to `TRUE` as the subquery produces 1 row.
SELECT * FROM person WHERE EXISTS (SELECT null);
+--------+----+
| name| age|
+--------+----+
| Albert|null|
|Michelle| 30|
| Fred| 50|
| Mike| 18|
| Dan| 50|
| Marry|null|
| Joe| 30|
+--------+----+-- `NOT EXISTS` expression returns `FALSE`. It returns `TRUE` only when
-- subquery produces no rows. In this case, it returns 1 row.
SELECT * FROM person WHERE NOT EXISTS (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+-- `NOT EXISTS` expression returns `TRUE`.
SELECT * FROM person WHERE NOT EXISTS (SELECT 1 WHERE 1 = 0);
+--------+----+
| name| age|
+--------+----+
| Albert|null|
|Michelle| 30|
| Fred| 50|
| Mike| 18|
| Dan| 50|
| Marry|null|
| Joe| 30|
+--------+----+
九、IN 和 NOT IN 子查詢中的空處理
在Spark中,允許在查詢的WHERE子句中使用IN和NOT IN表達式。與EXISTS表達式不同,IN表達式可以返回TRUE、FALSE或UNKNOWN(NULL)值。從概念上講,IN表達式在語義上等價于由disjunctive運算符(OR)分隔的一組相等條件。例如,c1 IN (1, 2, 3)在語義上等價于 (C1 = 1 OR c1 = 2 OR c1 = 3)。
就處理NULL值而言,語義可以從比較運算符(=)和邏輯運算符(OR)中的NULL值處理中推導出來。總之,以下是計算IN表達式結果的規則。
- 當在列表中找到有問題的非NULL值時,返回TRUE
- 當在列表中找不到非NULL值并且列表中不包含NULL值時,返回FALSE
- 當值為NULL,或者在列表中找不到非NULL值并且列表至少包含一個NULL值時,返回UNKNOWN
當列表包含NULL時,NOT IN總是返回UNKNOWN,與輸入值無關。這是因為如果值不在包含NULL的列表中,IN將返回UNKNOWN,并且因為not UNKNOWN再次為UNKNOW。
例子:
-- The subquery has only `NULL` value in its result set. Therefore,
-- the result of `IN` predicate is UNKNOWN.
SELECT * FROM person WHERE age IN (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+-- The subquery has `NULL` value in the result set as well as a valid
-- value `50`. Rows with age = 50 are returned.
SELECT * FROM personWHERE age IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
|Fred| 50|
| Dan| 50|
+----+---+-- Since subquery has `NULL` value in the result set, the `NOT IN`
-- predicate would return UNKNOWN. Hence, no rows are
-- qualified for this query.
SELECT * FROM personWHERE age NOT IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
+----+---+