Spark SQL----NULL語義

一、比較運算符中的空處理
二、邏輯運算符中的空處理
三、表達式中的空處理
- 3.1 null-intolerant表達式中的空處理
- 3.2 可以處理空值操作數的空處理表達式
- 3.3 內置聚合表達式中的空處理
四、WHERE、HAVING和JOIN子句中的條件表達式的空處理
五、在GROUP BY和DISTINCT中空處理
六、在ORDER BY中的空處理
七、UNION, INTERSECT, EXCEPT中的空處理
八、EXISTS 和NOT EXISTS 子查詢中的空處理
九、IN 和 NOT IN 子查詢中的空處理

表由一組行組成，每行包含一組列。列與數據類型相關聯，表示實體的特定屬性（例如，age 是一個名為person的實體的列）。有時，特定于行的列的值在該行出現時是未知的。在SQL中，這些值表示為NULL。本節詳細介紹了在各種運算符、表達式和其他SQL構造中處理NULL值的語義。
下面說明了名為person的表的schema layout和數據。數據在年齡列中包含NULL值，該表將用于以下各節中的各種示例。
TABLE: person

Id	Name	Age
100	Joe	30
200	Marry	NULL
300	Mike	18
400	Fred	50
500	Albert	NULL
600	Michelle	30
700	Dan	50

一、比較運算符中的空處理

Apache spark支持標準的比較運算符，如“>”、“>=”、“=”、”<“和”<=“。當其中一個操作數或兩個操作數都未知或為NULL時，這些運算符的結果為未知或NULL。為了比較NULL值的相等性，Spark提供了一個NULL安全的相等運算符（“<=>”），當其中一個操作數為NULL時，該運算符返回False，當兩個操作數均為NULL時返回True。下表說明了當一個或兩個操作數都為NULL時比較運算符的行為`：

Left Operand	Right Operand	>	>=	=	<	<=	<=>
NULL	Any value	NULL	NULL	NULL	NULL	NULL	False
Any value	NULL	NULL	NULL	NULL	NULL	NULL	False
NULL	NULL	NULL	NULL	NULL	NULL	NULL	True
例子：

-- Normal comparison operators return `NULL` when one of the operand is `NULL`.
SELECT 5 > null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT null = null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+-- Null-safe equal operator return `False` when one of the operand is `NULL`
SELECT 5 <=> null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|            false|
+-----------------+-- Null-safe equal operator return `True` when one of the operand is `NULL`
SELECT NULL <=> NULL;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

二、邏輯運算符中的空處理

Spark支持標準邏輯運算符，如AND、OR和NOT。這些運算符將布爾表達式作為參數，并返回布爾值。
下表說明了當一個或兩個操作數都為NULL時邏輯運算符的行為。

Left Operand	Right Operand	OR	AND
True	NULL	True	NULL
False	NULL	NULL	False
NULL	True	True	NULL
NULL	False	NULL	False
NULL	NULL	NULL	NULL

operand	NOT
NULL	NULL

例子：

-- Normal comparison operators return `NULL` when one of the operands is `NULL`.
SELECT (true OR null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT (null OR false) AS expression_output
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+-- Null-safe equal operator returns `False` when one of the operands is `NULL`
SELECT NOT(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

三、表達式中的空處理

比較運算符和邏輯運算符在Spark中被視為表達式。除了這兩種表達式之外，Spark還支持其他形式的表達式，如函數表達式、強制轉換表達式等。Spark中的表達式大致可分為：

Null intolerant表達式
可以處理NULL值操作數的表達式
- 這些表達式的結果取決于表達式本身。

3.1 null-intolerant表達式中的空處理

當表達式的一個或多個參數為Null時，Null intolerant表達式返回Null，大多數表達式屬于這一類。
例子：

SELECT concat('John', null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+SELECT positive(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+SELECT to_date(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

3.2 可以處理空值操作數的空處理表達式

這類表達式被設計用來處理NULL值。表達式的結果取決于表達式本身。例如，函數表達式isnull在輸入為空時返回true，在輸入為非空時返回false，而函數coalesce返回其操作數列表中的第一個非null值。但是，coalesce在其所有操作數為NULL時返回NULL。下面是這類表達的不完整列表。

COALESCE
NULLIF
IFNULL
NVL
NVL2
ISNAN
NANVL
ISNULL
ISNOTNULL
ATLEASTNNONNULLS
IN

例子：

SELECT isnull(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+-- Returns the first occurrence of non `NULL` value.
SELECT coalesce(null, null, 3, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|                3|
+-----------------+-- Returns `NULL` as all its operands are `NULL`. 
SELECT coalesce(null, null, null, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+SELECT isnan(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|            false|
+-----------------+

3.3 內置聚合表達式中的空處理

聚合函數通過處理一組輸入行來計算單個結果。下面是聚合函數如何處理NULL值的規則。

NULL值在所有聚合函數的處理過程中被忽略。
- 此規則的唯一例外是COUNT(*)函數。
當所有輸入值為NULL或輸入數據集為空時，一些聚合函數返回NULL。這些函數的列表如下:
- MAX
- MIN
- SUM
- AVG
- EVERY
- ANY
- SOME
  例子：

-- `count(*)` does not skip `NULL` values.
SELECT count(*) FROM person;
+--------+
|count(1)|
+--------+
|       7|
+--------+-- `NULL` values in column `age` are skipped from processing.
SELECT count(age) FROM person;
+----------+
|count(age)|
+----------+
|         5|
+----------+-- `count(*)` on an empty input set returns 0. This is unlike the other
-- aggregate functions, such as `max`, which return `NULL`.
SELECT count(*) FROM person where 1 = 0;
+--------+
|count(1)|
+--------+
|       0|
+--------+-- `NULL` values are excluded from computation of maximum value.
SELECT max(age) FROM person;
+--------+
|max(age)|
+--------+
|      50|
+--------+-- `max` returns `NULL` on an empty input set.
SELECT max(age) FROM person where 1 = 0;
+--------+
|max(age)|
+--------+
|    null|
+--------+

四、WHERE、HAVING和JOIN子句中的條件表達式的空處理

WHERE、HAVING操作符根據用戶指定的條件過濾行。JOIN操作符用于根據連接條件組合來自兩個表的行。對于所有這三種操作符，條件表達式都是布爾表達式，可以返回True、False或Unknown (NULL)。如果條件的結果為True，則表示“滿足”。
例子：

-- Persons whose age is unknown (`NULL`) are filtered out from the result set.
SELECT * FROM person WHERE age > 0;
+--------+---+
|    name|age|
+--------+---+
|Michelle| 30|
|    Fred| 50|
|    Mike| 18|
|     Dan| 50|
|     Joe| 30|
+--------+---+-- `IS NULL` expression is used in disjunction to select the persons
-- with unknown (`NULL`) records.
SELECT * FROM person WHERE age > 0 OR age IS NULL;
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+-- Person with unknown(`NULL`) ages are skipped from processing.
SELECT age, count(*) FROM person GROUP BY age HAVING max(age) > 18;
+---+--------+
|age|count(1)|
+---+--------+
| 50|       2|
| 30|       2|
+---+--------+-- A self join case with a join condition `p1.age = p2.age AND p1.name = p2.name`.
-- The persons with unknown age (`NULL`) are filtered out by the join operator.
SELECT * FROM person p1, person p2WHERE p1.age = p2.ageAND p1.name = p2.name;
+--------+---+--------+---+
|    name|age|    name|age|
+--------+---+--------+---+
|Michelle| 30|Michelle| 30|
|    Fred| 50|    Fred| 50|
|    Mike| 18|    Mike| 18|
|     Dan| 50|     Dan| 50|
|     Joe| 30|     Joe| 30|
+--------+---+--------+---+-- The age column from both legs of join are compared using null-safe equal which
-- is why the persons with unknown age (`NULL`) are qualified by the join.
SELECT * FROM person p1, person p2WHERE p1.age <=> p2.ageAND p1.name = p2.name;
+--------+----+--------+----+
|    name| age|    name| age|
+--------+----+--------+----+
|  Albert|null|  Albert|null|
|Michelle|  30|Michelle|  30|
|    Fred|  50|    Fred|  50|
|    Mike|  18|    Mike|  18|
|     Dan|  50|     Dan|  50|
|   Marry|null|   Marry|null|
|     Joe|  30|     Joe|  30|
+--------+----+--------+----+

五、在GROUP BY和DISTINCT中空處理

如章節一比較運算符中的空處理中所討論的，兩個NULL值不相等。但是，出于分組和不同處理的目的，將具有NULL數據的兩個或多個值分組到同一個bucket中。這種行為符合SQL標準和其他企業數據庫管理系統。
例子：

-- `NULL` values are put in one bucket in `GROUP BY` processing.
SELECT age, count(*) FROM person GROUP BY age;
+----+--------+
| age|count(1)|
+----+--------+
|null|       2|
|  50|       2|
|  30|       2|
|  18|       1|
+----+--------+-- All `NULL` ages are considered one distinct value in `DISTINCT` processing.
SELECT DISTINCT age FROM person;
+----+
| age|
+----+
|null|
|  50|
|  30|
|  18|
+----+

六、在ORDER BY中的空處理

Spark SQL在ORDER BY子句中支持空排序規范。Spark處理ORDER BY子句時，首先或最后放置所有NULL值，這取決于空排序規范。默認情況下，所有NULL值放在首位。
例子：

-- `NULL` values are shown at first and other values
-- are sorted in ascending way.
SELECT age, name FROM person ORDER BY age;
+----+--------+
| age|    name|
+----+--------+
|null|   Marry|
|null|  Albert|
|  18|    Mike|
|  30|Michelle|
|  30|     Joe|
|  50|    Fred|
|  50|     Dan|
+----+--------+-- Column values other than `NULL` are sorted in ascending
-- way and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age NULLS LAST;
+----+--------+
| age|    name|
+----+--------+
|  18|    Mike|
|  30|Michelle|
|  30|     Joe|
|  50|     Dan|
|  50|    Fred|
|null|   Marry|
|null|  Albert|
+----+--------+-- Columns other than `NULL` values are sorted in descending
-- and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age DESC NULLS LAST;
+----+--------+
| age|    name|
+----+--------+
|  50|    Fred|
|  50|     Dan|
|  30|Michelle|
|  30|     Joe|
|  18|    Mike|
|null|   Marry|
|null|  Albert|
+----+--------+

七、UNION, INTERSECT, EXCEPT中的空處理

在集合操作的上下文中，以null-safe的方式比較NULL值是否相等。這意味著在比較行時，兩個NULL值被認為是相等的，這與常規的EqualTo(=)操作符不同。
例子：

CREATE VIEW unknown_age SELECT * FROM person WHERE age IS NULL;-- Only common rows between two legs of `INTERSECT` are in the 
-- result set. The comparison between columns of the row are done
-- in a null-safe manner.
SELECT name, age FROM personINTERSECTSELECT name, age from unknown_age;
+------+----+
|  name| age|
+------+----+
|Albert|null|
| Marry|null|
+------+----+-- `NULL` values from two legs of the `EXCEPT` are not in output. 
-- This basically shows that the comparison happens in a null-safe manner.
SELECT age, name FROM personEXCEPTSELECT age FROM unknown_age;
+---+--------+
|age|    name|
+---+--------+
| 30|     Joe|
| 50|    Fred|
| 30|Michelle|
| 18|    Mike|
| 50|     Dan|
+---+--------+-- Performs `UNION` operation between two sets of data. 
-- The comparison between columns of the row ae done in
-- null-safe manner.
SELECT name, age FROM personUNION SELECT name, age FROM unknown_age;
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|     Joe|  30|
|Michelle|  30|
|   Marry|null|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
+--------+----+

八、EXISTS 和NOT EXISTS 子查詢中的空處理

在Spark中，允許在WHERE子句中使用EXISTS和NOT EXISTS表達式。這些是返回TRUE或FALSE的布爾表達式。換句話說，EXISTS是一個成員條件，當它引用的子查詢返回一行或多行時返回TRUE。類似地，NOT EXISTS是一個非成員條件，當從子查詢返回no rows或zero rows時返回TRUE。這兩個表達式不受子查詢結果中存在NULL的影響。它們通常更快，因為它們可以轉換為semijoins / anti-semijoins，而無需為null感知提供特殊規定。
例子：

-- Even if subquery produces rows with `NULL` values, the `EXISTS` expression
-- evaluates to `TRUE` as the subquery produces 1 row.
SELECT * FROM person WHERE EXISTS (SELECT null);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+-- `NOT EXISTS` expression returns `FALSE`. It returns `TRUE` only when
-- subquery produces no rows. In this case, it returns 1 row.
SELECT * FROM person WHERE NOT EXISTS (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+-- `NOT EXISTS` expression returns `TRUE`.
SELECT * FROM person WHERE NOT EXISTS (SELECT 1 WHERE 1 = 0);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

九、IN 和 NOT IN 子查詢中的空處理

在Spark中，允許在查詢的WHERE子句中使用IN和NOT IN表達式。與EXISTS表達式不同，IN表達式可以返回TRUE、FALSE或UNKNOWN（NULL）值。從概念上講，IN表達式在語義上等價于由disjunctive運算符（OR）分隔的一組相等條件。例如，c1 IN (1, 2, 3)在語義上等價于 (C1 = 1 OR c1 = 2 OR c1 = 3)。
就處理NULL值而言，語義可以從比較運算符（=）和邏輯運算符（OR）中的NULL值處理中推導出來。總之，以下是計算IN表達式結果的規則。

當在列表中找到有問題的非NULL值時，返回TRUE
當在列表中找不到非NULL值并且列表中不包含NULL值時，返回FALSE
當值為NULL，或者在列表中找不到非NULL值并且列表至少包含一個NULL值時，返回UNKNOWN

當列表包含NULL時，NOT IN總是返回UNKNOWN，與輸入值無關。這是因為如果值不在包含NULL的列表中，IN將返回UNKNOWN，并且因為not UNKNOWN再次為UNKNOW。
例子：

-- The subquery has only `NULL` value in its result set. Therefore,
-- the result of `IN` predicate is UNKNOWN.
SELECT * FROM person WHERE age IN (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+-- The subquery has `NULL` value in the result set as well as a valid 
-- value `50`. Rows with age = 50 are returned. 
SELECT * FROM personWHERE age IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
|Fred| 50|
| Dan| 50|
+----+---+-- Since subquery has `NULL` value in the result set, the `NOT IN`
-- predicate would return UNKNOWN. Hence, no rows are
-- qualified for this query.
SELECT * FROM personWHERE age NOT IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
+----+---+