sparksql的Transformation與 Action操作

Transformation操作

與RDD類似的操作

map、filter、flatMap、mapPartitions、sample、 randomSplit、 limit、

distinct、dropDuplicates、describe，而以上這些都是企業中比較常用的，這里在一個文件中統一論述

val df1 = spark.read.json("src/main/resources/people.json")
// 使用map去除某些字段
df1.map(row => row.getAs[Long](1)).withColumnRenamed("value","age").show()
//df1.map(row => row.getAs[String]("address")).show()
//df1.map(row => row.getString[String](0)).show()// randomSplit,按照數組中的權重將數據集劃分為不同的比例，可用于機器學習
val df2 = df1.randomSplit(Array(0.5, 0.6, 0.7))
df2(0).count
df2(1).count
df2(2).count// 取10行數據生成新的DataSet
val df3 = df1.limit(5).show()// distinct，去重
val df4 = df1.union(df1)
df4.distinct.count// 這個方法，不需要傳入任何的參數，默認根據所有列進行去重，然后按數據行的順序保留每行數據出現的第一條。
df4.dropDuplicates.show
// 傳入的參數是一個序列。你可以在序列中指定你要根據哪些列的重復元素對數據表進行去重，然后也是返回每一行數據出現的第一條
//def dropDuplicates(colNames: Seq[String])
df4.dropDuplicates("name", "age").show
df4.dropDuplicates("name").show// 返回全部列的統計（count、mean、stddev、min、max）
df4.describe().show// 返回指定列的統計
df4.describe("age").show
df4.describe("name", "age").show

存儲相關

persist、checkpoint、unpersist、cache

備注：Dataset 默認的存儲級別是 MEMORY_AND_DISK

val df1 = spark.read.json("src/main/resources/people.json")
import org.apache.spark.storage.StorageLevel
spark.sparkContext.setCheckpointDir("src/main/resources/data/checkpoint")
df1.show()
df1.checkpoint()// 默認的存儲級別是MEMORY_AND_DISK
df1.cache()
df1.persist(StorageLevel.MEMORY_ONLY)
println(df1.count())
df1.unpersist(true

select相關

列的多種表示：select、selectExpr、drop、withColumn、withColumnRenamed、cast（內置函數）

import spark.implicits._
import org.apache.spark.sql.functions._
val df1 = spark.read.json("src/main/resources/people.json")
// 列的多種表示方法。使用'、""、$""、col()、df("")
// 注意：不要混用；必要時使用spark.implicitis._；并非每個表示在所有的地方都有效
df1.select('name, 'age, 'address).show
df1.select("name", "age", "address").show
df1.select($"name", $"age", $"address").show
df1.select(col("name"), col("age"), col("address")).show
df1.select(df1("name"), df1("age"), df1("address")).show// 下面的寫法無效并且會報錯
// df1.select("name", "age"+10, "address").show
// df1.select("name", "age+10", "address").show// 這樣寫才符合語法
df1.select($"name", $"age"+10, $"address").show
df1.select('name, 'age+10, 'address).show// 可使用expr表達式(expr里面只能使用引號)
df1.select(expr("name"), expr("age+100"), expr("address")).show
df1.selectExpr("name as ename").show
df1.selectExpr("power(age, 2)", "address").show
df1.selectExpr("round(age, -3) as newAge", "name", "address").show// drop、withColumn、 withColumnRenamed、casting
// drop 刪除一個或多個列，得到新的DF
df1.drop("name")
df1.drop("name", "age")// withColumn，修改列值
val df2 = df1.withColumn("age", $"age"+10)
df2.show// withColumnRenamed，更改列名
df1.withColumnRenamed("name", "ename")
// 備注：drop、withColumn、withColumnRenamed返回的是DF// 類型轉化的兩種方式
df1.selectExpr("cast(age as string)").printSchema
import org.apache.spark.sql.types._
df1.select('age.cast(StringType)).printSchema

?where 相關的

val df1 = spark.read.json("src/main/resources/people.json")
// 過濾操作
df1.filter("age>30").show
df1.filter("age>30 and name=='Tom'").show
// 底層調用的就是filter算子
df1.where("age>30").show
df1.where("age>30 and name=='Tom'").show

groupBy相關的

groupBy、agg、max、min、avg、sum、count（后面5個為內置函數)

import spark.implicits._
import org.apache.spark.sql.functions._
val df1 = spark.read.json("src/main/resources/people.json")// 內置的sum max min avg count
df1.groupBy("address").sum("age").show
df1.groupBy("address").max("age").show
df1.groupBy("address").min("age").show
df1.groupBy("address").avg("age").show
df1.groupBy("address").count.show// 類似having子句
df1.groupBy("address").avg("age").where("avg(age) > 20").show
df1.groupBy("address").avg("age").where($"avg(age)" > 20).show// agg
df1.groupBy("address").agg("age"->"max", "age"->"min", "age"->"avg", "age"->"sum", "age"->"count").show// 這種方式更好理解
df1.groupBy("address").agg(max("age"), min("age"), avg("age"),sum("age"), count("age")).show// 給列取別名
df1.groupBy("address").agg(max("age"), min("age"), avg("age"),sum("age"), count("age")).withColumnRenamed("min(age)","minAge").show// 給列取別名，最簡便
df1.groupBy("address").agg(max("age").as("maxAge"),min("age").as("minAge"), avg("age").as("avgAge"),sum("age").as("sumAge"), count("age").as("countAge")).show

?orderBy相關的

import spark.implicits.
val df1 = spark.read.json("src/main/resources/people.json")// sort，以下語句等價
df1.sort("age").show
df1.sort($"age").show
df1.sort($"age".asc).show
df1.sort($"age".desc).show
df1.sort(-$"age").show
df1.sort(-'age, -'name).show// orderBy,底層調用的還是sort
df1.orderBy("age").show

join相關的

目前?Apache?Spark?3.x 版本中，一共支持以下七種 Join 類型：

INNER JOIN

CROSS JOIN

LEFT OUTER JOIN

RIGHT OUTER JOIN

FULL OUTER JOIN

LEFT SEMI JOIN

LEFT ANTI JOIN

在實現上，這七種 Join 對應的實現類分別如下：

object JoinType {def apply(typ: String): JoinType = typ.toLowerCase(Locale.ROOT).replace("_", "") match {case "inner" => Innercase "outer" | "full" | "fullouter" => FullOutercase "leftouter" | "left" => LeftOutercase "rightouter" | "right" => RightOutercase "leftsemi" | "semi" => LeftSemicase "leftanti" | "anti" => LeftAnticase "cross" => Crosscase _ =>val supported = Seq("inner","outer", "full", "fullouter", "full_outer","leftouter", "left", "left_outer","rightouter", "right", "right_outer","leftsemi", "left_semi", "semi","leftanti", "left_anti", "anti","cross")throw new IllegalArgumentException(s"Unsupported join type '$typ'. " +"Supported join types include: " + supported.mkString("'", "', '", "'") + ".")}
}

準備數據

    // 準備數據val order = spark.sparkContext.parallelize(Seq((1, 101, 2500), (2, 102, 1110), (3, 103, 500), (4, 102, 400))).toDF("paymentId", "customerId", "amount")val customer = spark.sparkContext.parallelize(Seq((101, "ds"), (102, "ds_hadoop"), (103, "ds001"), (104, "ds002"), (105, "ds003"), (106, "ds004"))).toDF("customerId", "name")

order 表

customer表

?INNER JOIN

在 Spark 中，如果沒有指定任何 Join 類型，那么默認就是 INNER JOIN。INNER JOIN 只會返回滿足 Join 條件（ join condition）的數據，這個在企業中用的應該比較多，具體如下：?

    // inner join// 單字段關聯customer.join(order,"customerId").show// 多字段關聯  Seq(“customerId”, “name”)customer.join(order,Seq("customerId")).show

執行結果

CROSS JOIN

這種類型的 Join 也稱為笛卡兒積（Cartesian Product），Join 左表的每行數據都會跟右表的每行數據進行 Join，產生的結果行數為 m*n，所以在生產環境下盡量不要用這種 Join。下面是 CROSS JOIN 的使用例子：

    // cross join// 笛卡爾積customer.crossJoin(order).show()// 如果兩張表出現相同的字段，可以使用下面的方式進去篩選  類似customer.name  order.amountcustomer.crossJoin(order).select(customer("name"), order("amount") ).show

執行1結果

執行2結果，只顯示select的字段

?LEFT OUTER JOIN

LEFT OUTER JOIN 等價于 LEFT JOIN，這個 Join 的返回的結果相信大家都知道，我就不介紹了。下面三種寫法都是等價的

    // 倆個表關聯字段名一致customer.join(order, Seq("customerId"), "left_outer").showcustomer.join(order, Seq("customerId"), "leftouter").showcustomer.join(order, Seq("customerId"), "left").showval order2 = spark.sparkContext.parallelize(Seq((1, 101, 2500), (2, 102, 1110), (3, 103, 500), (4, 102, 400))).toDF("paymentId", "custId", "amount")// 如果兩張表使用不同的字段進行關聯的話，要使用三等號即===customer.join(order2, $"customerId"===$"custId", "left").show

執行結果

RIGHT OUTER JOIN

和?LEFT OUTER JOIN 類似，RIGHT OUTER JOIN 等價于 RIGHT JOIN，下面三種寫法也是等價的：

    order.join(customer, Seq("customerId"), "right").showorder.join(customer, Seq("customerId"), "right_outer").showorder.join(customer, Seq("customerId"), "rightouter").show

FULL OUTER JOIN

FULL OUTER JOIN 的含義大家應該也都熟悉，會將左右表的數據全部顯示出來。FULL OUTER JOIN 有以下四種寫法：

    order.join(customer, Seq("customerId"), "outer").showorder.join(customer, Seq("customerId"), "full").showorder.join(customer, Seq("customerId"), "full_outer").showorder.join(customer, Seq("customerId"), "fullouter").show

LEFT SEMI JOIN

LEFT SEMI JOIN 只會返回匹配右表的數據，而且 LEFT SEMI JOIN 只會返回左表的數據，右表的數據是不會顯示的，下面三種寫法都是等價的

order.join(customer, Seq("customerId"), "leftsemi").show
order.join(customer, Seq("customerId"), "left_semi").show
order.join(customer, Seq("customerId"), "semi").show

LEFT SEMI JOIN 其實可以用 IN/EXISTS?來改寫

select * from order where customerId in (select customerId from customer)

LEFT ANTI JOIN

與?LEFT SEMI JOIN 相反，LEFT ANTI JOIN 只會返回沒有匹配到右表的左表數據。而且下面三種寫法也是等效的

order.join(customer, Seq("customerId"), "leftanti").show
order.join(customer, Seq("customerId"), "left_anti").show
order.join(customer, Seq("customerId"), "anti").show

LEFT SEMI JOIN 其實可以用 NOT?IN/EXISTS?來改寫

select * from order where customerId not in (select customerId from customer)

?集合相關的

union、unionAll、intersect、except

main{val lst = List(StudentAge(1, "Alice", 18),StudentAge(2, "Andy", 19),StudentAge(3, "Bob", 17),StudentAge(4, "Justin", 21),StudentAge(5, "Cindy", 20))val ds1 = spark.createDataset(lst)ds1.show()val rdd = spark.sparkContext.makeRDD(List(StudentHeight("Alice", 160),StudentHeight("Andy", 159),StudentHeight("Bob", 170),StudentHeight("Cindy", 165),StudentHeight("Rose", 160)))val ds2 = rdd.toDS// union、unionAll、intersect、except。集合的交、并、差val ds3 = ds1.select("name")val ds4 = ds2.select("sname")// union 求并集，不去重  去重使用distinctds3.union(ds4).show// 底層依舊調用的是unionds3.unionAll(ds4).show// intersect 求交ds3.intersect(ds4).show// except 求差ds3.except(ds4).show}// 定義第一個數據集case class StudentAge(sno: Int, name: String, age: Int)// 定義第二個數據集case class StudentHeight(sname: String, height: Int)

交集

差集

空值處理

na.fill、na.drop、na.replace、na.filter

    import spark.implicits._import org.apache.spark.sql.functions._val df1 = spark.read.json("src/main/resources/data/people.json")// NA表示缺失值，即“Missing value”，是“not available”的縮寫// 刪出含有空值的行df1.na.drop.show// 刪除某列的空值和nulldf1.na.drop(Array("age")).show// 對全部列填充df1.na.fill("NULL").show// 對指定單列填充；對指定多列填充df1.na.fill("NULL", Array("address")).showdf1.na.fill(Map("age" -> 0, "address" -> "NULL")).show// 對指定的值進行替換df1.na.replace(Array("address"), Map("NULL" -> "Shanghai")).na.replace(Array("age"), Map(0 -> 100)).show// 查詢空值列或非空值列。isNull、isNotNull為內置函數df1.where("address is null").showdf1.where($"address".isNull).showdf1.where(col("address").isNull).showdf1.filter("address is not null").showdf1.filter(col("address").isNotNull).show

?Action操作

與RDD類似的操作

show、 collect、 collectAsList、 head、 first、 count、 take、 takeAsList、 reduce

    // 隱式轉換import spark.implicits._// show：顯示結果，默認顯示20行，截取（true）spark.read.json("src/main/resources/data/people.json").show(100, false)val df = spark.read.json("src/main/resources/data/people.json")println(df.count())// 輸出數組arr   9df.collect().foreach(println)// 輸出listdf.collectAsList().forEach(println)// 輸出head3條 輸出數組1df.head(3).foreach(println)println(df.head(3))// 輸出第一條 head（1）println(df.first())// 底層調用的就是head  輸出數組3df.take(3).foreach(println)// 底層調用take，再調用headdf.takeAsList(3).forEach(println)

?獲取結構屬性的操作

printSchema、explain、columns、dtypes、col

    val df1 = spark.read.json("src/main/resources/data/people.json")// 結構屬性df1.columns.foreach(println) // 查看列名  address,age,namedf1.dtypes.foreach(println) // 查看列名和類型  (address,StringType) (age,LongType) (name,StringType)df1.explain() // 參看執行計劃println(df1.col("name")) // 獲取某個列df1.printSchema // 常用