-
讀取clickhouse數據庫數據
import scala.collection.mutable.ArrayBuffer import java.util.Properties import org.apache.spark.sql.SaveMode import org.apache.spark.sql.SparkSessiondef getCKJdbcProperties(batchSize: String = "100000",socketTimeout: String = "300000",numPartitions: String = "50",rewriteBatchedStatements: String = "true"): Properties = {val properties = new Propertiesproperties.put("driver", "ru.yandex.clickhouse.ClickHouseDriver")properties.put("user", "default")properties.put("password", "數據庫密碼")properties.put("batchsize", batchSize)properties.put("socket_timeout", socketTimeout)properties.put("numPartitions", numPartitions)properties.put("rewriteBatchedStatements", rewriteBatchedStatements)properties} // 讀取click數據庫數據 val today = "2023-06-05" val ckProperties = getCKJdbcProperties() val ckUrl = "jdbc:clickhouse://233.233.233.233:8123/ss" val ckTable = "ss.test" var ckDF = spark.read.jdbc(ckUrl, ckTable, ckProperties)
-
**show**
展示數據,類似于select * from test
的功能[ckDF.show](http://ckDF.show)
默認展示前20個記錄ckDF.show(3)
指定展示記錄數ckDF.show(false)
是否展示前20個ckDF.show(3, 0)
截取記錄數
-
**ckDF.collect
** 方法會將ckDF
中的所有數據都獲取到,并返回一個Array
對象 -
ckDF.collectAsList
功能和collect
類似,只不過將返回結構變成了List
對象 -
**ckDF.describe**("ip_src").show(3)
****獲取指定字段的統計信息scala> ckDF.describe("ip_src").show(3) +-------+------+ |summary|ip_src| +-------+------+ | count|855035| | mean| null| | stddev| null| +-------+------+ only showing top 3 rows
-
first, head, take, takeAsList
獲取若干行記錄first
獲取第一行記錄head
獲取第一行記錄,head(n: Int)
獲取前n行記錄take(n: Int)
獲取前n行數據takeAsList(n: Int)
獲取前n行數據,并以List
的形式展現
以
Row
或者Array[Row]
的形式返回一行或多行數據。first
和head
功能相同。take
和takeAsList
方法會將獲得到的數據返回到Driver端,所以,使用這兩個方法時需要注意數據量,以免Driver發生OutOfMemoryError