Spark 4.0的 VariantType 類型優點以及使用分析

背景

本文基于Spark 4.0。

總結

對于半結構化的數據來說，我們一般會有兩種方式進行存儲:
第一種是存儲為JSON String,這種可以保證Schema free，但是在使用的時候得解析為JSON，從而進行運算操作。
第二種是存儲為Struct類型，這種雖然性能好，但是Schema是不可變的
所以引入了Variant類型：

Schema free以及相對于json String來說會有更好的查詢性能，且使半結構化數據處理快速簡單。
Variant數據類型以靈活的方式存儲半結構化數據
無需預先定義模式。
Variant二進制編碼還允許比解析字符串更快地處理數據。

分析

直接到 Variant 對應的getFieldByKey方法（這個方法相對于JSON String來說就是獲取某個JSON所對應key的值）：

public Variant getFieldByKey(String key) {return handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> {// Use linear search for a short list. Switch to binary search when the length reaches// `BINARY_SEARCH_THRESHOLD`.final int BINARY_SEARCH_THRESHOLD = 32;if (size < BINARY_SEARCH_THRESHOLD) {for (int i = 0; i < size; ++i) {int id = readUnsigned(value, idStart + idSize * i, idSize);if (key.equals(getMetadataKey(metadata, id))) {int offset = readUnsigned(value, offsetStart + offsetSize * i, offsetSize);return new Variant(value, metadata, dataStart + offset);}}} else {int low = 0;int high = size - 1;while (low <= high) {// Use unsigned right shift to compute the middle of `low` and `high`. This is not only a// performance optimization, because it can properly handle the case where `low + high`// overflows int.int mid = (low + high) >>> 1;int id = readUnsigned(value, idStart + idSize * mid, idSize);int cmp = getMetadataKey(metadata, id).compareTo(key);if (cmp < 0) {low = mid + 1;} else if (cmp > 0) {high = mid - 1;} else {int offset = readUnsigned(value, offsetStart + offsetSize * mid, offsetSize);return new Variant(value, metadata, dataStart + offset);}}}return null;});}

其中 handleObject 方法用來獲取 Variant 對象的元數據信息,

public static <T> T handleObject(byte[] value, int pos, ObjectHandler<T> handler) {checkIndex(pos, value.length);int basicType = value[pos] & BASIC_TYPE_MASK;int typeInfo = (value[pos] >> BASIC_TYPE_BITS) & TYPE_INFO_MASK;if (basicType != OBJECT) throw unexpectedType(Type.OBJECT);// Refer to the comment of the `OBJECT` constant for the details of the object header encoding.// Suppose `typeInfo` has a bit representation of 0_b4_b3b2_b1b0, the following line extracts// b4 to determine whether the object uses a 1/4-byte size.boolean largeSize = ((typeInfo >> 4) & 0x1) != 0;int sizeBytes = (largeSize ? U32_SIZE : 1);int size = readUnsigned(value, pos + 1, sizeBytes);// Extracts b3b2 to determine the integer size of the field id list.int idSize = ((typeInfo >> 2) & 0x3) + 1;// Extracts b1b0 to determine the integer size of the offset list.int offsetSize = (typeInfo & 0x3) + 1;int idStart = pos + 1 + sizeBytes;int offsetStart = idStart + size * idSize;int dataStart = offsetStart + (size + 1) * offsetSize;return handler.apply(size, idSize, offsetSize, idStart, offsetStart, dataStart);}

在這里插入圖片描述

按照以上的布局來進行獲取該 object大小，field id list大小， offset list大小，id list的起始位，offset的起始位置。

接下來就是循環調用 getMetadataKey 方法獲取每個key（通過offset[i+1]- offset[i]）的具體值，與當前的key進行比對，如果相等，則返回，之后再返回new Variant(value, metadata, dataStart + offset)對象，其中會帶有該key對應的起始offset。
如果想要得到具體的類型值，直接通過對應的方法獲取即可，比如說getString等

注意：如果該object的字節長度大于32字節，則用二分查找來查找，否則用順序查找。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/88140.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/88140.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/88140.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！