域嵌套太深_pyspark如何修改嵌套結構域

域嵌套太深

In our adventures trying to build a data lake, we are using dynamically generated spark cluster to ingest some data from MongoDB, our production database, to BigQuery. In order to do that, we use PySpark data frames and since mongo doesn’t have schemas, we try to infer the schema from the data.

在嘗試建立數據湖的冒險中,我們使用動態生成的火花集群將一些數據從生產數據庫MongoDB提取到BigQuery。 為此,我們使用PySpark數據幀,并且由于mongo沒有架構,因此我們嘗試從數據中推斷出架構。

collection_schema = spark.read.format(“mongo”) \ 
.option(“database”, db) \
.option(“collection”, coll) \
.option(‘sampleSize’, 50000) \
.load() \
.schema ingest_df = spark.read.format(“mongo”) \
.option(“database”, db) \
.option(“collection”, coll) \ .load(schema=fix_spark_schema(collection_schema))

Our fix_spark_schema method just converts NullType columns to String.

我們的fix_spark_schema方法僅將NullType列轉換為String。

In the users collection, we have the groups field, which is an array, because users can join multiple groups.

users集合中,我們擁有groups字段,它是一個數組,因為用戶可以加入多個group。

root
|-- groups: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- programs: struct (nullable = true)
| | | |-- **{ program id }**: struct (nullable = true)
| | | | |-- Date: timestamp (nullable = true)
| | | | |-- Name: string (nullable = true)
| | | | |-- Some_Flags: struct (nullable = true)
| | | | | |-- abc: boolean (nullable = true)
| | | | | |-- def: boolean (nullable = true)
| | | | | |-- ghi: boolean (nullable = true)
| | | | | |-- xyz: boolean (nullable = true)

Also, each different group has some different programs the users can join. So under the programs, we store a JSON with keys the program ids the user has joined and values some extra data about the date they joined etc. The data looks like this

此外,每個不同的組都有一些用戶可以加入的不同程序。 因此,在這些程序下,我們存儲了一個JSON,其中包含用戶已加入的程序ID以及其加入日期等額外數據的鍵值。數據看起來像這樣

“groups” : [
{… some other fields …
“programs” : {
“123c12b123456c1d76a4f265f10f20a0” : {
“name” : “test_program_1”,
“some_flags” : {
“abc” : true,
“def” : true,
“ghi” : false,
“xyz” : true
},
“date” : ISODate(“2019–11–16T03:29:00.000+0000”)
}
}
]

As a result of the above, BigQuery creates a new column for each program_id and we end up with hundreds of columns, most of them empty for most of the users. So, how can we fix that? We can convert programs from a struct to string and store the whole json in there. That would create some extra friction if someone wants to access those fields, but it would make our columns much cleaner.

由于上述原因,BigQuery為每個program_id創建了一個新列,最后我們得到了數百個列,其中大多數對于大多數用戶而言都是空的。 那么,我們該如何解決呢? 我們可以將程序從結構轉換為字符串,然后將整個json存儲在其中。 如果有人要訪問這些字段,那會產生一些額外的摩擦,但這會使我們的色譜柱更加整潔。

Attempt 1:

嘗試1:

So, if the field wasn’t nested we could easily just cast it to string.

因此,如果未嵌套該字段,則可以輕松地將其轉換為字符串。

ingest_df

but since it’s nested this doesn’t work. The following command works only for root-level fields, so it could work if we wanted to convert the whole groups field, or move programs at the root level

但由于它是嵌套的,因此不起作用。 以下命令僅適用于根級別的字段,因此如果我們要轉換整個字段或在根級別移動程序 ,則該命令可以使用

ingest_df

Attempt 2:

嘗試2:

After a lot of research and many different tries. I realized that if we want to change the type, edit, rename, add or remove a nested field we need to modify the schema. The steps we have to follow are these:

經過大量研究和許多嘗試。 我意識到,如果要更改類型,編輯,重命名,添加或刪除嵌套字段,則需要修改架構。 我們必須遵循的步驟是:

  1. Iterate through the schema of the nested Struct and make the changes we want

    遍歷嵌套的Struct的架構并進行所需的更改
  2. Create a JSON version of the root level field, in our case groups, and name it for example groups_json and drop groups

    在我們的案例組中,創建根級別字段的JSON版本并將其命名為groups_json和drop groups

  3. Then convert the groups_json field to groups again using the modified schema we created in step 1.

    然后使用在步驟1中創建的修改后的架構再次將groups_json字段轉換為

If we know the schema and we’re sure that it’s not going to change, we could hardcode it but … we can do better. We can write (search on StackOverflow and modify) a dynamic function that would iterate through the whole schema and change the type of the field we want. The following method would convert the fields_to_change into Strings, but you can modify it to whatever you want

如果我們知道該模式并且確定它不會改變,則可以對其進行硬編碼,但是…我們可以做得更好。 我們可以編寫(搜索StackOverflow并進行修改)動態函數,該函數將遍歷整個架構并更改所需字段的類型。 以下方法會將fields_to_change轉換為字符串,但是您可以將其修改為所需的任何值

def change_nested_field_type(schema, fields_to_change, parent=""):
new_schema = []
if isinstance(schema, StringType):
return schema
for field in schema:
full_field_name = field.name
if parent:
full_field_name = parent + "." + full_field_name
if full_field_name not in fields_to_change:
if isinstance(field.dataType, StructType):
inner_schema = change_nested_field_type(field.dataType, fields_to_change, full_field_name)
new_schema.append(StructField(field.name, inner_schema))
elif isinstance(field.dataType, ArrayType):
inner_schema = change_nested_field_type(field.dataType.elementType, fields_to_change, full_field_name)
new_schema.append(StructField(field.name, ArrayType(inner_schema)))
else:
new_schema.append(StructField(field.name, field.dataType))
else:
# Here we change the field type to Stringnew_schema.append(StructField(field.name, StringType()))
return StructType(new_schema)

and now we can do the conversion like this:

現在我們可以像這樣進行轉換:

new_schema = ArrayType(change_nested_field_type(df.schema["groups"].dataType.elementType, ["programs"]))
df = df.withColumn("
df = df.withColumn("groups", from_json("

and voila! groups.programs is converted to a string.

和瞧! groups.programs將轉換為字符串。

翻譯自: https://medium.com/swlh/pyspark-how-to-modify-a-nested-struct-field-8105ebe83d09

域嵌套太深

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389609.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389609.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389609.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

redis小結

Redis 切換到redis的目錄 啟動:./redis-server 關閉:killall redis-server Redis的數據類型: String字符 list鏈表 set集合(無序) Sort Set排序(有序) hash數據類型 string類型的數據操作 re…

WIN10下ADB工具包安裝的教程和總結 --201809

ADB(Android Debug Bridge)是Android SDK中的一個工具, 使用ADB可以直接操作管理Android模擬器或者真實的Andriod設備。 ADB主要功能有: 在Android設備上運行Shell(命令行)管理模擬器或設備的端口映射在計算機和設備之間上傳/下載文件將電腦上的本地APK軟…

1816. 截斷句子

1816. 截斷句子 句子 是一個單詞列表,列表中的單詞之間用單個空格隔開,且不存在前導或尾隨空格。每個單詞僅由大小寫英文字母組成(不含標點符號)。 例如,“Hello World”、“HELLO” 和 “hello world hello world”…

spark的流失計算模型_使用spark對sparkify的流失預測

spark的流失計算模型Churn prediction, namely predicting clients who might want to turn down the service, is one of the most common business applications of machine learning. It is especially important for those companies providing streaming services. In thi…

峰識別 峰面積計算 peak detection peak area 源代碼 下載

原文:峰識別 峰面積計算 peak detection peak area 源代碼 下載Comparative analysis of peak-detection techniques for comprehensive two-dimensional chromatography http://www.docin.com/p-172045359.html http://terpconnect.umd.edu/~toh/spectrum/ipeak.html R…

區塊鏈開發公司談區塊鏈與大數據的關系

在過去的兩千多年的時間長河中,數字一直指引著我們去探索很多未知的科學世界。到目前為止,隨著網絡和信息技術的發展,一切與人類活動相關的活動,都直接或者間接的連入了互聯網之中,一個全新的數字化的世界展現在我們的…

Jupyter Notebook的15個技巧和竅門,可簡化您的編碼體驗

Jupyter Notebook is a browser bases REPL (read eval print loop) built on IPython and other open-source libraries, it allows us to run interactive python code on the browser.Jupyter Notebook是基于IPL和其他開源庫構建的基于REPL(讀取評估打印循環)的瀏覽器&#…

給定有權無向圖的鄰接矩陣如下,求其最小生成樹的總權重,代碼。

#include<bits/stdc.h> using namespace std; #define INF 0x3f3f3f3f const int maxn 117; int m[maxn][maxn]; int vis[maxn], low[maxn]; /* 對于這道題目來將&#xff0c;m就是臨接矩陣&#xff0c;vis是訪問標記數組&#xff0c;low是最短距離數組 */ int n; int …

Ubuntu-16-04-編譯-Caffe-SSD

該來的還是要來 之前為了偷懶想到使用 Docker 回避 Caffe SSD 編譯的難題。結果&#xff0c;「天道好輪回&#xff0c;蒼天饒過誰」。Docker 鏡像內無法調用 GUI 顯示以及攝像頭&#xff0c;沒法跑 ssd_pascal_webcam.py 做實時 Object Detection。所以沒辦法又得重新嘗試編譯 …

bi數據分析師_BI工程師和數據分析師的5個格式塔原則

bi數據分析師Image by Author圖片作者 將美麗融入數據 (Putting the Beauty in Data) Have you ever been ravished by Vizzes on Tableau Public that look like only magic could be in play to display so much data in such a pleasing way?您是否曾經被Tableau Public上的…

BSOJ 2423 -- 【PA2014】Final Zarowki

Description 有n個房間和n盞燈&#xff0c;你需要在每個房間里放入一盞燈。每盞燈都有一定功率&#xff0c;每間房間都需要不少于一定功率的燈泡才可以完全照亮。 你可以去附近的商店換新燈泡&#xff0c;商店里所有正整數功率的燈泡都有售。但由于背包空間有限&#xff0c;你…

WPF綁定資源文件錯誤(error in binding resource string with a view in wpf)

報錯&#xff1a;無法將“***Properties.Resources.***”StaticExtension 值解析為枚舉、靜態字段或靜態屬性 解決辦法&#xff1a;嘗試右鍵單擊在Visual Studio解決方案資源管理器的資源文件&#xff0c;并選擇屬性選項&#xff0c;然后設置自定義工具屬性 PublicResXFile cod…

因果推論第六章

因果推論 (Causal Inference) This is the sixth post on the series we work our way through “Causal Inference In Statistics” a nice Primer co-authored by Judea Pearl himself.這是本系列的第六篇文章&#xff0c;我們將通過Judea Pearl本人與他人合著的《引誘統計學…

如何優化網站加載時間

一、背景 我們要監測網站的加載情況&#xff0c;可以使用 window.performance 來簡單的檢測。 window.performance 是W3C性能小組引入的新的API&#xff0c;目前IE9以上的瀏覽器都支持。一個performance對象的完整結構如下圖所示&#xff1a; memory字段代表JavaScript對內存的…

VMWARE VCSA 6.5安裝過程

https://www.tech-coffee.net/step-by-step-deploy-vcenter-server-appliance-vcsa-6-5/ vcsa 6.0&#xff0c;6.5 注冊機下載 鏈接:https://pan.baidu.com/s/1X5V-iWpvxozrwE7Ji099jw 密碼:jt8l 轉載于:https://www.cnblogs.com/flyhgx/p/9073485.html

熊貓數據集_處理熊貓數據框中的列表值

熊貓數據集Have you ever dealt with a dataset that required you to work with list values? If so, you will understand how painful this can be. If you have not, you better prepare for it.您是否曾經處理過需要使用列表值的數據集&#xff1f; 如果是這樣&#xff0…

聊聊jdk http的HeaderFilter

序 本文主要研究一下jdk http的HeaderFilter。 FilterFactory java.net.http/jdk/internal/net/http/FilterFactory.java class FilterFactory {// Strictly-ordered list of filters.final LinkedList<Class<? extends HeaderFilter>> filterClasses new Linked…

旋轉變換(一)旋轉矩陣

1. 簡介 計算機圖形學中的應用非常廣泛的變換是一種稱為仿射變換的特殊變換&#xff0c;在仿射變換中的基本變換包括平移、旋轉、縮放、剪切這幾種。本文以及接下來的幾篇文章重點介紹一下關于旋轉的變換&#xff0c;包括二維旋轉變換、三維旋轉變換以及它的一些表達方式&#…

數據預處理 泰坦尼克號_了解泰坦尼克號數據集的數據預處理

數據預處理 泰坦尼克號什么是數據預處理&#xff1f; (What is Data Pre-Processing?) We know from my last blog that data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incom…

Pytorch中DNN入門思想及實現

DNN全連接層&#xff08;線性層&#xff09; 計算公式&#xff1a; y w * x b W和b是參與訓練的參數 W的維度決定了隱含層輸出的維度&#xff0c;一般稱為隱單元個數&#xff08;hidden size&#xff09; b是偏差值&#xff08;本文沒考慮&#xff09; 舉例&#xff1a; 輸…