bigquery_如何在BigQuery中進行文本相似性搜索和文檔聚類

bigquery

BigQuery offers the ability to load a TensorFlow SavedModel and carry out predictions. This capability is a great way to add text-based similarity and clustering on top of your data warehouse.

BigQuery可以加載TensorFlow SavedModel并執行預測。 此功能是在數據倉庫之上添加基于文本的相似性和群集的一種好方法。

Follow along by copy-pasting queries from my notebook in GitHub. You can try out the queries in the BigQuery console or in an AI Platform Jupyter notebook.

然后在GitHub中從我的筆記本復制粘貼查詢。 您可以在BigQuery控制臺或AI Platform Jupyter筆記本中嘗試查詢。

風暴報告數據 (Storm reports data)

As an example, I’ll use a dataset consisting of wind reports phoned into National Weather Service offices by “storm spotters”. This is a public dataset in BigQuery and it can be queried as follows:

舉例來說,我將使用由“風暴發現者”致電國家氣象局辦公室的風報告組成的數據集。 這是BigQuery中的公共數據集,可以按以下方式查詢:

SELECT 
EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
ST_GeogPoint(longitude, latitude) AS location,
comments
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
LIMIT 10

The result looks like this:

結果看起來像這樣:

Image for post

Let’s say that we want to build a SQL query to search for comments that look like “power line down on a home”.

假設我們要構建一個SQL查詢來搜索看起來像“家中的電源線”的注釋。

Steps:

腳步:

  • Load a machine learning model that creates an embedding (essentially a compact numerical representation) of some text.

    加載一個機器學習模型,該模型創建一些文本的嵌入(本質上是緊湊的數字表示形式)。
  • Use the model to generate the embedding of our search term.

    使用該模型生成搜索詞的嵌入。
  • Use the model to generate the embedding of every comment in the wind reports table.

    使用該模型可將每個評論嵌入風報告表中。
  • Look for rows where the two embeddings are close to each other.

    查找兩個嵌入彼此靠近的行。

將文本嵌入模型加載到BigQuery中 (Loading a text embedding model into BigQuery)

TensorFlow Hub has a number of text embedding models. For best results, you should use a model that has been trained on data that is similar to your dataset and which has a sufficient number of dimensions so as to capture the nuances of your text.

TensorFlow Hub具有許多文本嵌入模型。 為了獲得最佳結果,您應該使用經過訓練的模型,該數據類似于您的數據集,并且具有足夠的維數,以捕獲文本的細微差別。

For this demonstration, I’ll use the Swivel embedding which was trained on Google News and has 20 dimensions (i.e., it is pretty coarse). This is sufficient for what we need to do.

在此演示中,我將使用在Google新聞上接受訓練的Swivel嵌入,它具有20個維度(即,非常粗略)。 這足以滿足我們的需求。

The Swivel embedding layer is already available in TensorFlow SavedModel format, so we simply need to download it, extract it from the tarred, gzipped file, and upload it to Google Cloud Storage:

Swivel嵌入層已經可以使用TensorFlow SavedModel格式,因此我們只需要下載它,從壓縮后的壓縮文件中提取出來,然后將其上傳到Google Cloud Storage:

FILE=swivel.tar.gz
wget --quiet -O tmp/swivel.tar.gz https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1?tf-hub-format=compressed
cd tmp
tar xvfz swivel.tar.gz
cd ..
mv tmp swivel
gsutil -m cp -R swivel gs://${BUCKET}/swivel

Once the model files on GCS, we can load it into BigQuery as an ML model:

將模型文件保存到GCS后,我們可以將其作為ML模型加載到BigQuery中:

CREATE OR REPLACE MODEL advdata.swivel_text_embed
OPTIONS(model_type='tensorflow', model_path='gs://BUCKET/swivel/*')

嘗試在BigQuery中嵌入模型 (Try out embedding model in BigQuery)

To try out the model in BigQuery, we need to know its input and output schema. These would be the names of the Keras layers when it was exported. We can get them by going to the BigQuery console and viewing the “Schema” tab of the model:

要在BigQuery中試用模型,我們需要了解其輸入和輸出架構。 這些將是導出時Keras圖層的名稱。 我們可以通過轉到BigQuery控制臺并查看模型的“架構”標簽來獲得它們:

Image for post

Let’s try this model out by getting the embedding for a famous August speech, calling the input text as sentences and knowing that we will get an output column named output_0:

讓我們通過獲得著名的August演講的嵌入,將輸入文本稱為句子并知道我們將得到一個名為output_0的輸出列來試用該模型:

SELECT output_0 FROM
ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT "Long years ago, we made a tryst with destiny; and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially." AS sentences))

The result has 20 numbers as expected, the first few of which are shown below:

結果有20個預期的數字,其中前幾個顯示如下:

Image for post

文件相似度搜尋 (Document similarity search)

Define a function to compute the Euclidean squared distance between a pair of embeddings:

定義一個函數來計算一對嵌入之間的歐幾里德平方距離:

CREATE TEMPORARY FUNCTION td(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>, idx INT64) AS (
(a[OFFSET(idx)] - b[OFFSET(idx)]) * (a[OFFSET(idx)] - b[OFFSET(idx)])
);CREATE TEMPORARY FUNCTION term_distance(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>) AS ((
SELECT SQRT(SUM( td(a, b, idx))) FROM UNNEST(GENERATE_ARRAY(0, 19)) idx
));

Then, compute the embedding for our search term:

然后,為我們的搜索詞計算嵌入:

WITH search_term AS (
SELECT output_0 AS term_embedding FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(SELECT "power line down on a home" AS sentences))
)

and compute the distance between each comment’s embedding and the term_embedding of the search term (above):

并計算每個評論的嵌入與搜索詞的term_embedding之間的距離(如上):

SELECT
term_distance(term_embedding, output_0) AS termdist,
comments
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT comments, LOWER(comments) AS sentences
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
)), search_term
ORDER By termdist ASC
LIMIT 10

The result is:

結果是:

Image for post

Remember that we searched for “power line down on home”. Note that the top two results are “power line down on house” — the text embedding has been helpful in recognizing that home and house are similar in this context. The next set of top matches are all about power lines, the most unique pair of words in our search term.

請記住,我們搜索的是“家中的電源線”。 請注意,最上面的兩個結果是“房屋上的電源線斷開”-文本嵌入有助于識別房屋和房屋在這種情況下是相似的。 下一組熱門匹配項都是關于電源線的,這是我們搜索詞中最獨特的詞對。

文件叢集 (Document Clustering)

Document clustering involves using the embeddings as an input to a clustering algorithm such as K-Means. We can do this in BigQuery itself, and to make things a bit more interesting, we’ll use the location and day-of-year as additional inputs to the clustering algorithm.

文檔聚類涉及將嵌入用作聚類算法(例如K-Means)的輸入。 我們可以在BigQuery本身中做到這一點,并使事情變得更加有趣,我們將位置和年份作為聚類算法的其他輸入。

CREATE OR REPLACE MODEL advdata.storm_reports_clustering
OPTIONS(model_type='kmeans', NUM_CLUSTERS=10) ASSELECT
arr_to_input_20(output_0) AS comments_embed,
EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
))

The embedding (output_0) is an array, but BigQuery ML currently wants named inputs. The work around is to convert the array to a struct:

嵌入(output_0)是一個數組,但是BigQuery ML當前需要命名輸入。 解決方法是將數組轉換為結構:

CREATE TEMPORARY FUNCTION arr_to_input_20(arr ARRAY<FLOAT64>)
RETURNS
STRUCT<p1 FLOAT64, p2 FLOAT64, p3 FLOAT64, p4 FLOAT64,
p5 FLOAT64, p6 FLOAT64, p7 FLOAT64, p8 FLOAT64,
p9 FLOAT64, p10 FLOAT64, p11 FLOAT64, p12 FLOAT64,
p13 FLOAT64, p14 FLOAT64, p15 FLOAT64, p16 FLOAT64,
p17 FLOAT64, p18 FLOAT64, p19 FLOAT64, p20 FLOAT64>AS (
STRUCT(
arr[OFFSET(0)]
, arr[OFFSET(1)]
, arr[OFFSET(2)]
, arr[OFFSET(3)]
, arr[OFFSET(4)]
, arr[OFFSET(5)]
, arr[OFFSET(6)]
, arr[OFFSET(7)]
, arr[OFFSET(8)]
, arr[OFFSET(9)]
, arr[OFFSET(10)]
, arr[OFFSET(11)]
, arr[OFFSET(12)]
, arr[OFFSET(13)]
, arr[OFFSET(14)]
, arr[OFFSET(15)]
, arr[OFFSET(16)]
, arr[OFFSET(17)]
, arr[OFFSET(18)]
, arr[OFFSET(19)]
));

The resulting ten clusters can visualized in the BigQuery console:

可以在BigQuery控制臺中看到生成的十個集群:

Image for post

What do the comments in cluster #1 look like? The query is:

第1組中的注釋是什么樣的? 查詢是:

SELECT sentences 
FROM ML.PREDICT(MODEL `ai-analytics-solutions.advdata.storm_reports_clustering`,
(
SELECT
sentences,
arr_to_input_20(output_0) AS comments_embed,
EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
longitude, latitude
FROM ML.PREDICT(MODEL advdata.swivel_text_embed,(
SELECT timestamp, longitude, latitude, LOWER(comments) AS sentences
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports`
WHERE EXTRACT(YEAR from timestamp) = 2019
))))
WHERE centroid_id = 1

The result shows that these are mostly short, uninformative comments:

結果表明,這些大多是簡短的,無用的評論:

Image for post

How about cluster #3? Most of these reports seem to have something to do with verification by radar!!!

第3組如何? 這些報告大多數似乎與雷達驗證有關!!!

Image for post

Enjoy!

請享用!

鏈接 (Links)

TensorFlow Hub has several text embedding models. You don’t have to use Swivel, although Swivel is a good general-purpose choice.

TensorFlow Hub具有多個文本嵌入模型。 盡管Swivel是一個不錯的通用選擇,但您不必使用Swivel 。

Full queries are in my notebook on GitHub. You can try out the queries in the BigQuery console or in an AI Platform Jupyter notebook.

完整查詢在我的GitHub筆記本上 。 您可以在BigQuery控制臺或AI Platform Jupyter筆記本中嘗試查詢。

翻譯自: https://towardsdatascience.com/how-to-do-text-similarity-search-and-document-clustering-in-bigquery-75eb8f45ab65

bigquery

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390616.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390616.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390616.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

bzoj 1996: [Hnoi2010]chorus 合唱隊

Description 為了在即將到來的晚會上有吏好的演出效果&#xff0c;作為AAA合唱隊負責人的小A需要將合唱隊的人根據他們的身高排出一個隊形。假定合唱隊一共N個人&#xff0c;第i個人的身髙為Hi米(1000<Hi<2000),并已知任何兩個人的身高都不同。假定最終排出的隊形是A 個人…

移動應用程序開發_什么是移動應用程序開發?

移動應用程序開發One of the most popular forms of coding in the last decade has been the creation of apps, or applications, that run on mobile devices.在過去的十年中&#xff0c;最流行的編碼形式之一是創建在移動設備上運行的應用程序。 Today there are two main…

leetcode 1600. 皇位繼承順序(dfs)

題目 一個王國里住著國王、他的孩子們、他的孫子們等等。每一個時間點&#xff0c;這個家庭里有人出生也有人死亡。 這個王國有一個明確規定的皇位繼承順序&#xff0c;第一繼承人總是國王自己。我們定義遞歸函數 Successor(x, curOrder) &#xff0c;給定一個人 x 和當前的繼…

vlookup match_INDEX-MATCH — VLOOKUP功能的升級

vlookup match電子表格/索引匹配 (SPREADSHEETS / INDEX-MATCH) In a previous article, we discussed about how and when to use VLOOKUP functions and what are the issues that we might face while using them. This article, on the other hand, will take you to a jou…

java基礎-BigDecimal類常用方法介紹

java基礎-BigDecimal類常用方法介紹 作者&#xff1a;尹正杰 版權聲明&#xff1a;原創作品&#xff0c;謝絕轉載&#xff01;否則將追究法律責任。 一.BigDecimal類概述 我們知道浮點數的計算結果是未知的。原因是計算機二進制中&#xff0c;表示浮點數不精確造成的。這個時候…

節點對象轉節點_節點流程對象說明

節點對象轉節點The process object in Node.js is a global object that can be accessed inside any module without requiring it. There are very few global objects or properties provided in Node.js and process is one of them. It is an essential component in the …

PAT——1018. 錘子剪刀布

大家應該都會玩“錘子剪刀布”的游戲&#xff1a;兩人同時給出手勢&#xff0c;勝負規則如圖所示&#xff1a; 現給出兩人的交鋒記錄&#xff0c;請統計雙方的勝、平、負次數&#xff0c;并且給出雙方分別出什么手勢的勝算最大。 輸入格式&#xff1a; 輸入第1行給出正整數N&am…

leetcode 1239. 串聯字符串的最大長度

題目 二進制手表頂部有 4 個 LED 代表 小時&#xff08;0-11&#xff09;&#xff0c;底部的 6 個 LED 代表 分鐘&#xff08;0-59&#xff09;。每個 LED 代表一個 0 或 1&#xff0c;最低位在右側。 例如&#xff0c;下面的二進制手表讀取 “3:25” 。 &#xff08;圖源&am…

flask redis_在Flask應用程序中將Redis隊列用于異步任務

flask redisBy: Content by Edward Krueger and Josh Farmer, and Douglas Franklin.作者&#xff1a; 愛德華克魯格 ( Edward Krueger) 和 喬什法默 ( Josh Farmer )以及 道格拉斯富蘭克林 ( Douglas Franklin)的內容 。 When building an application that performs time-co…

CentOS7下分布式文件系統FastDFS的安裝 配置 (單節點)

背景 FastDFS是一個開源的輕量級分布式文件系統&#xff0c;為互聯網量身定制&#xff0c;充分考慮了冗余備份、負載均衡、線性擴容等機制&#xff0c;并注重高可用、高性能等指標&#xff0c;解決了大容量存儲和負載均衡的問題&#xff0c;特別適合以文件為載體的在線服務&…

如何修復會話固定漏洞_PHP安全漏洞:會話劫持,跨站點腳本,SQL注入以及如何修復它們...

如何修復會話固定漏洞PHP中的安全性 (Security in PHP) When writing PHP code it is very important to keep the following security vulnerabilities in mind to avoid writing insecure code.在編寫PHP代碼時&#xff0c;記住以下安全漏洞非常重要&#xff0c;以避免編寫不…

劍指 Offer 38. 字符串的排列

題目 輸入一個字符串&#xff0c;打印出該字符串中字符的所有排列。 你可以以任意順序返回這個字符串數組&#xff0c;但里面不能有重復元素。 示例: 輸入&#xff1a;s “abc” 輸出&#xff1a;[“abc”,“acb”,“bac”,“bca”,“cab”,“cba”] 限制&#xff1a; 1…

前饋神經網絡中的前饋_前饋神經網絡在基于趨勢的交易中的有效性(1)

前饋神經網絡中的前饋This is a preliminary showcase of a collaborative research by Seouk Jun Kim (Daniel) and Sunmin Lee. You can find our contacts at the bottom of the article.這是 Seouk Jun Kim(Daniel) 和 Sunmin Lee 進行合作研究的初步展示 。 您可以在文章底…

解釋什么是快速排序算法?_解釋排序算法

解釋什么是快速排序算法?Sorting algorithms are a set of instructions that take an array or list as an input and arrange the items into a particular order.排序算法是一組指令&#xff0c;這些指令采用數組或列表作為輸入并將項目按特定順序排列。 Sorts are most c…

SpringBoot自動化配置的注解開關原理

我們以一個最簡單的例子來完成這個需求&#xff1a;定義一個注解EnableContentService&#xff0c;使用了這個注解的程序會自動注入ContentService這個bean。 Retention(RetentionPolicy.RUNTIME) Target(ElementType.TYPE) Import(ContentConfiguration.class) public interfa…

hadoop將消亡_數據科學家:適應還是消亡!

hadoop將消亡Harvard Business Review marked the boom of Data Scientists in their famous 2012 article “Data Scientist: Sexiest Job”, followed by untenable demand in the past decade. [3]《哈佛商業評論 》在2012年著名的文章“數據科學家&#xff1a;最性感的工作…

劍指 Offer 15. 二進制中1的個數 and leetcode 1905. 統計子島嶼

題目 請實現一個函數&#xff0c;輸入一個整數&#xff08;以二進制串形式&#xff09;&#xff0c;輸出該數二進制表示中 1 的個數。例如&#xff0c;把 9 表示成二進制是 1001&#xff0c;有 2 位是 1。因此&#xff0c;如果輸入 9&#xff0c;則該函數輸出 2。 示例 1&…

[轉]kafka介紹

轉自 https://www.cnblogs.com/hei12138/p/7805475.html kafka介紹1.1. 主要功能 根據官網的介紹&#xff0c;ApacheKafka是一個分布式流媒體平臺&#xff0c;它主要有3種功能&#xff1a; 1&#xff1a;It lets you publish and subscribe to streams of records.發布和訂閱消…

如何開始android開發_如何開始進行Android開發

如何開始android開發Android開發簡介 (An intro to Android Development) Android apps can be a great, fun way to get into the world of programming. Officially programmers can use Java, Kotlin, or C to develop for Android. Though there may be API restrictions, …

httpd2.2的配置文件常見設置

目錄 1、啟動報錯&#xff1a;提示沒有名字fqdn2、顯示服務器版本信息3、修改監聽的IP和Port3、持久連接4 、MPM&#xff08; Multi-Processing Module &#xff09;多路處理模塊5 、DSO&#xff1a;Dynamic Shared Object6 、定義Main server &#xff08;主站點&#xff09; …