機器學習 建立模型_建立生產的機器學習系統

機器學習 建立模型

When businesses plan to start incorporating machine learning to enhance their solutions, they more often than not think that it is mostly about algorithms and analytics. Most of the blogs/training on the matter also only talk about taking fixed format files, training models and printing the result. Naturally, businesses tend to think that hiring good data scientists should get the job done. What they often fail to appreciate is that it is also a good old system and data engineering problem, with the data models and algorithms sitting at the core.

當企業計劃開始合并機器學習以增強其解決方案時,他們常常會認為這主要與算法和分析有關。 關于此事的大多數博客/培訓也只談論獲取固定格式的文件,培訓模型并打印結果。 自然,企業傾向于認為雇用優秀的數據科學家應該完成這項工作。 他們經常不能理解的是,這也是一個很好的舊系統和數據工程問題,數據模型和算法是核心。

About a few years ago, at an organisation I was working in, business deliberated on using machine learning models to enhance user engagement. The use cases that were initially planned revolved around content recommendations. However later, as we worked more in the field we started using it for more diverse problems like, Topic Classification, Keyword Extraction, Newsletter Content Selection etc.

大約幾年前,在我工作過的一個組織中,企業考慮使用機器學習模型來增強用戶參與度。 最初計劃的用例圍繞內容建議。 但是后來,隨著我們在該領域的工作越來越深入,我們開始將其用于更多樣化的問題,例如主題分類,關鍵字提取,新聞稿內容選擇等。

I would use our experience in designing and incorporating machine learnt models in production to illustrate the engineering and human aspects in building a data science application and team.

我將利用我們在生產中設計和合并機器學習模型的經驗來說明在構建數據科學應用程序和團隊時的工程和人為方面。

In big data analysis, training models were the crux of our data science application. But to make things work in production, many missing pieces of the puzzle were also required to be solved.

在大數據分析中,訓練模型是我們數據科學應用程序的關鍵。 但是要使事情在生產中可行,還需要解決許多缺失的難題。

These were:

這些曾經是:

  1. Getting data into the system on a regular basis from multiple sources.

    定期從多個來源將數據導入系統。
  2. Cleaning and transformation in more than one structures for use.

    清潔和改造多個結構以供使用。
  3. Training and retraining models, saving and reusing as required.

    培訓和再培訓模型,根據需要保存和重用。
  4. How to apply incremental changes.

    如何應用增量更改。
  5. Exposing model outputs for consumption through API’s.

    公開模型輸出以供通過API使用。

Scaling consumption API’s was also a concern for us. In our existing system, Content was mostly static and served from CDN cache. Certain content related data were served by application servers, but then all the users get the same data. Data was served from cache, which was updated every 5–10 seconds. Also the data pertained to around 7000 odd items on any particular day. Hence, overall low memory consumption and low number of writes.

擴展消耗API也是我們關注的問題。 在我們現有的系統中,內容大部分是靜態的,由CDN緩存提供。 某些與內容相關的數據由應用程序服務器提供,但是所有用戶都獲得相同的數據。 數據由緩存提供,緩存每5-10秒更新一次。 此外,該數據在任何特定日期涉及大約7000個奇數物品。 因此,總體上內存消耗低,寫入次數少。

Now, the personalized content output was for around 35 million users. Also new content was available every 10 minutes or so. Everything needed to be served by our application servers. Thus, this meant a far higher number of writes as well cache size to be handled than what we had handled earlier.

現在,個性化內容輸出可用于約3500萬用戶。 每10分鐘左右就有新內容可用。 我們的應用程序服務器需要提供所有服務。 因此,這意味著要處理的寫入數量和緩存大小比我們之前處理的要多得多。

The challenge for us was to design a system to do all these. So a data-science / ML project was not limited to building vectors and running models, but involved designing a complete system with data as the lead player.

我們面臨的挑戰是設計一個能夠完成所有這些任務的系統。 因此,數據科學/機器學習項目不僅限于構建向量和運行模型,還涉及設計一個以數據為主要參與者的完整系統。

When we started building our solution, we found that we had three facets that our decisions needed to cater too, namely System, Data and Team . Thus we would discuss our approach to these all three aspects separately.

在開始構建解決方案時,我們發現我們的決策也需要滿足三個方面的需求,即System,Data和Team。 因此,我們將分別討論這三個方面的方法。

Data Architecture:

數據架構:

We had data in multiple types of database, which backed our various applications. Data structures ranged from tabular to Document to Key-Value. Also, we had decided to use Hadoop ecosystem frameworks like Spark, Flink etc for our processing. Therefore, we chose HDFS for our data storage system for analytics.

我們在多種類型的數據庫中擁有數據,這些數據支持了我們的各種應用程序。 數據結構從表格到文檔再到鍵值。 此外,我們決定使用Spark,Flink等Hadoop生態系統框架進行處理。 因此,我們為數據存儲系統選擇了HDFS進行分析。

We built a 3 tier data storage system.

我們構建了3層數據存儲系統。

  1. Raw Data Layer: This essentially is our Data lake and the foundation layer. Data is ingested from all our sources into this layer. Data ingestion is done from Databases as well as Kafka Streams.

    原始數據層:本質上這是我們的數據湖和基礎層。 數據從我們所有的源中提取到該層中。 數據提取是從數據庫以及Kafka Streams完成的。
  2. Cleaned / Transformed / Enriched Data Layer: This layer stores data in structures which are directly consumable by our analytics or machine learning applications. Jobs take data from our lake, clean and transform it into standardised structures creating Primary Data. There are jobs which also merged changes to create an updated state. Primary Data is further enriched to create Secondary or Tertiary Data. Jobs also create and save Feature vectors in this layer. Feature vectors are designed to be used in multiple subsequent algorithms. For example, the Content Feature vector is used for Section/Topic classification. Same feature vector, enhanced over time to include consumption information, was used for Newsletter candidate selection & recommendation.

    清潔/轉換/豐富的數據層:此層將數據存儲在結構中,這些結構可直接由我們的分析或機器學習應用程序使用。 作業從我們的湖泊中獲取數據,進行清理并將其轉換為標準化結構,從而創建基本數據。 有些作業還合并了更改以創建更新狀態。 進一步豐富了主數據以創建輔助或第三數據。 作業還會在此層中創建和保存要素向量。 特征向量被設計用于多種后續算法。 例如,內容特征向量用于部分/主題分類。 隨時間推移增強的相同特征向量包括消費信息,用于新聞通訊候選人的選擇和推薦。
  3. Processed Output Layer: Analytics and Model outputs are stored in this layer. Also trained models too are stored in this layer, for subsequent use.

    處理的輸出層:分析和模型輸出存儲在此層中。 訓練有素的模型也存儲在此層中,以備后用。

System Architecture:

系統架構:

All the jobs/applications that we made catered to either data ingestion, processing or output consumption. Thus we built a 3 tiered application layer for all our processing.

我們提供的所有作業/應用程序都可以滿足數據攝取,處理或輸出消耗的需求。 因此,我們為所有處理構建了一個3層的應用程序層。

  • Data Ingestion Layer: This layer includes batch jobs to import data from RDBMS and Document storage. We used Apache Sqoop for ingestion jobs. There are a bunch of jobs to ingest data from Kafka message streams. For example user activity data. Apache Netty based Rest API Server collects activity data, which is pushed to Kafka. Apache Flink based jobs consume the activity data from Kafka, generate basic statistics and also push the data to HDFS.

    數據攝取層:該層包括批處理作業,以從RDBMS和文檔存儲導入數據。 我們使用Apache Sqoop進行提取作業。 有很多作業可以從Kafka消息流中提取數據。 例如用戶活動數據。 基于Apache Netty的Rest API Server收集活動數據,該數據被推送到Kafka。 基于Apache Flink的作業會使用來自Kafka的活動數據,生成基本統計信息,還將數據推送到HDFS。
  • Data Processing: We used Apache Spark jobs for all our processing. It includes jobs for cleaning, enhancements, feature vector builders, ML models and model output generation. The jobs are written in Java, Scala as well as Python.

    數據處理:我們使用Apache Spark作業進行所有處理。 它包括清潔,增強,特征向量生成器,ML模型和模型輸出生成的工作。 這些作業用Java,Scala和Python編寫。
  • Result Consumption: Processed output was pushed to RDBMS as well as Redis for consumption. The jobs were either built on Spark or Sqoop. The output is further exposed by Spring Boot Rest API endpoints for consumption. Further to this the same results were pushed out in event streams for further downstream processing or consumption.

    結果消耗:處理后的輸出被推送到RDBMS以及Redis進行消耗。 這些作業是基于Spark或Sqoop構建的。 Spring Boot Rest API端點進一步公開了輸出以供使用。 除此之外,在事件流中推出了相同的結果,以供進一步的下游處理或使用。

Team Setup:

團隊設置:

This was the most crucial aspect for the success of the entire enterprise. There were two needs that were required to be fulfilled:

這是整個企業成功的最關鍵方面。 需要滿足兩個需求:

  1. Skill in understanding and applying machine learning principles and tools.

    理解和應用機器學習原理和工具的技能。
  2. Knowledge of our domain. Deep understanding of the content types, important aspects of content, how it matures, dies, what all things affect it etc.

    我們的領域知識。 對內容類型,內容的重要方面,如何成熟,消亡,所有事物對其產生什么影響等方面的深刻理解。

Also, when it came to be known that we were planning ML based products, there were a lot of people in our existing team who wanted to be part of such an initiative. Also, it was important for us to ensure that we cater to the aspirations of our existing team members too.

另外,當我們得知我們正在計劃基于ML的產品時,我們現有團隊中的很多人都希望成為此類計劃的一部分。 同樣,對我們來說,確保我們也能滿足現有團隊成員的愿望也很重要。

Moreover, the overall system design meant that there were two distinct parts of the problem, the core ML section and the peripherals which were more like good old software engineering.

而且,整個系統的設計意味著問題有兩個截然不同的部分,核心ML部分和外圍設備,它們更像是好的舊軟件工程。

We decided to build our team with a combination two sets of people:

我們決定由兩組人組成一個團隊:

  1. Data Science experts, whom we hired. They were entrusted with the data science part of the puzzle. They also taught other team members and mentored their learning process.

    我們聘請的數據科學專家。 他們被賦予了難題中的數據科學部分。 他們還教了其他團隊成員,并指導了他們的學習過程。
  2. System development team, who were people picked from our existing team. They built the ingestion pipelines, stream processing engines, output consumption API’s etc.

    系統開發團隊是從我們現有團隊中挑選出來的。 他們建立了攝取管道,流處理引擎,輸出消耗API等。

Also, by taking people from our existing team, we were able to get the ingestion pipelines development going while we were hiring the data science people. Thus, we were able to kick start work from day one, figuratively speaking.

此外,通過從現有團隊中聘用人員,我們能夠在雇用數據科學人員的同時推動提取管道的開發。 因此,可以說,從第一天開始我們就可以開始工作。

As is illustrated through our experience, building a bunch of applications for training models and generating output is only a beginning. Building a system and team to harness them is an entirely different proposition.

正如我們的經驗所表明的那樣,構建大量用于訓練模型的應用程序并生成輸出只是一個開始。 建立一個系統和團隊來利用它們是完全不同的主張。

翻譯自: https://medium.com/@bmallick/building-a-ml-system-for-production-667923c4389e

機器學習 建立模型

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/392433.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/392433.shtml
英文地址,請注明出處:http://en.pswp.cn/news/392433.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

CDH使用秘籍(一):Cloudera Manager和Managed Service的數據庫

背景從業務發展需求,大數據平臺須要使用spark作為機器學習、數據挖掘、實時計算等工作,所以決定使用Cloudera Manager5.2.0版本號和CDH5。曾經搭建過Cloudera Manager4.8.2和CDH4,在搭建Cloudera Manager5.2.0版本號的時候,發現對…

leetcode 455. 分發餅干(貪心算法)

假設你是一位很棒的家長,想要給你的孩子們一些小餅干。但是,每個孩子最多只能給一塊餅干。 對每個孩子 i,都有一個胃口值 g[i],這是能讓孩子們滿足胃口的餅干的最小尺寸;并且每塊餅干 j,都有一個尺寸 s[j]…

壓縮/批量壓縮/合并js文件

寫在前面 如果文件少的話,直接去網站轉化一下就行。 http://tool.oschina.net/jscompress?type3 1.壓縮單個js文件 cnpm install uglify-js -g 安裝 1>壓縮單個js文件打開cmd,目錄引到當前文件夾,cduglifyjs inet.js -o inet-min.js 或者 uglifyjs i…

angular依賴注入_Angular依賴注入簡介

angular依賴注入by Neeraj Dana由Neeraj Dana In this article, we will see how the dependency injection of Angular works internally. Suppose we have a component named appcomponent which has a basic and simple structure as follows:在本文中,我們將看…

leetcode 85. 最大矩形(dp)

給定一個僅包含 0 和 1 、大小為 rows x cols 的二維二進制矩陣,找出只包含 1 的最大矩形,并返回其面積。 示例 1: 輸入:matrix [[“1”,“0”,“1”,“0”,“0”],[“1”,“0”,“1”,“1”,“1”],[“1”,“1”,“1”,“1”,“…

如何查看系統版本

1. winR,輸入cmd,確定,打開命令窗口,輸入msinfo32,注意要在英文狀態下輸入,回車。然后在彈出的窗口中就可以看到系統的具體版本號了。 2.winR,輸入cmd,確定,打開命令窗口,輸入ver&am…

java activemq jmx_通過JMX 獲取Activemq 隊列信息

首先在 activemq.xml 中新增以下屬性在broker 節點新增屬性 useJmx"true"在managementContext 節點配置斷開與訪問服務iP配置成功后啟動下面來看測試代碼/*** Title: ActivemqTest.java* Package activemq* Description: TODO(用一句話描述該文件做什么)* author LYL…

風能matlab仿真_發現潛力:使用計算機視覺對可再生風能發電場的主要區域進行分類(第1部分)

風能matlab仿真Github Repo: https://github.com/codeamt/WindFarmSpotterGithub回購: https : //github.com/codeamt/WindFarmSpotter This is a series:這是一個系列: Part 1: A Brief Introduction on Leveraging Edge Devices and Embedded AI to …

【Leetcode_easy】821. Shortest Distance to a Character

problem 821. Shortest Distance to a Character 參考 1. Leetcode_easy_821. Shortest Distance to a Character; 完轉載于:https://www.cnblogs.com/happyamyhope/p/11214805.html

tdd測試驅動開發課程介紹_測試驅動開發的實用介紹

tdd測試驅動開發課程介紹by Luca Piccinelli通過盧卡皮奇內利 測試驅動開發很難! 這是不為人知的事實。 (Test Driven Development is hard! This is the untold truth about it.) These days you read a ton of articles about all the advantages of doing Test …

軟件安裝(JDK+MySQL+TOMCAT)

一,JDK安裝 1,查看當前Linux系統是否已經安裝了JDK 輸入 rpm -qa | grep java 如果有: 卸載兩個openJDK,輸入rpm -e --nodeps 要卸載的軟件 2,上傳JDK到Linux 3,安裝jdk運行需要的插件yum install gl…

leetcode 205. 同構字符串(hash)

給定兩個字符串 s 和 t,判斷它們是否是同構的。 如果 s 中的字符可以被替換得到 t ,那么這兩個字符串是同構的。 所有出現的字符都必須用另一個字符替換,同時保留字符的順序。兩個字符不能映射到同一個字符上,但字符可以映射自己…

Java core 包_feilong-core 讓Java開發更簡便的工具包

## 背景在JAVA開發過程中,經常看到小伙伴直接從網上copy一長段代碼來使用,又或者寫的代碼很長很長很長...**痛點在于:*** 難以閱讀* 難以維護* sonar掃描結果債務長* codereview 被小伙伴鄙視* ....feilong-core focus on J2SE,是[feilong platform](https://github.com/venusd…

TensorFlow 2.X中的動手NLP深度學習模型準備

簡介:為什么我寫這篇文章 (Intro: why I wrote this post) Many state-of-the-art results in NLP problems are achieved by using DL (deep learning), and probably you want to use deep learning style to solve NLP problems as well. While there are a lot …

靜態代碼塊

靜態代碼塊 靜態代碼塊:定義在成員位置,使用static修飾的代碼塊{ }。位置:類中方法外。執行:隨著類的加載而執行且執行一次,優先于main方法和構造方法的執行。格式:作用: 給類變量進行初始化賦值…

異步api_如何設計無服務器異步API

異步apiby Garrett Vargas通過Garrett Vargas 如何設計無服務器異步API (How To Design a Serverless Async API) I recently ran a workshop to teach developers how to create an Alexa skill. The workshop material centered around a project to return car rental sear…

C# 序列化與反序列化json

與合作伙伴討論問題,說到的c與c#數據的轉換調用,正好就說到了序列化與反序列化,同樣也可用于不同語言間的調用,做了基礎示例,作以下整理: 1 using System.Data;2 using System.Drawing;3 using System.Linq…

學java 的要點_零基礎學Java,掌握Java的基礎要點

對于程序員群體來說,了解一定的技巧會對學習專業技能更有幫助,也更有助于在自己的職業發展中處于有利地位,無限互聯Java培訓專家今天就為大家總結Java程序員入門時需要掌握的基礎要點:掌握靜態方法和屬性靜態方法和屬性用于描述某…

實驗人員考評指標_了解實驗指標

實驗人員考評指標In the first part of my series on experimental design Thinking About Experimental Design, we covered the foundations of an experiment: the goals, the conditions, and the metrics. In this post, we will move away from the initial experimental…

leetcode 188. 買賣股票的最佳時機 IV(dp)

給定一個整數數組 prices ,它的第 i 個元素 prices[i] 是一支給定的股票在第 i 天的價格。 設計一個算法來計算你所能獲取的最大利潤。你最多可以完成 k 筆交易。 注意:你不能同時參與多筆交易(你必須在再次購買前出售掉之前的股票&#xf…