機器學習 來源框架_機器學習的秘密來源:策展

機器學習 來源框架

成功的機器學習/人工智能方法 (Methods for successful Machine learning / Artificial Intelligence)

It’s widely stated that data is the new oil, and like oil, data needs the right refinement to evolve to be utilised perfectly. The power of machine learning models will significantly depend on the quality of the data; I’m not saying anything new here.

人們普遍認為,數據是新的石油,就像石油一樣,數據需要進行適當的精煉才能發展以得到完美利用。 機器學習模型的功能將在很大程度上取決于數據的質量。 我不是在這里說新的話。

As AI development and its subsequent applications become even more pervasive, ML engineers everywhere are confronted with a grim reality. Once stakeholders overcome biases or skepticisms and finally buy-in, identify a use case with proven ROI, and now are eager to jump onto the AI ship, data curation is usually neglected and suffers from not attracting its due importance — often due to a quick win mentality and the fact it’s not sexy!

隨著AI開發及其后續應用變得越來越普遍,各地的ML工程師都面臨嚴峻的現實。 一旦利益相關者克服了偏見或懷疑并最終接受了投資,確定了具有良好ROI的用例,現在又急于跳入AI船上,數據管理通常會被忽略,并且由于無法快速獲得數據,因此無法發揮應有的重要性。贏得心態和事實,那就不是性感!

There are many assumptions even within technology groups, that AI only needs to be fed data collected and combined on a large measure; in most cases, this gravely backfires. Inaccurate datasets can come in many forms ranging from factually incorrect information to knowledge gaps to wrong guidelines. Among many other problems, an uncurated dataset can be:

即使在技術小組內部,也有許多假設,即只需要向AI提供大量收集和合并的數據即可。 在大多數情況下,這會適得其反。 不準確的數據集可能以多種形式出現,從事實不正確的信息到知識鴻溝再到錯誤的準則。 除許多其他問題外,未整理的數據集可能是:

  • Biased: recently, several popular AI’s used for image recognition displayed disturbing gender and racial bias.

    偏見:最近,幾種流行的用于圖像識別的AI顯示出令人不安的性別和種族偏見。

  • Inaccurate, unreliable or falsely represented

    不準確,不可靠或虛假陳述

  • Error-ridden or ambiguous

    錯誤纏身或模棱兩可

The lack of using refined or curated raw datasets are universally known to decrease feature quality and limit the evaluation and applications of transfer tasks. So how should datasets be treated in a way that they serve the exact purpose ML needs to work, this is highly dependant on the use cases the ML engineers are trying to address.

眾所周知,缺乏使用精煉或精選原始數據集會降低要素質量并限制傳輸任務的評估和應用。 因此,應如何以滿足ML工作所需確切目的的方式對待數據集,這在很大程度上取決于ML工程師試圖解決的用例。

機器學習的數據集類型 (Types of Datasets for Machine Learning)

ML engineers depend on data throughout each step of their AI journey — from model choice, training, and testing. These datasets typically fall under three classifications:

機器學習工程師在AI歷程的每個步驟中都依賴于數據,包括模型選擇,培訓和測試。 這些數據集通常分為三類:

  • Training sets

    訓練套

  • Validation sets

    驗證集

  • Testing sets.

    測試裝置。

Every ML project starts with two data set categories; the training data set and the testing data set.

每個ML項目都以兩個數據集類別開始; 訓練數據集和測試數據集。

  • The training data set is used to train an algorithm, implement concepts, discover, and give results.

    訓練數據集用于訓練算法,實現概念,發現并給出結果。
  • Testing data is used to examine the validity of the training data set. Training data is not used for testing because it will produce expected outputs.

    測試數據用于檢查訓練數據集的有效性。 訓練數據不用于測試,因為它將產生預期的輸出。
Image for post
Image created by Author Steve Leven
圖片由作者Steve Leven創建

機器學習的數據需求 (Data needs for Machine Learning)

Data scientists collect data from various sources, integrate it into one form, validate, manipulate, archive, preserve, retrieve, and express it.

數據科學家從各種來源收集數據,將其集成為一種形式,然后進行驗證,操作,存檔,保存,檢索和表達。

The process of curating datasets for machine learning starts well before availing datasets.

整理用于機器學習的數據集的過程在使用數據集之前就已經開始了。

My suggestion:

我的建議:

  • Identify the aim of the AI

    確定AI的目標

  • Identify what dataset you will require to solve the problem

    確定解決問題所需的數據集

  • Create a record of your hypotheses while selecting the Data

    選擇數據時創建假設記錄

  • Strive for collecting assorted and meaningful data from both external and internal sources

    努力從外部和內部來源收集各種有意義的數據

  • Create datasets that are hard for your competitors to copy (defendability)

    創建難以被競爭對手復制的數據集(可防御性)

If you have a small dataset, applying a model pre-trained on large datasets can be a great approach and use your small dataset to fine-tune.

如果您的數據集較小,則對大型數據集應用預訓練的模型可能是一種不錯的方法,并使用小型數據集進行微調。

Once you have accumulated the correct Data, you can progress with creating the training set. This step of putting data in the optimal format is called feature transformation, and it involves four stages:

一旦積累了正確的數據,就可以繼續創建訓練集。 將數據以最佳格式放置的這一步驟稱為特征轉換,它涉及四個階段:

Formatting: Data discovery is in different formats. Formatting will bring it together in one sheet. For example, consumer Data can come with different currencies, semantics and so on. These need to be compiled under one format for foundation uniformity.

格式:數據發現采用不同的格式。 格式化會將其合并到一張紙中。 例如,消費者數據可以帶有不同的幣種,語義等。 這些需要以一種格式進行編譯以實現基礎均勻性。

Labelling: Labelling ensures the Data set works for the specific model choice. For example, an autonomous car requires data labelled as images of cars, pedestrians, road signs, walkways.

貼標簽:貼標簽可確保數據集適用于特定的模型選擇。 例如,自動駕駛汽車需要標記為汽車,行人,道路標志,人行道圖像的數據。

Cleansing: Suboptimal characters need to be removed, and missing values are managed based on the weighting of need.

清理:需要刪除次優字符,并根據需要的權重來管理缺失值。

Extraction: Several features are examined and optimised — features that are essential for predictive capability and faster computation and less memory consumption.

提取:已檢查和優化了幾個功能-這些功能對于預測功能,更快的計算和更少的內存消耗至關重要。

底線 (The Bottom Line)

A dataset solely can ensure the success or failure of a machine learning model. Data curation is one of the fundamental aspects of machine learning, and if exercised correctly, it can unleash tremendous potential. The methods and subsequent processes can appear time-consuming; however, this will guarantee your dataset’s calibration with the goals of your machine learning at each step.

數據集僅可以確保機器學習模型的成功或失敗。 數據管理是機器學習的基本方面之一,如果正確執行,它可以釋放巨大的潛力。 方法和后續過程可能很耗時。 但是,這將確保您的數據集的校準符合每一步的機器學習目標。

Introducing data curation processes into your data team and the following procedures will appear time-consuming and expensive in the short term; therefore, organisations must carefully analyse current objectives and develop a strategy to support the relevance for curation-as-a-function. Managed services and Unsupervised methods trained on curated data are available and marketed by advisory and technology firms, be careful and choose carefully; this will play a key role in your AI future.

在您的數據團隊中引入數據管理流程,以下過程在短期內將顯得既耗時又昂貴。 因此,組織必須仔細分析當前的目標并制定策略,以支持與策展即功能有關。 咨詢和技術公司可以使用托管的服務和不受監管的方法進行策劃的數據培訓,并且要謹慎行事并謹慎選擇; 這將在您的AI未來中發揮關鍵作用。

翻譯自: https://towardsdatascience.com/machine-learnings-secret-source-curation-e8c3107dcc13

機器學習 來源框架

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390907.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390907.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390907.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

linux gcc 示例_最好的Linux示例

linux gcc 示例Linux is a powerful operating system that powers most servers and most mobile devices. In this guide, we will show you examples of how to use some of its most powerful features. This involves using the Bash command line.Linux是功能強大的操作系…

帆軟報表和jeecg的進一步整合--ajax給后臺傳遞map類型的參數

下面是頁面代碼&#xff1a; <% page language"java" contentType"text/html; charsetUTF-8" pageEncoding"UTF-8"%> <%include file"/context/mytags.jsp"%> <% String deptIds (String)request.getAttribute("…

@Nullable 注解的用法

問題&#xff1a;Nullable 注解的用法 我看到java中的一些方法聲明為: void foo(Nullable Object obj){…}在這里Nullable是什么意思?這是不是意味著輸入可以為空? 沒有這個注解&#xff0c;輸入仍然可以是null&#xff0c;所以我猜這不是它的用法? 回答一 它清楚地說明…

WebLogic調用WebService提示Failed to localize、Failed to create WsdlDefinitionFeature

在本地Tomcat環境下調用WebService正常&#xff0c;但是部署到WebLogic環境中&#xff0c;則提示警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_2 ......警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_1 ..…

呼吁開放外網_服裝數據集:呼吁采取行動

呼吁開放外網Getting a dataset with images is not easy if you want to use it for a course or a book. Yes, there are many datasets with images, but few of them are suitable for commercial or educational use.如果您想將其用于課程或書籍&#xff0c;則獲取帶有圖像…

git push命令_Git Push命令解釋

git push命令The git push command allows you to send (or push) the commits from your local branch in your local Git repository to the remote repository.git push命令允許您將提交(或推送 )從本地Git存儲庫中的本地分支發送到遠程存儲庫。 To be able to push to you…

在Java里面使用Pairs或者二元組

問題&#xff1a;在Java里面使用Pairs或者二元組 在Java里面&#xff0c;我的Hashtable要用到一個元組結構。在Java里面&#xff0c;我可以使用的什么數據結構呢&#xff1f; Hashtable<Long, Tuple<Set<Long>,Set<Long>>> table ...回答一 我不認…

github 搜索技巧

1、關鍵詞 指定開發語言 bitcoin language:javascript 2、關鍵詞 stars 數量 forks 數量 bitcoin stars:>100 forks:>50

React JS 組件間溝通的一些方法

剛入門React可能會因為React的單向數據流的特性而遇到組件間溝通的麻煩&#xff0c;這篇文章主要就說一說如何解決組件間溝通的問題。 1.組件間的關系 1.1 父子組件 ReactJS中數據的流動是單向的&#xff0c;父組件的數據可以通過設置子組件的props傳遞數據給子組件。如果想讓子…

數據可視化分析票房數據報告_票房收入分析和可視化

數據可視化分析票房數據報告Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.歡迎回到我的100天數據科學挑戰之旅。 在第4天和第5天&#xff0c;我將研究Kaggle上提供的TM…

sql limit子句_SQL子句解釋的位置:之間,之間,類似和其他示例

sql limit子句什么是SQL Where子句&#xff1f; (What is a SQL Where Clause?) WHERE子句(和/或IN &#xff0c; BETWEEN和LIKE ) (The WHERE Clause (and/or, IN , BETWEEN , and LIKE )) The WHERE clause is used to limit the number of rows returned.WHERE子句用…

在Java里面使用instanceof的性能影響

問題&#xff1a;在Java里面使用instanceof的性能影響 我正在寫一個應用程序&#xff0c;其中一種設計方案包含了instanceof操作的大量使用。雖然我知道面向對象設計通常試圖避免使用instanceof&#xff0c;但那是另一回事了&#xff0c;這個問題純粹只是討論與性能有關。我想…

Soot生成控制流圖

1.將soot.jar文件復制到工程bin目錄下&#xff1b;2.在cmd中執行如下命令java -cp soot-trunck.jar soot.tools.CFGViewer --soot-classpath .;"%JAVA_HOME%"\jre\lib\rt.jar com.wauoen.paper.classes.Activity其中&#xff0c;JAVA_HOME是jdk目錄&#xff1b;com.w…

Centos 6.5安裝MySQL-python

報錯信息&#xff1a;Using cached MySQL-python-1.2.5.zip Complete output from command python setup.py egg_info: sh: mysql_config: command not found Traceback (most recent call last): File "<string>", line 1, in <module&g…

react 最佳實踐_最佳React教程

react 最佳實踐React is a JavaScript library for building user interfaces. It was voted the most loved in the “Frameworks, Libraries, and Other Technologies” category of Stack Overflow’s 2017 Developer Survey.React是一個用于構建用戶界面JavaScript庫。 在S…

先知模型 facebook_Facebook先知

先知模型 facebook什么是先知&#xff1f; (What is Prophet?) “Prophet” is an open-sourced library available on R or Python which helps users analyze and forecast time-series values released in 2017. With developers’ great efforts to make the time-series …

Java里面的靜態代碼塊

問題&#xff1a;Java里面的靜態代碼塊 I was looking over some code the other day and I came across: 前幾天我在看一些代碼時發現&#xff1a; static {... }我是c轉來的&#xff0c;我不知道為啥要這樣干。這個代碼也編譯成功了&#xff0c;沒出錯誤。這里的"stat…

搭建Maven私服那點事

摘要&#xff1a;本文主要介紹在CentOS7.1下使用nexus3.6.0搭建maven私服&#xff0c;以及maven私服的使用&#xff08;將自己的Maven項目指定到私服地址、將第三方項目jar上傳到私服供其他項目組使用&#xff09; 一、簡介 Maven是一個采用純Java編寫的開源項目管理工具, Mave…

lee最短路算法_Lee算法的解釋:迷宮運行并找到最短路徑

lee最短路算法Lee算法是什么&#xff1f; (What is the Lee Algorithm?) The Lee algorithm is one possible solution for maze routing problems. It always gives an optimal solution, if one exists, but is slow and requires large memory for dense layout.Lee算法是迷…

gan訓練失敗_我嘗試過(但失敗了)使用GAN來創作藝術品,但這仍然值得。

gan訓練失敗This work borrows heavily from the Pytorch DCGAN Tutorial and the NVIDA paper on progressive GANs.這項工作大量借鑒了Pytorch DCGAN教程 和 有關漸進式GAN 的 NVIDA論文 。 One area of computer vision I’ve been wanting to explore are GANs. So when m…