uni-app清理緩存數據_數據清理-從哪里開始?

uni-app清理緩存數據

It turns out that Data Scientists and Data Analysts will spend most of their time on data preprocessing and EDA rather than training a machine learning model. As one of the most important job, Data Cleansing is very important indeed.

事實證明,數據科學家和數據分析師將把大部分時間花在數據預處理和EDA上,而不是訓練機器學習模型。 作為最重要的工作之一,數據清理確實非常重要。

We all know that we need to clean the data. I guess most people know that. But where to start? In this article, I will provide a generic guide/checklist. So, once we start on a new dataset, we can start the Data Cleansing as such.

我們都知道我們需要清理數據。 我想大多數人都知道。 但是從哪里開始呢? 在本文中,我將提供通用指南/清單。 因此,一旦我們開始一個新的數據集,我們就可以像這樣開始數據清洗。

方法論(CRAI) (Methodology (C-R-A-I))

Image for post
Photo by geralt on Pixabay
照片由Geralt在Pixabay上發布

If we ask ourselves “why do we need to clean the data?”, I think it is obvious that it is because we want our data to follow some standards in order to be fed into some algorithm or visualised on a consistent scale. Therefore, let’s firstly summarise what are the “standards” that we want our data to have.

如果我們問自己“為什么需要清理數據?”,我認為很明顯是因為我們希望我們的數據遵循某些標準,以便輸入某種算法或以一致的規模可視化。 因此,讓我們首先總結一下我們希望數據具有的“標準”。

Here, I summarised 4 major criteria/standards that a cleansed dataset should have. I would call it “CRAI”.

在這里,我總結了清洗后的數據集應具有的4個主要標準/標準。 我稱之為“ CRAI”。

  • Consistency

    一致性

Every column of the data should be consistent on the same scale.

數據的每一列都應在相同范圍內保持一致。

  • Rationality

    理性

All the values in each column should comply with common sense.

每列中的所有值均應符合常識。

  • Atomicity

    原子性

The data entries should not be duplicated and the data column should not be dividable.

數據條目不應該重復,并且data列不能分開。

  • Integrity

    廉潔

The data entries should have all the features available unless null value makes sense.

數據條目應具有所有可用功能,除非使用空值有意義。

OK. Just bear in mind with these 4 criteria. I will explain them with more examples so that hopefully they will become something that you can remember.

好。 只要記住這四個標準即可。 我將通過更多示例來說明它們,以便希望它們將成為您可以記住的東西。

一致性 (Consistency)

Image for post
Photo by AbsolutVision on Pixabay
照片由AbsolutVision在Pixabay上發布

It will be helpful to plot a histogram of a column regardless it is continuous or categorical. We need to pay attention to the min/max values, average values and the shape of the distribution. Then, use common sense to find out whether there is any potential inconsistency.

無論列是連續的還是分類的,繪制柱狀圖都是有幫助的。 我們需要注意最小值/最大值,平均值和分布形狀。 然后,使用常識找出是否存在任何潛在的不一致之處。

For example, if we sampled some people and one column of the data is for their weights. Let’s say the histogram is as follows.

例如,如果我們對一些人進行了抽樣,則數據的一列是他們的體重。 假設直方圖如下。

Image for post

It is obvious that there is almost nothing between 80 and 100. Using our common sense is quite enough to find out the problem. That is, some of the weights are in kg while the others are in lbs.

顯然,在80到100之間幾乎沒有任何東西。使用我們的常識就足以找出問題所在。 也就是說,一些重量以千克為單位,而另一些重量以磅為單位。

This kind of issue is commonly found when we have multiple data sources. Some of them might use different units for the same data fields.

當我們有多個數據源時,通常會發現這種問題。 他們中的一些人可能對相同的數據字段使用不同的單位。

After cleansing, we may end up with distribution like this, which looks good.

清洗后,我們可能最終會得到這樣的分布,看起來不錯。

Image for post

理性 (Rationality)

Image for post
Photo by MMillustrates on Pixabay
照片由MMillustrates在Pixabay上發布

This also relies on our common sense, but it is usually easier to be found. Some common example:

這也依賴于我們的常識,但通常更容易找到。 一些常見的例子:

  • Human age, weight and height should not be negative.

    人的年齡,體重和身高不應為負數。
  • Some categorical data such as gender will have a certain enumeration of values. otherwise, it is not valid.

    一些分類數據(例如性別)將具有一定的值枚舉。 否則無效。
  • Most types of textual values such as human names and product names should not have leading and tailing spaces.

    大多數類型的文本值(例如,人名和產品名)都不應使用前導和尾部空格。
  • Sometimes we may also need to pay attention to special characters. Most of the time they should be stripped out.

    有時我們可能還需要注意特殊字符。 在大多數情況下,應將其剝離。

原子性 (Atomicity)

Image for post
Photo by WikimediaImages on Pixabay
照片由WikimediaImages在Pixabay上發布

This is easy to understand. We should not have any duplicated rows in our dataset. It happens commonly when we have multiple data sources, where different data source may store overlapped data.

這很容易理解。 我們的數據集中不應有任何重復的行。 當我們有多個數據源時,通常會發生這種情況,其中不同的數據源可能會存儲重疊的數據。

It is important to leave uniqueness checking later than the consistency and rationality because it will difficult to find out the duplicated rows if we didn’t fix the consistency and rationality issues. For example, a person’s name might be presented in different ways in different data sources, such as González and Gonzalez. Once we realise that there are some non-English names existing, we need to pay attention to this kind of problems.

在一致性和合理性之后保留唯一性檢查很重要,因為如果我們不解決一致性和合理性問題,將很難找出重復的行。 例如,可以在不同的數據源(例如GonzálezGonzalez以不同的方式顯示一個人的名字。 一旦意識到存在一些非英語名稱,就需要注意這種問題。

Therefore, although it is usually not difficult to get rid of duplicated rows, the other of doing it may impact the final quality of the cleansed data.

因此,盡管通常不難擺脫重復的行,但另一步可能會影響已清理數據的最終質量。

Another type of violation of the atomicity is that one column may be dividable, which means that there are actually multiple features hidden in one column. To maximise the value of the dataset, we should divide them.

違反原子性的另一種類型是,一列可能是可分割的,這意味著實際上一列中隱藏了多個特征。 為了最大化數據集的價值,我們應該將它們分開。

For example, we may have a column that represents customer names. Sometimes it might necessary to split it into first name and last name.

例如,我們可能有一個代表客戶名稱的列。 有時可能需要將其拆分為名字和姓氏。

廉潔 (Integrity)

Image for post
Photo by AbsolutVision on Pixabay
照片由AbsolutVision在Pixabay上發布

Depending on the data source and how the data structure was designed, we may have some data missing in our raw dataset. This data missing sometimes doesn’t mean we lose some data entries, but we may lose some values for certain columns.

根據數據源以及數據結構的設計方式,原始數據集中可能會缺少一些數據。 有時丟失這些數據并不意味著我們會丟失一些數據條目,但是對于某些列,我們可能會丟失一些值。

When this happened, it is important to identify whether the “null” or “NaN” values make sense in the dataset. If it doesn’t, we may need to get rid of the row.

發生這種情況時,重要的是要確定數據集中“ null”或“ NaN”值是否有意義。 如果沒有,我們可能需要擺脫該行。

However, eliminating the rows having missing values is not always the best idea. Sometimes we may use average values or other technique to fill the gaps. This is depending on the actual case.

但是,消除具有缺失值的行并不總是最好的主意。 有時我們可能會使用平均值或其他技術來填補空白。 這取決于實際情況。

CRAI方法論的使用 (Usage of CRAI Methodology)

Image for post
Photo by blickpixel on Pixabay
照片由blickpixel在Pixabay上發布

Well, I hope the above examples explained what is “CRAI” respectively. Now, let’s take a real example to practice!

好吧,我希望以上示例分別解釋什么是“ CRAI”。 現在,讓我們以一個真實的例子進行練習!

Suppose we have an address column in our dataset, what should we do to make sure it is cleaned?

假設我們的數據集中有一個address列,應該怎么做才能確保它被清除?

C —一致性 (C — Consistency)

  1. The address might contain street suffixes, such as “Street” and “St”, “Avenue” and “Ave”, “Crescent” and “Cres”. Check whether we have both terms in those pairs.

    該地址可能包含街道后綴,例如“街道”和“ St”,“大道”和“ Ave”,“新月”和“克雷斯”。 檢查在這些對中是否同時存在這兩個術語。
  2. Similarly, if we have state names in the address, we may need to make sure they are consistent, such as “VIC” and “Victoria”.

    同樣,如果地址中有州名,則可能需要確保它們是一致的,例如“ VIC”和“ Victoria”。
  3. Check the unit number presentation of the address. “Unit 9, 24 Abc St” should be the same as “9/24 Abc St”.

    檢查地址的單位編號顯示。 “ 9號街24號Abc St”應與“ 9/24 Abc St”相同。

R-理性 (R — Rationality)

  1. One example would be that the postcode must follow a certain format in a country. For example, it must be 4 digits in Australia.

    一個示例是郵政編碼必須在某個國家/地區遵循某種格式。 例如,在澳大利亞必須為4位數字。
  2. If there is a country in the address string, we may need to pay attention to that. For example, the dataset is about all customers in an international company that runs its business in several countries. If an irrelevant country appears, we might need to further investigate.

    如果地址字符串中有一個國家,我們可能需要注意。 例如,數據集涉及在多個國家/地區經營業務的一家國際公司中的所有客戶。 如果出現無關的國家,我們可能需要進一步調查。

A —原子性 (A — Atomicity)

  1. The most obvious atomicity issue is that we should split the address into several fields such as the street address, suburb, state and postcode.

    最明顯的原子性問題是我們應該將地址分成幾個字段,例如街道地址,郊區,州和郵政編碼。
  2. If our dataset is about households, such as a health insurance policy. It might be worth to pay more attention to those duplicated addresses of different data entries.

    如果我們的數據集是關于家庭的,例如健康保險單。 可能需要更多注意不同數據條目的那些重復地址。

我-誠信 (I — Integrity)

  1. This is about the entire data entry (row-wise) in the dataset. Of course, we need to drop the duplicated rows.

    這是關于數據集中的整個數據條目(按行)。 當然,我們需要刪除重復的行。
  2. Check whether there are any data entries that have their address missing. Depending on the business rules and data analysis objectives, we may need to eliminate the rows that miss addresses. Also, addresses should not be able to be derived usually, so we should not need to fill gaps because we cannot.

    檢查是否有任何數據條目的地址丟失。 根據業務規則和數據分析目標,我們可能需要消除丟失地址的行。 同樣,地址通常不能被導出,因此我們不需要填補空白,因為我們不能。
  3. If in the Atomicity checking, we found that some rows having exactly the same addresses and based on the business rules this can be determined as duplicated, we may need to combine or eliminate them.

    如果在原子性檢查中發現某些地址具有完全相同的行,并且根據業務規則可以將其確定為重復行,則可能需要合并或消除它們。

摘要 (Summary)

Image for post
Photo by Pexels on Pixabay
照片由Pexels在· Pixabay上的免費圖片

Here is the “CRAI” methodology I usually follow myself when I first time gets along with a new dataset.

這是我初次接觸新數據集時通常會遵循的“ CRAI”方法。

It turns out that this method is kind of generic, which means that it should be applicable in almost all the scenarios. However, it is also because of its universal, we should pay more attention to the domain knowledge that will influence how you apply CRAI to clean your datasets.

事實證明,這種方法是通用的,這意味著它應該適用于幾乎所有場景。 但是,也是由于它的通用性,我們應該更加注意將影響您如何應用CRAI清理數據集的領域知識。

That’s all I want to share in this article. Hope it helps :)

這就是我要在本文中分享的全部內容。 希望能幫助到你 :)

翻譯自: https://towardsdatascience.com/data-cleansing-where-to-start-90802e95cc5d

uni-app清理緩存數據

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390623.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390623.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390623.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

高級人工智能之群體智能:蟻群算法

群體智能 鳥群: 魚群: 1.基本介紹 蟻群算法(Ant Colony Optimization, ACO)是一種模擬自然界螞蟻覓食行為的優化算法。它通常用于解決路徑優化問題,如旅行商問題(TSP)。 蟻群算法的基本步驟…

JavaScript標準對象:地圖

The Map object is a relatively new standard built-in object that holds [key, value] pairs in the order that theyre inserted. Map對象是一個相對較新的標準內置對象,按插入順序保存[key, value]對。 The keys and values in the Map object can be any val…

leetcode 483. 最小好進制

題目 對于給定的整數 n, 如果n的k(k>2)進制數的所有數位全為1,則稱 k(k>2)是 n 的一個好進制。 以字符串的形式給出 n, 以字符串的形式返回 n 的最小好進制。 示例 1: 輸入:“13” 輸…

圖像灰度變換及圖像數組操作

Python圖像灰度變換及圖像數組操作 作者:MingChaoSun 字體:[增加 減小] 類型:轉載 時間:2016-01-27 我要評論 這篇文章主要介紹了Python圖像灰度變換及圖像數組操作的相關資料,需要的朋友可以參考下使用python以及numpy通過直接操…

npx npm區別_npm vs npx —有什么區別?

npx npm區別If you’ve ever used Node.js, then you must have used npm for sure.如果您曾經使用過Node.js ,那么一定要使用npm 。 npm (node package manager) is the dependency/package manager you get out of the box when you install Node.js. It provide…

找出性能消耗是第一步,如何解決問題才是關鍵

作者最近剛接手一個新項目,在首頁列表滑動時就感到有點不順暢,特別是在滑動到有 ViewPager 部分的時候,如果是熟悉的項目,可能會第一時間會去檢查代碼,但前面說到這個是剛接手的項目,同時首頁的代碼邏輯比較…

bigquery_如何在BigQuery中進行文本相似性搜索和文檔聚類

bigqueryBigQuery offers the ability to load a TensorFlow SavedModel and carry out predictions. This capability is a great way to add text-based similarity and clustering on top of your data warehouse.BigQuery可以加載TensorFlow SavedModel并執行預測。 此功能…

bzoj 1996: [Hnoi2010]chorus 合唱隊

Description 為了在即將到來的晚會上有吏好的演出效果&#xff0c;作為AAA合唱隊負責人的小A需要將合唱隊的人根據他們的身高排出一個隊形。假定合唱隊一共N個人&#xff0c;第i個人的身髙為Hi米(1000<Hi<2000),并已知任何兩個人的身高都不同。假定最終排出的隊形是A 個人…

移動應用程序開發_什么是移動應用程序開發?

移動應用程序開發One of the most popular forms of coding in the last decade has been the creation of apps, or applications, that run on mobile devices.在過去的十年中&#xff0c;最流行的編碼形式之一是創建在移動設備上運行的應用程序。 Today there are two main…

leetcode 1600. 皇位繼承順序(dfs)

題目 一個王國里住著國王、他的孩子們、他的孫子們等等。每一個時間點&#xff0c;這個家庭里有人出生也有人死亡。 這個王國有一個明確規定的皇位繼承順序&#xff0c;第一繼承人總是國王自己。我們定義遞歸函數 Successor(x, curOrder) &#xff0c;給定一個人 x 和當前的繼…

vlookup match_INDEX-MATCH — VLOOKUP功能的升級

vlookup match電子表格/索引匹配 (SPREADSHEETS / INDEX-MATCH) In a previous article, we discussed about how and when to use VLOOKUP functions and what are the issues that we might face while using them. This article, on the other hand, will take you to a jou…

java基礎-BigDecimal類常用方法介紹

java基礎-BigDecimal類常用方法介紹 作者&#xff1a;尹正杰 版權聲明&#xff1a;原創作品&#xff0c;謝絕轉載&#xff01;否則將追究法律責任。 一.BigDecimal類概述 我們知道浮點數的計算結果是未知的。原因是計算機二進制中&#xff0c;表示浮點數不精確造成的。這個時候…

節點對象轉節點_節點流程對象說明

節點對象轉節點The process object in Node.js is a global object that can be accessed inside any module without requiring it. There are very few global objects or properties provided in Node.js and process is one of them. It is an essential component in the …

PAT——1018. 錘子剪刀布

大家應該都會玩“錘子剪刀布”的游戲&#xff1a;兩人同時給出手勢&#xff0c;勝負規則如圖所示&#xff1a; 現給出兩人的交鋒記錄&#xff0c;請統計雙方的勝、平、負次數&#xff0c;并且給出雙方分別出什么手勢的勝算最大。 輸入格式&#xff1a; 輸入第1行給出正整數N&am…

leetcode 1239. 串聯字符串的最大長度

題目 二進制手表頂部有 4 個 LED 代表 小時&#xff08;0-11&#xff09;&#xff0c;底部的 6 個 LED 代表 分鐘&#xff08;0-59&#xff09;。每個 LED 代表一個 0 或 1&#xff0c;最低位在右側。 例如&#xff0c;下面的二進制手表讀取 “3:25” 。 &#xff08;圖源&am…

flask redis_在Flask應用程序中將Redis隊列用于異步任務

flask redisBy: Content by Edward Krueger and Josh Farmer, and Douglas Franklin.作者&#xff1a; 愛德華克魯格 ( Edward Krueger) 和 喬什法默 ( Josh Farmer )以及 道格拉斯富蘭克林 ( Douglas Franklin)的內容 。 When building an application that performs time-co…

CentOS7下分布式文件系統FastDFS的安裝 配置 (單節點)

背景 FastDFS是一個開源的輕量級分布式文件系統&#xff0c;為互聯網量身定制&#xff0c;充分考慮了冗余備份、負載均衡、線性擴容等機制&#xff0c;并注重高可用、高性能等指標&#xff0c;解決了大容量存儲和負載均衡的問題&#xff0c;特別適合以文件為載體的在線服務&…

如何修復會話固定漏洞_PHP安全漏洞:會話劫持,跨站點腳本,SQL注入以及如何修復它們...

如何修復會話固定漏洞PHP中的安全性 (Security in PHP) When writing PHP code it is very important to keep the following security vulnerabilities in mind to avoid writing insecure code.在編寫PHP代碼時&#xff0c;記住以下安全漏洞非常重要&#xff0c;以避免編寫不…

劍指 Offer 38. 字符串的排列

題目 輸入一個字符串&#xff0c;打印出該字符串中字符的所有排列。 你可以以任意順序返回這個字符串數組&#xff0c;但里面不能有重復元素。 示例: 輸入&#xff1a;s “abc” 輸出&#xff1a;[“abc”,“acb”,“bac”,“bca”,“cab”,“cba”] 限制&#xff1a; 1…

前饋神經網絡中的前饋_前饋神經網絡在基于趨勢的交易中的有效性(1)

前饋神經網絡中的前饋This is a preliminary showcase of a collaborative research by Seouk Jun Kim (Daniel) and Sunmin Lee. You can find our contacts at the bottom of the article.這是 Seouk Jun Kim(Daniel) 和 Sunmin Lee 進行合作研究的初步展示 。 您可以在文章底…