如何實施成功的數據清理流程

干凈的數據是發現和洞察力的基礎。 如果數據很臟,您的團隊為分析,培養和可視化數據而付出的巨大努力完全是在浪費時間。 當然,骯臟的數據并不是新的。 它早在計算機變得普及之前就困擾著決策。 現在,計算機技術已普及到日常生活中,這個問題變得更加復雜。 (Clean data is the foundation of discovery and insight. The extreme effort your team puts forth to analyze, cultivate and visualize data is a complete waste of time if the data is dirty. Of course, dirty data isn’t new. It has plagued decisions long before computers became commonplace. And now that computer technology is pervasive in everyday life, the problem has only compounded.)

The first thing a company needs to determine is whether or not they have dirty data in their midst. Fortunately, this is easily done. The answer is ‘Yes.’ Everyone has dirty data and that means you have dirty data. Now that we’re over that hurdle, the next two questions we must pose are more challenging to answer, “Which data is dirty?” and “How do we clean our data?”

公司需要確定的第一件事是他們中間是否有臟數據。 幸運的是,這很容易做到。 答案是“是”。 每個人都有臟數據,這意味著您有臟數據。 現在我們已經克服了這一障礙,接下來我們必須提出的兩個問題更具挑戰性:“哪些數據是臟數據?” 和“我們如何清理數據?”

Over the years I’ve seen many companies and teams compose a herculean effort to clean their data. They involve large dedicated teams, major project schedules, and months or years of effort. All of them have ended in much the same way, failure. At least the original task of the company’s data being clean at the end of the project was a failure.

多年來,我已經看到許多公司和團隊付出了巨大的努力來清理數據。 他們涉及龐大的敬業團隊,主要的項目計劃以及數月或數年的工作。 所有這些都以幾乎相同的方式結束,即失敗。 至少在項目結束時清理公司數據的原始任務是失敗的。

The truth is the shape of the project tends to change. It morphs from a linear project with a beginning and end, followed by an expected deliverable of clean data into a cyclical process that will persist indefinitely. However, the ultimate goal of clean data for analysis and insights is achieved.

事實是,項目的形狀傾向于改變。 它從具有開始和結束的線性項目變形,然后是預期可提供的干凈數據交付到將無限期持續的周期性過程。 但是,實現了用于分析和洞察的干凈數據的最終目標。

失敗的例子 (An example of failure…)

Before I explain the steps involved in a successful cyclical process for cleaning data, let’s take a moment to explore the reasons gargantuan data cleaning projects like the example presented above will always fail. The linear project approach to cleaning data has an inherent assumption leading us to repeated failure. It assumes the data you are cleaning is somewhat static in nature. New data will enter your system, but the new data will be entered correctly and not contain errors or be dirty. But if dirty data is somehow introduced into the source systems again, the processes you have put in place account for all possible forms of dirty data issues that could come your way in the future.

在我解釋成功的周期性數據清理過程中涉及的步驟之前,讓我們花點時間來探究諸如上述示例之類的龐大數據清理項目將始終失敗的原因。 用于清理數據的線性項目方法具有一個固有的假設,導致我們反復失敗。 假設您要清除的數據本質上是靜態的。 新數據將進入您的系統,但是新數據將被正確輸入并且不包含錯誤或臟數據。 但是,如果將臟數據以某種方式再次引入到源系統中,那么您已經建立的流程將解決所有將來可能出現的臟數據問題。

This is not a realistic picture of how source systems work. For example, let’s say we have software used by the human resources department. It allows the HR staff to enter employee names and track their progress through all the on-boarding requirements from initial date of hire to the employee being fully trained for their position.

這不是源系統如何工作的真實描述。 例如,假設我們擁有人力資源部門使用的軟件。 它使人力資源員工可以輸入員工姓名,并跟蹤從入職之初到接受全面職位培訓的所有入職要求的進度。

This software has been written (or at least customized) by a 3rd party software development team your company used to tailor the software to its particular on-boarding processes. The software saves all data into a SQL transactional database, as we would expect. One of the fields in the database stores each employee’s score for each of their assessments throughout the on-boarding process. Of course, if a particular assessment has not been completed, no score would be entered. Your initial review of the data reveals the incomplete assessments contain an empty string in the score field whereas the completed assessments contain an actual score value in the field.

該軟件已由貴公司用來根據其特定的入職流程定制軟件的第三方軟件開發團隊編寫(或至少已定制)。 如我們所料,該軟件將所有數據保存到SQL事務數據庫中。 數據庫中的一個字段在整個入職過程中存儲每個員工的每個評估的分數。 當然,如果尚未完成特定評估,則不會輸入任何分數。 您對數據的初步檢查顯示,未完成的評估在“分數”字段中包含一個空字符串,而已完成的評估在該字段中包含實際分數。

Your team implements cleaning transformations to ensure these empty string values are changed into NULL values in your analytics data tables so that you can easily find and ignore them during your calculations. Problem solved! Or so you think…

您的團隊執行清理轉換,以確保在分析數據表中將這些空字符串值更改為NULL值,以便您在計算期間可以輕松找到并忽略它們。 問題解決了! 還是您認為...

Months after your data cleaning efforts are complete, one of the HR employees finds an issue in the HR software itself and they raise the concern with the software development team. The 3rd party team chooses to resolve the problem by changing the behavior of their software to no longer use an empty string value for incomplete assessment scores, but they now use a zero as the incomplete placeholder. At first, this seems to solve the problems for HR, but soon their reports from the analysts begin reflecting extremely low score averages across the onboarding groups.

完成數據清理工作后的幾個月,其中一名HR員工發現HR軟件本身存在問題,并引起了軟件開發團隊的關注。 第三方團隊選擇通過更改其軟件的行為來解決問題,不再對不完整的評估分數使用空字符串值,但現在使用零作為不完整的占位符。 最初,這似乎解決了人力資源方面的問題,但是不久之后,他們從分析師那里得到的報告就開始反映出新入職人員的平均得分非常低。

When your team is finally asked to investigate, you realize the issue comes from the averaging methods being used. Since empty strings were initially the incomplete indicators and you converted them to NULLS, the average of the scores could be completed with a simple average function because it ignores NULLS. It does not ignore zeros. Therefore, the averages are now including the zeros from the incomplete assessments.

當您的團隊最終被要求調查時,您意識到問題出在所使用的平均方法上。 由于空字符串最初是不完整的指示符,并且已將它們轉換為NULLS,因此可以使用簡單的平均值函數來完成分數的平均值,因為它忽略了NULLS。 它不會忽略零。 因此,現在的平均值包括了來自不完整評估的零。

Back to the drawing board!

回到繪圖板!

定期清潔過程將起作用 (A cyclical cleaning process will work)

While the above example is a simple one and real life problems are much more complex, it clearly demonstrates the problems inherent with linear data cleaning approaches.

盡管上面的示例很簡單,但是現實生活中的問題要復雜得多,但它清楚地說明了線性數據清除方法固有的問題。

Instead, we constantly change and add to the processes affecting the data throughout our business and therefore, we are continually changing our circumstances introducing new opportunities for problematic data. And to make matters worse, we not only change the processes, systems, sources, fields, and numerous other elements involved with the collection of data, we are adding new data at an ever increasing rate.

取而代之的是,我們不斷更改并增加影響整個業務范圍內數據的流程,因此,我們正在不斷更改環境,從而為出現問題的數據提供了新的機會。 更糟的是,我們不僅更改了流程,系統,源,字段以及與數據收集有關的許多其他元素,而且我們以不斷增加的速度添加新數據。

By the time you’ve created a method for cleaning the data and implemented it, many or all of these factors have changed — many times over.

到您創建一種清理數據的方法并實現它時,許多或所有這些因素已經改變了很多次。

The linear data cleaning path is doomed to failure and should be abandoned. Instead, let’s begin with a systemized version of the cyclical process everyone ends up adopting in the end. Instead of wasting the time, energy and resources on the linear version and then landing a disorganized cyclical method of data cleaning compounded with your team’s feeling of failure like everyone else, you can structure the cyclical process correctly from the beginning and everyone will be successful, know why they are successful, while doing their work with the feeling of success. Doesn’t that sound better?

線性數據清理路徑注定要失敗,應該放棄。 相反,讓我們從每個人最終最終采用的循環過程的系統化版本開始。 與其浪費時間,精力和資源,不如使用線性版本,而是采用無序的數據清理周期性方法,再加上團隊與其他所有人一樣的失敗感,您可以從一開始就正確地構建周期性過程,每個人都會成功,知道他們為什么成功,同時帶著成功的感覺做他們的工作。 聽起來更好嗎?

Image for post
Rod Castor桿腳輪

像ER Triage一樣思考您的清潔過程 (Think of Your Cleaning Process like ER Triage)

The cyclical process to data cleaning is rather simple. It’s composed of 5 stages similar to that of a Hospital Emergency Room. While hospitals vary in their exact implementation of ER procedures, the same basic stages apply. The first phase is triage. In this phase, medical professionals assess the patient, assign the patient a priority based on their severity, and assign them to the proper group for treatment. The second phase is the treatment phase. It consists of actually treating the patient and assigning any required follow up that needs to occur both prior to the patient’s discharge and after the patient’s discharge.

數據清理的循環過程非常簡單。 它由與醫院急診室相似的5個階段組成。 盡管醫院對ER程序的確切實施方式有所不同,但適用相同的基本階段。 第一階段是分類。 在此階段,醫學專業人員會評估患者,根據患者的嚴重程度為其分配優先級,然后將他們分配給適當的人群進行治療。 第二階段是治療階段。 它由實際治療患者和分配在患者出院前和患者出院后需要進行的所有必要隨訪組成。

Using these ER and triage processes as a guide, let’s consider what cyclical data cleaning looks like.

以這些ER和分類處理流程為指導,讓我們考慮一下周期性數據清除的外觀。

分析—一名患者走進急診室 (Analyze — A Patient Walks into the ER)

People don’t just show up to the ER to hang out or get a cup of coffee. They are there for a reason. The context of the visit is unmistakable even if the cause of their ailment is yet to be diagnosed. The same is true for data cleaning.

人們不僅會出現在急診室閑逛或喝杯咖啡。 他們在那里是有原因的。 即使尚未確診患病的原因,探訪的內容也是顯而易見的。 數據清理也是如此。

One of the difficulties with the more traditional linear approach to data cleaning is the abstract nature of the situation. Let’s think about this for a minute. If we’re looking at data in our database, but not actually trying to analyze anything for the company (ie. monthly sales trends or attrition) dirty data may or maybe not be obvious. But once we begin calculating revenue, profit, churn, and the other typical items our business is asking to know, dirty data rears its ugly head. Data, out of context, can easily mask itself as clean data. So, in the linear approach, we often miss many data fields that actually contain dirty data. The resulting “clean” data at the end of a linear cleaning project will need to be revisited the first time an analyst discovers the numbers in the profit column simply do not add up.

更傳統的線性數據清理方法的困難之一是情況的抽象性質。 讓我們考慮一下。 如果我們正在查看數據庫中的數據,但實際上并未嘗試分析公司的任何數據(例如,每月的銷售趨勢或損耗),則臟數據可能會或可能不會很明顯。 但是,一旦我們開始計算收入,利潤,客戶流失以及我們的業務要求了解的其他典型項目,骯臟的數據就會浮出水面。 脫離上下文的數據可以輕松地將其自身屏蔽為干凈的數據。 因此,在線性方法中,我們經常會錯過許多實際上包含臟數據的數據字段。 線性清潔項目結束時產生的“清潔”數據將需要在分析師首次發現“利潤”列中的數字完全不加起來時重新審查。

Using the cyclical process of data cleaning, we begin with analysis. Just go ahead and turn the analysts loose in the data. Tell them to make requests of the data engineers, work their analytical magic, and not be bashful about raising questions when something in the data doesn’t look correct. This is help and context our cleaning process desperately needs in order to correctly and holistically clean the data we currently have.

使用數據清理的循環過程,我們開始進行分析。 只是繼續進行下去,就可以使分析人員松動數據。 告訴他們向數據工程師提出要求,運用他們的分析魔術,不要對當數據中某些內容看起來不正確時提出的問題保持警惕。 這是我們清潔過程中迫切需要的幫助和背景,以正確,全面地清潔我們當前擁有的數據。

Whenever an analyst finds something concerning the data, this becomes a reason to send the patient to the ER. It’s the stomach pain or shortness of breath or high fever that brings the data to you for cleaning. Of course, once an analyst sends a “data” patient to your ER, your data triage staff must be ready to perform.

每當分析人員發現與數據有關的信息時,這便成為將患者送往急診室的原因。 胃痛,呼吸急促或發燒是數據帶給您清洗的原因。 當然,一旦分析師將“數據”患者發送到您的急診室,您的數據分類人員必須準備好執行。

評估-伸出你的舌頭,說“啊”。 (Assess — Stick out your tongue and say ‘Ahhh.’)

When someone walks into the emergency room, the first thing the medical team does is assess the patient. They take their temperature, blood pressure, itemize a list of the medications the patient is taking, get a description of the symptoms, and so on. The same is true when beginning to assess a “data” patient.

當有人走進急診室時,醫療團隊要做的第一件事就是評估患者。 他們會記錄自己的體溫,血壓,逐項列出患者正在服用的藥物,對癥狀進行描述等。 開始評估“數據”患者時也是如此。

Image for post
Rod Castor桿腳輪

Of course in a medical situation, most of this information is compared to the known range of normal. Nurses and doctors know if your temperature or blood pressure is higher than the acceptable range. Those normals have already been established. But in a data situation, the normals may need to be established before you can verify the patient’s health.

當然,在醫療情況下,大多數信息都將與已知的正常范圍進行比較。 護士和醫生知道您的體溫或血壓是否高于可接受范圍。 這些法線已經建立。 但是在數據情況下,可能需要先建立法線,然后才能驗證患者的健康狀況。

Your “data” triage team will assess the patient sent to them by the analyst (the referring physician). In this stage of our cyclical process, your team will work alongside the analyst and possibly other business team members to verify the data is indeed dirty. You will also want to assess the business cases using this data and its impact on the company or subsections of the company.

您的“數據”分類小組將評估由分析員(主診醫生)發送給他們的患者。 在我們的循環過程的這一階段,您的團隊將與分析師以及其他業務團隊成員一起工作,以驗證數據確實是骯臟的。 您還將需要使用此數據及其對公司或公司子部門的影響來評估業務案例。

Our first example of questions to ask in this phase are “What should this data look like: summed or averaged or as a dimension or whatever?” Without a proper understanding of what the data should be, we have a very high likelihood of making changes that do not clean the data. We may merely change the data into another form of dirty data.

在此階段,我們要問的第一個問題示例是:“這些數據應該是什么樣的:求和或求平均值或作為維或其他任何形式?” 如果對數據應該是什么沒有正確的了解,我們很可能會做出不清除數據的更改。 我們可能只是將數據更改為另一種臟數據形式。

Next, you will need to understand the impact and importance of the data. Does the CEO or CFO use this data to make market decisions, product decisions, report company progress to the street? Is this data used by customer care to better your clients’ experience? Does marketing use this data to plan their next ad strategy? Is this data stored and used for trend analysis later this year for the board of directors? There are endless possibilities, but the impact and importance of the data will help you properly prioritize this patient.

接下來,您將需要了解數據的影響和重要性。 CEO或CFO是否使用這些數據來制定市場決策,產品決策,向街道報告公司進展? 客戶服務會使用這些數據來改善客戶體驗嗎? 市場營銷會使用此數據來計劃其下一個廣告策略嗎? 這些數據是否已存儲并于今年晚些時候用于董事會的趨勢分析? 可能性無窮無盡,但是數據的影響和重要性將有助于您適當地確定患者的優先級。

分配優先級-即使緊急情況也具有嚴重程度 (Assign Priority — Even emergencies have degrees of severity)

Even patients in the ER have a varying degree of urgency. One patient may be feeling nauseous and another having extreme chest and arm pain. Yet another may be unconscious or suffering from burns. While they all need medical attention, there are only so many medical professionals present to assist them and the worst situation will merit the top priority. Of course, many factors are used to determine the severity of someone’s ailment and the consequential priority assigned. The same is true for “data” patients.

即使是急診室的患者也有不同程度的緊迫感。 一名患者可能會感到惡心,而另一名患者則感到極度的胸部和手臂疼痛。 還有一個人可能失去知覺或遭受灼傷。 盡管他們都需要醫療照顧,但只有這么多的醫療專業人員在場為他們提供幫助,最糟糕的情況將是當務之急。 當然,許多因素可用于確定某人疾病的嚴重程度和相應的優先級。 “數據”患者也是如此。

Once you understand, in context, what the data should be and its importance and impact, you will need to assign it a priority. All teams have limited resources and your data cleaning team is no different. This exposes yet another problem with the linear cleaning approach. When you linearly clean data every identified occurrence of dirty data tends to get equal priority because it becomes part of a large project that will consider the data clean at the end of the project. In today’s world, most teams have very limited resources. Undertaking all problems with equal priority significantly delays the team releasing the most important and impactful results.

一旦了解了上下文中的數據內容及其重要性和影響,就需要為它分配優先級。 所有團隊的資源都有限,您的數據清理團隊也不例外。 這暴露了線性清潔方法的另一個問題。 當您線性清理數據時,每次識別出的臟數據都將具有相同的優先級,因為它成為大型項目的一部分,該大型項目將在項目結束時考慮對數據進行清理。 在當今世界,大多數團隊的資源非常有限。 優先處理所有問題會大大延遲團隊發布最重要和最有意義的結果。

Image for post
Rod Castor桿腳輪

Your data team will need to establish its own rules for prioritization. Perhaps anything specifically coming from a customer’s request or trouble ticket jumps to the top of the list. Or maybe the C-Suite requests receive top priority. Every company culture has its own dynamic and your team should work together with company leaders to determine the best course of action for prioritizing requests. However, once that criteria is defined, your team should use the context, impact, and importance gathered in the assessment along with the established criteria for prioritization and assign each request accordingly.

您的數據團隊將需要建立自己的優先級規則。 也許來自客戶的請求或故障單的任何特定內容都跳到了列表的頂部。 也許C-Suite請求會獲得最高優先級。 每種公司文化都有其自身的活力,您的團隊應與公司負責人一起確定最佳的行動方案,以便對請求進行優先排序。 但是,一旦定義了該標準,您的團隊就應該使用評估中收集的上下文,影響和重要性以及已建立的優先級標準,并相應地分配每個請求。

確定適當的流程和團隊以清理數據 (Determine the Proper Processes and Team(s) to Clean the Data)

Hospitals often employee an ER doctor for each shift. This doctor will see every patient and address every condition coming into the ER during her shift. However, should a specific ER patient come in that requires a specialist, the shift doctor will work to stabilize the patient and then assign him to a department. The doctor on call for that specific department will take over this particular case. If a particular patient requires multiple specialists, multiple departments can be assigned and assist in the overall treatment plan.

醫院經常為每個班次雇用一名急診醫生。 這位醫生將在輪班期間為每位患者提供診治,并處理進入急診室的各種疾病。 但是,如果需要一名急診室的特定急診患者,則輪班醫生將努力穩定患者,然后將其分配到科室。 該特定部門的待命醫生將接管此特殊情況。 如果特定患者需要多個專家,則可以分配多個部門并協助制定總體治療計劃。

The practice of assigning a generalist or possibly even multiple specialists is the same when cleaning data. You must conclude the best teams to involve for the best outcome. In order to do this, your team of data specialists, perhaps the generalists need to learn more about the dirty data in question.

清理數據時,指派通才或什至可能由多個專家組成的做法是相同的。 您必須總結最好的團隊,以取得最佳結果。 為了做到這一點,您的數據專家團隊(也許是通才)需要了解有關有問題的臟數據的更多信息。

Begin by determining the sources for this data and any transformations the data undergoes before it’s ultimately saved in the database for the analyst to use. Again, your team may need to secure the assistance of data engineers, data wranglers, database administrators, or more business team members to properly assess the patient. You might even enlist the help of a software developer if you determine the application related to the source data is incorrectly saving the data to the database. Do not be afraid to work with others and ask for assistance.

首先確定此數據的來源以及數據最終經過存儲在數據庫中以供分析人員使用之前進行的任何轉換。 同樣,您的團隊可能需要獲得數據工程師,數據管理員,數據庫管理員或更多業務團隊成員的幫助,以正確評估患者。 如果您確定與源數據相關的應用程序將數據錯誤地保存到數據庫中,則甚至可以尋求軟件開發人員的幫助。 不要害怕與他人合作并尋求幫助。

By outlining the source systems for the data and any ETL, you can more easily and quickly acquire the necessary resources from the correct teams. If it’s accounting data, you may need to engage the data team for those servers. Or perhaps if it’s the geoscience data, you need to reach out to some of the geo-scientists to understand the ETL process they helped design with the data engineers. You get the idea.

通過概述數據和任何ETL的源系統,您可以更輕松,快速地從正確的團隊那里獲取必要的資源。 如果是會計數據,則可能需要聘請這些服務器的數據團隊。 或者,如果是地球科學數據,則需要聯系一些地球科學家,以了解他們與數據工程師一起幫助設計的ETL過程。 你明白了。

Performing this upfront work, places your “data” patient in the best possible care and ensures the quickest “recovery” or cleaning time.

執行此前期工作,可以使“數據”患者得到最好的護理,并確保最快的“恢復”或清潔時間。

確定清理數據的必要步驟 (Determine the Necessary Steps to Clean the Data)

With the proper team in place, the treatment of the patient can now begin. The assembled team will need to decide how to best clean the data for use and storage in the database used by the analysts. Of course, this could be a fast process or it could take some time depending on the complexity of existing ETL or the dirty nature of the source systems involved.

有了適當的團隊,現在就可以開始對患者的治療了。 組裝后的團隊將需要決定如何最好地清理數據以供分析人員使用和存儲在數據庫中。 當然,這可能是一個快速的過程,或者可能需要一些時間,具體取決于現有ETL的復雜性或所涉及源系統的骯臟性質。

Image for post
Rod Castor桿腳輪

Once the necessary steps have been identified, tested, and agreed upon, they need to be documented and, of course, implemented. It’s always best to use a development, test, production environment architecture and first implement the change into development. Then promote to the testing environment and only once everything is verified as correct, promote the final solution to the production environment. But these environments and steps to deployment differ from company to company and you’ll need to follow your organization’s outlined process.

一旦確定,測試并同意了必要的步驟,就需要將它們記錄在案,并加以實施。 始終最好使用開發,測試,生產環境架構,并首先將更改實施到開發中。 然后升級到測試環境,只有在所有內容都被驗證為正確之后,才將最終解決方案推廣到生產環境。 但是這些環境和部署步驟因公司而異,因此您需要遵循組織概述的流程。

自動化您剛剛實施的清潔步驟 (Automate the Cleaning Steps you just Implemented)

In a hospital, once the patient has been treated and is considered well enough for discharge, they still may have follow-up appointments or tasks they need to perform like taking medication. With data cleaning, multiple follow-up tasks may also be necessary, but the one follow-up task that is always required involves automation.

在醫院中,一旦患者接受了治療并且被認為可以很好地出院,他們仍可能需要進行隨訪約會或完成需要執行的任務,例如服藥。 通過數據清理,可能還需要執行多個后續任務,但是始終需要執行的一項后續任務涉及自動化。

No matter what your company’s change process is, the one thing you must do is make the changes persistent through automation. Automation comes in many forms. You may need to change the existing ETL process or introduce an automated process that cleans the data post ETL. Automation can be achieved through any language or system that works for you and your company: SQL, Python, C#, SAS, and the list goes on. A common automation system for companies using Microsoft products is SQL Server Integration Services (or SSIS). The scheduled execution of these tasks can be as simple as cron or Microsoft Task Manager or SQL Agent. It doesn’t necessarily need to be sophisticated. But it needs to be automated.

無論公司的變更過程是什么,您必須做的一件事就是通過自動化使變更持久化。 自動化有多種形式。 您可能需要更改現有的ETL流程或引入自動流程以清理ETL后的數據。 可以通過適用于您和您公司的任何語言或系統來實現自動化:SQL,Python,C#,SAS,并且清單不勝枚舉。 對于使用Microsoft產品的公司來說,常見的自動化系統是SQL Server Integration Services(或SSIS)。 這些任務的計劃執行可以像cron或Microsoft Task Manager或SQL Agent一樣簡單。 它不一定需要很復雜。 但是它需要自動化。

If you allow the cleaning process to remain manual, you will very quickly overwhelm your team with recurring manual work and hope of taking on new data cleaning efforts will be forfeited.

如果您讓清理過程保持手動狀態,那么您很快就會因重復進行的手動工作而使您的團隊不堪重負,而放棄進行新的數據清理工作的希望將會喪失。

I’ve said this elsewhere, but the quickest way to render your team useless is to overwhelm them with recurring manual work. All exploratory, cleaning and initial data wrangling work is manual. But once the process is cleanly defined, it must be automated if your team has any hope of continuing to impact your company’s insights and decisions.

我在其他地方已經說過了,但是使您的團隊變得毫無用處的最快方法是用重復的手動工作來壓倒他們。 所有探索,清潔和初始數據整理工作都是手動操作。 但是,一旦對流程進行了明確的定義,如果您的團隊有希望繼續影響您公司的見解和決策,則必須將其自動化。

重復 (Repeat)

Now that you’ve taken the cleaning process all the way through analysis, identifying, assessing, prioritizing, team assignment, establishing a cleaning process, and automating the cleaning process — now it’s time to repeat the process. Analysts will confirm the result of your team’s work and be thankful, but they will also send you new “data” patients and the process starts again for these new patients who are ill and need your healing touch.

現在,您已經通過分析,確定,評估,確定優先級,分配團隊,建立清潔過程并自動執行清潔過程來進行整個清潔過程,現在是重復該過程的時候了。 分析師將確認您團隊的工作結果并表示感謝,但他們還將向您發送新的“數據”患者,并且針對這些新生病且需要您進行康復治療的患者,該過程將重新開始。

最后提醒您正式建立您的周期性數據清除過程… (A final reminder to formally establish your cyclical data cleaning process…)

The doctors and nurses saving lives in the hospital emergency room aren’t just winging it. They’ve been training for years. They can practically perform their work in their sleep because it has been ingrained in them through hours of formalized training. Not only have they been diligently trained in medical procedures and knowledge, the policies and procedures used in the ER to triage a patient, admit a patient, prioritize a patient, all the way through treating the patient have been thoroughly studied and formalized in order to give any patient coming through their doors the best chance of survival.

在醫院急診室里挽救生命的醫生和護士不僅在旁舍。 他們已經訓練了多年。 他們幾乎可以在睡眠中完成工作,因為經過數小時的正規培訓,他們已經根深蒂固了。 他們不僅經過了嚴格的醫療程序和知識方面的培訓,而且急診室中用于分診患者,接納患者,確定患者優先級的政策和程序也得到了透徹的研究和正規化,以便給任何從門口進來的患者最大的生存機會。

Medical personnel and hospital administrators know that a strong, formalized plan leaves much less chance of error and that leads to greater success. The exact same benefits from formalizing a plan are true for data cleaning.

醫務人員和醫院管理人員知道,強有力的正規計劃可以減少出錯的機會,從而可以帶來更大的成功。 正規化計劃所帶來的完全相同的好處對于數據清理來說是正確的。

As outlined in the opening of this article, large linear based projects for data cleaning typically end up with this same cyclical process in the end. The linear process failed them and because the work must still be accomplished, the team tackles each problem as it comes to light. But many times arriving at this same process after a failed linear attempt leaves the process in an informal state. It’s really just performed in an ad hoc manner.

正如本文開頭所概述的那樣,用于數據清理的大型線性項目通常最終會以相同的循環過程結束。 線性過程使他們失敗了,并且由于必須完成工作,因此團隊要解決所發現的每個問題。 但是,在一次失敗的線性嘗試之后,許多次到達相同的過程會使過程處于非正式狀態。 它實際上只是臨時執行的。

Avoid this pitfall!

避免這種陷阱!

Without formalization, the process outcomes will ebb and flow between success and failure. The successful cleaning of each newly discovered dirty data issue will always be up for grabs. Your team will work inconsistently and their motivation for the job will come and go. Even if your ad hoc approach begins to work well over time as your team gels, it’s upended every time a team member leaves or a new member is hired. And here’s why…there’s no plan to follow.

如果沒有形式化,過程結果將在成功與失敗之間起伏不定。 成功清除每個新發現的臟數據問題將始終備受關注。 您的團隊工作會前后不一致,他們的工作動力會不斷變化。 即使您的臨時方法隨著團隊的發展逐漸發揮作用,但每次團隊成員離職或雇用新成員時,這種方法都會被顛覆。 這就是為什么……沒有可遵循的計劃。

Some days your team will knock it out of the park when they are feeling good about what they do. But other days the recurrent work and recollection of failure from the initial data cleaning project will push them to question their work and the nature of their job.

有時候,當您對自己的工作感到滿意時,您的團隊會將其趕出公園。 但是其他日子,反復的工作和從最初的數據清理項目中收集到的故障將促使他們質疑他們的工作和工作性質。

“Why can’t we create a process to fix all of this instead of constantly facing a fire every day?”

“為什么我們不能創建一個解決所有這些問題的流程,而不是每天都不斷面對火災?”

“Why doesn’t management care that we are already overworked? They just keep sending us more requests.”

“為什么管理層不關心我們已經工作過度了? 他們只是繼續向我們發送更多請求。”

“When will we ever catch up?”

“我們什么時候能趕上?”

If you will take the time and give the effort to formalize the cyclical process of cleaning your data, your team will have a roadmap to follow and the other departments in your organization will have a guide to properly interact with your team. In many ways, it’s just perspective, but the formalization of the process removes the ambiguity and gives purpose to the work being done. Formalizing is a necessary step to ensuring a consistently successful outcome to data cleaning for your team.

如果您愿意花時間并努力使清理數據的周期性過程正式化,那么您的團隊將有一個發展路線圖,組織中的其他部門將有一個指南與您的團隊進行正確的交互。 在許多方面,它只是透視圖,但是流程的形式化消除了歧義,并為正在完成的工作賦予了目的。 形式化是確保團隊成功獲得一致的數據清理結果的必要步驟。

Rod Castor helps companies Get Analytics Right! He works with both international organizations and small businesses to start or improve their efforts in data analytics, data science, tech strategy, and tech leadership. In addition to consulting, Rod also enjoys public speaking, teaching, and writing. You can discover more about Rod and his work at rodcastor.com and appliedai.us.

Rod Castor 幫助公司正確完成分析! 他與國際組織和小型企業合作,以開始或改善他們在數據分析,數據科學,技術戰略和技術領導力方面的工作。 除了提供咨詢服務外,Rod還喜歡公開演講,教學和寫作。 你可以發現更多關于羅德和他的工作 rodcastor.com appliedai.us

翻譯自: https://towardsdatascience.com/how-to-implement-a-successful-data-cleaning-process-701e565e6575

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/388280.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/388280.shtml
英文地址,請注明出處:http://en.pswp.cn/news/388280.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

nginx前端代理tomcat取真實客戶端IP

nginx前端代理tomcat取真實客戶端IP2011年12月14日? nginx? 暫無評論? 被圍觀 3,000 次使用Nginx作為反向代理時,Tomcat的日志記錄的客戶端IP就不在是真實的客戶端IP,而是Nginx代理的IP。要解決這個問題可以在Nginx配置一個新的Header,用來…

kubeadm安裝kubernetes 1.13.2多master高可用集群

1. 簡介 Kubernetes v1.13版本發布后,kubeadm才正式進入GA,可以生產使用,用kubeadm部署kubernetes集群也是以后的發展趨勢。目前Kubernetes的對應鏡像倉庫,在國內阿里云也有了鏡像站點,使用kubeadm部署Kubernetes集群變得簡單并且…

通才與專家_那么您準備聘請數據科學家了嗎? 通才還是專家?

通才與專家Throughout my 10-year career, I have seen people often spend their time and energy in passionate debates about what data science can deliver, and what data scientists do or do not do. I submit that these are the wrong questions to focus on when y…

ubuntu opengl 安裝

安裝相應的庫: sudo apt-get install build-essential libgl1-mesa-dev sudo apt-get install freeglut3-dev sudo apt-get install libglew-dev libsdl2-dev libsdl2-image-dev libglm-dev libfreetype6-dev 實例: #include "GL/glut.h" void…

分享一病毒源代碼,破壞MBR,危險!!僅供學習參考,勿運行(vc++2010已編譯通過)

我在編譯的時候,殺毒軟件提示病毒并將其攔截,所以會導致編譯不成功。 1>D:\c工程\windows\windows\MBR病毒.cpp : fatal error C1083: 無法打開編譯器中間文件:“C:\Users\lenovo\AppData\Local\Temp\_CL_953b34fein”: Permission denied 1> 1>…

HTTP請求錯誤400、401、402、403、404、405、406、407、412、414、500、501、502解析

【轉載】本文來自 chenxinchongcn 的CSDN 博客 ,全文地址請點擊:https://blog.csdn.net/chenxinchongcn/article/details/54945998?utm_sourcecopy HTTP 錯誤 400 400 請求出錯 由于語法格式有誤,服務器無法理解此請求。不作修改&#xff0…

數據科學家 數據工程師_數據科學家實際上賺了多少錢?

數據科學家 數據工程師目錄 (Table of Contents) Introduction 介紹 Junior Data Scientist 初級數據科學家 Mid-Level Data Scientist 中級數據科學家 Senior Data Scientist 資深數據科學家 Additional Compensation 額外補償 Summary 摘要 介紹 (Introduction) The lucrativ…

Spring Cloud構建微服務架構-Hystrix監控面板

在Spring Cloud中構建一個Hystrix Dashboard非常簡單,只需要下面四步:愿意了解源碼的朋友直接求求交流分享技術 一零三八七七四六二六 創建一個標準的Spring Boot工程,命名為:hystrix-dashboard。 編輯pom.xml,具體依賴…

Google 地圖 API 參考

楊航收集技術資料,分享給大家 Google 地圖 API 參考 Google 地圖 API 現在與 Google AJAX API 載入器集成,后者創建了一個公共命名空間,以便載入和使用多個 Google AJAX API。該框架可讓您將可選 google.maps.* 命名空間用于當前在 Google …

spotify歌曲下載_使用Spotify數據預測哪些“ Novidades da semana”歌曲會成為熱門歌曲

spotify歌曲下載TL; DR (TL;DR) Spotify is my favorite digital music service and I’m very passionate about the potential to extract meaningful insights from data. Therefore, I decided to do this article to consolidate my knowledge of some classification mod…

Hook技術之Hook Activity

一、Hook技術概述 Hook技術的核心實際上是動態分析技術,動態分析是指在程序運行時對程序進行調試的技術。眾所周知,Android系統的代碼和回調是按照一定的順序執行的,這里舉一個簡單的例子,如圖所示。 對象A調用類對象B&#xff0c…

(第三周)周報

此作業要求https://edu.cnblogs.com/campus/nenu/2018fall/homework/2143 1.本周PSP 總計:1422 min 2.本周進度條 (1)代碼累積折線圖 (2)博文字數累積折線圖 4.PSP餅狀圖 轉載于:https://www.cnblogs.com/gongylx/p/9761852.html

功能測試代碼python_如何使您的Python代碼更具功能性

功能測試代碼pythonFunctional programming has been getting more and more popular in recent years. Not only is it perfectly suited for tasks like data analysis and machine learning. It’s also a powerful way to make code easier to test and maintain.近年來&am…

layou split 屬性

layou split:true - 顯示側分欄 轉載于:https://www.cnblogs.com/jasonlai2016/p/9764450.html

BZOJ4503:兩個串(bitset)

Description 兔子們在玩兩個串的游戲。給定兩個字符串S和T,兔子們想知道T在S中出現了幾次,分別在哪些位置出現。注意T中可能有“?”字符,這個字符可以匹配任何字符。Input 兩行兩個字符串,分別代表S和TOutput 第一行一個正整數k&…

C#Word轉Html的類

C#Word轉Html的類/**//******************************************************************** created: 2007/11/02 created: 2:11:2007 23:13 filename: D:C#程序練習WordToChmWordToHtml.cs file path: D:C#程序練習WordToChm file bas…

分庫分表的幾種常見形式以及可能遇到的難題

前言 在談論數據庫架構和數據庫優化的時候,我們經常會聽到“分庫分表”、“分片”、“Sharding”…這樣的關鍵詞。讓人感到高興的是,這些朋友所服務的公司業務量正在(或者即將面臨)高速增長,技術方面也面臨著一些挑戰。…

iOS 鑰匙串的基本使用

級別: ★☆☆☆☆ 標簽:「鑰匙串」「keychain」「iOS」 作者: WYW 審校: QiShare團隊 前言 : 項目中有時會需要存儲敏感信息(如密碼、密鑰等),蘋果官方提供了一種存儲機制--鑰匙串&a…

線性回歸和將線擬合到數據

Linear Regression is the Supervised Machine Learning Algorithm that predicts continuous value outputs. In Linear Regression we generally follow three steps to predict the output.線性回歸是一種監督機器學習算法,可預測連續值輸出。 在線性回歸中&…

Spring Boot MyBatis配置多種數據庫

mybatis-config.xml是支持配置多種數據庫的,本文將介紹在Spring Boot中使用配置類來配置。 1. 配置application.yml # mybatis配置 mybatis:check-config-location: falsetype-aliases-package: ${base.package}.modelconfiguration:map-underscore-to-camel-case: …