敏捷數據科學pdf
TL;DR;
TL; DR;
- I have encountered a lot of resistance in the data science community against agile methodology and specifically scrum framework; 在數據科學界,我遇到了許多反對敏捷方法論(特別是Scrum框架)的抵制。
- I don’t see it this way and claim that most disciplines would improve by adopting agile mindset; 我不這樣認為,并認為通過采用敏捷的思維方式,大多數學科都將得到改善。
- We will go through a typical scrum sprint to highlight the compatibility of the data science process and the agile development process. 我們將經歷一個典型的Scrum沖刺,以突出數據科學過程與敏捷開發過程的兼容性。
- Finally, we discuss when a scrum is not an appropriate process to follow. If you are a consultant working on many projects at a time or your work requires deep concentration on a single and narrow issue (narrow, so that you alone can solve it). 最后,我們討論了Scrum何時不適合遵循的過程。 如果您是同時從事多個項目的顧問,或者您的工作需要專注于一個狹窄的問題(狹窄,那么您一個人就能解決)。
I have found a medium post recently, which claims that Scrum is awful for data science. I’m afraid I have to disagree and would like to make a case for Agile Data Science.
我最近發現了一篇中篇文章,其中聲稱Scrum 對于數據科學非常糟糕 。 恐怕我不得不不同意,并希望為敏捷數據科學辯護。
Ideas for this post are significantly influenced by the Agile Data Science 2.0 book (which I highly recommend) and personal experience. I am eager to know other experiences, so please share them in the comments.
這篇文章的想法在很大程度上受到敏捷數據科學2.0本書(我強烈推薦)和個人經驗的影響。 我很想知道其他經歷,所以請在評論中分享。
First, we need to agree on what data science is and how it solves business problems so we can investigate the process of data science and how agile (and specifically Scrum) can improve it.
首先,我們需要就什么是數據科學及其如何解決業務問題達成共識,以便我們可以調查數據科學的過程以及敏捷性(特別是Scrum)如何改進它。
什么是數據科學? (What is Data Science?)
There are countless definitions online. For example, Wikipedia gives such a description:
在線上有無數的定義。 例如, 維基百科給出了這樣的描述:
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.
數據科學是一個跨學科領域,它使用科學的方法,過程,算法和系統從許多結構化和非結構化數據中提取知識和見解。
In my opinion, it is quite an accurate definition of what data science tries to accomplish. But I would simplify this definition further.
我認為,這是對數據科學要完成的工作的準確定義。 但是,我將進一步簡化該定義。
Data Science solves business problems by combining business understanding, data and algorithms.
數據科學通過結合業務理解,數據和算法來解決業務問題。
Compared to the definition in Wikipedia, I would like to stress that data scientists should aim to solve business problems rather than “extract knowledge and insights.”
與Wikipedia中的定義相比,我想強調的是,數據科學家應該致力于解決業務問題,而不是“ 提取知識和見解”。
數據科學如何解決業務問題? (How Data Science Solves business problems?)
So data science is here to solve business problems. We need to accomplish a few things along the way:
因此,數據科學在這里可以解決業務問題。 我們需要在此過程中完成幾件事:
- Understand the business problem; 了解業務問題;
- Identify and acquire available data; 識別并獲取可用數據;
- Clean / transform / prepare data; 清理/轉換/準備數據;
- Select and fit an appropriate “model” for a given data; 為給定的數據選擇合適的“模型”;
- Deploy model to “production” — this is our attempt to solving a given problem; 將模型部署到“生產”中–這是我們解決給定問題的嘗試;
- Monitoring performance; 監測績效;
As with everything, there are countless ways to go about implementing those steps, but I will try to persuade you that the agile (incremental and iterative) approach brings the most value to the company and the most joy to data scientists.
與所有內容一樣,執行這些步驟的方法有無數種,但是我將嘗試說服您敏捷(增量和迭代)方法為公司帶來最大的價值,并為數據科學家帶來最大的樂趣。
敏捷數據科學宣言 (Agile Data Science Manifesto)
I took this from page 6 in the Agile Data Science 2.0 book, so you are encouraged to read the original, but here it is:
我是從敏捷數據科學2.0本書的第6頁上摘下來的,因此鼓勵您閱讀原始文檔,但此處是:
- Iterate, iterate, iterate — tables, charts, reports, predictions. 迭代,迭代,迭代-表格,圖表,報告,預測。
- Ship intermediate output. Even failed experiments have output. 運送中間輸出。 即使失敗的實驗也可以輸出。
- Prototype experiments over implementing tasks. 在執行任務方面進行原型實驗。
- Integrate the tyrannical opinion of data in product management. 將數據的專橫觀點整合到產品管理中。
- Climb up and down the data-value pyramid as you work. 在工作時上下爬數據值金字塔。
- Discover and pursue the critical path to a killer product. 發現并尋求關鍵產品的關鍵途徑。
- Get meta. Describe the process, not just the end state. 獲取元數據。 描述過程,而不僅僅是結束狀態。
Not all the steps are self-explanatory, and I encourage you to go and read what Russel Jurney had to say, but I hope that the main idea is clear — we share and intermediate output, and we iterate to achieve value.
并非所有步驟都是不言自明的,我鼓勵您去閱讀Russel Jurney所說的內容,但是我希望主要思想是明確的-我們共享和中間產出,并不斷迭代以實現價值。
Given the above preliminaries, let us go over a standard week for a scrum team. And we will assume a one week sprint.
鑒于以上初步介紹,讓我們為一個Scrum團隊度過一個標準的星期。 我們將假設一個星期的沖刺。
Scrum團隊沖刺 (Scrum Team Sprint)
第一天 (Day 1)
There are many sprint structure variations, but I will assume that planning is done on Monday morning. The team will decide which user stories from the product backlog will be transferred to the Sprint backlog. The most pressing issue for our business, as evident from the backlog ranking, is customer fraud — fraudulent transactions are causing our valuable customers out of our platform. During the previous backlog refinement session, the team already discussed this task, and the product owner got additional information from the Fraud Investigation team. So during the meeting, the team decides to start with a simple experiment (and already is thinking of interesting iterations further down the road) — an initial model based on simple features of the transaction and participating users. Work is split so that the data scientist can go and have a look at the data team identified for this problem. The data engineer will set up the pipeline for model output integration to DWH systems, and the full-stack engineer starts to set up a page for transaction review and alert system for the Fraud Investigation team.
sprint結構有很多變化,但我將假定計劃在星期一早上完成。 團隊將決定將產品積壓中的哪些用戶故事轉移到Sprint積壓中。 從積壓的排名中可以明顯看出,我們業務最緊迫的問題是客戶欺詐-欺詐性交易正使我們寶貴的客戶退出平臺。 在上一個待辦事項優化會話中,團隊已經討論了此任務,產品所有者從欺詐調查團隊獲得了更多信息。 因此,在會議期間,團隊決定從一個簡單的實驗開始(并且已經在考慮下一步的有趣迭代),這是一個基于交易和參與用戶的簡單特征的初始模型。 工作是分開的,以便數據科學家可以去看看針對此問題確定的數據團隊。 數據工程師將建立將模型輸出集成到DWH系統的管道,而全棧工程師將開始為欺詐調查團隊設置一個頁面,用于事務審查和警報系統。
第二天 (Day 2)
At the start of Tuesday, all team gathers and shares progress. Data scientist shows a few graphs which indicate that even with limited features, we will have a decent model. At the same time, the data engineer is already halfway through setting up the system to score incoming transactions with the new model. The full-stack engineer is also progressing nicely, and just after a few minutes, everyone is back at their desk working on the agreed tasks.
在星期二初,所有團隊聚集并分享進步。 數據科學家顯示了一些圖表,這些圖表表明即使功能有限,我們也將擁有一個不錯的模型。 同時,數據工程師已經完成設置系統的一半,以使用新模型對傳入的交易進行評分。 全職工程師的進度也不錯,幾分鐘后,每個人都回到了辦公桌前,完成約定的任務。
第三天 (Day 3)
As with Tuesday, the team starts Wednesday with a standup meeting to share their progress. There is already a simple model build and some accuracy and error rate numbers. The data engineer shows the infrastructure for the transaction scoring, and the team discusses how the features arrive at the system and what needs to be done for them to be ready for the algorithm. The full-stack engineer shows the admin panel with metadata on transactions is displayed and the triggering mechanism. Another discussion follows on the threshold value for the model output to trigger a message for a fraud analyst. The team agrees that we need to be able to adjust this value since different models might have different distributions, and also, depending on other variables, we might want to increase and decrease the number of approved transactions.
與星期二一樣,團隊從星期三開始進行站立會議,以分享他們的進度。 已經有一個簡單的模型構建以及一些準確性和錯誤率數字。 數據工程師展示了交易評分的基礎架構,團隊討論了功能如何到達系統以及需要做什么才能使其準備好算法。 全棧工程師將顯示管理面板,其中顯示有關事務的元數據以及觸發機制。 接下來是關于模型輸出的閾值以觸發欺詐分析者消息的討論。 團隊同意我們必須能夠調整此值,因為不同的模型可能具有不同的分布,并且根據其他變量,我們可能希望增加和減少批準的交易數量。
第四天 (Day 4)
On Thursday, the team already has all the pieces, and during the standup, discuss how to integrate those pieces. Team also outlines how to best monitor models in production, so that model performance could be evaluated and also degradation could be detected before it causes any real damage. They agree that a simple dashboard for monitoring accuracy and error rates will suffice for now.
星期四,團隊已經掌握了所有內容,在站立比賽中,討論了如何整合這些內容。 團隊還概述了如何在生產中最好地監視模型,以便可以評估模型性能并在導致任何實際損害之前檢測出退化。 他們一致認為,目前僅需要一個用于監視準確性和錯誤率的簡單儀表板即可。
第五天 (Day 5)
Friday is a demo day. During standup, the team discusses the last issues remaining with the first iteration of the transaction fraud detection. Team members prepare for the meeting with the fraud analysts that will be using this solution.
星期五是演示日。 在站立期間,團隊討論事務欺詐檢測的第一次迭代中剩下的最后一個問題。 團隊成員準備與將使用此解決方案的欺詐分析師進行會議。
During the demo, the team shows what they have built for the fraud analysts. The team presents performance metrics and their implications for the fraud analysts. All feedback is converted to tasks for future sprints.
在演示期間,團隊將展示他們為欺詐分析人員構建的內容。 該團隊介紹了績效指標及其對欺詐分析師的影響。 所有反饋都轉換為任務,以供將來沖刺。
Another vital part of the Sprint is a retrospective — meeting where the team discusses three things:1. What went well in the Sprint;
Sprint的另一個重要組成部分是回顧會議-團隊討論三件事的會議:1。 在Sprint中進展順利;
2. What could be improved;
2.有待改進的地方;
3. What will we commit to improving in the next Sprint;
3.在下一個Sprint中我們將致力于改進什么;
再往前走 (Further down the road)
During the next Sprint, the team is working on another most important item from the product backlog. It might be feedback from the fraud analysts, or it might be something else that the product owner thinks will improve the overall business the most. However, the team closely monitors the performance of the initial version of the solution. It will continue to do so because ML solutions are sensitive to changes in underlying assumptions that the model made about data distribution.
在下一個Sprint期間,團隊正在處理產品積壓中的另一個最重要的項目。 這可能是欺詐分析師的反饋,也可能是產品所有者認為可以最大程度改善整體業務的其他方面。 但是,團隊將密切監視解決方案初始版本的性能。 它將繼續這樣做,因為ML解決方案對模型對數據分布所做的基本假設的更改敏感。
討論區 (Discussion)
Above is a relatively “clean” exposition of the scrum process for data science solutions. Real-world rarely is that way, but I wanted to convey a few points:
上面是數據科學解決方案的Scrum過程的相對“干凈”的闡述。 現實世界很少采用這種方式,但我想表達幾點:
- Data Science cannot stand on its own. If we’re going to impact the real world we have to collaborate in a cross-functional team, it should be a part of a wider team; 數據科學不能自立。 如果要影響現實世界,我們必須在跨職能團隊中進行協作,這應該成為更廣泛團隊的一部分。
- Iteration is critical in data science, and we should expose artifacts of those iterations to our stakeholders to receive feedback as fast as possible; 迭代在數據科學中至關重要,我們應該將這些迭代的工件暴露給我們的涉眾,以便盡快獲得反饋。
- Scrum is a framework that is designed for iterative progress. Therefore it is a perfect fit for data science work; Scrum是一個專為迭代進度而設計的框架。 因此,它非常適合數據科學工作;
However, it is not a framework for any endeavor. If your job requires you to think deeply for days, then Scrum and agile would probably be very disruptive and counterproductive. Also, if your work requires you to handle a lot of different and small data science-related tasks, following Scrum would be inappropriate, and maybe Kanban should be considered. However, typical product data science work is not like that. Iteration is king, and getting feedback fast is key to providing the right solutions to business problems.
但是,這不是任何努力的框架。 如果您的工作需要您深入思考數日,那么Scrum和敏捷可能會非常破壞性且適得其反。 另外,如果您的工作要求您處理許多與小數據科學相關的不同任務,那么遵循Scrum是不合適的,也許應該考慮看板。 但是,典型的產品數據科學工作并非如此。 迭代為王,快速??獲得反饋對于提供正確的業務問題解決方案至關重要。
綜上所述 (In summary)
Data Science is a perfect fit for the Scrum with a single modification — we do not expect to ship finished models. Instead, we ship artifacts of our work and solicit feedback from our stakeholders so we can make progress faster. Project managers might not like data science for the unpredictability of the progress, but iteration is not at fault, it is the only way forward.
只需修改一下,Data Science就非常適合Scrum —我們不希望交付完成的模型。 取而代之的是,我們運送工作的工件并征求利益相關者的反饋,以便我們更快地取得進展。 項目經理可能不喜歡數據科學,因為它具有不可預測的進度,但是迭代并不是錯誤,這是前進的唯一途徑。
I would like to know what you think about agile data science? What has worked for you and your team? What didn’t work? I hope you will leave a comment!
我想知道您如何看待敏捷數據科學? 什么對您和您的團隊有用? 什么沒用? 希望您發表評論!
翻譯自: https://towardsdatascience.com/agile-data-science-data-science-can-and-should-be-agile-c719a511b868
敏捷數據科學pdf
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389249.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389249.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389249.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!