數據科學領域有哪些技術
Jeremie Harris: “In a way, it’s almost like a data scientist or a data analyst has to be like a private investigator more than just a technical person.”
杰里米·哈里斯(Jeremie Harris) :“ 從某種意義上說,這就像是數據科學家或數據分析師必須像私人調查員,而不僅僅是技術人員。 ”
Ian Scott: “Well, for sure. I have nothing against Kaggle, but when people say Kaggle, I have done this competition, I kind of roll my eyes, because they’re focusing on sort of 10% of the problem. (…) A lot of these projects go to die, because 90% of the other things weren’t done correctly, including stakeholder alignment.”
伊恩·斯科特:“ 嗯,肯定的。 我沒有反對Kaggle的事情,但是當人們說Kaggle時,我參加了這場比賽,我有點kind目結舌,因為他們專注于解決10%的問題。 (… ) 這些項目很多都死了,因為90%的其他事情沒有正確完成,包括利益相關者的協調。 ”
By reading this, you might think that Ian Scott, Chief Science Officer at Omnia AI, from Deloitte Canada, might think that doing Kaggle competitions is of useless. Think again. This is only an excerpt from an interview taken to Ian by Jeremie Harris, for the TDS podcast (minute 14), and it doesn’t illustrate the bigger picture. I will come back to it again further in the article, when discussing about the importance of context and what can happen when getting information out of its context.
通過閱讀此書,您可能會認為來自Deloitte Canada的Omnia AI首席科學官Ian Scott可能會認為參加Kaggle比賽毫無用處。 再想一想 。 這只是杰里米·哈里斯(Jeremie Harris)對TDS播客(第14分鐘)對伊恩(Ian)的采訪中的摘錄,而并未說明大局。 在討論上下文的重要性以及從上下文中獲取信息時可能發生的情況時,我將在本文中再次進行討論。
Through this article I try to find an answer to the question “ How important domain knowledge really is for data scientist?”, which came to my mind while listening to this interview. And I do so by looking both at the insights provided especially through their discussion and jobs postings for Data Scientists available in the US, in fields such as Health Care, Financial Services, Retail, Real Estate, Government Administration, Market Research, Information Technology etc.
通過這篇文章,我試圖找到一個問題“對于數據科學家來說領域知識到底有多重要?”的答案,這是我在聽這次采訪時想到的。 為此 ,我既要研究提供的見解, 也要針對他們在美國的數據科學家在醫療保健, 金融服務,零售,房地產,政府管理,市場研究,信息技術等
上下文在數據科學中的重要性 (The importance of context in Data Science)
In the interview, Ian talks about things that are in the data that can be analyzed and provide valuable information, and things that are not captured in the data which, as a result, will not be taken into consideration by the models built. How can data scientists deal with these cases? How can features be designed so that all relevant information and contextual data be included in the model?
在采訪中,Ian談論了數據中可以分析并提供有價值信息的事物,以及數據中未捕獲的事物,因此構建的模型將不考慮這些事物 。 數據科學家如何處理這些情況? 如何設計要素,以便將所有相關信息和上下文數據包括在模型中?
Here is where domain knowledge comes in. And Ian mentions an approach they sometimes use for getting a better understanding of the problem being solved and of the desired outputs: and that is putting both the members of his team and the clients in the same room for several hours so that they can work together on understanding the real problem, understanding the shared output, and for getting the context collectively built.
這就是領域知識的來源。Ian提到了他們有時用來更好地了解要解決的問題和期望的輸出的一種方法:這將他的團隊成員和客戶放在同一個房間里幾個小時,以便他們可以共同理解真正的問題,理解共享的輸出并共同構建上下文。
The key word here is context. Ian: “Data trumps context, trumps algorithm. Algorithm is the third thing I think about, not the first thing. What data do you have, give me the context of the data, the things that are in the data, that are about the data that I don’t see in the data.” (minute 7)
這里的關鍵詞是情境 。 伊恩: 數據勝過上下文,勝過算法。 算法是我考慮的第三件事,而不是第一件事。 你有什么數據,給我 的數據 , 是在數據的東西,是有關數據,我沒有數據看到 的范圍內 。 ”(第7分鐘)
One way to interpret this could be:
一種解釋方式可能是:
Data is the starting point, as we initially check what data is available, what do we have already.
數據是起點,因為我們最初檢查哪些數據可用,我們已經擁有什么。
Context. Some say that context is important as the data itself. For it we must go beyond what the data shows us and look even at things such as changes in the political landscapes. Or, as I heard in an example presented by Michael Bamberger, Independent Evaluation Consultant, in a presentation regarding “Evaluation in the Age of Big Data”, information collected from phone apps might be irrelevant when it comes to finding useful insights regarding women in certain areas, as their access to phones might be somewhat restricted by their partners or other family members. That is why, we need to have a good grasp of the contextual factors of the problems we wish to solve through Data Science, that might even influence the results. One other example, would be to look only at certain information from this interview with Ian Scott. If one would look only at the following statement, “Well, for sure. I have nothing against Kaggle, but when people say Kaggle, I have done this competition, I kind of roll my eyes.”, without listening to the entire interview, he/she would think that Ian Scott considers that doing Kaggle competitions is of no value. When instead, it seems to me that he only wants to underline that technical skills alone are not enough, but also the way they think and their ability to go for the simple answers — “we ask a lot of questions around people’s thought processes, as opposed to what they know”. This is what taking information out of the context means, which can be very harmful sometimes.
上下文。 有人說上下文對于數據本身很重要 。 為此,我們必須超越數據顯示的范圍,甚至要看諸如政治格局變化之類的事情。 或者,正如我在獨立評估顧問Michael Bamberger在一個有關“大數據時代的評估”的演講中所聽到的一個例子中所聽到的那樣,從電話應用程序收集的信息對于尋找某些女性的有用見解可能無關緊要。地區,因為他們的電話或其他伴侶可能會限制他們使用手機。 因此,我們需要充分了解我們希望通過數據科學解決的問題的背景因素 ,這些因素甚至可能影響結果。 另一個例子是,只看這次采訪中來自伊恩·斯科特(Ian Scott)的某些信息。 如果只看下面的陳述,“ 那當然。 我沒有反對Kaggle的任何東西,但是當人們說Kaggle時,我參加了這場比賽,我有點eyes目結舌。” 他/她會認為Ian Scott認為參加Kaggle比賽沒有任何價值。 相反,在我看來,他似乎只想強調技術技能還不夠 ,還要強調他們的思維方式和尋求簡單答案的能力-“ 我們圍繞人們的思維過程提出了很多問題,反對他們知道的東西 。 這就是將信息移出上下文的含義,有時可能非常有害。
Algorithm: an algorithm that now makes great use of both the data available and its context.
算法 :一種算法,現在充分利用了可用數據及其上下文。
Interpretation and presentation of results. I am adding this fourth element as I believe that, here, context should take the center stage again. Interpretation and presentation of results should be done in accordance with stakeholders/ clients needs. And Data Scientists need to be able need to be able to communicate their obtained results in an adequate manner even to non-technical audiences.
結果的解釋和表示 。 我添加第四個要素是因為我相信上下文在這里應該再次成為焦點。 結果的解釋和陳述應根據利益相關者/客戶的需求進行。 數據科學家需要能夠以適當的方式將其獲得的結果傳達給非技術受眾。
I remember how, in my first evaluation job, I had to learn fast a lot of domain specific knowledge. Even though I was part of a team with specialists from that field, I needed to know what they were talking about and be able to build an evaluation model that was appropriate to that specific context. So, beyond working on the task, I tried as much as possible to document myself on the respective topic.
我記得在我的第一份評估工作中,我必須如何快速學習很多特定領域的知識。 即使我是該領域專家組成的團隊的一員,我也需要知道他們在說什么,并能夠建立適合特定情況的評估模型。 因此,除了完成任務之外,我還嘗試著盡可能多地記錄下有關相應主題的內容。
Later on, in another case, when I had to add input for an evaluation of a program in healthcare, I felt that evaluators who had previous experience in that field were ahead of me. Yes, I knew methods of evaluation, but I had no knowledge about the history of the program and its specificity. However, the structure and the dynamic of the team was different, and the input expected from me was rather technical than specific to the subject matter. Within the team, we complemented each other and managed to provide the needed outputs.
后來,在另一種情況下,當我不得不增加對醫療保健計劃的評估的投入時,我感到在該領域具有先前經驗的評估師領先于我。 是的,我知道評估方法,但是我對該程序的歷史及其特殊性一無所知。 但是,團隊的結構和動態是不同的,我期望我提供的輸入內容是技術性的,而不是針對主題的。 在團隊中,我們相互補充并設法提供了所需的輸出。
美國的Data Science工作一覽 (A look into Data Science jobs in the US)
Now, let’s look at some Data Science jobs postings from different fields, as they appear on LinkedIn (on August, the 16th and 17th).
現在,讓我們看看來自不同領域的一些Data Science職位發布,它們出現在LinkedIn(8月16日和17日)上。
And we will start with Hospital and Health Care, this being the field where I would most expect for domain knowledge to be required.
我們將從醫院和醫療保健開始,這是我最期望需要領域知識的領域。
Hospital and Health Care
醫院與衛生保健

For this topic, I looked at job postings from all levels of experience: internships, entry-level jobs, Associate/ Mid-Senior level and Executive and Director levels.
在本主題中,我從各個級別的經驗著眼于職位發布:實習,入門級職位,協理/中高級級別以及執行和董事級別。
At the highest Seniority levels (Executive and Director), there were only five all-time published job postings, from which three included previous experience and knowledge in Healthcare as a must, while one mentioned it as being an advantage for potential job candidates, and another didn’t mention it at all.
在最高資歷級別 (執行官和總監)下,只有五次公開發布的職位空缺,其中三個必須包括以前在醫療保健方面的經驗和知識,而其中一個則提到這對于潛在的求職者是有利的,并且另一個完全沒有提及。
For the Associate/ Mid-Senior levels I looked at the postings from the last 24 hours, so as to get a feel on how much domain knowledge is required. From the 10 relevant job announcements, 6 asked explicitly for previous experience in the Health Care industry, while 2 considered it to be preferable or a plus, and the other two, which were in fact from Health, Wellness and Fitness, did not ask at all for domain related knowledge.
對于助理/中高級水平,我查看了過去24小時的帖子,以了解需要多少領域知識。 在10個相關的職位公告中,有6個明確要求提供醫療保健行業的從業經驗,而2個則認為更可取或更好,而另外2個實際上來自健康,保健和健身行業,則沒有要求全部用于領域相關知識。
When it comes to those asking for previous experience, the requirements sound as following: “Familiarity with a health care environment and EHR data”; “Familiarity with genomics and medical imaging data”; “A year or more in healthcare, preferably at a healthcare provider or health system”; “2+ years experience working with healthcare data (EHR/EMR healthcare information systems preferred)”, etc.
對于需要以前經驗的人,要求聽起來如下:“熟悉醫療環境和EHR數據”; “熟悉基因組學和醫學影像數據”; “一年或一年以上的醫療保健經驗,最好是在醫療保健提供者或衛生系統工作”; “具有2年以上處理醫療保健數據的經驗(首選EHR / EMR醫療保健信息系統)”等。
Regarding entry-level jobs, the situation is similar to that of senior level jobs. There are postings that ask for previous experience and some that don’t. Example: “Must be knowledgeable of healthcare terminology, concepts and value chain, with intermediate knowledge of one or more key healthcare domains, e.g. client, customer, product, provider, and/or claim.”
關于入門級職位 ,情況與高級職位相似。 有些帖子要求提供以前的經驗,有些則不需要。 示例:“必須具有醫療保健術語,概念和價值鏈的知識,并具有一個或多個關鍵醫療保健領域的中級知識,例如客戶,客戶,產品,提供者和/或索賠。”
When it comes to internships, things change, as there are no references to previous experience or educational background in the healthcare field. Besides the usual technical skills, some ask for a genuine interest in the industry or in their mission.
關于實習 ,情況發生了變化,因為沒有提及醫療領域的以往經驗或教育背景。 除了通常的技術技能,有些人還要求對行業或其使命有真正的興趣。
Financial Services
金融服務

In this case, even though I have applied a filter for jobs in Financial Services, LinkedIn showed results from other sectors too, such as Retail, Staffing and Recruiting, or Information Technology. However, from the postings I have managed to look at, I have noticed that some jobs indeed require things such as: product design, experience in providing business insights, Passion for building data products, particularly in the real estate/financial space; Proven ability to apply findings to business problems to lift revenue and profits; be a Business Acumen — Leverages business judgment to shape strategy, based on understanding of operational, financial, and organizational requirements and capabilities, etc. In this case, as same as in Retail and Real Estate, some jobs postings ask for previous experience within a business setting, or mention it as being preferable, but there are also jobs that only ask for the necessary technical requirements.
在這種情況下,即使我為金融服務業的職位應用了過濾器,LinkedIn也顯示了其他行業的結果,例如零售,人員配備和招聘或信息技術。 但是,從我設法查看的帖子中,我注意到某些工作確實需要執行以下任務:產品設計,提供業務見解的經驗,對數據產品的熱情,特別是在房地產/金融領域; 具有將發現應用于業務問題以提高收入和利潤的能力; 成為業務敏銳度—基于對運營,財務,組織要求和能力等的理解,利用業務判斷力制定戰略。在這種情況下, 與零售和房地產行業一樣 ,一些職位要求在職位范圍內尋求以前的經驗業務環境,或者說它是更可取的,但是有些工作只要求必要的技術要求。
Government Administration
政府行政
Here, there are jobs postings of both types: that don’t require subject matter expertise, and some that ask for it. In addition, they might require to be an U.S. citizen and to pass a background check, which is to be expected, as the employers are governmental agencies.
在這里,有兩種類型的職位發布:不需要主題專業知識,而有些則需要。 此外,由于雇主是政府機構,他們可能需要成為美國公民并通過背景調查,這是可以預料的。
Sectors that are less likely to ask for specific domain knowledge, as they appear on LinkedIn: Staffing and Recruiting, Information Technology, Internet, Market Research, Marketing & Advertising, Media Production.
出現在LinkedIn上的, 不太可能要求特定領域知識的行業 :人員配備和招聘,信息技術,互聯網,市場研究,市場營銷與廣告,媒體制作。
結論和要點 (Conclusions and key takeaways)
Domain knowledge definitely helps in better making sense of the data and of the problem’s context. In some cases, Data Scientists might need to also have strong subject matter expertise additional to the technical skills, but in other cases, depending also on the industry or on the way in which the organization for which he/ she works is structured, that might not be the same. In cases such as that from OMNIA AI (Deloitte Canada], even though Data Scientists are required to have certain soft skills in addition to technical ones, they work within larger teams, and are able to draw upon the expertise of their team members or even clients. In some cases, as same as in evaluation, teams can be of great use, and what Data Scientists don’t know can be covered with the help of subject matter experts.
領域知識無疑有助于更好地理解數據和問題背景。 在某些情況下,數據科學家可能還需要除技術技能外還具有較強的主題專業知識,但在其他情況下,還取決于行業或他/她工作的組織的組織方式,不一樣。 在諸如OMNIA AI(加拿大德勤)的案例中,盡管數據科學家除了必須具備一定的軟技能外,他們還需要在較大的團隊中工作,并能夠利用團隊成員甚至是專家的專業知識。在某些情況下,就像在評估中一樣,團隊可能會發揮很大作用,而數據科學家不知道的內容可以在主題專家的幫助下解決。
When it comes to Data Science jobs postings some of them do ask for previous experience working in a specific field, but not all of them. There are sectors which require subject matter expertise more strongly (such as Health Care], and some who mostly require the technical skills and expertise (such as Market Research, Marketing & Advertising etc.]. Others, who are somehow in the middle, ask for previous experience in a business setting, or consider it to be preferable.
當涉及到數據科學工作職位時 ,其中一些確實要求在特定領域工作的以往經驗,但并不是全部。 有些部門對主題的專業知識要求更高(例如“醫療保健”),而有些部門則最需要技術技能和專業知識(例如“市場研究”,“市場營銷和廣告”等)。以前在業務環境中的經驗,或認為更可取。
Senior level jobs are more likely to ask for domain specific knowledge compared to internships and entry-level jobs, which is to be expected. However, this does not happen in all cases. And this might also be because the companies in certain sectors only now start adding Data Scientists to their teams.
與預期的實習和入門級工作相比,高級職位更可能要求特定領域的知識。 但是,并非在所有情況下都會發生這種情況。 這也可能是因為某些部門的公司現在才開始將數據科學家添加到其團隊中。
When preparing to become a Data Scientist also try to identify the field in which you wish to work in, if you haven’t done so already, and look what employers look for in job candidates. Make a list of those requirements, see how you fit them and make a plan on honing the necessary skills. Additionally, try to do projects in the field you targeted and provide solutions that are already relevant to certain potential employers.
在準備成為一名數據科學家時,還應嘗試確定您希望從事的領域(如果您尚未從事該領域的話),并尋找雇主在求職者中尋找的東西。 列出這些要求,查看您如何適應它們并制定磨練必要技能的計劃。 此外,嘗試在您所針對的領域中進行項目,并提供已經與某些潛在雇主相關的解決方案。
Sources:
資料來源:
“Data Science at Deloitte”, Interview taken to Ian Scott by Jeremie Harris for the TDS Podcast, https://towardsdatascience.com/data-science-at-deloitte-133457084a5
“ Deloitte的數據科學”,Jeremie Harris接受Ian Scott采訪的TDS播客, https://towardsdatascience.com/data-science-at-deloitte-133457084a5
LinkedIn Job postings for Data Scientist in the US, https://www.linkedin.com/
美國數據科學家的領英職位發布, https://www.linkedin.com/
FLOWINGDATA, Article “ Why context is as important as the data itself”, interview with John Allen Paulos, math professor at Temple University, May 21, 2010, https://flowingdata.com/2010/05/21/why-context-is-as-important-as-the-data-itself/
FLOWINGDATA,文章“為什么背景與數據本身同樣重要”,坦普爾大學數學教授John Allen Paulos訪談,2010年5月21日, https://flowingdata.com/2010/05/21/why-context-作為數據本身很重要/
IEG gLocal Event: Rewiring Evaluation Approaches, June 3, 2020, Presentation given by Michael Bamberger on “Evaluation in the Age of Big Data”, https://www.youtube.com/watch?v=bFRzcaDc0lU&list=PL23ljg-NcGGf_X-90O_WhTHSI8xvuU6Hn&index=37&t=3214s
IEG gLocal事件:重新評估方法,2020年6月3日,Michael Bamberger在“大數據時代的評估”中的演示, https://www.youtube.com/watch?v = bFRzcaDc0lU&list = PL23ljg-NcGGf_X-90O_WhTHSI8xvuU6Hn&index = 37&t = 3214s
翻譯自: https://towardsdatascience.com/how-important-domain-knowledge-really-is-in-data-science-19d833d98698
數據科學領域有哪些技術
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/390585.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/390585.shtml 英文地址,請注明出處:http://en.pswp.cn/news/390585.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!