數據科學，意見 (Data Science, Opinion)

“There is a saying, ‘A jack of all trades and a master of none.’ When it comes to being a data scientist you need to be a bit like this, but perhaps a better saying would be, ‘A jack of all trades and a master of some.’” — Brendan Tierney
“有一句俗語，“萬事通，萬事通”。當要成為數據科學家時，您需要有點像這樣，但也許更好的說法是， “萬事通，萬事通”。” —布倫丹·蒂爾尼(Brendan Tierney)

I believe the word some in the above quote includes communication and domain knowledge. You might have read many articles focusing on the technical facets of data science. In this article, we will discuss about some not-so-technical facets that data scientists encounter in their day-day lives by picturing a scenario.

我相信以上引用中的“ 某些 ”一詞包括交流和領域知識。您可能已經閱讀了許多有關數據科學技術方面的文章。在本文中，我們將通過描述場景來討論數據科學家在日常生活中遇到的一些非技術方面。

情境 (Scenario)

I am working as a data practitioner for the online department of Eastside, a large retail company. My manager passes by my desk on his way to a meeting and asks me to figure out “our best customers” and leaves in a whisker.

我是一家大型零售公司Eastside的在線部門的數據從業人員 。我的經理在去開會的路上路過我的辦公桌，要我找出“我們最好的客戶”，然后胡說八道。

What does best mean here? Does it mean customers who have spent the most? or does it mean customers who buy more? Notice that spending most and buying many items are two completely different things.

最好的意思是什么？這是否意味著花費最多的客戶？還是意味著購買更多商品的客戶？請注意，花費最多和購買許多物品是完全不同的兩件事。

The situation which happened above is a common occurrence in the field of data. The usage of fuzzy(vague) language. More often, we will hear people expressing their ideas using natural language which looks good initially but on close inspection are ill-defined.

上面發生的情況是數據領域中的一種普遍現象。模糊 (模糊)語言的用法。通常，我們會聽到人們使用自然語言表達想法的想法，這種語言最初看起來不錯，但仔細檢查后仍然不清楚。

In the above situation, you noticed how bad communication can have an adverse impact. A Linkedin study states that communication is the most sought-after soft skill. Even though my manager was not precise in his request, I could have sought clarifications. If we find out the end goal of the request ie. Why does he want to know the best customers? we can decide upon our approach.

在上述情況下，您注意到不良的通訊會對您產生不利影響。 Linkedin的一項研究指出，溝通是最受追捧的軟技能。即使我的經理要求不準確，我也可以尋求澄清。如果我們找到請求的最終目標，即。 他為什么想認識最好的顧客？ 我們可以決定我們的方法。

Upon reaching to my manager, he explains that there are $1000 left in the marketing budget and he wants to use that money to convert some physical store customers to the online stores by emailing them some free coupons. One caveat here is that we should not steal the customers of the physical store as it may create a problem for the physical store’s head. He also mentions that this task must be accomplished within two hours!!

到達我的經理后，他解釋說營銷預算中還剩$ 1000 ，他想用這筆錢通過向他們發送一些免費的優惠券 ，將一些實體店的客戶轉換為在線商店。這里的一個警告是，我們不應竊取實體店的顧客，因為這可能給實體店的負責人帶來麻煩。他還提到必須在兩個小時內完成此任務！

This is where domain knowledge comes into the picture. What does the stealing of customers mean? It means we should not send coupons to active customers of the physical store as it may prevent them from going to the physical store. Instead, we can send coupons to some of the best-churned customers.

這就是領域知識出現的地方。 偷顧客是什么意思？ 這意味著我們不應該將優惠券發送給實體店的活躍客戶，因為這可能會阻止他們去實體店。相反，我們可以將優惠券發送給一些最佳客戶。

Customer churn — means a customer, ceases to be a customer. (I bought a subscription on Netflix for 3 months and unsubscribed for it later. I am a churned customer.)

客戶流失 -意味著客戶不再是客戶。 (我在Netflix上購買了3個月的訂閱，后來又取消了訂閱。我是一位流失的客戶。)

I explained to my manager that we will consider a customer to be churned if he has not purchased anything from the physical store in the past 3 months (Most of the customers buy only groceries from the physical store. It is safe to assume that someone who has not purchased anything for the past 3 months has churned). My manager agrees and gives me the dataset of all the physical store customers.

我向經理解釋說，如果客戶在過去3個月內未從實體店購買任何東西，我們將認為該客戶會受到攪動(大多數客戶僅從實體店購買雜貨。可以肯定地假設有人過去3個月內未購買任何商品)。我的經理同意并給我所有實體店客戶的數據集。

You may think the idea for churned customers is not perfect. There arises a situation in data science when we don’t know the truth due to time constraints or the inability to measure it. We use approximations close to truth. These are called proxies. When a request is urgent, it is common to use proxies.

您可能會認為吸引客戶的想法并不完美。當由于時間限制或無法測量真相而導致我們不了解真相時，就會出現數據科學中的一種情況。我們使用接近真實的近似值。這些被稱為代理。當緊急請求時，通常使用代理。

方法 (Approach)

Let us explore the data of the physical store given.

讓我們探索給定的物理商店的數據。

import pandas as pd
import datetimedata = pd.read_csv("/content/es_phy_store.txt")

The Output —

輸出 -

There are 1,25,000 rows in total and 3 columns. Our goal is to return the id’s of the best customers. We can also see that all the columns are not-null indicating that the data is clean.

總共有1,25,000行和3列。我們的目標是返回最佳客戶的ID。我們還可以看到所有列都不為空，表明數據是干凈的。

To find out the churned customers we will group the data by customer_id and find out the latest transaction_date of each customer.

要找出我們會按組的數據流失的客戶customer_id并找出最新的transaction_date每一個客戶的。

group_by_customer = data.groupby("customer_id")
last_transaction = group_by_customer["transaction_date"].max()
last_transaction.head(5)

Output —

輸出—

Since customers are considered to be churned if their last transaction was three months ago we will create a cutoff date of May 1st, 2020, and label the customers accordingly.

由于如果客戶的最后一筆交易是在三個月前，則認為客戶受到了干擾，因此我們將截止日期定為2020年5月1日，并相應地給客戶貼上標簽。

We will create a separate data frame called best_churn which consists of the customer_id, transaction_date and a boolean column churned denoting whether the customer was churned or not.

我們將創建一個名為best_churn的單獨數據框，該數據框由customer_id ， transaction_date和一個布爾值列組成，該布爾值列表示是否churned了客戶。

Output —

輸出—

客戶排名 (Ranking the Customers)

We found out the churned customers. The main aim is to find the best-churned customers. Firstly, we need to rank the customers based on some criteria, and next, we need to find a threshold value to identify the best customers.

我們找到了流失的客戶。主要目的是尋找最受客戶歡迎的客戶。首先，我們需要根據一些標準對客戶進行排名，其次，我們需要找到一個閾值來確定最佳客戶。

Due to the time constraints, we cannot use a complex ML/DL model. We can use a simple weighted-sum model to classify customers. This model assigns a number(score) to each customer denoting how good they are. In our case, we need to consider two criteria — Amount Spent and the Number of Purchases made. Both must be given the same weight ie. a customer who spends a lot is equivalent to a customer who makes more purchases. So we can define the customer score as — Score = (1/2 × Number of purchases)+(1/2 × Amount spent)

由于時間限制，我們不能使用復雜的ML / DL模型。我們可以使用簡單的加權和模型對客戶進行分類。該模型為每個客戶分配一個數字(分數) ，表示他們的水平。在我們的案例中，我們需要考慮兩個標準- 花費金額和購買次數 。兩者必須賦予相同的權重，即。花很多錢的客戶等于花更多錢購買的客戶。因此，我們可以將客戶分數定義為：分數=(1/2×購買數量)+(1/2×消費金額)

For example, if a customer made 2 purchases worth $500 his score would be (1/2 × 2) + (1/2 × 500) = 251.

例如，如果客戶進行了2次購物，價值500美元，那么他的得分將是(1/2×2)+(1/2×500)= 251。

Let us find the number of transactions per customer and create a separate column. This can be accomplished by grouping the data based on customer_id and using the size() method. We can also find the total amount spent by using the sum() method on the transaction_amount column. We will also drop the transaction_date column which is no longer required.

讓我們找到每個客戶的交易數量并創建一個單獨的列。這可以通過基于customer_id分組數據并使用size()方法來完成。我們還可以通過使用transaction_amount列上的sum()方法來找到花費的總金額 。我們還將刪除不再需要的transaction_date列。

best_churn["no_of_transactions"] = group_by_customer.size()
best_churn["amount_spent"] = group_by_customer.sum()

Output —

輸出—

Everything seems to be good, but if we take a closer look at the formula we notice a defect. We saw that when a customer spent $500 for 2 purchases his score was 251. If a customer has spent $400 across 20 different purchases his score would be 210 which seems to be unfair because it seems that the second customer is more regular than the first one and shows more potential to spend in the long run. This is happening mainly due to two reasons. 1) Money spent always exceeds the number of transactions.2) We are using the same weights for both the criteria.

一切似乎都很好，但是如果我們仔細看一下公式，就會發現一個缺陷。我們看到，當一位顧客花500美元進行2次購買時，他的得分是251。如果一位顧客在20次不同的購買中花費了400美元，他的得分將是210分，這似乎是不公平的，因為第二位顧客似乎比第一個顧客更固定并顯示出更多的長期消費潛力。發生這種情況主要是由于兩個原因。 1)花費的錢總是超過交易數量。2)我們對這兩個標準使用相同的權重。

Let us find out the min and max number of transactions and amounts from the best_churn data frame.

讓我們從best_churn數據框中找出最小和最大交易數以及金額。

best_churn[["no_of_transactions", "amount_spent"]].describe().loc[["min", "max"]]

We can see that the number of transactions is way too less when compared to the amount spent. To overcome this problem we will use min-max scaling which is used to compare different scales in a meaningful way. The formula for min-max scaling is —

我們可以看到，與花費的數量相比，交易數量實在太少了。為了克服這個問題，我們將使用最小-最大縮放比例 ，該縮放比例用于以有意義的方式比較不同的縮放比例。最小-最大縮放比例的公式是-

Let us apply the above formula on our no_of_transactions and amount_spent columns, find out the score using the scaled values and sort the data frame based on the score.

讓我們將上述公式應用于no_of_transactions和amount_spent列，使用縮放后的值找出分數，并根據分數對數據框進行排序。

Output —

輸出—

我們如何找出閾值得分值？ (How do we find out the threshold score value?)

Should we chose the first 20 customers? or the first 50 customers? or the top 10%? What should be the criteria? Again, domain knowledge plays a crucial role here. We know that the budget is $1000. Each coupon value is not specified and we must decide the value. The coupon value cannot be too high because it reduces the number of customers.

我們應該選擇前20位客戶嗎？還是前50位客戶？還是前10％？準則是什么？同樣，領域知識在這里起著至關重要的作用。我們知道預算是$ 1000。沒有指定每個優惠券的價值，我們必須決定其價值。優惠券價值不能太高，因為它減少了客戶數量。

We all know that a 30% discount on one transaction is a pretty decent deal.So, let us find out the mean value of all the 1,25,000 transactions in the initial data frame we have and find 30% of that mean value.

我們都知道一筆交易有30％的折扣是相當不錯的一筆交易，因此，讓我們找出初始數據框中所有1,25,000筆交易的均值，然后找到該均值的30％。

coupon = data["tran_amount"].mean() * 0.3Output - 19.4976

Let us round this to 20. Hence each coupon value is $20. We know that our budget is $1000. Dividing 1000/20 yields 50. Hence we can select the top 50 customers from the best_churn data whose churned value is 1 and mail the coupons.

讓我們將其四舍五入為20。因此，每個優惠券價值為20美元。我們知道我們的預算是1000美元。除以1000/20得出50。因此，我們可以從best_churn數據中選擇前50個客戶 ，其攪動值為1，然后將優惠券郵寄best_churn 。

top_50_churned = best_churn.loc[best_churn["churned"] == 1].head(50)

Output —

輸出—

結論 (Conclusion)

In this article, we understood the importance of communication and how the usage of fuzzy-language can be a hindrance. We also took a real-life scenario and solved a problem that had many constraints and also required the usage of communication skills, domain knowledge, and quick decision-making. I hope you learned something new today.

在本文中，我們了解了交流的重要性以及模糊語言的使用如何成為障礙。我們還采用了現實生活中的場景，解決了一個有很多約束并且還需要使用溝通技巧，領域知識和快速決策能力的問題。希望您今天學到了新東西。

If you would like to get in touch, connect with me on LinkedIn.

如果您想取得聯系，請通過 LinkedIn 與我聯系。

翻譯自: https://medium.com/towards-artificial-intelligence/data-science-in-business-8266fae71a87

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/390858.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/390858.shtml
英文地址，請注明出處：http://en.pswp.cn/news/390858.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

為什么游戲開發者不玩游戲_什么是游戲開發？

為什么游戲開發者不玩游戲Game Development is the art of creating games and describes the design, development and release of a game. It may involve concept generation, design, build, test and release. While you create a game, it is important to think about t…

商業數據科學

數據科學 ， 意見 (Data Science, Opinion)

情境 (Scenario)

方法 (Approach)

客戶排名 (Ranking the Customers)

我們如何找出閾值得分值？ (How do we find out the threshold score value?)

結論 (Conclusion)

相關文章

數據科學，意見 (Data Science, Opinion)