商業數據科學

數據科學 , 意見 (Data Science, Opinion)

“There is a saying, ‘A jack of all trades and a master of none.’ When it comes to being a data scientist you need to be a bit like this, but perhaps a better saying would be, ‘A jack of all trades and a master of some.’” — Brendan Tierney

“有一句俗語,“萬事通,萬事通”。 當要成為數據科學家時,您需要有點像這樣,但也許更好的說法是, “萬事通,萬事通”。” —布倫丹·蒂爾尼(Brendan Tierney)

I believe the word some in the above quote includes communication and domain knowledge. You might have read many articles focusing on the technical facets of data science. In this article, we will discuss about some not-so-technical facets that data scientists encounter in their day-day lives by picturing a scenario.

我相信以上引用中的“ 某些 ”一詞包括交流和領域知識。 您可能已經閱讀了許多有關數據科學技術方面的文章。 在本文中,我們將通過描述場景來討論數據科學家在日常生活中遇到的一些非技術方面。

情境 (Scenario)

I am working as a data practitioner for the online department of Eastside, a large retail company. My manager passes by my desk on his way to a meeting and asks me to figure out “our best customers” and leaves in a whisker.

我是一家大型零售公司Eastside的在線部門的數據從業人員 。 我的經理在去開會的路上路過我的辦公桌,要我找出“我們最好的客戶”,然后胡說八道。

What does best mean here? Does it mean customers who have spent the most? or does it mean customers who buy more? Notice that spending most and buying many items are two completely different things.

最好的意思是什么? 這是否意味著花費最多的客戶? 還是意味著購買更多商品的客戶? 請注意,花費最多和購買許多物品是完全不同的兩件事。

The situation which happened above is a common occurrence in the field of data. The usage of fuzzy(vague) language. More often, we will hear people expressing their ideas using natural language which looks good initially but on close inspection are ill-defined.

上面發生的情況是數據領域中的一種普遍現象。 模糊 (模糊)語言的用法。 通常,我們會聽到人們使用自然語言表達想法的想法,這種語言最初看起來不錯,但仔細檢查后仍然不清楚。

In the above situation, you noticed how bad communication can have an adverse impact. A Linkedin study states that communication is the most sought-after soft skill. Even though my manager was not precise in his request, I could have sought clarifications. If we find out the end goal of the request ie. Why does he want to know the best customers? we can decide upon our approach.

在上述情況下,您注意到不良的通訊會對您產生不利影響。 Linkedin的一項研究指出, 溝通最受追捧的軟技能。 即使我的經理要求不準確,我也可以尋求澄清。 如果我們找到請求的最終目標,即。 他為什么想認識最好的顧客? 我們可以決定我們的方法。

Upon reaching to my manager, he explains that there are $1000 left in the marketing budget and he wants to use that money to convert some physical store customers to the online stores by emailing them some free coupons. One caveat here is that we should not steal the customers of the physical store as it may create a problem for the physical store’s head. He also mentions that this task must be accomplished within two hours!!

到達我的經理后,他解釋說營銷預算中還剩$ 1000 ,他想用這筆錢通過向他們發送一些免費的優惠券 ,將一些實體店的客戶轉換為在線商店。 這里的一個警告是,我們不應竊取實體店的顧客,因為這可能給實體店的負責人帶來麻煩。 他還提到必須在兩個小時內完成此任務!

This is where domain knowledge comes into the picture. What does the stealing of customers mean? It means we should not send coupons to active customers of the physical store as it may prevent them from going to the physical store. Instead, we can send coupons to some of the best-churned customers.

這就是領域知識出現的地方。 偷顧客是什么意思? 這意味著我們不應該將優惠券發送給實體店的活躍客戶,因為這可能會阻止他們去實體店。 相反,我們可以將優惠券發送給一些最佳客戶。

Customer churn — means a customer, ceases to be a customer. (I bought a subscription on Netflix for 3 months and unsubscribed for it later. I am a churned customer.)

客戶流失 -意味著客戶不再是客戶。 (我在Netflix上購買了3個月的訂閱,后來又取消了訂閱。我是一位流失的客戶。)

I explained to my manager that we will consider a customer to be churned if he has not purchased anything from the physical store in the past 3 months (Most of the customers buy only groceries from the physical store. It is safe to assume that someone who has not purchased anything for the past 3 months has churned). My manager agrees and gives me the dataset of all the physical store customers.

我向經理解釋說,如果客戶在過去3個月內未從實體店購買任何東西,我們將認為該客戶會受到攪動(大多數客戶僅從實體店購買雜貨。可以肯定地假設有人過去3個月內未購買任何商品)。 我的經理同意并給我所有實體店客戶的數據集。

You may think the idea for churned customers is not perfect. There arises a situation in data science when we don’t know the truth due to time constraints or the inability to measure it. We use approximations close to truth. These are called proxies. When a request is urgent, it is common to use proxies.

您可能會認為吸引客戶的想法并不完美。 當由于時間限制或無法測量真相而導致我們不了解真相時,就會出現數據科學中的一種情況。 我們使用接近真實的近似值。 這些被稱為代理 。 當緊急請求時,通常使用代理。

方法 (Approach)

Let us explore the data of the physical store given.

讓我們探索給定的物理商店的數據。

import pandas as pd
import datetimedata = pd.read_csv("/content/es_phy_store.txt")

The Output —

輸出 -

Image for post
Image for post

There are 1,25,000 rows in total and 3 columns. Our goal is to return the id’s of the best customers. We can also see that all the columns are not-null indicating that the data is clean.

總共有1,25,000行和3列。 我們的目標是返回最佳客戶的ID。 我們還可以看到所有列都不為空,表明數據是干凈的。

To find out the churned customers we will group the data by customer_id and find out the latest transaction_date of each customer.

要找出我們會按組的數據流失的客戶customer_id并找出最新的transaction_date每一個客戶的。

group_by_customer = data.groupby("customer_id")
last_transaction = group_by_customer["transaction_date"].max()
last_transaction.head(5)

Output —

輸出—

Image for post

Since customers are considered to be churned if their last transaction was three months ago we will create a cutoff date of May 1st, 2020, and label the customers accordingly.

由于如果客戶的最后一筆交易是在三個月前,則認為客戶受到了干擾,因此我們將截止日期定為2020年5月1日,并相應地給客戶貼上標簽。

We will create a separate data frame called best_churn which consists of the customer_id, transaction_date and a boolean column churned denoting whether the customer was churned or not.

我們將創建一個名為best_churn的單獨數據框,該數據框由customer_idtransaction_date和一個布爾值列組成,該布爾值列表示是否churned了客戶。

Output —

輸出—

Image for post

客戶排名 (Ranking the Customers)

We found out the churned customers. The main aim is to find the best-churned customers. Firstly, we need to rank the customers based on some criteria, and next, we need to find a threshold value to identify the best customers.

我們找到了流失的客戶。 主要目的是尋找最受客戶歡迎的客戶。 首先,我們需要根據一些標準對客戶進行排名 ,其次,我們需要找到一個閾值來確定最佳客戶。

Due to the time constraints, we cannot use a complex ML/DL model. We can use a simple weighted-sum model to classify customers. This model assigns a number(score) to each customer denoting how good they are. In our case, we need to consider two criteria — Amount Spent and the Number of Purchases made. Both must be given the same weight ie. a customer who spends a lot is equivalent to a customer who makes more purchases. So we can define the customer score as — Score = (1/2 × Number of purchases)+(1/2 × Amount spent)

由于時間限制,我們不能使用復雜的ML / DL模型。 我們可以使用簡單的加權和模型對客戶進行分類。 該模型為每個客戶分配一個數字(分數) ,表示他們的水平。 在我們的案例中,我們需要考慮兩個標準- 花費金額購買次數 。 兩者必須賦予相同的權重,即。 花很多錢的客戶等于花更多錢購買的客戶。 因此,我們可以將客戶分數定義為:分數=(1/2×購買數量)+(1/2×消費金額)

For example, if a customer made 2 purchases worth $500 his score would be (1/2 × 2) + (1/2 × 500) = 251.

例如,如果客戶進行了2次購物,價值500美元,那么他的得分將是(1/2×2)+(1/2×500)= 251。

Let us find the number of transactions per customer and create a separate column. This can be accomplished by grouping the data based on customer_id and using the size() method. We can also find the total amount spent by using the sum() method on the transaction_amount column. We will also drop the transaction_date column which is no longer required.

讓我們找到每個客戶的交易數量并創建一個單獨的列。 這可以通過基于customer_id分組數據并使用size()方法來完成。 我們還可以通過使用transaction_amount列上的sum()方法來找到花費的總金額 。 我們還將刪除不再需要的transaction_date列。

best_churn["no_of_transactions"] = group_by_customer.size()
best_churn["amount_spent"] = group_by_customer.sum()

Output —

輸出—

Image for post

Everything seems to be good, but if we take a closer look at the formula we notice a defect. We saw that when a customer spent $500 for 2 purchases his score was 251. If a customer has spent $400 across 20 different purchases his score would be 210 which seems to be unfair because it seems that the second customer is more regular than the first one and shows more potential to spend in the long run. This is happening mainly due to two reasons. 1) Money spent always exceeds the number of transactions.2) We are using the same weights for both the criteria.

一切似乎都很好,但是如果我們仔細看一下公式,就會發現一個缺陷。 我們看到,當一位顧客花500美元進行2次購買時,他的得分是251。如果一位顧客在20次不同的購買中花費了400美元,他的得分將是210分,這似乎是不公平的,因為第二位顧客似乎比第一個顧客更固定并顯示出更多的長期消費潛力。 發生這種情況主要是由于兩個原因。 1)花費的錢總是超過交易數量。2)我們對這兩個標準使用相同的權重。

Let us find out the min and max number of transactions and amounts from the best_churn data frame.

讓我們從best_churn數據框中找出最小和最大交易數以及金額。

best_churn[["no_of_transactions", "amount_spent"]].describe().loc[["min", "max"]]
Image for post

We can see that the number of transactions is way too less when compared to the amount spent. To overcome this problem we will use min-max scaling which is used to compare different scales in a meaningful way. The formula for min-max scaling is —

我們可以看到,與花費的數量相比,交易數量實在太少了。 為了克服這個問題,我們將使用最小-最大縮放比例 ,該縮放比例用于以有意義的方式比較不同的縮放比例。 最小-最大縮放比例的公式是-

Image for post

Let us apply the above formula on our no_of_transactions and amount_spent columns, find out the score using the scaled values and sort the data frame based on the score.

讓我們將上述公式應用于no_of_transactionsamount_spent列,使用縮放后的值找出分數,并根據分數對數據框進行排序。

Output —

輸出—

Image for post

我們如何找出閾值得分值? (How do we find out the threshold score value?)

Should we chose the first 20 customers? or the first 50 customers? or the top 10%? What should be the criteria? Again, domain knowledge plays a crucial role here. We know that the budget is $1000. Each coupon value is not specified and we must decide the value. The coupon value cannot be too high because it reduces the number of customers.

我們應該選擇前20位客戶嗎? 還是前50位客戶? 還是前10%? 準則是什么? 同樣,領域知識在這里起著至關重要的作用。 我們知道預算是$ 1000。 沒有指定每個優惠券的價值,我們必須決定其價值。 優惠券價值不能太高,因為它減少了客戶數量。

We all know that a 30% discount on one transaction is a pretty decent deal.So, let us find out the mean value of all the 1,25,000 transactions in the initial data frame we have and find 30% of that mean value.

我們都知道一筆交易有30%的折扣是相當不錯的一筆交易,因此,讓我們找出初始數據框中所有1,25,000筆交易的均值,然后找到該均值的30%。

coupon = data["tran_amount"].mean() * 0.3Output - 19.4976

Let us round this to 20. Hence each coupon value is $20. We know that our budget is $1000. Dividing 1000/20 yields 50. Hence we can select the top 50 customers from the best_churn data whose churned value is 1 and mail the coupons.

讓我們將其四舍五入為20。因此,每個優惠券價值為20美元。 我們知道我們的預算是1000美元。 除以1000/20得出50。因此,我們可以從best_churn數據中選擇前50個客戶 ,其攪動值為1,然后將優惠券郵寄best_churn

top_50_churned = best_churn.loc[best_churn["churned"] == 1].head(50)

Output —

輸出—

Image for post

結論 (Conclusion)

In this article, we understood the importance of communication and how the usage of fuzzy-language can be a hindrance. We also took a real-life scenario and solved a problem that had many constraints and also required the usage of communication skills, domain knowledge, and quick decision-making. I hope you learned something new today.

在本文中,我們了解了交流的重要性以及模糊語言的使用如何成為障礙。 我們還采用了現實生活中的場景,解決了一個有很多約束并且還需要使用溝通技巧,領域知識和快速決策能力的問題。 希望您今天學到了新東西。

If you would like to get in touch, connect with me on LinkedIn.

如果您想取得聯系,請通過 LinkedIn 與我聯系

翻譯自: https://medium.com/towards-artificial-intelligence/data-science-in-business-8266fae71a87

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390858.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390858.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390858.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

為什么游戲開發者不玩游戲_什么是游戲開發?

為什么游戲開發者不玩游戲Game Development is the art of creating games and describes the design, development and release of a game. It may involve concept generation, design, build, test and release. While you create a game, it is important to think about t…

leetcode 692. 前K個高頻單詞

題目 給一非空的單詞列表,返回前 k 個出現次數最多的單詞。 返回的答案應該按單詞出現頻率由高到低排序。如果不同的單詞有相同出現頻率,按字母順序排序。 示例 1: 輸入: ["i", "love", "leetcode", "…

數據顯示,中國近一半的獨角獸企業由“BATJ”四巨頭投資

中國的互聯網行業越來越有被巨頭壟斷的趨勢。百度、阿里巴巴、騰訊、京東,這四大巨頭支撐起了中國近一半的獨角獸企業。CB Insights日前發表了題為“Nearly Half Of China’s Unicorns Backed By Baidu, Alibaba, Tencent, Or JD.com”的數據分析文章,列…

Java的Servlet、Filter、Interceptor、Listener

寫在前面: 使用Spring-Boot時,嵌入式Servlet容器可以通過掃描注解(ServletComponentScan)的方式注冊Servlet、Filter和Servlet規范的所有監聽器(如HttpSessionListener監聽器)。 Spring boot 的主 Servlet…

html5教程_最好HTML和HTML5教程

html5教程HyperText Markup Language (HTML) is a markup language used to construct online documents and is the foundation of most websites today. A markup language like HTML allows us to超文本標記語言(HTML)是用于構造在線文檔的標記語言,并且是當今大…

leetcode 1035. 不相交的線(dp)

在兩條獨立的水平線上按給定的順序寫下 nums1 和 nums2 中的整數。 現在,可以繪制一些連接兩個數字 nums1[i] 和 nums2[j] 的直線,這些直線需要同時滿足滿足: nums1[i] nums2[j] 且繪制的直線不與任何其他連線(非水平線&#x…

SPI和RAM IP核

學習目的: (1) 熟悉SPI接口和它的讀寫時序; (2) 復習Verilog仿真語句中的$readmemb命令和$display命令; (3) 掌握SPI接口寫時序操作的硬件語言描述流程(本例僅…

個人技術博客Alpha----Android Studio UI學習

項目聯系 這次的項目我在前端組,負責UI,下面簡略講下學到的內容和使用AS過程中遇到的一些問題及其解決方法。 常見UI控件的使用 1.TextView 在TextView中,首先用android:id給當前控件定義一個唯一標識符。在活動中通過這個標識符對控件進行事…

數據科學家數據分析師_站出來! 分析人員,數據科學家和其他所有人的領導和溝通技巧...

數據科學家數據分析師這一切如何發生? (How did this All Happen?) As I reflect on my life over the past few years, even though I worked my butt off to get into Data Science as a Product Analyst, I sometimes still find myself begging the question, …

leetcode 810. 黑板異或游戲

黑板上寫著一個非負整數數組 nums[i] 。Alice 和 Bob 輪流從黑板上擦掉一個數字,Alice 先手。如果擦除一個數字后,剩余的所有數字按位異或運算得出的結果等于 0 的話,當前玩家游戲失敗。 (另外,如果只剩一個數字,按位異…

react-hooks_在5分鐘內學習React Hooks-初學者教程

react-hooksSometimes 5 minutes is all youve got. So in this article, were just going to touch on two of the most used hooks in React: useState and useEffect. 有時只有5分鐘。 因此,在本文中,我們僅涉及React中兩個最常用的鉤子: …

分析工作試用期收獲_免費使用零編碼技能探索數據分析

分析工作試用期收獲Have you been hearing the new industry buzzword — Data Analytics(it was AI-ML earlier) a lot lately? Does it sound complicated and yet simple enough? Understand the logic behind models but dont know how to code? Apprehensive of spendi…

select的一些問題。

這個要怎么統計類別數呢? 哇哇哇 解決了。 之前怎么沒想到呢?感謝一樓。轉載于:https://www.cnblogs.com/AbsolutelyPerfect/p/7818701.html

html5語義化標記元素_語義HTML5元素介紹

html5語義化標記元素Semantic HTML elements are those that clearly describe their meaning in a human- and machine-readable way. 語義HTML元素是以人類和機器可讀的方式清楚地描述其含義的元素。 Elements such as <header>, <footer> and <article> …

重學TCP協議(12)SO_REUSEADDR、SO_REUSEPORT、SO_LINGER

1. SO_REUSEADDR 假如服務端出現故障&#xff0c;主動斷開連接以后&#xff0c;需要等 2 個 MSL 以后才最終釋放這個連接&#xff0c;而服務重啟以后要綁定同一個端口&#xff0c;默認情況下&#xff0c;操作系統的實現都會阻止新的監聽套接字綁定到這個端口上。啟用 SO_REUSE…

殘疾科學家_數據科學與殘疾:通過創新加強護理

殘疾科學家Could the time it takes for you to water your houseplants say something about your health? Or might the amount you’re moving around your neighborhood reflect your mental health status?您給植物澆水所需的時間能否說明您的健康狀況&#xff1f; 還是…

POJ 3660 Cow Contest [Floyd]

POJ - 3660 Cow Contest http://poj.org/problem?id3660 N (1 ≤ N ≤ 100) cows, conveniently numbered 1..N, are participating in a programming contest. As we all know, some cows code better than others. Each cow has a certain constant skill rating that is un…

Linux 網絡相關命令

1. telnet 1.1 檢查端口是否打開 執行 telnet www.baidu.com 80&#xff0c;粘貼下面的文本&#xff08;注意總共有四行&#xff0c;最后兩行為兩個空行&#xff09; telnet [domainname or ip] [port]例如&#xff1a; telnet www.baidu.com 80 如果這個網絡連接可達&…

JSON.parseObject(String str)與JSONObject.parseObject(String str)的區別

一、首先來說說fastjson fastjson 是一個性能很好的 Java 語言實現的 JSON 解析器和生成器&#xff0c;來自阿里巴巴的工程師開發。其主要特點是&#xff1a; ① 快速&#xff1a;fastjson采用獨創的算法&#xff0c;將parse的速度提升到極致&#xff0c;超過所有基于Java的jso…

jQuery Ajax POST方法

Sends an asynchronous http POST request to load data from the server. Its general form is:發送異步http POST請求以從服務器加載數據。 其一般形式為&#xff1a; jQuery.post( url [, data ] [, success ] [, dataType ] )url : is the only mandatory parameter. This…