虎牙直播電影一天收入_電影收入

虎牙直播電影一天收入

“美國電影協會(MPAA)的首席執行官J. Valenti提到:“沒有人能告訴您電影在市場上的表現。 直到電影在黑暗的劇院里放映并且銀幕和觀眾之間都散發出火花。 (“The CEO of Motion Picture Association of America (MPAA) J. Valenti mentioned that ‘No one can tell you how a movie is going to do in the marketplace. Not until the film opens in darkened theater and sparks fly up between the screen and the audience’”)

Cigdem Tuncer
Cigdem Tuncer西格德·圖姆斯
Aug 9 8月9

The modern film industry, a business of nearly 10 billion dollars per year, is a cutthroat business competition.

現代電影業每年的營業額接近100億美元,是一場殘酷的商業競爭。

Each year in the United States, hundreds of films are released to domestic audiences in the hope that they will become the next “blockbuster.” Predicting how well a movie will perform at the box office is hard because there are so many factors involved in success.

在美國,每年都會向國內觀眾放映數百部電影,希望它們將成為下一部“大片”。 很難預測電影在票房上的表現如何,因為成功涉及很多因素。

The goal of this project is to develop a computational model for predicting the revenues based on public data for movies extracted from Boxofficemojo.com online movie database.

該項目的目標是開發一種計算模型,該模型可以基于從Boxofficemojo.com在線電影數據庫中提取的電影的公共數據來預測收入。

The first phase is web scraping. Different types of features are extracted from Boxofficemojo.com which will be described later. Second phase is data cleaning. After scrapping data from our source, we cleaned our data mainly depend on unavailability of some features. After cleaning all data, next phase is exploratory data analysis. In third phase we create graphics to understand data. Fourth phase is feature engineering, where you create features for machine learning model from raw text data. Fifth phase is model analysis, where I applied one of the machine learning algorithms on our data set.

第一階段是刮紙。 從Boxofficemojo.com中提取了不同類型的功能,這將在后面描述。 第二階段是數據清理。 從我們的來源中刪除數據后,我們清理數據主要取決于某些功能的不可用性。 清除所有數據后,下一階段是探索性數據分析。 在第三階段,我們創建圖形來理解數據。 第四階段是功能工程,其中您可以從原始文本數據創建用于機器學習模型的功能。 第五階段是模型分析,其中我在數據集上應用了一種機器學習算法。

Image for post

網頁抓取 (Web Scraping)

Web scraping is a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.

Web抓取是從Web提取和處理大量數據的程序或算法。 無論您是數據科學家,工程師,還是任何分析大量數據集的人員,從網絡中抓取數據的能力都是一項有用的技能。

It’s a good idea to do some research on your own and make sure that you’re not violating any Terms of Service before you start a large-scale project. To learn more about the legal aspects of web scraping, check out Legal Perspectives on Scraping Data from the Modern Web.

最好自己進行一些研究,并確保在開始大規模項目之前,不要違反任何服務條款。 要了解有關網絡抓取的法律方面的更多信息,請查閱《現代網絡中關于數據搜集的法律觀點》 。

For this project;

對于這個項目;

· BeautifulSoup Library is used for data extraction from the web.

· BeautifulSoup庫用于從Web提取數據。

· Pandas Library is used for data manipulation and cleaning.

· 熊貓庫用于數據處理和清理。

· Matplotlib and Seaborn are used for data visualization.

· MatplotlibSeaborn用于數據可視化。

My data set contains 8319 movies released in between 2010 to 2019. Recent movies are not selected because Covid-19 not much movie released in 2020. I collect Title, Distributor, Release, MPAA, Time, Genre, Domestic, International, Worldwide, Opening, Budget, and Actors information.

我的數據集包含2010年至2019年之間發行的8319部電影。由于Covid-19 2020年發行的電影不多,因此未選擇近期電影。我收集標題,發行商,發行,MPAA,時間,類型,國內,國際,全球,開幕,預算和演員信息。

Image for post
Image for post
Image for post

數據清理 (Data Cleaning)

At the beginning my data set had 8319 movies. Then I recognize that there were many movies which don’t have all data available. So unavailability of features was the main reason behind eliminating movies from my data set.

最初,我的數據集包含8319部電影。 然后我意識到有很多電影沒有所有可用數據。 因此,功能不可用是從我的數據集中刪除電影的主要原因。

Most of the movie doesn’t have budget data available. So, null rows have been deleted.

這部電影大部分沒有可用的預算數據。 因此,空行已被刪除。

Image for post

Dtype is converted from “Object” to “float” for numeric columns.

對于數字列,Dtype從“對象”轉換為“浮點”。

Image for post

“Release” data is checked for leap year detail and found data is modified. Dtype is converted from “Object” to “datetime” for Release column. Data from “Distributor” column is cleaned from not related info.

檢查“發布”數據中的leap年細節,并修改找到的數據。 Dtype從Release列的“ Object”轉換為“ datetime”。 來自“分銷商”列的數據已從不相關的信息中清除。

Image for post

Duplicate rows have been deleted from data set.

重復的行已從數據集中刪除。

Image for post

After removing those movies I finally got my data set with 1293 movies which have all information available.

刪除這些電影后,我終于獲得了包含所有可用信息的1293電影的數據集。

Image for post

探索性數據分析(EDA) (Exploratory Data Analysis (EDA))

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

在統計中,探索性數據分析(EDA)是一種分析數據集以總結其主要特征的方法,通常使用視覺方法。 可以使用統計模型,也可以不使用統計模型,但是EDA主要用于查看數據可以在形式建模或假設檢驗任務之外告訴我們的內容。

Let’s look at the data relation between “Domestic Total Gross” and “Budget” for each year.

讓我們看一下每年“國內總收入”和“預算”之間的數據關系。

Image for post

While there are an almost overwhelming number of methods to use in EDA, one of the most effective starting tools is the pairs plot (also called a scatterplot matrix). A pairs plot allows us to see both distribution of single variables and relationships between two variables. Pair plots are a great method to identify trends for follow-up analysis and, fortunately, are easily implemented in Python.

盡管在EDA中使用了幾乎絕大多數方法,但最有效的入門工具之一是結對圖(也稱為散點圖矩陣)。 配對圖使我們可以看到單個變量的分布以及兩個變量之間的關系。 配對圖是識別趨勢以進行后續分析的一種好方法,幸運的是,可以在Python中輕松實現。

Image for post

A Heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features in a data set. We can now use either Matplotlib or Seaborn to create the heatmap. To get the correlation of the features inside a data set we can call <dataset>.corr(), which is a Pandas dataframe method. This will give us the correlation matrix.

熱圖是數據的圖形表示,其中矩陣中包含的各個值表示為顏色。 熱圖非常適合探索數據集中要素的相關性。 現在,我們可以使用Matplotlib或Seaborn來創建熱圖。 為了獲得數據集中內部<dataset>.corr()的相關性,我們可以調用<dataset>.corr() ,這是Pandas數據<dataset>.corr()方法。 這將給我們相關矩陣。

Image for post

特征工程 (Feature Engineering)

Feature engineering means building additional features out of existing data which is often spread across multiple related tables. Feature engineering requires extracting the relevant information from the data and getting it into a single table which can then be used to train a machine learning model.

特征工程意味著從現有數據中構建附加特征,這些數據通常分布在多個相關表中。 特征工程需要從數據中提取相關信息,并將其放入一個表中,然后該表可用于訓練機器學習模型。

Machine learning fits mathematical notations to the data in order to derive some insights. The models take features as input. A feature is generally a numeric representation of an aspect of real-world phenomena or data. Just the way there are dead ends in a maze, the path of data is filled with noise and missing pieces. Our job as a Data Scientist is to find a clear path to the end goal of insights.

機器學習使數學符號適合數據,以得出一些見解。 這些模型將要素作為輸入。 特征通常是真實現象或數據方面的數字表示。 就像迷宮中的死胡同一樣,數據的路徑充滿了噪聲和丟失的碎片。 作為數據科學家,我們的工作是找到通往最終見解的明確路徑。

Let’s look at the description of dataset and see distribution of target column.

讓我們看一下數據集的描述并查看目標列的分布。

Image for post
Image for post

We want the target variable to be predicted in the model to have a normal distribution. When we examine the distribution of our target variable, we see that there is no right skewed distribution. We can correct this situation by applying a logarithmic transformation to the target variable.

我們希望在模型中預測目標變量具有正態分布。 當我們檢查目標變量的分布時,我們發現沒有右偏分布。 我們可以通過對目標變量應用對數轉換來糾正這種情況。

Image for post

Ordinary least-squares (OLS) models assume that the analysis is fitting a model of a relationship between one or more explanatory variables and a continuous or at least interval outcome variable that minimizes the sum of square errors, where an error is the difference between the actual and the predicted value of the outcome variable.

普通最小二乘(OLS)模型假設分析適合一個或多個解釋變量與連續或至少區間結果變量之間的關系模型,該變量使平方誤差之和最小,其中誤差是結果變量的實際值和預測值。

Image for post

When I do OLS model with two numerical features from data set, I got low cond. no, but also got low R-2 score. To increase R-2 score I will do feature engineering to add new features from categorical variables from out data set.

當我使用數據集中的兩個數值特征進行OLS模型建模時,cond降低。 不,但R-2得分也很低。 為了增加R-2分數,我將進行特征工程設計以從數據集中的分類變量中添加新特征。

· The “year” column and four season columns were created from the “Release” column.

·從“發布”列中創建了“年”列和四個季節列。

· Four Dummy columns were created from “MPAA” column.

·從“ MPAA”列中創建了四個虛擬列。

· Running time (min) column were created from “time” column.

·運行時間(分鐘)列是從“時間”列中創建的。

· New columns created for all distributors with more than 49 rows.

·為具有49行以上的所有分發者創建的新列。

· Logs of “Budget” and “Opening” columns were created.

·創建了“預算”和“開放”列的日志。

模型分析 (Model Analysis)

Now is the time to split our data into sets of training, testing and validation. Let’s rerun our model and finally compare the Ridge, Lasso and Polynomial regression results.

現在是時候將我們的數據分為訓練,測試和驗證的集合了。 讓我們重新運行模型,最后比較Ridge,Lasso和多項式回歸結果。

Image for post

Data set was split as a train (60%), validation (20%), and test (20%). The tuning parameters (alpha) of the Lasso and Ridge models were chosen from a wide value range than put the 10-fold cross-validation.

數據集分為訓練(60%),驗證(20%)和測試(20%)。 拉索和里奇模型的調整參數(alpha)是從10倍交叉驗證的寬泛范圍內選擇的。

When we included the variables we applied feature engineering into the model, OLS model R-2 score is increased to 0.759, but at the same time the cond. no. increased. Lasso Regression and Ridge Regression brought us the same results. The result of Linear Regression was also very close to them. We have the best result in a Degree 2 Polynomial Regression and the second is Ridge Polynomial Regression.

當我們將變量應用到模型中時,將OLS模型R-2得分提高到0.759,但同時條件也有所提高。 沒有。 增加。 拉索回歸和嶺回歸為我們帶來了相同的結果。 線性回歸的結果也非常接近它們。 我們在2次多項式回歸中得到最好的結果,第二個是Ridge多項式回歸。

Image for post
Image for post

Now it’s time to do Cross Validation (CV) and look at Mean Absolute Error (MAE) score. When we cross validate each model (kfold = 10), we see little drop in scores.

現在是時候進行交叉驗證(CV)和查看平均絕對誤差(MAE)分數了。 當我們交叉驗證每個模型(kfold = 10)時,我們看到分數幾乎沒有下降。

Image for post

結論 (Conclusion)

Finally, when we look at the mean absolute errors on the established models, we can say that Ridge Polynomial Regression will bring us the most accurate results.

最后,當我們查看已建立模型的平均絕對誤差時,可以說嶺多項式回歸將為我們帶來最準確的結果。

Five fundamental assumptions of the linear regression analysis were checked as these can be seen on Jupyter Notebook.

檢查了線性回歸分析的五個基本假設,因為可以在Jupyter Notebook中看到這些假設。

GitHub repository for web scraping and data processing is here.

用于Web抓取和數據處理的GitHub存儲庫在這里 。

Thank you for your time and reading my article. Please feel free to contact me if you have any questions or would like to share your comments.

感謝您的時間和閱讀我的文章。 如果您有任何疑問或想分享您的意見,請隨時與我聯系。

翻譯自: https://medium.com/analytics-vidhya/predicting-a-movies-revenue-3709fb460604

虎牙直播電影一天收入

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391513.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391513.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391513.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

郵箱如何秘密發送多個人郵件_如何發送秘密消息

郵箱如何秘密發送多個人郵件Cryptography is the science of using codes and ciphers to protect messages, at its most basic level. Encryption is encoding messages with the intent of only allowing the intended recipient to understand the meaning of the message.…

leetcode 面試題 17.21. 直方圖的水量(單調棧)

給定一個直方圖(也稱柱狀圖)&#xff0c;假設有人從上面源源不斷地倒水&#xff0c;最后直方圖能存多少水量?直方圖的寬度為 1。 上面是由數組 [0,1,0,2,1,0,1,3,2,1,2,1] 表示的直方圖&#xff0c;在這種情況下&#xff0c;可以接 6 個單位的水&#xff08;藍色部分表示水&a…

python:動態參數*args

動態參數 顧名思義&#xff0c;動態參數就是傳入的參數的個數是動態的&#xff0c;可以是1個、2個到任意個&#xff0c;還可以是0個。在不需要的時候&#xff0c;你完全可以忽略動態函數&#xff0c;不用給它傳遞任何值。 Python的動態參數有兩種&#xff0c;分別是*args和**kw…

3.5. Ticket

過程 3.4. Ticket 使用方法 New Ticket 新建Ticket, Ticket 可以理解為任務。 將Ticket 分配給團隊成員 受到Ticket后&#xff0c;一定要更改Ticket 為 accept &#xff0c; 這時在View Tickets 中將會看到該Ticket已經分配&#xff0c; 編碼過程 這里有一個特別的規定&…

Python操作Mysql實例代碼教程在線版(查詢手冊)_python

實例1、取得MYSQL的版本在windows環境下安裝mysql模塊用于python開發MySQL-python Windows下EXE安裝文件下載 復制代碼 代碼如下:# -*- coding: UTF-8 -*- #安裝MYSQL DB for pythonimport MySQLdb as mdb con None try: #連接mysql的方法&#xff1a;connect(ip,user,pass…

批判性思維_為什么批判性思維技能對數據科學家至關重要

批判性思維As Alexander Pope said, to err is human. By that metric, who is more human than us data scientists? We devise wrong hypotheses constantly and then spend time working on them just to find out how wrong we were.正如亞歷山大波普(Alexander Pope)所說…

leetcode 1143. 最長公共子序列(dp)

給定兩個字符串 text1 和 text2&#xff0c;返回這兩個字符串的最長 公共子序列 的長度。如果不存在 公共子序列 &#xff0c;返回 0 。 一個字符串的 子序列 是指這樣一個新的字符串&#xff1a;它是由原字符串在不改變字符的相對順序的情況下刪除某些字符&#xff08;也可以…

【Spark】SparkStreaming-Kafka-Redis-集成-基礎參考資料

SparkStreaming-Kafka-Redis-集成-基礎參考資料 Overview - Spark 2.2.0 DocumentationSpark Streaming Kafka Integration Guide - Spark 2.2.0 DocumentationSpark Streaming Kafka Integration Guide (Kafka broker version 0.8.2.1 or higher) - Spark 2.2.0 Documentat…

Manjaro 17 搭建 redis 4.0.1 集群服務

安裝Redis在Linux環境中 這里我們用的是manjaro一個小眾一些的發行版 我選用的是manjaro 17 KDE 如果你已經安裝好了manjaro 那么你需要準備一個redis.tar.gz包 這里我選用的是截至目前最新的redis 4.0.1版本 我們可以在官網進行下載 https://redis.io/download選擇Stable &…

了解如何使用Flutter構建iOS和Android應用

Learn Flutter in this full course from Nick Manning (of fluttercrashcourse.com). Flutter is Google’s multi-platform mobile development framework used to create apps for Android and iOS using the Dart programming language. 可以從fluttercrashcourse.com的Nic…

leetcode 781. 森林中的兔子(hashmap)

森林中&#xff0c;每個兔子都有顏色。其中一些兔子&#xff08;可能是全部&#xff09;告訴你還有多少其他的兔子和自己有相同的顏色。我們將這些回答放在 answers 數組里。 返回森林中兔子的最少數量。 示例: 輸入: answers [1, 1, 2] 輸出: 5 解釋: 兩只回答了 “1” 的兔…

快速排序簡便記_建立和測試股票交易策略的快速簡便方法

快速排序簡便記Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without se…

Java學習第1天:序言,基礎及配置tomcat

所謂是福不是禍&#xff0c;是禍躲不過&#xff0c;到底還是回到java的陣地上來。既然它這么熱&#xff0c;那就學學它&#xff0c;現在這件事已經提上議事日程&#xff0c;也已經開始。 今天做的事&#xff1a; 泛泛的翻了幾本書&#xff0c;敲了一些練習代碼&#xff0c;比如…

robot:List變量的使用注意點

創建list類型變量&#xff0c;兩種方式&#xff0c;建議使用Create List關鍵字 使用該列表變量時需要變為${}方式&#xff0c;切記切記&#xff01; 轉載于:https://www.cnblogs.com/gcgc/p/11429482.html

python基礎教程(十一)

迭代器 本節進行迭代器的討論。只討論一個特殊方法---- __iter__ &#xff0c;這個方法是迭代器規則的基礎。 迭代器規則 迭代的意思是重復做一些事很多次---就像在循環中做的那樣。__iter__ 方法返回一個迭代器&#xff0c;所謂迭代器就是具有next方法的對象&#xff0c;在調…

編程需要數學知識嗎_編程需要了解數學嗎?

編程需要數學知識嗎Does programming require knowing math? Not necessarily. 編程需要了解數學嗎&#xff1f; 不必要。 When I say that, Im mostly talking about Web Development, not working with graphics or specific applications that require advanced math. 當我…

美劇迷失_迷失(機器)翻譯

美劇迷失Machine translation doesn’t generate as much excitement as other emerging areas in NLP these days, in part because consumer-facing services like Google Translate have been around since April 2006.如今&#xff0c;機器翻譯并沒有像其他NLP新興領域那樣…

mysql 1449 : The user specified as a definer ('usertest'@'%') does not exist 解決方法 (grant 授予權限)...

從服務器上遷移數據庫到本地localhost 執行 函數 時報錯&#xff0c; mysql 1449 &#xff1a; The user specified as a definer (usertest%) does not exist 經查&#xff0c;是權限問題(其中usertest是服務器上數據庫的登錄名)&#xff0c;解決辦法&#xff1a; 授權給 u…

初識數據結構與算法

1、什么是數據結構&#xff1f; a、數據結構是一門研究非數值計算的程序設計問題中的操作對象&#xff0c;以及它們之間的關系和操作等相關問題的學科。 b、數據結構是計算機存儲、組織數據的方式&#xff0c;數據結構是指相互之間存在一種或多種特定關系的數據元素的集合。通常…

產品經理思考

1.分析QQ與微信的區別&#xff0c;包含其中存在的問題&#xff0c;并給出解決方案。 2.如果你設計一款基于音樂的高品質內容的社交軟件&#xff0c;給說出這么設計的原因。 3.請說出一款APP的的看法&#xff0c;指出其中不合理的地方&#xff0c;并說明原因&#xff08;APP如果…