虎牙直播電影一天收入
“美國電影協會(MPAA)的首席執行官J. Valenti提到:“沒有人能告訴您電影在市場上的表現。 直到電影在黑暗的劇院里放映并且銀幕和觀眾之間都散發出火花。 (“The CEO of Motion Picture Association of America (MPAA) J. Valenti mentioned that ‘No one can tell you how a movie is going to do in the marketplace. Not until the film opens in darkened theater and sparks fly up between the screen and the audience’”)

The modern film industry, a business of nearly 10 billion dollars per year, is a cutthroat business competition.
現代電影業每年的營業額接近100億美元,是一場殘酷的商業競爭。
Each year in the United States, hundreds of films are released to domestic audiences in the hope that they will become the next “blockbuster.” Predicting how well a movie will perform at the box office is hard because there are so many factors involved in success.
在美國,每年都會向國內觀眾放映數百部電影,希望它們將成為下一部“大片”。 很難預測電影在票房上的表現如何,因為成功涉及很多因素。
The goal of this project is to develop a computational model for predicting the revenues based on public data for movies extracted from Boxofficemojo.com online movie database.
該項目的目標是開發一種計算模型,該模型可以基于從Boxofficemojo.com在線電影數據庫中提取的電影的公共數據來預測收入。
The first phase is web scraping. Different types of features are extracted from Boxofficemojo.com which will be described later. Second phase is data cleaning. After scrapping data from our source, we cleaned our data mainly depend on unavailability of some features. After cleaning all data, next phase is exploratory data analysis. In third phase we create graphics to understand data. Fourth phase is feature engineering, where you create features for machine learning model from raw text data. Fifth phase is model analysis, where I applied one of the machine learning algorithms on our data set.
第一階段是刮紙。 從Boxofficemojo.com中提取了不同類型的功能,這將在后面描述。 第二階段是數據清理。 從我們的來源中刪除數據后,我們清理數據主要取決于某些功能的不可用性。 清除所有數據后,下一階段是探索性數據分析。 在第三階段,我們創建圖形來理解數據。 第四階段是功能工程,其中您可以從原始文本數據創建用于機器學習模型的功能。 第五階段是模型分析,其中我在數據集上應用了一種機器學習算法。

網頁抓取 (Web Scraping)
Web scraping is a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.
Web抓取是從Web提取和處理大量數據的程序或算法。 無論您是數據科學家,工程師,還是任何分析大量數據集的人員,從網絡中抓取數據的能力都是一項有用的技能。
It’s a good idea to do some research on your own and make sure that you’re not violating any Terms of Service before you start a large-scale project. To learn more about the legal aspects of web scraping, check out Legal Perspectives on Scraping Data from the Modern Web.
最好自己進行一些研究,并確保在開始大規模項目之前,不要違反任何服務條款。 要了解有關網絡抓取的法律方面的更多信息,請查閱《現代網絡中關于數據搜集的法律觀點》 。
For this project;
對于這個項目;
· BeautifulSoup Library is used for data extraction from the web.
· BeautifulSoup庫用于從Web提取數據。
· Pandas Library is used for data manipulation and cleaning.
· 熊貓庫用于數據處理和清理。
· Matplotlib and Seaborn are used for data visualization.
· Matplotlib和Seaborn用于數據可視化。
My data set contains 8319 movies released in between 2010 to 2019. Recent movies are not selected because Covid-19 not much movie released in 2020. I collect Title, Distributor, Release, MPAA, Time, Genre, Domestic, International, Worldwide, Opening, Budget, and Actors information.
我的數據集包含2010年至2019年之間發行的8319部電影。由于Covid-19 2020年發行的電影不多,因此未選擇近期電影。我收集標題,發行商,發行,MPAA,時間,類型,國內,國際,全球,開幕,預算和演員信息。



數據清理 (Data Cleaning)
At the beginning my data set had 8319 movies. Then I recognize that there were many movies which don’t have all data available. So unavailability of features was the main reason behind eliminating movies from my data set.
最初,我的數據集包含8319部電影。 然后我意識到有很多電影沒有所有可用數據。 因此,功能不可用是從我的數據集中刪除電影的主要原因。
Most of the movie doesn’t have budget data available. So, null rows have been deleted.
這部電影大部分沒有可用的預算數據。 因此,空行已被刪除。

Dtype is converted from “Object” to “float” for numeric columns.
對于數字列,Dtype從“對象”轉換為“浮點”。

“Release” data is checked for leap year detail and found data is modified. Dtype is converted from “Object” to “datetime” for Release column. Data from “Distributor” column is cleaned from not related info.
檢查“發布”數據中的leap年細節,并修改找到的數據。 Dtype從Release列的“ Object”轉換為“ datetime”。 來自“分銷商”列的數據已從不相關的信息中清除。

Duplicate rows have been deleted from data set.
重復的行已從數據集中刪除。

After removing those movies I finally got my data set with 1293 movies which have all information available.
刪除這些電影后,我終于獲得了包含所有可用信息的1293電影的數據集。

探索性數據分析(EDA) (Exploratory Data Analysis (EDA))
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
在統計中,探索性數據分析(EDA)是一種分析數據集以總結其主要特征的方法,通常使用視覺方法。 可以使用統計模型,也可以不使用統計模型,但是EDA主要用于查看數據可以在形式建模或假設檢驗任務之外告訴我們的內容。
Let’s look at the data relation between “Domestic Total Gross” and “Budget” for each year.
讓我們看一下每年“國內總收入”和“預算”之間的數據關系。

While there are an almost overwhelming number of methods to use in EDA, one of the most effective starting tools is the pairs plot (also called a scatterplot matrix). A pairs plot allows us to see both distribution of single variables and relationships between two variables. Pair plots are a great method to identify trends for follow-up analysis and, fortunately, are easily implemented in Python.
盡管在EDA中使用了幾乎絕大多數方法,但最有效的入門工具之一是結對圖(也稱為散點圖矩陣)。 配對圖使我們可以看到單個變量的分布以及兩個變量之間的關系。 配對圖是識別趨勢以進行后續分析的一種好方法,幸運的是,可以在Python中輕松實現。

A Heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features in a data set. We can now use either Matplotlib or Seaborn to create the heatmap. To get the correlation of the features inside a data set we can call <dataset>.corr()
, which is a Pandas dataframe method. This will give us the correlation matrix.
熱圖是數據的圖形表示,其中矩陣中包含的各個值表示為顏色。 熱圖非常適合探索數據集中要素的相關性。 現在,我們可以使用Matplotlib或Seaborn來創建熱圖。 為了獲得數據集中內部<dataset>.corr()
的相關性,我們可以調用<dataset>.corr()
,這是Pandas數據<dataset>.corr()
方法。 這將給我們相關矩陣。

特征工程 (Feature Engineering)
Feature engineering means building additional features out of existing data which is often spread across multiple related tables. Feature engineering requires extracting the relevant information from the data and getting it into a single table which can then be used to train a machine learning model.
特征工程意味著從現有數據中構建附加特征,這些數據通常分布在多個相關表中。 特征工程需要從數據中提取相關信息,并將其放入一個表中,然后該表可用于訓練機器學習模型。
Machine learning fits mathematical notations to the data in order to derive some insights. The models take features as input. A feature is generally a numeric representation of an aspect of real-world phenomena or data. Just the way there are dead ends in a maze, the path of data is filled with noise and missing pieces. Our job as a Data Scientist is to find a clear path to the end goal of insights.
機器學習使數學符號適合數據,以得出一些見解。 這些模型將要素作為輸入。 特征通常是真實現象或數據方面的數字表示。 就像迷宮中的死胡同一樣,數據的路徑充滿了噪聲和丟失的碎片。 作為數據科學家,我們的工作是找到通往最終見解的明確路徑。
Let’s look at the description of dataset and see distribution of target column.
讓我們看一下數據集的描述并查看目標列的分布。


We want the target variable to be predicted in the model to have a normal distribution. When we examine the distribution of our target variable, we see that there is no right skewed distribution. We can correct this situation by applying a logarithmic transformation to the target variable.
我們希望在模型中預測目標變量具有正態分布。 當我們檢查目標變量的分布時,我們發現沒有右偏分布。 我們可以通過對目標變量應用對數轉換來糾正這種情況。

Ordinary least-squares (OLS) models assume that the analysis is fitting a model of a relationship between one or more explanatory variables and a continuous or at least interval outcome variable that minimizes the sum of square errors, where an error is the difference between the actual and the predicted value of the outcome variable.
普通最小二乘(OLS)模型假設分析適合一個或多個解釋變量與連續或至少區間結果變量之間的關系模型,該變量使平方誤差之和最小,其中誤差是結果變量的實際值和預測值。

When I do OLS model with two numerical features from data set, I got low cond. no, but also got low R-2 score. To increase R-2 score I will do feature engineering to add new features from categorical variables from out data set.
當我使用數據集中的兩個數值特征進行OLS模型建模時,cond降低。 不,但R-2得分也很低。 為了增加R-2分數,我將進行特征工程設計以從數據集中的分類變量中添加新特征。
· The “year” column and four season columns were created from the “Release” column.
·從“發布”列中創建了“年”列和四個季節列。
· Four Dummy columns were created from “MPAA” column.
·從“ MPAA”列中創建了四個虛擬列。
· Running time (min) column were created from “time” column.
·運行時間(分鐘)列是從“時間”列中創建的。
· New columns created for all distributors with more than 49 rows.
·為具有49行以上的所有分發者創建的新列。
· Logs of “Budget” and “Opening” columns were created.
·創建了“預算”和“開放”列的日志。
模型分析 (Model Analysis)
Now is the time to split our data into sets of training, testing and validation. Let’s rerun our model and finally compare the Ridge, Lasso and Polynomial regression results.
現在是時候將我們的數據分為訓練,測試和驗證的集合了。 讓我們重新運行模型,最后比較Ridge,Lasso和多項式回歸結果。

Data set was split as a train (60%), validation (20%), and test (20%). The tuning parameters (alpha) of the Lasso and Ridge models were chosen from a wide value range than put the 10-fold cross-validation.
數據集分為訓練(60%),驗證(20%)和測試(20%)。 拉索和里奇模型的調整參數(alpha)是從10倍交叉驗證的寬泛范圍內選擇的。
When we included the variables we applied feature engineering into the model, OLS model R-2 score is increased to 0.759, but at the same time the cond. no. increased. Lasso Regression and Ridge Regression brought us the same results. The result of Linear Regression was also very close to them. We have the best result in a Degree 2 Polynomial Regression and the second is Ridge Polynomial Regression.
當我們將變量應用到模型中時,將OLS模型R-2得分提高到0.759,但同時條件也有所提高。 沒有。 增加。 拉索回歸和嶺回歸為我們帶來了相同的結果。 線性回歸的結果也非常接近它們。 我們在2次多項式回歸中得到最好的結果,第二個是Ridge多項式回歸。


Now it’s time to do Cross Validation (CV) and look at Mean Absolute Error (MAE) score. When we cross validate each model (kfold = 10), we see little drop in scores.
現在是時候進行交叉驗證(CV)和查看平均絕對誤差(MAE)分數了。 當我們交叉驗證每個模型(kfold = 10)時,我們看到分數幾乎沒有下降。

結論 (Conclusion)
Finally, when we look at the mean absolute errors on the established models, we can say that Ridge Polynomial Regression will bring us the most accurate results.
最后,當我們查看已建立模型的平均絕對誤差時,可以說嶺多項式回歸將為我們帶來最準確的結果。
Five fundamental assumptions of the linear regression analysis were checked as these can be seen on Jupyter Notebook.
檢查了線性回歸分析的五個基本假設,因為可以在Jupyter Notebook中看到這些假設。
GitHub repository for web scraping and data processing is here.
用于Web抓取和數據處理的GitHub存儲庫在這里 。
Thank you for your time and reading my article. Please feel free to contact me if you have any questions or would like to share your comments.
感謝您的時間和閱讀我的文章。 如果您有任何疑問或想分享您的意見,請隨時與我聯系。
翻譯自: https://medium.com/analytics-vidhya/predicting-a-movies-revenue-3709fb460604
虎牙直播電影一天收入
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391513.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391513.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391513.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!