20.🔗本章代碼筆記📓鏈接(需要🪜):(01_the_machine_learning_landscape.ipynb - Colab (google.com))
如果你不想通過上面的官方網址下載本章的筆記,還可以在本篇博文的附件中下載筆記!但我更推薦你支持原版的書和原版的網址
21.參考答案原文及其中文翻譯:
Machine Learning is about building systems that can learn from data.
機器學習是關于構建能夠從數據中學習的系統。Learning means getting better at some task, given some performance measure.
學習意味著在某些任務上變得更好,這是根據某些性能度量來衡量的。Machine Learning is great for complex problems for which we have no algorithmic solution, to replace long lists of hand-tuned rules, to build systems that adapt to fluctuating environments, and finally to help humans learn (e.g., data mining).
機器學習非常適合那些我們沒有算法解決方案的復雜問題,用來替代長長的手工調整規則列表,構建能夠適應波動環境的系統,最終幫助人類學習(例如,數據挖掘)。A labeled training set is a training set that contains the desired solution (a.k.a. a label) for each instance.
一個被標記的訓練集是一個訓練集,它為每個實例包含了期望的解決方案(即標簽)。The two most common supervised tasks are regression and classification.
兩種最常見的監督任務是回歸和分類。Common unsupervised tasks include clustering, visualization, dimensionality reduction, and association rule learning.
常見的無監督任務包括聚類、可視化、降維和關聯規則學習。Reinforcement Learning is likely to perform best if we want a robot to learn to walk in various unknown terrains, since this is typically the type of problem that Reinforcement Learning tackles. It might be possible to express the problem as a supervised or semi-supervised learning problem, but it would be less natural.
如果我們希望機器人學會在各種未知地形中行走,強化學習可能會表現得最好,因為這是強化學習通常處理的問題類型。雖然有可能將問題表達為監督或半監督學習問題,但這樣做會顯得不那么自然。If you don’t know how to define the groups, then you can use a clustering algorithm (unsupervised learning) to segment your customers into clusters of similar customers. However, if you know what groups you would like to have, then you can feed many examples of each group to a classification algorithm (supervised learning), and it will classify all your customers into these groups.
如果你不知道如何定義組別,那么可以使用聚類算法(無監督學習)將客戶分割成相似客戶的群集。然而,如果你知道你想要的組別,那么你可以向分類算法(監督學習)提供每個組的許多示例,它將把所有客戶分類到這些組中。Spam detection is a typical supervised learning problem: the algorithm is fed many emails along with their labels (spam or not spam).
垃圾郵件檢測是一個典型的監督學習問題:算法被輸入了許多電子郵件及其標簽(垃圾郵件或非垃圾郵件)。An online learning system can learn incrementally, as opposed to a batch learning system. This makes it capable of adapting rapidly to both changing data and autonomous systems, and of training on very large quantities of data.
在線學習系統可以逐步學習,與批量學習系統相反。這使得它能夠快速適應變化的數據和自主系統,并且能夠訓練大量數據。Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer’s main memory. An out-of-core learning algorithm chops the data into mini-batches and uses online learning techniques to learn from these mini-batches.
核心外算法可以處理大量無法適應計算機主存儲器的數據。核心外學習算法將數據分割成小批量,并使用在線學習技術從小批量中學習。An instance-based learning system learns the training data by heart; then, when given a new instance, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.
基于實例的學習系統通過心記憶訓練數據;然后,當給定一個新的實例時,它使用相似性度量來找到最相似的學習實例,并用它們進行預測。A model has one or more model parameters that determine what it will predict given a new instance (e.g., the slope of a linear model). A learning algorithm tries to find optimal values for these parameters such that the model generalizes well to new instances. A hyperparameter is a parameter of the learning algorithm itself, not of the model (e.g., the amount of regularization to apply).
模型有一個或多個模型參數,這些參數決定了它將對新實例進行什么預測(例如,線性模型的斜率)。學習算法試圖找到這些參數的最優值,以便模型能夠很好地泛化到新實例。超參數是學習算法本身的參數,而不是模型的參數(例如,要應用的正則化量)。Model-based learning algorithms search for an optimal value for the model parameters such that the model will generalize well to new instances. We usually train such systems by minimizing a cost function that measures how bad the system is at making predictions on the training data, plus a penalty for model complexity if the model is regularized. To make predictions, we feed the new instance’s features into the model’s prediction function, using the parameter values found by the learning algorithm.
基于模型的學習算法尋找模型參數的最優值,以便模型能夠很好地泛化到新實例。我們通常通過最小化一個代價函數來訓練這樣的系統,該函數衡量系統在訓練數據上進行預測的表現有多差,如果模型進行了正則化,還會加上模型復雜性的懲罰。要進行預測,我們將新實例的特征輸入到模型的預測函數中,使用學習算法找到的參數值。Some of the main challenges in Machine Learning are the lack of data, poor data quality, nonrepresentative data, uninformative features, excessively simple models that underfit the training data, and excessively complex models that overfit the data.
機器學習面臨的一些主要挑戰包括數據缺乏、數據質量差、數據不具代表性、特征不具信息量、過于簡單的模型導致訓練數據欠擬合,以及過于復雜的模型導致數據過擬合。If a model performs great on the training data but generalizes poorly to new instances, the model is likely overfitting the training data (or we got extremely lucky on the training data). Possible solutions to overfitting are getting more data, simplifying the model (selecting a simpler algorithm, reducing the number of parameters or features used, or regularizing the model), or reducing the noise in the training data.
如果一個模型在訓練數據上表現很好,但對新實例的泛化能力很差,那么模型可能過擬合了訓練數據(或者我們在訓練數據上非常幸運)。解決過擬合的可能方案是獲取更多數據、簡化模型(選擇一個更簡單的算法、減少使用的參數或特征數量,或者對模型進行正則化)或減少訓練數據中的噪聲。A test set is used to estimate the generalization error that a model will make on new instances, before the model is launched in production.
測試集用于在模型投入生產之前估計模型將在新實例上犯的泛化錯誤。A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.
驗證集用于比較模型。它使得選擇最佳模型和調整超參數成為可能。The train-dev set is used when there is a risk of mismatch between the training data and the data used in the validation and test datasets (which should always be as close as possible to the data used once the model is in production). The train-dev set is a part of the training set that’s held out (the model is not trained on it). The model is trained on the rest of the training set, and evaluated on both the train-dev set and the validation set. If the model performs well on the training set but not on the train-dev set, then the model is likely overfitting the training set. If it performs well on both the training set and the train-dev set, but not on the validation set, then there is probably a significant data mismatch between the training data and the validation + test data, and you should try to improve the training data to make it look more like the validation + test data.
當訓練數據與驗證和測試數據集使用的數據之間存在不匹配的風險時,使用訓練-開發集(train-dev set)。訓練-開發集是保留出來的訓練集的一部分(模型未在此部分上訓練)。模型在訓練集的其余部分上進行訓練,并在訓練-開發集和驗證集上進行評估。如果模型在訓練集上表現良好,但在訓練-開發集上表現不佳,那么模型可能過擬合了訓練集。如果它在訓練集和訓練-開發集上都表現良好,但在驗證集上表現不佳,那么訓練數據與驗證+測試數據之間可能存在顯著的數據不匹配,你應該嘗試改進訓練數據,使其更接近驗證+測試數據。If you tune hyperparameters using the test set, you risk overfitting the test set, and the generalization error you measure will be optimistic (you may launch a model that performs worse than you expect).
如果你使用測試集來調整超參數,你冒著過擬合測試集的風險,你測量的泛化錯誤將會是樂觀的(你可能推出的模型表現比你預期的要差)。