圖片主成分分析后的可視化

If you have ever taken an online course on Machine Learning, you must have come across Principal Component Analysis for dimensionality reduction, or in simple terms, for compression of data. Guess what, I had taken such courses too but I never really understood the graphical significance of PCA because all I saw was matrices and equations. It took me quite a lot of time to understand this concept from various sources. So, I decided to compile it all in one place.

如果您曾經參加過有關機器學習的在線課程，那么您必須碰到主成分分析以降低維度，或者簡單來說就是壓縮數據。猜猜我也參加過此類課程，但是我從來沒有真正理解PCA的圖形意義，因為我看到的只是矩陣和方程式。我花了很多時間從各種來源了解這個概念。因此，我決定將其全部編譯在一個地方。

In this article, we will take a visual (graphical) approach to understand PCA and how it can be used to compress data. Basic knowledge of Linear Algebra and Matrices is assumed. If you are new to this concept, just follow along, I have tried my best to keep this as simple as possible.

在本文中，我們將采用視覺(圖形)方法來理解PCA以及如何將其用于壓縮數據。假設線性代數和矩陣的基本知識。如果您是這個概念的新手，那么請跟隨我，我已盡力使它盡可能簡單。

介紹 (Introduction)

These days, datasets containing a large number of dimensions are increasingly common and are often difficult to interpret. One example can be a database of face photographs of let’s say, 1,000,000 people. If each face photograph has a dimension of 100x100, then the data of each face is 10000 dimensional (there are 100x100 = 10,000 unique values to be stored for each face). Now, if 1 byte is required to store the information of each pixel, then 10,000 bytes are required to store 1 face. Since there are 1000 faces in the database,10,000 x 1,000,000 = 10 GB will be needed to store the dataset.

如今，包含大量維的數據集變得越來越普遍，并且通常難以解釋。一個例子可以是一個數據庫，假設有100萬人的面部照片。如果每個人臉照片的尺寸為100x100，則每個人臉的數據為10000維(每個人臉要存儲100x100 = 10,000個唯一值)。現在，如果需要1個字節來存儲每個像素的信息，則需要10,000個字節來存儲1個面部。由于數據庫中有1000張面Kong，因此需要10,000 x 1,000,000 = 10 GB來存儲數據集。

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, exploiting the fact that the images in these datasets have something in common. For instance, in a dataset consisting of face photographs, each photograph will have facial features like eyes, nose, mouth. Instead of encoding this information pixel by pixel, we could make a template of each type of these features and then just combine these templates to generate any face in the dataset. In this approach, each template will still be 100x100 = 1000 dimensional, but since we will be reusing these templates (basis functions) to generate each face in the dataset, the number of templates required will be very small. PCA does exactly this.

主成分分析(PCA)是一種利用此類數據集中的圖像具有共同點的事實來降低此類數據集的維數的技術。例如，在由臉部照片組成的數據集中，每張照片將具有面部特征，例如眼睛，鼻子，嘴巴。不用逐個像素地編碼此信息，我們可以制作這些特征的每種類型的模板，然后將這些模板組合在一起以生成數據集中的任何人臉。在這種方法中，每個模板仍將是100x100 = 1000尺寸，但是由于我們將重用這些模板(基本函數)以生成數據集中的每個面，因此所需模板的數量將非常少。 PCA正是這樣做的。

PCA如何工作？ (How does PCA work?)

This part is going to be a bit technical, so bear with me! I will try to explain the working of PCA with a simple example. Let’s consider the data shown below containing 100 points each 2 dimensional (x & y coordinates is needed to represent each point).

這部分將有點技術性，請多多包涵！我將嘗試通過一個簡單的例子來解釋PCA的工作。讓我們考慮下面顯示的數據，每個數據包含100個點(二維)(需要用x和y坐標表示每個點)。

Currently, we are using 2 values to represent each point. Let’s explain this situation in a more technical way. We are currently using 2 basis functions, x as (1, 0) and y as (0, 1). Each point in the dataset is represented as a weighted sum of these basis functions. For instance, point (2, 3) can be represented as 2(1, 0) + 3(0, 1) = (2, 3). If we omit either of these basis functions, we will not be able to represent the points in the dataset accurately. Therefore, both the dimensions necessary, and we can’t just drop one of them to reduce the storage requirement. This set of basis functions is actually the cartesian coordinate in 2 dimensions.

當前，我們使用2個值來表示每個點。讓我們以更技術性的方式解釋這種情況。我們目前正在使用2個基函數，x作為(1，0)，y作為(0，1)。數據集中的每個點都表示為這些基函數的加權和。例如，點(2，3)可以表示為2(1，0)+ 3(0，1)=(2，3)。如果我們忽略這些基本函數中的任何一個，我們將無法準確表示數據集中的點。因此，這兩個尺寸都是必需的，我們不能只丟掉其中一個以減少存儲需求。這套基礎函數實際上是二維的直角坐標。

If we notice closely, we can very well see that the data approximates a line as shown by the red line below.

如果我們密切注意，我們可以很好地看到數據接近一條線，如下面的紅線所示。

Now, let’s rotate the coordinate system such that the x-axis lies along the red line. Then, the y-axis (green line) will be perpendicular to this red line. Let’s call these new x and y axes as a-axis and b-axis respectively. This is shown below.

現在，讓我們旋轉坐標系，以使x軸沿著紅線。然后，y軸(綠線)將垂直于該紅線。我們將這些新的x和y軸分別稱為a軸和b軸。如下所示。

Now, if we use a and b as the new set basis functions (instead of using x and y) for this dataset, it wouldn’t be wrong to say that most of the variance in the dataset is along the a-axis. Now, if we drop the b-axis, we can still represent the points in the dataset very accurately, using just a-axis. Therefore, we now only need half as must storage to store the dataset and reconstruct it accurately. This is exactly how PCA works.

現在，如果我們使用a和b作為該數據集的新集合基函數(而不是使用x和y )，那么說數據集中的大多數方差都沿著a軸是沒有錯的。現在，如果我們放下b軸，我們仍然可以僅使用a軸就非常精確地表示數據集中的點。因此，我們現在只需要存儲一半就可以存儲數據集并準確地重建它。這正是PCA的工作方式。

PCA is a 4 step process. Starting with a dataset containing n dimensions (requiring n-axes to be represented):

PCA是一個四步過程。 從包含n個維度的數據集開始(需要表示n個軸)：

Find a new set of basis functions (n-axes) where some axes contribute to most of the variance in the dataset while others contribute very little.
找到一組新的基函數( n軸)，其中一些軸對數據集中的大部分方差有貢獻，而另一些軸則貢獻很小。
Arrange these axes in the decreasing order of variance contribution.
以方差貢獻的降序排列這些軸。
Now, pick the top k axes to be used and drop the remaining n-k axes.
現在，選擇要使用的前k個軸，然后刪除其余的nk個軸。
Now, project the dataset onto these k axes.
現在，將數據集投影到這k個軸上。

After these 4 steps, the dataset will be compressed from n-dimensions to just k-dimensions (k<n).

經過這4個步驟，數據集將從n維壓縮為僅k維( k < n )。

腳步 (Steps)

For the sake of simplicity, let’s take the above dataset and apply PCA on that. The steps involved will be technical and basic knowledge of linear algebra is assumed. You can view the Colab Notebook here:

為了簡單起見，讓我們采用以上數據集并在其上應用PCA。所涉及的步驟將是線性代數的技術和基本知識。您可以在此處查看Colab筆記本：

第1步 (Step 1)

Since this is a 2-dimensional dataset, n=2. The first step is to find the new set of basis functions (a & b). In the explanation above, we saw that the dataset had the maximum variance along a line and we manually chose that line as a-axis and the line the perpendicular to it as b-axis. In practice, we want this step to be automated.

由于這是二維數據集，因此n = 2。第一步是找到新的基礎函數集( a ＆ b )。在上面的說明中，我們看到數據集沿一條線具有最大方差，我們手動選擇了該線作為a軸，垂直選擇與其垂直的線作為b軸。實際上，我們希望這一步驟是自動化的。

To accomplish this, we can find the eigenvalues and eigenvectors of the covariance matrix of the dataset. Since the dataset is 2 dimensional, we will get 2 eigenvalues and their corresponding eigenvectors. Then, the 2 eigenvectors are two basis functions (new axes) and the two eigenvalues tell us the variance contribution of the corresponding eigenvectors. A large value of eigenvalue implies that the corresponding eigenvector (axis) contributes more towards the total variance of the dataset.

為此，我們可以找到數據集協方差矩陣的特征值和特征向量。由于數據集是二維的，因此我們將獲得2個特征值及其對應的特征向量。然后，這兩個特征向量是兩個基函數(新軸)，兩個特征值告訴我們相應特征向量的方差貢獻。特征值的較大值表示相應的特征向量(軸)對數據集的總方差貢獻更大。

第2步 (Step 2)

Now, sort the eigenvectors (axes) according to decreasing eigenvalues. Here, we can see that the eigenvalue for a-axis is much larger than that of the b-axis meaning that a-axis contributes more towards the dataset variance.

現在，根據遞減的特征值對特征向量(軸)進行排序。在這里，我們可以看到a軸的特征值遠大于b軸的特征值，這意味著a軸對數據集方差的貢獻更大。

The percentage contribution of each axis towards the total dataset variance can be calculated as:

每個軸對總數據集方差的百分比貢獻可以計算為：

The above numbers prove that the a-axis contributes 99.7% towards the dataset variance and that we can drop the b-axis and lose just 0.28% of the variance.

以上數字證明， a軸對數據集方差的貢獻為99.7％，我們可以刪除b軸并僅損失0.28％的方差。

第三步 (Step 3)

Now, we will drop the b-axis and keep only the a-axis.

現在，我們將放下b軸，僅保留a軸。

第4步 (Step 4)

Now, reshape the first eigenvector (a-axis) into a 2x1 matrix, called the projection matrix. It will be used to project the original dataset of shape (100, 2) onto the new basis function (a-axis), thus compressing it to (100, 1).

現在，將第一個特征向量(a軸)整形為2x1矩陣，稱為投影矩陣。它將用于將形狀為(100，2)的原始數據集投影到新的基函數(a軸)上，從而將其壓縮為(100，1)。

重建數據 (Reconstruct the data)

Now, we can use the projection matrix to expand the data back to its original size, with of course a small loss of variance (0.28%).

現在，我們可以使用投影矩陣將數據擴展回其原始大小，當然會有很小的方差損失(0.28％)。

The reconstructed data is shown below:

重建的數據如下所示：

Please note that the variance along the b-axis (0.28%) is lost as evident by the above figure.

請注意，如上圖所示，沿b軸的方差(0.28％)丟失了。

那是所有人！ (That’s all folks!)

If you made it till here, hats off to you! In this article, we took a graphical approach to understand how Principal Component Analysis works and how it can be used for data compression. In my next article, I will show how PCA can be used to compress Labelled Faces in the Wild (LFW), a large scale dataset consisting of 13233 human-face images.

如果您做到了這里，就向您致敬！在本文中，我們采用圖形化的方法來了解主成分分析的工作原理以及如何將其用于數據壓縮。在下一篇文章中，我將展示如何使用PCA壓縮野外標記的面部(LFW)，LFW是由13233張人臉圖像組成的大規模數據集。

If you have any suggestions, please leave a comment. I write articles regularly so you should consider following me to get more such articles in your feed.

如果您有任何建議，請發表評論。我會定期撰寫文章，因此您應該考慮關注我，以便在您的供稿中獲取更多此類文章。

If you liked this article, you might as well love these:

如果您喜歡這篇文章，則不妨喜歡這些：

Visit my website to learn more about me and my work.

訪問我的網站以了解有關我和我的工作的更多信息。

翻譯自: https://towardsdatascience.com/principal-component-analysis-visualized-17701e18f2fa

圖片主成分分析后的可視化

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/390970.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/390970.shtml
英文地址，請注明出處：http://en.pswp.cn/news/390970.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

回溯算法和遞歸算法_回溯算法：遞歸和搜索示例說明

回溯算法和遞歸算法Examples where backtracking can be used to solve puzzles or problems include:回溯可用于解決難題或問題的示例包括： Puzzles such as eight queens puzzle, crosswords, verbal arithmetic, Sudoku [nb 1], and Peg Solitaire. 諸如八個皇后…

C#中的equals()和==

using System;namespace EqualsTest {class EqualsTest{static void Main(string[] args){//值類型int x 1;int y 1;Console.WriteLine(x y);//TrueConsole.WriteLine(x.Equals(y));//True //引用類型A a new A();B b new B();//Console.WriteLine(ab);//報錯…

JPA JoinColumn vs mappedBy

問題：JPA JoinColumn vs mappedBy 兩者的區別是什么呢 Entity public class Company {OneToMany(cascade CascadeType.ALL , fetch FetchType.LAZY)JoinColumn(name "companyIdRef", referencedColumnName "companyId")private List<B…

TP引用樣式表和js文件及驗證碼

TP引用樣式表和js文件及驗證碼引入樣式表和js文件 <script src"__PUBLIC__/bootstrap/js/jquery-1.11.2.min.js"></script> <script src"__PUBLIC__/bootstrap/js/bootstrap.min.js"></script> <link href"__PUBLIC__/bo…

pytorch深度學習_深度學習和PyTorch的推薦系統實施

pytorch深度學習The recommendation is a simple algorithm that works on the principle of data filtering. The algorithm finds a pattern between two users and recommends or provides additional relevant information to a user in choosing a product or services.該…