k均值算法 二分k均值算法
Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.
您見過加勒比礁嗎? 好吧,如果沒有,請做好準備。
Today, we will be answering a question that, at face value, appears quite simple: “What does a Caribbean reef look like?” However, this question can be decomposed into many complex layers. So to avoid ambiguity, let’s refine the question to: “What are the non-mobile components of a Caribbean reefs and how are they related?”
今天,我們將回答一個從表面上看很簡單的問題:“加勒比海礁石看起來像什么?” 但是,這個問題可以分解為許多復雜的層。 因此,為避免歧義,讓我們將問題細化為:“加勒比海珊瑚礁的非活動組成部分是什么,它們之間有何關系?”
That seems reasonable; we’ll have to look at fish another day.
這似乎是合理的; 我們要改天看看魚。
Now we’re not going to roll out beautiful images of underwater cities teeming with diversity. Instead, we have bar charts. Without further ado, let’s dive in.
現在,我們不打算發布充滿多樣性的水下城市的美麗影像。 相反,我們有條形圖。 事不宜遲,讓我們開始吧。
什么是典型的珊瑚礁? (What Makes up a Typical Reef?)
To start, we have developed a baseline graph (Figure 1) of the components of all Caribbean reefs. Here we have the median percent cover for nine substrate types. Now, if you haven’t conducted a scuba transect before, it may be helpful to break down the above sentence. First, percent cover is how coral reef composition is measured — in other words, from a birds-eye view what percent of sea floor is hard coral, sponge, rock, etc. Second, substrates types are broad categories of sea floor, such as silt or sand. If you’re curious about the sampling methods or specific substrate definitions, check this out.
首先,我們繪制了所有加勒比海珊瑚礁成分的基線圖(圖1)。 此處,我們提供了9種基材類型的中位覆蓋率百分比。 現在,如果您以前沒有進行過水肺橫斷面檢查,則最好將上述句子分解。 首先,覆蓋率是如何測量珊瑚礁成分的,換句話說,從鳥瞰角度看,硬質珊瑚,海綿,巖石等占海床的百分比。其次,底物類型是海床的大類,例如淤泥或沙子。 如果您想了解抽樣方法或特異底物的定義,請檢查該出來。
Ok, so in Figure 1 we’re looking at the median value for each of the nine substrate values. For example, in the Hard Coral column, we can see that hard coral’s median percent cover is roughly 17%. Good to know.
好的,因此在圖1中,我們查看的是9個底物值中的每個的中值。 例如,在“ 硬珊瑚”列中,我們可以看到硬珊瑚的覆蓋率中位數約為17%。 很高興知道。
Diving deeper into the chart, it appears that most Caribbean reefs are primarily composed four substrate types: rock, hard coral, nutrient indicator algae (NI Algae), and sand. Together, these four categories account for 91% of the total median values. On the other hand, recently killed coral (RK Coral) and silt both have median values of 0. So, they’re relatively rare.
深入研究圖表,似乎大多數加勒比海礁石主要由四種基質類型組成:巖石,硬珊瑚,營養指示藻類( NI Algae )和沙子。 這四個類別合起來占總中值的91%。 另一方面,最近被殺死的珊瑚( RK Coral )和淤泥的中位數均為0。因此,它們相對較少。
We have learned that Caribbean reefs are rocky and sandy. Lovely.
我們了解到加勒比礁是巖石和沙灘。 可愛。
But here’s an alarming analogy: the average number of children per US family is 1.93. If we take that number to be representative of the data, we might conclude that most families have 1.93 children, which I find hard to believe. Even worse, we have no understanding of the underlying distribution that led to an average of 1.93. There could be one family with 184 children and 9 families with one child. Instead, it would be useful to see if there are common counts for the number of kids per family.
但這是一個令人震驚的類比:每個美國家庭的平均孩子人數為1.93。 如果我們以該數字作為數據的代表,我們可以得出結論,大多數家庭有1.93個孩子,我很難相信。 更糟糕的是,我們不了解導致平均1.93的基本分布。 可能有一個家庭有184個孩子,有9個家庭有一個孩子。 取而代之的是,查看每個家庭的孩子數是否有共同計數是有用的。
K-均值演示 (K-Means Demo)
Applying this logic to reef composition, we will explore if there are groups coral reefs using the above substrate categories. This is where unsupervised classification comes into play. Unsupervised algorithms fit data where we don’t know the “correct” answer. And, one of the simplest methods of all is the k-means algorithm.
將這種邏輯應用于珊瑚礁組成,我們將使用上述基質類別探討是否存在珊瑚礁群。 這是無監督分類起作用的地方。 無監督算法適合我們不知道“正確”答案的數據。 而且,最簡單的方法之一是k-means算法。
Without getting too technical, k-means attempts to split data into k clusters. The algorithm does this by minimizing the distance from the center of the cluster (the cluster mean) to all points in that cluster. And because of this simple fitting criteria, it’s really easy to interpret. So let’s see an example…
不用太技術,k-means嘗試將數據拆分為k個群集。 該算法通過最小化從群集中心(群集均值)到該群集中所有點的距離來實現此目的。 而且由于這種簡單的擬合標準,它真的很容易解釋。 因此,讓我們看一個例子……

In Figure 2 we have created two clusters (k=2 in this case) using two substrate categories: hard coral and nutrient indicator algae. As you can see, there appears to be a clear divide between these two categories. But, let’s not get into interpretation quite yet.
在圖2中,我們使用兩個基質類別(硬珊瑚和營養指示藻)創建了兩個群集(在這種情況下, k = 2 )。 如您所見,這兩個類別之間似乎存在明顯的鴻溝。 但是,讓我們暫時不做解釋。
Instead, let’s consider the case where we add another variable. Here, the k-means algorithm would categorize each point using three dimensions instead of two. But as you increase the number of dimensions, you lose the ability to visualize; it’s pretty hard to think in five or eight dimensions. However, we can still see where the cluster centers are numerically located in hyperspace.
相反,讓我們考慮添加另一個變量的情況。 在這里,k-means算法將使用三個維度而不是兩個維度對每個點進行分類。 但是隨著尺寸的增加,您將失去可視化的能力。 很難從五個或八個維度來思考。 但是,我們仍然可以看到聚類中心在數字上位于超空間中的位置。
Now that we have a basic understanding of what k-means does, let’s move on to the interesting graphs.
現在,我們對k均值的功能有了基本的了解,讓我們繼續研究有趣的圖。
前4種基板類型(k = 3) (Top 4 Substrate Types (k=3))
In Figure 3 (below) we have fit three clusters (k=3) using the four most most prevalent substrate types. Each bar represents a substrate category. The height of each bar represents the the difference between the cluster mean and the total mean for that given substrate. Blue bars correspond to a cluster mean greater than the entire category’s mean and conversely, red bars correspond to a cluster mean less than the entire category’s mean.
在下面的圖3中,我們使用四種最普遍的底物類型擬合了三個簇( k = 3 )。 每個條形代表基材類別。 每個條形的高度代表該給定底物的簇均值與總均值之差。 藍色條形對應的聚類平均值大于整個類別的平均值,紅色條形對應的聚類平均值小于整個類別的平均值。

When classifying Caribbean reefs into three clusters there appear to be sensible groupings: sand-dominated, rock-dominated, and algae-dominated. Interestingly, hard coral showed relatively little change even though it was the second most abundant substrate category. Conversely, nutrient indicator algae, which is often found on degraded reefs, had extremely high signal relative to its abundance.
將加勒比海珊瑚礁分為三類時,似乎有一些合理的分類:以沙子為主,以巖石為主和以藻類為主。 有趣的是,即使硬質珊瑚是第二豐富的底物類別,其變化也相對較小。 相反,經常在退化的珊瑚礁上發現的營養指示劑藻類相對于其豐富度具有極高的信號。
We can also observe that sand-dominated reefs allowed for the highest quantity of hard coral at roughly 10 percentage points more than the total data average. Rock-dominated reefs were net positive but had little impact on hard corals. And finally, as most people would expect, the evil nutrient indicator algae appears to have a fairly strong negative impact on all other substrate types.
我們還可以觀察到,以砂巖為主的礁石允許的硬珊瑚數量最多,比整個數據平均值高出大約10個百分點。 巖石為主的礁石為凈陽性,但對硬珊瑚影響不大。 最后,正如大多數人所期望的那樣,邪惡的營養指示劑藻類似乎對所有其他底物類型具有相當強烈的負面影響。
Ok, we’re starting to get somewhere. Now let’s increase the number of substrate types by including all categories that had a median value greater than zero: only silt and recently killed coral were not included.
好的,我們開始有所建樹。 現在,通過包含中值大于零的所有類別來增加底物類型的數量:不包括淤泥和最近被殺死的珊瑚。
非零中值基板類型(k = 3) (Non-Zero-Median Substrate Types (k=3))

In Figure 4 it appears the categories we found above hold steady. Sand/rubble dominated reefs seem to support the most life with above-average values in hard coral, soft coral, and sponge. Rocky reefs also exhibit life-supporting ability, although less than its sandy counterpart. And finally, nutrient indicator algae reefs show below average percent cover in all other substrate values observed.
在圖4中,我們上面找到的類別似乎保持穩定。 在硬珊瑚,軟珊瑚和海綿中,以沙/卵石為主的礁石似乎能維持大多數生命,其價值均高于平均值。 礁石還具有生命維持能力,盡管比沙質礁石要弱一些。 最后,營養指示劑藻類礁石在所有其他底物值中均顯示低于平均覆蓋率。
Now you might be wondering what the deal is with NI Algae. Well, nutrient indicator algae are often found on degraded reefs because they thrive in waters with elevated nutrient levels, such as nitrogen and phosphorus; Reef Check added this category to monitor the infamous algal blooms. Conversely, these high levels of nutrients can be harmful to corals. Thus, we would expect to see an inverse relationship between nutrient indicator algae and the other living substrate types, namely sponges, soft corals, and hard corals.
現在您可能想知道與NI Algae達成的交易是什么。 好吧,營養指示劑藻類經常在退化的珊瑚礁上發現,因為它們在營養水平較高的水中繁殖,例如氮和磷。 Reef Check添加了此類別,以監視臭名昭著的藻華。 相反,這些高含量的養分可能對珊瑚有害。 因此,我們希望看到營養指示劑藻類與其他活的基質類型(即海綿,軟珊瑚和硬珊瑚)之間存在反比關系。
This stuff is pretty cool.
這個東西很酷。
使用非零基材值進行擬合(k = 4) (Fitting Using the Non-Zero Substrate Values (k=4))
In our final chart, we will try increasing the number of clusters to four because who’s to say there are only three types of Caribbean reefs? Well, technically there are statistical methods to show reasonable values that k can take. In this case the elbow method was implemented and three to five clusters were deemed sensible.
在我們的最終圖表中,我們將嘗試將集群數增加到四個,因為誰能說只有三種類型的加勒比海珊瑚礁? 嗯,從技術上講,有統計方法可以顯示k可以取的合理值。 在這種情況下,采用肘部方法,認為三到五個簇是明智的。

As shown shown in Figure 5 to the left, as expected, a fourth category has emerged. Boasting extremely high values of hard and soft corals, this coral-dominated reef appears to be the “healthiest” reefs of the four.
如預期的那樣,如左圖5所示,出現了第四類。 這種以珊瑚為主的珊瑚礁擁有極高的硬珊瑚和軟珊瑚價值,似乎是這四種珊瑚中“最健康的”。
Now why did increasing the number of clusters suddenly create this magical healthy reef category? Well, with only three clusters, the high levels of hard and soft corals were lumped into the sand-dominated and rock-dominated classifications. By allowing for a fourth category, the data could be subset more cleanly.
現在,為什么增加簇的數量突然創建了這個神奇的健康珊瑚礁類別? 好吧,只有三個集群,高水平的硬珊瑚和軟珊瑚被歸類為以沙子為主和以巖石為主的分類。 通過考慮第四類,可以更清晰地對數據進行子集化。
In a similar vein, why can’t we conclude that there are five types of reefs? To answer your outstanding question, k-means with k=5 was plotted, however the categories created were not intuitive. Moreover, because four central substrate categories compose 91% of the median total, limiting to four clusters is intuitive.
同樣,為什么我們不能得出結論說有五種類型的珊瑚礁呢? 為了回答您的懸而未決的問題,繪制了k = 5的 k均值,但是創建的類別不直觀。 此外,由于四個中央底物類別構成中位數總數的91%,因此直觀地限制為四個簇即可。
Ok final question, how can we tell if three or four clusters is better? Another outstanding question, but unfortunately there isn’t a clear answer.
好吧,最后一個問題,我們如何確定三個或四個集群更好? 另一個懸而未決的問題,但不幸的是沒有一個明確的答案。
From an ecological perspective, there is no reason why rock and sand-dominated reefs can’t support corals and sponges, which argues for k=3. It’s also simpler. However, by creating four clusters we can develop clear-cut classifications that appear to correspond to health, which argues for k=4. Those categories are:
從生態的角度來看,沒有任何理由說明以巖石和沙子為主的珊瑚礁不能支撐珊瑚和海綿,這證明了k = 3 。 它也更簡單。 但是,通過創建四個群集,我們可以開發出與健康相對應的清晰分類,這證明k = 4 。 這些類別是:
- High health: coral-dominated 高健康:珊瑚為主
- Medium health: sand/rubble-dominated, rock-dominated 中度健康:以沙子/碎石為主,以巖石為主
- Low health: algae-dominated 低健康:藻類為主
As with many applied statistics problems, humans have to make judgement calls based on subject-matter knowledge. Here, there are good arguments for both k=3 and k=4.
與許多應用統計問題一樣,人類必須根據主題知識做出判斷。 在這里,對于k = 3和k = 4都有很好的論據。
結論 (Conclusion)
I’m glad you now understand why bar charts are superior to pretty pictures. Even though you have no idea what a Caribbean reef looks like, you have a better understanding of what makes up a Caribbean reef (which is pretty cool).
我很高興您現在了解為什么條形圖優于漂亮的圖片。 即使您不知道加勒比礁是什么樣子,您也可以更好地了解加勒比礁的構成(這很酷)。
What else can we conclude?
我們還能得出什么結論?
- Caribbean reefs tend to be dominated by sand, rock, hard coral, and nutrient indicator algae. However, ratios differ greatly at the tails of the distributions. 加勒比礁往往以沙子,巖石,堅硬的珊瑚和營養指示劑藻類為主。 但是,比率在分布的尾部差別很大。
- One of the most consistent reef classifications was algae-dominated reefs. Algal blooms tend to occur in areas with high levels of sunlight, nutrients, and CO2 (a term called eutrophication), so from an ecological standpoint, it makes sense that coral cover would have an inverse relationship with algae. That being said, further research is required, specifically species breakdown of the NI algae. 最一致的礁石分類之一是藻類為主的礁石。 藻華往往發生在陽光,營養和二氧化碳含量高的區域(富營養化),因此從生態角度來看,珊瑚覆蓋與藻類成反比是有意義的。 話雖如此,還需要進一步的研究,特別是NI藻類的種類分解。
- All classifications that do not include nutrient indicator algae have the ability to support coral. That being said, sand-dominated reefs show a higher “life capacity” than rock-dominated reefs. 所有不包括營養指標藻類的分類都具有支持珊瑚的能力。 話雖如此,以砂為主的珊瑚礁比以巖石為主的珊瑚礁顯示出更高的“生命能力”。
Got any other ideas?
還有其他想法嗎?
資料來源 (Sources)
Algae can function as indicators of water pollution. (n.d.). Retrieved August 21, 2020, from http://www.walpa.org/waterline/june-2012/algae-can-function-as-indicators-of-water-pollution/
藻類可以作為水污染的指標。 (nd)。 檢索于2020年8月21日, 網址為http://www.walpa.org/waterline/june-2012/algae-can-function-as-indicators-of-water-pollution/
Barott, K. L., Rodriguez-Mueller, B., Youle, M., Marhaver, K. L., Vermeij, M. J., Smith, J. E., & Rohwer, F. L. (2011). Microbial to reef scale interactions between the reef-building coral Montastraea annularis and benthic algae. Proceedings of the Royal Society B: Biological Sciences, 279(1733), 1655–1664. doi:10.1098/rspb.2011.2155
KL的Barott,B。的Rodriguez-Mueller,M。的Youle,Marhaver的KL,Vermeij,MJ,Smith,JE和Rohwer的佛羅里達(2011)。 造礁珊瑚Montastraea ringis和底棲藻類之間的微生物到礁垢的相互作用。 皇家學會學報B:生物科學, 279 (1733),1655–1664。 doi:10.1098 / rspb.2011.2155
Duffin, P., & 13, J. (2020, January 13). Average number of own children per family U.S. Retrieved August 20, 2020, from https://www.statista.com/statistics/718084/average-number-of-own-children-per-family/
Duffin,P.,&13,J.(2020年1月13日)。 美國每個家庭的平均獨生子女數于2020年8月20日從https://www.statista.com/statistics/718084/average-number-of-own-children-per-family/檢索
The data were collected by Reef Check, a coral conservation non-profit that trains volunteer divers to collect marine data. There were 1576 unique entries for the Caribbean ranging from 1997–05–24 to 2019–08–24. Date of the dive was not taken into account, however in future iterations it would be interesting to see how these cluster centers change over time. The only transformation to the traditional k-means algorithm was including weights that correspond to the median percent cover of each substrate category.
數據是由珊瑚礁非營利組織Reef Check收集的,該組織培訓志愿潛水員收集海洋數據。 1997–05–24至2019–08–24期間,加勒比海地區共有1576個獨特條目。 沒有考慮潛水日期,但是在將來的迭代中,觀察這些聚類中心如何隨時間變化會很有趣。 對傳統k均值算法的唯一轉換是包括權重,該權重對應于每種基材類別的中位覆蓋率百分比。
Here is the code.
這是代碼 。
Note: These are my findings. If you would like to contact me, leave a message here. All criticisms are welcome.
注意:這些是我的發現。 如果您想與我聯系,請在此處留言。 歡迎所有批評。
翻譯自: https://medium.com/data-diving/classification-of-caribbean-coral-reefs-using-k-means-51a66997a989
k均值算法 二分k均值算法
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/390697.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/390697.shtml 英文地址,請注明出處:http://en.pswp.cn/news/390697.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!