分析堆棧溢出原因

by Vlad Wetzel

通過弗拉德·韋澤爾

我分析了有關堆棧溢出的所有書籍。這是最受歡迎的。 (I analyzed every book ever mentioned on Stack Overflow. Here are the most popular ones.)

Finding your next programming book is hard, and it’s risky.

尋找下一本編程書非常困難，而且很冒險。

As a developer, your time is scarce, and reading a book takes up a lot of that time. You could be programming. You could be resting. But instead you’re allocating precious time to read and expand your skills.

作為開發人員，您的時間很匱乏，而讀書會占用大量時間。您可能正在編程。你可能會休息。但是，相反，您是在分配寶貴的時間閱讀和擴展技能。

So which book should you read? My colleagues and I often discuss books, and I’ve noticed that our opinions on a given book vary wildly.

那你應該讀哪本書？我和我的同事經常討論書籍，而且我注意到我們對特定書籍的看法差異很大。

So I decided to take a deeper look into the problem. My idea: to parse the most popular programmer resource in the world for links to a well-known book store, then count how many mentions each book has.

因此，我決定更深入地研究這個問題。我的想法是：解析世界上最受歡迎的程序員資源，以獲取到知名書店的鏈接，然后計算每本書的提及人數。

Fortunately, Stack Exchange (the parent company of Stack Overflow) had just published their data dump. So I sat down and got to coding.

幸運的是，Stack Exchange(Stack Overflow的母公司)剛剛發布了其數據轉儲。所以我坐下來開始編程。

“If you’re curious, the overall top recommended book is Working Effectively with Legacy Code, with Design Pattern: Elements of Reusable Object-Oriented Software coming in second. While the titles for these are as dry as the Atacama Desert, the content should still be quality. You can sort books by tags, like JavaScript, C, Graphics, and whatever else. This obviously isn’t the end-all of book recommendations, but it’s certainly a good place to start if you’re just getting into coding or looking to beef up your knowledge.” — review on Lifehacker.com
“如果您感到好奇，則整體上推薦的最好書是《有效地使用舊版代碼》，《設計模式：可重用的面向對象軟件的元素》第二。 盡管這些標題的標題與阿塔卡馬沙漠一樣干燥，但內容仍應是高質量的。 您可以按標簽(例如JavaScript，C，圖形等)對書籍進行排序。 這顯然不是本書建議的全部，但是如果您只是開始編碼或希望增強自己的知識，那么這當然是一個不錯的起點。” —在Lifehacker.com上進行評論

Shortly afterward, I launched dev-books.com, which allows you to explore all the data I gathered and sorted. I got more than 100,000 visitors and received lots of feedback asking me to describe the whole technical process.

此后不久，我啟動了dev-books.com ，使您可以瀏覽我收集和排序的所有數據。我有超過100,000位訪客，并且收到了很多反饋，要求我描述整個技術過程。

So, as promised, I’m going to describe how I built everything right now.

因此，按照承諾，我將描述我現在如何構建所有內容。

獲取和導入數據 (Getting and importing the data)

I grabbed the Stack Exchange database dump from archive.org.

我從archive.org抓取了Stack Exchange數據庫轉儲。

From the very beginning I realized it would not be possible to import a 48GB XML file into a freshly created database (PostgreSQL) using popular methods like myxml := pg_read_file(‘path/to/my_file.xml’), because I didn’t have 48GB of RAM on my server. So, I decided to use a SAX parser.

從一開始，我就意識到無法使用諸如myxml := pg_read_file('path/to/my_file.xml')類的流行方法將48GB XML文件導入到新創建的數據庫(PostgreSQL myxml := pg_read_file('path/to/my_file.xml') ，因為我沒有我的服務器上有48GB的RAM。因此，我決定使用SAX解析器。

All the values were stored between <row> tags, so I used a Python script to parse it:

所有值都存儲在<r ow>標記之間，因此我使用Python腳本對其進行了解析：

After three days of importing (almost half of the XML was imported during this time), I realized that I’d made a mistake: the ParentID attribute should have been ParentId.

導入三天后(這段時間內將近一半的XML導入了)，我意識到自己犯了一個錯誤： ParentID屬性應該是ParentId 。

At this point, I didn’t want to wait for another week, and moved from an AMD E-350 (2 x 1.35GHz) to an Intel G2020 (2 x 2.90GHz). But this still didn’t speed up the process.

此時，我不想再等一周，而是從AMD E-350(2 x 1.35GHz)遷移到Intel G2020(2 x 2.90GHz)。但這仍然沒有加快流程。

Next decision — batch insert:

下一步決定-批量插入：

StringIO lets you use a variable like file to handle the function copy_from, which uses COPY. This way, the whole import process only took one night.

StringIO允許您使用諸如file之類的變量來處理函數copy_from ，該函數使用COPY 。這樣，整個導入過程只花了一個晚上。

OK, time to create indexes. In theory, GiST indexes are slower than GIN, but take less space. So I decided to use GiST. After one more day, I had an index that took 70GB.

好，該創建索引了。從理論上講，GiST索引比GIN慢，但占用空間少。因此，我決定使用GiST。又過了一天，我的索引占用了70GB。

When I tried couple of test queries, I realized that it takes way too much time to process them. The reason? Disk IO waits. SSD GOODRAM C40 120Gb helped a lot, even if it is not the fastest SSD so far.

當我嘗試幾個測試查詢時，我意識到處理它們花費了太多時間。原因？磁盤IO等待。 SSD GOODRAM C40 120Gb發揮了很大作用，即使它不是到目前為止最快的SSD。

I created a brand new PostgreSQL cluster:

我創建了一個全新的PostgreSQL集群：

initdb -D /media/ssd/postgresq/data

Then I made sure to change the path in my service config (I used Manjaro OS):

然后，確保在服務配置中更改路徑(我使用了Manjaro OS)：

vim /usr/lib/systemd/system/postgresql.service

Environment=PGROOT=/media/ssd/postgresPIDFile=/media/ssd/postgres/data/postmaster.pid

I Reloaded my config and started postgreSQL:

我重新加載了配置并啟動了postgreSQL：

systemctl daemon-reloadpostgresql systemctl start postgresql

This time it took couple hours to import, but I used GIN. The indexing took 20GB of space on SSD, and simple queries were taking less than a minute.

這次花了幾個小時才能導入，但是我使用了GIN。索引在SSD上占用了20GB的空間，而簡單的查詢不到一分鐘。

從數據庫中提取書籍 (Extracting books from the database)

With my data finally imported, I started to look for posts that mentioned books, then copied them over to a separate table using SQL:

最終導入我的數據之后，我開始查找提到書籍的帖子，然后使用SQL將它們復制到單獨的表中：

CREATE TABLE books_posts AS SELECT * FROM posts WHERE body LIKE ‘%book%’”;

The next step was to find all the hyperlinks within those:

下一步是找到其中的所有超鏈接：

CREATE TABLE http_books AS SELECT * posts WHERE body LIKE ‘%http%’”;

At this point I realized that StackOverflow proxies all links like: rads.stackowerflow.com/[$isbn]/

此時，我意識到StackOverflow可以代理所有鏈接，例如： rads.stackowerflow.com/[$isbn]/

I created another table with all posts with links:

我用所有帶有鏈接的帖子創建了另一個表：

CREATE TABLE rads_posts AS SELECT * FROM posts WHERE body LIKE ‘%http://rads.stackowerflow.com%'";

Using regular expressions to extract all the ISBNs. I extracted Stack Overflow tags to another table through regexp_split_to_table.

使用正則表達式提取所有ISBN 。我通過regexp_split_to_table將Stack Overflow標簽提取到另一個表中。

Once I had the most popular tags extracted and counted, the top of 20 most mentioned books by tags were quite similar across all tags.

一旦我提取并計算了最受歡迎的標簽，所有標簽中按標簽排列的20本最受關注的書籍的前十名都非常相似。

My next step: refining tags.

我的下一步：優化標簽。

The idea was to take the top-20-mentioned books from each tag and exclude books which were already processed.

這個想法是從每個標簽中抽取前20名提到的書籍，并排除已經處理過的書籍。

Since it was “one-time” job, I decided to use PostgreSQL arrays. I wrote a script to create a query like so:

由于這是“一次性”工作，因此我決定使用PostgreSQL數組。我編寫了一個腳本來創建如下查詢：

With the data in hand, I headed for the web.

掌握了數據之后，我便走向了網絡。

建立網路應用程式 (Building the web app)

Since I’m not a web developer — and certainly not a web user interface expert — I decided to create a very simple single-page app based on a default Bootstrap theme.

由于我不是Web開發人員，當然也不是Web用戶界面專家，因此我決定基于默認的Bootstrap主題創建一個非常簡單的單頁應用程序。

I created a “search by tag” option, then extracted the most popular tags to make each search clickable.

我創建了一個“按標簽搜索”選項，然后提取最受歡迎的標簽以使每個搜索都可點擊。

I visualized the search results with a bar chart. I tried out Hightcharts and D3, but they were more for dashboards. These had some issues with responsiveness, and were quite complex to configure. So, I created my own responsive chart based on SVG. To make it responsive, it has to be redrawn on screen orientation change event:

我用條形圖可視化了搜索結果。我試用了Hightcharts和D3，但它們更多地用于儀表板。這些在響應性方面存在一些問題，并且配置非常復雜。因此，我基于SVG創建了自己的響應圖。為了使其響應，必須在屏幕方向更改事件上重新繪制它：

Web服務器故障 (Web server failure)

Right after I published dev-books.com I had a huge crowd checking out my web site. Apache couldn’t serve for more than 500 visitors at the same time, so I quickly set up Nginx and switched to it on the way. I was really surprised when real-time visitors shot up to 800 at same time.

在我發布dev-books.com之后，立即有很多人查看我的網站。 Apache不能同時為500個以上的訪問者提供服務，因此我Swift設置了Nginx并順便切換到了它。當實時訪問者同時達到800人時，我感到非常驚訝。

結論： (Conclusion:)

I hope I explained everything clearly enough for you to understand how I built this. If you have any questions, feel free to ask. You can find me on twitter and Facebook.

我希望我對所有內容進行了足夠清晰的解釋，以使您理解我是如何構建的。如果您有任何問題隨時問。您可以在Twitter和Facebook 上找到我。

As promised, I will publish my full report from Amazon.com and Google Analytics at the end of March. The results so far have been really surprising.

按照承諾，我將于3月底發布來自Amazon.com和Google Analytics(分析)的完整報告。到目前為止，結果確實令人驚訝。

Make sure you click on green heart below and follow me for more stories about technology :)

確保您單擊下面的綠色心臟，然后關注我以獲取有關技術的更多故事:)

Stay tuned at dev-books.com

敬請關注dev-books.com