從頭學習計算機網絡_我如何通過從頭開始構建網絡爬蟲來自動進行求職

從頭學習計算機網絡

它是如何開始的故事 (The story of how it began)

It was midnight on a Friday, my friends were out having a good time, and yet I was nailed to my computer screen typing away.

星期五是午夜，我的朋友們出去玩得很開心，但我被釘在電腦屏幕上打字了。

Oddly, I didn’t feel left out.

奇怪的是，我沒有被排除在外。

I was working on something that I thought was genuinely interesting and awesome.

我正在做一些我認為真的很有趣而且很棒的事情。

I was right out of college, and I needed a job. When I left for Seattle, I had a backpack full of college textbooks and some clothes. I could fit everything I owned in the trunk of my 2002 Honda Civic.

我當時剛大學畢業，需要一份工作。當我去西雅圖時，我有一個裝滿大學課本和一些衣服的背包。我可以裝滿2002年本田思域后備箱中的所有物品。

I didn’t like to socialize much back then, so I decided to tackle this job-finding problem the best way I knew how. I tried to build an app to do it for me, and this article is about how I did it. ?

那時我不喜歡社交，所以我決定以我所知道的最好方式解決這個找工作的問題。我試圖構建一個應用程序來為我做這件事，而本文則是關于我是如何做到的。？

Craigslist入門 (Getting started with Craigslist)

I was in my room, furiously building some software that would help me collect, and respond to, people who were looking for software engineers on Craigslist. Craigslist is essentially the marketplace of the Internet, where you can go and find things for sale, services, community posts, and so on.

我當時在我的房間里，瘋狂地開發一些軟件，這些軟件可以幫助我收集和響應在Craigslist上尋找軟件工程師的人們。 Craigslist本質上是Internet的市場，您可以在其中找到要出售的東西，服務，社區帖子等。

At that point in time, I had never built a fully fledged application. Most of the things I worked on in college were academic projects that involved building and parsing binary trees, computer graphics, and simple language processing models.

那時，我從未構建過完整的應用程序。我在大學期間從事的大多數工作都是學術項目，涉及構建和解析二叉樹，計算機圖形學以及簡單的語言處理模型。

I was quite the “newb.”

我真是個“新手”。

That said, I had always heard about this new “hot” programming language called Python. I didn’t know much Python, but I wanted to get my hands dirty and learn more about it.

就是說，我一直都聽說過這種稱為Python的新“熱門”編程語言。我對Python不太了解，但是我想弄清楚自己的手，并進一步了解它。

So I put two and two together, and decided to build a small application using this new programming language.

因此，我將兩個和兩個放在一起，并決定使用這種新的編程語言來構建一個小型應用程序。

建立(工作中的)原型的旅程 (The journey to build a (working) prototype)

I had a used BenQ laptop my brother had given me when I left for college that I used for development.

我上大學時曾用過哥哥給我的一臺二手BenQ筆記本電腦，當時我用它來開發。

It wasn’t the best development environment by any measure. I was using Python 2.4 and an older version of Sublime text, yet the process of writing an application from scratch was truly an exhilarating experience.

無論如何，它都不是最佳的開發環境。我使用的是Python 2.4和較舊版本的Sublime文本，但是從頭開始編寫應用程序的過程確實令人振奮。

I didn’t know what I needed to do yet. I was trying various things out to see what stuck, and my first approach was to find out how I could access Craigslist data easily.

我還不知道該怎么辦。我嘗試了各種嘗試以了解問題所在，而我的第一種方法是找出如何輕松訪問Craigslist數據的方法。

I looked up Craigslist to find out if they had a publicly available REST API. To my dismay, they didn’t.

我查找了Craigslist，以了解他們是否具有公開可用的REST API。令我沮喪的是，他們沒有。

However, I found the next best thing.

但是，我找到了下一個最好的東西。

Craigslist had an RSS feed that was publicly available for personal use. An RSS feed is essentially a computer-readable summary of updates that a website sends out. In this case, the RSS feed would allow me to pick up new job listings whenever they were posted. This was perfect for my needs.

Craigslist的RSS供稿已公開供個人使用。 RSS feed本質上是網站發送的更新的計算機可讀摘要 。在這種情況下，RSS提要將允許我在發布新職位列表時選擇它們。這非常適合我的需求。

Next, I needed a way to read these RSS feeds. I didn’t want to go through the RSS feeds manually myself, because that would be a time-sink and that would be no different than browsing Craigslist.

接下來，我需要一種閱讀這些RSS feed的方法。我不想自己親自瀏覽RSS提要，因為那會浪費時間，而且與瀏覽Craigslist沒什么不同。

Around this time, I started to realize the power of Google. There’s a running joke that software engineers spend most of their time Googling for answers. I think there’s definitely some truth to that.

大約在這段時間里，我開始意識到Google的強大功能。開個玩笑，軟件工程師將大部分時間都用在Google搜索上。我認為這肯定是有些道理。

After a little bit of Googling, I found this useful post on StackOverflow that described how to search through a Craiglist RSS feed. It was sort of a filtering functionality that Craigslist provided for free. All I had to do was pass in a specific query parameter with the keyword I was interested in.

經過一番谷歌搜索之后，我在StackOverflow上找到了這篇有用的文章，描述了如何搜索Craiglist RSS feed。這是Craigslist免費提供的一種篩選功能。我要做的就是用我感興趣的關鍵字傳遞特定的查詢參數。

I was focused on searching for software-related jobs in Seattle. With that, I typed up this specific URL to look for listings in Seattle that contained the keyword “software”.

我專注于在西雅圖尋找與軟件相關的工作。這樣，我輸入了該特定URL，以查找包含關鍵字“軟件”的西雅圖清單。

https://seattle.craigslist.org/search/sss?format=rss&query=software
https://seattle.craigslist.org/search/sss?format=rss&query=software

And voilà! It worked beautifully.

和瞧！它工作得很漂亮。

我吃過最美麗的湯 (The most beautiful soup I’ve ever tasted)

I wasn’t convinced, however, that my approach would work.

但是，我沒有確信我的方法會奏效。

First, the number of listings was limited. My data didn’t contain all the available job postings in Seattle. The returned results were merely a subset of the whole. I was looking to cast as wide a net as possible, so I needed to know all the available job listings.

首先，列表的數量是有限的 。我的數據沒有包含西雅圖所有可用的職位發布。返回的結果只是整體的一部分。我一直在尋找盡可能廣泛的網絡，所以我需要知道所有可用的工作清單。

Second, I realized that the RSS feed didn’t include any contact information. That was a bummer. I could find the listings, but I couldn’t contact the posters unless I manually filtered through these listings.

其次，我意識到RSS提要不包含任何聯系信息 。真是可惜。我可以找到列表，但是除非手動過濾這些列表，否則我無法聯系海報。

I’m a person of many skills and interests, but doing repetitive manual work isn’t one of them. I could’ve hired someone to do it for me, but I was barely scraping by with 1-dollar ramen cup noodles. I couldn’t splurge on this side project.

我是一個有很多技能和興趣的人，但是做重復的體力勞動不是其中之一。我本來可以雇一個人為我做的，但我勉強抓著一美元的拉面杯面條。我不能為此項目揮霍。

That was a dead-end. But it wasn’t the end.

那是死路一條。但它是不是結束。

連續迭代 (Continuous iteration)

From my first failed attempt, I learned that Craigslist had an RSS feed that I could filter on, and each posting had a link to the actual posting itself.

從我的第一次失敗嘗試中，我了解到Craigslist有一個RSS提要供我過濾，并且每個帖子都有指向實際帖子本身的鏈接。

Well, if I could access the actual posting, then maybe I could scrape the email address off of it? 🧐 That meant I needed to find a way to grab email addresses from the original postings.

好吧，如果我可以訪問實際的帖子，那么也許可以從中刪除電子郵件地址？ meant那意味著我需要找到一種方法來從原始帖子中獲取電子郵件地址。

Once again, I pulled up my trusted Google, and searched for “ways to parse a website.”

我再次拉起我信任的Google，并搜索“解析網站的方式”。

With a little Googling, I found a cool little Python tool called Beautiful Soup. It’s essentially a nifty tool that allows you to parse an entire DOM Tree and helps you make sense of how a web page is structured.

稍加谷歌搜索，我發現了一個很酷的Python小工具，名為Beautiful Soup 。從本質上講，它是一個漂亮的工具，可讓您解析整個DOM樹，并幫助您理解網頁的結構。

My needs were simple: I needed a tool that was easy to use and would let me collect data from a webpage. BeautifulSoup checked off both boxes, and rather than spending more time picking out the best tool, I picked a tool that worked and moved on. Here’s a list of alternatives that do something similar.

我的需求很簡單：我需要一個易于使用的工具，并且可以讓我從網頁上收集數據。 BeautifulSoup選中了這兩個復選框，而不是花更多的時間挑選最好的工具 ，而是選擇了一個行之有效的工具。這是做類似事情的替代方案的列表。

Side note: I found this awesome tutorial that talks about how to scrape websites using Python and BeautifulSoup. If you’re interested in learning how to scrape, then I recommend reading it.
旁注：我發現了這個很棒的教程，該教程討論了如何使用Python和BeautifulSoup抓取網站。如果您有興趣學習如何抓取，則建議閱讀。

With this new tool, my workflow was all set.

有了這個新工具，我的工作流程就完成了。

I was now ready to tackle the next task: scraping email addresses from the actual postings.

我現在準備處理下一個任務：從實際發帖中抓取電子郵件地址。

Now, here’s the cool thing about open-source technologies. They’re free and work great! It’s like getting free ice-cream on a hot summer day, and a freshly baked chocolate-chip cookie to go.

現在，這是關于開源技術的最酷的東西。它們是免費的，而且效果很好！就像在炎熱的夏日里免費獲得冰淇淋，以及新鮮出爐的巧克力曲奇餅干一樣。

BeautifulSoup lets you search for specific HTML tags, or markers, on a web page. And Craigslist has structured their listings in such a way that it was a breeze to find email addresses. The tag was something along the lines of “email-reply-link,” which basically points out that an email link is available.

BeautifulSoup使您可以在網頁上搜索特定HTML標簽或標記。 Craigslist的清單結構很容易找到電子郵件地址。該標記類似于“ email-reply-link”，基本上指出了電子郵件鏈接可用。

From then on, everything was easy. I relied on the built-in functionality BeautifulSoup provided, and with just some simple manipulation, I was able to pick out email addresses from Craigslist posts quite easily.

從那時起，一切都很輕松。我依靠提供的內置功能BeautifulSoup，并且只需進行一些簡單的操作，就可以很容易地從Craigslist帖子中挑選出電子郵件地址。

放在一起 (Putting things together)

Within an hour or so, I had my first MVP. I had built a web scraper that could collect email addresses and respond to people looking for software engineers within a 100-mile radius of Seattle.

在一個小時左右的時間內，我有了第一個MVP。我建立了一個網絡抓取工具，可以收集電子郵件地址并響應在西雅圖100英里范圍內尋找軟件工程師的人們的React。

I added various add-ons on top of the original script to make life much easier. For example, I saved the results into a CSV and HTML page so that I could parse them quickly.

我在原始腳本的頂部添加了各種附加組件，以使工作更加輕松。例如，我將結果保存到CSV和HTML頁面中，以便可以快速解析它們。

Of course, there were many other notable features lacking, such as:

當然，還缺少許多其他值得注意的功能，例如：

the ability to log the email addresses I sent
能夠記錄我發送的電子郵件地址
fatigue rules to prevent over-sending emails to people I’d already reached out to
疲勞規則，以防止向我已經聯系過的人發送過多電子郵件
special cases, such as some emails requiring a Captcha before they’re displayed to deter automated bots (which I was)
特殊情況，例如有些電子郵件需要顯示驗證碼才能顯示，以阻止自動漫游器(我當時是)
Craigslist didn’t allow scrapers on their platform, so I would get banned if I ran the script too often. (I tried to switch between various VPNs to try to “trick” Craigslist, but that didn’t work), and
Craigslist不允許在其平臺上使用刮板，因此如果我過于頻繁地運行腳本，我將被禁止使用。 (我試圖在各種VPN之間切換以嘗試“欺騙” Craigslist，但這沒有用)，以及
I still couldn’t retrieve all postings on Craigslist
我仍然無法檢索Craigslist上的所有帖子

The last one was a kicker. But I figured if a posting had been sitting for a while, then maybe the person who posted it was not even looking anymore. It was a trade-off I was OK with.

最后一個是踢腳。但是我發現如果某個發布已經坐了一段時間，那么發布該帖子的人可能甚至都不再看了。這是我可以接受的折衷方案。

The whole experience felt like a game of Tetris. I knew what my end goal was, and my real challenge was fitting the right pieces together to achieve that specific end goal. Each piece of the puzzle brought me on a different journey. It was challenging, but enjoyable nonetheless and I learned something new each step of the way.

整個體驗就像是俄羅斯方塊的游戲。我知道自己的最終目標是什么，而我真正的挑戰是將正確的零件組合在一起以實現那個特定的最終目標。每個難題都使我走上了不同的旅程。這是具有挑戰性的，但仍然很有趣，我在每一步中都學到了一些新東西。

得到教訓 (Lessons learned)

It was an eye-opening experience, and I ended up learning a little bit more about how the Internet (and Craigslist) works, how various different tools can work together to solve a problem, plus I got a cool little story I can share with friends.

這是一次令人大開眼界的經歷，我最終了解了有關Internet(和Craigslist)如何工作，各種不同工具如何協同工作以解決問題的更多知識，并且我得到了一個很酷的小故事，可以與我分享朋友們。

In a way, that’s a lot like how technologies work these days. You find a big, hairy problem that you need to solve, and you don’t see any immediate, obvious solution to it. You break down the big hairy problem into multiple different manageable chunks, and then you solve them one chunk at a time.

從某種意義上講，這與當今技術的運作方式非常相似。您發現需要解決的一個大問題，而且沒有任何直接，明顯的解決方案。您將大毛病分解為多個不同的可管理塊，然后一次解決一個塊。

Looking back, my problem was this: how can I use this awesome directory on the Internet to reach people with specific interests quickly? There were no known products or solutions available to me at the time, so I broke it down into multiple pieces:

回想起來，我的問題是這樣的： 我如何使用Internet上的這個很棒的目錄快速找到具有特定興趣的人 ？當時沒有可用的已知產品或解決方案，因此我將其分解為多個部分：

Find all listings on the platform
在平臺上查找所有列表
Collect contact information about each listing
收集有關每個列表的聯系信息
Send an email to them if the contact information exists
如果存在聯系信息，請向他們發送電子郵件

That’s all there was to it. Technology merely acted as a means to the end. If I could’ve use an Excel spreadsheet to do it for me, I would’ve opted for that instead. However, I’m no Excel guru, and so I went with the approach that made most sense to me at the time.

僅此而已。 技術只是達到目的的手段 。如果我可以使用Excel電子表格來幫我做，那我會選擇這么做。但是，我不是Excel專家，所以我采用了當時對我來說最有意義的方法。

改進領域 (Areas of Improvement)

There were many areas in which I could improve:

我可以在很多方面進行改進：

I picked a language I wasn’t very familiar with to start, and there was a learning curve in the beginning. It wasn’t too awful, because Python is very easy to pick up. I highly recommend that any beginning software enthusiast use that as a first language.
我選擇了一種我不太熟悉的語言來開始學習，而且一開始就有學習的彎路。并不是很糟糕，因為Python很容易拿起。我強烈建議任何新手軟件愛好者將其用作第一語言。
Relying too heavily on open-source technologies. Open source software has it’s own set of problems, too. There were multiple libraries I used that were no longer in active development, so I ran into issues early on. I could not import a library, or the library would fail for seemingly innocuous reasons.
過于依賴開源技術。 開源軟件也有它自己的一系列問題 。我使用了多個不再進行主動開發的庫，所以我很早就遇到了問題。我無法導入庫，否則該庫將因看似無害的原因而失敗。
Tackling a project by yourself can be fun, but can also cause a lot of stress. You’d need a lot of momentum to ship something. This project was quick and easy, but it did take me a few weekends to add in the improvements. As the project went on, I started to lose motivation and momentum. After I found a job, I completely ditched the project.
自己解決一個項目可能很有趣，但也會帶來很多壓力 。您需要大量的動力來運送東西。這個項目既快速又簡單，但是確實花了我幾個周末來進行改進。隨著項目的進行，我開始失去動力和動力。找到工作后，我完全放棄了這個項目。

我使用的資源和工具 (Resources and Tools I used)

The Hitchhiker’s Guide to Python — Great book for learning Python in general. I recommend Python as a beginner’s first programming language, and I talk about how I used it to land offers from multiple top-tier top companies in my article here.

《 Hitchhiker的Python指南》 -全面學習Python的好書。我建議Python作為初學者的第一個編程語言，和我談我如何使用從多個頂級頂級公司的土地報價在我的文章在這里。

DailyCodingProblem: It’s a service that sends out daily coding problems to your email, and has some of the most recent programming problems from top-tier tech companies. Use my coupon code, zhiachong, to get $10 off!

DailyCodingProblem ：這是一項將日常編碼問題發送到您的電子郵件的服務，并且具有一些頂級技術公司的最新編程問題。使用我的優惠券代碼zhiachong可獲得$ 10的折扣！

BeautifulSoup — The nifty utility tool I used to build my web crawler

BeautifulSoup —我用來構建網絡搜尋器的漂亮實用工具

Web Scraping with Python — A useful guide to learning how web scraping with Python works.

使用Python進行網絡抓取-學習如何使用Python進行網絡抓取的有用指南。

Lean Startup - I learned about rapid prototyping and creating an MVP to test an idea from this book. I think the ideas in here are applicable across many different fields and also helped drive me to complete the project.

精益創業 -我從本書中學到了快速原型制作和創建MVP來測試想法的知識。我認為這里的想法適用于許多不同領域，也幫助我完成了該項目。

Evernote — I used Evernote to compile my thoughts together for this post. Highly recommend it — I use this for basically _everything_ I do.

Evernote —我使用Evernote將我的想法匯總在一起。強烈推薦它-我基本上將其用于所有操作。

My laptop- This is my current at-home laptop, set up as a work station. It’s much, much easier to work with than an old BenQ laptop, but both would work for just general programming work.

我的筆記本電腦 -這是我當前的家用筆記本電腦，設置為工作站。與舊的BenQ筆記本電腦相比，它使用起來容易得多，但兩者都僅適用于常規編程工作。

Credits:

學分：

Brandon O’brien, my mentor and good friend, for proof-reading and providing valuable feedback on how to improve this article.

我的導師和好朋友Brandon O'brien進行了校對并提供了有關改進本文的寶貴反饋。

Leon Tager, my coworker and friend who proofreads and showers me with much-needed financial wisdom.

萊昂·塔格 ( Leon Tager )是我的同事和朋友，他用急需的財務知識為我校對和洗澡。

You can sign up for industry news, random tidbits and be the first to know when I publish new articles by signing up here.

您可以注冊以獲取行業新聞，隨機花絮，并可以在此處注冊成為第一個知道我何時發布新文章的人。

Zhia Chong is a software engineer at Twitter. He works on the Ads Measurement team in Seattle, measuring ads impact and ROI for advertisers. The team is hiring!

Zhia Chong是Twitter的軟件工程師。 他在西雅圖的廣告評估團隊工作，負責評估廣告客戶的廣告影響力和投資回報率。 團隊正在 招聘！

You can find him on Twitter and LinkedIn.

您可以在 Twitter 和 LinkedIn 上找到他 。