kaggle數據集
“arXiv is a free distribution service and an open-access archive for 1.7 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics”, as stated by its editors. ArXiv is a gold mine of knowledge. The more you dig into, the more valuable information you learn. It also makes it easier to follow the trends in science.
如前所述,“ arXiv是一項免費分發服務,是一個開放的檔案庫,可容納170萬條物理學,數學,計算機科學,定量生物學,定量金融,統計,電氣工程和系統科學以及經濟學領域的學術文章”。它的編輯。 ArXiv是知識的金礦。 您越深入研究,就會學到更多有價值的信息。 它還使跟蹤科學趨勢變得更加容易。
If you are into the field of data science, you have probably read articles on arXiv. If you haven’t done it yet, you should. Since data science is still an evolving field, new papers leading to new enhancements are published everyday. This makes the platforms like arXiv even more valuable.
如果您是數據科學領域的專家,您可能已經閱讀了有關arXiv的文章。 如果您還沒有這樣做,那應該。 由于數據科學仍然是一個不斷發展的領域,因此每天都會發表新的文章,以進行新的改進。 這使arXiv等平臺更具價值。
arXiv has made its entire corpus available as a dataset on Kaggle. The dataset contains relevant features such as article titles, authors, categories, content (both abstract and full text) and citations of 1.7 million scholarly articles avaiable on arXiv.
arXiv已將其整個語料庫作為數據集在Kaggle上提供。 數據集包含相關特征,例如文章標題,作者,類別,內容(摘要和全文)以及arXiv上170萬篇學術文章的引用。
This dataset is amazing resource to do machine learning and deep learning applications. Some of the applications that can be done are:
該數據集是進行機器學習和深度學習應用程序的絕佳資源。 可以完成的一些應用程序是:
- Natural language processing (NLP) and understanding (NLU) use cases 自然語言處理(NLP)和理解(NLU)用例
- Text generation with deep learning using the content of articles 使用文章內容通過深度學習生成文本
- Predictive analytics such as category prediction of articles 預測分析,例如文章類別預測
- Trend analysis of topics in different scientific fields 不同科學領域主題的趨勢分析
- Paper recommender engine 紙張推薦器引擎

Deep learning models are data hungry. With the advancements in computing and processing, models can absorb more data than ever. Such a big dataset of scientific text is a highly valuable raw material for NLP, NLU and text generation. We may even have a model that writes scholarly articles on some topics. OpenAI’s new text generator, GPT-3, makes us think beyond the limits. Thus, I don’t think it is too far to have a deep learning model to write about science.
深度學習模型需要大量數據。 隨著計算和處理技術的進步,模型可以吸收比以往更多的數據。 如此龐大的科學文本數據集對于NLP,NLU和文本生成是非常有價值的原材料。 我們甚至可能有一個模型可以撰寫有關某些主題的學術文章。 OpenAI的新文本生成器GPT-3使我們的思考超出了極限。 因此,我認為擁有一個關于科學的深度學習模型并不過分。
Eleonora Presani, arXiv executive director said that “by offering the dataset on Kaggle we go beyond what humans can learn by reading all these articles and we make the data and information behind arXiv available to the public in a machine-readable format”. I definitely agree with her on the learning opportunities. Having all of these articles as a dataset allows to go beyond learning by reading. A ton of valuable insights can be discovered from this gold mine of articles by data analysis and machine learning. For instance, some not-so-obvious connections between different technologies can light up.
arXiv執行董事Eleonora Presani表示:“通過在Kaggle上提供數據集,我們超越了人類通過閱讀所有這些文章所能學到的知識,并以機器可讀的格式向公眾提供了arXiv背后的數據和信息”。 我絕對同意她的學習機會。 將所有這些文章作為數據集可以超越閱讀學習的范圍。 通過數據分析和機器學習,可以從這個金礦中找到大量有價值的見解。 例如,不同技術之間的一些不太明顯的連接可能會點亮。
Converting the entire arXiv articles to a well-structured and organized dataset has the potential to accelerate scientific discoveries. Science grows and advances by building on itself. There is no need to reinvent the wheel when we can focus on improving the wheel. By analyzing this arXiv dataset, we can obtain a concise summary of what science has been up to and shed light on what we need to focus going forward.
將整個arXiv文章轉換為結構合理且組織良好的數據集有可能加速科學發現。 科學在自身的基礎上發展壯大。 當我們可以專注于改進車輪時,無需重新發明車輪。 通過分析此arXiv數據集,我們可以獲得有關最新科學知識的簡明摘要,并闡明了今后我們需要關注的重點。
There is just so much to do with this dataset. I highly encourage you to at least take a look at it. You don’t have to create a machine learning product but it will also be a helpful resource for practicing data analysis and processing skills.
這個數據集有很多事情要做。 我強烈建議您至少看看它。 您不必創建機器學習產品,但它也將是練習數據分析和處理技能的有用資源。
Thank you for reading. Please let me know if you have any feedback.
感謝您的閱讀。 如果您有任何反饋意見,請告訴我。
https://www.kaggle.com/Cornell-University/arxiv?select=arxiv-metadata-oai-snapshot.json
https://www.kaggle.com/Cornell-University/arxiv?select=arxiv-metadata-oai-snapshot.json
https://blogs.cornell.edu/arxiv/2020/08/05/leveraging-machine-learning-to-fuel-new-discoveries-with-the-arxiv-dataset/
https://blogs.cornell.edu/arxiv/2020/08/05/leveraging-machine-learning-to-fuel-new-discoveries-with-the-arxiv-dataset/
翻譯自: https://towardsdatascience.com/a-dataset-of-1-7-million-arxiv-articles-available-on-kaggle-8a11075cac32
kaggle數據集
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/388762.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/388762.shtml 英文地址,請注明出處:http://en.pswp.cn/news/388762.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!