計算機科學必讀書籍_5篇關于數據科學家的產品分類必讀文章

計算機科學必讀書籍

Product categorization/product classification is the organization of products into their respective departments or categories. As well, a large part of the process is the design of the product taxonomy as a whole.

產品分類/產品分類是將產品組織到各自部門或類別中。 同樣,整個過程的很大一部分是整個產品分類的設計。

Product categorization was initially a text classification task that analyzed the product’s title to choose the appropriate category. However, numerous methods have been developed which take into account the product title, description, images, and other available metadata. The following papers on product categorization represent essential reading in the field and offer novel approaches to product classification tasks.

產品分類最初是一個文本分類任務,用于分析產品標題以選擇適當的類別。 但是,已經開發出許多方法來考慮產品標題,描述,圖像和其他可用的元數據。 以下有關產品分類的論文代表了該領域的重要閱讀內容,并為產品分類任務提供了新穎的方法。

1.不要分類,翻譯 (1. Don’t Classify, Translate)

In this paper, researchers from the National University of Singapore and the Rakuten Institute of Technology propose and explain a novel machine translation approach to product categorization. The experiment uses the Rakuten Data Challenge and Rakuten Ichiba datasets. Their method translates or converts a product’s description into a sequence of tokens which represent a root-to-leaf path to the correct category. Using this method, they are also able to propose meaningful new paths in the taxonomy.

在本文中,新加坡國立大學和樂天技術學院的研究人員提出并解釋了一種新穎的機器翻譯方法來進行產品分類。 該實驗使用了Rakuten Data Challenge和Rakuten Ichiba數據集。 他們的方法將產品的描述轉換或轉換為一系列標記,這些標記代表從根到葉的正確類別路徑。 使用這種方法,他們還能夠在分類法中提出有意義的新路徑。

The researchers state that their method outperforms many of the existing classification algorithms commonly used in machine learning today.

研究人員指出,他們的方法優于當今機器學習中常用的許多現有分類算法。

Published/Last Updated — Dec. 14, 2018

發布/最新更新— 2018年12月14日

Authors and Contributors — Maggie Yundi Li (National University of Singapore), Stanley Kok (National University of Singapore), and Liling Tan (Rakuten Institute of Technology)

作者和撰稿人:李Mag(新加坡國立大學),斯坦利·科克(新加坡國立大學)和譚麗玲(樂天技術學院)

Read Now

現在讀

2.使用神經注意模型對日本商品名稱進行大規模分類 (2. Large-Scale Categorization of Japanese Product Titles Using Neural Attention Models)

The authors of this paper propose attention convolutional neural network (ACNN) models over baseline convolutional neural network (CNN) models and gradient boosted tree (GBT) classifiers. The study uses Japanese product titles taken from Rakuten Ichiba as training data. Using this data, the authors compare the performance of the three methods (ACNN, CNN, and GBT) for large-scale product categorization. While differences in accuracy can be less than 5%, even minor improvements in accuracy can result in millions of additional correct categorizations.

本文的作者提出了關注卷積神經網絡(ACNN)模型,而不是基線卷積神經網絡(CNN)模型和梯度提升樹(GBT)分類器。 該研究使用從Rakuten Ichiba獲得的日語產品標題作為培訓數據。 利用這些數據,作者比較了三種方法(ACNN,CNN和GBT)用于大規模產品分類的性能。 盡管精度差異可以小于5%,但即使精度略有提高,也可以導致數百萬種其他正確的分類。

Lastly, the authors explain how an ensemble of ACNN and GBT models can further minimize false categorizations.

最后,作者解釋了ACNN和GBT模型的集成如何進一步減少錯誤分類。

Published/Last Updated — April, 2017 for EACL 2017

已發布/最新更新— 2017年4月,適用于EACL 2017

Authors and Contributors — From the Rakuten Institute of Technology: Yandi Xia, Aaron Levine, Pradipto Das Giuseppe Di Fabbrizio, Keiji Shinzato and Ankur Datta

作者和撰稿人—來自樂天技術學院:夏彥迪,亞倫·萊文,Pradipto Das Giuseppe Di Fabbrizio,京急新zato和安庫·達塔

Read Now

現在讀

3.地圖集:電子商務服裝產品分類的數據集和基準 (3. Atlas: A Dataset and Benchmark for Ecommerce Clothing Product Classification)

Image for post

Researchers at the University of Colorado and Ericsson Research (Chennai, India) have created a large product dataset known as Atlas. In this paper, the team presents their dataset which includes over 186,000 images of clothing products along with their product titles. Furthermore, they introduce related work in the field that has influenced their study. Finally, they test their dataset using a Resnet34 classification model and a Seq to Seq model to categorize the products. The data is taken from Indian ecommerce stores, so some of the categories used may not be applicable to Western markets. However, the dataset has been open-sourced and is available on Github.

科羅拉多大學和愛立信研究公司(印度金奈)的研究人員創建了一個名為Atlas的大型產品數據集。 在本文中,研究小組展示了他們的數據集,其中包括超過186,000種服裝產品的圖像以及產品標題。 此外,他們介紹了影響他們的研究領域的相關工作。 最后,他們使用Resnet34分類模型和Seq to Seq模型對產品進行測試,以對產品進行分類。 數據來自印度的電子商務商店,因此使用的某些類別可能不適用于西方市場。 但是,該數據集已經開源,可以在Github上使用。

Published/Last Updated — Aug. 19, 2019

發布/最后更新— 2019年8月19日

Authors and Contributors — Venkatesh Umaashankar (Ericsson Research), Girish Shanmugam (Ericsson Research), and Aditi Prakash (University of Colorado)

作者和撰稿人— Venkatesh Umaashankar(愛立信研究中心),Girish Shanmugam(愛立信研究中心)和Aditi Prakash(科羅拉多大學)

Read Now

現在讀

4.使用結構化和非結構化屬性的大規模產品分類 (4. Large Scale Product Categorization using Structured and Unstructured Attributes)

In this study, a team at WalmartLabs compares hierarchical models to flat models for product categorization.

在這項研究中,沃爾瑪實驗室的一個團隊將層次模型與平面模型進行了比較,以進行產品分類。

The researchers employ deep-learning based models which extract features from each product to create a product signature. In the paper, the researchers describe a multi-LSTM and multi-CNN based approach to this extreme classification task. Furthermore, they present a novel way to use structured attributes. The team states that their methods can be scaled to take into account any number of product attributes during categorization.

研究人員采用了基于深度學習的模型,該模型從每個產品中提取功能以創建產品簽名。 在論文中,研究人員描述了一種基于多LSTM和多CNN的方法來完成這種極端分類任務。 此外,它們提供了一種使用結構化屬性的新穎方法。 該團隊指出,他們的方法可以擴展,以在分類過程中考慮任何數量的產品屬性。

Published/Last Updated — Mar. 1, 2019

已發布/最新更新— 2019年3月1日

Authors and Contributors — From WalmartLabs: Abhinandan Krishnan and Abilash Amarthaluri

作者和貢獻者—來自沃爾瑪實驗室:Abhinandan Krishnan和Abilash Amarthaluri

Read Now

現在讀

5.使用多模式融合模型進行多標簽產品分類 (5. Multi-Label Product Categorization Using Multi-Modal Fusion Models)

In this paper, researchers from New York University and U.S. Bank investigate multi-modal approaches to categorize products on Amazon. Their approach utilizes multiple classifiers trained on each type of input data from the product listings. Using a dataset of 9.4 million Amazon products, they developed a tri-modal model for product classification based on product images, titles, and descriptions. Their tri-modal late fusion model retains an F1 score of 88.2%.

在本文中,來自紐約大學和美國銀行的研究人員研究了多模式方法來對亞馬遜上的產品進行分類。 他們的方法利用了針對產品列表中每種輸入數據類型進行訓練的多個分類器。 他們使用940萬個Amazon產品的數據集,開發了一種基于產品圖像,標題和描述的產品分類的三峰模型。 他們的三峰后期融合模型保留了88.2%的F1分數。

The findings of their study demonstrate that increasing the number of modalities could improve performance in multi-label product categorization.

他們研究的結果表明,增加模式數量可以改善多標簽產品分類的性能。

Published/Last Updated — June 30, 2019

發布/最新更新— 2019年6月30日

Authors and Contributors — Pasawee Wirojwatanakul (New York University) and Artit Wangperawong (U.S. Bank)

作者和貢獻者— Pasawee Wirojwatanakul(紐約大學)和Artit Wangperawong(美國銀行)

Read Now

現在讀

In the papers on product categorization above, the researchers trained their models on open datasets which included millions of products. However, if you are building a product categorization model for commercial use, many open datasets may not be available to you.

在上面有關產品分類的論文中,研究人員在包含數百萬種產品的開放數據集上訓練了他們的模型。 但是,如果您要構建用于商業用途的產品分類模型,則可能無法使用許多開放數據集。

Looking for training data for your product classification model? Check out this training data guide and these open datasets.

尋找針對您的產品分類模型的培訓數據? 查閱本培訓數據指南和這些開放的數據集 。

翻譯自: https://medium.com/analytics-vidhya/5-must-read-papers-on-product-categorization-for-data-scientists-19c98421cef3

計算機科學必讀書籍

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389410.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389410.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389410.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

es6解決回調地獄問題

本文摘抄自阮一峰老師的 http://es6.ruanyifeng.com/#docs/generator-async 異步 所謂"異步",簡單說就是一個任務不是連續完成的,可以理解成該任務被人為分成兩段,先執行第一段,然后轉而執行其他任務,等做好…

交替最小二乘矩陣分解_使用交替最小二乘矩陣分解與pyspark建立推薦系統

交替最小二乘矩陣分解pyspark上的動手推薦系統 (Hands-on recommender system on pyspark) Recommender System is an information filtering tool that seeks to predict which product a user will like, and based on that, recommends a few products to the users. For ex…

莫煩Matplotlib可視化第四章多圖合并顯示代碼學習

4.1Subplot多合一顯示 import matplotlib.pyplot as plt import numpy as npplt.figure() """ 每個圖占一個位置 """ # plt.subplot(2,2,1) #將畫板分成兩行兩列,選取第一個位置,可以去掉逗號 # plt.plot([0,1],[0,1]) # # plt.su…

python 網頁編程_通過Python編程檢索網頁

python 網頁編程The internet and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to share pages of hypertext (hence the name Hypertext T…

Python+Selenium自動化篇-5-獲取頁面信息

1.獲取頁面title title:獲取當前頁面的標題顯示的字段from selenium import webdriver import time browser webdriver.Chrome() browser.get(https://www.baidu.com) #打印網頁標題 print(browser.title) #輸出內容:百度一下,你就知道 2.…

火種 ctf_分析我的火種數據

火種 ctfOriginally published at https://www.linkedin.com on March 27, 2020 (data up to date as of March 20, 2020).最初于 2020年3月27日 在 https://www.linkedin.com 上 發布 (數據截至2020年3月20日)。 Day 3 of social distancing.社會疏離的第三天。 As I sit on…

莫煩Matplotlib可視化第五章動畫代碼學習

5.1 Animation 動畫 import numpy as np import matplotlib.pyplot as plt from matplotlib import animationfig,ax plt.subplots()x np.arange(0,2*np.pi,0.01) line, ax.plot(x,np.sin(x))def animate(i):line.set_ydata(np.sin(xi/10))return line,def init():line.set…

data studio_面向營銷人員的Data Studio —報表指南

data studioIn this guide, we describe both the theoretical and practical sides of reporting with Google Data Studio. You can use this guide as a comprehensive cheat sheet in your everyday marketing.在本指南中,我們描述了使用Google Data Studio進行…

人流量統計系統介紹_統計介紹

人流量統計系統介紹Its very important to know about statistics . May you be a from a finance background, may you be data scientist or a data analyst, life is all about mathematics. As per the wiki definition “Statistics is the discipline that concerns the …

pyhive 連接 Hive 時錯誤

一、User: xx is not allowed to impersonate xxx 解決辦法&#xff1a;修改 core-site.xml 文件&#xff0c;加入下面的內容后重啟 hadoop。 <property><name>hadoop.proxyuser.xx.hosts</name><value>*</value> </property><property…

樂高ev3 讀取外部數據_數據就是新樂高

樂高ev3 讀取外部數據When I was a kid, I used to love playing with Lego. My brother and I built almost all kinds of stuff with Lego — animals, cars, houses, and even spaceships. As time went on, our creations became more ambitious and realistic. There were…

圖像灰度化與二值化

圖像灰度化 什么是圖像灰度化&#xff1f; 圖像灰度化并不是將單純的圖像變成灰色&#xff0c;而是將圖片的BGR各通道以某種規律綜合起來&#xff0c;使圖片顯示位灰色。 規律如下&#xff1a; 手動實現灰度化 首先我們采用手動灰度化的方式&#xff1a; 其思想就是&#…

分析citibike數據eda

數據科學 (Data Science) CitiBike is New York City’s famous bike rental company and the largest in the USA. CitiBike launched in May 2013 and has become an essential part of the transportation network. They make commute fun, efficient, and affordable — no…

jvm感知docker容器參數

docker中的jvm檢測到的是宿主機的內存信息&#xff0c;它無法感知容器的資源上限&#xff0c;這樣可能會導致意外的情況。 -m參數用于限制容器使用內存的大小&#xff0c;超過大小時會被OOMKilled。 -Xmx: 默認為物理內存的1/4。 4核CPU16G內存的宿主機 java 7 docker run -m …

Flask之flask-script 指定端口

簡介 Flask-Scropt插件為在Flask里編寫額外的腳本提供了支持。這包括運行一個開發服務器&#xff0c;一個定制的Python命令行&#xff0c;用于執行初始化數據庫、定時任務和其他屬于web應用之外的命令行任務的腳本。 安裝 用命令pip和easy_install安裝&#xff1a; pip install…

上采樣(放大圖像)和下采樣(縮小圖像)(最鄰近插值和雙線性插值的理解和實現)

上采樣和下采樣 什么是上采樣和下采樣&#xff1f; ? 縮小圖像&#xff08;或稱為下采樣&#xff08;subsampled&#xff09;或降采樣&#xff08;downsampled&#xff09;&#xff09;的主要目的有 兩個&#xff1a;1、使得圖像符合顯示區域的大小&#xff1b;2、生成對應圖…

r語言繪制雷達圖_用r繪制雷達蜘蛛圖

r語言繪制雷達圖I’ve tried several different types of NBA analytical articles within my readership who are a group of true fans of basketball. I found that the most popular articles are not those with state-of-the-art machine learning technologies, but tho…

java 分裂數字_分裂的補充:超越數字,打印物理可視化

java 分裂數字As noted in my earlier Nightingale writings, color harmony is the process of choosing colors on a Color Wheel that work well together in the composition of an image. Today, I will step further into color theory by discussing the Split Compleme…

Java 集合 之 Vector

http://www.verejava.com/?id17159974203844 import java.util.ArrayList; import java.util.Enumeration; import java.util.List; import java.util.Vector;public class Test {/*** param args the command line arguments*/public static void main(String[] args) {//打印…

前端電子書單大分享~~~

前言 純福利&#xff0c; 如果你不想買很多書&#xff0c;只想省錢看電子書&#xff1b; 如果你找不到很多想看書籍的電子書版本&#xff1b; 那么&#xff0c;請保存或者下載到自己的電腦或者手機或者網盤吧。 不要太著急&#xff0c;連接在最后呢 前端 前端框架 node html-cs…