電子表格轉換成數據庫
Part 1: Creating an Entity Relational Diagram (ERD)
第1部分:創建實體關系圖(ERD)
A Relational Database Management System (RDMS) is a program that allows us to create, update, and manage a relational database. Structured Query Language (SQL) is a programming language used to communicate with data stored in the RDMS. The SQL skill for using a RDMA is required for many data-related positions these days. In the social media forums like Quora or Reddit, there are many people who search for a public database for practicing their SQL querying skills. However, although there are many public data sets in a single spreadsheet, there are not many public databases online. Even if you found a data set for the topic you have interest in, the format of the data is usually just one spreadsheet, not a database for most cases. Therefore, it will be very useful to know how to convert a data set in one spreadsheet to a database with multiple tables fitting a relational database format. Knowing the process of this conversion can give us many chances to practice SQL querying skills with a variety of databases.
關系數據庫管理系統 (RDMS)是允許我們創建,更新和管理關系數據庫的程序。 結構化查詢語言 (SQL)是一種編程語言,用于與RDMS中存儲的數據進行通信。 如今,與許多數據相關的職位都需要使用RDMASQL技能。 在Quora或Reddit等社交媒體論壇中,有很多人在搜索公共數據庫以練習其SQL查詢技能。 但是,盡管單個電子表格中有許多公共數據集,但在線的公共數據庫并不多。 即使您找到了感興趣的主題的數據集,數據的格式通常也只是一個電子表格,在大多數情況下不是數據庫。 因此,了解如何將一個電子表格中的數據集轉換為具有多個符合關系數據庫格式的表的數據庫將非常有用。 了解這種轉換的過程可以使我們有很多機會來練習各種數據庫SQL查詢技能。
This is the first article in a two part series. The goal of this series is to show how to create a relational database for SQL. The whole process is to convert a data in one spreadsheet to a relational database for SQL. In this first article I create an Entity Relational Diagram (ERD) which is a graphical representation showing the relationships between entities.
這是分兩部分的系列文章中的第一篇。 本系列的目的是展示如何為SQL創建關系數據庫。 整個過程是將一個電子表格中的數據轉換為SQL的關系數據庫。 在第一篇文章中,我創建一個實體關系圖(ERD) ,該圖是顯示實體之間關系的圖形表示。
數據集 (The Data Set)
Let’s find a data set for this practice. There are many public data on the Kaggle dataset. Among a bunch of data sets, I selected a data set named Sample Sales Data. The following pictures show the original format of the data which is contained in one spreadsheet.
讓我們為該練習找到一個數據集。 Kaggle數據集上有許多公共數據。 在一系列數據集中,我選擇了一個名為Sample Sales Data的數據集。 下圖顯示了一個電子表格中包含的數據的原始格式。


The data has 25 columns and 2824 rows including headers. The list of the headers is as follows:
數據有25列和2824行,包括標題。 標頭列表如下:
ORDERNUMBER: the identification number for each order
ORDERNUMBER :每個訂單的標識號
QUANTITYORDERED: the quantity ordered
QUANTITYORDERED :訂購的數量
PRICEEACH: the actual price paid for the transaction (variable across transactions)
PRICEEACH :為交易支付的實際價格(隨交易而變化)
ORDERLINENUMBER: the number of the order line
ORDERLINENUMBER :訂單行的編號
SALES: the amount of sales
銷售 :銷售額
ORDERDATE: the order date
ORDERDATE :訂單日期
STATUS: the shipping status (Shipped, Resolved, Cancelled, On Hold, Disputed, and In Progress)
狀態 :運輸狀態(已發貨,已解決,已取消,保留,有爭議和進行中)
QTR_ID: the quarter of the order date
QTR_ID :訂購日期的四分之一
MONTH_ID: the month of the order date
MONTH_ID :訂購日期的月份
YEAR_ID: the year of the order date
YEAR_ID :訂單日期的年份
PRODUCTLINE: the category of products
PRODUCTLINE :產品類別
MSRP: the manufacture’s suggested retail price (constant across transactions)
MSRP :制造商的建議零售價(在交易中保持不變)
PRODUCTCODE: the identification code for each product
PRODUCTCODE :每個產品的識別碼
CUSTOMERNAME: the names of customers
客戶名稱:客戶名稱
PHONE: the phone numbers of customers
電話 :客戶的電話號碼
ADDRESSLINE1: addressline 1 for customers
ADDRESSLINE1 :客戶的地址行1
ADDRESSLINE2: address line 2 for customers
ADDRESSLINE2 :客戶的地址行2
CITY: city names for customers
城市 :客戶的城市名稱
STATE: state names for customers (only for customers located in the US)
STATE :客戶的州名(僅適用于位于美國的客戶)
POSTALCODE: postal codes for customers
郵政編碼 :為客戶郵政編碼
COUNTRY: countries for customers
國家 :客戶所在的國家
TERRITORY: the regional names of each country (NA, EMEA, Japan, and APAC)
地區 :每個國家的區域名稱(NA,EMEA,日本和亞太地區)
CONTACTLASTNAME and CONTACTFIRSTNAME: the last and first names of customers
CONTACTLASTNAME和CONTACTFIRSTNAME :客戶的姓氏和名字
DEALSIZE: the deal sizes of orders
DEALSIZE :訂單的交易大小
創建實體關系圖(ERD) (Creating an Entity Relationship Diagram (ERD))
The information contained in the data can be basically divided into three entities: Customers, Products, and Orders.
數據中包含的信息基本上可以分為三個實體: 客戶 , 產品和訂單 。
顧客 (Customers)
Based on the columns in the data, the attributes to be included in the entity for Customers are CUSTOMERNAME, CONTACTLASTNAME, CONTACTFIRSTNAME, PHONE, ADDRESSLINE1, ADDRESSLINE2, POSTALCODE, CITY, STATE, COUNTRY, and TERRITORY. CustomerID is added into the entity as the primary key. And for convenience sake, the cases of the attributes are changed as follows:
根據數據中的列,要包含在“客戶”實體中的屬性是CUSTOMERNAME , CONTACTLASTNAME,CONTACTFIRSTNAME, PHONE , ADDRESSLINE1 , ADDRESSLINE2 , POSTALCODE , CITY,STATE,COUNTRY和TERRITORY 。 CustomerID作為主鍵添加到實體中。 為了方便起見,屬性的情況如下更改:

The problem of keeping City, State, Country, and Territory in the entity for Customers is that there will be many duplicate rows in the table for Customers. Therefore, entities for City, Country, and Territory are separately created, and then the links between them and Customers are made using foreign keys. The reason why the entity for State is not created is that there are many null values in State so that states do not determine countries and territories in this data.
保留城市 , 州 , 國家和地區的問題 在客戶實體中 是表中的客戶會有很多重復的行。 因此,將分別創建用于城市 , 國家和地區的實體,然后使用外鍵在它們和客戶之間建立鏈接。 為什么不建立國有實體的原因是,在許多國家空值,這樣的狀態并不確定這個數據的國家和地區。
In the ERD, there are three types of relationships between entities: One-to-One, One-to-Many(or Many-to-One), and Many-to-Many. The relationships between Territories and Countries, Countries and Cities, and Cities and Customers are One-to-Many relationships. For example, each customer lives in one city in this data while one city has many customers. Therefore, using the Crow’s Foot Notation, the ERD between Customers, Cities, Countries, and Territories is as follows:
在ERD中,實體之間存在三種類型的關系: 一對一,一對多(或多對一)和多對多 。 區域與國家 , 國家和城市以及城市與客戶之間的關系是一對多關系。 例如,在此數據中,每個客戶居住在一個城市中,而一個城市有許多客戶。 因此,使用烏鴉腳符號 , 客戶之間的ERD, 城市 , 國家 ,和地區如下:

產品展示 (Products)
The attributes to be included in the entity for Products are PRODUCTCODE, PRODUCTLINE, MSRP, and ORDERNUMBER. Among these attributes, PRODUCTCODE can be a candidate for the primary key in the entity since it is unique and does not have null values. However, in order to keep the format of the primary keys consistent across entities, ProductID is added into the entity and is set to be the primary key. The letter cases are changed as follows:
產品實體中要包含的屬性是PRODUCTCODE , PRODUCTLINE , MSRP和ORDERNUMBER 。 在這些屬性中, PRODUCTCODE可以是實體中主鍵的候選者,因為它是唯一的并且沒有空值。 但是,為了使主鍵的格式在各個實體之間保持一致, 產品編號 被添加到實體中并設置為主鍵。 字母大小寫更改如下:

As we saw in the case for the entity for Customers, including Productline in this entity generates duplicate rows as well. Therefore, the entity for ProductLines is created separately, and is linked to the entity for Products using a foreign key. Since each product is classified into one product line while each product line has many products, the relationship between ProductLines and Products is One-to-Many.
正如我們在“ 客戶 ”實體案例中看到的那樣,在該實體中包括Productline也會生成重復的行。 因此, ProductLine的實體是單獨創建的,并使用外鍵鏈接到Products的實體。 由于每個產品分類為一個產品線,而每個產品線具有許多產品,因此ProductLine和Products之間的關系為一對多 。

This is not the final version for Products since OrderNumber will be included in the entity for Orders as well. Including OrderNumber in the entity for Orders will make some changes in the entity for Products. This will be covered after the entity for Orders is created.
這并非以來ORDERNUMBER 產品將包含在訂單的實體以及最終版本。 在訂單的實體,包括訂單號碼將在產品實體的一些變化。 創建訂單實體后將對此進行介紹。
命令 (Orders)
The entity for Orders should include ORDERNUMBER, ORDERDATE, PRODUCTCODE, STATUS, QUARTER_ID, MONTH_ID, YEAR_ID,and CUSTOMERNAME as its attributes. OrderID is added into this entity as the primary key. The names and cases of the attributes are changed as follows for convenience sake:
訂單實體應包括ORDERNUMBER , ORDERDATE , PRODUCTCODE , STATUS , QUARTER_ID , MONTH_ID , YEAR_ID和CUSTOMERNAME作為其屬性。 OrderID作為主鍵添加到該實體中。 為了方便起見,屬性的名稱和大小寫更改如下:

In order to remove duplicate rows generated by including Status and CustomerName in this entity, the entities for Status and CustomerNames are created separately and are linked to the entity for Orders using foreign keys. The entity for CustomerNames (which is equivalent to the Customers) was already created previously. The relationships of Orders with Status and Customers are Many-to-One, respectively.
為了刪除通過在此實體中包含Status和CustomerName生成的重復行,分別創建了Status和CustomerNames的實體,并使用外鍵將其鏈接到Order的實體。 CustomerNames的實體(等同于Customer )是先前已創建的。 訂單與狀態和客戶的關系分別是多對一的 。

訂單和產品 (Orders and Products)
Each order contains a list of products in it. This means that Orders and Products are related to each other. What is the relationship between these two entries? Each product can be ordered by multiple orders and each order can contain multiple products. Therefore, the relationship is many-to-many.
每個訂單中都包含一個產品列表。 這意味著訂單和產品相互關聯。 這兩個條目之間是什么關系? 每個產品可以按多個訂單訂購,每個訂單可以包含多個產品。 因此,這種關系是多對多的 。

The way to create tables which are in the many-to-many relationship is to create a junction table between the two original tables and then to make one-to-many relationships between the original tables and the junction table. The junction table is the table for the combination of Orders and Products, and then one-to-many relationships are made between Orders and the junction, and between Products and the junction. The primary key for this junction entity should be the composite of OrderID and ProductID.
創建作為在許多一對多的關系表的方法是將原來的兩個表之間創建一個結表,然后做一個對許多原始表和聯接表之間的關系。 聯結表是Orders和Products組合的表,然后在Orders和聯結之間以及Products和聯結之間建立一對多關系。 此聯結實體的主鍵應為OrderID的組合 和ProductID 。
The remaining columns which are not included in any entity so far are QUANTITYORDERED, PRICEEACH, ORDERLINENUMBER, SALES, and DEALSIZE. These columns cannot be determined by Orders or Products by themselves. For example, the quantity of a product in an order cannot be identified by an order or a product alone. Only the combination of an order and a product can identify it. Therefore, those columns are included in the junction entity as its attributes. The entities and their relationships for Orders and Products are as follows:
到目前為止,尚未包含在任何實體中的其余列為QUANTITYORDERED , PRICEEACH , ORDERLINENUMBER , SALES和DEALSIZE 。 這些列不能由訂單或產品自行確定。 例如,訂單中的產品數量無法通過訂單或單獨的產品來識別。 只有訂單和產品的組合才能識別它。 因此,這些列作為其屬性包含在聯結實體中。 訂單和產品的實體及其關系如下:

Due to duplicate rows, the entity for DealSizes is separately created and linked to the entity for Orders_Products using foreign key for SizeID.
由于重復行,為DealSizes實體被單獨創建并鏈接到實體使用SizeID外鍵Orders_Products。
整個ERD (The Entire ERD)
Now, we completed creating all the entities for this data and making relationships between them. The entire ERD is as follows:
現在,我們完成了為該數據創建所有實體并在它們之間建立關系。 整個ERD如下:

The usefulness of the ERD is that it makes us to easily understand the relationships between the columns (or the entities) in the data set and to see the entire structure of the data set at a glance.
ERD的有用之處在于,它使我們能夠輕松了解數據集中的列(或實體)之間的關系,并一目了然地看到數據集的整個結構。
下一步是什么? (What’s Next?)
This article covered how to create an Entity Relationship Diagram using a data set. The data for this practice is Sample Sales Data obtained from the Kaggle dataset. It is briefly covered how to divide the original data set into entities and make relationships between the entities.
本文介紹了如何使用數據集創建實體關系圖。 該實踐的數據是從Kaggle數據集獲得的樣本銷售數據 。 簡要介紹了如何將原始數據集劃分為實體以及如何在實體之間建立關系。
In the next article, the practice of converting a data in a single spread sheet to a relational database by using MySQL will be covered.
在下一篇文章中 ,將介紹通過使用MySQL將單個電子表格中的數據轉換為關系數據庫的實踐。
翻譯自: https://medium.com/swlh/creating-a-database-converting-a-spreadsheet-to-a-relational-database-part-1-2a9a228bf77a
電子表格轉換成數據庫
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389965.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389965.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389965.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!