數據庫測試數據生成

by Tom Winter

湯姆·溫特(Tom Winter)

我們的測試數據生成器如何使假數據看起來真實 (How our test data generator makes fake data look real)

We recently released DataFairy, a free tool that generates test data. But first, let me tell you the story of how it came about.

我們最近發布了DataFairy ，這是一個免費的工具，可以生成測試數據。但是首先，讓我告訴您它是如何產生的。

This is the story of how we turned a fun open source side project into something that has turned out to be really useful.

這是關于我們如何將一個有趣的開源項目變成一個真正有用的故事。

This is not about fake news or tricking the masses. But the fact remains that for developers, software testers, and really anyone who has ever given a demo, fake data is essential and is surprisingly difficult to make up off the top of your head.

這與假新聞或欺騙群眾無關。但是事實仍然是，對于開發人員，軟件測試人員以及曾經進行過演示的任何人來說，偽造數據都是必不可少的，而且令人驚訝地難以彌補。

Our story with fake data starts back when we first developed our SaaS tool, Devskiller. Like all applications, we needed users. We weren’t even looking for paying users at this point. We just needed candidate profiles for our application. What we needed was dummy data that looked real.

關于虛假數據的故事可以追溯到我們最初開發SaaS工具Devskiller時。像所有應用程序一樣，我們需要用戶。目前，我們甚至都沒有在尋找付費用戶。我們只需要用于我們的應用程序的候選配置文件。我們需要的是看起來真實的偽數據。

我們需要一個測試數據生成器 (We needed a test data generator)

We needed fake data for a couple of reasons:

我們需要偽造數據的原因有兩個：

1. We needed to see if our system worked

1.我們需要查看我們的系統是否正常工作

This meant that we needed to build a number of different dummy profiles to see if the system stored and displayed them correctly.

這意味著我們需要構建許多不同的虛擬概要文件，以查看系統是否正確存儲和顯示了它們。

2. We needed to sell our product

2.我們需要出售我們的產品

We needed to do demos for our first prospective customers. We wanted to show our customers what the system would look like after 6 months of inviting and testing hundreds of candidates.

我們需要為我們的第一個潛在客戶進行演示。我們想向我們的客戶展示經過六個月的邀請和測試數百名候選人后，系統的外觀。

Our first thought was to look for an available test data generator. But the problem is that data is hard to fake convincingly. Just ask this guy,

我們的第一個想法是尋找可用的測試數據生成器。但是問題在于，很難令人信服地偽造數據。只是問這個人，

or him,

還是他

很多數據都經過算法驗證 (A lot of data is validated algorithmically)

If it was easy to make convincing data, we probably wouldn’t need a tool. But generating data can be tricky for a couple of reasons.

如果說服數據很容易，我們可能就不需要工具了。但是由于以下幾個原因，生成數據可能很棘手。

Fake data is more than just random numbers. Take the example of a credit card number. Most credit card numbers are based on something called a Luhn algorithm. To explain this we are going to use the example of a Visa card:

偽數據不僅僅是隨機數。以信用卡號為例。大多數信用卡號都基于一種稱為Luhn算法的東西。為了說明這一點，我們將使用Visa卡的示例：

如何檢查信用卡號碼是否有效 (How to check if a credit card number is valid)

Before you start, it’s important to know that all Visa card numbers start with a 4. Also, they all have either 16 or 13 digits.

在開始之前，重要的是要知道所有Visa卡號都以4開頭。此外，它們都具有16或13位數字。

Take this Visa card number:

使用此Visa卡號：

The first thing you need to do to see if you can validate the number is to double the alternating digits starting with the first digit in the sequence.

要查看是否可以驗證數字，您需要做的第一件事是從序列中的第一個數字開始將交替的數字加倍。

4574487405351567

(4x2), (7x2), (4x2), (7x2), (0x2), (3x2), (1x2), (6x2)

8, 14, 8, 14, 0, 6, 2, 12

If the doubling that you’ve just done results in a number with two digits, add them together to get a single digit number.

如果您剛進行的加倍運算得到的數字是兩位數，則將它們加在一起即可得到一位數字。

8, 5, 8, 5, 0, 6, 2, 3

You then need to go back to the original credit card number and replace the digits that you doubled the new value.

然后，您需要返回到原始信用卡號，并替換將新值翻倍的數字。

8554885405652537

This could either be the doubles value or the table of values with the digits added together. Now add it all up.

這可以是double值，也可以是數字加在一起的值表。現在全部添加。

8+5+5+4+8+8+5+4+0+5+6+5+2+5+3+7=80

And then check to see if the sum is evenly divisible by 10. In this case it is, so the number is valid.

然后檢查總和是否可以被10整除。在這種情況下，它是有效的，因此該數字有效。

You need some sort of computational algorithm to validate credit card numbers at scale. But credit card numbers are relatively easy pieces of data to get right. We didn’t just need individual pieces of verifiable data, we needed entire profiles.

您需要某種計算算法來大規模驗證信用卡號。但是信用卡號是相對容易獲得的數據。我們不僅需要單個可驗證的數據，還需要整個配置文件。

可驗證的配置文件需要邏輯上相互關聯的各種數據 (Verifiable profiles need different kinds of data that relate to each other logically)

Credit card numbers are relatively easy to generate, because they only relate to themselves. But personal identity numbers often relate to other things about a person. Take the Swedish personal identity number, practically called the personnummer.

信用卡號相對容易生成，因為它們僅與自己相關。但是個人身份號碼通常與一個人的其他事情有關。取瑞典的個人身份號碼，實際上稱為personnummer。

For those of you who don’t know, personnummers are designed for paying taxes, sort of like an American Social Security number. But they’re also used as a way to access services like healthcare and schools as well as non-governmental services like credit ratings.

對于不認識的人，personnummers是專為繳稅而設計的，有點像美國社會保險號。但是它們也被用作訪問醫療保健和學校等服務以及信用評級等非政府服務的方式。

The format of a personnummer is slightly different than that of a credit card. It is a 10 digit number split into a six digit section and a four digit section connected by a hyphen.

personnummer的格式與信用卡的格式略有不同。它是一個10位數字，分為一個六位部分和一個由連字符連接的四位部分。

Cool fact: Swedes over the age of 100 replace the hyphen in their personnummer with a plus sign.

很酷的事實：100歲以上的瑞典人用加號替換其personnummer中的連字符。

The first six digits in the personnummer are simple and correspond to the person’s birthday using a YYMMDD format. Of the second 4 digit section, the first three are a serial number. The third serial number digit is odd for males and even for females. The last number is a checksum digit.

personnummer中的前六位數字很簡單，并且使用YYMMDD格式對應于該人的生日。在第二個4位數部分中，前三個是序列號。男性，甚至女性的第三個序列號數字都是奇數。最后一個數字是校驗和數字。

So if you take the personnummer:

因此，如果您使用personnummer：

601128–9235

You know that it is for a man born November 28th, 1960.

您知道這是給一個1960年11月28日出生的男人的。

60(year)11(month)28(day)-(under 100 years old)92(unique numbers)3(unique odd number for male)5(checksum digit)

To calculate the checksum, multiply the individual digits in the identity number with the corresponding digits in the number 212121–212.

要計算校驗和，請將身份編號中的各個數字與編號212121-212中的相應數字相乘。

(6x2)(0x1)(1x2)(1x1)(2x2)(8x1)(9x2)(2x1)(3x2)

12, 0, 2, 1, 4, 8, 18, 2, 6

Just like with the Visa card above, if the product of any of these numbers results in a two digit number, simply add the two digits together.

就像上面的Visa卡一樣，如果其中任何一個數字的乘積產生兩位數的數字，只需將兩位數字加在一起即可。

3, 0, 2, 1, 4, 8, 9, 2, 6

Add all the remaining products together.

將所有剩余的產品加在一起。

3+0+2+1+4+8+9+2+6=35

To get the checksum digit, subtract the last digit of the added products from 10 (the exception is that if the last digit is zero, the checksum is also zero).

要獲得校驗和數字，請從10中減去所添加乘積的最后一位(例外是，如果最后一位為零，則校驗和也為零)。

10–5=5

So if you were going to generate a profile of this person, it couldn’t be of a woman born on April 10th, 1916. Her personnummer would have to be something like: 160410+1244. In other words, you couldn’t just come up with a random number and expect it to work with just any fake profile you’ve generated.

因此，如果您要生成此人的個人資料，則不可能是1916年4月10日出生的女人。她的personnummer必須為：160410 + 1244。換句話說，您不能只想出一個隨機數并期望它可以與您生成的任何偽造配置文件一起使用。

我們需要邏輯測試數據 (We needed logical test data)

The data would need to relate to each other in a logical way, since the personnummer isn’t the only piece of data that is built on outside information. Most types of identification numbers relate to other information in some way. We simply couldn’t find a test data generator which would do that, so we decided to build our own. It looks like we weren’t the only one having this problem.

數據將需要以邏輯方式相互關聯，因為personnummer并不是唯一基于外部信息構建的數據。大多數類型的標識號以某種方式與其他信息相關。我們根本找不到能夠做到這一點的測試數據生成器，因此我們決定構建自己的測試數據生成器。看來我們并不是唯一一個遇到此問題的人。

妖精 (JFairy)

As regular contributors the open source community, we decided that the best way to generate the test data we needed was to build our own library. Called JFairy, our goal was for it to generate sets of data that were all verifiable and logically connected.

作為開放源代碼社區的定期貢獻者，我們認為生成所需測試數據的最佳方法是構建自己的庫。稱為JFairy ，我們的目標是生成所有可驗證的邏輯連接數據集。

This way we could populate our app with users. Our user data couldn’t be gibberish or else it couldn’t be imputed. So we put the library to work and it performed better than we could have expected. It even generates real people from time to time. We found this out because we used Gravatar to show the candidate pictures. We were surprised when a real photo appeared on our test account.

這樣，我們可以向用戶填充應用程序。我們的用戶數據不能亂碼，否則不能被估算。因此，我們將庫投入使用，其性能超出了我們的預期。它甚至不時產生真正的人。我們發現這一點是因為我們使用Gravatar來顯示候選圖片。當我們的測試帳戶中出現真實照片時，我們感到驚訝。

This was really useful when we started shopping around our app. We wanted to show enterprise clients an account with 300 different test candidates on the platform. If we hadn’t built JFairy, we might have all tried to use the app a few times, but there were only five of us on the team. It would have been impractical for the five of us to come up with 300 logically connected fake profiles.

當我們開始在應用程序周圍購物時，這真的很有用。我們希望向企業客戶顯示一個平臺上具有300個不同測試候選人的帳戶。如果我們沒有構建JFairy，我們可能都曾幾次嘗試使用該應用程序，但團隊中只有五個人。對于我們五個人來說，想出300個邏輯連接的虛假配置文件是不切實際的。

The data generated by JFairy proved to be so convincing that new customers were puzzled as to where we had gotten all of these people to test. In fact, they asked us if we could help them with sourcing new developers, as clearly we were in touch with a number of people who have technical backgrounds, some of whom actually had validated skills.

事實證明，JFairy生成的數據令人信服，以至于新客戶對于我們讓所有這些人進行測試的地方感到困惑。實際上，他們問我們是否可以幫助他們尋找新的開發人員，很明顯，我們與許多具有技術背景的人保持聯系，其中一些人實際上已經驗證了技能。

我們需要讓開源社區看看JFairy (We needed to let the open source community have a look at JFairy)

We realized that this was becoming something bigger than ourselves, so we decided to put the system out on open source. The first reason is that we are all avid users of open source code. We know that it’s important to give back to that community in order to get things in return. But on top of that, open source can bring real benefits back to the product. By putting our project out there so that a number of different developers can take a look at it, we can get some new ideas that we would never have considered.

我們意識到這正在變得比我們自己更大，因此我們決定將系統發布在開源上。第一個原因是我們都是開放源代碼的狂熱用戶。我們知道，回饋社區以換取回報很重要。但最重要的是，開源可以為產品帶來真正的收益。通過將我們的項目放到那里，以便許多不同的開發人員可以看一下它，我們可以獲得一些我們從未考慮過的新想法。

The most notable contributions were the inclusion of new languages. We only built JFairy to generate data for English speakers and Polish speakers. After all, we are rather limited by the languages we know well. But of course, it could be a useful tool for people from any number of different countries. Through open source contributions, we’ve been able to add support for data in Spanish, French, German, Swedish, and Chinese.

最顯著的貢獻是加入了新的語言。我們僅構建了JFairy來為英語使用者和波蘭語使用者生成數據。畢竟，我們受到我們熟知的語言的限制。但是，當然，對于來自許多不同國家的人們來說，它可能是一個有用的工具。通過開源貢獻，我們已經能夠添加對西班牙語，法語，德語，瑞典語和中文數據的支持。

We also realized that while we’re reaching a great group of users in software developers, Jfairy had applications well beyond a community whose members know how to code. So we decided to build on the success of the library and create an app which could support its use for more applications and more people.

我們還意識到，當我們接觸到軟件開發人員中的大量用戶時，Jfairy所擁有的應用程序遠遠超出其成員知道如何編碼的社區。因此，我們決定在圖書館的成功基礎上，創建一個可以支持更多應用程序和更多人員使用的應用程序。

數據童話讓所有人都可以訪問假數據 (Data Fairy gives everyone access to fake data)

JFairy proved to be super useful for developers who knew how to code, but they weren’t the only people out there who would use the data JFairy generated. Software testers need to be able to populate their systems to see if they work. Salespeople and marketers need data to make their demos look realistic. To make JFairy useful to the most people, we had to make its fake data easy to access.

JFairy被證明對知道如何編碼的開發人員非常有用，但是并不是唯一使用JFairy生成的數據的人。軟件測試人員需要能夠填充其系統以查看其是否正常運行。銷售人員和營銷人員需要數據以使他們的演示看起來逼真。為了使JFairy對大多數人有用，我們必須使其假數據易于訪問。

With that goal in mind, we built DataFairy. DataFairy is an app powered by JFairy so you can access our fake data without having to learn to code first. The data is presented in a neat notebook interface. To get more than one fake profile, you can either generate a new profile or export a bulk list of up to 100 profiles to a CSV file. It is a free and easy way to populate your software with logically connected valid data.

考慮到這一目標，我們構建了DataFairy 。 DataFairy是由JFairy提供支持的應用程序，因此您無需先學習編碼即可訪問我們的虛假數據。數據顯示在簡潔的筆記本界面中。要獲取多個偽造的配置文件，您可以生成一個新的配置文件，也可以將最多100個配置文件的批量列表導出到CSV文件。這是一種使用邏輯連接的有效數據填充軟件的免費簡便方法。

我們對DataFairy未來的計劃 (Our plans for DataFairy’s future)

DataFairy can always be improved upon and have new features added to it. In addition to our own efforts, we want to stick to the tenants of the open source community. We continue to solicit new languages that we can add to our roster and we have an open GitHub project. We would also love to eventually have users add sample data. This will help us build a community of participants who will help DataFairy grow and become more useful for more people.

DataFairy可以隨時進行改進并添加新功能。除了我們自己的努力，我們還希望堅持開源社區的租戶。我們繼續征集可以添加到名冊中的新語言，并且我們有一個開放的GitHub項目。我們也希望最終讓用戶添加樣本數據。這將幫助我們建立一個參與者社區，這將幫助DataFairy成長并變得對更多人有用。

Whether you need to download large batches of logically validated data or simply want to have fun reading the profiles that pop up, check out DataFairy.

無論您是需要下載大量經過邏輯驗證的數據，還是只是想開心地閱讀彈出的配置文件，請查看DataFairy 。