preface

In software development testing process, test data is often needed. These scenarios include:

  • After the back-end development creates a new table, it is necessary to construct database test data and generate interface data for the front-end use.
  • Database performance test generates a large amount of test data to test database performance

How to write a test data generation script

The first step is to formulate the structure of the data, including name, type, scope, etc

If data is dependent, the final data form can be conceived in advance, and then the form of the data it depends on can be deduced backwards according to the final data form. For example, suppose you need a user tag – life insurance customer value. This label is dependent on life insurance policy amount label, life insurance recently purchased or renewed time label, life insurance purchased or renewed within 2 years frequency label. The life insurance policy amount label relies on a series of event attributes. With the user tag – life Insurance customer value, you can pull out a lot of data structures in reverse.

The second step is to design the distribution of data in different ranges

The quality of the data affects the end result, so the distribution of the data should be reasonable. The distribution of data is related to the business. The optimal distribution is to refer to the distribution of original business data. Other methods include the distribution based on the experience of the business people, the distribution of similar business from the Internet, and the distribution of the data in the subjective imagination of the developers.

The third step is to specify the data generation method, including the number of data, the sequence of data, and the storage method

Step 4: Confirm the data import method

The relevant framework

After investigating various frameworks of Python, I finally chose # Joke2K/Faker to consult the documents and found a method to generate data according to weights, which can meet my needs.

Address: document faker. Readthedocs. IO/en/stable/p…

Simple to use

  1. Rely on
 pip install faker 
Copy the code
  1. Localization, Chinese faker
   fake = Faker('zh_CN')
Copy the code
  1. Simple API calls
from faker import Faker fake = Faker() fake.name() # 'Lucy Cechtelar' fake.address() # Cartwrightshire, SC 88120-6700' fake.text() # 'Sint velit eveniet. Rerum atque repellat voluptatem quia rerum. Numquam excepturi # beatae  sint laudantium consequatur. Magni occaecati itaque sint et sit tempore. NesciuntCopy the code

For more information on the use of FAker, see the documentation.

Concrete example

Generate the following user attribute data and save it as a JSON file

User attributes

The property name Attribute display name type The scope of Distribution proportion
name The user name STRING random
age age NUMBER 0-18, 18-26, 26-35, 36-45, 45-55, 55 and above (2,8,20,34,21,15)
sex gender NUMBER Men and women (50.03, 49.97)
city Life in the city STRING North to Guangshen, Shenyang, Jinan, Tianjin, Xi ‘an, Hohhot, Wenzhou, Huangshan (40,30,20,10)
province provinces STRING Beijing, Shanghai and Guangzhou, Liaoning, Shandong, Tianjin, Shaanxi, Inner Mongolia, Jiangsu, Anhui (40,30,20,10)
annual_income Annual income NUMBER 0-6W, 6-15W, 15-30W, 30W-80W, 80W and above (15,45,33,5,2)
married Marital status STRING Unmarried, married, divorced (20,70,10)
occupation professional STRING White-collar workers, teachers, workers, civil servants, sales (45,10,20,10,15)
work_state Working state STRING On the job, retired, freelance (45,35,20)
family_size Family size NUMBER 1-6, and other (5,15,18,22,22,15,5)
children_size Number of children NUMBER 0 to 3, the other (33,30,20,12,5)
have_car Whether to have a car BOOL (20,80)
vip_level Membership grade STRING 0-5 Ordinary Members – Diamond members (40,30,15,10,5)
membership_points Member of the integral NUMBER 0, 1-1000, 1001-2000, 2000-5000, 5000 or more (20,30,30,15,5)
is_valid Whether in the bartender BOOL (30,70)
education Record of formal schooling STRING High school or below, bachelor, Master, doctor (35,45,15,5)

The relevant code

user_faker.py

import json from collections import OrderedDict from datetime import datetime, date from typing import Optional from pydantic import BaseModel from faker_config import fake from snowflake import id_worker class User(BaseModel): user_id: int first_id: int = None second_id: int = None time: Optional[datetime] = None name: str age: int sex: str city: str province: str annual_income: int married: str occupation: str work_state: str family_size: int children_size: int have_car: int vip_level: str membership_points: str is_valid: int education: str create_time: Optional[datetime] create_date: Optional[date] def generate_user(): time = fake.past_datetime(start_date='-2y') user = { "user_id": id_worker.get_id(), "first_id": None, "second_id": None, "time": time, "name": fake.name(), "age": user_faker.age(), "sex": user_faker.sex(), "annual_income": user_faker.annual_income(), "married": user_faker.married(), "occupation": user_faker.occupation(), "work_state": user_faker.work_state(), "family_size": user_faker.family_size(), "children_size": user_faker.children_size(), "have_car": user_faker.have_car(), "vip_level": user_faker.vip_level(), "membership_points": user_faker.membership_points(), "is_valid": user_faker.is_valid(), "education": user_faker.education(), "create_time": time, "create_date": time.date() } user.update(json.loads(user_faker.province_and_city())) user = User(**user) return user class UserFaker: def age(self): Elements = OrderedDict([(fake. Random_int (min=0, Max =18), 0.02), (fake. Random_int (min=19, Max =26), 0.08), (fake. Random_int (min=27, Max =35), 0.2), (fake. Random_int (min=36, Max =45), 0.34), 0.21), (fake random_int (min = 55, Max = 99), 0.15)]) return fake.random_element(elements=elements) def province_and_city(self): Elements = OrderedDict ([(' {" province ", "Beijing", "city" : "Beijing"} ', 0.4), (' {" province ", "liaoning province", "city" : "Shenyang"} ', 0.3), (" {" province ", "shanxi", "city" : "xian"} ', 0.2), (' {" province ", "anhui province", "city" : }', 0.1),]) return fake. Random_element (elements=elements) def annual_income(self): Elements = OrderedDict([(fake. Random_int (min=0, Max =6), 0.15), (fake. Random_int (min=7, Max =15), 0.45), (fake. Random_int (min=16, Max =30), 0.33), (fake. Random_int (min=31, Max =80), 0.02) (3)]) return fake.random_element(elements=elements) def phone: Elements = OrderedDict ([(' unmarried, 0.2), (' married ', 0.7), (' divorced, 0.1). ]) return fake.random_element(elements=elements) def sex(self): Elements = OrderedDict([(' male ', 0.52), (' female ', 0.48)]) return fake. Random_element (elements=elements) def occupation(self): Elements = OrderedDict ([(' white-collar workers', 0.45), (' teachers', 0.1), (' workers', 0.2), (' civil servants', 0.1), (0.15) 'sales', ]) return fake.random_element(elements=elements) def work_state(self): Elements = OrderedDict ([(' on-the-job ', 0.45), (' retirement ', 0.35), (' freelance, 0.20), ]) return fake.random_element(elements=elements) def family_size(self): Elements = OrderedDict ([(1, 0.05), (2, 0.15), (3, 0.18), (4, 0.22), (5, 0.22), (6, 0.15), (fake. Random_int (min = 7, Max =10), 0),]) return fake.children_size (c) def children_size(self): Elements = OrderedDict ([(1, 0.33), (2, 0.35), (3, 0.20), (4, 0.07), (5, Return fake. Element (elements=elements) def have_car(self): Return fake.element (elements=elements) def vip_level(self): return fake.element (elements=elements) def vip_level(self): Elements = OrderedDict ([(1, 0.40), (2, 0.30), (3, 0.15), (4, 0.10), (5, 0)]) return fake. Element (elements=elements) def membership_points(self): Elements = OrderedDict([(fake. Random_int (min=0, Max =0), 0.2), (fake. Random_int (min=1, Max =1000), 0.3), (fake. Random_int (min=1001, Max =2000), 0.3), (fake. Random_int (min=2001, Max =5000), 0.15), (fake. 0)]) return fake. Random_element (elements=elements) def is_valid(self): Return fake. Random_element (elements=elements) def education(self): return fake. Random_element (elements=elements) def education(self): Elements = OrderedDict ([(' high school and below, 0.35), (' bachelor ', 0.45), (' master ', 0.15), (0.05), 'Dr', ]) return fake.random_element(elements=elements) user_faker = UserFaker()Copy the code

faker_config.py

from faker import Faker

fake = Faker('zh_CN')
Copy the code

main.py

import datetime import json from user_faker import generate_user, User class DateEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, datetime.datetime): return obj.strftime("%Y-%m-%d %H:%M:%S") if isinstance(obj, User): return obj.dict() if isinstance(obj, datetime.date): return obj.strftime("%Y-%m-%d") else: return json.JSONEncoder.default(self, obj) def generate_data(row): Print (f" generating data ========>{row} ", datetime.datetime.now()) users = [] for I in range(row): user = generate_user() users.append(user) with open('./user.json', 'w', encoding='utf-8') as fObj: Json. dump(users, fObj, ensure_ASCII =False, CLS =DateEncoder) print(========>{row}, Datetime.datetime.now ()) if __name__ == '__main__': # Row = 10000 generate_data(row)Copy the code

The effect

Generating data ========>10000 items 2021-07-23 11:13:44.739249 Generating test data ========>10000 items, completed 2021-07-23 11:13:48.923505Copy the code

user.json

[{ "user_id": 1418409069177348096, "first_id": null, "second_id": null, "time": "2019-12-31 12:27:59", "name": "LuBing", "age" : 38, "sex", "female", "city" : "shenyang", "province", "liaoning province", "annual_income" : 2, "I" : "married", "occupation" : "White collar", "work_state" : "on-the-job", "family_size" : 4, "children_size" : 2, "have_car" : 0, "vip_level" : "2", "membership_points" : "275", "is_valid" : 1, "education", "high school and the following", "create_time" : "the 2019-12-31 12:27:59", "create_date" : "2019-12-31"}]Copy the code

Commonly used API

  • Bothify generates strings and numbers

bothify(text=’## ?? ‘, = ‘abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ’ letters), Number signs (‘ # ‘) are replaced with a random digit (0 To 9). Question marks (‘? ‘) are replaced with a random character from letters.

eg:
for _ in range(5):
    fake.bothify(letters='ABCDE')
Copy the code
  • lexify(text=’???? = ‘abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ’, ‘letters) random letters
Generate a string with each question mark ') in text interviews with a random character from letters.Copy the code
  • Numerify (text=’###’) random number

Number signs (‘ # ‘) are replaced with a random digit (0 to 9). Percent signs (‘ % ‘) are replaced with a random non-zero Digit (1 to 9). Exclamation marks (‘! ‘) are replaced with a random digit or an empty string. At symbols (‘ @ ‘) are replaced with a random non-zero digit or an empty string.

>>> Faker.seed(0)
>>> for _ in range(5):
...     fake.numerify(text='Intel Core i%-%%##K vs AMD Ryzen % %%##X')
Copy the code
  • Random_digit () A random number
    Generate a random digit (0 to 9).
Copy the code
  • Random_choices (elements=(‘a’, ‘b’, ‘c’), length=None
Length is a numberCopy the code
  • C (c =(‘ c’, ‘c’, ‘c’, length=None, unique=False, use_weighting=None
Fake. Random_elements (elements = OrderedDict ([(" variable_1 ", 0.5), # Generates "variable_1" 50% of the time ("variable_2", 0.2), # Generates "variable_2" 20% of the time ("variable_3", 0.2), 0.2), # Generates "variable_3" 20% of the time ("variable_4": 0), # Generates "variable_4" 10% of the time]), unique=False)Copy the code
  • Random_int (min=0, Max =9999, step=1) A random number
  • Random_letter ()
  • Date_between_dates (date_start=None, date_end=None) Specifies a random date
  • Past_datetime (start_date=’-30d’, tzinfo=None) Past random time