00 background

Normative constraint is the whole process of warehouse construction, as well as the reference of subsequent iteration and operation and maintenance. In fact, the warehouse specification document, along with the architectural design document, should be distributed to all concerned prior to the start of warehouse development, and is a convention that must be strictly followed by all.

Some people ask, can we just do it without rules? Of course, in some temporary short-term projects, in order to quickly work out as soon as possible to see the effect, there is no need to enforce the specification and affect the efficiency. But from a professional perspective, even if the project doesn’t have a specification, there are good conventions, such as indentation, line breaks, blank lines, and comment……

Search the Internet, you can find a lot of relevant articles, but the fragmentation is serious. In this paper, we strive to say through the warehouse specifications, so that we can not only understand the purpose of the warehouse specifications, content, boundary, but also to introduce how to introduce the relevant specifications in the enterprise landing.

We welcome you to combine with the actual situation of their own company, to build and improve their own warehouse standard system. Then we can communicate more and make progress together.

01 Why norms?

As the saying goes, no rules, no fangyuan, no norms, what a mess? Personally, the specification is to solve the efficiency and coordination problems in group operations, and is a strong guarantee for the final delivery quality.

Do you have similar problems in your work?

  • Table A seems to be ok, table B seems to be ok. I asked my colleague A, he said that he always forms from C. After exploring the three tables for a long time, I found that none of them matched each other. Forget it, I calculated it again from the source, and came up with another table D.
  • Thousands of tables in the database, as if I used only a dozen, the other are dry what to use it, asked a circle no one knows, delete it? No one dared to move.
  • There is a process error, the leader asked me to take a look, click in, the shit like the code can not understand, in addition, looking for a long time to find the upstream dependency.
  • A colleague is going to leave, he is responsible for that part of the content, replaced by a person to take over, hard work for many days still can not figure out why, under the anger of another person left.

Due to the above problems, the overall development efficiency, output quality, work happiness and maintenance cost of the warehouse team are getting worse and worse. With turnover, it is often those who have worked hard and been loyal to the company who suffer.

I believe that people who have done data development, more or less will have the above mentioned part of the distress. I think the root of the problem is usually that there are no rules or rules are not enforced. It is understandable that people sometimes take shortcuts to meet business requirements on time, but the technical debt should be paid as soon as possible, and the organization should not berate employees, but the leadership should take the blame. The leadership takes seriously everybody takes seriously, the leadership does not take seriously, did not each fly oneself?

Data warehouse is the invisible product of our data engineers. Data specification is the “language” of data warehouse system construction, the manual and interpreter of data use, and the escort of data quality. For the long and healthy development of the data system, the data warehouse management should be gradually transformed from the rule of man to institutionalization, standardization and instrumentalization.

How should the 02 specification be implemented?

specification

From zero to one, there should be leaders or architects in this link, who should fully consider the actual situation of the company, refer to industry standards or established norms, and make comprehensive and unified formulation.

The specification can also be broken down and written by the core developers of each part, and then integrated by the Leader or architect. For example, in our previous team, model designers were responsible for model design specifications, ETL engineers were responsible for ETL development specifications, BI developers formulated front-end development specifications, and deployment online specifications directly adopted the existing project specifications.

In general, the first draft should try to ensure the completeness of the specification and compatibility between parts.

Specification to discuss

After the completion of the first draft, it is inevitable that there will be poor consideration. At this time, it is best to have the Leader take the lead and organize some core members (the number is not easy to be too many, three or five can be. Too many people can easily lead to confusion, difficult decision-making, no one to raise opinions, resulting in the Leader’s speech and other problems. Further improve the details, correct the shortcomings of the first draft.

In theory, there should be no big problem with a multi-person specification.

Specification for

When finalized, the specification is ready for full promotion and can be issued to all team members.

  • It can be through group chat or formal email reply. Of course, in order to attract everyone’s attention, special group meetings can be held.
  • After the distribution of publicity into the implementation stage, all must strictly abide by, if there is a violation of the warning, serious punishment, repeatedly advised to cancel the year-end salary adjustment and so on.

In order to ensure the implementation of the norms, in addition to attracting attention from all staff through the above two points, it is also necessary to guarantee the organization, system and process in many aspects.

  • The data model should have a common locus, such as a data architect, who periodically checks the model for compliance.
  • Organize data developers to Review everyone’s code regularly, but it doesn’t need to be personal, let alone outline. The purpose is to let everyone know what is good code through comparison and discussion, and finally make “writing good code” become the basic literacy. If there is no condition, a Leader will be responsible for regular inspection and point out problems privately to help group members gradually standardize.
  • New staff, after familiar with the specifications, should also arrange special guidance, compliance inspection is the focus of attention.

Does anyone see a problem here?

The implementation and supervision of norms, as mentioned above, depends more on institutional processes and the consciousness of the relevant people, and institutional processes depend on people. This leads to several problems:

  • Short-term persistence is fine, but long-term focus is hard.
  • Sometimes people are busy, fast output or standard which to choose? Do you want to do code Review? Do I need a data architect to review my new tables?
  • Data modeling is best done by a dedicated person or a small team, and other people use it, which often affects the overall efficiency, so it is usually built by the person who uses it, but it is really difficult to rely on people to check compliance after it is spread out.

Conditional had better introduce corresponding tool to strengthen supervision.

For example, we have metric system metadata, root library metadata, table building metadata, ETL flow metadata, and so on.

That whether we can develop some statements or other pages, through the UI support people to check, or by checking the metadata approach to regulation (such as a note for null, the root in the fields or table names are exist in the root database, table, or page using the indexes such as whether exists in the index system, the data related whether there is a closed loop or isolated nodes).

Ha ha, so much for data governance readers, will it feel familiar? Warehouse construction needs to consider these from the very beginning, the best management is to cure the disease.

Specifications complete

Issued draft, from general should not have what problem, but the details may be thoughtless, in preaching stage, executing stage encounter problems hinder, should according to the actual situation to adjust specification, only tested can increasingly perfect, believe that after a period of continuous practice, the specification will be part of the organizational culture, In this way, communication cost is reduced, development efficiency is improved, and delivery quality is guaranteed, so as to achieve a win-win situation for both the team and the individual.

What are the specifications for data warehouse?

In order to let you understand the number of warehouse specifications of the overall picture, specially spend a lot of effort to sort out the above classification. Welcome to popularize and use. Since it is just one opinion, if you have different opinions, better plans or can add, please add our wechat, we can study together.

Here, I put the warehouse specifications, a total of four categories: design specifications, process specifications, quality management specifications, safety specifications.

  • The design specification is divided into four parts: data model design, naming specification, index system design, root database.
  • Process specification, mainly from the perspective of warehouse management, constraints on various processes in the scenario of logarithmic warehouse. There are five types of core processes: requirements submission, model design, ETL development, front-end development and online process.
  • Quality control specifications are listed separately because data quality, like model design, is highly dependent on the success or failure of warehouse construction. Who would use a data warehouse whose data quality could not be guaranteed? Data quality specifications are mainly divided into three categories from the perspective of data flow: source end control, warehouse management and application control.
  • Security norms, as the country, society and enterprises pay more and more attention to data, on the other hand, as the popularity of the Internet makes it more and more difficult to ensure personal privacy, data leakage occurs from time to time. Data security has become so important to data warehouses that security specifications are listed separately. From a large level, security specifications are divided into three categories: network security, account security, data security.

04 Design Specifications

Data model design

Horizontal layered
  • instructions

    • Hierarchical design is one of the outputs of data architecture design, followed as a mandatory specification in model design.
  • Layered specification

    • ODS

      • Stick to the source layer, the original data do not change or only do the simplest completion after saving.
      • Data domains are divided according to data sources.
    • DWD

      • After cleaning, converting, completing and encoding the data source, the data source is loaded into the detail data layer.
      • Data fields are divided according to the longitudinal fields below.
    • DWS

      • Summary data layer + topic wide table.
      • Data fields are divided according to the longitudinal fields below.
    • ADS

      • Application layer, oriented towards the end application.
      • Subject areas are divided according to the final application. The life cycle is also synchronized with the application.
  • Hierarchical invocation specification

    • Disable reverse call
    • ODS can only be called by DWD.
    • DWD can be called by DWS and ADS.
    • DWS can only be called by ADS.
    • Data applications can call DWD, DWS, and ADS, but it is recommended to use highly summarized data.
    • ODS->DWD->DWS>ADS
    • ODS->DWD->ADS
Longitudinal field
  • define

    • A topic domain is usually a collection of closely related data topics that make it easy to find and use data.
  • The basic principle of

    • High cohesion, low coupling.
    • Not too many. No more than ten are recommended.
    • It has to be stable. It can not only cover all the existing business requirements, but also can be included into the existing data domain or expand new data domain without impact when new business enters.
    • Need to combine the actual situation of the team and the business, such as whether the business is stable, team member modeling level, etc.
    • Moderate abstraction. Too low to adapt to change, too high to understand and use.
  • classification

    • Data/business subject domains

      • It is relatively easy to implement according to business processes.
    • Analysis subject area

      • For analysis scenarios, it is difficult to implement and requires high service understanding and abstraction ability.
  • Classified according to

    • It can be divided according to business or business process: for example, the subject domain of a portal website relying on the sale of advertising position may have advertising domain, customer domain, etc., while the AD domain may have advertising inventory, sales analysis, internal release analysis and other topics.
    • According to the demand side: for example, if the demand side is the finance department, the corresponding financial subject field can be set, and the financial subject field may include employee salary analysis, investment return ratio analysis and other topics.
    • Divided by function or application: for example, the data domain of wechat circle of friends and group chat, etc., while the data domain of wechat circle of friends may have user dynamic information theme, advertising theme, etc.
    • Division by department: for example, there may be operation domain, technology domain, etc. In the operation domain, there may be wage expenditure analysis, activity publicity effect analysis and other topics.
The basic principle of
  • High cohesion and low coupling
  • The core model is separated from the extension model
  • Common processing logic sink and single
  • Cost and performance balance
  • Data can be rolled back
  • consistency
  • Clear and understandable naming
Additional fields
  • Dimension table: creation time, update time
  • Fact table: ETL date, update time
Other requirements
  • Note The remarks on tables and fields must be concise and concise as possible on the premise of a clear description.
  • Field type constraints: String, Int, year, month, and date are String, such as yyyyMMdd.

Naming conventions

Unified specification
  • Use serpentine nomenclature, which separates the roots with an underscore.
  • Give priority to the use of existing keywords in the root (root management in standard configuration of data warehouse), and regularly Review the irrationality of new names.
  • Non-standard abbreviations are prohibited.
  • All names are lowercase and must start with a letter.
  • The name should not be too long.
Proprietary specifications
  • table

    • Layer – domain – root – time period
    • Official table: Hierarchical name + Data domain + Table description + Time period or loading policy, such as incremental, snapshot, zipper/hour, day, week, month, quarter, and year
    • The middle table corresponds to the formal table +_mid+ Arabic digits
    • Temporary table, z+ creator name check + table name
  • view

    • Refer to the table naming convention +_v
  • field

    • Preference is taken from the root, and multiple occurrences are added to the root library
  • task

    • Same as the target table name
  • indicators

    • Atomic indicators

      • Business modifier + root
    • Derivative index

      • Atomic index + time period (optional)
    • The derived indicators

      • One atomic metric + multiple modifiers (optional) + time period

Code design specification

  • Whether there are comments in the script, whether there are comments in the complex calculation logic.
  • Whether the task supports multiple reruns with the same output, without insert into statements.
  • Whether the partition table is filtered using partition keys and has valid clipping.
  • Whether the conditions of the outer join are used correctly, such as the filtering conditions of the right table in the where statement of the left join.
  • Whether to use /*+ map join * / to associate small tables.
  • References to other computing task temporary tables are not allowed.
  • In principle, one task is not allowed to update multiple target tables.
  • Is there a Cartesian product?
  • Do not use DDL statements such as DROP, create, and rename in code.
  • Check whether the partition key is NULL when dynamic partitioning is used.
  • For the important task whether DQC quality control rules are configured, streaking is strictly prohibited.
  • Does the code adequately circumvent data skew statements?

Index System construction

  • Indicator hierarchy division mode

    • By analysis subject

      • The primary classification
      • The secondary classification
    • By business process

      • The primary classification
      • The secondary classification
      • Level 3 classification
  • Index definition

    • content

      • The category
      • Indicators category
      • The name of the
      • describe
      • Caliber/algorithm
      • The measuring unit
      • Suitable dimension
      • .
    • The principle of

      • uniqueness
      • extensible
      • Easy to understand
    • category

      • Atomic metric (a measure of the behavior of a business event, a non-separable metric) example: order amount
      • Derived index (four operations on atomic index)
      • Derived indicator (Statistical period + Statistical granularity + service qualification + atomic indicator) Example: Latest day + Newly created + Number of orders (Ali Big Data Road Defines derived indicators as follows: Derived indicator = atomic indicator + Time period modifier + other modifier. A data field that uniquely belongs to and inherits an atomic index)
    • Explanation: on the net to index classification view is not unified, we know how to return a responsibility son on the line. Searched alibaba’s big data road, there is no concept of derivative metrics. Statement 1: Derived indicators = derived indicators. Then use my definition of derived indices above. Statement 2: Derived indices are derived from four operations on atomic indices. So the derived index is the atomic index after adding or subtracting a few modifiers or time period expansion or contraction. So it feels like the derivative index is a little bit of a chicken and maybe it turns into an atomic/derivative index.

  • Indicator Management Process

    • Indicator New Application
    • Preliminary review: specify the caliber of the indicator and check whether the indicator library contains it
    • Second audit: check whether all elements required by index definition are accurate and complete
    • For the indicators

Root library

  • define

    • Group names for phrases that may be used more than once to ensure global consistency.
  • content

    • The category
    • The name of the
    • English abbreviations
    • The data type
    • note
  • classification

    • Common root: the smallest unit that describes something, e.g., trade-trade.
    • Proper root: conventionally or industry-specific descriptors, such as USD -USD.
  • Public fields

    • Common field = root combination + other keywords
    • The use of common fields in the root library is not very strict, but the field name can be directly used to reduce the risk of inconsistent naming, so it is recommended by companies that are not fully instrumented.

05 Process Specifications

Requirements Submission Process

  • Put forward the demand

    • Requirement proposer: put forward requirement (write clearly requirement content, deliverables and expected delivery date) in the form of document and send it to the data warehouse Leader.
  • Communication requirements

    • The warehouse Leader assigns the requirements to the relevant people and negotiates the actual delivery date.
    • If the delivery date of the demand proposer is inconsistent with the delivery date of the warehouse Leader, both parties shall further agree through negotiation.
  • Development delivery

    • The demand acceptor shall deliver the goods on time according to the agreed delivery date.

Model design process

  • Data research, business research, demand research

  • Data modeling

    • The overall train of thought

      • According to the existing stratified domain, divide and conquer, each defeat.
    • A combination of multiple methods

      • Determine the business process -> claim granularity -> Determine the dimensions -> Determine the facts
      • Business Modeling -> Logical modeling -> Physical modeling
      • Building the Bus Matrix
      • Construction of index System
  • review

    • In addition to model design, the necessary development, business, analysts, product managers, warehouse operations, etc.
  • Go online, iterate, improve

ETL development process

(This will be covered in more detail in a later chapter -ETL)

  • Need to understand
  • Data detection
  • Application development
  • The process relies on
  • Configuration scheduling

Front-end development specification

(This will be covered in more detail in the following chapter – Application)

  • The interface specification
  • Code deployment specification

On-line process

  • To apply for

    • Online time
    • Online function scope
    • Impact on other modules, upstream and downstream dependencies
    • List of online support teams
    • Detailed procedure for going online
    • The test report
    • The rollback plan
  • review

    • Code Review
    • Upstream and downstream impact analysis
  • online

    • Online support team in place
    • Strictly follow the online operation steps
    • Failed to rollback

06 Quality control specification

The source side controls

  • Changes in the source side must be notified in advance.
  • If possible, use tools to monitor changes in focus on the source side.

Number of warehouse management

  • Warning and punishment will be given to those existing specifications that are not implemented: modeling specifications, development specifications, and on-line specifications

  • Use tools to strengthen data quality monitoring, and timely notify and alarm problems.

  • Establish a data quality resolution mechanism, the responsibility is on people.

  • Regularly checking

    • Major FAQs Alarm rules are generated
    • Coordinate with source end to solve data quality problems
    • Problems caused by storage model, ETL development, on-line process, etc., need to develop appropriate solutions

Application controls

  • Unified index definition
  • Uniform index caliber
  • Unify external data output hubs

07 Safety Specifications

Network security

  • The Intranet and extranet are isolated. To access the Intranet from the Internet, you need to log in to the VPN
  • Core data storage, functional modules, only open to a select few people.

Account security

  • Each person is assigned an independent account with reasonable authority, and mutual borrowing is prohibited.

  • Multiple role accounts have been created for databases and big data components. Such as read-only, partial table read and write, administrator, etc. Of course, it can be broken down by actual needs. Hive, ODPS can also achieve single single number.

  • Server login. It’s also a single number

  • Company internal application account. Single order number.

Data security

  • At least do table level permission control, if not separate libraries.
  • The ODS layer is not open to the public, and is only visible to the developers associated with the ODS-DWD layer.
  • Especially sensitive data, such as user age, number, ID, address, etc., should be placed in a special database. The master database only stores user IDS and other required fields. For example, age should be desensitized to age ranges or specific UDF conversion functions developed.

08 summary

The data warehouse specification is elaborated from four aspects of design specification, flow specification, quality control and data security. Should have covered all aspects of the warehouse specification. If there are omissions or better classification methods, welcome to add my wechat details.

The original intention of this writing, is to find a reasonable classification, the data specifications exhaustive list to everyone, let everyone understand the overall picture. However, in the actual practice of landing may not be able to use so much, there is no best only the most appropriate, we need to combine the reality of the scene to select a subset of the need to land.

In the several companies and projects I have experienced, none of the projects has completely used all the above specifications. Internet big data companies use fewer specifications and have different emphases than traditional data warehouse projects before. Big data companies may lay more emphasis on data security and instrumentalization due to the Internet gene, and have less requirements on data quality and data model. And several traditional warehouse has a high requirement on the data modeling, data quality (I have a colleague, because a dollar, was detained by the party a’s chief financial officer, data for the whole day ~), network data security environment is not many, may be due to the cause of the project to do, other tools don’t too be attention, management basic laws, basic document metadata.

Recommended reading:

System design an index system

How to build a data warehouse

Dimensional modeling of data warehouses

OneData construction exploration road: SaaS cashier operation warehouse construction

How is id-mapping the core technology of One ID implemented?

Netease cloud music data service

Netease Cloud music data warehouse dimension Modeling practice – Model Design chapter

Public account: Data warehouse and big data

The author is OTW30, known as Bo Ge. I want to systematically summarize the relevant knowledge of data warehouse and big data, so as to provide a complete set of learning materials for the majority of colleagues in the circle.