What is NoSQL?

The data model

Structured data

  • Structured dataIt refers to the data logically expressed and realized by the two-dimensional table structure, strictly following the data format and length specifications

Unstructured data

  • Unstructured dataIt refers to data with irregular or incomplete data structure, which does not have any predefined data model and cannot be easily represented by two-dimensional logical tables, such as text, images, HTML, video and audio

Semi-structured data

  • Semi-structured data is a form of structured data. Although it does not conform to the data model structure of two-dimensional logic, semi-structured data contains related tags for separating semantic elements and layering records and fields, such as XML and JSON

  • Semi-structured data stores data in a tree or graph data structure

  • For structured data there is usually structure followed by data, while for semi-structured data there is data followed by structure

 

Relational database

Architectural evolution of storage as a relational database

  • Phase 1: In the early stage of enterprise development, an application server adds a relational database and reads and writes the database each time

  • Phase 2: As the enterprise scale expands, application servers become performance bottlenecks. Add multiple application servers and use Nginx as a layer of load balancing at the traffic entrance

  • Phase 3: As the enterprise scale continues to expand, the database becomes a performance bottleneck. In this case, read and write data to the primary and secondary databases are separated. Data is synchronized between the primary and secondary databases using the binlog

  • Stage four: the development of the enterprise is getting better and better, and the pressure of read-write separation database is still increasing. Increase the number of databases to do sub-database sub-table, to do vertical split table, to do horizontal split database

Advantages and disadvantages of relational databases

advantages

  • Easy to operate: the general SQL language makes it very convenient to operate relational databases, support join and other complex queries

  • Data consistency: Supports ACID to maintain data consistency

  • Data stability: Data is persisted to disks without risk of data loss and supports massive data storage

  • Stable service: The most commonly used relational database products, MySql and Oracle, have excellent performance and stable service

disadvantages

  • With high concurrency, I/O pressure is high: Data is stored in rows. Even if operations are performed on only one column, the entire row of data is read from the storage device to the memory, resulting in high I/OS

  • High index maintenance costs: Data updates are accompanied by updates of all secondary indexes, reducing the read and write performance of the relational database, and the more indexes, the worse the read and write performance

  • High cost of maintaining data consistency: THE SQL standard defines different isolation levels for transactions, from low to high: read uncommitted, read committed, repeatability, and serialization. The higher the isolation level, the worse the read and write performance

  • Problems with horizontal scaling: Data migration, cross-library joins, and distributed transactions are all issues that need to be considered after repository splitting

  • Inconvenient expansion of the table structure: If the table structure needs to be modified, DDL needs to be executed to lock the table and some services are unavailable

 

Non-relational databases

Non-relational database (NoSQL, Not Only SQL) is a database management system that is different from traditional relational database. It is mainly used to solve the requirements of high concurrent reading and writing, mass storage and high scalability of data.

Advantages of NoSQL databases

  • High scalability: NoSQL data has no relationship to each other, so it is very easy to scale

  • High performance: NoSQL also has high read and write performance due to its irrelevance

  • Flexible data model: NoSQL does not need to create fields for the data to be stored and can store custom data formats at any time

Classification of NoSQL databases

The key value store Column storage Document storage Graphics store
Storage structure Key/value pair Column cluster storage Class a JSON object The graph structure
Application scenarios Content caching Distributed data storage and management The Web application Relationship graph
Typical representative Redis,Memcached Cassandra,HBase MongoDB,CouchDB Neo4j,Infinite Graph

No CAP + BASE

NoSQL tends to be multi-node and uses BASE theory to ensure data consistency

Theory of CAP

In July 2000, Professor Eric Brewer of University of California, Berkeley proposed CAP conjecture at ACM PODC conference. Two years later, Seth Gilbert and Nancy Lynch of the Massachusetts Institute of Technology proved CAP theoretically. Since then, CAP theory has officially become the accepted theorem in distributed computing.

  • A distributed system can only satisfy at most two of Consistency, Availability and Partition tolerance at the same time

    • Strong Consistency: Data on all nodes is consistent at the same time after a successful update operation is returned to the client. (Weak Consistency and final Consistency are not restricted by CAP theory.)

    • Availability: Services are always available and have normal response times

    • Partition tolerance: The loss or failure of any information in the system does not affect the continued operation of the system

  • The relationship between the CAP

    • CP without A: Once A network fault or message loss occurs, services are provided after all data is consistent. For example, distributed storage systems such as Redis and HBase or distributed coordination components such as Zookeeper require data consistency

    • AP WiHTout C: Once network problems occur, each node can only provide services with local data, leading to global data inconsistency. Many Web applications abandon strong consistency to ensure final consistency in order to provide high availability services (refer to the BASE theory below).

The BASE theory of

BASE theory originated in 2008 and was published by eBay architect Dan Pritchett at the ACM.

  • BASE is Basically Available, Soft state, and Eventually consistent.

  • Eventually consistent: All copies of data in a system that Eventually reach a consistent state after a period of synchronization

 

What is HBase?

HBase is a high availability, high performance, and multi-version distributed NoSQL database based on Apache Hadoop. It is an open source implementation of Google BigTable and provides high-performance random read/write capabilities for massive data.

 

HBase Storage Structure

The data model

  • HBase is essentially a key-value database

  • Key consists of RowKey (RowKey) +ColumnFamily (Column family) +Column Qualifier (Column Qualifier) +TimeStamp (TimeStamp — version) +KeyType (type) and Value is the actual Value

System architecture

  • Client provides interfaces for accessing HBase and maintains the corresponding cache to facilitate access

  • Zookeeper: Stores HBase metadata. The Client obtains the metadata from Zookeeper to know which machine to read and write data on

  • HRegionServer processes read and write requests from clients and interacts with HDFS. It is a node that does real work

  • HMaster, which processes metadata changes and monitors the status of RegionServer

HRegionServer structure

  • Data in a table is horizontally segmented to HRegion by RowKey. HRegion is the smallest unit of distributed storage and load balancing in Hbase. An HRegionServer can contain multiple HRegions

  • HRegion data is vertically segmented to Store in ColumnFamily. Store is the HBase core storage unit consisting of MemStore and StoreFile

  • HBase writes data to MemStore first. When the MemStore exceeds a certain threshold, data in the memory is written to hard disks to form StoreFile

  • StoreFile is stored in HFile format at the bottom layer. HFile is the data format stored in HBase

  • In order to prevent machine downtime, the data in memory will hang before flushing to disk, so when writing Mem Store will also write a HLog

 

 

Refer to the article

Sql Or NoSql

CAP Theory of Distributed Systems

I finally understand HBase, it’s not easy…