Minio architecture analysis

Introduction to the

Minio is an object storage system written by Go and based on the Apache License V2.0 open source protocol. It is designed for massive data storage, artificial intelligence, and big data analysis. It is fully compatible with Amazon S3 apis and is very suitable for storing large unstructured data ranging from tens of KB to a maximum of 5 TB. Is a small and beautiful open source distributed storage software.

The characteristics of

Minimalism but not simplicity: Minio adopts a simple and reliable clustering solution, eliminates complex large-scale cluster scheduling management, reduces risks and performance bottlenecks, and focuses on the core functions of the product to create a highly available cluster, flexible scalability and superior performance. Create multiple small – and medium-sized clusters that are easy to manage and can be aggregated into a large resource pool across data centers rather than a large-scale, centrally managed distributed cluster.

Minio supports cloud native, and can docking well with Kubernetes, Docker, Swarm programming systems to achieve flexible deployment. And the deployment is simple, only one executable file, few parameters, a command can start a Minio system. Minio adopts a metadata free database design for high performance, avoiding the metadata database from becoming a performance bottleneck of the entire system, and limiting faults to a single cluster so that other clusters are not involved. Minio is fully compatible with S3 interfaces and can be used as a gateway to provide S3 access externally. Both Minio Erasure Code and Checksum are used to prevent hardware failures. Even if you lose more than half your hard drive, you can still recover from it. Distributed also allows (N/2) -1 node failure.

architecture

Decentralized architecture

Minio uses a decentralized shared-nothing architecture. Object data is scattered and stored on multiple disks on different nodes. It provides unified access to the external namespace and implements load balancing among servers through load balancing or DNS polling

Unified namespace

Minio supports two cluster deployment modes: common local distributed cluster deployment and alliance deployment. In local distributed cluster deployment, the Minio service is deployed on multiple local server nodes to form a single distributed storage cluster, which provides a unified namespace and annotated S3 access interface. Federated deployment is to logically form multiple local Minio clusters into a unified namespace to achieve near-wireless expansion and massive data scale management. These clusters can be locally or distributed in different data centers.

Distributed lock Management

Like distributed databases, Minio suffers from data consistency problems: while one client is reading an object, another client may be modifying or deleting the object. To avoid inconsistencies. Minio specifically designed and implemented the Dsync distributed lock manager to control data consistency.

A lock request from any one node is broadcast to all online nodes in the cluster
If N/2+1 nodes agree, the command is successfully obtained
There is no master node, and each node is peer to peer. Stale lock detection mechanism is used to determine the status and lock holding status of nodes
Because the design is simple, relatively rough. It has some defects and supports a maximum of 32 nodes. Lock loss scenarios cannot be avoided. But it basically meets the available requirements.

EC2 Instance Type	Nodes	Locks/server/sec	Total Locks/sec	CPU Usage
C3.8 xlarge (32 vCPU)	8	(min=2601, max=2898)	21996	10%
C3.8 xlarge (32 vCPU)	8	(min=4756, max=5227)	39932	20%
C3.8 xlarge (32 vCPU)	8	(min=7979, max=8517)	65984	40%
C3.8 xlarge (32 vCPU)	8	(min=9267, max=9469)	74944	50%

The data structure

Minio Object storage systems organize storage resources into tenant – bucket – object formats

Object: similar to the xiang entry in the hash table. The name is a keyword and the content is equivalent to a value
Bucket: a logical abstraction of several objects, a container for objects
Tenant: Isolates storage resources. You can create buckets and storage objects under a tenant
User: an account created under a tenant for accessing buckets. You can use the MC command provided by minio to set permissions for different users to access each bucket

Unified Domain name Access

After the Minio cluster is extended to a new cluster or bucket, the client program of the object storage needs to access the data object through a unified domain name/URL. This process involves ETCD and CoreDns

The storage mechanism

Minio uses erasure codes and Checksum to protect data from hardware failures and silent data damage. Even if half (N/2) of the hard disks are lost, data can still be recovered.

Erasure code is a mathematical algorithm for recovering lost and damaged data. At present, the application of erasure code technology in distributed storage system is divided into three categories. Array Code (RAID5, RAID6, etc.), Reed-Solomon (RS), and LowDensity Parity Check code (LDPC). ErasureCode is a coding technology that can add M pieces of original data and restore the original data through any N fraction of N+M pieces. That is, if any data less than or equal to M pieces are lost, the remaining data can still be restored.

Minio uses Reed-Solomon code to split objects into N/2 data and N/2 parity fast. This means that if there are 12 disks, an object will be split into 6 data blocks, 6 parity fast, and any 6 disks can be lost (regardless of whether the data stored is fast or the parity fast). Then you can recover data from the remaining disks.

In a distributed Minio with N nodes, as long as N/2 nodes are online, your data is safe. But you need at least N/2+1 nodes to write.

After a file is uploaded to Minio, the information on the disk is as follows:

Where xl.json is the metadata file for this object. Part.1 is the first data shard for this object. (Each node in the distributed system will have two files, namely data block and parity check fast.) When reading data, Minio will HighwayHash the encoding fast, and then verify to ensure the correctness of each encoding. With Erasure Code and Bit Rot Protection HighwayHash, Minio provides high data reliability.

Lambda computations with continuous backups

Minio supports lambda computation notification, that is, objects in the bucket support event notification. Currently, the supported event types include object upload, object download, object deletion, and object copy. The current event acceptance systems include Redis, NATS, AMQP, Kafka, mysql, elasticSearch, etc.

Object notification enhances Minio’s extensibility by allowing users to develop their own functionality that Minio does not. Such as metadata-based retrieval, computing related to the user’s business, etc. This mechanism can also be used for fast and efficient incremental backups.

Object Storage Gateway

In addition to serving as a storage system, Minio can also serve as a gateway. The back-end can communicate with distributed file systems such as NAS and HDFS or third-party storage systems such as S3 and OSS. With the Minio gateway, you can add S3-compatible apis to these backend systems for easy management and portability, since the S3API is already a fact-note in the object storage world.

Users request storage resources through the unified S3 API, and route each request to the corresponding ObjectLayer through the S3 API Router. Each ObjectLayer implements all apis of object operations of each storage system. For example, after the Google Cloud Storage (GCS) implements ObjectLayer interface, its operations on back-end storage are implemented by the SDK of THE GCS. When the terminal obtains the bucket list through the S3 API, the final implementation accesses the GCS service through the SDK of the GCS to obtain the bucket list and returns it to the terminal as an S3 standard structure.

Introduction to the

The characteristics of

architecture

Related Posts

2020 year-end summary | Denver annual essay

DataX’s practice in big data platform

[Youth Training camp] -JS (5) Bonus package I remember