preface

This paper aims to build a set of Iceberg data lake system to prepare for learning iceberg function. Using Iceberg’s Trino is relatively easy to understand, and there are already projects on Github that we can learn from, so we just need to adapt. The whole project is based on Ali Cloud. If necessary, I will also make some demos in Huawei cloud or Tencent cloud.

Below is the basic framework of iceberg’s data lake.

Trino is a big data query engine that uses resident processes and pure in-memory calculations to achieve low-latency and high-concurrency queries. It is generally used in near-real-time scenarios ranging from seconds to several minutes. There are a lot of articles introduced on the Internet.

Below is the design diagram of the whole demo project that I planned:

We use a low-profile ECS as a unified gateway to the public network.

Apply for a number of ali cloud charging burst cloud host (cheap, suitable for demo), you can first apply for two 4-core 8G, and then create K8S cluster.

For the sake of simplicity, our object storage is installed on our externally connected cloud host using miniO’s binary installation mode (easy to view the data store directly through web connection).

In THE K8S cluster, we first install a Mariadb, which is mainly used to store the object storage path information of the current snapshot in the Iceberg data lake scheme.

We then installed MetaStore to manage meta information.

Finally, let’s install Trino and it’s ok.

Code store address

Gitee.com/nativesailo…

Note: The YAML configuration in this article may not be the final version and you can refer directly to the code repository configuration.

This paper is mainly divided into two parts, the first part is about environment construction, and the second part is about using Trino simple experience to store data to Iceberg data Lake.

Environment set up

Student “A” showed up

Applying for a Cloud Host

Reference: juejin. Cn/post / 684516…

Prepare an ECS server, also known as little A.

Remotely connect to the cloud host

Use the PEM key to log in

Install git

sudo yum install git

Download the Github code

Search trino-on-k8s github for project links github.com/joshuarobin…

Clone code to cloud host

Git clone github.com/joshuarobin…

To view the README

Main application process

Build Docker image for Hive Metastore
Deploy Hive Metastore: MariaDB (pvc and deployment), init-schemas, Metastore
Deploy Trino services (coordinator, workers, and cli)

Install minio-Server on small A

Download the minio

Wget dl. Min. IO/server/mini…

Add executable permissions

chmod +x minio

Set the access key for logging in to miniO

export MINIO_ACCESS_KEY=minioadmin

Set the secret key for logging in to minio

export MINIO_SECRET_KEY=minioadmin

Start minio in the background

The port that the Console port accesses is fixed 9090

nohup ./minio server -console-address :9090 /data/minio/ > /data/minio/minio.log 2>&1 &
Copy the code

The foreground starts minio

The port that the Console port accesses is fixed 9090

[root@iZ8vbgn00yu75o270lb2azZ code]# ./minio server --console-address :9090 /data/minio/ WARNING: MINIO_ACCESS_KEY and MINIO_SECRET_KEY are deprecated. Please use MINIO_ROOT_USER and MINIO_ROOT_PASSWORD API: http://172.28.166.95:9000 http://127.0.0.1:9000 RootUser: minioadmin RootPass: minioadmin Console: http://172.28.166.95:9090 http://127.0.0.1:9090 RootUser: minioadmin RootPass: minioadmin Command - line: https://docs.min.io/docs/minio-client-quickstart-guide $MC alias set myminio minioadmin at http://172.28.166.95:9000 minioadmin Documentation: https://docs.min.ioCopy the code

access

Use user A’s public IP address 39.103.234.55 and port 9090 to access the Web

Install minio-Client on small A

Download the minio

wget http://dl.minio.org.cn/client/mc/release/linux-amd64/mc
Copy the code

Add permissions

chmod +x mc
./mc --help
Copy the code

Add the host

. / MC config host add minio http://172.28.166.95:9000 minioadmin minioadmin s3v4 - APICopy the code

Visit the store

./mc ls minio/datalake
Copy the code

Build a K8S cluster

Apply for two burst cloud hosts

Create emergent instance reference document: help.aliyun.com/document_de…

On-demand service; 2 core 4G burst performance example;

No public IP address is assigned. Use the same security group as Little A;

Select the same key pair as little A;

Click on the create

Create a K8S cluster

ACK hosted version; The standard version. Pay by volume;

SNAT is not configured. API Server access to choose simple type; Does not apply to EIP exposure API Server;

Select the two ECS just applied as Worker instances; Select CentOS 7.9 for operating system.

Only Ingerss is reserved for component configuration, and private networks are used for load balancing. Simplified load balancing specifications are also selected. (This blog does not use the Ingerss function of K8S cluster, so it can not follow the Ingress component.)

Click Create Cluster, then OK.

Install kubectl client

yum install -y kubectl

User A configures the cluster access permission

Vi $HOME/.kube/config

[root @ iZ8vbgn00yu75o270lb2azZ NFS] # kubectl get nodes NAME STATUS ROLES AGE VERSION cn - zhangjiakou. 172.28.166.120 Ready < none > 27 m v1.18.8 - aliyun. 1 cn - zhangjiakou. 172.28.166.121 Ready < none > 27 m v1.18.8 - aliyun. 1Copy the code

NFS

Go to the trino-on-k8s directory and check maria_PVc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: maria-pv-claim
spec:
  storageClassName: pure-block
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
Copy the code

MariaDB relies on a storageClass as a datastore, so let’s define storageClass in the cluster

Reference:

Blog.csdn.net/huwh_/artic…

www.cnblogs.com/allmdzz/p/1…

Create an NFS share service on user A

Installing the NFS Server

dnf install nfs-utils

Start the NFS server

systemctl start nfs-server && systemctl enable nfs-server

Creating a Shared Directory

mkdir /data/nfs

Configuring a Shared Directory

vi /etc/exports

/data/nfs *(insecure,rw,async,no_root_squash)

Start the

exportfs -rv

check

showmount -e

Creating a namespace

kubectl create ns datalake
Copy the code

Create an SA and bind permissions to it

rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: nfs-client-provisioner
  namespace: datalake
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: nfs-client-provisioner-runner
rules:
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "update", "patch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: run-nfs-client-provisioner
subjects:
  - kind: ServiceAccount
    name: nfs-client-provisioner
    namespace: datalake
roleRef:
  kind: ClusterRole
  name: nfs-client-provisioner-runner
  apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: leader-locking-nfs-client-provisioner
  namespace: datalake
rules:
  - apiGroups: [""]
    resources: ["endpoints"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: leader-locking-nfs-client-provisioner
subjects:
  - kind: ServiceAccount
    name: nfs-client-provisioner
    namespace: datalake
roleRef:
  kind: Role
  name: leader-locking-nfs-client-provisioner
  apiGroup: rbac.authorization.k8s.io
Copy the code

Application of rbac. Yaml

kubectl create ns datalake
kubectl apply -f rbac.yaml -n datalake
Copy the code

Create storageclass

nfs-StorageClass.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pure-block
provisioner: qgg-nfs-storage
parameters:
  archiveOnDelete: "false"
Copy the code

The name of the provisioner is the same as the environment variable PROVISIONER_NAME in the provisioner configuration file;

Nfs-storageclass.yaml metadata.name Must be the same as the storage-class name specified in maria_pvc.yaml.

kubectl apply -f nfs-StorageClass.yaml -n datalake
Copy the code

Create ali Cloud private warehouse

Create a repository with the sailor-datalake namespace.

The mirror

Because the K8S cluster is not connected to the Internet at present, we need to select the image, re-tag it, and push it to the private warehouse (Intranet warehouse).

Little A pulls the latest Mariadb server image

docker pull quay.io/external_storage/nfs-client-provisioner:latest
Copy the code

Add a new label to the image

docker tag quay.io/external_storage/nfs-client-provisioner:latest registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest
Copy the code

Logging In to the Image Repository

docker login --username=XXX registry-vpc.cn-zhangjiakou.aliyuncs.com
Copy the code

Note: the account and address here should be filled in according to the information of the container mirror warehouse of Ali Cloud.Copy the code

Push the image to the mirror repository

 docker push registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest
Copy the code

Docker Login is configured on worker nodes in the K8S cluster

Configure access to cluster worker nodes

Configure hosts

Open the configuration file:

vi /etc/hosts
Copy the code

Add the internal and external IP addresses of ECS nodes and their corresponding names

172.28.166. XXX worker1 172.28.166. Yyy worker2Copy the code

Delete the host key (if you restart the worker, delete the previously recorded hosts)

/root/.ssh/known_hosts
Copy the code

SSH access

Ssh-agent bash ssh-add -k Happy_123. Pem SSH worker1 or SSH -i Happy_123Copy the code

Access the worker node and execute the Docker login

Found registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest mirror that I pushed to their warehouse image package, The deployment of nfS-client-provisioner will fail due to a permission pull failure, causing the pod to fail. No other mirrors have this problem.

You can login to the worker node, manually execute the docker login background to pull the deployment image from the worker node, and set the image pull rule of nfs-client-provisioner deployment to IfNotPresent. I’ll explain why on my other blogs.

ssh worker1 sudo docker login --username=XXX registry-vpc.cn-zhangjiakou.aliyuncs.com sudo docker login Registry-vpc.cn-zhangjiakou.aliyuncs.com Happy@123 sudo docker pull registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latestCopy the code

Notice Use the Intranet address Registry – VPC

Add worker to small A security group

Note: We can also apply for ECS (directly configure the security group of small A), and then create the K8S cluster, and select ECS when creating the cluster. Do pay attention to the choice of ECS do not choose the small A, otherwise it will lead to small A reinstall the system, all your data and configuration will be lost yo!

Create the NFS provisioner

nfs-provisioner.yaml

apiVersion: apps/v1 kind: Deployment metadata: name: nfs-client-provisioner labels: app: nfs-client-provisioner namespace: datalake spec: replicas: 1 selector: matchLabels: app: nfs-client-provisioner strategy: type: Recreate selector: matchLabels: app: nfs-client-provisioner template: metadata: labels: app: nfs-client-provisioner spec: serviceAccountName: nfs-client-provisioner containers: - name: nfs-client-provisioner image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest volumeMounts: - name: nfs-client-root mountPath: /persistentvolumes env: - name: PROVISIONER_NAME value: qgg-nfs-storage - name: NFS_SERVER value: 172.28.166.95 -name: NFS_PATH value: /data/ NFS Volumes: -name: nfs-client-root NFS: server: 172.28.166.95 path: / data/NFSCopy the code

The NFS volume mount path is /data/ NFS that I created on small A earlier.

Our NFS server was installed on Xiao A. The server address of NFS_SERVER and volumes must be changed to xiao A’s private network address.

PROVISIONER_NAME is the provisioner name, ensure that it is the same as the provisioner name in the nfs-storageclass. Yaml file.

kubectl apply -f nfs-provisioner.yaml -n datalake
Copy the code

Install mariaDB

The application of PVC

Example change the storage capacity in the PVC file to 5Gi

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: maria-pv-claim
spec:
  storageClassName: pure-block
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
      
Copy the code

Application:

kubectl apply -f maria_pvc.yaml -n datalake  
Copy the code

Small A installs docker

yum -y install docker

systemctl start docker

systemctl status docker
Copy the code

The mirror

Because the K8S cluster is not connected to the Internet at present, we need to select the image, re-tag it, and push it to the private warehouse (Intranet warehouse).

Little A pulls the latest Mariadb server image

docker pull mariadb/server:latest
Copy the code

Add a new label to the image

docker tag mariadb/server:latest registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/mariadb-server:latest
Copy the code

Logging In to the Image Repository
```
docker login --username=XXX registry-vpc.cn-zhangjiakou.aliyuncs.com
Copy the code
```
Note: the account and address here should be filled in according to the information of the container mirror warehouse of Ali Cloud.

Push the image to the mirror repository

docker push registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/mariadb-server:latest
Copy the code

Install the mariaDB service

cd /opt/code/trino-on-k8s
vi maria_deployment.yaml
Copy the code

Modify the yaml

--- apiVersion: v1 kind: Service metadata: name: metastore-db spec: ports: - port: 13306 targetPort: 3306 selector: app: mysql --- apiVersion: apps/v1 kind: Deployment metadata: name: mysql spec: selector: matchLabels: app: mysql strategy: type: Recreate template: metadata: labels: app: mysql spec: containers: - name: mariadb image: "registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/mariadb-server:latest" env: - name: MYSQL_ROOT_PASSWORD value: Happy_123 ports: - containerPort: 3306 name: mysql volumeMounts: - name: Mariadb-for-hive mountPath: /var/lib/mysql resources: requests: memory: "1G" CPU: 0.5 volumes: -name: mariadb-for-hive mountPath: /var/lib/mysql resources: requests: memory: "1G" CPU: 0.5 volumes: -name: mariadb-for-hive persistentVolumeClaim: claimName: maria-pv-claimCopy the code

Modify the mirror address. Change the mariadb password. Modify the required resource to a very small value.

kubectl apply -f maria_deployment.yaml -n datalake
Copy the code

Check database metastore_DB

Install mysql client on small A

yum install mysql -y
The cluster opens the Nodeport

vi maria_deployment.yaml

Modify yamL Service, add NodePort port 32005

---
apiVersion: v1
kind: Service
metadata:
  name: metastore-db
spec:
  type: NodePort
  ports:
  - port: 13306
    targetPort: 3306
    nodePort: 32005
  selector:
    app: mysql
Copy the code

Viewing a Database

[root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# mysql -h worker1 -P 32005 -u root -p Enter password: Happy_123 Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 26 Server version: Mariadb-1:10.5.10 + Maria ~ Bionic Mariadb.org Binary Distribution Copyright (C) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help; ' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | metastore_db | | mysql | | Performance_schema | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 4 rows in the set (0.00 SEC) MariaDB [(none)] > use metastore_db Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [metastore_db]> show tables;Copy the code

Mysql -h worker1 -p 32005 -u root -p mysql -h 172.28.166.120 -p 32005 -u root -pCopy the code

172.28.166.120 is the IP address of any node, and 32005 is the NodePort for the mariaDB service

Install the metastore

Execute the script and build the image

Edit and compile the script CD /opt/code/trino-on-k8s/hive_metastore vi build_image.sh

Set the mirror address and TAG. Specify the namespace datalake when creating configMap.

The set - e REPONAME=registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake TAG = metastore: v1.0.1 docker build - t $TAG. # Tag and push to the public docker repository. docker tag $TAG $REPONAME/$TAG docker push $REPONAME/$TAG # Update configmaps kubectl create configmap metastore-cfg --dry-run --from-file=metastore-site.xml --from-file=core-site.xml -n datalake -o yaml | kubectl apply -f - -n datalakeCopy the code

Initialize the schema

vi hive-initschema.yaml

Change the password and image name.

apiVersion: batch/v1
kind: Job
metadata:
  name: hive-initschema
spec:
  template:
    spec:
      containers:
      - name: hivemeta
        image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/metastore:v1.0.1
        command: ["/opt/hive-metastore/bin/schematool"]
        args: ["--verbose" ,"-initSchema" , "-dbType", "mysql" , "-userName", "root",
          "-passWord", "mypass" , "-url", "jdbc:mysql://metastore-db:13306/metastore_db?createDatabaseIfNotExist=true"]
      restartPolicy: Never
  backoffLimit: 4
Copy the code

application

[root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# kubectl apply -f hive-initschema.yaml -n datalake job.batch/hive-initschema  created [root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# kubectl get pod -n datalake NAME READY STATUS RESTARTS AGE hive-initschema-9zd75 0/1 Completed 0 14s mysql-8466d8694c-mmlqk 1/1 Running 0 75m nfs-client-provisioner-75bc5bc89f-xmbw8 1/1 Running 0 37mCopy the code

Modify the core – site. XML

The original core-site. XML is not complete enough to access S3 storage.

Refer to trino. IO/docs/curren…

<configuration> <property> <name>fs.s3a.connection.ssl.enabled</name> <value>false</value> </property> <property> < name > fs. S3a. The endpoint < / name > < value > http://172.28.166.95:9000 < value > / < / property > < property > <name>hive.s3a.aws-access-key</name> <value>minioadmin</value> </property> <property> <name>hive.s3a.aws-secret-key</name> <value>minioadmin</value> </property> <property> <name>fs.s3a.access.key</name> <value>minioadmin</value> </property> <property> <name>fs.s3a.secret.key</name> <value>minioadmin</value> </property> <property> <name>fs.s3a.path.style.access</name> <value>true</value> </property> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3a.fast.upload</name> <value>true</value> </property> </configuration>Copy the code

Change fs.s3a.endpoint to the private network address of little A + the port of miniO

Modify the metastore – site. XML

<configuration>
<property>
        <name>metastore.task.threads.always</name>
        <value>org.apache.hadoop.hive.metastore.events.EventCleanerTask</value>
</property>
<property>
        <name>metastore.expression.proxy</name>
        <value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://metastore-db.datalake.svc.cluster.local:13306/metastore_db</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>Happy_123</value>
</property>
<property>
        <name>metastore.warehouse.dir</name>
        <value>s3a://datalake/warehouse/</value>
</property>
<property>
        <name>metastore.thrift.port</name>
        <value>9083</value>
</property>
</configuration>
Copy the code

Change the database access password. Change the namespace in the URL to datalake

Run the script to update the ConfigMap

Install the MetaStore service

Modify the yaml

--- apiVersion: v1 kind: Service metadata: name: metastore spec: ports: - port: 9083 selector: app: metastore --- apiVersion: apps/v1 kind: Deployment metadata: name: metastore spec: selector: matchLabels: app: metastore strategy: type: Recreate template: metadata: labels: app: metastore spec: containers: - name: Metastore image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/metastore:v1.0.1 env: - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: my-s3-keys key: access-key - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: my-s3-keys key: secret-key ports: - containerPort: 9083 volumeMounts: - name: metastore-cfg-vol mountPath: /opt/hive-metastore/conf/metastore-site.xml subPath: metastore-site.xml - name: metastore-cfg-vol mountPath: /opt/hadoop/etc/hadoop/core-site.xml subPath: core-site.xml command: ["/opt/hive-metastore/bin/start-metastore"] args: ["-p", "9083"] resources: requests: memory: "1G" cpu: 0.5 imagePullPolicy: Always volumes: -name: metastore-cfg-vol configMap: name: metastore-cfgCopy the code

New yaml configuration file my-s3-keys.yaml for secret

apiVersion: v1
kind: Secret
metadata:
   name: my-s3-keys
type:
  Opaque
data:
   access-key: bWluaW9hZG1pbg==
   secret-key: bWluaW9hZG1pbg==
Copy the code

Application:

kubectl apply -f my-s3-keys.yaml -n datalake
kubectl apply -f metastore.yaml -n datalake
Copy the code

Install trino

Modify trino configuration

vi trino-cfgs.yaml

Edit yaml

--- kind: ConfigMap apiVersion: v1 metadata: name: trino-configs data: jvm.config: |- -server -Xmx2G -XX:-UseBiasedLocking -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+ExplicitGCInvokesConcurrent -XX:+ExitOnOutOfMemoryError -XX:+UseGCOverheadLimit -XX:+HeapDumpOnOutOfMemoryError -XX:ReservedCodeCacheSize=512M -Djdk.attach.allowAttachSelf=true -Djdk.nio.maxCachedBufferSize=2000000 config.properties.coordinator: |- coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8080 query.max-memory=200GB Query. Max - the memory - per - node = 0.2 GB query. Max - total - the memory - per - node = 0.6 GB query. Max - stage - count = 200 task. Writer - count = 4 discovery-server.enabled=true discovery.uri=http://trino:8080 config.properties.worker: HTTP - server. | - coordinator = false. HTTP port = 8080 query. The Max - memory = 200 GB query. Max - - per - node = 0.2 GB of memory Query. Max - total - the memory - per - node = 0.6 GB query. The Max - stage - count = 200 task. Writer - count = 4 discovery. Uri = http://trino:8080 node.properties: |- node.environment=test spiller-spill-path=/tmp max-spill-per-node=4TB query-max-spill-per-node=1TB hive.properties: |- connector.name=hive-hadoop2 hive.metastore.uri=thrift://metastore:9083 hive.allow-drop-table=true Hive. Max - partitions - per - scan = 1000000 hive. S3. The endpoint = 172.28.166.95 hive. S3. The path - style - access = true hive.s3.ssl.enabled=false hive.s3.max-connections=100 iceberg.properties: |- connector.name=iceberg hive.metastore.uri=thrift://metastore:9083 hive.max-partitions-per-scan=1000000 Hive. S3. The endpoint = 172.28.166.95 hive. S3. The path - style - access = true hive. S3. SSL) enabled = false hive. S3. Max - connections = 100 mysql.properties: |- connector.name=mysql connection-url=jdbc:mysql://metastore-db.datalake.svc.cluster.local:13306 connection-user=root connection-password=Happy_123Copy the code

Application:

kubectl apply -f trino-cfgs.yaml -n datalake
Copy the code

Edit trino yaml

--- apiVersion: v1 kind: Service metadata: name: trino spec: ports: - port: 8080 selector: app: trino-coordinator --- apiVersion: apps/v1 kind: Deployment metadata: name: trino-coordinator spec: selector: matchLabels: app: trino-coordinator strategy: type: Recreate template: metadata: labels: app: trino-coordinator spec: containers: - name: trino image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/trino:361 ports: - containerPort: 8080 env: - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: my-s3-keys key: access-key - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: my-s3-keys key: secret-key volumeMounts: - name: trino-cfg-vol mountPath: /etc/trino/jvm.config subPath: jvm.config - name: trino-cfg-vol mountPath: /etc/trino/config.properties subPath: config.properties.coordinator - name: trino-cfg-vol mountPath: /etc/trino/node.properties subPath: node.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/hive.properties subPath: hive.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/iceberg.properties subPath: iceberg.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/mysql.properties subPath: mysql.properties resources: requests: memory: "1G" cpu: ImagePullPolicy: Always volumes: -name: trigino-cfg-vol configMap: name: Trigino-configs -- apiVersion: apps/v1 kind: StatefulSet metadata: name: trino-worker spec: serviceName: trino-worker replicas: 1 selector: matchLabels: app: trino-worker template: metadata: labels: app: trino-worker spec: securityContext: fsGroup: 1000 containers: - name: trino image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/trino:361 ports: - containerPort: 8080 env: - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: my-s3-keys key: access-key - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: my-s3-keys key: secret-key volumeMounts: - name: trino-cfg-vol mountPath: /etc/trino/jvm.config subPath: jvm.config - name: trino-cfg-vol mountPath: /etc/trino/config.properties subPath: config.properties.worker - name: trino-cfg-vol mountPath: /etc/trino/node.properties subPath: node.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/hive.properties subPath: hive.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/iceberg.properties subPath: iceberg.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/mysql.properties subPath: mysql.properties - name: trino-tmp-data mountPath: /tmp resources: ImagePullPolicy: Always volumes: -name: Trino-cfg-vol configMap: name: requests: memory: "1G" CPU: 0.5 imagePullPolicy: Always Volumes: name: Trino-cfg-vol configMap: name: trino-configs volumeClaimTemplates: - metadata: name: trino-tmp-data spec: storageClassName: pure-block accessModes: - ReadWriteOnce resources: requests: storage: 40Gi --- apiVersion: v1 kind: Pod metadata: name: trino-cli spec: containers: - name: trino-cli image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/trino:361 command: ["tail", "-f", "/dev/null"] imagePullPolicy: Always restartPolicy: AlwaysCopy the code

Modify the image name. Change the size of the resource in trino-tmp-data to 40Gi. Set the number of workers to 1; Resources occupied by workers and coordinators;

Application trino yaml

kubectl apply -f trino.yaml -n datalake

[root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# kubectl get pod -n datalake
NAME                                      READY   STATUS      RESTARTS   AGE
hive-initschema-9zd75                     0/1     Completed   0          71m
metastore-7df56797cc-j7dcd                1/1     Running     0          11m
mysql-8466d8694c-mmlqk                    1/1     Running     0          146m
nfs-client-provisioner-75bc5bc89f-xmbw8   1/1     Running     0          108m
trino-cli                                 1/1     Running     0          64s
trino-coordinator-54f78c7984-zqf8t        1/1     Running     0          64s
trino-worker-0                            0/1     Pending     0          64s
Copy the code

If resources are insufficient, add a worker and add it to the cluster.

Binary client

wget https://repo1.maven.org/maven2/io/trino/trino-cli/361/trino-cli-361-executable.jar mv trino-cli-361-executable.jar /trino --server LOCALhost :8080 -- Catalog icebergCopy the code

test

kubectl exec -it trino-cli -n datalake  -- trino --server trino:8080 --catalog iceberg --schema default
Copy the code

Access trino’s UI

Since our TRino is installed in the cluster, but K8s cluster is not configured with external network access, so if you want to access the UI, you can only access through small A. The solution is to install Nginx on small A, and then connect the nginx back end to the Trino open NodePort.

Install nginx on small A

Reference documents: blog.csdn.net/lxh_worldpe…

Add port 32001 to security group (32001 is trino exposed NodePort)

Configure nginx

Proxy_pass Configures the K8S cluster node IP address and trino exposed NodePort

cd /usr/local/nginx/conf vim nginx.conf server { listen 32001; server_name localhost; location / { proxy_read_timeout 300; proxy_connect_timeout 300; proxy_redirect off; proxy_set_header Host $http_host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header X-Frame-Options SAMEORIGIN; Proxy_pass http://172.28.166.135:32001; } error_page 500 502 503 504 /50x.html; location = /50x.html { root /usr/share/nginx/html; }}Copy the code

Refer to www.cnblogs.com/fps2tao/p/9…

Nginx needs to restart the configuration file to take effect

/ usr/local/nginx/sbin/nginx – t test configuration file modification is normal

/ usr/local/nginx/sbin/nginx -s reload to reload

Web access

http://39.103.234.55:32001/ui/

Experience Iceberg with Trino

Create a table

CREATE TABLE iceberg.default.events (level VARCHAR, event_time TIMESTAMP(6), message VARCHAR, call_stack ARRAY(VARCHAR)) WITH (partitioning= ARRAY['day(event_time)']);
Copy the code

Insert three pieces of data

INSERT INTO iceberg. Default. events VALUES('ERROR', timestamp '2021-04-01 12:00:00.000001', 'Oh noes', ARRAY ['Exception in thread "main" java.lang.NullPointerException']);Copy the code

INSERT INTO iceberg.default.events\
VALUES\
(

  'ERROR',

  timestamp '2021-04-01 12:00:00.000001',

  'Oh noes', 

  ARRAY ['Exception in thread "main" java.lang.NullPointerException']

),\
(

  'ERROR',

  timestamp '2021-04-02 15:55:55.555555',

  'Double oh noes',

  ARRAY ['Exception in thread "main" java.lang.NullPointerException']),\
(

  'WARN', 

  timestamp '2021-04-02 00:00:11.1122222',

  'Maybeh oh noes?',

  ARRAY ['Bad things could be happening??']\
);
Copy the code

View the data

SELECT * FROM iceberg.default.events WHERE event_time < timestamp '2021-04-02';
Copy the code

View the data stored in mariaDB

MariaDB [(none)]> select * from metastore_db.TABLE_PARAMS; +--------+----------------------------+--------------------------------------------------------------------------------- ------------------+ | TBL_ID | PARAM_KEY | PARAM_VALUE | +--------+----------------------------+--------------------------------------------------------------------------------- ------------------+ | 1 | EXTERNAL | TRUE | | 1 | metadata_location | s3a://datalake/warehouse/events/metadata/00001-9bfdcfaa-6e00-425e-aee7-6d4128cc6cfb.metadata.json | | 1 | numFiles | 6 |  | 1 | previous_metadata_location | s3a://datalake/warehouse/events/metadata/00000-4f86466c-c60c-4b2a-b2c0-5a8b6fb76e65.metadata.json | | 1 | table_type | iceberg | | 1 | totalSize | 13167 | | 1 | transient_lastDdlTime | 1633849635 | +--------+----------------------------+--------------------------------------------------------------------------------- ------------------+ 7 rows in set (0.00 SEC)Copy the code

View the data stored in Minio

/ root @ iZ8vbgn00yu75o270lb2azZ code #. / MC ls minio/datalake/warehouse/events/metadata 1.9 KiB 14:32:25 CST [2021-10-10] B2ca292 00000-1-85 - c7-435 - c - d74d2605 b884-5218. The metadata. The json 15:02:04 CST [2021-10-10] 1.9 KiB 00000-21 e6307c - 8 b63-41 f6-987 - c - fbc86de000f9. Metadata. The json 15:07:14 CST [2021-10-10] 1.9 KiB B2a f86466c c60c - 00000-4-4 - b2c0-5 a8b6fb76e65. Metadata. The json 13:45:22 CST [2021-10-10] 1.9 KiB 00000 - b88a389c e3ca - 491 - b - bd9e - e901a01f82fe. Metadata. The json 13:58:33 CST [2021-10-10] 1.9 KiB 00000 - fe989767-841 - a - 4412 - a2e3 - bc094e148ab1. Metadata. The json 15:09:51 CST [2021-10-10] 2.8 KiB 00001-9 e00 bfdcfaa - 6-425 - e - aee7-6 d4128cc6cfb. Metadata. The json 15:09:51 CST [2021-10-10] 5.9 KiB A10aa346-3b55-437b-93a3-20219f7d8ad1-m0. Avro [2021-10-10 15:07:14cst] 3.3kib Snap-1941033720141809751-1-2b0c70bb-ab8d-4db8-b377-9722e4926e08. Avro [2021-10-10 15:09:51cST] 3.kib snap-4869152224340281922-1-a10aa346-3b55-437b-93a3-20219f7d8ad1.avro [root@iZ8vbgn00yu75o270lb2azZ code]# ./mc ls minio/datalake/warehouse/events/data [2021-10-10 23:49:32 CST] 0B event_time_day=2021-04-01/ [root@iZ8vbgn00yu75o270lb2azZ code]# ./mc ls minio/datalake/warehouse/events/data/event_time_day=2021-04-01/Copy the code

Stop using the cluster

Stop using ECS

1. In shutdown saving mode, computing resources (vcpus and memory), fixed public IP addresses, and bandwidths are not charged.

2. Charging resources include system disks, data disks, elastic public IP addresses and bandwidth (in fixed bandwidth mode), and charging images.

Disabling load Balancing

We found that load balancing was still charging and ECS was still charging but only about 25%. We can save a lot already.

To enable the cluster

Enable the ECS

Enabling Load Balancing

After both are enabled, you can observe that instances in the cluster are recovered.

Take your hand to the iceberg – trino_on_K8s

Take you to iceberg – spending spend_wisely

Refer to the website

Iceberg Connector

Trino. IO/docs/curren…

StarBurst

Starburst Enterprise for Presto is a commercial version of the Presto open source distributed SQL query engine for finding and analyzing data residing in a variety of distributed data sources.

The StarBurst blog has many excellent trino articles to recommend.

Among them, the Trino On Ice blog series introduces many interesting practices to understand Iceberg based On Trino.

Blog. Starburst. IO/trino – on – IC…

The cluster structures,

Juejin. Cn/post / 684516… Juejin. Cn/post / 684790…

Take your hand to the iceberg – trino_on_K8s

preface

This paper is mainly divided into two parts, the first part is about environment construction, and the second part is about using Trino simple experience to store data to Iceberg data Lake.

Environment set up

Student “A” showed up

Applying for a Cloud Host

Remotely connect to the cloud host

Install git

Download the Github code

To view the README

Main application process

Install minio-Server on small A

Download the minio

Add executable permissions

Set the access key for logging in to miniO

Set the secret key for logging in to minio

Start minio in the background

The foreground starts minio

access

Install minio-Client on small A

Download the minio

Add permissions

Add the host

Visit the store

Build a K8S cluster

Apply for two burst cloud hosts

Create a K8S cluster

Install kubectl client

User A configures the cluster access permission

NFS

Create an NFS share service on user A

Creating a namespace

Create an SA and bind permissions to it

Create storageclass

Create ali Cloud private warehouse

The mirror

Docker Login is configured on worker nodes in the K8S cluster

Configure access to cluster worker nodes

Access the worker node and execute the Docker login

Add worker to small A security group

Create the NFS provisioner

Install mariaDB

The application of PVC

Small A installs docker

The mirror

Install the mariaDB service

Check database metastore_DB

Install the metastore

Execute the script and build the image

Initialize the schema

Modify the core – site. XML

Modify the metastore – site. XML

Install the MetaStore service

Install trino

Modify trino configuration

Edit trino yaml

Application trino yaml

Binary client

test

Access trino’s UI

Install nginx on small A

Add port 32001 to security group (32001 is trino exposed NodePort)

Configure nginx

Nginx needs to restart the configuration file to take effect

Web access

Experience Iceberg with Trino

Create a table

Insert three pieces of data

View the data

View the data stored in mariaDB

View the data stored in Minio

Stop using the cluster

Stop using ECS

Disabling load Balancing

To enable the cluster

Enable the ECS

Enabling Load Balancing

Related articles

Refer to the website

Related Posts