Author: XYZ, member of Crawlab Development Group

Today we will talk about how to build a Crawlab platform in Kubernetes (k8S) cluster. This article requires familiarity with the use of Docker and some basic concepts of K8S.

First of all, let’s introduce some important concepts of K8S.

Pod

A Pod is the most basic unit in a K8S cluster. A Pod contains a set of containers. A Pod contains multiple containers. For example, in Sidecar mode, service containers write logs, and log collection Agent containers read and forward logs. They are mounted to the same Volume. In our practice, there is only one business container in a Pod. For logs, we directly connect THE SDK with Aliyun log service. So it’s easy to understand that a Pod is a container.

Deployment

The Deployment object is used to control the Pod, such as the number of Pod copies (service instances), the rolling Deployment configuration of the Pod, the node affinity of the Pod, and so on. In general, Deployment is managed and configured, not Pod directly. Deploy Deployment by writing yamL description files.

Service

In a K8S cluster, there will be multiple worker nodes. Pod will be deployed on the corresponding node according to the Deployment configuration every time it is redeployed. Therefore, the IP of Pod is different every time, so K8S provides Service object. It can be understood that a Service is a load balancer in front of a Pod. When Deployment controls one or more copies of Pod, it can be proxy to these pods through the IP of the Service.

Well, after introducing Pod, Deployment, and Service, let’s move on to Crawlab. As of 5 October 2019, the latest version of Crawlab is Tikazyq/Crawlab :0.3.2

In version 0.3.2, Crawlab’s crawler synchronization mechanism for worker node and master node is modified to take MongoDB as the criterion: After the user uploads crawler through the front end, it will be stored in MongoDB GridFs. No matter the worker or master node has a timer to pull the file on the GridFs, judging by the MD5 value of the file object.

The flow chart of upload crawler is as follows:

The flow chart of node synchronization crawler is as follows:

Master node and worker node are decoupled. File synchronization between them is through GridFs and communication is through Redis PUB /sub mechanism. For md5 files, an MD5 value is automatically generated after the crawler is saved to GridFs. This value is written in md5. TXT of the root directory of the crawler when the crawler is synchronized to the local node, which is used to determine whether the current crawler file is consistent with the file in GridFs.

OK, knowing Crawlab’s crawler synchronization mechanism, we can deploy our crawler platform on K8S.

The deployment of the Master

The following three points to note:

1. Create a ConfigMap object. We write the required configuration file information to ConfigMap and then mount it to the container to form the config.yml file. 2. Configure the CRAWLAB_SERVER_MASTER environment variable for Deployment. Since the master node and the worker node share a ConfigMap object, a special configuration is required. 3. After the Service object is created, a ClusterIP is obtained and configured into the CRAWLAB_API_ADDRESS environment variable in the Deployment, since this is the address configuration for the front-end to access the back-end. In a production environment, the Ingress object is typically configured.Copy the code

ConfigMap configuration files are as follows:

apiVersion: v1
kind: ConfigMap
metadata:
  name: crawlab-conf
  namespace: dev
data:
  config.yml: |-
    api:
      address: "localhost:8000"
    mongo:
      host: "192.168.235.26"
      port: 27017
      db: crawlab_xyz
      username: "root"
      password: "example"
      authSource: "admin"Redis: address: 192.168.235.0 Password: redis-1.0 database: 18 port: 16379log:
      level: info
      path: "/opt/crawlab/logs"
      isDeletePeriodically: "N"
      deleteFrequency: "@hourly"Host: 0.0.0.0 port: 8000 master:"Y"
      secret: "crawlab"
      register:
        MAC address or IP address. If it is an IP address, you need to manually specify the IP address
        type: "mac"
        ip: ""
    spider:
      path: "/opt/crawlab/spiders"
    task:
      workers: 4
    other:
      tmppath: "/opt/crawlab/tmp"
Copy the code

The Deployment and Service configurations are as follows:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  labels:
    app: crawlab-master
  name: crawlab-master
  namespace: dev
spec:
  replicas: 1
  # tag selector
  selector:
    matchLabels:
      app: crawlab-master
  # Rolling deployment strategy
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    # Source data information for pod
    metadata:
      labels:
        app: crawlab-master
    spec:
      containers:
        - env:
            - name: CRAWLAB_API_ADDRESS
              value: "cluster_ip:8000"
            The worker node and the master node share a ConfigMap.
            So CRAWLAB_SERVER_MASTER requires additional configuration
            - name: CRAWLAB_SERVER_MASTER
              value: "Y"Image: 192.168.224.194:5001 / vanke center/crawlab: 0.3.2 imagePullPolicy: Always name: crawlab - master# Resource allocation
          resources:
            limits:
              cpu: '2'
              memory: 1024Mi
            requests:
              cpu: 30m
              memory: 256Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          Config file mounts configuration
          volumeMounts:
           - mountPath: /app/backend/conf/
             name: crawlab-conf
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      Mount volumes from ConfigMap
      - configMap:
          defaultMode: 420
          name: crawlab-conf
        name: crawlab-conf

---

apiVersion: v1
kind: Service
metadata:
  name: crawlab-master
  namespace: dev
spec:
  ports:
    - port: 8000
      protocol: TCP
      The port that the crawlab back-end service listens on
      targetPort: 8000
      name: backend
    - port: 80
      protocol: TCP
      # crawlab front-end listening port
      targetPort: 8080
      name: frontend
  selector:
    Metadata.labels need to be consistent with the pod defined in deployment to select the corresponding pod for traffic broker
    app: crawlab-master
  sessionAffinity: None
  type: ClusterIP
Copy the code

After you have deployed the Service, you can access the login screen through the ClusterIP of the Service.

Outside the K8S cluster, services are generally accessed through Ingress or NodePort. Because we directly connect the local network with the K8S cluster, we can directly access the ClusterIP of the Service object, which is normally inaccessible.

However, you will find that you cannot log in normally, because the third step we just mentioned has not been processed, so you cannot log in normally. In Google Browser, press F12 to view the following exception information.

If you notice anything familiar, this is actually the CRAWLAB_API_ADDRESS environment variable just configured in Deployment. We need to replace the actual value with the ClusterIP for Service.

Run the following command to view the ClusterIP address of the Service

kubectl get svc -n dev | grep crawlab
Copy the code

So we need to replace the value of the CRAWLAB_API_ADDRESS environment variable and change the fragment of the Deployment as follows:

      containers:
        - env:
            - name: CRAWLAB_API_ADDRESS
              value: "172.21.9.55:8000"
            The worker node and the master node share a ConfigMap.
            So CRAWLAB_SERVER_MASTER requires additional configuration
            - name: CRAWLAB_SERVER_MASTER
              value: "Y"
Copy the code

Then re-apply the Deployment and restart the master node to update the configuration.

After revisiting, you can enter the background.

The deployment of the Worker

The Deployment file is basically the same as the master file. Just change the value of the CRAWLAB_SERVER_MASTER environment variable to N, delete the definition of the Service object, and of course change the name of the Service object. As follows:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  labels:
    app: crawlab-worker
  name: crawlab-worker
  namespace: dev
spec:
  replicas: 1
  # tag selector
  selector:
    matchLabels:
      app: crawlab-worker
  # Rolling deployment strategy
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    # Source data information for pod
    metadata:
      labels:
        app: crawlab-worker
    spec:
      containers:
        - env:
            The worker node and the master node share a ConfigMap.
            So CRAWLAB_SERVER_MASTER requires additional configuration
            - name: CRAWLAB_SERVER_MASTER
              value: "N"Image: 192.168.224.194:5001 / vanke center/crawlab: 0.3.2 imagePullPolicy: Always name: crawlab - worker# Resource allocation
          resources:
            limits:
              cpu: '2'
              memory: 1024Mi
            requests:
              cpu: 30m
              memory: 256Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          Config file mounts configuration
          volumeMounts:
           - mountPath: /app/backend/conf/
             name: crawlab-conf
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      Mount volumes from ConfigMap
      - configMap:
          defaultMode: 420
          name: crawlab-conf
        name: crawlab-conf

# Worker does not need to define a Service because it does not need to expose the access address
Copy the code

After worker deployment is complete, it can be viewed in the node list in the background.

The IP is not the Service IP, but the Pod IP.

At this point, the deployment of Crawlab crawler platform on K8S platform is completed