1. Functional requirements of architecture

In a recent project, I encountered the need to configure a monitoring architecture for all switches in a large organization, as well as the N7000 core switches. The requirements mainly have the following items:

  • N7000 configures detailed upstream and downstream monitoring and error counting for all ports.
  • Common switches are configured with upstream and downstream monitoring and error counting for TEL1/0/1 and TEL1/0/2.
  • All switches are configured to monitor uptime and CPU usage.

All switches in this project are Cisco C2960 series.

2. Idea design

Since it is a landing project and the required data update frequency is 5sec, the data pressure of the whole framework is not very great. In line with the principle of convenient migration and easy maintenance, all frameworks are configured into the Docker container environment. Since the N7000 requires port-by-port monitoring, two data fetching containers are configured and aggregated into the same database for visual processing.

Therefore, I decided to use the Open-source data tools of InfluxData series, which include InfluxDB, Telegraf, Chronograf and Kapacitor, respectively corresponding to the four functions of database, data collection, data visualization and customized alarm. Commonly known as TICK meal. In this case, Chronograf and Kapacitor were not needed because there was a better alternative: Grafana.

All the tools are hosted on the same VIRTUAL machine. This small to medium scale data flow does not require a cluster or hadoop tool to run it. If data security is required, a dynamic incremental backup of the InfluxDB and Telegraf configuration files can be done on another physical machine.

The specific process is as follows:

The capturing and processing of data are all configured in the container.

3. Specific process

① Create a virtual host

Configure a VM with more than four cores on the server using tools such as VMware. The VM functions as a container server.

(2) Install and configure Telegraf and Influxdb

First of all, on the virtual machine to install telegraf influxdb and configure the/etc/influxdb/influxdb conf and/etc/telegraf/telegraf. Conf. The purpose of this step is to be able to call the relevant configuration file directly when the container is raised later, and to easily modify the relevant configuration file.

  • If you have configured the InfluxDB and Telegraf files on other devices or obtained the original conf file from the official website, you can directly use the configuration filemkdir /etc/influxdb/andmkdir /etc/telegraf/“Cp to the corresponding configuration conf file and edit the configuration.
$ vim /etc/influxdb/influxdb.conf
$ vim /etc/telegraf/telegraf.conf
Copy the code
  • Telegraf installation method
Ubuntu: $wget https://dl.influxdata.com/telegraf/releases/telegraf_1.9.1-1_amd64.deb $sudo DPKG -i telegraf_1. 9.1 1 _amd64. Deb  RedHat/CentOS: $wget https://dl.influxdata.com/telegraf/releases/telegraf-1.9.1-1.x86_64.rpm $sudo yum localinstall Telegraf 1.9.1-1. X86_64. RPMCopy the code
  • Influxdb Installation method
Ubuntu: $wget https://dl.influxdata.com/influxdb/releases/influxdb_1.7.2_amd64.deb $sudo DPKG -i influxdb_1. 7.2 _amd64. Deb RedHat/CentOS: $wget https://dl.influxdata.com/influxdb/releases/influxdb-1.7.2.x86_64.rpm $sudo yum localinstall Influxdb - 1.7.2. X86_64. RPMCopy the code
  • Grafana can be installed directly from the Docker container without the need to install it locally.

As for the/etc/influxdb influxdb. Conf and/etc/telegraf/telegraf conf configuration, can go to the website to check the basic configuration Settings such as Stack Overflow, set up according to their needs.

3 configure the container environment

After configuring the conf file, start the container processes of Telegraf and InfluxDB. During this process, the system will automatically pull the docker.o cloud program package. Please be patient. We can also manually pull with the following command:

$ docker pull telegraf
$ docker pull influxdb
Copy the code

Or start the container process directly:

  • Start the InfluxDB docker and set it to port 8086, or you can set it to any port you want.
$ docker run -d \
--name influxdb \
-p 8086:8086 \
-v /etc/influxdb/influxdb.conf:/etc/influxdb/influxdb.conf \
-v /var/lib/influxdb:/var/lib/influxdb \
docker.io/influxdb
Copy the code
  • Start the grafana docker
$ docker run \
  -d \
  -p 3000:3000 \
  --name=grafana \
  -e "GF_SERVER_ROOT_URL=http://grafana.server.name" \
  -e "GF_SECURITY_ADMIN_PASSWORD=secret" \
  grafana/grafana
Copy the code
  • Start the telegraf docker
$ docker run -d \
--name telegraf \
--network host \
-v /etc/telegraf/telegraf.conf:/etc/telegraf/telegraf.conf \
docker.io/telegraf
Copy the code

It is important to note that since we need to monitor the N7000 and the switch, and the requirements and OID mappings of the two machines are completely different, it is strongly recommended to configure two different telegraf.conf files for data collection from two different containers to avoid a large number of invalid and empty entries.

At this point, the basic container environment configuration of the monitoring framework has been completed, and then began to configure SNMP data capture. We can access the Grafana Graphical monitoring console via the following website:

http://localhost:3000/
Copy the code

Subsequent Settings are described below.

4 Capture SNMP switch data

SNMP is short for Simple Network Management Protocol. It is a simplified Network Management tool that provides Simple Network functions such as Management, data collection, and packet sending control. The OID system is perfect for small and medium scale data capture. The Telegraf configuration tool we chose this time has perfect support for SNMP.

First by $vim/etc/telegraf/telegraf conf into the telegraf configuration file editing mode. In command mode/elsion. SNMP go to the SNMP configuration section (telegraf supports many tools, so the configuration file has thousands of lines), and then follow the instructions in the configuration file.

It is recommended to use the fetching scheme of [[elsion.snmp. field]] class for oid writing. Oid is the name of the dynamic API interface (Object Identifier) that corresponds to the detailed status information of hardware such as switches. A possible example is as follows:

[[inputs.snmp]]
   agents = ["192.268.0.1"]
   ## Timeout for each SNMP query.
   timeout = "5s"
   ## Number of retries to attempt within timeout.
   retries = 2
   ## SNMP version, values can be 1, 2, or 3
   version = 2

   ## SNMP community string.
   community = "YourCommunity"
   ## The GETBULK max-repetitions parameter
   max_repetitions = 30

   ## measurement name
   name="snmpd"
   [[inputs.snmp.field]]
     name="sysuptime"
     oid="1.3.6.1.2.1.1.3.0"
   [[inputs.snmp.field]]
     name="cpu5sec"
     oid=". 1.3.6.1.4.1.9.9.109.1.1.1.1.6.1"
   [[inputs.snmp.field]]
     name="cpu1min"
     oid=". 1.3.6.1.4.1.9.9.109.1.1.1.1.7.1"
   [[inputs.snmp.field]]
     name="cpu5min"
     oid=". 1.3.6.1.4.1.9.9.109.1.1.1.1.8.1"
Copy the code

With the COMPARISON table of OID, we can accurately capture the dynamic data information of the hardware and software of the relevant switch. SNMP also provides tools to directly verify oid availability. The SNMPWalk detection tool can be installed. CentOS is used as an example.

$ yum install -y net-snmp
$ yum install -y net-snmp-utils
Copy the code

You can then query the validity of the OID using the following command format. At the same time, note that the OID written in the Elsions. SNMP part of telegraf must be the end of the OID tree. Flying the end will result in reading the data table instead of a single value, and cannot be written to the influxDB.

$ snmpwalk -v 2 -c YourCommunity <IP/URL> <oid>
Copy the code

Once all the OID’s are available, you can edit telegraf.conf, save it, and start a new Telegraf docker container to collect data. If luck is good, the data will immediately be written to the corresponding database in the InfluxDB. Then we can go to the Grafana platform and use it.

⑤ Common OID table of Cisco C2960

Considering how dehumanizing cisco’s OID query system is, and that most people don’t go to the C2960 website to download the OID mapping manual, here are some of the most commonly used OID mapping tables for cisco’s C2960 series. As for the N7000 series port comparison table, please download the official manual for query, because it is an unordered table, linked to: Cisco Device OID comparison query tool.

Get port IndexSnmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.2.1.2.2.1.1Get a list of ports and their descriptionsSnmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.2.1.2.2.1.2Obtain the PORT Mac addressSnmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.2.1.2.2.1.6Get the Index of the IP addressSnmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.2.1.4.20.1.2Get the Up/Down status of the portSnmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.2.1.2.2.1.8Get port incoming traffic (Bytes)Snmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.2.1.2.2.1.10# Fetch port outgoing traffic (Bytes)Snmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.2.1.2.2.1.16Get the CPU load for the last 5 secondsSnmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.4.1.9.2.1.56.0Get the CPU load for the past 1 minuteSnmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.4.1.9.2.1.57.0Get the CPU load for the last 5 minutesSnmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.4.1.9.2.1.58.0Get current memory usage (bytes)Snmpwalk - v 2 c - c Public 192.168.232.25 1.3.6.1.4.1.9.9.48.1.1.1.5How much memory is currently free (bytes)Snmpwalk - v 2 c - c Public 192.168.232.25 1.3.6.1.4.1.9.9.48.1.1.1.6Obtain the device serial numberSnmpwalk - v 2 c - c Public 192.168.232.25 1.3.6.1.2.1.47.1.1.1.1.11.1 or snmpwalk - v 2 c - c Public 192.168.232.25 1.3.6.1.4.1.9.3.6.3.0Get the device nameSnmpwalk -v 2c -c Public 192.168.232.25 1.3.6.1.2.1.1.5.0Copy the code

⑥ Grafana graphical monitoring platform configuration

After we have arranged several Telegraf data collection containers and corresponding SNMP Settings + InfluxDB + Grafana containers, we can finally go to grafana web graphical monitoring platform to set up.

  • First we need to import the corresponding Database. Go to the Grafana Settings screen and click Add Database:

    You need to enter the user-defined Name Name, address and port of the database (http://localhost:8086 by default), and then set the Name and password of the database to be captured. afterSave&TestI’m done adding.

  • After adding the database, we need to start customizing the graph table. We first need to create a panel and then set up various panels in the panel. See the Grafana official Docs for details on how to set up panels. The trick here is Grafana’s $Variable function, which is valuable for databases. You can set filter criteria for a database tag or field to generate a group of variables. This variable group or groups of variables will appear as a drop-down list at the top of the entire dashboard to be used as a filter. With these variables, we can present a large amount of data with very few panels when faced with a large number of many-way data storage structures. Query represents the filter statement of the database, which is used to obtain the full value of a tag or field. Regex is a normal representation filter that can filter out all the variables needed. It does not support union

  • In addition to being an open source tool, Grafana supports Json editing mode, which allows you to generate panels directly from Json script code. For panels of tens to hundreds of magnitude, Using Python scripts to generate JSON directly is most appropriate. We can make a panel group, which represents the smallest combination of faces, and then go to the Dashboard Settings panel, where we can see the full script code for the entire panel under the JSON Model. This code is highly regular, so let’s just whip out a script to loop it out. [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] []

def SwhFuckingNum(i):
    with open('./file2'.'r+') as d1:
        infos=d1.readlines()
        d1.seek(0,0)
        for line in infos:
            yaxe = (i-1)*5
            line=line.replace('thefuckingport',str(i))
            d1.write(line)
        d1.close()
Copy the code

This part has two points to pay attention to. First, pay attention to the relationship between the position of the upper left anchor point (X, Y) of each panel and the size of the panel, and do a good job of automatic evolution. 2. Single panel ends with ‘,’ and the last panel does not need to end with ‘,’. Finally, put the generated Panel JSON set (which may be tens or even hundreds of thousands) into the Dashboard panel[]. But generally in the case of panel number more than 400, it is not recommended to stack a single panel. Because Grafana was not optimized very well, feedback was slow in multi-panel scenarios. I will cover the panel JSON construction of Grafana in more detail in a future update.

  • The result is a very dense and readable monitoring panel, as shown below: