demand

Whether operation and maintenance is event-driven or self-driven may be a problem that we pay little attention to in operation and maintenance work. Event-driven operations stop at failure, while auto-driven operations stop at construction. Sustainable operation and maintenance construction needs a set of automatic operation and maintenance system, so where should we start?

In fact, the previous series of articles “Operation and Maintenance Thinking” has given us the answer, which is to start with the layered construction and lay a good foundation from the operation and maintenance framework. Remember that “great oaks from little acorns grow, not to build a high platform on floating sand”.

Operational framework

Usually speaking of operation and maintenance construction, the first thing that comes to our mind is “a mess of hemp”, because this is not the work of one person or post, but the work of a whole team. Therefore, we divide “this mass of hemp” from the bottom up into:

  • IT infrastructure layer

    IT infrastructure layer, mainly responsible by the basic operation and maintenance team, mainly includes storage, network, server, security equipment and other hardware facilities;

  • The data layer

    The data layer is mainly responsible by DBA team and big data team, including database, cache, data warehouse, etc.

  • The application layer

    Application layer, mainly responsible by the application operation and maintenance team, mainly includes basic services, business applications, middleware, etc.

  • management

    The management layer is mainly responsible by the configuration management team, security team, and application operation and maintenance team, mainly including various automatic operations, security management, monitoring management, etc.

  • Presentation layer

    Display layer, mainly managed by each team, including various management tools, monitoring tools, etc.;

Through the decomposition of the operation and maintenance framework and logical isolation of various resources, each team can clearly identify the current situation and deficiencies in the current operation and maintenance construction. If we can pay continuous attention to the operation and maintenance framework, we can clearly know which team’s shortcomings through the picture, and the focus of the future direction of each team.

Operations on the basis of

If you don’t think the operations framework is detailed enough, here comes the breakdown of the work at each level of the framework, which we call the operations basis.

In view of these operation and maintenance basis, we can carry out a series of targeted measures, such as establishing norms and automatic processes, so as to continuously enrich the system, norms and processes of each team, why not?

1. Infrastructure layer

In the basic hardware facilities management, the focus of the work is

  • Network partitioning and isolation

    Network zones include the Internet access zone, common production zone, data zone, and external connection zone to ensure proper access.

    Network isolation Isolates test, production, and production environments to avoid access permissions confusion.

  • CMDB assets are managed

    The CMDB is used to manage assets at the infrastructure layer and provide data support for upper-layer applications. The use of CMDB must be tightly integrated with business applications, and once removed from business use, CMDB will become eye candy.

    For related scenarios, see Operation and Maintenance Thinking: Automatic Construction of Grounding Gas operation and Maintenance.

  • Internal DNS

    Internal DNS can decouple applications from IP addresses. Once IP addresses change, no code changes are required. Production environments should minimize such changes.

  • Quick server mount

    In order to meet the growing demand of business, it should have a series of automatic processes such as fast server launching and real-time asset recording to CMDB.

  • Network Permission Change

    Register and grant network permissions quickly according to application requirements.

And so on.

2. The database

Database in addition to the unique cluster, you can consider the database work order, SQL audit optimization and other processes.

3. System application

  • Capacity planning

Capacity planning refers to periodic evaluation based on certain basic data, such as the growth of service user traffic and existing capacity. If possible, it can be combined with the actual pressure test situation to ensure more accurate data. Capacity planning can effectively control server specifications and avoid resource overflow.

  • Environment maintenance and deployment

To avoid problems caused by different environments, application deployment in different environments must comply with unified directory specifications, unified automatic deployment mode, and separate application configuration files.

, etc.

4. Configure management

  • Unified Account Management

    All platforms and management tools related to user login should be connected to ldap account management. In this way, unified login of all systems can be implemented using one account.

  • Automated configuration center

    Adhering to the idea of infrastructure as code, ansible as the configuration center, at the operating system level to achieve system initialization, environment initialization, component initialization, automatic backup and other centralized management, each environment delivered to the unified specifications of the server.

  • Process management

    Combine jIRA and other workflow tools to realize the process management of operation.

, etc.

5.CI/CD

Under the premise of unified operation and maintenance specifications, CI/CD can truly implement the ideas and solutions at all levels above. Therefore, CI/CD capability largely determines the height of our automated operations.

  • Continuous integration

    Code quality testing, unit testing, packaging testing, automated testing, etc.

  • Operating system delivery

    Comply with unified O&M specifications, deliver a unified operating system, and register resources on each management node of the O&M platform.

  • release

Supports smooth release, rollback, and restart.

  • Automatic packaging

    Android/IOS automatically packages and uploads to the App Store.

6. Monitoring system

  • System construction

    Collect and analyze monitoring data in multiple dimensions to realize alarms at different levels;

    Multi-dimensional data can be analyzed to realize self-healing of faults.

  • Monitoring management

    Monitoring does not only ensure that alarms are generated, but also ensure the accuracy of alarms. Therefore, you need to pay special attention to the management of alarm severity, alarm convergence, and fault self-healing policies.

7. Safety protection

In addition to carrying out security protection and traffic analysis through necessary security devices such as WAF, IDS and firewall, problems should be actively discovered by combining security penetration.

8. Data analysis

The centralized analysis and display of application data, service data, and operation data helps us better understand the system running status.

conclusion

Through the operation and maintenance framework and basis of the above levels, we hope that we can brainstorm based on the actual situation and do more than this.

Of course, the construction of automatic operation and maintenance is not achieved overnight, and it needs to be gradually realized by combining norms, systems and processes.

Remember that operation and maintenance construction is a process, not just a goal. We need to follow the trend of technology and continue to optimize and enrich the process.