The transformation from carriage to car is to improve transportation efficiency. With the development of The Times, we now hope to liberate the driver from the physical labor of driving by using automatic driving to increase operation efficiency and reduce the traffic accident rate. This is also the appeal of enterprises for intelligent operation and maintenance.
The transformation from human operation and maintenance to automatic operation and maintenance is to reduce labor costs, reduce operational risks and improve operation and maintenance efficiency. However, the essence of automatic operation and maintenance is still the operation and maintenance mode combining human and automatic tools, which still has limitations. In order to continuously provide high quality operation and maintenance services for large scale and high complexity systems, intelligent operation and maintenance (AIOPS) came into being.
In this paper, Kangaroo Cloud will share with you the specific application of the intelligent operation and maintenance big data platform (an out-of-the-box operation and maintenance monitoring platform) in the Oracle database operation and maintenance scenario.
I. Data collection
The first step in using the platform is data access. What data support is needed for Oracle operation and maintenance? Based on our daily experience running and maintaining Oracle, the following types of data are particularly important:
Instance and database base information
Including the instance version, Patch, start time, instance parameters, basic host configuration information.
Database health check
Check whether the database can be normally connected, read and write response time is normal.
Instance base performance data
Including business QPS, TPS, instance and host CPU utilization, memory utilization, connection utilization, SQL parsing, database logical reads, physical reads, database lock waiting status, and RAC cluster communication status.
Oracle wait event
Collect the type, number, and elapsed time of wait events within Oracle. Wait events can be used to determine the overall health of an instance and locate instance bottlenecks.
Database space usage information
This includes tablespace file footprint, tablespace usage, temporary tablespace usage, and UNDO tablespace usage. You need to monitor tablespace usage in real time to avoid failures caused by tablespace overcrowding.
Database Session information
Session information records the current SQL situation of the instance, and records the specific information of the current blocking Session, which is common, such as lock wait. Through Session information, it is convenient to quickly locate the blocking phenomenon in the instance.
In the field of database operation and maintenance, backup is more important than Mount Tai. Every day you need to check the database backup, including whether the backup is successful, backup time, backup space, etc.
DataGuard is one of the most commonly used solutions for high availability in Oracle. You need to monitor the health of Oracle DataGuard in real time, including whether the log transfer is normal or not, and the log application latency.
Database alarm log, TNS listening log. From the log can find the database internal run error, abnormal client connection information and so on.
The data collection mentioned above is already integrated into the product. Users only need to configure access information in the database performance acquisition module, and these data will be collected automatically.
After data access, the product will use the data in several ways:
The system comes with a common dashboard for Oracle scenarios by default. Users can also configure the custom dashboard through SPL according to their own usage habits.
Common monitoring alarms are built into the system. Custom alerts can also be configured via SPL. Once the data is collected, it can be used to configure alarms.
The system supports the configuration of custom inspection rules, according to the user-defined time interval, regular database inspection.
Based on the Oracle alarm log and TNS listening log collected by the system, in addition to using basic log search and monitoring alarm, some scenarios for log analysis can also be configured.
This article focuses on the use of the dashboard.
Dashboard is the basic form of data visualization, which is convenient for users to intuitively understand the overall operation status of the system.
1. Oracle instance overview
Oracle General Dashboard consists of the following parts:
- Instance statistics, including the total number of instances, the number of exceptional instances, the number of databases, instance version distribution. Through these several indexes, we can have a general understanding of the instance in the access system.
- TOP instance, including busyness rate TOP instance, number of active sessions TOP instance.
Locate busy instances by these two metrics. !
- List of Exception Instances This table shows all instances that cannot be connected, including connection error messages.
- The TOP performance trend chart selects the core indicators of the database to give an overall understanding of the health of the overall instance. Indicators selected: · DB Time utilization: reflects the overall busyness of the instance · DB CPU utilization: utilization of CPU resources. · Number of active sessions: post-SQL backlog · Number of sessions utilization: Session resource utilization · QPS/TPS: throughput to show business requests
2, Oracle instance details
This dashboard is used to show the health details of an individual instance. The instrument panel is mainly divided into the following parts.
1) Instance information
Display the basic information of the instance, including the host situation, instance running state, instance version, database role, read/write mode, etc
2) Instance operation
Display the core performance metrics of the instance.
- Number of blocked sessions/number of active sessions
- DB Time usage
- Instance current session count utilization
- CPU usage trend
- Instance session count trends
- SQL execution/SQL parsing
- Instance logical reads/physical reads
- Instance network traffic
- Number of instance IO requests
3. Oracle instance space overview
The dashboard shows the space usage of the instance. It mainly includes several parts:
1) Total spatial distribution of instances
Shows the spatial distribution of all instances.
2) The instance uses the space TOP
Shows the spatial trend of the TOP instance of space utilization.
3) Instance tablespace related information
Showcases the number of tablespaces, the total space of the selected instance, the use of space compared to last year, UNDO space, TEMP space, and flashback space of the selected instance.
4) Instance tablespace usage and occupancy ranking.
5) Top trend in instance tablespace usage
6) List of instance tablespaces
Shows space usage for all table Spaces in the instance.
4, Oracle blocks session
The dashboard, which shows the blocking session in the instance, is composed of several parts.
1) Top Blocking Session Trend Chart
Shows the trend in the number of blocked sessions for all instances in the system. If there are blocking sessions, you need special attention.
2) Event distribution diagram of instance and so on
Shows the wait event distribution for the blocking session of the selected instance.
3) Blocking source analysis
Shows which sessions cause other sessions to block.
4) Wait for the event trend
Instances wait for event trends.
5) Block the session list
The details of the blocking session are presented in table form, including:
The Session ID,
· Login time will be returned
· Returns the current state
· The session ID causing the blocking
· Block object ID
· Waiting for events
· Waiting time
· Login user information, including user name, login terminal, and application name.
· Executed SQL information, including SQL ID, SQL statement.
From these dashboards, you can get an overview of the basic health of all instances, as well as an in-depth analysis of individual instances, down to the specific SQL execution. You can get an overall picture of space usage trends across all databases, and you can also see data usage for individual table Spaces. Third, summary
The above case is a specific application of intelligent operation and maintenance big data products in the Oracle database operation and maintenance scenario.
In fact, the whole product is not limited to the database operation and maintenance scene at all.
Products in the data acquisition and data application, has a strong ability to expand.
1) Automatic inspection of all indicators, can be configured as inspection items, the system supports custom scheduling cycle (hourly granularity), regular inspection of the running status of the system, in the way of stapling messages or emails sent.
2) Monitoring of the whole link The above only introduces the scene of the database, in fact, the system supports the data collection and analysis on the whole link. The current collection supported by the system includes: · Information collection of physical devices (CPU fans, disks, temperature, power state of physical machines) · Network devices (switches, firewalls, wireless AP) · Data collection of Ali Cloud cloud products, supporting data integration of dozens of cloud products. · Universal software (Docker, Tomcat, Messaging Middleware) · Web access log, firewall log, host log · Application log data · APM application call connection data collection
3) Intelligent algorithm automatic baseline learning, without the need to configure alarms, can automatically monitor the abnormal status of the system operation.
“Intelligent Operation and Maintenance Big Data Platform”
“Intelligent Operation and Maintenance Big Data Platform”Is a box of operational monitoring platform, through the unique platform function can be enterprise infrastructure, application, log management together, provide a unified collection, storage, correlation analysis, unified monitoring business support capability, stable and efficient operation, security business at the same time using the offline calculation, real time calculation, such as machine learning techniques, Achieving operational and maintenance data sharing, data development and processing capabilities, allowing developers, operations teams and business teams to work together to build and improve software applications, and helping enterprises understand business and user usage. It is used by organizations for digital transformation and cloud migration, driving collaboration between development, operations and business teams, speeding application time-to-launch, reducing time to problem solving, understanding user behavior and tracking key business metrics.
The stack is a cloud-based, native, site-based PaaS for data, and we have an interesting open source project on GitHub and Gitee: FLINKX, FLINKX is a unified data synchronization tool based on FLINK batch stream, which can collect static data, but also can collect real-time changing data. It is a global, heterogeneous, batch stream data synchronization engine. If you like, please give us a star! Star! Star!
Making open source projects: https://github.com/DTStack/fl…
Gitee open source projects: https://gitee.com/dtstack_dev…