Foreword
Operation and maintenance is the most basic task of enterprise IT, and it is also the task with the most pain points and slots. Massive data, frequent alarms, difficult troubleshooting, and ruthless complaints are enough to make operation and maintenance engineers feel broken and desperate…
Gartner is based on ITOA (IT Operations Analytics) , Put forward the concept of AIOps. At that time, the meaning of AIOps was “Algorithmic IT Operations”. With the advent of the AI boom, Gartner has also followed the trend. In a 2017 report, AIOps was redefined as “Artificial Intelligence for IT Operations”, which is what everyone is talking about now.
The concept of AIOps is to try to apply artificial intelligence algorithms such as machine learning and deep learning to large data sets collected by IT operation and maintenance tools and business systems, and to try to simulate human behavior (such as discovery, judgment) , Response) intelligent operation and maintenance management platform.
AIOps hopes to equip operation and maintenance management with the ability of algorithm and machine learning, through continuous learning, make operation and maintenance more intelligent, and liberate operation and maintenance personnel from the complicated daily work.
More than two years have passed, is AIOps still at the level of concept and vision, or has it become a solution that can be implemented?
With these questions, XingXingbao interviewed the domestic AIOps technology frontier explorer, Gartner AIOps Sample Vendors-Mr. Li Cheng, the vice president of Cloud Wisdom.
The following are some of the wonderful content that Mr. Li Cheng shared with you in the live broadcast. I hope to inspire and help you.
1
The concept, application scenarios and user value of AIOps
Li Weiliang: In which O&M scenarios can AIOps be applied?
Li Cheng: AIOps has a wide range of application scenarios, which can address a large number of pain points in traditional operation and maintenance, such as: anomaly detection, fault prediction, correlation analysis, root cause analysis, alarm suppression, automatic fault handling, etc. .
Li Weiliang: How does Cloud Wisdom understand the concept of AIOps?
Li Cheng: In the concept of cloud wisdom, IT is business. Therefore, we understand AIOps as “intelligent business operation and maintenance”, and in 2016 released the intelligent business operation and maintenance platform DOCP (Digital Operation Central Platform). DOCP includes solutions such as big data operation and maintenance, business operation and maintenance, and intelligent operation and maintenance, aiming to help users comprehensively improve IT operation efficiency and strengthen the business value of IT. Cloud Wisdom’s intelligent business operation and maintenance combines Gartner’s AIOps concept with China’s IT operation and maintenance practice, which is more scene-oriented and more grounded.
Li Weiliang: In the past two years, in which industries has Cloud Wisdom’s AIOps solution been applied? What value does it bring?
Li Cheng: In the past two years, cloud smart intelligent business operation and maintenance solutions have been used in the business scenarios of large enterprises in various fields such as banking, insurance, securities, aviation, medicine, manufacturing, and consumer goods. Successfully landed in China.
Intelligent business operation and maintenance solutions have greatly improved the efficiency of operation and maintenance through the automation, intelligence and IT team empowerment of operation and maintenance. At the same time, intelligent business operation and maintenance makes operation and maintenance more scientific, reduces excessive dependence on personal experience, overcomes the instability of manual operation and maintenance, and greatly improves the quality of operation and maintenance. Intelligent business operation and maintenance can liberate operation and maintenance personnel from huge, tedious, and repetitive labor, enabling them to devote more energy to IT and business innovation.
In view of Cloud Wisdom’s contribution and efforts in the AIOps field, Gartner nominated Cloud Wisdom as the Sample Vendors in the AIOps field in the newly released “Technology Maturity Curve of China’s ICT Industry, 2018” report.
2
AIOps actual combat case sharing
Li Weiliang: Can we combine some industry cases and make some specific explanations?
Li Cheng:
☉ Application Scenario 1: Abnormal Monitoring
One of our customers in the aviation industry, during the business development process, 600 business application systems (including The ticketing system, the refund system, the warehouse entry system, the order query system, etc.) generate massive log data (7TB/100 million incremental data generated in 2 hours). Users hope to be able to analyze massive amounts of data in real time, detect business fluctuations in time and give early warnings. The needs of this customer are characterized by large data volume, high index complexity, and high real-time requirements (data collection, analysis, and presentation completed within 1 minute).
Cloud Wisdom has been serving this customer since 2016 and has established a real-time monitoring and analysis platform for business operations, achieving business abnormal warnings, business baseline warnings, operational monitoring and analysis, and real-time log query goals.
Through distributed big data processing, memory computing and other technologies, we have realized the real-time analysis and processing of 100,000 pieces of concurrent data and second-level alarm processing for this user. Through the application of algorithms such as deep learning and time series prediction, the accuracy of prediction has been greatly improved, and the deviation of the prediction result from the actual situation is only 3%.
☉ Application scenario 2: Association analysis
One of our financial industry customers is a large financial institution with a relatively fast pace of digitization. It has 3 data centers and 600 business application systems in China , There are tens of thousands of physical devices, and the calling relationship between the systems is complicated, and there is a strong dependency between some core businesses.
These application systems generate massive log data and alarm information every day, and the processing and analysis of log message data has poor timeliness and low efficiency. The overall IT operation and maintenance efficiency has become an obstacle to the digital development of enterprises.
In view of the situation of this company, Cloud Wisdom has established a unified view of business and IT based on the technology and experience accumulated in products such as monitoring treasure, perspective treasure, and pressure measurement treasure over the past years. Clarified the internal relationship of various index data, log data and event data, and carried out unified modeling and analysis.
On this basis, Cloud Wisdom’s intelligent business operation and maintenance platform has realized the prediction and anomaly detection of key business indicators and experience indicators for this customer, improved the efficiency of business operations and IT management, and initially achieved Digitization and intelligence of IT operations.
Financial control center large-screen effect display
☉ Application scenario 3: Smart alarm
When an IT failure occurs, multiple systems will send out alarms at the same time. It brings huge troubles to operation and maintenance personnel and greatly reduces the efficiency of fault handling. This phenomenon is called “alarm storm”. Alarm storm is a common scenario in IT operation and maintenance, and it is also one of the typical applications of AIOps.
One of our pharmaceutical customers has nearly 10 online products and office systems for all types of customers. With the rapid development of their business, they have built 3 data centers across the country. Have tens of thousands of physical devices. The calling relationship between the systems is complex, and some core businesses have strong dependencies.
The operation and maintenance team receives nearly 10,000 fault alarm notification messages every day, with an average of 100-200 notification messages per person, and frequent omissions and false alarms. When a fault occurs, the coordination of various departments is required to locate and solve the problem, and the average resolution time takes more than 1 hour. The user currently has 5 sets of monitoring systems, and each system will independently generate alarm notifications. When a large-scale failure occurs, the operation and maintenance personnel will receive a large number of alarm notifications from each system at the same time, which causes great trouble to normal work .
According to the situation of this company, we deployed an intelligent alarm platform for it, using restAPI, agnet collection and other methods to connect to various monitoring systems, and the alarm messages of each system were unified and aggregated through the intelligent alarm platform. Integration allows operation and maintenance personnel to handle all failures on one platform.
After the official deployment of the smart alarm platform, we successfully reduced the amount of alarms by 93%, that is, for every 100 alarm data, it can be compressed to 7. At the same time, the system can also scientifically classify the alarm information and send it to the correct person in time.
The intelligent alarm platform has greatly shortened the average response time (MTTA) of the entire operation and maintenance team, from the previous average of 25 minutes and 23 seconds to 4 minutes and 16 seconds. Through technologies such as dynamic baselines, the rate of false positives and false negatives can be reduced from 22.4% to 8.5%; 9.3% to 3.8%.
On this basis, we have recently implemented a “fault prediction” function for users to help users understand possible IT problems in advance and minimize the impact of IT failures on the business.
3
Deployment method and implementation methodology
Li Weiliang: What method is needed to implement AIOps?
Li Cheng:
The implementation of intelligent operation and maintenance is not accomplished overnight. It needs to go through three stages:
The first stage is big data operation and maintenance, and the construction of a unified monitoring platform , Realize the unified management and control of IT resources. Use big data to collect and analyze IT monitoring data such as infrastructure, network, and logs. Through the real-time processing and analysis of massive IT data, eliminate data islands, achieve unified alarms, and improve the efficiency of operation and maintenance management.
The second stage is business operation and maintenance, which comprehensively improves user experience and business system health, and realizes the two-way drive of business and IT. User experience and business efficiency are the two core indicators of digital business. Through the two-way drive of IT and business business operation and maintenance, it can help companies discover the impact of IT failure on the business, how IT can better support business transformation, and how to maximize it. Reduce business losses to a great extent.
The third stage is intelligent operation and maintenance, building an intelligent IT operation management and control system, and continuously improving business value. Through intelligent alarms, abnormal monitoring, root cause analysis, automatic handling, and failure prediction, it greatly improves IT operation and maintenance efficiency, guarantees business continuity, and reduces business losses.
Among these, the big data platform is the foundation and the foundation of the entire intelligent business operation and maintenance system. Enterprise users can first lay a good foundation for big data, and on top of this, gradually increase application modules, adopt the method of accumulating experience and running in small steps, so that AIOps can successfully land in their own enterprises.
Li Weiliang: What deployment methods are supported by the cloud smart intelligent business operation and maintenance platform?
Li Cheng: The cloud intelligent intelligent business operation and maintenance platform adopts a hybrid cloud architecture to support local privatization deployment and SaaS deployment based on public cloud. As the first domestic business operation and maintenance solution provider that realizes cross-industry scenario application of AIOps, Cloud Wisdom can provide users with a full range of services from big data platforms, to intelligent operation and maintenance modules, to experts and implementations. The basic needs and individual needs of enterprises promote the development of enterprises’ digital business.