On October 6, 2018, Elastic became one of the few companies in the open source software space to go public. Elastic is best known for its open-source search product Elasticsearch.

Elasticsearch is a distributed, scalable real-time search and analysis engine used by many well-known companies around the world with a variety of scenarios. When the app is at Uber, Instacart and Tinder, it matches riders with nearby drivers, provides relevant results and suggestions for online shoppers, or matches people they might like; When applied in traditional IT, operations, and security sectors, IT is used to aggregate pricing, quotation, and business data, process a billion log events per day, and provide network security operations for thousands of devices and critical data.

What about Elasticsearch and log analysis? Its positioning itself is not a log analysis system, for log analysis, what role can it play? How to optimize and improve? What other open source search engines are out there? What are the pros and cons of Elasticsearch compared to them? InfoQ caught up with Wuping Li, vp of Log Analytics, to talk about Elasticsearch and log analysis.

Li Wuping: Log is a very broad concept in computer system. Any program may output log: operating system kernel, various application servers and so on. Logs also vary in content, size, and purpose. Here are just a few scenarios.

In the Web or App, logs are often used to record user access behaviors. Logs can be used to analyze the traffic volume of the service, improve the system, and analyze the image of users. The logs can be used for advertising, recommendation and other services.

Many enterprise internal software records all user operations. These logs can be used for user audit or enterprise security purposes.

In addition, for development or operations personnel, logging can be the first choice for fault analysis, alarm monitoring, and other scenarios.

Li Wuping: in the case of the relatively small size data, you can use a single script processing, this method is simple and rapid, but when a number of different analysis was carried out on the log, could lead to a large number of repetitive code, used to perform data parsing and cleaning, this time may use a more appropriate method, such as the database.

When using a database for log analysis, an important point is how to import the various heterogeneous log files into the database, because the database first needs to create a table structure in a fixed format – a process commonly referred to as ETL. After the data is imported, you can use familiar SQL to analyze the logs.

For large data sizes, distributed technologies are typically used. One of the distributed methods is to use Hadoop to store log data and then use MapReduce Jobs or Spark to analyze the log data. If you want to use SQL for analysis, you can use Hive, a database-like system. This type of system is suitable for batch analysis. If real-time analysis is required, some real-time processing systems need to be introduced.

There is also a kind of log analysis based on search engine. Currently there are many applications for log analysis based on Elasticsearch in China, and there are some commercial products in the industry. For example, foreign Splunk and Sumo Logic, domestic log is easy.

Li Wuping: The biggest problem of log analysis is the complexity of field extraction. Firstly, field extraction for all logs is a lot of work. Secondly, logs will change with product version update, and it is difficult to predict the field requirements for subsequent analysis when logging into the database. Hence the need for a powerful and easy-to-use log extraction function, as well as the ability to temporarily extract fields on demand when searching.

The second is the need for flexible and configurable analytical capabilities. If you need to write code for every analysis, it is a lot of work and the threshold to use is relatively high. SQL can be used for database solutions. Some commercial products have launched their own languages, such as SPL (Search Processing Language).

Other common issues are real-time and performance issues. As the volume of log data increases, real-time monitoring and analysis of data inevitably affects system performance. The balance and optimization of the two are mainly influenced by operation and maintenance work experience.

Li Wuping: The search engine in log analysis is mainly used for data reading and writing: that is, it receives the latest log data in real time, indexes it, and provides it to users for search and statistical analysis in real time or quasi-real time.

Search engine compared with some similar database system, the main characteristics are: support full-text search, search faster, processing data performance is stronger, real-time is also very good.

Li Wuping: The open source search engine projects mainly include Solr from Apache community, Vespa from Yahoo, Sensei and Elasticsearch from LinkedIn, etc. Elasticsearch is currently widely used for log analysis by many companies in China, including Internet companies.

In log analysis, Elasticsearch has the following advantages: easy to use, plug-in extension, quasi-real-time search, and statistical analysis. For example, Elasticsearch supports massive aggregations and rich Restful interfaces. Also, the Elasticsearch community is very active and many problems can be solved quickly in the community.

Li Wuping: First of all, Logyi has always been a deep user of ES. A lot of customer is a large amount of data, the product cluster scale deployment to meet hundreds of machines, new produce tens of terabytes of data every day, our products are mostly deployed in the user’s production environment, compared with their company internal use ES, fault handling, function can be much more difficult to debug, so we demand for function, performance, stability is very high.

Elasticsearch is implemented in the following ways:

The first is to optimize Elasticsearch functionality, performance and stability. Problems occur during the actual use of Elasticsearch. For example, field type conflicts are not allowed in Elasticsearch. When indexing, Elasticsearch checks to see if the new field has type conflicts across the entire cluster, which can cause data to pile up and the cluster to freeze. You need to optimize ES based on specific application scenarios, including Mapping update logic, adding long-term index storage, and optimizing Shard allocation logic for multiple storage paths. In addition, the Lucene search library for Elasticsearch calls has been improved and optimized, involving hundreds of source files and over ten thousand lines of source code.

Second, the professionalization of log search engine. Elasticsearch itself is a generic search engine, not a dedicated log analysis system, so there are a lot of features that are not required for log analysis. For example, logs are immutable and require little correlation. In fact, because Elasticsearch is available in a wider range of scenarios, some of the benefits mentioned above are subject to limitations.

Log Easy log search engine, greatly improving the log search performance. Here are the main improvements to search engines:

Optimization of field conflicts. Elasticsearch is schemaless, less is not No, there will still be field conflicts. This is particularly problematic with log data, which is often formatted differently. The log does its own processing in this area.

Statistical analysis and RESTful interface improvement. Elasticsearch’s statistical analytics and RESTful interface can be a source of JSON hell if you’re faced with complex requirements analysis. Of course, at this point, log easy also made the corresponding processing.

Real-time and performance are also the focus of large-scale cluster improvement and optimization. The engine we redeveloped has been reformed and optimized in different degrees in Replica strategy, Segment merging strategy, weight elimination strategy and DocValues statistics.

If you want to explore Elasticsearch optimization experience, please join me at the CNUTCon log Processing forum in Shanghai in November.

Elasticsearch recently went public on THE Nyse, which is one of the few successful companies in the open source software field. After the listing, Elasticsearch will further improve its brand influence, financial strength and talent attraction. With this comes greater investment, which will certainly increase the influence of search engines in the technology community and the acceptance of search engine-based log analysis among IT professionals. The future of this sector should continue to flourish, and I expect more new scenarios to emerge.