In the era of big data, enterprises also put forward higher requirements for DBAs. At the same time, NoSQL, as a newly emerging technology in recent years, has also received more and more attention. Based on the DBA work that Mr. Meng Xianyao is responsible for, and big data operation and maintenance related experience, this article will share two main directions: 1. The company’s architecture evolution on KV storage and the problems that need to be solved in operation and maintenance; 2. Right How to select NoSQL and some thoughts on its future development.
According to official statistics, as of now (April 20, 2018) NoSQL has 225 solutions. Specific to each company, they use a very small subset of them. The products marked in blue in the picture are currently being used by this push.
The origin of NoSQL
In 1946, the first general-purpose computer was born. But it wasn’t until the emergence of RDMBS in 1970 that everyone found a universal data storage solution. In the 21st century, the DT era has made data capacity the most difficult problem. For this, Google and Amazon respectively proposed their own NoSQL solutions. For example, Google proposed Bigtable in 2006. At a technology conference in 2009, the term NoSQL was formally proposed, and there are currently 225 solutions.
The difference between NoSQL and RDMBS is mainly in two points: first, it provides modeless flexibility and supports very flexible mode changes; second, scalability, native RDBMS is only suitable for stand-alone And small clusters. NoSQL is distributed from the beginning, which solves the problems of read-write and capacity scalability. The above two points are also the root causes of NoSQL.
There are two main means to realize distributed: Replication and Sharding. Replication can solve the read scalability problem and HA (high availability), but it cannot solve the read and capacity scalability. And Sharding can solve the scalability of reading and writing and capacity. The general NoSQL solution combines the two.
Sharding mainly solves the problem of data division, mainly based on interval division (such as Hbase’s Rowkey division) and hash-based division. In order to solve the problem of monotonicity and balance of hash distribution, virtual nodes are currently mainly used in the industry. Codis described later also uses virtual nodes. The virtual node is equivalent to establishing a virtual mapping relationship between the data shard and the hosting server.
At present, everyone mainly classifies NoSQL based on data model and access method.
A few commonly used NoSQL solutions
The scale of a push Redis system is shown in the figure below. Here are some problems encountered in the operation and maintenance process.
The first is the evolution of the technical architecture. It started as a push notification service for APP developers. Before 2012, the business volume of this push was relatively small. At that time, we used Redis for caching and MySQL for persistence. In 2012-2016, with the rapid development of personal push services, a single node could no longer solve the problem. In the case that MySQL cannot solve high QPS and TPS, we self-developed Redis sharding solution. In addition, we also developed the Redis client by ourselves, and used it to implement basic cluster functions, support custom read and write ratios, and check the monitoring and isolation of faulty nodes, slow monitoring, and the health of each node. However, this architecture does not take too much into consideration the efficiency of operation and maintenance, and lacks operation and maintenance tools.
When we planned to improve the operation and maintenance tools, we found that the pea pod team had open sourced Codis, which provided us with a good option.
Advantages of pushing Codis+
Codis is a proxy-based architecture that supports native clients, web-based cluster operation and monitoring, and also integrates Redis Sentinel. It can improve the efficiency of our operation and maintenance, and HA is also easier to land.
But in the course of use, we also found some limitations. Therefore, we proposed Codis+, which is to make some functional enhancements to Codis.
First, adopt the 2N+1 replica solution to solve the single point problem of the Master during the failure.
Second, Redis is semi-synchronous. Set a threshold such that the slave is only readable within 5 seconds.
Third, resource pooling. Resource expansion can be carried out by adding RegionServer similar to HBase.
In addition, there are rack-aware functions and cross-IDC functions. Redis itself is set up for a single computer room, without considering these issues.
So, why don’t we use the native rRedis cluster? There are three reasons here: First, the native cluster, which couples the routing and forwarding function and the actual data management function into one function. If one function fails, it will cause data problems; second, in a large cluster, P2P The process of achieving a consistent state of the architecture is relatively time-consuming. Codis is a tree architecture, and this problem does not exist. 3. The cluster has not been endorsed by a large platform.
In addition, regarding Redis, we are currently looking at a new NoSQL solution, Aerospike, and our positioning is to replace part of the cluster Redis. The problem with Redis is that the data resides in memory and the cost is high. We look forward to using Aerospike to reduce TCO costs. Aerospike has the following features:
First, Aerospike data can be stored in memory or SSD, and SSD is optimized.
Second, resource pooling, operation and maintenance costs continue to decrease.
Three, support rack awareness and cross-IDC synchronization, but this is an enterprise-level version function.
At present, we currently have two internal businesses using Aerospike. After actual measurement, we found that a single physical machine equipped with a single Inter SSD 4600 can achieve a QPS close to 10w. For businesses with large capacity but low QPS requirements, you can choose the Aerospike solution to save TCO.
During the evolution of NoSQL, we also encountered some operational and maintenance problems.
Standardized installation
We are divided into three parts: OS standardization, Redis file and directory standard, Redis parameter standardization, all implemented with saltstack + cmdb;
Expansion and shrinkage
During the continuous evolution of the technical architecture, the difficulty of expansion and shrinkage has also become lower. One of the reasons is that codis has alleviated some of the problems. Of course, if you choose Aerospike, the related operations will be very easy.
Do a good job of monitoring and reduce operation and maintenance costs
Most of the operation and maintenance students should carefully read “SRE: Google Operation and Maintenance Secret”, which puts forward both theoretical and practical levels Many very valuable methodologies are highly recommended.
One push Redis monitoring complexity
Three cluster architectures: self-developed, codis2 and codis3, the three architectures collect data in different ways.
Three types of monitoring objects: clusters, instances, and hosts need to have metadata to maintain logical relationships and aggregate them globally.
Three personalized configurations: a push Redis cluster, some clusters need multiple copies, some don’t. Some nodes allow the cache to be full, and some nodes are not allowed to be full. There are also persistence strategies. Some do not do persistence, some do persistence, and some do persistence + remote backup. These business characteristics place high requirements on our monitoring flexibility.
Zabbix is a very complete monitoring system. For about three years, I have used it as the main monitoring system platform. But it has two shortcomings: one is that it uses MySQL as the back-end storage, and TPS has an upper limit; the other is that it is not flexible enough. For example, if a cluster is placed on one hundred machines, it is very difficult to aggregate indicators.
Xiaomi’s open-falcon solves this problem, but it also creates some new problems. For example, there are few alarm functions, strings are not supported, and manual operations are sometimes added. Later, we supplemented it with functionality and did not encounter any major problems.
The picture below is a push operation and maintenance platform.
The first one is IT hardware resource platform, Mainly maintain the physical information of the host dimension. For example, in which rack the host is connected to which switch, which floor in which computer room, etc., this is the basis for rack awareness and cross-IDC.
The second is CMDB, which is to maintain the software information on the host, which instances are installed on the host, which clusters the instances belong to, which ports we use, and what are the personalized parameter configurations for these clusters, including The alarm mechanism is different, all of which are realized through CMDB. The data consumer of CMDB includes the grafana monitoring system and monitoring collection program, which is developed by ourselves. This way the CMDB data will come alive. If there is no consumer for static data, the data will be inconsistent.
The grafana monitoring system aggregates multiple IDC data, and we only need to look at the big screen every day for operation and maintenance.
Slatstack, used to realize automatic release, realize standardization and improve work efficiency.
The collection program is developed by ourselves and is highly customized according to the company’s business characteristics. There is also ELK (without logstach, use filebeat) as the log center.
Through all of the above, we built an entire monitoring system.
Let’s talk about a few pits encountered during the construction process.
First, the master-slave reset will cause the pressure on the master node to increase sharply, and the master node will not be able to provide services.
There are many reasons for master-slave reset.
Redis version is low, and the probability of master-slave reset is high. The probability of Redis3 master-slave reset is greatly reduced than Redis2. Redis4 supports incremental synchronization after node restart. This is a lot of improvements in Redis itself.
We now mainly use 2.8.20 , It is relatively easy to generate a master-slave reset.
Redis master-slave reset generally triggers one of the following conditions.
1, repl-backlog-size is too small, the default is 1M, if you have a lot of writing, it is easy to break this buffer; 2, repl-timeout, Redis master-slave default every ten seconds The clock is pinged once, and the master-slave will be reset if the ping is not pushed for 60 seconds. The reason may be network jitter, excessive pressure on the total node, and unable to respond to this packet; 3. tcp-baklog, the default is 511. The default of the operating system is limited to 128. This can be increased moderately. We can increase it to 2048. This can be tolerant of network packet loss to a certain extent.
The above are all the reasons leading to the master-slave reset, and the consequences of the master-slave reset are very serious. The pressure on the Master has soared to be unable to provide services, and the business has determined this node as unavailable. The response time becomes longer. The nodes of all hosts where the Master is located will be affected.
Second, the node is too large, partly caused by man-made reasons. The first is that the efficiency of splitting nodes is low, which is much slower than the growth of the company’s business volume. In addition, there are too few fragments. Our shards are 500, codis is 1024, and codis native is 16384. Too few shards are also a problem. If you are doing a self-developed distributed solution, you must set the number of shards slightly larger to avoid the situation where the business development exceeds your expectations. After the node is too large, it will cause the persistence time to increase. Our 30G nodes need to be persistent, and the remaining memory of the host must be greater than 30G. If not, your use of Swap will cause the host’s persistence time to increase significantly. The persistence of a 30G node may take 4 hours. Excessive load will also cause the master-slave reset, causing a chain reaction.
About what we encountered Pit, let me share a few practical cases next.
The first case is a master-slave reset. This situation occurred two days before the Spring Festival. Before the Spring Festival, it was the peak period of the news push business. Let’s simply restore the failure scenario. First, the large-scale message delivery caused the load to increase; then, the Redis Master pressure increased, the TCP packet backlog, and the OS generated packet loss. The packet loss caused the Redis master-slave ping packet to be lost, triggering the repl-timeout for 60 seconds. The threshold value of the master and slave is reset. At the same time, the saturation of Swap and IO is close to 100% due to the excessively large nodes. The solution is very simple, we first disconnect the master and slave. The first reason for the failure is that the parameters are unreasonable, most of which are default values, and the second is that the node is too large to magnify the effect of the failure.
The second case is recently encountered by codis A problem. This is a typical failure scenario. After a host was hung up, codis started the master-slave switch. After the master-slave switch, the business was not affected, but when we reconnected the master-slave, we found that it could not be connected. If it failed to connect, an error was reported. This error is not difficult to find. In fact, the parameter setting is too small, which is also caused by the default value. When the slave pulls data from the master node, the new data remains in the master buffer. If the slave has not been pulled, the master buffer will exceed the upper limit, which will cause the master and slave to reset and enter an endless loop.
Based on these cases, we Put together a best practice.
One, configure CPU affinity. Redis is a stand-alone point structure, and incompatibility will affect the efficiency of the CPU.
Second, the node size is controlled at 10G.
3. The remaining memory of the host should be greater than the maximum node size + 10G. Master-slave reset requires the same size of memory. This must be reserved enough. If you do not reserve enough, it will be difficult to reset successfully if you use Swap.
Fourth, try not to use Swap. Responding to a request in 500 milliseconds is better than hanging up.
Fifth, tcp-backlog, repl-backlog-size, repl-timeout increase moderately.
Six. Master does not do persistence, and Slave does AOF+ timing reset.
Finally, some personal thoughts and suggestions. Choose NoSQL that suits you, there are five selection principles:
1, business logic. First of all, we must understand the characteristics of our own business. For example, if it is a KV type, look for it in KV; if it is a graph, look for it in the graph, so that the range will be reduced by 70%-80%.
2. Load characteristics, QPS, TPS and response time. When choosing a NoSQL solution, you can measure it from these indicators. What is the performance indicator of a single machine under a certain configuration? When Redis has enough hosts, a single QPS of 400,000 to 500,000 is completely OK.
3. Data scale. The larger the data scale, the more issues need to be considered and the smaller the selectivity. At the level of hundreds of terabytes or petabytes, there are hardly many choices, it is the Hadoop system.
4. Operation and maintenance costs and availability of monitoring, whether it can be easily expanded and contracted.
5. Others. For example, are there any successful cases, whether there are complete documentation and communities, and whether there is official or corporate support. After others have stepped on the pit, we can pass it smoothly. After all, the cost of stepping on the pit is still quite high.
Conclusion: Regarding the interpretation of NoSQL, there was a paragraph on the Internet: from Know SQL in 1980, to Not only SQL in 2005, to No SQL today! The development of the Internet is accompanied by the renewal of technical concepts and the improvement of related functions. Behind the technological progress is the continuous learning, careful thinking and unremitting attempts of every technical person.