This article is the content of “TiDB Tool Chain and Ecology” shared by Liu Yin, the head of PingCAP’s commercial product team, at TiDB DevCon2018 at the beginning of this year. It introduces the peripheral tools and ecosystem of TiDB in detail.
Good afternoon everyone, my name is Liu Yin. At PingCAP, he is mainly responsible for the product development of TiDB commercial tools, and is also doing the company’s SRE. The topic I shared this afternoon is to introduce the peripheral tools and ecosystem of TiDB.
The content to be talked about today mainly includes these aspects. First, it is about the deployment of TiDB, which is the first thing that many users of TiDB care about. Next, we will introduce TiDB’s data import tools and data migration synchronization tools, as well as tools related to management configuration and data visualization.
The structure of TiDB may be clear to everyone. TiDB is a distributed system composed of several modules. These modules rely on each other to coordinate their work to form a cluster, which constitutes the TiDB database as a whole. Such an architecture is not so easy for users to deploy and operate and maintain compared to a stand-alone database such as MySQL. Let’s take a look at how to quickly deploy a set of TiDB cluster instances. Recently we published a project pingcap/tidb-docker-compose, which makes it very easy for us to run a set of TiDB on a local development and test environment. Just use one command docker-compose up
to quickly start up. docker-compose is a very convenient tool in the Docker ecosystem. It can conveniently integrate all the components of TiDB, including its monitoring and visualization tools, into a yaml file to describe it, which is very convenient. Not only can it be started from our official docker image image, but it can also be started from the local binary. For example, when I compile a special version of binary on my machine, I can directly build a local mirror to start it, and even support on-site compilation of source code to start it. So this is very convenient for our own development and testing. In addition, we have also made a very simplified configuration file. For example, I don’t want to run 3 TiKVs by default. I want to enable 5 or more. Simply change the configuration to get it done.
For the deployment and operation and maintenance of the production environment, it is often faced with a large-scale cluster, and the docker-compose deployment method is not enough. We recommend using the provided Ansible deployment method. The user first describes and arranges the required TiDB cluster topology in an Inventory file, and then executes the ansible-playbook script we provide to quickly deploy and operate a TiDB cluster in a production environment. Many of our online users now also use this deployment method.
TiDB Ansible not only implements the deployment of clusters on bare metal, but also supports the deployment of Cloud. For example, with the components provided by Ansible, we can create a TiDB cluster based on AWS / Azure / GCP with one click, and in the future will also support domestic public cloud platforms. Secondly, the topology of the cluster can be customized according to user needs. This is more detailed and will also include the deployment of some of TiDB’s peripheral tools, such as the TiDB Binlog component. Third, it provides a configuration management function, including many parameter configurations of TiDB and TiKV. We are also integrated, and can manage the configuration of the entire cluster in one place. In addition, we have made a series of optimizations to the execution scripts of operation and maintenance operations. This becomes extremely convenient for deploying a large-scale cluster. In addition, I would like to mention here that during our Ansible deployment process, we will do a strict inspection of the hardware and system environment. Some users may use lower-speed mechanical hard drives for testing purposes, and fail to meet the minimum requirements for running TiDB. So here, we will have limited requirements, which will be interactively prompted during the installation process.
TiDB is a distributed database that can be scaled elastically and horizontally. It is designed for the cloud by nature, and we have been very close to containers since the beginning. I believe everyone knows the advantages of containers very well. First of all, it provides a consistent environment, users do not need to adapt to a variety of different system environments, but build runtime Binary separately. In addition, the startup and operation of the container is very convenient, and it can be quickly run in a development environment or deployed in a production environment. In addition, the container provides the feature of resource isolation. Through the capabilities provided by modern operating systems such as Namespace and CGroups, the isolation and restriction of resources inside and outside the container can be achieved.
When it comes to containers, we have to mention container orchestration. We have done a lot to integrate TiDB with K8s. For example, the automatic management of TiDB clusters, rapid deployment, expansion and contraction, and automatic healing of failures are realized on top of K8s. At the same time, it better supports multi-tenant management under the cloud platform, by restricting the use of resources of a single tenant, and using containers to complete isolation. To ensure that tenants will not affect each other. It is not to say that a user performs a high-load query or write, which will affect other user instances on the same host. However, TiDB storage itself is stateful. When deploying on K8s, how to manage stateful services and ensure the demanding requirements of storage iops and latency, while also ensuring high availability of services has become a problem.
If the native storage solution provided by K8s is used, PV is connected externally, that is, network storage is connected. However, for the database system, especially in the scenario of a large number of random reads and sequential writes, the performance of the network disk cannot meet the requirements. So from the very beginning, we designed the TiDB cloud solution, in fact, the main thing is to explore the local PV solution of K8s. Of course, K8s 1.9 has started to support Local PV to a certain extent, and we implemented a Local Storage Manager in 1.7. Some of the work we are doing is gradually being integrated with the main version of the community K8s. In addition, TiDB itself is a complex cluster, in addition to storage, network, and management of peripheral tools need to be considered. In order to achieve more automation of operation and maintenance management in professional fields, we created TiDB Operator. The Operator pattern is actually the Etcd Operator originally borrowed from CoreOS. TiDB Operator is to reduce the complexity of TiDB deployment and operation and maintenance, and realize automatic capacity expansion and failover. At the same time, Operator manages multiple sets of TiDB clusters on K8s at the same time, like on the two public clouds of Tencent Cloud and UCloud, it is this way to achieve unified multi-tenant management. The Local PV management mechanism we implemented is essentially a unified management of all local disks in the cluster, and give them a life cycle, so as to participate in scheduling as a type of resource in K8s. At the same time, the trend of the new version of K8s is to develop in the direction of the operating system on the cloud, and its own resources and APIs have become more open. We don’t need to change the code of K8s itself, but to make better extensions to achieve our own scheduling functions. For example, we take advantage of the affinity of K8s to allow services of the same type to run on the same physical machine, making full use of hardware resources. Another example is PD and TiKV. You cannot mix and use the same SSD together, otherwise the IO will affect each other. Therefore, we use the anti-affinity feature to separate PD and TiKV as much as possible when scheduling. Another example of scheduling is that the TiDB cluster itself supports distinguishing the geographical attributes of nodes. PD implements data-level scheduling according to the geographical attributes, and try to ensure that multiple copies of the same data are dispersed as much as possible by region. Then when K8s deploys TiDB nodes, it also needs to take into account regional characteristics for scheduling. For example, the nodes of a cluster are deployed in a cross-region and cross-availability zone, and geographical information is passed to TiKV and PD, so that data copies are as dispersed as possible. And this knowledge itself is not available in K8s. We need to extend the scheduler of K8s to pass the experience and principles into it.
Operator contains some TiDB extended Controller and Scheduler, but it is not enough. We need to wrap a layer on it to expose a unified operation, maintenance and management interface. This is Cloud Manager. Cloud Manager exposes standardized interfaces to the front-end console of the cloud platform, so that the comprehensive management of K8s and TiDB cluster related resources can be completed through the front-end.
DBaaS structure diagram shows the layered architecture of Cloud TiDB. The bottom layer is the container cloud, and the middle layer is K8s’ own service management and API Server. We expand on this basis, implement various Controllers and schedulers, store and manage Volume Manager locally, and finally expose them through the RESTful API provided by Cloud Manager. You can easily connect to a front-end Dashboard, or directly use the CLI command line tool to complete the unified management of TiDB cluster instances.
This picture is some of the details mentioned earlier. As you can see here, the left half is the component of Kube itself, and the right is our extended component. In addition, we have also defined some TiDB resource types and put them in the CDR. For example, in TiDB Cluster, this resource object can describe how many TiKVs and how many TiDBs to start. In addition, there are a series of objects such as TiDB Set / TiKV Set / PD Set to describe the configuration of a certain service.
This is a screenshot on Tencent Cloud,
This is a screenshot of UCloud
< p>
Now these two products are in public beta, interested students can pay attention to it.
In addition, we provide an installation method for Operator Chart. Using the Helm tool, you can pull up a set of TiDB instances through Operator with one click.
This method is more like an RPM package on K8s to deploy services and manage dependencies between services. With just one line of command, you can get the official Cloud TiDB core components. If you have a K8s cluster, or you are using a K8s cluster provided by a public cloud, you can quickly run TiDB Operator and TiDB cluster with the above command.
This is an example of configuration. Open the charts compressed package to find the corresponding configuration yaml file.
We have made detailed comments on the configuration of each line. For example, you can set some parameters: like the number of copies, CPU memory usage limit, how many TiDB starts, how many TiKV starts, and so on.
So many deployment tools are introduced first. In the next part, we start to introduce the tools around TiDB. In fact, there are some of them that you have already contacted and used.
The first is Syncer, this small tool has been used in many production environments. It is a real-time synchronization tool between MySQL and TiDB. The principle is very simple. It is to pretend to be a MySQL Slave library, dump binlog from upstream MySQL in real time, and restore it to SQL to playback downstream (TiDB).
Here we support simple rule filtering, and also support the merging of sub-databases and sub-tables. We can also run multiple Syncers at the same time to synchronize multiple upstream MySQL databases to a large TiDB cluster. The main features of Syncer, first of all, are to support synchronization by GTID. What is GTID? It is a feature provided by MySQL’s own replication mechanism. MySQL master-slave synchronization was first described by binlog pos (file name + offset) to describe the synchronization position, but this design has obvious flaws. For example, in such a scenario, initially there is 1 Master with 2 Slaves, when the Master hangs up. One Slave needs to be upgraded to Master, and the other Slave continues to synchronize from the new Master. But this must ensure that the binlog pos of the new Master and the old Master can be connected, but the binlog recording methods of different instances of MySQL are different, so there must be a globally unique ID to correspond to the binlog, which is GTID. GTID support is better after MySQL 5.6, and this method is mostly enabled in production environments. In addition to synchronizing by pos, Syncer also supports GTID. Syncer supports RDS synchronization from public clouds. For example, Alibaba Cloud and Tencent Cloud have also tested more. Because of the failure or maintenance of the back-end machines of the cloud platform, the master-slave switch is more frequent, and the Virtual IP remains unchanged. There is no perception of the user, so if Syncer can’t support GTID well, then the master-slave data will be inconsistent once cut. The second is the consolidation of sub-databases and sub-tables. Regardless of whether the upstream library is split by library, split by table, or even mixed split, Syncer can well support it, which is described by configuration files. There is also the problem of synchronization performance. Because binlog is a one-way data stream, if we synchronize it with a single thread, although it is relatively simple, the performance may be poor. Using multithreading, it is necessary to distinguish the causal sequence of operations on the same row of data. Rows that are not related can be executed in parallel, and rows that are related can only be executed sequentially. For MySQL, each binlog event is a transaction, which contains multiple operations on different tables and different rows. So Syncer will split the transaction and execute it in parallel. The price of this is that Syncer does not guarantee synchronization according to upstream transaction atomicity, but there is no problem with eventual consistency. Syncer also supports some simple filtering rules, you can choose to specify library or table synchronization, or you can exclude them. In addition, some simple table name mapping transformations are also supported.
In the early stage of a company, the business may be spread faster. Each business uses a MySQL database, and the data between different businesses is isolated. Later, the business became more complicated, and MySQL might be linked to the library. The slave library is specially used for some data analysis scenarios, and cannot affect the reading and writing of the main library’s support line. With further development, data analysis may have to cross lines of business, so cross-database statistical queries, such as Join and Sub Query, are basically difficult. In this scenario, we can use a TiDB cluster as the slave of all online MySQL, and use Syncer to complete the synchronization. The data analysis team can complete complex related queries and analysis in TiDB, which is no different from using MySQL. Moreover, the real-time performance of Syncer synchronization is very high, so that the back-end analysis can be very real-time.
Next we will introduce TiDB Binlog. TiDB Binlog is essentially different from MySQL. This should be stated. Our binlog is different from MySQL’s binlog format. TiDB uses a self-describing protobuf format binlog. And each TiDB Server writes its own binlog, and a transaction is a binlog event. Then a small program called Pump is used to summarize and write to the Kafka cluster. Pump writes directly locally, why use Kafka? This is taking into account the risk of a single point of failure when the log is placed on the local disk. So use Kafka or a distributed file system to solve this problem. There is a component called Drainer downstream to consume Kafka data. Drainer’s responsibility is to restore the binlog to SQL in the order of transactions and synchronize it to a downstream database, such as MySQL, or another TiDB cluster. It can also write to a file stream to implement incremental data backup.
In fact, what Drainer does is somewhat difficult, because TiDB is not like MySQL. It is a distributed system. You can think about it. First of all, how to ensure the integrity of the transaction, what does it mean, because everyone knows that the transaction of TiDB is a two-phase transaction. Then it is possible that the transaction is submitted successfully, but the binlog is not written successfully; it is also possible that the transaction is not successfully written but the binlog is sent out, both of which may lead to inconsistencies. The second point is how to restore the causal sequence between distributed transactions. TiDB transactions are submitted to TiKV for execution. Each transaction is in two phases. The sequence number of the transaction is generated by PD. Multiple transactions may be executed concurrently on the same TiDB node. Therefore, the generated binlog transaction seq cannot be guaranteed to be monotonous. Increment, how to restore the order and output in real time. The third point is that the network itself may also be unreliable. You may write that TiDB has the previous transaction first and the next. In the process of network transmission, the order may change. In the case that multiple machines are generating binlogs, the order of ending up to Drainer is out of order, so how to restore the order. This seems to be a bit similar to TCP, but it is not the same. The global sequence number of transactions in TiDB is not continuously increasing, so when Drainer receives a binlog, it never knows the transaction number of the next binlog. As for the implementation, we designed a more complex dynamic window algorithm. I won’t talk about the time relationship. If you are interested, you can think about it.
In terms of scenarios, we can do many things with TiDB Binlog. For example, suspend another slave cluster on the TiDB cluster. It can also be synchronized to MySQL as a slave database. For example, some customers may be more cautious when they start using TiDB online. They start to link TiDB behind MySQL through Syncer to make a slave database. After running for a period of time to verify that there is no problem, they can change it. TiDB becomes the main database, use binlog to reverse sync to MySQL. After running for a while, I feel OK and safe. You can remove MySQL from the library and complete a grayscale online process. In addition, we can also use binlog to synchronize other heterogeneous databases, or some data warehouses, or distributed storage products. Including we are also developing our own OLAP storage engine. In the future, real-time data synchronization will be completed through binlog. Just write different Adapter plug-ins for Drainer.
TiDB Binlog can also be used for incremental data backup. You can find the most recent full backup point, and then play back the Binlog during this period to restore the data state at any point in time. There are also some scenarios. For example, some company businesses hope to implement event subscription based on binlog. We can trigger a message in Kafka by monitoring binlog and trigger a message in Kafka when a certain business data has changed, similar to the realization of the trigger function. Binlog itself is described as a general protobuf format, which can also be used to drive streaming computing engines to achieve some asynchronous/streaming analysis requirements. Binlog has a wide range of usage scenarios and can be used flexibly in actual business.
Another tool to introduce is Lightning. Lightning may not be used by everyone, because we are still in the final testing and optimization stage. This is a fast TiDB import tool. Previously, the tool we provided was MyDumper, which is a general data export tool for MySQL. It also has a MyLoader. We built a TiDB Loader on this basis, but this thing essentially executes SQL. In other words, the data file output by MyDumper is a lot of SQL text. In the process of using Loader to import to TiDB, you may feel that importing data is slow. This is because in this way of data import, the region of TiKV’s underlying storage must be continuously split and moved, and data is generally written sequentially, and the primary key of the table is often increasing, which will lead to writing hotspots. It is not possible to all TiKV nodes at the same time. Mobilize, lose the advantage of distributed. So how does Lightning do it? First, we will directly convert the input data format into an ordered KV key-value pair, bypassing the SQL parser and optimizer, and process them in batches. According to the PD, the Region distribution of the newly inserted data is pre-calculated, and then the SST is directly generated. File Ingest into TiKV is very close to physical data import. Our internal testing is 7 to 10 times faster than the previous Loader method. The 1T data will be imported within 5 hours. We expect to see you soon.
The MyDumper format file is used as input. First, the conversion from SQL to KV is completed. It is completed by a number of distributed workers. Parallel execution. At the same time, the optimizer is bypassed, a continuous KV stream is generated, and the KV is globally sorted through a dedicated Ingest Server. At the same time, the region can be pre-calculated and the node to be scheduled can be arranged in advance through the PD, so the entire process is very efficient.
Next, we will introduce a commercial tool called Wormhole. This can be understood as a Syncer with a control panel, but it is more powerful than Syncer. It supports multi-source and multi-destination data synchronization. And it is also a distributed structure, with the characteristics of high availability and parallel execution. In addition, it has better support for sub-databases and sub-tables, and configuration visualization. The pre-synchronization check is also stricter. For example, when synchronizing MySQL, the compatibility of the table structure and TiDB will be checked in advance, and whether the binlog in row mode is enabled, etc., to avoid re-reporting exceptions found during the operation. In addition, Wormhole also supports some simple ETL conversion rules, such as simple mapping calculations and UDF for certain fields of the table during synchronization. For example, for the merge of sub-databases and sub-tables, if each sub-table has its own auto-incrementing primary key, you may encounter primary key conflicts when inserting into TiDB after the tables are merged. Wormhole can complete the merge of the primary key through configuration, or you can add a new field as the real primary key. The primary key of the original table retains the name and the uniqueness constraint is removed.
I took some pictures of the interface, and I can see the progress of the entire data synchronization process, including full and incremental synchronization speeds , And I can pause it at any time, or perform some specific operations. For multi-source/destination synchronization, like this configuration, I can directly read out all the table structure in the database, and use it on the interface to determine the synchronization table and database and the field mapping relationship.
Next, the third part will talk about data visualization in TiDB. The TiDB Vision project is open source, and we will implement data visualization on the interface provided by PD.
From the figure, we can clearly see the distribution of regions on different nodes and the relationship between region leaders. Each segment on the ring in the figure represents a TiKV store. Each small cell of each store represents a region, and the green represents the leader. The line segments in the middle are animated during operation. When the leader splits, transfers, and leader transfer, there are animations of different colors. Express. This reflects the scheduling process of a real PD, which we can see intuitively through visualization. The other is hotspots, which may not be reflected in this picture. If there are hotspots in a certain region, you can see some red dots on the interface. In addition, this edge displays some network traffic scheduled by each PD, and we also display some traffic information of TiKV in real time. If a node is down, it will have a certain color on the edge. For example, TiKV is offline. Those familiar with TiDB know that TiKV does not go offline immediately. It will become offline first. Then it becomes the state of Tombstone, which can be intuitively reflected on the figure. This tool is very simple. It is in the TiDB Vision open source project. Interested students can make more skins for TiDB. Make this display cooler and more helpful for business monitoring.
This is an enterprise version of the Dashboard we are working on. This may be different from the Grafana and the existing open source interface that you have seen. Here are some parts的图。 You can see the status of each process on each node, including node runtime logs and service health status. Through the Dashboard, the topology and running status of the entire cluster can be displayed. In this interface, it can choose how many TiDB and TiKV nodes to create, and choose the specifications. On the left, you can select the upgraded TiDB component version to complete things like rolling upgrade.
Finally, let’s talk about TiDB monitoring. Monitor the very famous project Prometheus used in our backend, and use it to store metrics for various services in the database. Each TiDB and TiKV component will report its status to Prometheus (actually in the pull mode), and we collect the status of each host through the Node Exporter. For the K8s cluster, it is collected through cAdvisor, and the metrics are aggregated in Prometheus. Use Grafana for monitoring and visualization. Click the edit button on the Grafana panel that we configured, and you can see the corresponding Prometheus query expression. Through a SQL-like query statement, you can easily pull monitoring data from Prometheus and then connect to your own monitoring platform. Alert manager is also a tool in the Prometheus ecosystem. It can receive alarm events sent by Prometheus, and then push them out through various alarm methods. For logs, we also use the more popular EFK suite. In the K8s cluster, the logs of each Pod are collected, stored in the ES, and then displayed through Kibana.
These are a few screenshots of monitoring, which may be familiar to everyone.
Finally, briefly talk about the TiDB ecology, because the biggest advantage of TiDB is compatibility with the MySQL protocol. So not only command-line tools, including tools such as MySQL’s own MySQL Workbench, as well as traditional product tools such as Navicat, and a web management tool such as an old-brand phpMyAdmin, can be directly connected to a TiDB Instance. We are also constantly optimizing TiDB compatibility, because after all, it is somewhat different from MySQL. Like these tools, it may read some system tables of MySQL, and we will try our best to maintain compatibility with MySQL. There are also some very convenient functions, such as grabbing the schema and drawing ER diagrams. In fact, we also hope to run smoothly on TiDB. In this way, users who are accustomed to using various MySQL management tools can switch to TiDB very smoothly.
This is the main content I introduced today, thank you everyone!
Extended reading:
Wu Di: TiDB’s Practice in Today’s Toutiao
Li Yulai: Query Cache in TiDB