NOSQL modeling technology for big data architects -

1. Preface

In order to adapt to the requirements of big data application scenarios, emerging architectures such as Hadoop and NoSQL, which are completely different from traditional enterprise platforms, are rapidly emerging. The revolution in the underlying technology base will surely affect the superstructure: data models and algorithms. Simply copying the traditional model based on the fourth-paradigm structured relational database to the new engine is tantamount to slashing your feet and adapting to your shoes. It not only increases the difficulty and complexity of big data application development, but also fails to release the potential of the new framework.

How to build a NoSQL-based data model? The public knowledge accumulation available for reference now is either a simple “de-normalization” or a rough wide table (all the fields that the query and the application need to access “Sit in rows”, placed in a structured table with many columns), or implementation details for specific tools or specific scenarios, (such as the discussion on how to design HBase primary keys in the “HBase Authoritative Guide”). There is no methodology that can be followed at the level of model architecture like programming design patterns.

When comparing different NoSQL databases, various indicators besides functions are usually used, such as scalability, performance, and consistency. Because these indicators are usually the original intention of using NoSQL, they have been studied in depth from a theoretical and practical perspective, and the basic conclusions of distributed systems such as the CAP theorem are also applicable to NoSQL systems. On the other hand, in the field of NoSQL data model, it has not been well studied, and it lacks the kind of systematic theory in relational database.

In this article, I made a relatively simple comparison of NoSQL family systems from the perspective of data modeling, and briefly introduced several common modeling techniques.

2. NoSQL data model view

To explore data modeling technology, we must first start with the systematic NoSQL data model view, which can more or less help us reveal its development trend And the relationship between them. The following figure depicts the virtual “evolution” process of the main NoSQL family systems, namely key-value storage, BigTable type database, document database, full-text search engine, database and graph database:

Share Picture

First of all, we should note that, in a general sense, SQL and relations Model models were designed a long time ago for the purpose of end-user interaction. This user-oriented nature has a profound impact:

End users are often interested in summary report information rather than individual data items, so SQL has done a lot of work in this area.

Users as natural persons cannot be expected to explicitly control concurrency, integrity, consistency, or data type validity. This is why SQL strives to focus on transaction assurance, schema, and referential integrity.

On the other hand, software applications often don’t have much interest in doing aggregation inside the database, and at least in many cases, the program can control the integrity and effectiveness by itself. In addition, eliminating these features is extremely important for performance and scalability storage impact.

The evolution of the new data model has begun:

Key-value storage is a very simple but very powerful model. Many of the techniques described below are fully applicable to this model.

One of the most fatal shortcomings of the key-value model is that it is not suitable for scenarios where the primary key is processed by scope. The ordered key-value model breaks through this limitation and significantly improves aggregation capabilities.

The ordered key-value model is very powerful, but it does not provide any value-oriented modeling framework. In general, the value modeling can be done by the application, but the BigTable style database is more thoughtful, it can model the value according to the map-of-maps-of-maps, To be clear, they are the column family, column, and time-stamped versions.

The document database proposes two obvious improvements to the BigTable model. First, the value can be declared as an arbitrarily complex schema, not just a map-of-maps. Second, at least some products implement indexes managed by the database. In this sense, full-text search engines can also be considered to provide flexible schemas and automated indexes. The main difference between them is that the document database groups the index according to the field name, while the search engine groups the index using the field value. It is worth noting that key-value storage systems like Oracle Coherence have increased the functions of indexes and embedded port processors, and are gradually evolving to file databases.

Finally, the graphical data model can be seen as an evolution of an ordered key-value model in another direction. Graph databases allow very transparent modeling of business entities (this thing depends on that thing), while hierarchical modeling technology uses another data model in this respect, but it is also comparable. Graph databases and file databases are closely related, because many implementations allow the value of modeling to be mappings or documents.

3. General considerations for NoSQL data modeling

Different from relational modeling, NoSQL data modeling often starts with the application of specific queries:

Relational modeling is typically driven by the structure of the data available at hand. The design mainly revolves around “what kind of answer do I have?”

NoSQL data modeling is usually driven by the access mode of a specific application, such as the type of query that needs to be supported. The design mainly revolves around “what’s wrong with me?”

NoSQL data modeling often requires a deeper understanding of data structures and algorithms than relational database modeling. In this article, I introduced several well-known data structures. Although they are not unique to NoSQL, they are very useful for actual NoSQL modeling.

Data replication and denormalization are first-class citizens.

Relational databases are not very convenient for modeling and processing hierarchical or graphical data. The graph database is obviously the perfect solution in this field, but in fact most NoSQL is also very good at solving such problems. This is why this article has a separate chapter for hierarchical data modeling.

Although the data modeling technology has nothing to do with the specific implementation, I still list the products I can think of when writing this article:

Key-value storage: Oracle Coherence , Redis, Kyoto Cabinet

BigTable style database: Apache HBase, Apache Cassandra

Document database: MongoDB, CouchDB

Full text search engine: Apache Lucene, Apache Solr

Graphic database: Neo4j, FlockDB

4. Conceptual technology

This section specifically introduces the basic principles of NoSQL data modeling.

1, Denormalization (Denormalization)

Denormalization can be defined as copying the same data to multiple documents or data tables, which can simplify/optimize query processing, or Allow user data to match a specific data model. Most of the techniques in this article use one way or another of denormalization.

Generally speaking, denormalization is used for the following trade-offs:

The trade-off between the amount of data to be queried or IO** per query and the total amount of data. Denormalization can combine all the data required for a query and store it in the same place. This usually means that different queries on the same data will access different combinations of data. Therefore, the data needs to be copied multiple copies, which means that the total amount of data is increased.

The trade-off between processing complexity and total data volume. Normalization during modeling and joins of corresponding queries significantly increase the complexity of the query processor, especially in distributed systems. Denormalization allows data to be stored in a query-friendly manner, thereby simplifying query processing.

Applicability: Key-value storage, document database, BigTable style database

2, Aggregates

All mainstream NoSQL provides one way or another The loose schema (soft schema) support:

Key-value storage and graph databases usually do not constrain the value, so the value may be in any format. In addition, a business entity can also be represented as multiple records by using composite keys. For example, a user account can be modeled as an entity set represented by a combination of keys such as UserID_name, UserID_email, UserID_messages, etc. If the user does not have an email or message, then the corresponding entity will not be recorded.

BigTable mode also supports loose schema, because a column cluster is a variable column set, and a cell can store an indefinite number of data versions.

Document databases do not have a schema by nature, although some document databases allow user-defined schemas to be used for validation during data entry.

The loose schema allows the use of complex internal structures (nested entities) to construct entity classes, and also allows the structure of specific entities to be changed. This can bring two important conveniences:

Through nested entities, one-to-many relationships are minimized, and therefore joins are reduced.

The model of heterogeneous business entities can use a document collection or a data table. The loose schema hides the “technical” difference between this modeling and business entities.

We use the following diagram to illustrate these conveniences. This figure depicts the modeling of a product entity in the e-commerce field. First of all, we can think that all products have an ID, price (Price) and description (Deion). Looking further, we found that different types of products have different attributes. The picture book contains author information, while jeans have length attributes. Some of these attributes are inherently one-to-many or many-to-many, such as the tracks in a music record.

Furthermore, it may be impossible for some entities to be modeled with a fixed type. For example, the attributes of jeans of different brands are not fixed, and the attributes of jeans produced by each manufacturer are also inconsistent. Although these problems can be solved in the standardized relational data model, the method is very trivial. The loose schema soft architecture allows modeling of all types of products and their attributes using only one aggregation (product):

share picture

The embedded denormalization will have a great impact on the performance and consistency of the update operation, so Pay special attention to the update process.

Applicability: Key-value storage, document database, BigTable style database

3, Application Side Joins

There are few NoSQL solutions The program supports connections. The consequence of the “problem-oriented” nature of NoSQL is that joins are usually handled at design time, while the relational model handles joins when executing queries. Processing joins when querying will almost certainly bring performance losses, but in many cases, denormalization and aggregation can be used, that is, embedded nested entities to avoid joins. Of course, join is unavoidable in many cases and should be handled by the application. Main use cases:

Many-to-many relationships are often modeled through links, which require joins.

Aggregation operations are often not suitable for scenarios where internal entities are frequently modified. It is usually better to keep what happened as a new record and join all records when querying instead of changing the value. For example, for an information system, it can be modeled by nested User entities that contain Message entities. However, if you frequently add messages, a better way may be to extract the Message as an independent entity, and connect it with the User when you query:

share picture

Applicability: key-value storage, document database, BigTable style database, graph database< /p>

5. General modeling techniques

In this section, we will discuss general modeling techniques applicable to various NoSQL implementations.

1. Atomic Aggregates

Many NoSQL solutions provide limited transaction support, although some NoSQL does not support it. In some cases, people can also use the MVCC mechanism of distributed locks or application management to implement transaction behavior, but it is common to use aggregation technology to model data to ensure some ACID characteristics.

A powerful transaction processing mechanism is indispensable for relational databases. One of the reasons is that standardized data usually needs to be updated in multiple places. On the other hand, aggregation allows a single business entity to be stored as a file, row or key-value pair, so that it can be updated atomically:

share picture

Of course, as a data modeling technology, atomic aggregation is not a perfect transactional solution Solution, but if storage can provide atomicity, locks, or some guarantees on TAS (test-and-set) instructions, then atomic aggregation is feasible.

(Translator’s Note: Put together business data that requires transactional operations and store them in a NoQSQL-provided or application-provided data structure that provides atomic operations. When using HBase, a user All data of a certain business, as shown in the figure above, is the application of this model.)

Applicability: key-value storage, document database, BigTable style database

2 Enumerable Keys

Perhaps the biggest advantage of the unordered key-value data model is that the entity data can be stored on multiple servers by hashing the primary key. Sorting makes things more complicated, but even if storage does not provide such functionality, sometimes applications can take advantage of ordered primary keys. Let’s take the modeling of email as an example:

Some NoSQL stores provide atomic counters that can generate a sequential ID. In this case, you can use userID_messageID as a composite key to store the message. If the latest message ID is known, then the previous messages can be traversed. In addition, for any given message ID, it can also be traversed forward or backward.

You can also divide messages into buckets, for example, put daily data in a bucket. This allows starting from any specified date or the current date to traverse a mailbox forward or backward.

Applicability: Key-value storage

(Translator’s Note: The ability to use some natural or business dimensional characteristics of the primary key to convert random reads and writes to sequential reads and writes can improve traversal performance , At the same time, it is convenient to write application logic. But you need to pay attention to the impact of concurrent writing in distributed deployment and excessive coupling to the business. For the discussion of unordered primary keys and ordered primary keys, please refer to the Schema design chapter in the “HBase Authoritative Guide”.)

3. Dimensionality Reduction

Dimensionality reduction technology allows a multi-dimensional data model to be mapped to a key-value model or other non-multidimensional models.

Traditional geographic information systems use a variant of Quadtree or R-tree for indexing. These structures need to be updated locally, so when the amount of data is large, the maintenance overhead is quite large. Another method is to traverse this two-dimensional structure and flatten it into a normal list of items. A well-known example of using this technique is Geohash. Geohash uses a Z-shaped route to scan the entire two-dimensional space, and each movement is coded as 0 or 1 according to the direction of travel. The change movement and movement of the longitude and latitude of the interlaced position. The encoding process is illustrated in the figure below, where the black and red bits represent the longitude and latitude respectively:

share picture

As shown in the figure, an important feature of Geohash is the ability to estimate the distance between regions through the approximate degree of this bit-by-bit encoding. Geohash encoding allows the use of simple and common data models to store geographic information, such as using ordered keys to store spatial connections. [6.1] describes the dimensionality reduction technology in BigTable. More information about Geohash and related technologies can be found in [6.2] and [6.3].

Applicability: key-value storage, document database, BigTable style database

, Stored in a one-dimensional key-value storage system, this is a very important modeling mode: it provides that the adjacent data in the multi-dimensional space at different zoom levels is still stored sequentially, and the traversal is efficient; at the same time, different primary keys are from front to back The similarity is the same as the distance of the space, and the position “similarity” can be judged by the simple order comparison of the key value.

Its application is far more than the representation of geographic information, there are many different dimensional attributes. This technology can be used for granular data representation. For example, offline sales transaction data usually has time and branch structure information. When users query sales data, they usually query a large area first, and the time period used is also relatively large. When searching for more specific information, narrow the area and time to gradually locate the details. At this time, the branch and time are like the two axes of the longitude dimension. The establishment of the primary key by interleaving time and branch structure coding can meet this well. This scenario. The simple use of time or branch offices as primary keys or indexes cannot adapt to the above scenarios.)

4. Index Table

The index table is a very Simple technology, it provides index support on storage that does not support indexes internally. The most important type of this type of storage is the BigTable style database. The idea of an index table is to create and maintain a special table according to the keys required by the access pattern. For example, there is a main table that stores user accounts that can be directly accessed by user ID. Querying all users in a specified city can be supported by an additional table with the city as the primary key:

share picture

The index table can be updated when each master table record is updated or in batch mode. Either way, it will cause additional performance loss and bring data consistency problems.

The index table can be considered as a simulation of the materialized view of a relational database.

Applicability: BigTable style database

5. Composite key index (Composite Key Index)

Composite key index is a very common technology, but especially in The primary key is extremely useful when stored in order. The composite primary key combined with the secondary sort can build a multi-dimensional index, which is similar in principle to the dimensionality reduction technique described above. For example, suppose we have a set of records, and each record is a user statistics. If we want to aggregate these statistics according to the region where the user comes from, we can use this primary key format (State:City:UserId). If the storage of the primary key supports partial matching to select the range (such as a BigTable style database), then you can traverse the records of a specific state (State) or city (City):

Share pictures

Applicability: BigTable style database

< p>(Translator’s Note: When using a BigTable style database such as HBase, if the primary key only uses the information of one field/domain, it is simply horrible. Using multiple fields to combine the primary key can not only solve the problem of the primary key value cannot be repeated, but also Improve the performance of the secondary search for data subsets.)

6. The aggregation of the composite primary key

The composite primary key can not only be used as an index, but also can be used for different types of grouping. Let us look at an example. There is a huge array of logs that record information about Internet users and their visits to different websites (clickstream). Our goal is to calculate the number of clicks on each site for each unique user. This is similar to the following SQL query:

We can model this situation by using a composite primary key that prefixes the username:

share picture

Our idea is to put all records of a user together so that it is possible Load it all into memory (one user will not generate too many events), and use hash tables or other methods to eliminate duplicate sites. Another technique is to use the user as the primary key and add the website to the back of this piece of data every time an event arrives. However, in most implementations, modifying data is generally less efficient than inserting data.

Applicability: ordered key-value storage, BigTable style database

7, inverted search-direct aggregation

This technology is more like data processing Patterns, not data modeling. However, the data model is also affected by the use of this model. The main idea of this technique is to use the index to find data that meets the conditions, but the aggregation operation still uses the original method or a full table scan. Let us consider an example. There is a bunch of log data that records information about Internet users and their visits to different websites (click stream). Assume that each record includes the user ID, the category the user belongs to (male, female, blogger, etc.), the city where the user comes from, and the website visited. Our goal is to find the audience who meets the conditions (URL, city, etc.), and classify the different users that appear in the audience (such as the set of users that meet the criteria) into categories.

Obviously, users who meet the conditions can be found very efficiently through inverted index tables like {category->[user ID]} or {website->[user ID]}. Using such an inverted index, the intersection or union of the desired user ID can be obtained (if the user ID is stored as an ordered list or bitmap, this can be implemented very efficiently), thereby obtaining the target user. But if the target user is described by an aggregate query like this:

If the number of categories is large, the inverted index cannot be used for effective processing. To solve this problem, you can create a Direct Index in the form of {Username->[Category Collection]}, and then traverse it to build the final report. This architecture is shown in the following figure:

Share pictures

Finally, we need to be reminded that we need to know that if we randomly access the record corresponding to each user ID in the target user, the efficiency of doing so may be very low. This problem can be solved by using batch query processing. This means that a certain number of user sets can be pre-calculated (for different report conditions), and then all reports for this batch of target users can be calculated by performing a full table scan on the direct index table or the inverted index table.

Applicability: Key-value storage, BigTable style database, document database

Hierarchy Modeling Techniques

(Translator’s Note: If you need to quickly find the corresponding entity by attribute or category, such an index is indispensable. Now that it is very popular to use big data to analyze user portraits, don’t you need a structure like {user tag->user group}.)

8. Tree Aggregation

A single record or file model can be built into a tree, or even an arbitrary graph (by denormalization).

In scenarios where the tree will be accessed once (for example, the entire comment tree of a blog will be read and displayed on the page of an article), this technique is very efficient.

There may be problems searching and accessing any item.

In most NoSQL implementations, update operations are inefficient (compared with independent nodes).

share pictures

Applicability : Key-value storage, document database

(Translator’s Note: This is the world of document-based NoSQL. For scenarios that do not require cross-tree joins, not only does it have high read and write efficiency, but also supports localized Transaction application.)

9. Adjacency Lists

The adjacency list is a simple graph modeling method-each node is modeled as a separate record, where There are arrays containing immediate ancestors or arrays containing descendants. It allows searching for a node by the identifier of its parent or child. Of course, it is also possible to traverse a graph by querying one step forward. Whether for depth-first or breadth-first traversal, to find a given node in the entire subtree, this method is usually not efficient.

Applicability: key-value storage, document database

10. Materialized path

The materialized path is a way to help avoid recursion on the tree structure Traversal technology. You can also think of this as a de-standardization technique. The design idea is to use all the parent nodes or child nodes of a node to identify the node, so that all ancestor nodes or derivative nodes of a node can be obtained without traversing:

Share Picture

Because this technology can transform the hierarchical structure into a flat document, so It is particularly useful for full-text search engines. As can be seen from the above figure, all products or sub-categories under the Men’s Shoes category can be obtained simply by querying a category name. This query is very short.

The storage method of the materialized path can be a collection of IDs, or a string containing cascading IDs. The latter method allows the use of regular expressions to find nodes whose specified part of the path meets certain conditions. This method is shown in the figure below (the path also includes the node itself):

Share a picture

Applicability: key-value storage, document database, search engine

11, Nested Sets

When modeling tree-like structures, nesting collections is a standard practice. It is widely used in relational databases, but it is also fully applicable to key-value storage and document databases. The design idea is to use an array to store the leaf nodes of the tree (translator: each leaf node corresponds to a position subscript in the array), and map each non-leaf node to the range of a leaf node. This range is The position index of the start leaf node and the end leaf node in the array. As shown in the figure below:

share picture

This structure is very effective for invariant data, because it occupies a small amount of memory and can get all the leaf nodes of a given node without traversing the tree. However, because adding a leaf node will bring a large number of updates to the position index, the operation cost of inserting and updating is quite high.

Applicability: key-value storage, document database

(Translator’s note: applicable to dictionary tables, historical log data tables sorted by date, etc.)

12. Flattened nested files: field name number

Search engines usually work on flat documents, that is, each document consists of two flat lists, which record fields separately The name and its corresponding value. The goal of data modeling is to map business entities to simple unstructured documents, but if the internal structure of the entity is complex, this is difficult. A typical difficulty is the hierarchical structure model, such as mapping a document with embedded documents to a simple unstructured document. Let’s consider the following example:

share picture < /p>

Each business entity is a resume of some kind, which contains the person’s name and a list of the skills he or she possesses and the corresponding skill levels. An obvious way to model such entities is to create a simple unstructured document that contains a list of Skill and Level fields. This mode allows searching for a person by skill or level, but combining these two fields to search can easily lead to false matches, as shown in the figure above. (Translator: Simple AND operation cannot perceive the corresponding relationship between skills and their levels.)

A method to solve this problem is proposed in [4.6]. The main idea of this technology is to combine each skill and corresponding level to form a pair, and use subscripts to identify them as Skill_i and Level_i. When searching, you need to query all these pairs of values at the same time (the number of OR conditions in the query is the same as the maximum value of the skills that a person can have):

share picture

This method is actually not scalable, because with the number of nested structures The growth of will rapidly increase the complexity of the query.

Applicability: search engine

(Translator’s Note: If you have nothing to do with using indexes to speed up queries, especially if you don’t know how users will query data in the future, you can’t design models When optimizing, this is the last straw: use a full-text search tool. It can at least find something in an acceptable time ^o^.)

13. Flatten nested files : Approximate query

Another technique that can solve the nested file problem is also described in [4.6]. The idea is to use approximate queries to limit the distance between words in the document to an acceptable distance. In the figure below, all skills and levels are indexed into a field called SkillAndLevel. If you use “Excellent” and “Poetry” for query expression, you will find items that are adjacent to these two words:

share picture

[4.3] describes a successful case of using this technology on Solr.

Applicability: search engine

(Translator’s Note: Same as above, but the query recall rate is high, there will be false matches and dirty data.)

14 , Batch graph processing

When browsing adjacent nodes of a specified node or browsing the relationship between two or more nodes, the performance of graph databases like Neo4j is surprisingly good. However, the general graph database is not very efficient for global processing of large images because of poor scalability. Distributed graphics processing can be implemented using MapReduce or Message Passing mode. My previous article introduced one such pattern. This method uses a key-value store, a document database, and a BigTable style database, which can handle large graphics.

Leave a Comment Cancel reply