PostgreSQL – the best way to make a large number of updates to the index set

The context of this question is PostgreSQL 9.6.5 on AWS RDS.

The question is about a table containing 300 million rows of the following logical data model Best mode design and batch update strategy:

> id: primary key, a string of up to 40 characters
>code: integer 1-999
>year: integer year
> flags: each variable number (1000) associated with the name, new flags added over time. Ideally, flags should be considered to have three values: absent (null), on (true / 1) and off (false / 0). It is possible to treat the flag as a simple bit (on or off, not present) at the cost of additional updates (see below). The “on” value is usually very sparse: <1/1000.
query A Boolean expression that usually involves the existence of one or more flags (by name), which also contains the code and year.

Data is updated in batches through Apache Spark, that is, the update can be expressed as a flat file, for example, In COPY format, or as a SQL operation. Only one update is active at any time. Updates to code and year are very rare. Updates to flags affect 1-5% of rows (3-15 million rows) per update. The update line can include all flags and their values, including only the “on” flags to be updated or only the flags whose value has changed. In the former case, Spark needs to query the data to get the current value of the flag.

There will be a small read load during the update.

The question is about the best mode to support queries and related update strategies. Update as described above.

So far Some of the research comments:

>Using 1,000 Boolean columns will create a very efficient row representation, but in addition to some DDL complexity, 1,000 indexes are required.
>If there is a way Single bits can be indexed, then bit strings would be great. Moreover, they do not provide a good way to indicate absence signs. Using this method requires maintaining a lookup table between the flag name and the bit ID. If needed, merge updates can be combined with || Use, but, given PostgreSQL’s MVCC, it doesn’t seem to be of much benefit to only update the flags instead of replacing the entire row.
> JSONB fields provide indexes. They also provide empty representations, but this comes at a price: all “off” flags are Need to be explicitly set, which will make the field very large. If we ignore the empty representation, the JSONB field will be relatively small. To further narrow them, we can use short 1-3 character field names and lookup tables. Same comment: and Bit string merge.
> tsvector / tsquery: no such data class Type experience, but in theory, it seems to be an accurate representation of the “on” logo on a set of names. A lookup table must be used to map the logo name to the token, with additional requirements to ensure that there will be no conflicts due to stemming.

Do not store the logo in the main table.

Assumptions The main table is called data, please define the following:

CREATE TABLE flag_names (
id smallint PRIMARY KEY,
name text NOT NULL
);

CREATE TABLE flag (
flagname_id smallint NOT NULL REFERENCES flag_names(id),
data_id text NOT NULL REFERENCES data(id),
value boolean NOT NULL ,
PRIMARY KEY (flagname_id, data_id)
);

If a new flag is created, insert a new line in flag_names.

If the flag is set to TRUE or FALSE, insert or update a row in the flag table.

Add data flags to test whether a certain flag is set.

This question The environment is PostgreSQL 9.6.5 on AWS RDS.

The question is about the best model design and batch update strategy for a 300 million-row table containing the following logical data model:

> id: primary key, a string of up to 40 characters
>code: integer 1-999
>year: integer year
> flags: each variable number associated with the name (1000) , New flags added over time. Ideally, flags should be considered to have three values: absent (null), on (true / 1) and off (false / 0). It is possible to update with additional (see below) For the cost, treat the flag as a simple bit (on or off, not present). The “on” value is usually very sparse: <1/1000.
Queries usually involve the presence of one or more flags (by name) Boolean expression of, which also contains the code and year.

Data passed Apache Spark batch updates, that is, updates can be expressed as flat files, for example, in COPY format, or as SQL operations. Only one update is active at any time. Updates to the code and year are very rare. Updates to the flag affect each time Update 1-5% of the rows (3-15 million rows). The updated row can include all flags and their values, including only the “on” flags to be updated or only flags whose values ​​have changed. In the former case Next, Spark needs to query data to get the current value of the flag.

There will be a small read load during the update.

The question is about the best mode to support the query and related updates. Strategy. Update as described above.

Some research comments so far:

>Using 1,000 Boolean columns will create a very effective row representation, but apart from some DDL complexity In addition, 1,000 indexes are required.
>If there is a way to index a single bit, then bit strings would be great. Also, they don’t provide a good way to indicate an absence flag. Using this method requires the flag name and bit ID The lookup table is maintained between. If needed, merge updates can be used with ||, but, given PostgreSQL’s MVCC, only updating the flag instead of replacing the entire row does not seem to be of much benefit.
> JSONB fields provide indexes. They also provide Empty representation, but this comes at a price: all “off” flags need to be explicitly set, which makes the field very large. If we ignore the empty representation, the JSONB field will be relatively small. To further reduce them, we can use short The 1-3 character field name and lookup table. The same comment: merge with bit string.
> tsvector / tsquery: No experience with this data type, but theoretically, it seems to be a set of “on” on the name “The precise representation of the logo. A lookup table must be used to map the logo name to the token, with additional requirements to ensure that there will be no conflicts due to stemming.

Do not store the flags in the main table.

Assuming that the main table is called data, please define the following:

CREATE TABLE flag_names (< br /> id smallint PRIMARY KEY,
name text NOT NULL
);

CREATE TABLE flag (
flagname_id smallint NOT NULL REFERENCES flag_names(id),
data_id text NOT NULL REFERENCES data(id) ,
value boolean NOT NULL,
PRIMARY KEY (flagname_id, data_id)
);

If a new flag is created, please insert a new row in flag_names.

p>

If the flag is set to TRUE or FALSE, insert or update a row in the flag table.

Add a data flag to test whether a certain flag is set.

Leave a Comment

Your email address will not be published.