Reviewing and Modifying a Table Definition
Overview
To properly define a table, you must:
- Reduce the data volume scanned by using the partition pruning mechanism.
- Minimize random I/Os by using clustering or partial clustering.
The table definition is created during the database design and is reviewed and modified during the SQL statement optimization.
Selecting a Storage Model
During database design, some key factors about table design will greatly affect the subsequent query performance of the database. Table design affects data storage as well. A good table design reduces I/O operations and minimizes memory usage, improving the query performance.
Selecting a model for table storage is the first step of table definition. Select a proper storage model for your service based on the following table:
Using Partial Cluster Keys (PCKs)
PCK is a column-store technology. It can minimize or maximize sparse indexes to quickly filter base tables. PCKs can be used to specify multiple columns. However, it is recommended that a maximum of two columns be specified. Use the following principles to specify PCKs:
- The specified PCKs must be restricted by simple expressions in base tables. Such constraints are usually represented by col op const, in which col indicates the column name, op indicates operators (including =, >, >=, <=, and <), and const indicates constants.
- Select columns that are frequently selected (to filter much more undesired data) in simple expressions.
- List the most frequently selected columns at the top.
- List the columns of the enumerated type at the top.
Using Partitioned Tables
Partitioned tables split what is logically one table into smaller physical blocks based on a specific scheme. The table based on the logic is called a partitioned table, and each physical block is called a partition. A partitioned table is a logical table and does not store data. Data is actually stored in partitions. A partitioned table has the following advantages over an ordinary table:
High query performance: You can specify partitions when querying partitioned tables, improving query efficiency.
High availability: If a certain partition in a partitioned table is faulty, data in the other partitions is still available.
Easy maintenance: To fix a partitioned table having a faulty partition, you only need to fix the partition.
Partitioned tables supported by the openGauss database are level-1 and level-2 partitioned tables. Level-1 partitioned tables include range partitioned tables, interval partitioned tables, list partitioned tables, and hash partitioned tables. Level-2 partitioned tables include nine combinations of any two of range partitioned tables, list partitioned tables, and hash partitioned tables.
- Range partitioned table: Data within a certain range is mapped to each partition. The range is determined by the partition key specified when the partitioned table is created. This partitioning method is most commonly used. The partition key is usually a date. For example, sales data is partitioned by month.
- Interval partitioned table: a special type of range partitioned tables. Compared with range partitioned tables, interval value definition is added. When no matching partition can be found for an inserted record, a partition can be automatically created based on the interval value.
- List partitioned table: Key values contained in the data are stored in different partitions, and the data is mapped to each partition in sequence. The key values contained in the partitions are specified when the partitioned table is created.
- Hash partitioned table: Data is mapped to each partition based on the internal hash algorithm. The number of partitions is specified when the partitioned table is created.
- Level-2 partitioned table: a partitioned table obtained by randomly combining range partitioning, list partitioning, and hash partitioning. Both level-1 and level-2 partitions can be defined in the preceding three ways.
Selecting a Data Type
Efficient data types include the following:
Select data types that facilitate data calculation.
Generally, the calculation of integers (including common comparison calculations, such as =, >, <, ≥, ≤, and ≠ and GROUP BY) is more efficient than that of strings and floating point numbers. For example, if you need to perform a point query on a column-store table whose numeric column is used as a filter criterion, the query will take over 10s. If you change the data type from NUMERIC to INT, the query will be reduced to about 1.8s.
Select data types with a short length.
Data types with a short length reduce both the data file size and the memory used for computing, improving the I/O and computing performance. For example, use SMALLINT instead of INT, and INT instead of BIGINT.
Use the same data type for a join.
You are advised to use the same data type for a join. To join columns with different data types, the database needs to convert them to the same type, which leads to additional performance overheads.