Index Recommendation

openGauss index recommendation has three functions, including single-query index recommendation, virtual index recommendation, and workload-level index recommendation.

Single-query Index Recommendation

The single-query index recommendation function supports direct operations in the database and generates recommended indexes for a single query statement entered by users based on the semantic information of the query statement and the statistics data of the database. This function supports only a single SELECT statement and does not support other types of SQL statements. This function depends on the gs_index_advise function.

Table 1 Single-query index recommendation APIs

Function	Parameter	Description
gs_index_advise	SQL statement string	Generates a recommendation index for a single query statement.

Use the preceding function to obtain the recommendation index generated for the query. The recommendation result consists of the table name and column name of the index. To use the tool, do as follows:

openGauss=# select "table", "column" from gs_index_advise('SELECT c_discount from bmsql_customer where c_w_id = 10');
     table      |  column  
----------------+----------
 bmsql_customer | (c_w_id)
(1 row)

The preceding statement indicates that an index should be created on the c_w_id column of the bmsql_customer table. To create the index, run the following command:

CREATE INDEX idx on bmsql_customer(c_w_id);

A composite index may be recommended for some SQL statements, for example:

openGauss=# select "table", "column" from gs_index_advise('select name, age, sex from t1 where age >= 18 and age < 35 and sex = ''f'';');
 table | column
-------+------------
 t1    | (age, sex)
(1 row)

The preceding statement indicates that a composite index (age, sex) should be created on table t1. To create the index, run the following command:

CREATE INDEX idx1 on t1(age, sex);

Virtual Index Recommendation

The virtual index recommendation function supports direct operations in the database and simulates the creation of a real index to avoid the time and space overhead required for creating a real index. Based on the virtual index, users can estimate cost of the index on the specified query statement by using the optimizer. For details about the functions and GUC parameters related to virtual indexes, see Table 2 and Table 3.

Table 2 Virtual index functions

Function	Parameter	Description
hypopg_create_index	Character string of the statement for creating an index	Creates a virtual index.
hypopg_display_index	None	Displays information about all created virtual indexes.
hypopg_drop_index	OID of the index	Deletes a specified virtual index.
hypopg_reset_index	None	Clears all virtual indexes.
hypopg_estimate_size	OID of the index	Estimates the space required for creating a specified index.

Table 3 GUC parameters related to virtual indexes

Parameter	Description	Default Value
enable_hypo_index	Whether to enable the virtual index function.	off

To use the tool, do as follows:

Use the **hypopg_create_index **function to create a virtual index.

openGauss=> select * from hypopg_create_index('create index on bmsql_customer(c_w_id)');
 indexrelid |              indexname              
------------+-------------------------------------
     329726 | <329726>btree_bmsql_customer_c_w_id
(1 row)

Enable the GUC parameter enable_hypo_index. This parameter determines whether the database optimizer considers the created virtual index when executing the EXPLAIN statement. By executing EXPLAIN on a specific query statement, you can evaluate whether the index can improve the execution efficiency of the query statement based on the execution plan provided by the optimizer. For example:

Before enabling the GUC parameter, run the combined statement of keyword EXPLAIN and the query statement.

openGauss=> explain SELECT c_discount from bmsql_customer where c_w_id = 10;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Seq Scan on bmsql_customer  (cost=0.00..52963.06 rows=31224 width=4)
   Filter: (c_w_id = 10)
(2 rows)

To enable the GUC parameters, run the following command:

openGauss=> set enable_hypo_index = on;
SET

After enabling the GUC parameter, run the combined statement of keyword EXPLAIN and the query statement.

openGauss=> explain SELECT c_discount from bmsql_customer where c_w_id = 10;
                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 [Bypass]
 Index Scan using <329726>btree_bmsql_customer_c_w_id on bmsql_customer  (cost=0.00..39678.69 rows=31224 width=4)
   Index Cond: (c_w_id = 10)
(3 rows)

By comparing the two execution plans, you can find that the index may reduce the execution cost of the specified query statement. Then, you can consider creating a real index.

(Optional) Use the hypopg_display_index function to display all created virtual indexes. For example:

openGauss=> select * from hypopg_display_index();
                 indexname                  | indexrelid |     table      |      column      
--------------------------------------------+------------+----------------+------------------
 <329726>btree_bmsql_customer_c_w_id        |     329726 | bmsql_customer | (c_w_id)
 <329729>btree_bmsql_customer_c_d_id_c_w_id |     329729 | bmsql_customer | (c_d_id, c_w_id)
(2 rows)

(Optional) Use the hypopg_estimate_size function to estimate the space (in bytes) required for creating a virtual index. For example:

openGauss=> select * from hypopg_estimate_size(329730);
 hypopg_estimate_size 
----------------------
             15687680
(1 row)

Delete the virtual index.
Use the hypopg_drop_index function to delete the virtual index of a specified OID. For example:
```
openGauss=> select * from hypopg_drop_index(329726);
 hypopg_drop_index 
-------------------
 t
(1 row)
```
Use the hypopg_reset_index function to clear all created virtual indexes at a time. For example:
```
openGauss=> select * from hypopg_reset_index();
 hypopg_reset_index 
--------------------

(1 row)
```

Workload-level Index Recommendation

For workload-level indexes, you can run scripts outside the database to use this function. This function uses the workload of multiple DML statements as the input to generate a batch of indexes that can optimize the overall workload execution performance.

Prepare a file that contains multiple DML statements as the input workload. Each statement in the file occupies a line. You can obtain historical service statements from the offline logs of the database.
Run the Python script index_advisor_workload.py:
```
python index_advisor_workload.py [p PORT] [d DATABASE] [f FILE] [--h HOST] [-U USERNAME] [-W PASSWORD][--schema SCHEMA]
[--max_index_num MAX_INDEX_NUM][--max_index_storage MAX_INDEX_STORAGE] [--multi_iter_mode] [--multi_node]  [--json] [--driver] [--show_detail]
```
The input parameters are as follows:
- PORT: port of the connected database.
- DATABASE: name of the connected database.
- FILE: file path that contains the workload statement.
- HOST (optional): ID of the host that connects to the database.
- USERNAME (optional): username for connecting to the database.
- PASSWORD (optional): password for connecting to the database.
- SCHEMA: schema name.
- MAX_INDEX_NUM (optional): maximum number of recommended indexes.
- MAX_INDEX_STORAGE (optional): maximum size of the index set space.
- multi_node (optional): specifies whether the current instance is a distributed database instance.
- multi_iter_mode (optional): algorithm mode. You can switch the algorithm mode by setting this parameter.
- json (optional): specifies the file path format of the workload statement as JSON after SQL normalization. By default, each SQL statement occupies one line.
- driver (optional): specifies whether to use the Python driver to connect to the database. By default, gsql is used for the connection.
- show_detail (optional): specifies whether to display the detailed optimization information about the current recommended index set.
For example:
```
python index_advisor_workload.py 6001 postgres tpcc_log.txt --schema public --max_index_num 10 --multi_iter_mode
```
The recommendation result is a batch of indexes, which are displayed on the screen in the format of multiple create index statements. The following is an example of the result.
```
create index ind0 on public.bmsql_stock(s_i_id,s_w_id);
create index ind1 on public.bmsql_customer(c_w_id,c_id,c_d_id);
create index ind2 on public.bmsql_order_line(ol_w_id,ol_o_id,ol_d_id);
create index ind3 on public.bmsql_item(i_id);
create index ind4 on public.bmsql_oorder(o_w_id,o_id,o_d_id);
create index ind5 on public.bmsql_new_order(no_w_id,no_d_id,no_o_id);
create index ind6 on public.bmsql_customer(c_w_id,c_d_id,c_last,c_first);
create index ind7 on public.bmsql_new_order(no_w_id);
create index ind8 on public.bmsql_oorder(o_w_id,o_c_id,o_d_id);
create index ind9 on public.bmsql_district(d_w_id);
```