Native DB4AI Engine

The current version of openGauss supports the native DB4AI capability. By introducing native AI operators, openGauss simplifies the operation process and fully utilizes the optimization and execution capabilities of the database optimizer and executor to obtain the high-performance model training capability in the database. With a simpler model training and prediction process and higher performance, developers can focus on model tuning and data analysis in a shorter period of time, avoiding fragmented technology stacks and redundant code implementation.

Keyword Parsing

Table 1 DB4AI syntax and keywords

Name

Description

Syntax

CREATE MODEL

Creates a model, trains it, and saves the model.

PREDICT BY

Uses an existing model for prediction.

DROP MODEL

Deletes a model.

Keyword

TARGET

Target column name of a training or prediction task.

FEATURES

Data feature column name of a training or prediction task.

MODEL

Model name of a training task.

Usage Guide

  1. Introduce the algorithms supported in this version.

    The DB4AI of the current version supports the following new algorithms:

    Table 2 Supported algorithms

    Optimization Algorithm

    Algorithm

    GD

    logistic_regression

    linear_regression

    svm_classification

    PCA

    multiclass

    Kmeans

    kmeans

    xgboost

    xgboost_regression_logistic

    xgboost_binary_logistic

    xgboost_regression_squarederror

    xgboost_regression_gamma

  2. Learn about the model training syntax.

    • CREATE MODEL

      You can run the CREATE MODEL statement to create and train a model. This SQL statement uses the public Iris dataset for model training.

    • The following uses multiclass as an example to describe how to train a model. Specify sepal_length, sepal_width, petal_length, and petal_width as feature columns in the tb_iris training set, and use the multiclass algorithm to create and save the iris_classification_model model.

      openGauss=# CREATE MODEL iris_classification_model USING xgboost_regression_logistic FEATURES sepal_length, sepal_width,petal_length,petal_width TARGET target_type < 2 FROM tb_iris_1 WITH nthread=4, max_depth=8;
      MODEL CREATED. PROCESSED 1
      

      In the preceding command:

      • The CREATE MODEL statement is used to train and save a model.

      • USING specifies the algorithm name.

      • FEATURES specifies the features of the training model and needs to be added based on the column name of the training data table.

      • TARGET specifies the training target of the model. It can be the column name of the data table required for training or an expression, for example, price > 10000.

      • WITH specifies the hyperparameters used for model training. When the hyperparameter is not set by the user, the framework uses the default value.

        The framework supports various hyperparameter combinations for different operators.

        Table 3 Hyperparameters supported by operators

        Operator

        Hyperparameter

        GD

        (logistic_regression, linear_regression, and svm_classification)

        optimizer(char*), verbose(bool), max_iterations(int), max_seconds(double), batch_size(int), learning_rate(double), decay(double), and tolerance(double)

        SVM limits the hyperparameter lambda(double).

        K-Means

        max_iterations(int), num_centroids(int), tolerance(double), batch_size(int), num_features(int), distance_function(char*), seeding_function(char*), verbose(int), and seed(int)

        GD(pca)

        batch_size(int);max_iterations(int);max_seconds(int);tolerance(float8);verbose(bool);number_components(int);seed(int)

        GD(multiclass)

        classifier(char*)

        Note: Other hyperparameter types of multiclass depend on the categories in the selected classifier.

        xgboost_regression_logistic, xgboost_binary_logistic, xgboost_regression_squarederror, and xgboost_regression_gamma

        batch_size(int);booster(char*);tree_method(char*);eval_metric(char*);seed(int);nthread(int);max_depth(int);gamma(float8);eta(float8);min_child_weight(int);verbosity(int)

        The default value and value range of each hyperparameter are as follows:

        Table 4 Default values and value ranges of hyperparameters

        Operator

        Default Hyperparameter Value

        Value Range

        Hyperparameter Description

        GD:

        logistic_regression, linear_regression, svm_classification, and pca

        optimizer = gd (gradient descent)

        gd or ngd (natural gradient descent)

        Optimizer

        verbose = false

        T or F

        Log display

        max_iterations = 100

        (0, 10000]

        Maximum iterations

        max_seconds = 0 (The running duration is not limited.)

        [0,INT_MAX_VALUE]

        Running duration

        batch_size = 1000

        (0, 1048575]

        Number of data records selected per training

        learning_rate = 0.8

        (0,DOUBLE_MAX_VALUE]

        Learning rate

        decay = 0.95

        (0,DOUBLE_MAX_VALUE]

        Weight decay rate

        tolerance = 0.0005

        (0,DOUBLE_MAX_VALUE]

        Tolerance

        seed = 0 (random value of seed)

        [0,INT_MAX_VALUE]

        Seed

        just for linear, SVM: kernel = "linear"

        linear/gaussian/polynomial

        Kernel function

        just for linear, SVM: components = MAX(2*features, 128)

        [0,INT_MAX_VALUE]

        Number of high-dimension space dimensions

        just for linear, SVM: gamma = 0.5

        (0,DOUBLE_MAX_VALUE]

        Gaussian kernel function parameter

        just for linear, SVM: degree = 2

        [2, 9]

        Polynomial kernel function parameter

        just for linear, SVM: coef0 = 1.0

        [0, DOUBLE_MAX_VALUE]

        Polynomial kernel function parameter

        just for SVM:lambda = 0.01

        (0,DOUBLE_MAX_VALUE)

        Regularization parameter

        just for pca: number_components

        (0, INT_MAX_VALUE]

        Target dimension after dimension reduction

        GD:

        multiclass

        classifier="svm_classification"

        svm_classification\logistic_regression

        Classifier for multiclass tasks

        K-Means

        max_iterations = 10

        [1, 10000]

        Maximum iterations

        num_centroids = 10

        [1, 1000000]

        Number of clusters

        tolerance = 0.00001

        (0,1]

        Central point error

        batch_size = 10

        [1, 1048575]

        Number of data records selected per training

        num_features = 2

        [1, INT_MAX_VALUE]

        Number of sample features

        distance_function = "L2_Squared"

        L1, L2, L2_Squared, or Linf

        Regularization method

        seeding_function = "Random++"

        "Random++" or "KMeans||"

        Method for initializing seed points

        verbose = 0U

        {0,1,2}

        Verbose mode

        seed = 0U

        [0,INT_MAX_VALUE]

        Seed

        xgboost:

        xgboost_regression_logistic,

        xgboost_binary_logistic,

        xgboost_regression_gamma, and xgboost_regression_squarederror

        n_iter=10

        (0, 10000]

        Iteration times

        batch_size=10000

        (0, 1048575]

        Number of data records selected per training

        booster="gbtree"

        gbtree\gblinear\dart

        Booster type

        tree_method="auto"

        auto\exact\approx\hist\gpu_hist

        Note: To use the gpu_hist parameter, you must configure a GPU library. Otherwise, the DB4AI platform does not support this value.

        Tree construction algorithm

        eval_metric="rmse"

        rmse\rmsle\map\mae\auc\aucpr

        Data verification metric

        seed=0

        [0,INT_MAX_VALUE]

        Seed

        nthread=1

        [0, 100]

        Concurrency

        max_depth=5

        (0, MAX_MEMORY_LIMIT]

        Maximum depth of the tree. This parameter is valid only for the tree booster.

        gamma=0.0

        [0, DOUBLE_MAX_VALUE]

        Minimum loss required for further partitioning on leaf nodes

        eta=0.3

        [0, 1]

        Step used in the update to prevent overfitting

        min_child_weight=1

        [0,INT_MAX_VALUE]

        Minimum sum of instance weights required by child nodes

        verbosity=1

        0 (silent)\1 (warning)\2 (info)\3 (debug)

        Printing level

        MAX_MEMORY_LIMIT = Maximum number of tuples loaded in memory

        GS_MAX_COLS = Maximum number of attributes in a database table

    • If the model is saved successfully, the following information is returned:

      MODEL CREATED. PROCESSED x
      
  3. View the model information.

    After the training is complete, the model is stored in the gs_model_warehouse system catalog. You can view information about the model and training process in the gs_model_warehouse system catalog.

    The model details are stored in the system catalog in binary mode. You can use the gs_explain_model function to view the model details. The statement is as follows:

    openGauss=# select * from gs_explain_model("iris_classification_model");
     DB4AI MODEL
    -------------------------------------------------------------
     Name: iris_classification_model
     Algorithm: xgboost_regression_logistic
     Query: CREATE MODEL iris_classification_model
     USING xgboost_regression_logistic
     FEATURES sepal_length, sepal_width,petal_length,petal_width
     TARGET target_type < 2
     FROM tb_iris_1
     WITH nthread=4, max_depth=8;
     Return type: Float64
     Pre-processing time: 0.000000
     Execution time: 0.001443
     Processed tuples: 78
     Discarded tuples: 0
     n_iter: 10
     batch_size: 10000
     max_depth: 8
     min_child_weight: 1
     gamma: 0.0000000000
     eta: 0.3000000000
     nthread: 4
     verbosity: 1
     seed: 0
     booster: gbtree
     tree_method: auto
     eval_metric: rmse
     rmse: 0.2648450136
     model size: 4613
    
  4. Use an existing model to perform a prediction task.

    Use the SELECT and PREDICT BY keywords to complete the prediction task based on the existing model.

    Query syntax: SELECT… PREDICT BY… (FEATURES…)… FROM…;

    openGauss=# SELECT id, PREDICT BY iris_classification (FEATURES sepal_length,sepal_width,petal_length,petal_width) as "PREDICT" FROM tb_iris limit 3;
    
    id  | PREDICT
    -----+---------
      84 |       2
      85 |       0
      86 |       0
    (3 rows)
    

    For the same prediction task, the results of the same model are stable. In addition, models trained based on the same hyperparameter and training set are stable. AI model training is random (random gradient descent of data distribution each batch). Therefore, the computing performance and results of different models can vary slightly.

  5. View the execution plan.

    You can use the EXPLAIN statement to analyze the execution plan in the model training or prediction process of CREATE MODEL and PREDICT BY. The keyword EXPLAIN can be followed by a CREATE MODEL or PREDICT BY clause or an optional parameter. The supported parameters are as follows:

    Table 5 Parameters supported by EXPLAIN

    Parameter

    Description

    ANALYZE

    Boolean variable, which is used to add description information such as the running time and number of loop times

    VERBOSE

    Boolean variable, which determines whether to output the training running information to the client

    COSTS

    Boolean variable

    CPU

    Boolean variable

    DETAIL

    Boolean variable, which is unavailable

    NODES

    Boolean variable, which is unavailable

    NUM_NODES

    Boolean variable, which is unavailable

    BUFFERS

    Boolean variable

    TIMING

    Boolean variable

    PLAN

    Boolean variable

    FORMAT

    Optional format type: TEXT, XML, JSON, and YAML

    Example:

    openGauss=# Explain CREATE MODEL patient_logisitic_regression USING logistic_regression FEATURES second_attack, treatment TARGET trait_anxiety > 50 FROM patients WITH batch_size=10, learning_rate = 0.05;
                                   QUERY PLAN
    -------------------------------------------------------------------------
     Train Model - logistic_regression  (cost=0.00..0.00 rows=0 width=0)
       ->  Materialize  (cost=0.00..41.08 rows=1776 width=12)
             ->  Seq Scan on patients  (cost=0.00..32.20 rows=1776 width=12)
    (3 rows)
    
  6. Perform troubleshooting in case of exceptions.

    • Training phase

      • Scenario 1: When the value of the hyperparameter exceeds the value range, the model training fails and an error message is returned. For example:

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety  FROM patients WITH optimizer='aa';
        ERROR:  Invalid hyperparameter value for optimizer. Valid values are: gd, ngd.
        
      • Scenario 2: If the model name already exists, the model fails to be saved, and an error message with the cause is displayed. For example:

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety  FROM patients;
        ERROR:  The model name "patient_linear_regression" already exists in gs_model_warehouse.
        
      • Scenario 3: If the value in the FEATURE or TARGETS column is *, an error message with the cause is displayed. For example:

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES *  TARGET trait_anxiety  FROM patients;
        ERROR:  FEATURES clause cannot be *
        -----------------------------------------------------------------------------------------------------------------------
        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET *  FROM patients;
        ERROR:  TARGET clause cannot be *
        
      • Scenario 4: If the keyword TARGET is used in the unsupervised learning method or is not applicable to the supervised learning method, an error message with the cause is displayed. For example:

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment FROM patients;
        ERROR:  Supervised ML algorithms require TARGET clause
        -----------------------------------------------------------------------------------------------------------------------------
        CREATE MODEL patient_linear_regression USING linear_regression TARGET trait_anxiety  FROM patients;   
        ERROR:  Supervised ML algorithms require FEATURES clause
        
      • Scenario 5: If there is only one category in the TARGET column, an error message with the cause is displayed. For example:

        openGauss=# CREATE MODEL ecoli_svmc USING multiclass FEATURES f1, f2, f3, f4, f5, f6, f7 TARGET cat FROM (SELECT * FROM db4ai_ecoli WHERE cat='cp');
        ERROR:  At least two categories are needed
        
      • Scenario 6: DB4AI filters out data that contains null values during training. When the model data involved in training is null, an error message with the cause is displayed. For example:

        openGauss=# create model iris_classification_model using xgboost_regression_logistic features message_regular target error_level from error_code;
        ERROR:  Training data is empty, please check the input data.
        
      • Scenario 7: The DB4AI algorithm has restrictions on the supported data types. If the data type is not in the whitelist, an error message is returned and the invalid OID is displayed. You can check the OID in pg_type to determine the invalid data type. For example:

        openGauss=# CREATE MODEL ecoli_svmc USING multiclass FEATURES f1, f2, f3, f4, f5, f6, f7, cat TARGET cat FROM db4ai_ecoli ;
        ERROR:  Oid type 1043 not yet supported
        
      • Scenario 8: If the GUC parameter statement_timeout is set, the statement that is executed due to training timeout will be terminated. In this case, execute the CREATE MODEL statement. Parameters such as the size of the training set, number of training rounds (iteration), early termination conditions (tolerance and max_seconds), and number of parallel threads (nthread) affect the training duration. When the duration exceeds the database limit, the statement execution is terminated and model training fails.

    • Model parsing

      • Scenario 9: If the model name cannot be found in the system catalog, an error message with the cause is displayed. For example:

        openGauss=# select gs_explain_model("ecoli_svmc");
        ERROR:  column "ecoli_svmc" does not exist
        
    • Prediction phase

      • Scenario 10: If the model name cannot be found in the system catalog, an error message with the cause is displayed. For example:

        openGauss=# select id, PREDICT BY patient_logistic_regression (FEATURES second_attack,treatment) FROM patients;
        ERROR:  There is no model called "patient_logistic_regression".
        
      • Scenario 11: If the data dimension and data type of the FEATURES task are inconsistent with those of the training set, an error message with the cause is displayed. For example:

        openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES second_attack) FROM patients;
        ERROR:  Invalid number of features for prediction, provided 1, expected 2
        CONTEXT:  referenced column: patient_linear_regression_pred
        -------------------------------------------------------------------------------------------------------------------------------------
        openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES 1,second_attack,treatment) FROM patients;
        ERROR:  Invalid number of features for prediction, provided 3, expected 2
        CONTEXT:  referenced column: patient_linear_regression_pre
        

NOTE: The DB4AI feature requires data access for computing and is not applicable to encrypted databases.

Feedback
编组 3备份
    openGauss 2024-05-07 00:46:52
    cancel