DB4AI

The current version of openGauss supports the native DB4AI capability. By introducing native AI operators, openGauss simplifies the operation process and fully utilizes the optimization and execution capabilities of the database optimizer and executor to obtain the high-performance model training capability in the database. With a simpler model training and prediction process and higher performance, developers can focus on model tuning and data analysis in a shorter period of time, avoiding fragmented technology stacks and redundant code implementation.

DB4AI of the current version supports logistic regression (binary classification tasks), linear regression, and vector machine algorithms (classification tasks) based on the SGD operator, as well as the K-Means clustering algorithm based on the K-Means operator.

Keyword Parsing

Table 1 DB4AI syntax and keywords

Name

Description

Statement

CREATE MODEL

Creates a model, trains it, and saves the model.

PREDICT BY

Uses an existing model for prediction.

Keywords

TARGET

Target column name of a training or prediction task.

FEATURES

Data feature column name of a training or prediction task.

MODEL

Model name of a training task.

Usage Guide

  1. You can run the CREATE MODEL statement to create and train a model.

    Taking dataset kmeans_2d as an example, the data content of the table is as follows:

    openGauss=# select * from kmeans_2d;
     id |              position
    ----+-------------------------------------
      1 | {74.5268815685995,88.2141939294524}
      2 | {70.9565760521218,98.8114827475511}
      3 | {76.2756086327136,23.8387574302033}
      4 | {17.8495847294107,81.8449544720352}
      5 | {81.2175785354339,57.1677675866522}
      6 | {53.97752255667,49.3158342130482}
      7 | {93.2475341879763,86.934042100329}
      8 | {72.7659293473698,19.7020415100269}
      9 | {16.5800288529135,75.7475957670249}
     10 | {81.8520747194998,40.3476078575477}
     11 | {76.796671198681,86.3827232690528}
     12 | {59.9231450678781,90.9907738864422}
     13 | {70.161884885747,19.7427458665334}
     14 | {11.1269539105706,70.9988166182302}
     15 | {80.5005071521737,65.2822235273197}
     16 | {54.7030725912191,52.151339428965}
     17 | {103.059707058128,80.8419883321039}
     18 | {85.3574452036992,14.9910179991275}
     19 | {28.6501615960151,76.6922890325077}
     20 | {69.7285806713626,49.5416352967732}
    (20 rows)
    

    The data type of the position field in this table is double precision[].

    Specify position as a feature column in the kmeans_2d training set, and use the K-Means algorithm to create and save the point_kmeans model.

    openGauss=# CREATE MODEL point_kmeans USING kmeans FEATURES position FROM kmeans_2d WITH num_centroids=3;
    NOTICE:  Hyperparameter max_iterations takes value DEFAULT (10)
    NOTICE:  Hyperparameter num_centroids takes value 3
    NOTICE:  Hyperparameter tolerance takes value DEFAULT (0.000010)
    NOTICE:  Hyperparameter batch_size takes value DEFAULT (10)
    NOTICE:  Hyperparameter num_features takes value DEFAULT (2)
    NOTICE:  Hyperparameter distance_function takes value DEFAULT (L2_Squared)
    NOTICE:  Hyperparameter seeding_function takes value DEFAULT (Random++)
    NOTICE:  Hyperparameter verbose takes value DEFAULT (0)
    NOTICE:  Hyperparameter seed takes value DEFAULT (0)
    MODEL CREATED. PROCESSED 1
    

    In the preceding command:

    • The CREATE MODEL statement is used to train and save a model.

    • USING specifies the algorithm name.

    • FEATURES specifies the features of the training model and needs to be added based on the column name of the training data table.

    • TARGET specifies the training target of the model. It can be the column name of the data table required for training or an expression, for example, price > 10000.

    • WITH specifies the hyperparameters used for model training. When the hyperparameters are not set by the user, the framework uses the default value.

      The framework supports various hyperparameter combinations for different operators. For details, see Table 2.

      Table 2 Hyperparameters supported by operators

      Operator

      Hyperparameter

      GD

      (logistic_regression, linear_regression, and svm_classification)

      optimizer(char*); verbose(bool); max_iterations(int); max_seconds(double); batch_size(int); learning_rate(double); decay(double); tolerance(double)

      SVM limits the hyperparameter lambda(double).

      Kmeans

      max_iterations(int); num_centroids(int); tolerance(double); batch_size(int); num_features(int); distance_function(char*); seeding_function(char*); verbose(int);seed(int)

      Table 3 shows the default value and value range of each hyperparameter.

      Table 3 Default value and value range of each hyperparameter

      Operator

      Default Hyperparameter Value

      Value Range

      Hyperparameter Description

      GD (logistic_regression, linear_regression, and svm_classification)

      optimizer = gd (gradient descent)

      gd or ngd (natural gradient descent)

      Optimizer

      verbose = false

      T/F

      Log display

      max_iterations = 100

      (0, INT_MAX_VALUE]

      Maximum iterations

      max_seconds = 0 (The running duration is not limited.)

      [0,INT_MAX_VALUE]

      Running duration

      batch_size = 1000

      (0, MAX_MEMORY_LIMIT]

      Number of samples per training

      learning_rate = 0.8

      (0, DOUBLE_MAX_VALUE]

      Learning rate

      decay = 0.95

      (0, DOUBLE_MAX_VALUE]

      Weight decay rate

      tolerance = 0.0005

      (0, DOUBLE_MAX_VALUE]

      Tolerance

      seed = 0 (random value of seed)

      [0, INT_MAX_VALUE]

      Seed

      just for SVM:lambda = 0.01

      (0, DOUBLE_MAX_VALUE)

      Regularization parameter

      Kmeans

      max_iterations = 10

      [1, INT_MAX_VALUE]

      Maximum iterations

      num_centroids = 10

      [1, MAX_MEMORY_LIMIT]

      Number of clusters

      tolerance = 0.00001

      (0,1)

      Central point error

      batch_size = 10

      [1, MAX_MEMORY_LIMIT]

      Number of samples per training

      num_features = 2

      [1, GS_MAX_COLS]

      Number of sample features

      distance_function = "L2_Squared"

      L1\L2\L2_Squared\Linf

      Regularization method

      seeding_function = "Random++"

      "Random++"\"KMeans||"

      Method for initializing seed points

      verbose = 0U

      { 0, 1, 2 }

      Verbose mode

      seed = 0U

      [0, INT_MAX_VALUE]

      Seed

      MAX_MEMORY_LIMIT = Maximum number of tuples loaded in memory

      GS_MAX_COLS = Maximum number of attributes in a database table

    If the model is saved successfully, the following information is returned:

    MODEL CREATED. PROCESSED x
    
  2. View the model information.

    After the training is complete, the model is stored in the gs_model_warehouse system catalog. You can view information about the model and training process in the gs_model_warehouse system catalog.

    You can view a model by viewing the system catalog. For example, run the following SQL statement to view the model named point_kmeans:

    openGauss=# select * from gs_model_warehouse where modelname='point_kmeans';
    -[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    modelname             | point_kmeans
    modelowner            | 10
    createtime            | 2021-04-30 17:30:39.59044
    processedtuples       | 20
    discardedtuples       | 0
    pre_process_time      | 6.2001e-05
    exec_time             | .000185272
    iterations            | 5
    outputtype            | 23
    modeltype             | kmeans
    query                 | CREATE MODEL point_kmeans USING kmeans FEATURES position FROM kmeans_2d WITH num_centroids=3;
    modeldata             |
    weight                |
    hyperparametersnames  | {max_iterations,num_centroids,tolerance,batch_size,num_features,distance_function,seeding_function,verbose,seed}
    hyperparametersvalues | {10,3,1e-05,10,2,L2_Squared,Random++,0,0}
    hyperparametersoids   | {23,23,701,23,23,1043,1043,23,23}
    coefnames             | {original_num_centroids,actual_num_centroids,dimension,distance_function_id,seed,coordinates}
    coefvalues            | {3,3,2,2,572368998,"(77.282589,23.724434)(74.421616,73.239455)(18.551682,76.320914)"}
    coefoids              |
    trainingscoresname    |
    trainingscoresvalue   |
    modeldescribe         | {"id:1,objective_function:542.851169,avg_distance_to_centroid:108.570234,min_distance_to_centroid:1.027078,max_distance_to_centroid:297.210108,std_dev_distance_to_centroid:105.053257,cluster_size:5","id:2,objective_function:5825.982139,avg_distance_to_centroid:529.634740,min_distance_to_centroid:100.270449,max_distance_to_centroid:990.300588,std_dev_distance_to_centroid:285.915094,cluster_size:11","id:3,objective_function:220.792591,avg_distance_to_centroid:55.198148,min_distance_to_centroid:4.216111,max_distance_to_centroid:102.117204,std_dev_distance_to_centroid:39.319118,cluster_size:4"}
    
  3. Use an existing model to perform a prediction task.

    Use the SELECT and PREDICT BY keywords to complete the prediction task based on the existing model.

    Query syntax: SELECT… PREDICT BY… (FEATURES…)… FROM…;

    openGauss=# SELECT id, PREDICT BY point_kmeans (FEATURES position) as pos FROM (select * from kmeans_2d limit 10);
     id | pos
    ----+-----
      1 |   2
      2 |   2
      3 |   1
      4 |   3
      5 |   2
      6 |   2
      7 |   2
      8 |   1
      9 |   3
     10 |   1
    (10 rows)
    

    For the same prediction task, the results of the same model are stable. In addition, models trained based on the same hyperparameter and training set are stable. AI model training is random (random gradient descent of data distribution each batch). Therefore, the computing performance and results of different models can vary slightly.

  4. View the execution plan.

    You can use the EXPLAIN statement to analyze the execution plan in the model training or prediction process of CREATE MODEL and PREDICT BY. The keyword EXPLAIN can be followed by a CREATE MODEL or PREDICT BY statement/clause or an optional parameter. Table 4 shows the supported parameters.

    Table 4 Parameters supported by EXPLAIN

    Parameter

    Description

    ANALYZE

    Boolean variable, which is used to add description information such as the running time and number of loop times

    VERBOSE

    Boolean variable, which determines whether to output the training running information to the client

    COSTS

    Boolean variable

    CPU

    Boolean variable

    DETAIL

    Boolean variable, which is unavailable

    NODES

    Boolean variable, which is unavailable

    NUM_NODES

    Boolean variable, which is unavailable.

    BUFFERS

    Boolean variable

    TIMING

    Boolean variable

    PLAN

    Boolean variable

    FORMAT

    Optional format type: TEXT, XML, JSON, and YAML

    Examples:

    openGauss=# Explain CREATE MODEL patient_logisitic_regression USING logistic_regression FEATURES second_attack, treatment TARGET trait_anxiety > 50 FROM patients WITH batch_size=10, learning_rate = 0.05;
    NOTICE:  Hyperparameter batch_size takes value 10
    NOTICE:  Hyperparameter decay takes value DEFAULT (0.950000)
    NOTICE:  Hyperparameter learning_rate takes value 0.050000
    NOTICE:  Hyperparameter max_iterations takes value DEFAULT (100)
    NOTICE:  Hyperparameter max_seconds takes value DEFAULT (0)
    NOTICE:  Hyperparameter optimizer takes value DEFAULT (gd)
    NOTICE:  Hyperparameter tolerance takes value DEFAULT (0.000500)
    NOTICE:  Hyperparameter seed takes value DEFAULT (0)
    NOTICE:  Hyperparameter verbose takes value DEFAULT (FALSE)
    NOTICE:  GD shuffle cache size 212369
                                QUERY PLAN
    -------------------------------------------------------------------
     Gradient Descent  (cost=0.00..0.00 rows=0 width=0)
       ->  Seq Scan on patients  (cost=0.00..32.20 rows=1776 width=12)
    (2 rows)
    
  5. Perform troubleshooting in case of exceptions.

    • Training phase

      • Scenario 1: When the value of the hyperparameter exceeds the value range, the model training fails and an error message is returned. For example:

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety  FROM patients WITH optimizer='aa';
        NOTICE:  Hyperparameter batch_size takes value DEFAULT (1000)
        NOTICE:  Hyperparameter decay takes value DEFAULT (0.950000)
        NOTICE:  Hyperparameter learning_rate takes value DEFAULT (0.800000)
        NOTICE:  Hyperparameter max_iterations takes value DEFAULT (100)
        NOTICE:  Hyperparameter max_seconds takes value DEFAULT (0)
        NOTICE:  Hyperparameter optimizer takes value aa
        ERROR:  Invalid hyperparameter value for optimizer. Valid values are: gd, ngd. (default is gd)
        
      • Scenario 2: If the model name already exists, the model fails to be saved, and an error message with the cause is displayed:

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety  FROM patients;
        NOTICE:  Hyperparameter batch_size takes value DEFAULT (1000)
        NOTICE:  Hyperparameter decay takes value DEFAULT (0.950000)
        NOTICE:  Hyperparameter learning_rate takes value DEFAULT (0.800000)
        NOTICE:  Hyperparameter max_iterations takes value DEFAULT (100)
        NOTICE:  Hyperparameter max_seconds takes value DEFAULT (0)
        NOTICE:  Hyperparameter optimizer takes value DEFAULT (gd)
        NOTICE:  Hyperparameter tolerance takes value DEFAULT (0.000500)
        NOTICE:  Hyperparameter seed takes value DEFAULT (0)
        NOTICE:  Hyperparameter verbose takes value DEFAULT (FALSE)
        NOTICE:  GD shuffle cache size 5502
        ERROR:  The model name "patient_linear_regression" already exists in gs_model_warehouse.
        
      • Scenario 3: If the value in the FEATURE or TARGETS column is *, ERROR is returned with the error cause:

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES *  TARGET trait_anxiety  FROM
        patients;
        ERROR:  FEATURES clause cannot be *
        -----------------------------------------------------------------------------------------------------------------------,
        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET *  FROM patients;
        ERROR:  TARGET clause cannot be *
        
      • Scenario 4: If the keyword TARGET is used in the unsupervised learning method or is not applicable to the supervised learning method, ERROR is returned with the error cause:

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment FROM patients;
        ERROR:  Supervised ML algorithms require TARGET clause
        -----------------------------------------------------------------------------------------------------------------------------
        CREATE MODEL patient_linear_regression USING linear_regression TARGET trait_anxiety  FROM patients;   ERROR:  Supervised ML algorithms require FEATURES clause
        
      • Scenario 5: If the GUC parameter statement_timeout is set, the statement that is executed due to training timeout will be terminated. In this case, execute the CREATE MODEL statement. Parameters such as the size of the training set, number of training rounds (iteration), early termination conditions (tolerance and max_seconds), and number of parallel threads (nthread) affect the training duration. When the duration exceeds the database limit, the statement execution is terminated and model training fails.

    • Prediction phase

      • Scenario 6: If the model name cannot be found in the system catalog, the database reports ERROR:

        openGauss=# select id, PREDICT BY patient_logistic_regression (FEATURES second_attack,treatment) FROM patients;
        ERROR:  There is no model called "patient_logistic_regression".
        
      • Scenario 7: If the data dimension and data type of the FEATURES task are inconsistent with those of the training set, ERROR is reported with the error cause. For example:

        openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES second_attack) FROM patients;
        ERROR:  Invalid number of features for prediction, provided 1, expected 2
        CONTEXT:  referenced column: patient_linear_regression_pred
        -------------------------------------------------------------------------------------------------------------------------------------
        openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES 1,second_attack,treatment) FROM patients;
        ERROR:  Invalid number of features for prediction, provided 3, expected 2
        CONTEXT:  referenced column: patient_linear_regression_pre
        
Feedback
编组 3备份
    openGauss 2024-05-07 00:46:52
    cancel