Array operation | - | Array and vector operations, including basic addition, subtraction, multiplication, and division, exponentiation, root extraction, cos, sin, absolute value, and variance. |
Principal component analysis for dimensionality reduction | PCA | This is used to reduce dimensions and compute the principal component. |
Encoding categorical variable | - | Currently, the one-hot and dummy encoding technologies are supported. When a specific group of prediction variables need to be compared with another group of prediction variables, dummy coding is usually used, and a group of variables compared with the group of prediction variables is referred to as a reference group. One-hot encoding is similar to dummy encoding, and a difference lies in that the one-hot encoding establishes a numeric type 0/1 indication column for each classification value. In each row of data (corresponding to one data point), a value of only one classification code column can be 1. |
Matrix operation | - | Using matrix decomposition to decompose a large matrix into the product form of a simple matrix can greatly reduce the difficulty and volume of computation. Matrix addition, subtraction, multiplication, and division, extremum, mean, rank calculation, inversion, matrix decomposition (QR, LU, Cholesky), and feature extraction. |
Norms and distance functions | - | This is used to compute the norm, cosine similarity, and distance between vectors. |
Sparse vector | - | This is used to implement the sparse vector type. If there are a large number of repeated values in the vector, the vector can be compressed to save space. |
Pivot | - | Pivot tables are used to meet common row and column transposition requirements in OLAP or report systems. The pivot function can perform basic row-to-column conversion on data stored in a table and output the aggregation result to another table. It makes row and column conversion easier and more flexible. |
Path | - | It performs regular pattern matching on a series of rows and extracts useful information about pattern matching. The useful information can be a simple match count or something more involved, such as an aggregate or window function. |
Sessionize | - | The sessionize function performs time-oriented session rebuilding on a dataset that includes an event sequence. The defined inactive period indicates the end of a session and the start of the next session. It can be used for network analysis, network security, manufacturing, finance, and operation analysis. |
Conjugate gradient | - | A method for solving numerical solutions of linear equations whose coefficient matrices are symmetric positive definite matrices. |
Stemming | - | Stemming is simply to find the stem of a word. It can be used to, for example, establish a topic-focused search engine. The optimization effect is obvious on English websites, which can be a reference for websites in other languages. |
Train-Test Split | - | It is used to split a dataset into a training set and a test set. The train set is used for training, and the test set is used for verification. |
Cross validation | - | It is used to perform cross validation. |
Prediction metric | - | It is used to evaluate the quality of model prediction, including the mean square error, AUC value, confusion matrix, and adjusted R-square. |
Mini-batch preprocessor | - | It is used to pack the data into small parts for training. The advantage is that the performance is better than that of the stochastic gradient descent (the default MADlib optimizer), and the convergence is faster and smoother. |