SQL on Spark

Overview

SQL on Spark allows you to access a Spark database instance, sending SQL statements and processing return results. Extension Connector is compatible with this function in earlier versions and retains the original interface function exec_hadoop_sql (text, text, text). The exec_hadoop_sql function has security risks and can be used only after permission is granted. You are advised to use the exec_on_extension function.

When connecting to a Spark database instance, the exec_hadoop_sql and exec_on_extension functions have the following differences:

Parameters and their meanings:
The three text in exec_hadoop_sql(text,text,text) indicate the DSN, SQL, and encoding, respectively.
The two text in exec_on_extension(text, text) indicate the data source name and SQL, respectively.
Required ODBC configuration:
exec_hadoop_sql must be configured on only the current execution node.
exec_on_extension must be configured on all openGauss nodes.
Execution:
exec_hadoop_sql does not require any data source created before its execution.
exec_on_extension requires that a data source be created before its execution.

Supported Spark Data Types

The following table lists data types supported by SQL on Spark.

Table 1 Data types supported by SQL on Spark

General Data Type	Spark Data Type	Local openGauss Data Type
NUMERIC	TINYINT	SMALLINT
	SMALLINT	SMALLINT
	INTEGER	INTEGER
	BIGINT	BIGINT
	FLOAT	DOUBLE PRECISION
	DOUBLE
	DECIMAL
STRING	STRING	TEXT
	VARCHAR
	CHAR
BOOLEAN	BOOLEAN	BOOLEAN
DATE	TIMESTAMP	TIMESTAMP
DATE	DATE	DATE

NOTICE:
To receive a type of data returned by Spark, specify its corresponding data type in the AS clause. If the returned type of Spark is not listed in the table or its corresponding data type is not specified, the type conversion may generate an incorrect result or fail.
For a string of the CHAR(n) type in Spark, if its length is less than n, it will be automatically padded with spaces. These spaces will be retained when the string is transferred to openGauss and converted to the TEXT type.
If the encoding mode of openGauss is set to SQL_ASCII, the length() function returns the number of bytes of the string rather than the actual number of characters. For example, if you execute exec_on_extension as follows: select c2,length(c2) from exec_on_extension('spark','select * from a;') as (c1 int, c2 text); The second column returned is the number of bytes of the string, rather than the number of characters. This is similar to exec_hadoop_sql.

Interconnection Configuration

The following procedure uses exec_on_extension as an example. When using exec_hadoop_sql to connect to Spark, you do not need to create any data sources.

Log in as the OS user omm to the primary node of the database.
Prepare package.zip and put it in $GAUSSHOME/utilslib/fc_conf/$DSN. If the path does not exist, create it first. $DSN indicates a folder naming after the DSN. The DSN name must consist of letters, digits, and underscores (_). The package contains the following files:
- KrbClient.zip
- datasight-hiveodbc-1.0.0-1.el6.x86_64.rpm
- kuser_1544261037803_keytab.tar
Run the following command to generate the package:
```
zip -r package.zip KrbClient.zip datasight-hiveodbc-1.0.0-1.el6.x86_64.rpm kuser_1544261037803_keytab.tar
```
KrbClient.zip indicates the Kerberos client to connect to Spark, kuser_1544261037803_keytab.tar indicates the user authentication credential used for connection, and datasight-hiveodbc-1.0.0-1.el6.x86_64.rpm indicates the Spark ODBC installation package. Perform the following steps to obtain the files:
1. Log in to the peer Spark database instance to be connected.
2. In the upper right corner of the home page, choose More > Download Client. Decompress the downloaded client package to obtain the FusionInsight_Services_ClientConfig.tar package. Decompress the package, go to the FusionInsight_Services_ClientConfig directory, and compress KrbClient to KrbClient.zip in the method of preparing the package.zip package.
3. Go to the Spark2x directory in FusionInsight_Services_ClientConfig and decompress DataSight_Spark_V100R002C50RC2SPC200_HiveODBC.zip. Go to the Linux directory generated after the decompression and select the .rpm package based on the OS version. Red Hat/CentOS: datasight-hiveodbc-1.0.0-1.el6.x86_64.rpm; SUSE: datasight-hiveodbc-1.0.0-1.x86_64.rpm.
4. On top of the home page, click System to go to the user interface. Select the user that is used for connecting to Spark, click More, and select the authentication credential kuser_1544261037803_keytab.tar to download it.

Generate the DSN.ini file in $GAUSSHOME/utilslib/fc_conf/$DSN. Ensure that the file content and format are correct. For example:

[spark]
MODE=0
HOST=10.185.178.151
PORT=23040
DATABASE=zyl
PRINCIPAL=spark/hadoop.hadoop.com@HADOOP.COM
FRAMED=0
NAMESPACE=sparkthriftserver
principal="kuser@HADOOP.COM"

Install datasight-hiveodbc.

mkdir $GAUSSHOME/utilslib/fc_conf/$DSN/sparkodbc

Red Hat/CentOS environment:

rpm -ivh --nofiledigest $GAUSSHOME/utilslib/fc_conf/$DSN/datasight-hiveodbc-1.0.0-1.el6.x86_64.rpm --nodeps --force --prefix=$GAUSSHOME/utilslib/fc_conf/$DSN/sparkodbc

SUSE environment:

rpm -ivh --nofiledigest $GAUSSHOME/utilslib/fc_conf/$DSN/datasight-hiveodbc-1.0.0-1.x86_64.rpm --nodeps --force --prefix=$GAUSSHOME/utilslib/fc_conf/$DSN/sparkodbc

Run the following command to automatically deploy Extension Connector. After the deployment is complete, run the kinit command on each node to generate a ticket for the user.

gs_om -t ec -m add -N DSN -U username --type=spark
gs_om -t ec -m add -N DSN -U username --type=spark  -L  #-L indicates the local mode.
kinit kuser    # The kuser user refers to the user who connects to and accesses the peer database instance.

NOTE: If the local mode is used, perform steps 1 to 3 on each node.

Run the following command to restart the database instance and stop the om_monitor process so that the openGauss process can detect the environment variable change:
```
gs_om -t ec -m restart -U username
gs_om -t ec -m restart -U username -L   #-L indicates the local mode.
gs_om -t stop && gs_om -t start         #Run this command only in local mode.
```
NOTE:
- If the local mode is used, perform steps 1 to 4 on each node.
- In local mode, the database instance does not need to be started or stopped. Therefore, you need to manually run the command for starting or stopping the database instance.
Create a data source.
Perform the following steps:
1. Perform the steps in Connecting to a Database.
2. Create a data source.
```
postgres=# CREATE DATA SOURCE spark_ds OPTIONS(DSN 'spark_odbc', encoding 'utf8');
```
  The field meanings and modification methods are similar to those in 9 in “SQL on Oracle”.
Connect to Spark.
For details, see Examples.

Precautions

If you use the required authentication information and file to connect to Spark, you do not need to provide the username and password in the data source. They will be transferred to unixODBC if provided.
Ensure that the content in package.zip is complete and that the content in the DSN.ini file is correct. Do not modify them after the installation and deployment are complete. Otherwise, the environment probably cannot be used.
Extension Connector can connect to only one Spark database instance each time. If multiple Spark connections are set up, authentication conflicts occur.
Changing the user password of the Spark database instance will invalidate the authentication credential. On top of the home page, click System to go to the user interface. Select the user that is used for connecting to Spark, click More, and select the authentication credential to download it again. Upload the downloaded authentication credential to the GAUSSHOME/utilslib/fc_conf/DSN/ directory. DSN indicates a user-defined DSN name. Decompress the authentication credential to replace the existing files. Change the owner of the authentication credential to the database instance user.
Spark ODBC must be compiled by using GCC 5.4 or later.

Examples

-- Assume that table t1 (c1 int, c2 string) exists in the Spark. The content is as follows:
-- 1  spark
-- 2  hive
--Query data by using the exec_hadoop_sql function. spark_odbc is the DSN of Spark in odbc.ini.
postgres=# SELECT * FROM exec_hadoop_sql('spark_odbc', 'select * from t1;', '') AS (c1 int, c2 text);
 c1 |  c2   
----+-------
  1 | spark
  2 | hive
(2 rows)
--Query data by using the exec_on_extension function.
postgres=# CREATE DATA SOURCE spark_ds OPTIONS(dsn 'spark_odbc');
postgres=# SELECT * FROM exec_on_extension ('spark_ds', 'select * from t1;') AS (c1 int, c2 text);
 c1 |  c2   
----+-------
  1 | spark
  2 | hive
(2 rows)
postgres=# DROP DATA SOURCE spark_ds;

Troubleshooting

For details about common issues that occur while Extension Connector interconnects with openGauss, see Table 2.

Table 2 Common troubleshooting for interconnection between Extension Connector and openGauss

Error Type	Error Information	Troubleshooting
Deployment error	[GAUSS-50201] : The $GAUSSHOME/utilslib/fc_conf/ora/ does not exist.	View automatic deployment logs and rectify the fault based on the error information in the logs. Then perform the deployment again. The log path is as follows: The default path is $GAUSSLOG/om. If the gs_om tool is used for automatic deployment and -l is specified, logs are stored in the specified directory.
Connection error	ERROR: source "spark_ds" does not exist	Create a data source. For details, see CREATE DATA SOURCE.
Execution error	ERROR: invalid input syntax for integer	Check the LirbA data types supported by Extension Connector. For details, see Table 1.
Execution error	ERROR: dn_6033_6034: DSN:spark2x,Fail to get data from ODBC connection! Detail can be found in node log of 'dn_6033_6034'. DETAIL: SQL_ERROR: cast unexpected type to date/timestamp.	If the data type returned by Spark in the SQL statement does not match the expected data type, an error is reported. If other syntax errors occur during SQL statement execution, Extension Connector also returns Spark errors.

CREATE DATA SOURCE; “Server Tools > gs_om” in the Tool Reference