This article introduces Greenplum R, a new R language development library based on Greenplum. Greenplum R provides two functions, gpApply and GPTApply, which can send R programs to Greenplum for parallel execution to avoid data movement and improve the execution efficiency of R language.

R language is an open source programming language focused on statistical analysis, with rich statistical analysis extensions. In the wave of big data, R language is also widely used in big data analysis by data analysts. Major data platforms have also strengthened their support for R language.

Greenplum big data analysis platform has good support for R language. Data analysis in THE R language can be done in two ways in Greenplum, either by using ODBC to connect to the Greenplum database to read data, or by writing User Defined functions (UDFs) for PL/R. The two methods have their own advantages and disadvantages. The first method does not need to learn the special syntax of PL/R and only needs to write the standard R language program. However, the disadvantages are also obvious: data needs to be read from Greenplum to the client for calculation. The second approach does not require moving data and allows R programs to execute in parallel, but the user needs to learn PL/R syntax and user-defined functions (UDFs) written in PL/R are not easy to debug.

So is there a way to have it both ways? Can you leverage the parallel computing benefits of the Greenplum data platform without the special syntax of PL/R? The answer is yes, using the Greenplum R Library allows parallel execution of R programs in Greenplum.

Greenplum R profile

The rationale for Greenplum R is to convert standard R programs into UDFs written in R that can be executed in parallel on the Greenplum platform. The Greenplum R Library provides two important methods, gpApply and GPTApply, that can convert an R function into an R UDF, send it to the Greenplum platform for parallel execution, and then return a summary of the execution results to the caller.

Greenplum R executes the R code as shown below: The user calls gpApply to convert R code into UDFs and send them to Greenplum execution. The Greenplum Master node distributes udFs to each Greenplum segment for execution, and then returns a summary of the results from each segment to the caller.

From the R developer’s perspective, the data in the Table in Greenplum is abstracted to the data structure of db.data.frame, and therefore, the data operated on and the result set returned by the R program is abstracted to db.data.frame.

Greenplum R sample

Here’s a simple example of how to use the Greenplum R Library.

Suppose we have code R that sums the first column of a Table and adds the result and adjust to the corresponding value in the second column.

# Input is a frame, sum the first column then add the sum vaule and adjust value to the each element in the second column and return as a new table
x <- function(frame, adjust)
{
  sum <- 0
  for (row in table)
  {
     sum <- sum + row$column1
  }
  table$column2 <- table$column2 + sum + adjust
  res <- data.frame(column1 = table$column1, column2 = table$column2)
  return(r)
}
Copy the code

This code can be sent to the Greenplum side for parallel execution using the gpApply function:

res <- list(c(column1 = 'int', column2 = 'int'))
inputTable <- db.data.frame('test1')
db.gpapply(inputTable, output.name = 'results', FUN = x, output.signature = res)
Copy the code

The internal mechanics of Greenplum R

The Greenplum R Library implements parallel execution of R code by converting R code into user-defined functions (UDFs). Greenplum provides a way to perform UDFs in 2, either using the standard PL/R mechanism or using PL/Container to perform UDFs.

The circulation of using PL/R to execute R UDF is shown below. Greenplum creates a Query Executor (QE) to execute R UDF. This approach has the advantage of simplicity and efficiency, but has the disadvantage of security risks. PL/R is an Untrust Language. The UDF code may contain malicious code, which may have serious consequences on the database.

Another way is to use PL/Container to perform R UDF. During the execution, Greenplum will execute R UDF in a Docker container. This method has the advantage of high security, even if there is malicious code in R UDF, because it is executed in a container, so it cannot affect the Greenplum database. Also, we can use resource groups to limit the amount of computing resources (CPU and Memory) required for R UDF execution.

The resources

Greenplum R code base:

Github.com/greenplum-d…

Greenplum R document:

GPDB. Docs. Pivotal. IO / 6-6 / analyti…