Let’s use C99 and C++11 for common data science tasks.

metrics and data shown on a computer screen

While languages like Python and R are increasingly popular in data science, C and C++ are a good choice for effective data science. In this article, we will use C99 and C++11 to write a program that uses Anscombe’s quartet data set, which is explained below.

I wrote about my motivation to keep learning programming languages in an article on Python and GNU Octave that’s worth recalling. All programs here need to be run on the command line, not on a graphical user interface (GUI). A complete example can be found in the Polyglot_FIT repository.

Programming tasks

The programs you will write in this series:

  • Read data from a CSV file
  • Interpolate data by straight lines (i.eF (x)= M ⋅ X + q)
  • Draw the results to an image file

This is a common situation for many data scientists. Sample data is the first set of Anscombe’s quartets, as shown in the table below. This is a set of artificially constructed data that, when fitted to a straight line, provides the same result, but their curves are very different. A data file is a text file in which the tabs are used as column separators and the first few lines as headings. This task will use only the first group (the first two columns).

C language

C is a general-purpose programming language, and is one of the most widely used languages today (based on the TIOBE Index, RedMonk programming language ranking, Programming Language Popularity Index, and GitHub Octoverse status). It’s a fairly old language (circa 1973), and many successful programs have been written in it (the Linux kernel and Git are just two examples). It is also one of the languages closest to the internal workings of a computer, as it is directly used to manipulate memory. It is a compiled language; Therefore, the source code must be converted to machine code by the compiler. The standard library was small and had few features, so other libraries were developed to provide the missing features.

I use this language most often in number operations, mainly because of its performance. I find it tedious to use because it requires a lot of boilerplate code, but it is well supported in a variety of environments. The C99 standard is the latest version, adds some nifty features, and is well supported by compilers.

Along the way, I’ll cover the necessary background in C and C++ programming so that beginners and advanced users alike can continue learning.

The installation

To develop with C99, you need a compiler. I usually use Clang, but GCC is another effective open source compiler. For linear fitting, I choose to use the GNU Science library. I couldn’t find any sensible libraries for plotting, so the program relied on an external program: Gnuplot. The example also uses a dynamic data structure, defined in the Berkeley Software Distribution (BSD), to store data.

Installing in Fedora is easy:

sudo dnf install clang gnuplot gsl gsl-devel
Copy the code

Code comments

In C99, comments are formatted by placing // at the beginning of a line, and the rest of the line is discarded by the interpreter. In addition, anything between /* and */ will also be discarded.

// This is a comment that is ignored by the interpreterCopy the code

The necessary library

The library consists of two parts:

  • Header file that contains the function description
  • Contains the source file for the function definition

The header file is contained in the source file, and the source file of the library file is linked to the executable file. Therefore, the header file required for this example is:

// Input/output functions#include <stdio.h>/ / the standard library#include <stdlib.h>// String manipulation function#include <string.h>/ / the BSD queue#include <sys/queue.h>// GSL science function#include <gsl/gsl_fit.h>
#include <gsl/gsl_statistics_double.h>
Copy the code

The main function

In C, programs must be inside special functions called main() :

int main(void) {
    ...
}
Copy the code

This is different from Python, which was introduced in the last tutorial, which runs all the code found in the source file.

Define variables

In C, variables must be declared before use and must be associated with a type. Whenever you use a variable, you have to decide what kind of data to store in it. You can also specify whether you intend to use a variable as a constant value. This is not required, but the compiler can benefit from this information. The following is from the fitting_c99.c program in the repository:

const char *input_file_name = "anscombe.csv";
const char *delimiter = "\t";
const unsigned int skip_header = 3;
const unsigned int column_x = 0;
const unsigned int column_y = 1;
const char *output_file_name = "fit_C99.csv";
const unsigned int N = 100;
Copy the code

Arrays in C are not dynamic, in the sense that the length of an array must be determined in advance (i.e., before compilation) :

int data_array[1024];
Copy the code

Since you usually don’t know how many data points there are in the file, use single-linked lists. This is a dynamic data structure that can grow indefinitely. Fortunately, BSD provides linked lists. Here is an example definition:

struct data_point {
    double x;
    double y;

    SLIST_ENTRY(data_point) entries;
};

SLIST_HEAD(data_list, data_point) head = SLIST_HEAD_INITIALIZER(head);
SLIST_INIT(&head);
Copy the code

This example defines a list of datA_point structured values that contain both x and Y values. The syntax is fairly complex, but intuitive, and describing it in detail would be too tedious.

A printout

To print on a terminal, you can use the printf() function, which is similar to Octave’s printf() function (introduced in the first article) :

printf("#### Anscombe's first set with C99 ####\n");
Copy the code

The printf() function does not automatically add a newline to the end of the printed string, so you must add a newline. The first argument is a string that can contain formatting information for other arguments passed to the function, such as:

printf("Slope: %f\n", slope);
Copy the code

Read the data

Now comes the hard part… There are several libraries that parse CSV files in C, but none seem stable or popular enough to fit into a Fedora package repository. Instead of adding dependencies for this tutorial, I decided to write this section myself. Again, it’s too verbose to discuss these details, so I’ll just explain the general idea. Some lines in the source code will be ignored for brevity, but you can find the complete sample code in the repository.

First, open the input file:

FILE* input_file = fopen(input_file_name, "r");
Copy the code

Then read the file line by line until an error occurs or the file ends:

while (!ferror(input_file) && !feof(input_file)) {
    size_t buffer_size = 0;
    char *buffer = NULL;
   
    getline(&buffer, &buffer_size, input_file);

    ...
}
Copy the code

The getline() function is a nice addition to the POSIX.1-2008 standard. It can read entire lines in a file and is responsible for allocating the necessary memory. Each line is then divided into character tokens using the strtok() function. Iterate over the character and select the desired column:

char *token = strtok(buffer, delimiter);

while(token ! = NULL) { double value; sscanf(token,"%lf", &value);

    if (column == column_x) {
        x = value;
    } else if (column == column_y) {
        y = value;
    }

    column += 1;
    token = strtok(NULL, delimiter);
}
Copy the code

Finally, when x and y values are selected, insert the new data point into the linked list:

struct data_point *datum = malloc(sizeof(struct data_point));
datum->x = x;
datum->y = y;

SLIST_INSERT_HEAD(&head, datum, entries);
Copy the code

The malloc() function dynamically allocates (reserves) some persistent memory for new data points.

Fitting the data

The GSL linear fitting function gsl_FIT_Linear () expects its input to be a simple array. Therefore, since you will not know the size of the arrays to be created, you will have to allocate their memory manually:

const size_t entries_number = row - skip_header - 1;

double *x = malloc(sizeof(double) * entries_number);
double *y = malloc(sizeof(double) * entries_number);
Copy the code

Then, iterate through the linked list to save the relevant data to the array:

SLIST_FOREACH(datum, &head, entries) {
    const double current_x = datum->x;
    const double current_y = datum->y;

    x[i] = current_x;
    y[i] = current_y;

    i += 1;
}
Copy the code

Now that you’re done with the linked list, clean it up. Always free manually allocated memory to prevent memory leaks. Memory leaks are bad, bad, bad (three important words). Every time memory is not released, garden gnomes lose their heads:

while(! SLIST_EMPTY(&head)) { struct data_point *datum = SLIST_FIRST(&head); SLIST_REMOVE_HEAD(&head, entries); free(datum); }Copy the code

At last, at last! You can now fit your data:

gsl_fit_linear(x, 1, y, 1, entries_number,
               &intercept, &slope,
               &cov00, &cov01, &cov11, &chi_squared);
const double r_value = gsl_stats_correlation(x, 1, y, 1, entries_number);

printf("Slope: %f\n", slope);
printf("Intercept: %f\n", intercept);
printf("Correlation coefficient: %f\n", r_value);
Copy the code

drawing

You must use an external program to draw. Therefore, save the fitting data to an external file:

const double step_x = ((max_x + 1) - (min_x - 1)) / N;

for (unsigned int i = 0; i < N; i += 1) {
    const double current_x = (min_x - 1) + step_x * i;
    const double current_y = intercept + slope * current_x;

    fprintf(output_file, "%f\t%f\n", current_x, current_y);
}
Copy the code

The Gnuplot command used to plot two files is:

plot 'fit_C99.csv' using 1:2 with lines title 'Fit'.'anscombe.csv' using 1:2 with points pointtype 7 title 'Data'
Copy the code

The results of

Before you can run the program, you must compile it:

clang -std=c99 -I/usr/include/ fitting_C99.c -L/usr/lib/ -L/usr/lib64/ -lgsl -lgslcblas -o fitting_C99
Copy the code

This command tells the compiler to use the C99 standard, read the fitting_c99.c file, load the GSL and gSLcblas libraries, and save the result to fitting_C99. The output on the command line is:

#### Anscombe's first set with C99 ####Intercept: 3.000091 Correlation coefficient: 0.816421Copy the code

Here is the resulting image generated with Gnuplot:

Plot and fit of the dataset obtained with C99

C + + 11 way

C++ is a general-purpose programming language and one of the most popular languages in use today. It was created as a successor to C (born in 1983) with an emphasis on object-oriented programming (OOP). C++ is often seen as a superset of C, so C programs should be able to compile using a C++ compiler. This is not entirely true, because in some extreme cases they behave differently. In my experience, C++ requires less boilerplate code than C, but the syntax is more difficult if you want to do object-oriented development. The C++11 standard is the latest version, adds some nifty features, and is largely supported by compilers.

Since C++ is largely compatible with C, I will only emphasize the differences between the two. Anything I don’t cover in this section means it’s the same as in C.

The installation

The C++ example has the same dependencies as the C example. On Fedora, run:

sudo dnf install clang gnuplot gsl gsl-devel
Copy the code

The necessary library

The library works in the same way as in C, but the include directive is slightly different:

#include <cstdlib>
#include <cstring>
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>

extern "C" {
#include <gsl/gsl_fit.h>
#include <gsl/gsl_statistics_double.h>
}
Copy the code

Since the GSL library is written in C, you must inform the compiler of this special case.

Define variables

C++ supports more data types (classes) than C, for example, the string type has more functionality than its C version. Update the variable definition accordingly:

const std::string input_file_name("anscombe.csv");
Copy the code

For structured objects like strings, you can define variables without using the = symbol.

A printout

You can use the printf() function, but cout objects are more idiomatic. Use the operator << to indicate the string (or object) to print with cout:

std::cout << "#### Anscombe's first set with C++11 ####"<< std::endl; . std::cout <<"Slope: " << slope << std::endl;
std::cout << "Intercept: " << intercept << std::endl;
std::cout << "Correlation coefficient: " << r_value << std::endl;
Copy the code

Read the data

The scheme is the same as before. Will open the file and read it line by line, but with a different syntax:

std::ifstream input_file(input_file_name);

while(input_file.good()) { std::string line; getline(input_file, line); . }Copy the code

Extract line characters using the same functionality as in the C99 example. Instead of using a standard C array, use two vectors. Vectors are an extension of C arrays in the C++ standard library that allows dynamic memory management without explicitly calling malloc() :

std::vector<double> x;
std::vector<double> y;

// Adding an element to x and y:
x.emplace_back(value);
y.emplace_back(value);
Copy the code

Fitting the data

To fit in C++, you don’t have to walk through lists because vectors guarantee contiguous memory. You can pass the pointer to the vector buffer directly to the fit function:

gsl_fit_linear(x.data(), 1, y.data(), 1, entries_number,
               &intercept, &slope,
               &cov00, &cov01, &cov11, &chi_squared);
const double r_value = gsl_stats_correlation(x.data(), 1, y.data(), 1, entries_number);

std::cout << "Slope: " << slope << std::endl;
std::cout << "Intercept: " << intercept << std::endl;
std::cout << "Correlation coefficient: " << r_value << std::endl;
Copy the code

drawing

Draw in the same way as before. Write to file:

const double step_x = ((max_x + 1) - (min_x - 1)) / N;

for (unsigned int i = 0; i < N; i += 1) {
    const double current_x = (min_x - 1) + step_x * i;
    const double current_y = intercept + slope * current_x;

    output_file << current_x << "\t" << current_y << std::endl;
}

output_file.close();
Copy the code

Then plot using Gnuplot.

The results of

Before you can run the program, you must compile it with a similar command:

clang++ -std=c++11 -I/usr/include/ fitting_Cpp11.cpp -L/usr/lib/ -L/usr/lib64/ -lgsl -lgslcblas -o fitting_Cpp11
Copy the code

The output on the command line is:

#### Anscombe's first set with C++11 ####
Slope: 0.500091
Intercept: 3.00009
Correlation coefficient: 0.816421
Copy the code

This is the resulting image generated with Gnuplot:

Plot and fit of the dataset obtained with C++11

conclusion

This article provides examples of data fitting and plotting tasks written in C99 and C++11. Because C++ is largely compatible with C, this article takes advantage of their similarities to write a second example. In some ways, C++ is easier to use because it partially relieves the burden of explicitly managing memory. But the syntax is more complex because it introduces the possibility of writing classes for OOP. However, you can still write software in C using OOP methods. Because OOP is a programming style, it can be used in any language. There are some good EXAMPLES of OOP in C, such as the GObject and Jansson libraries.

For number crunching, I prefer to do it in C99 because it has a simpler syntax and is widely supported. Until recently, C++11 was not widely supported, and I tend to avoid the rough edges of previous versions. For more complex software, C++ may be a good choice.

Do you also use C or C++ for data science? Share your experiences in the comments.


Via: opensource.com/article/20/…

By Cristiano L. Fontana, lujun9972

This article is originally compiled by LCTT and released in Linux China