SQL, in the field of data processing and analysis is basically “mandarin” status, is almost a necessary ability, but to use SQL, and the relational database system, also known as RDBMS, this is like Mandarin is mainly in China to speak. Of course, if learning English is worth it in order to go to the UK and THE US, it’s like learning Python or Hadoop, and going out to find a job is a skill. But if you’re going to do analysis on plain text data, or Excel spreadsheets, like going to a small country like Kiribati, do you have to roll up your tongue and start memorizing words and grammar in order to have fun shopping? This kind of time, be afraid what think of for the first time, go up some treasure namely, Amoy a good translator.

Set calculator, in this problem, can be said to be a home travel essential artifact!

In fact, working with structured text or Excel data in SQL is a natural way to think about it, in addition to being lazy. A file or table consists of rows of data, each of which is either delimited by certain delimiters (Spaces, commas, tabs… Separate items, or specify a fixed length for each item. This representation is almost identical to that of a Table in a relational database, with variable-length fields and fixed-length fields appearing identical. The difference is that there are no primary keys, data types, nullable concepts on the file. In addition, even the description of the relationship between files is not as clear as the database, often only as business rules or experience, exist in the user’s head or some of the files shown to people.

The idea behind aggregators is to automatically parse structured text or Excel files, map them to “tables,” and on top of that, fully support SQL syntax and functionality.

All right, let’s cut to the chase. Let’s use two associated files as sample files to see how to easily perform query analysis without “install database -> create database table -> import data” :

First, take a look at the sample data. There are two files: employee information (employee.txt) and state basic information (state.xlsx). Here we use two kinds of files, one is formatted TXT text, the other is Excel spreadsheet, that is to say, the aggregator can connect different types of data sources at the same time, isn’t it amazing? Are you surprised?

Even more magical, the collector can automatically identify and read four file types according to the file suffix! TXT, Excel, XLS, and CSV files.

The following two graphs show sample data for employee information and STATE information, respectively, and the two files are associated by the STATE item in employee information (column 5) and the STATEID item in STATE information (column 1).

Sample employee information data:

Sample state information data:

All right, let’s get to work. GENDER= ‘F’; SALARY>10000; GENDER= ‘F’; GENDER= ‘F’;

A
1 =connect()
2 = a1.query (” select * from C :/ SQL /employee.txt where gender= ‘F’ and salary>10000 order by eid “)

Yes, it’s that simple, it’s that familiar! Step 1, connect to the database… Query () is used to execute the SQL query, which is exactly the same as the database query, except that the name of the table after from is changed to the name of the table. The query results are as follows:

Note that in Windows, file paths in the aggregator use a slash “/” instead of a backslash “\”, which is consistent with the Java language.

Select * from employees who were born on January 1, 1980 and earn more than 10000 salary:

A
1 $() select * from C :/ SQL /employee where BIRTHDAY>=date(‘ 1980-01-01 ‘) and SALARY>10000

$() is the equivalent of connect(), followed by SQL. In fact, multiple data sources can be connected simultaneously by writing different data source names in parentheses.

In addition, this example uses the string-to-date function in SQL called date().

Next, the key to SQL databases as distinct from single files is associated queries. For women making more than 10,000 dollars, let’s look at what states they live in:

A
1 =connect()
2 = a1. query(” select t1.eid eid,t1.name name,t1.gender gender,t2.name state,t2.population population,t1.salary from TXT T1 left join C :/ SQL /state.xlsx T2 on t1.state=t2.stateid where t1.gender= ‘F’ and t1.salary>10000 “)

Well, replacing the table name with a filename is a bit long, so we used the SQL alias usage and the result is as follows:

In addition to using aliases instead of absolute paths to files, for extremely long paths or for many files, you can also configure home directories in the set operator – menu – tools – options to facilitate writing and legible reading. This allows you to use file names or relative paths directly in SQL. Is this more like specifying a database and accessing its tables directly?

The configuration method is as follows:

Select * from department where total payroll is greater than 100000;

A
1 =connect()
2 Query (” select dept,count(1) c,sum(salary) s from employee.txt group by dept having s>100000 “)

The query results are as follows:

Let’s go into some details:

1) The set operator supports logical operations and, or, and not, e.g., to query an employee whose surname is Smith or Robinson and is not a male employee in the Sales department:

A
1 =connect()
2 = a1. query(” select * from employee.txt where (surname= ‘Smith’ or surname= ‘Robinson’) and gender= ‘M’ and not dept= ‘Sales’ “)

2) In the set operator, is null can be used to determine whether a surname is null, and is not null can be used to determine whether a surname is not null. For example:

A
1 =connect()
2 = a1. query(” select EID,NAME,SURNAME from employee.txt where SURNAME is null “)

In addition, the coalesce function can be used to process null values. For example, if the surname field is empty, UNKNOWN is displayed in the result:

A
1 =connect()
2 = A1. Query (” select EID, NAME, SURNAME, coalesce (SURNAME, “UNKNOWN”) as SURNAME1 from the employee. TXT “)

Query results are as follows:



Note: the field alias in the aggregator must not be the same as the field name in the file.

3) The set operator supports Case when. For example, if the gender field is “F”, it should be displayed as “female” and if the gender field is “M”, it should be displayed as “male”.

A
1 =connect()
2 = a1.query (” Select EID,NAME,GENDER,(case GENDER when ‘F’ then ‘female’ else ‘male’ end) as GENDER1 from employee.txt “)

Query results are as follows:

4) The set calculator supports fuzzy query with the keyword like. For example, in the case of staff, query the staff whose surname field contains “son”.

A
1 =connect()
2 = a1. query(” select * from employee.txt where surname like ‘%son%’ “)

% is a wildcard character, indicating one or more characters. In addition, the “_” represents a character. If you want to query a surname ending in ‘son’ and preceded by three characters, you can write it as’ surname like ‘___son’; [WJ] indicates a list of characters containing W and J. Surname like ‘[WJ]%’ indicates that surname starts with ‘W’ or ‘J’. Surname like ‘[!WJ]%’ does not begin with ‘W’ or ‘J’.

5) Set operator support In keyword query data In multiple values. For example, query the employees of Finance, Sales, and R&D departments.

A
1 =connect()
2 = a1. query(” select * from employee.txt where dept in (‘ Finance ‘, ‘Sales’,’ R&D ‘) “)

6) The set operator supports defining an external table with T as (x). For example, the state field in employee. TXT is left joined with the state ID field in the state table of another data source, the Demo database, to find the name and population of each employee’s state:

A
1 =connect()
2 = a1. query(” with T2 as (connect(\ “demo\”).query(\ “select * from states\”)) select t1.eid eid,t1.name name,t1.gender gender,t2.name state,t2.population population,t1.salary salary from employee.txt t1 left join t2 on T1. STATE = t2. STATEID “)

In this SQL:

With T2 as (connect(\ “demo\”).query(\ “select * from States \))) defines an external table T2 that connects to the demo data source (which is actually the HSQL demo database of the set operator), Execute the SQL “Select * from States” with the query function. (where \ “is an escape using double quotation marks in a string)

Select t1.eid… Left join t2 on t1.STATE=t2.STATEID “then use the left join t2 and employee. TXT to find the name and population of each employee’s STATE.

This query is a typical combination of a database and a text file. In fact, the with keyword can define the data found in various data sources, making it very flexible to implement federated queries across heterogeneous data sources.

7) The set calculator supports into to output the query results into the file. For example, deptresult.xlsx is used to query the number of employees and the total salary of the department whose total salary is greater than 100000. In this case, the new file is like a new table in a relational database.

A
1 =connect()
2 Query (” select dept,count(1) c,sum(salary) s into deptresult.xlsx from employee.txt group by dept having s>100000 “)

Having said so much, it can be seen that through the set operator, we can basically achieve in structured text data (TXT, CSV, etc.) and Excel files (XLS, XLSX) on the easy, direct use of SQL.

Of course, the aggregator is not a complete “translation” copy of THE ABILITY of SQL, for the SQL sub-query, aggregator is not directly supported, but will be more flexible, convenient, intuitive step-by-step computing to solve. At the same time, for some special join calculations, the aggregator is slower than the traditional database.

Finally, let’s look at the additional benefits of performing SQL calculations with a set operator:

1) Dynamic calculation based on input parameters:

When performing data queries, it is often necessary to perform calculations based on different conditions, which is called dynamic execution. At this point, we can define “grid parameters” to reserve places for conditions that may change. For example, if you want to find out who are the highest-paid young employees in your company, but you are not sure about the age and salary starting line, you can define two parameters in the Program/Grid Parameters menu of the Aggregator IDE: Birthday and salary:

Then use the placeholder “? “in the query statement. Write out the SQL and specify the corresponding grid parameter names as input in sequence:

A
1 =connect()
2 = a1. query(” select * from employee. TXT where BIRTHDAY>=? And SALARY >?” ,birthday,salary)

If you specify a specific value when defining grid parameters and do not check “Set parameters before each run” then the run step will specify the value directly. If Set Parameters before each run is selected, the Set Parameter Value window is displayed every time the script is run. This way, we can enter the parameter values we need at any time, and the query results will change accordingly:

2) Use SQL to query files on the command line

In Windows or Linux system, we can also call the compiled collector script from the command line to query the file data directly. If combined with the timing task mechanism of the operating system, the batch data calculation can be completed at the specified time.

Let’s start with an example that does not return a result set. Periodically provide the finance department with the number of employees and payroll for departments with a payroll greater than 100,000, writing the results to deptresult.xlsx (which can then be sent to the relevant person by email or other means).

First, write a set operator script and save it as deptresult.dfx.

A
1 =connect()
2 Query (” select dept,count(1) c,sum(salary) s into deptresult.xlsx from employee.txt group by dept having s>100000 “)
3 > the output (” create deptResult. XLSX successfully!” )

Then, on the command line, execute the esprocx.exe command (in the bin folder of the collector installation directory) and execute the result:

C:\Program Files\raqsoft\esProc\bin>esprocx.exe deptResult.dfxcreate deptResult.xlsx successfully!

The second line contains the prompt Output by the Output function, which can be used to monitor program execution and debug.

Let’s look at another example of returning a result set, the same query requirement, but instead of requiring output to a file, the result is viewed directly. This time we’ll change the name of the set operator script we wrote to deptQuery.dfx.

A
1 =connect()
2 Query (” select dept,count(1) c,sum(salary) s from employee.txt group by dept having s>100000 “)
3 return A2

Execute on the command line and view the results:

C:\Program Files\raqsoft\esProc\bin>esprocx.exe -R deptQuery.dfx

Result in A3:

Finance 24 177500

HR 19 138000

Marketing 99 733500

Production 91 663000

R&D 29 239000

Sales 187 1362500

Technology 47 344000

Furthermore, the aggregator can write a complete SQL statement directly from the command line and return the query results directly from the file. Is it as convenient as a database command line query tool?

Define a parameter SQL to pass in the SQL statement to be queried.

Then write the following set operator script, save it as query.dfx,

A
1 =connect()
2 =A1.query(sql)
3 return A2

When you run a command, write an SQL statement in the command line interface. The result is as follows:

C: Program Files\raqsoft\esProc\bin> esprocx.exe -r query. DFX \ “select dept,count(1) C,sum(salary) s from employee Group by dept having s>100000\”

Result in A3:

Finance 24 177500

HR 19 138000

Marketing 99 733500

Production 91 663000

R&D 29 239000

Sales 187 1362500

Technology 47 344000

In combination with the dynamic calculation by parameters described earlier, you can also achieve some interaction when using command line calculations. Again, let’s take the example of looking for higher-paid young employees in the company:

In the Program/Grid Parameters menu of the Collector IDE, define two parameters: Birthday and Salary.

Write the following set operator script, save it as empQueryparam.dfx,

A
1 =connect()
2 = a1. query(” select * from employee. TXT where BIRTHDAY>=? And SALARY >?” ,birthday,salary)
3 return A2

When the command is executed, the two parameters are supplied with numeric values in order. The result is as follows:

C:\Program Files\raqsoft\esProc\bin>esprocx.exe -R empQueryParam.dfx

6 Matthew Johnson M 5 1984-07-07 2005-07-07 Sales 11000

22 Jacob Davis M 43 1985-05-07 2001-05-07 R&D 16000

36 Emily Smith F 13 1983-12-27 2003-12-27 Production 12000

57 Megan Thompson F 38 1983-06-15 2003-06-15 Sales 12000

At this point, we have learned enough to use the set calculator, you can use the “diamond diamond” of SQL to do data files these porcelain work. In fact, in this story, set calculator is the real “Diamond diamond”! In addition to treating data files directly as “tables” as described in this article, the truly powerful Arsenal of an aggregator goes far beyond that. With this lightweight data analysis tool, you can easily process data in your database or file system.