preface

Meteorological data is a kind of typical big data, which has the characteristics of large amount of data, high timeliness and rich data types. A large amount of meteorological data is spatio-temporal data, which records the observed or simulated quantities of various physical quantities at various points in time and space. The amount of data generated every day is often dozens to hundreds of TB, and it is exploding. How to store and query these meteorological data efficiently is becoming a difficult problem.

Traditional schemes often use relational database and file system to realize the storage and real-time query of such meteorological data. This scheme has some defects in scalability, maintainability and performance. With the increase of data scale, the shortcomings become more and more obvious. In recent years, the industry began to solve this problem more and more based on distributed NoSQL, such as TableStore to achieve meteorological grid data storage and query. TableStore is a distributed NoSQL service developed by Alibaba, which can provide super large storage capacity, support super large concurrent access and low latency performance, and can well solve the problems of scale and query performance of meteorological data.

We have also written a related solution article “Massive Meteorological Data Storage and query Scheme based on Distributed NoSQL on the cloud” before, and some customers have developed based on this scheme. In order to reduce the difficulty of customer development, to provide the idea of universal implementation, we recently developed a TableStore-grid Library, based on this Library users can be very convenient to achieve the storage, query and management of meteorological Grid data. As a practical article, this paper mainly explains the design and use of this solution.

background

Characteristics of lattice data

Grid data has obvious multidimensional characteristics. Taking the data generated each time in the model system as an example, it generally contains the following five dimensions:

  1. Physical quantities, or elements: temperature, humidity, wind direction, wind speed and so on.
  2. Forecast time: next 3 hours, 6 hours, 9 hours, 72 hours, etc.
  3. Height.
  4. Longitude.
  5. The latitude.

When we fix an element and a forecast time, the height, longitude and latitude form a three-dimensional grid, as shown in the figure below. Each grid point represents a point in three-dimensional space, and the value above is the forecast value of a physical quantity (such as temperature) at that point in a certain forecast time (such as the next three hours).



Suppose that a three-dimensional grid space contains 10 planes of different heights, each plane is a 2880 x 570 grid point, and each grid point holds a 4 byte data, then the amount of data in this three-dimensional space is 2880 x 570 x 4 x 10, about 64MB. This is only a prediction of a certain physical quantity under a certain time of a certain model system, so it can be seen that the total amount of model data is very large.

Grid data query method

The forecaster will browse various model data (grid data) in the form of a page and make numerical model forecasts. This page needs to provide multiple query methods for schema data, such as:

  1. Query the grid data of a latitude and longitude plane: for example, the grid data of the global surface temperature in the next three hours, or the surface temperature data of Zhejiang Province in the next three hours.
  2. Query the time series data of a grid point: for example, the temperature of the location of Ali Cloud company in the next 3 hours, 6 hours, until the next 72 hours.
  3. Query the data of different physical quantities: for example, query the forecast data of a certain forecast time, a certain height and a certain point of all physical quantities.
  4. Query the data generated by different model systems: for example, query the data generated by a model in the European center and the corresponding data generated by the Meteorological agency in China.

As mentioned above, a grid point data set is generally a five-dimensional structure, and various query methods are actually the segmentation of this five-dimensional data, such as querying a certain plane, each section, a certain point sequence, a certain three-dimensional or four-dimensional subspace and so on. And our scheme design to ensure the query performance in a variety of query conditions, which is the main technical difficulties in data query.

TableStore based scheme design

Standardize lattice data model

First of all, we define a regular five-dimensional grid data as a GridDataSet, representing a grid point data set. According to the dimensional order, its five dimensions are:

  1. Variable: a variable, such as various physical quantities.
  2. Time: indicates the time dimension.
  3. Z: the Z-axis, generally representing the height of space
  4. X: the X-axis, usually denoting longitude or latitude.
  5. Y: indicates the Y-axis, usually representing longitude or latitude.

GridDataSet = F(variable, time, z, x, y).

In addition to five-dimensional data and the length of each dimension, a GridDataSet also contains some other information:

GridDataSetId: The Id that uniquely marks this GridDataSet. Attributes: Custom attribute information, such as when the data was generated, where the data came from, forecast type, and so on. Users can freely define user-defined attributes, or set up indexes for some attributes. After setting up indexes, users can query the data sets that meet the conditions through various combination conditions.Copy the code

For example, suppose that a certain weather forecast predicts various physical quantities of each altitude and latitude and longitude of each hour in the next 72 hours, then this forecast is a standard five-dimensional data set, which is a separate GridDataSet, and the same forecast next time is another data set. The two datasets need to have different Griddatasetids. The two data sets are similar except for the different time of alarm. However, since the time of alarm is not in the five-dimension model (the time within the five-dimension refers to different future moments in a forecast), they belong to different data sets, and the time of alarm can be used as a custom attribute of the data set. Retrieval of custom attribute setting criteria is also supported in this solution.

Data storage Scheme

We designed two tables to store meta and data of the GridDataSet respectively. Meta represents various metadata of the dataset, such as GridDataSetId, dimension length, custom attributes, etc. Data represents the actual grid data in the dataset. Data is much larger in data size than meta.

The main reason why meta and data are stored separately is as follows:

  1. Users may be asked to query data sets based on various criteria, such as which data sets have been recently added to the database, or which data sets of a certain type are present in the table. In the traditional scheme, it is mainly stored by MySQL and other relational databases. In this scheme, we store it by a single meta table, and realize multi-condition combination query and various sorting methods through the multiple index function of TableStore. Compared with the traditional scheme, it is easier to use.
  2. Before querying grid data, it is generally necessary to know the length of each dimension in grid data and other information, which is stored in the META table, that is, to query meta table first, and then query data table. Because meta data is generally small, the query efficiency is higher than data query, and one more query does not significantly increase the delay.

Meta table design

The design of the meta table is simple, the primary key is only one column, records the GridDataSetId, because the GridDataSetId can uniquely mark a GridDataSet. Various system properties and custom properties are stored in the properties column of the META table.


There are two ways to query meta tables: one is to directly query meta tables using GridDataSetId, and the other is to query meta tables using multivariate indexes, which can be queried based on a combination of multiple attribute conditions, such as filtering certain types of data and returning data from new to old according to the storage time.

The data table design

The design of data table should solve the query efficiency problem of five-dimensional data in different segmentation modes, so it can not store the data simply and directly.

First, to maximize query efficiency, we minimize the amount of data that needs to be scanned in a query. The amount of data in a data set may be several GB, but only a few MB of data is often needed in a query. If the data to be queried cannot be efficiently located, it is necessary to scan all several GB of data to screen out the data in a certain range, which is obviously very inefficient. So how can we effectively locate the data we need?

We first design a table structure design method, we use four primary key columns, respectively:

GridDataSetId: data set Id that uniquely identifies this data set. Variable: The name of a Variable, namely, the first dimension of a five-dimensional model. Time: the second dimension of the five-dimensional model. Z: Height, the third dimension of the five-dimensional model.


These four primary key columns mark the data in a TableStore row that needs to hold the data in the next two dimensions, namely a lattice plane.

In this design, for the first three dimensions of the five dimensions, we can locate by the value of the primary key column, that is, for each case of the first three dimensions, it corresponds to a row in the TableStore. Since the first three dimensions respectively represent variables, time and height, generally speaking, they are not very much, and each dimension ranges from several to dozens of levels. We can accelerate the query speed through some parallel query methods.

All that remains is how the latter two dimensions are stored and queried. First of all, the latter two dimensions represent a horizontal plane, usually a grid of latitude and longitude. The size of these two dimensions is much larger than that of the first three dimensions, with each dimension ranging from hundreds to thousands. As the numerical prediction becomes more and more refined, the size of this grid will increase exponentially. For such a dense grid data, we cannot store each grid point in one column, because the number of columns will be very large and the storage efficiency will be very low. On the other hand, if we store the lattice point data of a plane in a column, the efficiency of the whole read and fetch will be higher, but if we only read a certain point, a lot of invalid data will be read, and the efficiency will be lower. Therefore, we adopt a compromise scheme to slice the two-dimensional data of the plane into smaller data blocks. In this way, only part of the data blocks can be read instead of the whole plane, so that the query performance is greatly improved.


Plan implementation

Based on the above storage scheme, we implement a TableStore-grid library, providing the following interface:

Public interface GridStore {/** * create related meta, data table, before data entry call. * @throws Exception */ void createStore() throws Exception; /** * Writes meta information to the gridDataSet. * @param meta * @throws Exception */ void putDataSetMeta(GridDataSetMeta meta) throws Exception; /** * Update meta information. * @param meta * @throws Exception */ void updateDataSetMeta(GridDataSetMeta meta) throws Exception; /** * Get meta from gridDataSetId. * @param dataSetId * @return* @throws Exception */ GridDataSetMeta getDataSetMeta(String dataSetId) throws Exception; /** * / create index for meta table. * @param indexName * @param indexSchema * @throws Exception */ void createMetaIndex(String indexName, IndexSchema indexSchema) throws Exception; /** * Query the data set that meets the conditions through a variety of query criteria. * @param indexName specifies the indexName. * @param query query conditions, which can be built using QueryBuilder. * @param queryParams query related parameters, including offset,limit, sort, etc. * @return* @throws Exception */ QueryGridDataSetResult queryDataSets(String indexName, Query query, QueryParams queryParams) throws Exception; /** * Get GridDataWriter for writing data. * @param meta * @return*/ GridDataWriter getDataWriter(GridDataSetMeta meta); /** * Get GridDataFetcher to read data. * @param meta * @return*/ GridDataFetcher getDataFetcher(GridDataSetMeta meta); /** * Release resources. */ void close(); } public interface GridDataWriter {/** * write to a 2d plane. * @param variable Specifies the variable name. * @param t value of the time dimension. * @param z height dimension value. * @param grid2D Plane data. * @throws Exception */ void writeGrid2D(String variable, int t, int z, Grid2D grid2D) throws Exception; } public interface GridDataFetcher {/** * sets the variable to be queried. * @param variables * @return
     */
    GridDataFetcher setVariablesToGet(Collection<String> variables); /** * Sets the starting point and size of each dimension to read. * @param Origin The starting point of each dimension. * @param shape Size of each dimension. * @return
     */
    GridDataFetcher setOriginShape(int[] origin, int[] shape); /** * get data. * @return
     * @throws Exception
     */
    GridDataSet fetch() throws Exception;
}Copy the code

Examples of data entry, data query, and data set retrieval are given below.

Data entry

The data entry process can be divided into three parts:

  1. Write putDataSetMeta The putDataSetMeta interface writes meta information about the data set.
  2. Input the data of the whole data set through GridDataWriter.
  3. The meta information of the data set is updated through the updateDataSetMeta interface, and the marker data has been entered.

In the following example, we read a NetCDF(the common format for weather grid data) file and input the data into the TableStore via GridDataWriter. Only one two-dimensional plane can be written each time through GridDataWriter, so we need to carry out three layers of loops in the outer layer, respectively enumerate the values of variable dimension, time dimension and height dimension, and then read the data of the corresponding two-dimensional plane for input.

public void importFromNcFile(GridDataSetMeta meta, String ncFileName) throws Exception {
    GridDataWriter writer = tableStoreGrid.getDataWriter(meta);
    NetcdfFile ncFile = NetcdfFile.open(ncFileName);
    List<Variable> variables = ncFile.getVariables();
    for (Variable variable : variables) {
        if (meta.getVariables().contains(variable.getShortName())) {
            for (int t = 0; t < meta.gettSize(); t++) {
                for (int z = 0; z < meta.getzSize(); z++) {
                    Array array = variable.read(new int[]{t, z, 0, 0}, new int[]{1, 1, meta.getxSize(), meta.getySize()});
                    Grid2D grid2D = new Grid2D(array.getDataAsByteBuffer(), variable.getDataType(),
                            new int[] {0, 0}, new int[] {meta.getxSize(), meta.getySize()});
                    writer.writeGrid2D(variable.getShortName(), t, z, grid2D);
                }
            }
        }
    }
}Copy the code

Data query

GridDataFetcher supports arbitrary dimensional queries for five-dimensional data. The first dimension is the variable dimension, which is set to read through the setVariablesToGet interface. The other four dimensions can be read from any dimension by setting origin and shape.

public Array queryByTableStore(String dataSetId, String variable, int[] origin, int[] shape) throws Exception {
      GridDataFetcher fetcher = this.tableStoreGrid.getDataFetcher(this.tableStoreGrid.getDataSetMeta(dataSetId));
      fetcher.setVariablesToGet(Arrays.asList(variable));
      fetcher.setOriginShape(origin, shape);
      Grid4D grid4D = fetcher.fetch().getVariable(variable);
      return grid4D.toArray();
}Copy the code

Multi-conditional retrieval of data sets

In this scheme, after the establishment of multiple indexes for Meta tables, data sets can be retrieved through various combination conditions, and the data sets that meet the conditions can be queried. This function is very important for the meteorological management system.

As an example, suppose we want to query the weather forecast that has been stored, created in the last day, from ECMWF(European Center for Medium-Range Weather Forecasts) or NMC(National Meteorological Center), with a precision of 1KM, and order the forecast by creation time from new to old, we can use the following code:

Search criteria: (status == DONE) and (create_time > System.currentTimeMillis – 86400000) and (source == “ECMWF” or source == “NMC”) and (accuracy == “1km”)

QueryGridDataSetResult result = tableStoreGrid.queryDataSets(
        ExampleConfig.GRID_META_INDEX_NAME,
        QueryBuilder.and()
                .equal("status"."DONE")
                .greaterThan("create_time", System.currentTimeMillis() - 86400000)
                .equal("accuracy"."1km")
                .query(QueryBuilder.or()
                        .equal("source"."ECMWF")
                        .equal("source"."NMC")
                        .build())
                .build(),
        new QueryParams(0, 10, new Sort(Arrays.<Sort.Sorter>asList(new FieldSort("create_time", SortOrder.DESC)))));Copy the code

Is it very simple? This part of the function uses the TableStore multivariate index, multivariate index can achieve multi-field combination query, fuzzy query, full text search, sorting, range query, nested query, spatial query and other functions, provides a powerful underlying capability for metadata management scenarios.


The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.