The recent Winter Olympics went smoothly in Beijing, China. It was a matter of national joy. There’s been a lot of data coming out of the Olympics. Can we use Elastic Stack to analyze that data and provide some insight into the sport in our country?

Recently, the children’s school assigned a winter holiday of science and technology information function. As follows:

1, use the network and other methods, data collection, collection, 2014, 2018, 2022, three Olympic winter games related data, such as host countries, cities, dates, mascot, won the top 20 countries (in gold), gold, silver and bronze medal number and other relevant information, students can according to the hobby, choose five kinds of their favorite sport, And find this three winter Olympics, China, the United States, Russia, Japan, Germany in this five projects on the medal situation. Students can collect more relevant data or information according to their own thinking.

Students think :(1) how to ensure the accuracy of the collected data.

(2) Through sorting out and analyzing the above data, I can put forward my own suggestions on ice sports education for teenagers in China from the information of Chinese athletes’ winning data.

2. After obtaining the above data, students can fill them in EXCEL and other data processing software according to their needs. You can decide if you want to put them in one file, multiple files, collate them in one worksheet, or put them in different worksheets.

3. After sorting out the data, you can use charts to express it vividly. According to the data tables and charts, you can write your own opinions on China’s youth ice sports education, such as the problems found, the causes of the problems, and your solutions.

First of all, I have to say that I am not an Excel spreadsheet expert, and I have no idea how to use Excel to do the above work. I think Elastic Stack would be the perfect solution to this problem.

In today’s presentation, I’m going to use Elastic Stack 8.0 to solve this problem, although other versions of the Elastic Stack aren’t that different. One of the biggest changes to Elastic Stack 8.0 is the shift of the Index Patterns interface to Data Views.

To prepare data

I did a lot of data searching on the Internet. So far I think the most complete data is on Kaggle, although it doesn’t have data from the last two sessions. Its data was collated in 2018. On Baidu, there are a lot of data, some are gold ranking, some are analysis charts of some countries, but generally speaking, there is not a complete data. If anyone has found the data that can be downloaded in full, please let me know.

On Kaggle, I can download two files like this:

  • athlete_events.csv
  • noc_regions.csv

In the athlete_events.csv file, we can see the following:

athlete_events.csv

1. "ID","Name","Sex","Age","Height","Weight","Team","NOC","Games","Year","Season","City","Sport","Event","Medal" 2. "1", "A Dijiang", "M", 24180, "China" and "CHN", "Summer" 1992, 1992, "Summer", "Barcelona", "also", "also Men 's Also, "NA 3." 2 ", "A Lamusi", "M", 23170, 60, "China" and "CHN", "2012 Summer", 2012, "Summer" and "London", "Judo," "Judo Men 's Extra-Lightweight",NA 4. "3","Gunnar Nielsen Aaby","M",24,NA,NA,"Denmark","DEN","1920 Summer",1920,"Summer","Antwerpen","Football","Football Men's Football",NACopy the code

In this table, we can see a runner’s name, gender, age, height, weight, team, country, year, winter or summer Olympics, parameter type, MEDALS won.

In noc_regions.csv, we can see the following:

noc_regions.csv



1.  NOC,region,notes
2.  AFG,Afghanistan,
3.  AHO,Curacao,Netherlands Antilles
4.  ALB,Albania,
5.  ALG,Algeria,
6.  AND,Andorra,
7.  ANG,Angola,
8.  ANT,Antigua,Antigua and Barbuda
9.  ANZ,Australia,Australasia
10.  ARG,Argentina,
11.  ARM,Armenia,
12.  ARU,Aruba,
13.  ASA,American Samoa,
14.  AUS,Australia,


Copy the code

So this table, obviously, can be a form where we look for acronyms, like the names of countries that AFG stands for. In our design, we used ingest Pipeline to enrich the names of the countries in this table into the final table, because most of us are familiar with the names of the countries, not the names represented by abbreviations like AND, ARG, etc.

Create regions index

First, let’s take the noc_regions.csv file. Create an index called Regions:

We chose to download the noc_regions.csv file. Click the Import:

We can click Advanced to view the Mapping for the Regions index. If we are not happy with the mapping, we can modify it. We are not making any adjustments at this time. Click the Import:

It shows that we have successfully ingested 230 documents.

Create rich data policies

Since we want to enrich regions in the Regions index into the final data, we must create an enrich policy:

For information on Enrich Policy, read my previous article “Elasticsearch: How to Enrich logs and metrics with Elasticsearch ingest nodes”. We define the following enrich policy:



1.  PUT /_enrich/policy/enrich-region-policy
2.  {
3.    "match": {
4.      "indices": "regions",
5.      "match_field": "NOC",
6.      "enrich_fields": [
7.        "region"
8.      ]
9.    }
10.  }


Copy the code

We use the Execute Enrich Policy API to create an enrich index for this policy:

PUT /_enrich/policy/enrich-region-policy/_execute
Copy the code

We have successfully created a rich policy so far. This strategy will be used when taking in athlete information below.

Cleans Althelete event data

While we could also use Kibana to consume athlete_events.csv directly, let’s clean the data in Python before doing so. We can use Jupyter to help us process the data:



1.  import pandas as pd

3.  df = pd.read_csv('./athlete_events.csv')

5.  df.head()


Copy the code

Above, we use panda to read athlete_event.csv. We found 271,116 pieces of data.



1.  nan_values = df.isna()
2.  nan_columns = nan_values.any()

4.  columns_with_nan = df.columns[nan_columns].tolist()
5.  print(columns_with_nan)


Copy the code

Let’s see which terms are not numeric:

That is, Age, Height, Weight, and Medal contain non-numeric entries.



1.  df[['Age','Height','Weight']] = df[['Age','Height','Weight']].fillna(0)

3.  df.Medal = df.Medal.fillna('None')

5.  df.Age = df.Age.astype(int)


Copy the code

We fill the non-numeric image with 0, and in Medal, if there is no value, we fill in “None”.

Finally, we named the cleaned file cleaned_Olympic.csv and saved it in the current file directory. In the following operations, we will use this file instead of the previous althlete_events.csv file.



1.  $ pwd
2.  /Users/liuxg/data/olympic
3.  $ ls 
4.  archive (13).zip    cleaned_olympic.csv noc_regions.csv
5.  athlete_events.csv  ingest.py           olympic.ipynb


Copy the code

Intake of athlete information

In this step, we will import the just-generated cleaned_Olympic.csv into Elasticsearch. We open the file and add a header to the file:

id,name,sex,age,height,weight,team,NOC,game,year,season,city,sports,event,medal
Copy the code

The cleaned_Olympic.csv file would look like this:

With this header in hand, let’s use Kibana to ingest this file:

We select the cleaned_Olympic.csv file as the input file:

We chose Olympic as the name of the index. Next, we need to adjust the index mapping:

We set event to keyword in order to count MEDALS later. The number of MEDALS can not be counted according to sports, because in team events, a medal may be owned by many athletes, and for male and female teams, the same sport, sports are the same. Only by counting events can we really get the number of MEDALS. In addition, year is defined as keyword first. We need to use UTC to set its time value. By default, it uses the time zone of the current machine for the conversion.

Next we modify the pipeline on the right:

We add the following date processor:

 1.      {
2.        "date" : {
3.          "field" : "year",
4.          "target_field" : "@timestamp",
5.          "formats" : ["yyyy"],
6.          "timezone" : "UTC"
7.        }
8.      }
Copy the code

We add a field called @timestamp. Its value is converted from the year field. We set the time to UTC time.

Next, we need to apply the richness policy configured earlier. We want to add the region field to the index of ingestion as well. We need to add the following enrich Processor:

 1.      {
2.        "enrich": {
3.          "policy_name": "enrich-region-policy",
4.          "field": "NOC",
5.          "target_field": "region1",
6.          "max_matches": "1"
7.        }
8.      }
Copy the code

Regions in the index of regions are automatically added to the index when NOC is match. After this enrich Processor is processed, there is a field called region1.region. Next, we want to recommand the value of this field into the region field so that we can analyze the data more easily. Let’s add the following rename processor:

 1.      {
2.        "rename": {
3.          "if": "ctx.containsKey('region1')",
4.          "field": "region1.region",
5.          "target_field": "region"
6.        }
7.      }
Copy the code

After this processing, we can delete the previous REGION1 field:

 1.      {
2.        "remove": {
3.          "if": "ctx.containsKey('region1')",
4.          "field": "region1"
5.        }
6.      } 
Copy the code

So far, our mapping and pipeline have been set up. We click the Import button above:

As shown above, we have successfully ingested 271,116 documents. This is also the data of all athletes at the time when the data is collated. So let’s say View Index in Discover.

We can see all the data. We can see a field called region, which comes from the Regions index, if there is a matching NOC. We also see a field called @timestamp. This comes from the Date Processor.

In the Discover interface above, although we see a field called @timestamp, it is not displayed in a time-domain format. We need to adjust.

Let’s delete the Created Olympic Data View. Let’s create a new data View:

This creates a brand new Olympic Data View. Let’s go back to the Discover J screen and re-select the index:

In the view above, we can clearly see the profile of all the athletes from the Olympic Games until 2016. To verify the data, let’s search for the total number of athletes who participated in 2008:

From the above, we can see clearly that the total number of athletes participating in the 2008 Beijing Games is 13602.

Well, that’s all for today’s presentation. In this article, we explained how to import an Olympic CSV file into Elasticsearch. We’ll cover visualizing this data in detail in our next article, “Analyzing Olympic Data using Elastic Stack (PART 2).”