· Baseline of the first China ECG Intelligence Contest (Based on Keras Val_ACC: 0.88)

Personal website –> www.yansongsong.cn

Github address: github.com/xiaosongshi…

Preliminary introduction: blog.csdn.net/xiaosongshi…

Open source solution: “Deep Application” Open Source in the first China Electrocardiogram Intelligence Contest (1st place, score 0.841484)

Welcome to xiao Song’s public account “Minimalist AI” to teach you deep learning:

Based on the sharing of theoretical learning and application development technology of deep learning, the author will often share the dry contents of deep learning. When learning or applying deep learning, you can also communicate with me on this page if you have any questions.

From CSDN blog expert & Zhihu deep learning columnist @Xiaosong yes

 

Introduction of competition

In response to the National Health China Strategy and the policy of promoting the integrated development of health care and big data, the first China Ecg Intelligence Competition was officially launched jointly sponsored by Tsinghua University Clinical School of Medicine and Data Science Research Institute, Tianjin Wuqing District Jingjin High-tech Innovation Park, and a number of key hospitals. From today to 24:00 on March 31, 2019, the competition will open global recruitment, and the total prize money is expected to reach one million yuan! At present, the official registration website has been launched, welcome universities, hospitals, entrepreneurial teams and other people interested in the development of Chinese ECG artificial intelligence to participate.

The official registration website of the 1st China Ecg Intelligence Competition >>mdi.ids.tsinghua.edu.cn

 

Data is introduced

Download the complete training set and test set, a total of 1000 routine ECG, including 600 in the training set and 400 in the test set. This data is retrieved from multiple public datasets. Teams are required to design and implement algorithms using data from training sets labeled normal and abnormal, and to make predictions on unlabeled test sets.

The sampling rate of the ecg data was 500 Hz. In order to facilitate the teams to read the data with different programming languages, all ecg data are stored in MAT format. Voltage signals of 12 leads are stored in this file. The labels corresponding to the training data are stored in the TXT file, where 0 indicates normal and 1 indicates abnormal.

Download address

 

The problem analysis

For simple analysis, there are 1000 samples in the data set of the preliminary competition, including 600 in the training set and 400 in the test set. The training set contains 600 cases with label, which can be used for our training model. A total of 400 cases in the test set were not labeled, requiring us to use the trained model for prediction.

The problem is a dichotomous prediction problem, and the solution idea should include the following contents

  1. Data reading and processing
  2. Network model building
  3. Model training
  4. Model application and submission of prediction results

 

Practical application

After the analysis of the problem, we divided the task into four small tasks. The first step is:

1. Data reading and processing

The sampling rate of the ecg data was 500 Hz. In order to facilitate the teams to read the data with different programming languages, all ecg data are stored in MAT format. Voltage signals of 12 leads are stored in this file. The labels corresponding to the training data are stored in the TXT file, where 0 indicates normal and 1 indicates abnormal.

We can learn from the above description that

  • Our data is stored in a MAT-formatted file (this determines how we will read the data later)
  • The sampling rate is 500 Hz (this information is not used very much, you can simply understand, that is, 500 points are collected in one second, from the later we know that each data is 5000 points, that is, 10-second ecg picture).
  • The voltage signal of 12 leads (this refers to the adoption of 12 leads, which can be simply understood as using 12 body temperatures to measure body temperature, so as to get more accurate information. The following is a simple introduction of the lead mode, for everyone to understand. It should be noted that since 12 kinds of leads are provided, we should use all of them. Although we can also conduct training and prediction by using only one lead method, experience tells us that adopting multiple features will achieve better results.)

 

Data processing function definition:

import keras
from scipy.io import loadmat
import matplotlib.pyplot as plt
import glob
import numpy as np
import pandas as pd
import math
import os
from keras.layers import *
from keras.models import *
from keras.objectives import *


BASE_DIR = "preliminary/TRAIN/"

# normalize
def normalize(v) :
    return (v - v.mean(axis=1).reshape((v.shape[0].1))) / (v.max(axis=1).reshape((v.shape[0].1)) + 2e-12)

#loadmat open the file
def get_feature(wav_file,Lens = 12,BASE_DIR=BASE_DIR) :
    mat = loadmat(BASE_DIR+wav_file)
    dat = mat["data"]
    feature = dat[0:12]
    return(normalize(feature).transpose())


# Change the hashtag to oneHot
def convert2oneHot(index,Lens) :
    hot = np.zeros((Lens,))
    hot[index] = 1
    return(hot)

TXT_DIR = "preliminary/reference.txt"
MANIFEST_DIR = "preliminary/reference.csv"
Copy the code

Read a piece of data for display

if __name__ == "__main__":
    dat1 = get_feature("preliminary/TRAIN/TRAIN101.mat")
    print(dat1.shape)
    #one data shape is (12, 5000)
    plt.plot(dat1[:,0])
    plt.show()
Copy the code

We can see from the above information that each lead is a list composed of 5000 points, and each sample is a matrix of 12*5000 in 12 lead modes, similar to a photo with a resolution of 12×5000.

All we need to do is read each one out, normalize it, and send it to the network for training.

Label processing mode

def create_csv(TXT_DIR=TXT_DIR) :
    lists = pd.read_csv(TXT_DIR,sep=r"\t",header=None)
    lists = lists.sample(frac=1)
    lists.to_csv(MANIFEST_DIR,index=None)
    print("Finish save csv")
Copy the code

Here, I read from reference. TXT and then scrambled and saved to reference. CSV. Note that data must be scrambled, otherwise the training effect will be poor. Because the raw data has all the ones in front of it and all the zeros after it

Data iteration

Batch_size = 20
def xs_gen(path=MANIFEST_DIR,batch_size = Batch_size,train=True) :

    img_list = pd.read_csv(path)
    if train :
        img_list = np.array(img_list)[:500]
        print("Found %s train items."%len(img_list))
        print("list 1 is",img_list[0])
        steps = math.ceil(len(img_list) / batch_size)    # Determine how many batches are in each round
    else:
        img_list = np.array(img_list)[500:]
        print("Found %s test items."%len(img_list))
        print("list 1 is",img_list[0])
        steps = math.ceil(len(img_list) / batch_size)    # Determine how many batches are in each round
    while True:
        for i in range(steps):

            batch_list = img_list[i * batch_size : i * batch_size + batch_size]
            np.random.shuffle(batch_list)
            batch_x = np.array([get_feature(file) for file in batch_list[:,0]])
            batch_y = np.array([convert2oneHot(label,2) for label in batch_list[:,1]])

            yield batch_x, batch_y
Copy the code

I use the generator to read data, which can be read by batch to speed up the training speed. You can also read all the data according to your personal habits. For more on generators, see this blog post of mine.

 

2. Network model building

We have processed the data, and then we have to build the model. I used Keras to build the model, which is easy and convenient to operate, tf, PyTorch, sklearn. You can follow your own preferences.

The network model can choose CNN, RNN, Attention structure, or the fusion of multiple models, to throw a brick to introduce yu. One-dimensional CNN mode is adopted at the Baseline, and one-dimensional CNN learning address is adopted

Model structures,

TIME_PERIODS = 5000
num_sensors = 12
def build_model(input_shape=(TIME_PERIODS,num_sensors),num_classes=2) :
    model = Sequential()
    model.add(Conv1D(16.16,strides=2, activation='relu',input_shape=input_shape))
    model.add(Conv1D(16.16,strides=2, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(64.8,strides=2, activation='relu',padding="same"))
    model.add(Conv1D(64.8,strides=2, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(128.4,strides=2, activation='relu',padding="same"))
    model.add(Conv1D(128.4,strides=2, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(Conv1D(256.2,strides=1, activation='relu',padding="same"))
    model.add(Conv1D(256.2,strides=1, activation='relu',padding="same"))
    model.add(MaxPooling1D(2))
    model.add(GlobalAveragePooling1D())
    model.add(Dropout(0.3))
    model.add(Dense(num_classes, activation='softmax'))
    return(model)
Copy the code

The network model output with model.summary() is

________________________________________________________________
Layer (type)                 Output Shape              Param #================================================================= reshape_1 (Reshape) (None, 5000, 12) 0 _________________________________________________________________ conv1d_1 (Conv1D) (None, 2493, 16) 3088 _________________________________________________________________ conv1d_2 (Conv1D) (None, 1247, 16) 4112 _________________________________________________________________ max_pooling1d_1 (MaxPooling1 (None, 623, 16) 0 _________________________________________________________________ conv1d_3 (Conv1D) (None, 312, 64) 8256 _________________________________________________________________ conv1d_4 (Conv1D) (None, 156, 64) 32832 _________________________________________________________________ max_pooling1d_2 (MaxPooling1 (None, 78, 64) 0 _________________________________________________________________ conv1d_5 (Conv1D) (None, 39, 128) 32896 _________________________________________________________________ conv1d_6 (Conv1D) (None, 20, 128) 65664 _________________________________________________________________ max_pooling1d_3 (MaxPooling1 (None, 10, 128) 0 _________________________________________________________________ conv1d_7 (Conv1D) (None, 10, 256) 65792 _________________________________________________________________ conv1d_8 (Conv1D) (None, 10, 256) 131328 _________________________________________________________________ max_pooling1d_4 (MaxPooling1 (None, 5, 256) 0 _________________________________________________________________ global_average_pooling1d_1 ( (None, 256) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 256) 0 _________________________________________________________________ dense_1 (Dense) (None, 2) 514 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 344482 Trainable params: 344482 Non - trainable params: 0 _________________________________________________________________Copy the code

There are few training parameters and you can change them as you like.

3. Network model training

Model training

if __name__ == "__main__":
    """dat1 = get_feature("TRAIN101.mat") print("one data shape is",dat1.shape) #one data shape is (12, 5000) plt.plot(dat1[0]) plt.show()"""
    if (os.path.exists(MANIFEST_DIR)==False):
        create_csv()
    train_iter = xs_gen(train=True)
    test_iter = xs_gen(train=False)
    model = build_model()
    print(model.summary())
    ckpt = keras.callbacks.ModelCheckpoint(
        filepath='best_model.{epoch:02d}-{val_acc:.2f}.h5',
        monitor='val_acc', save_best_only=True,verbose=1)
    model.compile(loss='categorical_crossentropy',
                optimizer='adam', metrics=['accuracy'])
    model.fit_generator(
        generator=train_iter,
        steps_per_epoch=500//Batch_size,
        epochs=20,
        initial_epoch=0,
        validation_data = test_iter,
        nb_val_samples = 100//Batch_size,
        callbacks=[ckpt],
        )
Copy the code

Output of training process (Optimal result: Loss: 0.0565-ACC: 0.9820-VAL_loss: 0.8307-VAL_ACC: 0.8800)

Epoch 10/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s 37 ms/step - loss: 0.2329 acc: 0.9040 - val_loss: 0.4041-VAL_ACC: 0.8700 Epoch 00010: Val_acc Improved from 0.85000 to 0.87000, Saving model to best_model. 10-0.87. H5 Epoch 11/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 38-1 s/ms step - loss: 0.1633-ACC: 0.9380-Val_loss: 0.5277-Val_ACC: 0.8300 Epoch 00011: 0.87000 Epoch val_acc did not improve the from 12/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 1 40 ms/s step - loss: 0.1394-ACC: 0.9500-Val_loss: 0.4916-Val_ACC: 0.7400 Epoch 00012: 0.87000 Epoch val_acc did not improve the from 13/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 38-1 s/ms step - loss: 0.1746-ACC: 0.9220-Val_loss: 0.5208-Val_ACC: 0.8100 Epoch 00013: 0.87000 Epoch val_acc did not improve the from 14/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 38-1 s/ms step - loss: 0.1009-ACC: 0.9720-Val_loss: 0.5513-VAL_ACC: 0.8000 Epoch 00014: 0.87000 Epoch val_acc did not improve the from 15/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 38-1 s/ms step - loss: 0.0565-ACC: 0.9820-Val_loss: 0.8307-Val_ACC: 0.8800 Epoch 00015: Val_acc Improved from 0.87000 to 0.88000, Saving model to best_model. 15-0.88. H5 Epoch 16/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 38-1 s/ms step - loss: 0.0261-ACC: 0.9920 - Val_loss: 0.6443 - Val_ACC: 0.8400 Epoch 00016: 0.88000 Epoch val_acc did not improve the from 17/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 38-1 s/ms step - loss: 0.0178-ACC: 0.9960 - Val_loss: 0.7773 - Val_ACC: 0.8700 Epoch 00017: 0.88000 Epoch val_acc did not improve the from 18/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 38-1 s/ms step - loss: 0.0082-ACC: 0.9980-Val_loss: 0.8875-Val_ACC: 0.8600 Epoch 00018: 0.88000 Epoch val_acc did not improve the from 19/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s 37 ms/step - loss: 0.0045-ACC: 1.0000-VAL_loss: 1.0057-Val_ACC: 0.8600 Epoch 00019: 0.88000 Epoch val_acc did not improve the from 20/20 25/25 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s 37 ms/step - loss: 0.0012-ACC: 1.0000-VAL_loss: 1.1088-VAL_ACC: 0.8600 Epoch 00020: Val_ACC did not improve from 0.88000Copy the code

 

4. The model applies the predicted results

Forecast data

if __name__ == "__main__":
    """dat1 = get_feature("TRAIN101.mat") print("one data shape is",dat1.shape) #one data shape is (12, 5000) plt.plot(dat1[0]) plt.show()"""
    """if (os.path.exists(MANIFEST_DIR)==False): create_csv() train_iter = xs_gen(train=True) test_iter = xs_gen(train=False) model = build_model() print(model.summary()) ckpt = keras.callbacks.ModelCheckpoint( filepath='best_model.{epoch:02d}-{val_acc:.2f}.h5', monitor='val_acc', save_best_only=True,verbose=1) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit_generator( generator=train_iter, steps_per_epoch=500//Batch_size, epochs=20, initial_epoch=0, validation_data = test_iter, nb_val_samples = 100//Batch_size, callbacks=[ckpt], )"""
    PRE_DIR = "sample_codes/answers.txt"
    model = load_model("Best_model. 15-0.88. H5." ")
    pre_lists = pd.read_csv(PRE_DIR,sep=r" ",header=None)
    print(pre_lists.head())
    pre_datas = np.array([get_feature(item,BASE_DIR="preliminary/TEST/") for item in pre_lists[0]])
    pre_result = model.predict_classes(pre_datas)#0-1 probability prediction
    print(pre_result.shape)
    pre_lists[1] = pre_result
    pre_lists.to_csv("sample_codes/answers1.txt",index=None,header=None)
    print("predict finish")
Copy the code

Here are the top 10 predictions:

TEST394,0
TEST313,1
TEST484,0
TEST288,0
TEST261,1
TEST310,0
TEST286,1
TEST367,1
TEST149,1
TEST160,1
Copy the code

You u should note that my prediction method is different from the official one, and you need to submit your prediction according to the requirements of the question.

 

Looking forward to

The Baseline is 88% accurate using the simplest one-dimensional convolution (which can fluctuate up and down due to random initialization values). You can also try GRU, Attention, and Resnet results for more than 95+ test accuracy.

Ability is limited, write bad place welcome everyone to criticize and correct.

Personal website –> www.yansongsong.cn/

Github address: github.com/xiaosongshi…

Welcome to Fork+Star ><

 

 

TensorFlow electrocardiogram recognition tutorial

I have bought the course, the quality is very high, both principle explanation and actual practice. Highly recommended. The author is developing an electrocardiogram deep learning recognition through this tutorial.

 

Welcome to join the author’s Knowledge Planet: “AI Deep Learning Application Road”

Based on the sharing of theoretical learning and application development technology of deep learning, the author will often share the dry contents of deep learning. When learning or applying deep learning, you can also communicate with me on this page if you have any questions.

The price of a few cups of milk tea can be a one-year subscription, said Xiao Song, CSDN blog expert and zhihu deep learning columnist