Numpy Data Analysis (2)

“This is the 23rd day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”

The knowledge points for this review are as follows:

Boolean array and data filtering
Construction of multidimensional arrays
Use numpy to save text files
Matplotlib line chart drawing
How to set common properties of matplotlib diagrams
Chart saving

About Data Sources

After the last article was sent out, I found that I forgot to add the link of the data source. After that, xianyu was added in the message area, and friends who need hands-on practice can help themselves

Analysis of the target

Looking at the data from last time, some of the data in the data has two categories of users: members and non-members. This time, we mainly analyze the comparison of average cycling time between the two categories of users.

Data reading and data cleaning

According to the process diagram last time, we mainly follow the following steps:

However, in the actual operation, we found that the actual data of this time is very clean, and we can combine our data reading and data cleaning codes together to realize the purpose of code simplification.

Here the code can be simplified as:

# Data reading, data cleaning
def data_collection() :
    clndata_arr_list = []
    for data_filename in data_filenames:
        file = os.path.join(data_path, data_filename)
        data_arr = np.loadtxt(file, skiprows=1, delimiter=', ', dtype=bytes).astype(str)
        cln_data_arr = np.core.defchararray.replace(data_arr, '"'.' ')
        clndata_arr_list.append(cln_data_arr)
    return clndata_arr_list
Copy the code

In the last article, the specific implementation principle has been explained here will not be repeated.

The data analysis

According to the analysis target of this time, we take out not only Duration (ms) in the first column but also Member Type in the last column.

We can use Boolean arrays to filter our data. For details, see the following article:

We can take a look at the finished code first:

# Data analysis
def mean_data(clndata_arr_list, member_type) :
    duration_mean_list = []
    for cln_data in clndata_arr_list:
        bool_arr = cln_data[:, -1] == member_type
        filter_arr = cln_data[bool_arr]
        duration_mean = np.mean(filter_arr[:, 0].astype('float') /1000 / 60)
        duration_mean_list.append(duration_mean)
    return duration_mean_list
Copy the code

Here we pass in a member_type and compare the single scalar member_type with the fetched vector cln_data[:, -1] to screen out the data we need.

In math scalar and vector is unable to compare, after all the dimension is different, but its broadcast mechanism in numpy very good for us to achieve this requirement, numpy could probably make comparative data of the same data to a single scalar dimension, it can be one-on-one, to use the Boolean array filter data requirements.

As for how to extract the average value and other operations of the data, we have introduced them in the last article, and then we will enter the part of data visualization.

The results show

Generated line chart:

Generated CSV table:

The generated data chart was beautified a lot last time. The specific implementation can be seen in the following code, and the specific beautified attributes are marked in the comments.

# Result display
def show_data(member_mean_duration_list, casual_mean_duration_list) :
    Construct a multidimensional array
    # FMT output the specified format of the data format, the default output science count format
    mean_duraion_arr = np.array([member_mean_duration_list, casual_mean_duration_list]).transpose()
    np.savetxt('./mean_duration.csv', mean_duraion_arr, delimiter=', ',
               header='Member Mean Duraion, Casual Mean Duraion', fmt='%.4f',
               comments=' ')

    Generate a blank canvas
    plt.figure()
    # color Specifies the line color to display
    # linestyle specifies the style of the fold line
    Marker specifies the node style
    plt.plot(member_mean_duration_list, color='g', linestyle=The '-', marker='o', label='Member')
    plt.plot(casual_mean_duration_list, color='r', linestyle=The '-', marker=The '*', label='Casual')
    plt.title('Member vs Casual')
    Rotation specifies the Angle of the subscript
    plt.xticks(range(0.4),'1st'.'2nd'.'3rd'.'4th'], rotation=45)
    # xlabel x, y axis title
    plt.xlabel('Quarter')
    plt.ylabel('Mean duration (min)')
    plt.legend(loc='best')
    plt.tight_layout()

    plt.savefig('./duration_trend.png')
    plt.show()
Copy the code

The knowledge points for this review are as follows:

0 Numpy 0 and Shape in action
Matplotlib pie chart drawing

Analysis of the target

Looking at the data from last time, some of the data in the data has two categories of users: members and non-members. This time, we will analyze the proportion of two categories of users in the data.

Data reading and data cleaning

The code here is:

# Data reading, data cleaning
def read_clean_data() :
    clndata_arr_list = []
    for data_filename in data_filenames:
        file = os.path.join(data_path, data_filename)
        data_arr = np.loadtxt(file, skiprows=1, delimiter=', ', dtype=bytes).astype(str)
        cln_arr = np.core.defchararray.replace(data_arr[:, -1].'"'.' ')
        cln_arr = cln_arr.reshape(-1.1)
        clndata_arr_list.append(cln_arr)
    year_cln_arr = np.concatenate(clndata_arr_list)
    return year_cln_arr
Copy the code

Two points to note here:

0 0 0 Because of the large number we don’t have a data file for numpy. 0 0 We could use Numpy. 0 (-1,1) so numpy could 0 0 0 0 0 0 0 0
Our need for data is no longer to get an average of The Times, just to get the last column of data and stack it together for further processing using the concatenate method.

The data analysis

According to the analysis target this time, we took out the last column of Member Type.

In the previous step we have obtained all the values, in this part we just need to filter out the statistics of the member and non-member values.

We can take a look at the finished code first:

# Data analysis
def mean_data(year_cln_arr) :
    member = year_cln_arr[year_cln_arr == 'Member'].shape[0]
    casual = year_cln_arr[year_cln_arr == 'Casual'].shape[0]
    users = [member,casual]
    print(users)
    return users

Copy the code

Again, numpy.shape is used to get the specific data for the user classification.

The results show

The resulting pie chart:

Here is the code to generate the pie chart:

# Result display
plt.figure()
    plt.pie(users, labels=['Member'.'Casual'], autopct='%.2f%%', shadow=True, explode=(0.05.0))
    plt.axis('equal')
    plt.tight_layout()
    plt.savefig(os.path.join(output_path, './piechart.png'))
    plt.show()
Copy the code

About Data Sources

Analysis of the target

Data reading and data cleaning

The data analysis

The results show

Analysis of the target

Data reading and data cleaning

The data analysis

The results show

Related Posts

Thirty-two years later, computer graphics won the Turing Award, and pixar bosses pushed 3D animation forward

Grass recognition based on MATLAB GUI morphology of Matang grass + ox tendon grass recognition

Threshold algorithm for image segmentation in SCIkit-image