Introduction:

Python is fast becoming the language of choice for data scientists, and for good reason. It has a broad ecosystem as a programming language and many excellent scientific computing libraries. If you’re new to Python, take a look at the Python route.

Of the many scientific libraries, I find Pandas the most useful for data science. Pandas, together with SciKit-learn, comprise almost all the tools needed by data scientists. This article aims to provide 12 ways to handle Python data. Here are some tips to make your job easier.

Before moving on, I recommend reading some code for Data Exploration.

To help you understand, this article uses a concrete data set for operations and manipulation. This article uses the Loan Prediction problem dataset

Start to work

First I import the module I want to use and load the data set into the Python environment.

1. Boolean Indexing

What if you want to filter the values of another column with criteria based on some columns? For example, we want a list of all women without a college degree who have loans. Boolean indexes can be used here. The code is as follows:

data.loc[(data["Gender"] = ="Female") & (data["Education"] = ="Not Graduate") & (data["Loan_Status"] = ="Y"),"Gender"."Education"."Loan_Status"]]
Copy the code


2. Apply function

Apply is a function that is often used to manipulate data and create new variables. Apply returns some values after applying a function to a particular row/column of a data box. Functions here can be either system native or user – defined. For example, here we can use it to find the number of missing values per row and column:

Output result:

So we get what we want.

Note: The second output uses the head() function because the data contains too many lines.

3. Replace the missing value

‘fillna()’ solves this problem in one go. It is used to replace the missing value with the mean/mode/median of the column.

ModeResult(mode=array([‘ Male ‘], dtype=object), count=array([489]))

Returns the mode number and the number of occurrences. Remember, the mode can be an array, because there may be more than one value at high frequencies. We usually default to the first one:

mode(data['Gender']).mode[0]
Copy the code

Now you can fill in the missing values and verify them using the technique from the previous step.

Thus, the missing value is definitely replaced. Note that this is the most basic substitution; other more complex techniques, such as modeling missing values and populating them with grouped means (mean/mode/median), will be covered in future articles.

4. PivotTable

Pandas can be used to create Excel style pivottables. For example, the important column “LoanAmount” has missing values. We can replace the missing values with the mean values of the groups grouped by ‘Gender’, ‘Married’ and ‘Self_Employed’. The ‘LoanAmount’ of each group can be determined as follows:

5. Multiple indexes

You may have noticed a strange property about the output from the previous step. Each index is a combination of three values. This is called multiple indexing. It helps to do calculations quickly.

Continuing the example above, we now have the values for each group, but no substitutions yet. This task can be accomplished using a number of techniques that we have learned so far.



Note: Multiple indexes require tuples that define groups in the LOC. This tuple is used in functions.

The suffix. Values [0] is required. By default, the order in which elements are returned does not match the original database. In this case, a direct assignment returns an error.

6. Two-dimensional tables

This feature can be used to get an initial “impression” (observation) of the data. Here we can test some basic assumptions. For example, in this case “Credit_History” is considered to have a significant effect on the status of the arrears. This can be verified using the following two-dimensional table:

pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)
Copy the code

These numbers are absolute. Percentages, however, are more helpful for a quick look at the data. We can use the apply function to do this:

It is now clear that those with a credit history are more likely to get a loan: 80% of those with a credit history get a loan, compared with only 9% of those without.

But there’s more to it than that. Since I now know that having a credit history is very important, what if I use a credit history to predict whether I’m going to get a loan? Amazingly, we got it right 460 times out of 614 trials, a good 75% of the time!

I don’t blame you if you’re wondering at this point what we’re doing with statistical models. But trust me, even a 0.001 percent improvement in accuracy on top of that is challenging. Are you willing to accept this challenge?

Note: 75% for the training set. There were some differences on the test set, but the results were similar. In the meantime, I hope this example shows why a 0.05% increase in accuracy can jump 500 places on the Kaggle leaderboard.