Towardsdatascience by Admond Lee, Compiled by Geek AI and Qian Zhang.

Data cleansing is a hard job for data scientists. To make this work less painful, the author of this article shares his own set of data cleaning code.

The quality of data in the real world is often poor, and as a data scientist, part of the cleansing is sometimes required. This requires that the data scientist should be able to perform the cleansing step before the data analysis or modeling work to ensure the best quality of the data.

But to cut a long story short, having worked in data science for a long time, I really get a sense of how painful it is to wash data before you can analyze, visualize, and model it.

Whether you admit it or not, data cleansing is not an easy task, and most of the time it is time-consuming and tedious, but it is also very important.

If you’ve been through the data cleansing process, you know what I mean. That’s what this article is all about — making data cleansing easier for the reader.

In fact, I realized some time ago that some data has a similar pattern when it comes to data cleansing. Since then, I’ve been organizing and compiling some data cleaning code (see below) that I think applies to other common scenarios as well.

Because these common scenarios involve different types of data sets, this article focuses more on showing and explaining what the code can do to make it easier for readers to use it.

My data cleaning kit

In the code snippet below, the data cleaning code is wrapped in some functions, and the purpose of the code is straightforward. You can use this code directly without having to embed it in functions that require minimal parameter modification.

1. Delete multiple columns of data

def drop_multiple_col(col_names_list, df): 
    ' '' AIM -> Drop multiple columns based on their column names INPUT -> List of column names, df OUTPUT -> updated df with dropped columns ------ '' '
    df.drop(col_names_list, axis=1, inplace=True)
    return dfCopy the code

Sometimes, not all of the column data is useful for our data analysis work. Therefore, “df.drop” is a convenient way to drop your selected columns.

2. Convert Dtypes

def change_dtypes(col_int, col_float, df): 
    ' '' AIM -> Changing dtypes to save memory INPUT -> List of column names (int, float), df OUTPUT -> updated df with smaller memory ------ '' '
    df[col_int] = df[col_int].astype('int32')
    df[col_float] = df[col_float].astype('float32')Copy the code

When we are faced with larger data sets, we need to convert “dtypes” to save memory. If you are interested in learning how to use Pandas to handle big data, I highly recommend reading “Why and How to Use Pandas with Large Data “(https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c) in this article.

3. Convert categorical variables to numeric variables

def convert_cat2num(df):
    # Convert categorical variable to numerical variable
    num_encode = {'col_1' : {'YES': 1,'NO': 0}.'col_2'  : {'WON': 1,'LOSE': 0.'DRAW':0}}  
    df.replace(num_encode, inplace=True)  Copy the code

There are machine learning models that require variables to be numerical. At this point, we need to convert the categorical variables into numeric variables and then use them as inputs to the model. For data visualization tasks, I recommend keeping categorization variables so that the results of the visualization can be explained more clearly and easily understood.

4. Check the missing data

def check_missing_data(df):
    # check for any missing data in the df (display in descending order)
    return df.isnull().sum().sort_values(ascending=False)Copy the code

If you want to check how much missing data is in each column, this is probably the fastest way to do it. This approach will give you a clearer idea of which columns have more missing data and help you decide what actions to take next in your data cleansing and analysis efforts.

5. Delete the string in the column

def remove_col_str(df):
    # remove a portion of string in a dataframe column - col_1
    df['col_1'].replace('\n'.' ', regex=True, inplace=True)

    # remove all the characters after &# (including &#) for column - col_1
    df['col_1'].replace('& #. *'.' ', regex=True, inplace=True)Copy the code

Sometimes you might see a new line of characters, or some strange symbol in a string column. You can easily handle this problem with df[‘col_1’].replace, where “col_1” is a column in data frame DF.

6. Delete the Spaces in the column

def remove_col_white_space(df):
    # remove white space at the beginning of string 
    df[col] = df[col].str.lstrip()Copy the code

When the data is messy, a lot of unexpected things can happen. It’s quite common to have some Spaces at the beginning of a string. Therefore, this approach is useful when you want to remove a space at the beginning of a string from a column.

7. Concatenate two columns of string data (under certain conditions)

def concat_col_str_condition(df):
    # concat 2 columns with strings if the last 3 letters of the first column are 'pil'
    mask = df['col_1'].str.endswith('pil', na=False)
    col_new = df[mask]['col_1'] + df[mask]['col_2']
    col_new.replace('pil'.' ', regex=True, inplace=True)  # replace the 'pil' with emtpy spaceCopy the code

This is useful when you want to combine two columns of string data under certain conditions. For example, you might want to concatenate column 1 and column 2 data when the first column ends with some particular letter. If you want, you can also delete the ending letter after the stitching is complete.

8. Convert timestamp (from string type to date “DateTime” format)

def convert_str_datetime(df): 
    ' '' AIM -> Convert datetime(String) to datetime(format we want) INPUT -> df OUTPUT -> updated df with new datetime format -- -- -- -- -- - '' '
    df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))Copy the code

When working with time series data, you may encounter timestamp columns in string format. This means that we may have to convert the data in string format to the date “datetime” format specified by our requirements in order to use the data for meaningful analysis and presentation.

The original link: towardsdatascience.com/the-simple-…