How do you iterate over DataFrame rows (and should you do so?

One of the most searched (and discussed) questions about PANDAS is how to iterate over rows in a DataFrame. This problem tends to pop up immediately for new users who have loaded some data into the DataFrame and now want to do something useful with it. For most programmers, the natural next step is to create a loop. They may not yet know the “correct” way to use DataFrames, but even experienced pandas and NumPy developers will consider iterating over the lines of a DataFrame to solve a problem. Rather than trying to find the single right answer to an iteration, it makes more sense to understand the problems involved and know when to choose the best solution.

By far the most popular question on Stack Overflow labeled “pandas “was about how to iterate over DataFrame rows. As it turns out, this question is also the most copied answer in code blocks on the entire site. Stack Overflow developers say thousands of people look at the answer every week and copy it to solve their problems. Obviously, people want to iterate over DataFrame rows!

Iterating over a DataFrame line with a top-level solution does have serious consequences. The other answers to this question (especially the second-highest rated answer) do a fairly good job of providing alternative options, but the entire 26 (and counting!) The list of answers is very confusing. Rather than asking how to iterate over the DataFrame line, it makes more sense to know what options are available, what their advantages and disadvantages are, and then choose the options that make sense to you. In some cases, the iteration answer with the most votes may be the best choice!

But I’ve heard iteration is wrong, right?

First, choosing to iterate over rows in a DataFrame is not automatically the wrong way to solve a problem. In most cases, however, what a beginner would do with iteration would be done better in another way. However, no one should feel bad about writing the first solution that uses iteration rather than some other (perhaps better) approach. This is often the best way to learn, and you can think of the first solution as the first draft of your article, which you can improve with some editing.

Now what we’re going to do is`DataFrame` ？

Let’s start with the basics. If we look at the original question on Stack Overflow, the question and answer simply print the contents of the DataFrame. First, let’s all agree that this is not a good way to look at DataFrame content. A standard rendering of a DataFrame, whether rendered by print, viewed by display in a Jupyter notebook, or as output in a cell, is much better than the content printed in a custom format.

If the DataFrame is large, only a few columns and rows may be visible by default. Use head and tail to get a feel for the data. If you want to see only a subset of the DataFrame, rather than having a loop that displays only those rows, use pandas’ powerful indexing capabilities. With a little practice, you can choose any combination of rows or columns to display. Let’s start here.

Now instead of a trivial print example, let’s see how a row of data can actually be used in a DataFrame that contains some logic.

example

Let’s create an example DataFrame that we can use. I will do this by making some fake data (using Faker). Notice that these columns are different data types (we have some strings, an integer, and a date).

from datetime import datetime, timedelta import pandas as pd import numpy as np from faker import Faker fake = Faker() today = datetime.now() next_month = today + timedelta(days=30) df = pd.DataFrame([[fake.first_name(), fake.last_name(), fake.date_this_decade(), fake.date_between_dates(today, next_month), fake.city(), fake.state(), fake.zipcode(), Fake. Random_int (-100,1000)] for r in range(100)], columns=['first_name', 'last_name', 'start_date', 'end_date', 'city', 'state', 'zipcode', 'balance']) df['start_date'] = pd.to_datetime(df['start_date']) # convert to datetimes df['end_date'] = pd.to_datetime(df['end_date']) df.dtypesCopy the code

first_name            object
last_name             object
start_date    datetime64[ns]
end_date      datetime64[ns]
city                  object
state                 object
zipcode               object
balance                int64
dtype: object
Copy the code

df.head()
Copy the code

  first_name last_name start_date   end_date               city      state  \
0  Katherine     Moody 2020-02-04 2021-06-28           Longberg   Maryland   
1      Sarah   Merritt 2021-03-02 2021-05-30  South Maryborough  Tennessee   
2      Karen   Hensley 2020-02-29 2021-06-23          Brentside   Missouri   
3      David  Ferguson 2020-02-02 2021-06-14         Judithport   Virginia   
4    Phillip     Davis 2020-07-17 2021-06-04          Louisberg  Minnesota   

  zipcode  balance  
0   20496      493  
1   18495      680  
2   63702      427  
3   66787      587  
4   98616      211  
Copy the code

First try

Assuming that our DataFrame contains customer data, we have A customer scoring function that uses multiple customer attributes to score them between ‘A’ and ‘F’. Any customer with A negative balance is given an “F, “anything above 500 an “A,” and the logic after that depends on whether the customer is A” traditional “customer and what state they live in.

Please note that I tested this feature, see my Jupyter Unit Test post for more details on how to unit test in Jupyter.

from dataclasses import dataclass @dataclass class Customer: first_name: str last_name: str start_date: datetime end_date: datetime city: str state: str zipcode: str balance: int def score_customer(customer:Customer) -> str: """Give a customer a credit score. >>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), Datetime (2023,1,1), "Chicago", "Illinois", 66666, 5)) 'F' >>> score_customer(Customer("Joe", "Smith", datetime(2020, 1,1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50)) 'C' >>> score_customer(Customer("Joe", "Smith", Datetime (2021, 1,1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50)) 'D' >>> score_customer(Customer("Joe", Smith, datetime(2021, 1,1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 150) 'C' >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1,1), datetime(2023,1,1), "Chicago", "Illinois", >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1,1), datetime(2023,1,1), "Chicago", 'B' >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1,1), datetime(2023,1,1), "Santa Fe", "California", 88888, 350)) 'A' >>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), Datetime (2023,1,1), "Santa Fe", "California", 88888, 50) 'C' """ if customer. Balance < 0: return 'F' if customer.balance > 500: return 'A' # legacy vs. non-legacy if customer.start_date > datetime(2020, 1, 1): if customer.balance < 100: return 'D' elif customer.balance < 200: return 'C' elif customer.balance < 300: return 'B' else: if customer.state in ['Illinois', 'Indiana']: return 'B' else: return 'A' else: if customer.balance < 100: return 'C' else: return 'A' import doctest doctest.testmod()Copy the code

TestResults(failed=0, attempted=8)
Copy the code

Grade our clients

Ok, so now we have a concrete example, how do we get all the customer scores? Let’s go straight to the top answer to the Stack Overflow problem, datafame.iterrows. This is a generator that returns the index of a row along with that row as Series. If you’re not familiar with a generator, you can think of it as a function that iterates. So calling next on it gives you the first element.

next(df.iterrows())
Copy the code

(0,
 first_name              Katherine
 last_name                   Moody
 start_date    2020-02-04 00:00:00
 end_date      2021-06-28 00:00:00
 city                     Longberg
 state                    Maryland
 zipcode                     20496
 balance                       493
 Name: 0, dtype: object)
Copy the code

This looks promising! This is a tuple containing the index of the first row and the row data itself. Maybe we can just pass it right into our function. Let’s give it a try and see what happens. Although this row is a Series, its columns have the same attributes as our Customer class, so we might be able to pass it directly into our scoring function.

score_customer(next(df.iterrows())[1])
Copy the code

'A'
Copy the code

Wow, that seems to work. Can we just grade the whole form?

df['score'] = [score_customer(c[1]) for c in df.iterrows()]
Copy the code

Is this our best option?

Wow, that seems so simple. You can see why this is the most voted answer, because it seems to be exactly what we want. Why is there controversy about this answer?

As is often the case with pandas (and indeed with any software engineering problem), picking an ideal solution depends on input. Let’s summarize the possible problems with the various design choices. If the problem presented is not suitable for your particular use case, then using iterrows iterations may be a perfectly acceptable solution! I won’t judge you. I’ve used it a lot and will conclude with a summary of how to make decisions about possible solutions.

Arguments for and against using Iterrows fall into the following categories.

Efficiency (speed and memory
Problems caused by mixing types in a line
Readability and maintainability

Speed and memory

In general, if you want things to be fast in PANDAS (or Numpy, or any framework that provides vector computation), you will not want to iterate over elements and will opt for a vectorized solution. However, even if the solution _ can _ be vectorized, it can be taxing for programmers, especially beginners, to do so. Other answers to this question on Stack Overflow suggest a host of other solutions. Most of them fall into the following categories, in order of speed first.

Vector quantization
Cython routines
List Understanding (Vanilla for loop).
DataFrame.apply()
DataFrame. Itertuples () and the iteritems ()
DataFrame.iterrows()

Vector quantization

The main problem with always telling people to vectorize everything is that sometimes vectorized solutions can be cumbersome to write, debug, and maintain. To prove that vectorization is preferred, the examples given often show trivial operations, such as simple multiplication. But since the example I started with in this article wasn’t just a single calculation, I decided to write a possible vectorized solution to the problem.

def vectorized_score(df): return np.select([df['balance'] < 0, df['balance'] > 500, # technically not needed, Would fall through ((df['start_date'] > datetime(2020,1,1)) & (df['balance'] < 100)), ((df/' the start_date > datetime,1,1 (2020) & (df [' balance '] > = 100) & (df [' balance '] < 200)), ((df/' the start_date > datetime,1,1 (2020) & (df [' balance '] > = 200) & (df [' balance '] < 300)), ((df/' the start_date > datetime,1,1 (2020) & (df [' balance '] > = 300) & df [' state '] the isin ([' Illinois' and 'Indiana'])), ((df/' the start_date > = datetime,1,1 (2020) & (df [' balance '] # < 100)]), and the conditions [' F ', 'A', 'D', 'C', 'B', 'B', 'C'], # choices 'A') # default score assert (df['score'] == vectorized_score(df)).all()Copy the code

Of course, there is more than one way to do this. I chose to use Np.select (you can read more about it and various other ways to update DataFrames in my article on using WHERE and Masks). When you have multiple conditions like this, I kind of like using NP.select, even though it’s not very readable. We can also do this with more code, with vector updates at each step to make it more readable. It might be about the same in terms of speed.

I personally find this hard to read, but perhaps with some good comments it can be clearly explained to future maintainers (or my future self). But the reason we’re doing vector code is to make it faster. What is the performance of our sample DataFrame?

%timeit vectorized_score(df)
Copy the code

2.75ms ± 489 µs per loop (mean ± std.dev. Of 7 runs, 100 loops each)Copy the code

Let’s also time our original solution.

%timeit [score_customer(c[1]) for c in df.iterrows()] 
Copy the code

13.5ms ± 911 µs per loop (mean ± std.dev. Of 7 runs, 100 loops each)Copy the code

Ok, so we’re almost five times faster, just using our small data set. This speed is not important for small data sets, but for large data sets it can be achieved by simple rewriting. I believe that with a little thought and analysis, it is possible to write a faster version of the vector. But hang in there to see how big data sets perform.

Cython

Cython is a project that makes it easy to write C extensions to Python using (mostly) Python syntax. I admit THAT I’m far from being a Cython expert, but I’ve found that even a little effort in Cython can make Python code hot faster. In this case, we have shown that we can make a vectorized solution, so using Cython in a non-vectorized solution might not be worth pursuing as a first choice. However, I did write a simple version of Cython here that is the fastest of the non-vectorized solutions for smaller sizes of input with even a little effort. Especially for cases where each row has a lot of computation and cannot be vectorized, using Cython can be a good choice, but requires a certain amount of time.

List comprehension

Now, the next option is a little different. I admit, I don’t think I use this technique very often. The idea here is to use a list understanding to call your function with each element in your DataFrame. Note that in our first solution, I already used a list understanding, but it was with iterrows. This time, instead of using Iterrows, you extract the data directly from each column of the DataFrame and iterate over it. In this case, no Series is created. If your function has more than one argument, you can use zip to make arguments, passing columns in your DataFrame to match the order of the arguments. To do this now, I need a modified scoring function, because I don’t have Customer objects already built in my DataFrame, and creating them just to call this function adds another layer. I only use three attributes of the customer, so this is a simple rewrite.

def score_customer_attributes(balance:int, start_date:datetime, state:str) -> str:
    if balance < 0:
        return 'F'
    if balance > 500:
        return 'A'
    # legacy vs. non-legacy
    if start_date > datetime(2020, 1, 1):
        if balance < 100:
            return 'D'
        elif balance < 200:
            return 'C'
        elif balance < 300:
            return 'B'
        else:
            if state in ['Illinois', 'Indiana']:
                return 'B'
            else:
                return 'A'
    else:
        if balance < 100:
            return 'C'
        else:
            return 'A'
Copy the code

Here is what the list understands the first loop to look like when the function is called.

next(zip(df['balance'], df['start_date'], df['state']))
Copy the code

(493, Timestamp('2020-02-04 00:00:00'), 'Maryland')
Copy the code

We will now create a list of all the scores for the entire DataFrame.

df['score3'] = [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]
assert (df['score'] == df['score3']).all()
Copy the code

Now how fast is that?

%timeit [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]
Copy the code

100 µs ± 100 µs per loop (mean ± std.dev. Of 7 runs, 10000 loops each)Copy the code

Wow, that’s a lot faster, more than 70 times faster than the original processing of this data. The score is quickly calculated in Python space simply by taking the raw data and calling a simple Python function. There is no need to convert rows to Series.

Note that we can also call our original function, we just need to make a Customer object and pass it in. That’s a little ugly, but still pretty fast.

%timeit [score_customer(Customer(first_name='', last_name='', end_date=None, city=None, zipcode=None, balance=a[0], start_date=a[1], state=a[2])) for a in zip(df['balance'], df['start_date'], df['state'])]
Copy the code

254 µs ± 2.59 µs per loop (mean ± std.dev. Of 7 runs, 1000 loops each)Copy the code

DataFrame.apply

We can also use datafame.apply. Note that to apply it to rows, you need to pass in the right axis, because it is applied to every column by default. The axis argument here specifies which index you want to have in the object passed to your function. We want each object to be a customer row, column as an index.

assert (df.apply(score_customer, axis=1) == df['score']).all()
Copy the code

%timeit df.apply(score_customer, axis=1)
Copy the code

3.57 ms ± 117 µs per loop (mean ± std.dev. Of 7 runs, 100 loops each)Copy the code

The performance here is more than three times faster than our old one. This is also very readable and allows us to use our original functions that are easy to read and maintain. But it is still slower to understand than a list, because it constructs a Series object for each row.

DataFrame. Iteritems and DataFrame. Itertuples

Now we’ll look at the general iterative approach in more detail. For DataFrames, there are three iter functions: iteritems,itertuples, and iterrows. DataFrames also directly supports iteration, but these functions don’t all iterate over the same thing. Since it can be confusing to understand what these methods do just by looking at their names, we’ll review them here.

iter(df)(callDataFrame.__iter__Methods). Traversing the information axis, forDataFrames, which is the column name, not the value.

next(iter(df)) # 'first_name'
Copy the code

'first_name'
Copy the code

iteritems. Iterates over the column, returning a column name and a tuple of the column, asSeries 。

next(df.iteritems())
next(df.items())       # these two are equivalent
Copy the code

('first_name',
 0       Katherine
 1           Sarah
 2           Karen
 3           David
 4         Phillip
          ...     
 95         Robert
 96    Christopher
 97        Kristen
 98       Nicholas
 99       Caroline
 Name: first_name, Length: 100, dtype: object)
Copy the code

items. This is the same thing as up here.iteritemsIt’s really just calleditems 。

next(df.iterrows())
Copy the code

(0,
 first_name              Katherine
 last_name                   Moody
 start_date    2020-02-04 00:00:00
 end_date      2021-06-28 00:00:00
 city                     Longberg
 state                    Maryland
 zipcode                     20496
 balance                       493
 score                           A
 score3                          A
 Name: 0, dtype: object)
Copy the code

iterrows. We’ve already seen this, which iterates over rows, but returns as indexes and tuples of rows, asSeries 。
itertuples. Iterates through rows, returning one for each rownamedtuple. You can choose to change the name of the tuple and disable the index being returned.

next(df.itertuples())
Copy the code

Pandas(Index=0, first_name='Katherine', last_name='Moody', start_date=Timestamp('2020-02-04 00:00:00'), end_date=Timestamp('2021-06-28 00:00:00'), city='Longberg', state='Maryland', zipcode='20496', balance=493, score='A', score3='A')
Copy the code

Using itertuples

Since we’ve already looked at Iterrows, we just need to look at itertuples. As you can see, the value returned, a namedtuple, can be used in our original function.

assert ([score_customer(t) for t in df.itertuples()]  == df['score']).all()
Copy the code

%timeit [score_customer(t) for t in df.itertuples()] 
Copy the code

858 µs ± 5.23 µs per loop (mean ± std.dev. Of 7 runs, 1000 loops each)Copy the code

The performance here is pretty good, over 12 times faster. Building a NamedTuple for each row is much faster than building a Series.

Mixed types in a row

Now is a good time to bring up another difference between Iterrows and itertuples. A namedtuple that can appropriately represent any type in a line. In our case, we have strings, date types, and integers. However, a pandasSeries, in the whole Series, must have only one data type. Because our data types are diverse enough that they are all represented as objects and ultimately retain their types, there are no functional issues for us. However, this is not always the case!

For example, if your columns have different numeric types, they will eventually become representative of all types. This can result in your itertuples and iterrows, which return slightly different data, so be careful.

Dfmixed = pd.DataFrame({'integer_column': [1,2,3], 'float_column': [1.1, 2.2, 3.3]}) dfmixedCopy the code

integer_column      int64
float_column      float64
dtype: object
Copy the code

next(dfmixed.itertuples())
Copy the code

Pandas (Index = 0, integer_column = 1, float_column = 1.1)Copy the code

next(dfmixed.iterrows())
Copy the code

(0,
 integer_column    1.0
 float_column      1.1
 Name: 0, dtype: float64)
Copy the code

The column name

And a word of warning. If your DataFrame has columns that cannot be represented by Python variable names, you will not be able to access them using dot syntax. Therefore, if you have a Column named 2b or My Column, then you will have to access them using the location name (for example, the first Column will be called _1). For Iterrows, the row will be a Series, so you must use [“2b”] or [“My Column”] to access the Column.

Other options

Of course, there are other iterative options. For example, you can increment an integer offset and use the ILOC indexer in the DataFrame to select any row. Of course, this is really no different than any other iteration, and it’s also non-standard, so other people reading your code might find it difficult to read and understand. I’ve built a naive version of the performance comparison code summarized below, if you want to see it (the performance is terrible).

Choose good

Choosing the right solution basically depends on two factors.

How big is your data set?
What can you easily write (and maintain)?

In the image below, you can see the runtime of the solution we are considering (the code to generate this is here). As you can see, only vectorized solutions hold up well in large data sets. If your data set is large, a vectorized solution may be your only reasonable option.

A comparison of the running times of various methods on our DataFrame.

However, depending on how many times you need to execute the code, how long it takes you to write the code correctly, and how well you can maintain the code, you might choose any of the other solutions, and it would be fine. In fact, for each of these solutions, they grow linearly as data increases.

Maybe one way to think about it is not just the big-O notation, but the big-U notation. In other words, how long will it take you to write the right solution? If it’s less than the running time of your code, an iterative solution might be perfectly fine. However, if you’re writing production code, take the time to learn how to vectorize.

One more thing; Sometimes it’s easy to write iterative solutions on a smaller set, and you might want to do that first and write the vectorized version later. Verify your results with iterative solutions to make sure you’re doing it right, and then use vectorized versions on larger, complete data sets.

I hope you found this delve into the DataFrame iteration interesting. I know I’ve learned something useful along the way.

The postHow to iterate over DataFrame rows (and should you? ) appeared first onwrighters.io.