Pandas Style Guide #1

if 'cycle' in df.columns:
    cycle_seq = df['cycle'].values[:max_epoch]
    for i in range(1, len(cycle_seq)):
        if cycle_seq[i] != cycle_seq[i - 1]:
            ...

This code, found in a stray Jupyter notebook,^[1] makes me... sad? No, that's insufficient. Let me consult the Feelings Wheel. Shocked? Dismayed? Overwhelmed? Embarrassed is listed twice, a subset of both Hurt and Disapproving. And yet, I am also Inspired to write about this, Confident I can help, Unfocussed because I doubt the word should have that second s. I myself go in cycles on this wheel, comparing where I am to where I was. There must be a better way.

Indents and Hidden Conditions

Python's style guide tells us to limit our lines to 79 characters. After all, IBM punchcards only held 80. Yes, yes, IBM expanded to 96 punch-able columns in 1969, but that space age technology is a bit too cutting edge; let's stay backward compatible.

Each level of indentation eats through our virtual punch card and adds a bit more cognitive overhead. Soon, we find ourselves three levels deep in a conditional stack. A common solution is to invert these conditions, removing one level of indentation and one thing to think about:

def process_dataframe(df):
    ...
    if 'cycle' not in df.columns:
        return

    cycle_seq = df['cycle'].values[:max_epoch]
    for i in range(1, len(cycle_seq)):
        if cycle_seq[i] != cycle_seq[i - 1]:
            ...  # This ellipsis is valid Python code. It does nothing!

In this case, it makes little difference. It's more useful if we have a dozen spinning plates to unstack. The more relevant question is not how we're checking this, but why. Is "cycle" an optional parameter? Is this function sometimes called with a "cycle" and sometimes without? What if our dataframe used the column name "cycle" for something else? What if we accidentally called it "Cycle" with a capital C? What if my name is Cycle, and I show up riding a (bi/uni)cycle, launching an uproarious Who's On First routine?

How would we learn about this "cycle" behavior? Think of a function signature as an invitation. The function invites us to a gala, merely telling us to bring a dataframe. Halfway through the gala, the hosts announce they're starting a game of axe throwing, but only for guests who wore closed-toed shoes and brought an axe. If we didn't, we're not kicked out of the party or, say, struck by an axe, but it does sting a bit (the axe). We had no way to anticipate this was an option. Maybe we would have liked to join in. But we brought no "cycle" column, no axe, and no shoes, and now our feet are cold. We feel despondent. Is that a feeling? Let me check. No. We feel Betrayed.

This is a problem with Pandas in general. The columns of a dataframe essentially define a type, but there's no good way to annotate that type. When possible, we should try to provide a heads-up:

def process_dataframe(df, cycle_column = None):
    ...

    if cycle_column:
        cycle_seq = df[cycle_column].values[:max_epoch]
        for i in range(1, len(cycle_seq)):
            if cycle_seq[i] != cycle_seq[i - 1]:
                ...

Now our guests know what to expect. It's clear that we can bring a cycle_column, but the default of None makes it clear we don't have to. We can call the column whatever we want. If we name a column which does not exist, it's an error, and it's obviously our error. If we screw up, we'll ~~raise an exception~~ get struck by an axe. We will feel embarrassed (hurt), but not betrayed. We also don't need to write the string "cycle" twice. Less risk of a tpyo.

If we're still worried about indentation, we can make sure a cycle column exists:

def process_dataframe(df, cycle_column = None):
    ...
    if not cycle_column:
        cycle_column = '__cycle'  # Pick a name that won't conflict with anything real
        df[cycle_column] = -1     # Populate with a dummy value

    cycle_seq = df[cycle_column].values[:max_epoch]
    for i in range(1, len(cycle_seq)):
        if cycle_seq[i] != cycle_seq[i - 1]:
            ...

In some cases, this can simplify our code, but it has downsides. It gives the function a confusing side effect, modifying the dataframe that's passed in. If the input is actually a dataframe slice, things get messy. We could copy the dataframe and work off the copy, but that can slow things down and eat up memory. For now, let's live with the indentation. We can always punch another card.

Off by One Errors

Counting is hard. Babies rarely learn to count until they're at least a day old.

cycle_seq = df[cycle_column].values[:max_epoch]
for i in range(1, len(cycle_seq)):
    ...

But even adults get confused. Quick: how many items are in range(x, y)? The answer is in this footnote.^[2]

An array with n elements has a length of n, but since its indices start at 0, the largest valid index into that array is n-1. A python range with two arguments is inclusive of the first argument but exclusive of the second. If we index into an array using the indices in range(1, len(array)), we first access the second element, which is at index 1, then end with the index corresponding to 1 minus the array's length, which represents the last element of the array. Are you still reading this? Did your eyes glaze over? I'm typing words on the computer. Hello. Welcome back. Quick: something in this paragraph is wrong. What?^[3]

As written, this function works. If cycle_seq is empty, range(1, 0) is empty, so the loop will not run. If cycle_seq has one element, range(1, 1) is empty, so the loop will not run. With at least two elements, this code correctly checks adjacent pairs.

But the cognitive overhead here is startlingly high for something so simple. This range syntax is an unintuitive way to explain what we want: to compare whether pairs of adjacent values are the same. In translating between implementation and intent, we're on guard for off-by-one errors and edge cases. It's a trap waiting to happen, like the pit of scorpions under my bed. I'm pretty sure I locked the trapdoor correctly, but it would be nice not to worry at all. I should probably switch rooms.

We can simplify this slightly by consolidating the two bounds checks:

cycle_seq = df[cycle_column].values
for i in range(1, max_epoch):
    ...

Better yet, rewrite this to not use an index at all:

previous = None
for element in cycle_seq:
    if element != previous:
        ...
    previous = element

This will mark the first row as the start of a cycle, while the earlier code would not. If we don't want that behavior, we can add another check (if element != previous and previous is not None). If we still need the index, we can count on enumerate to count on our behalf.

previous = None
for i, element in enumerate(cycle_seq):
    if element != previous:
        ...
    previous = element

Loops over Dataframes

Python is slow. Pandas dispatches most of its work to a C layer, and it's designed to work in bulk. Imagine sending 80 character punch cards, or, better yet, a telegram, via some sort of very slow snake. Say, an asp. He slithers over, picks up our message, slithers over to the recipient, reads out our telegram, ending every sentence with an elongated "sssstop", which is cute at first but quickly gets old. Check the first pair, sssstop, slither back, get the next message, slither away, now check the ssssecond pair, sssstop... No, stop, we'll be here all day. We want to send one message: Check all pairs(sss).

When we find ourselves looping through a dataframe, there's probably a more concise way to express what we want. Today, we can use shift():

cycle_data = ['a','b','c','d']
df = pd.DataFrame(cycle_data, columns=['cycle'])
df['previous_cycle'] = df['cycle'].shift(1)  # We could also shift by more than 1
df

cycle previous_cycle

0 a None

1 b a

2 c b

3 d c

	cycle	previous_cycle
0	a	None
1	b	a
2	c	b
3	d	c

is_new_cycle = df['cycle'] != df['cycle'].shift(1)

If the reason for previously pulling out .values was to sidestep the dataframe's index, we can use df.reset_index() instead. Checking for new cycles this way is about 4x faster than looping by hand. Under the hood it's still doing that loop, but it's happening in C, not ~~Python~~ asp.

Putting it together

If, instead of this:

def process_dataframe(df):
    ...
    if 'cycle' in df.columns:
        cycle_seq = df['cycle'].values[:max_epoch]
        for i in range(1, len(cycle_seq)):
            if cycle_seq[i] != cycle_seq[i - 1]:
                ...

We write this:

    def process_dataframe(df, cycle_column = None):
        ...
        if cycle_column:
            is_new_cycle = df[cycle_column] != df[cycle_column].shift(1)
            is_epoch_constrained = df.index < max_epoch
            is_first_row = df.index == 0

            new_cycle_df = df[is_new_cycle & is_epoch_constrained & ~is_first_row]
            ...

We will not feel Perplexed, Embarrassed (hurt), or Embarrassed (disapproving). Let us be Pleased (approving). Let us be content.

Incorporated into this blog with the author's enthusiastic permission, I think? ^[return]
The answer is not (y-x). range(3, 1) does not return -2 elements, it returns 0. The answer is max(y - x, 0) ^[return]
If an array is empty (n = 0), its largest valid index is not -1. It does not have a valid index at all. ^[return]