How to write really good Python code for Data Engineering

Writing high-performance code without creating more problems for yourself in the future.

If you’re reading this, you likely want to get better at writing code for data engineering. In this guide, I will be showing you how I approach writing code as a means of problem solving.

I’m a Senior Data Engineer at Headspace Health and have been practicing data engineering for 5+ years. More on my personal website.

I. BACKGROUND CONCEPTS

Throughout this article, I will be referring to code as declarative vs. non-declarative (imperative):

  • Imperative code (non-declarative) tells your compiler what a program does, step by step. The compiler cannot skip steps, because each step is completely dependent on the previous step.
  • Declarative code tells your compiler what a program’s desired state should be, abstracting the steps how to achieve it. The compiler can skip steps or combine them, since it can determine ahead of time all of the states.

High performance software (code which is very fast) is achieved by doing the maximum amount of work in the smallest number of steps, as elegantly stated by David Farley

Modern compilers have all sorts of tricks to make code run faster and more efficiently on modern hardware. The more a compiler can predict a program’s state, the more “tricks” it can employ, resulting in fewer instructions and significant performance benefits.

Below is an architecture diagram of a compiler’s interface with a processing unit (piece of hardware). Let the compiler help you! Give it simpler more predictable instructions.

Hardware architecture for a processing unit. (credit: Science Direct)

To sum it all up: Declarative code takes advantage of modern compilers’ ability and results in higher performance.

Okay, let’s get to writing some code!

II. PROBLEM STATEMENT

You have a list of things, maybe the list is empty, maybe it’s got millions of items, and you need the first non-null value:

# A small list
things_small = [0, 1]

# An impossibly big list
things_big = list(range(1_000_000))

# A list with nulls and other stuff
things_with_nulls = [None, "", object()]
  • The function should return accurate results — and not discriminate against 0'sor empty strings "".
  • Solution’s performance shouldn’t be slow. It's probably reasonable to aim for min = O(1), max = O(k), where k is the size of the list
>> get_first_non_null([1, 2, 3])
1

>> get_first_non_null([None, 2, 3])
2

>> get_first_non_null([None, 0, 3])
0

>> get_first_non_null([None, None, None])
None

>> get_first_non_null([])
None

>> get_first_non_null([None, "", 1])
""

III. [CODE] SOLUTIONS:

Here are different approaches for solving this, starting with the simplest.

I’m sure every data engineer has solved this problem dozens of times over. Easy beans! Iterate over the list and see which items are not null, then return the first value:

def get_first_non_null_list_comp(my_vals: list, default=None):
"""
Get first non-null value using
list comprehension.
"""
filtered_vals = [x for x in my_vals if x is not None]

if len(filtered_vals)>0:
return filtered_vals[0]
else:
return default

But this is clearly not performant:

  • You need to operate on every element on the list. This can be slow if your list is MASSIVE 🐘
  • List comprehension essentially copies the list, so it may be memory intense. Unless we operate in place (my_vals = [x for x in my_vals]) which may introduce issues with overwriting the original list. So we should avoid doing that.
  • Accessing first element in the list list[0] is non-declarative >> meaning, your program has no guarantee what the attribute will be, until it gets it. This is OK for python, most of the time. But as you get to writing more “organizationally used code” you tend to see examples where accessing elements in a list goes awry. For example: customer_email = response["data"][0]["custom_attributes"][-1]["email"] << EEEEEK! 😱
  • There are 2 return statements — this is partially non-declarative and increases the code complexity (potentially limiting extendability).

So we can change the function to iterate through the list without processing all values:

def get_first_non_null_loop(my_vals: list, default=None):
"""
Get first non-null value using
a loop.
"""
for x in my_vals:
if x is not None:
return x

# Otherwise, return the default value
return default

This is not bad and would pass most code reviews. ⭐️

But there are shortcomings yet:

  • The code is loopy and would not benefit from vectorization 🚫 🚀
  • Code is not declarative — our compiler is sad. 😢
  • Similar to solution 1 above, there are 2 return statements. I want there to only be 1.

But what if you want to take your code to the next level? 💡

Dynamically load values using python’s built in filter function which creates a generator which allows us to dynamically access and evaluate each component:

from operator import is_not
from functools import partial

def get_first_non_null_generator(my_vals: list, default=None):
"""
Get first non-null value using
a generator (via filter).
"""
# Create a generator of values
filtered_vals = filter(partial(is_not, None), my_vals)

# Iterate and get the first not none value
return next(filtered_vals, default)

What are some of the advantages of doing it this way?

  • The filter operator is a generator/iterator, meaning, it only evaluates items that it needs to. Since we are using the next function, it will basically lazy load.
  • The partial function allows us to dynamically apply the fastest python evaluation to the value >> is not None >> otherwise, if we use something like [x for x in my_list if x] then 0's will be excluded.
  • The next function gets the next item from the iterator. Memory doesn’t explode, since we are only getting 1 value at a time. The default is explicitly set, otherwise, this will raise a StopIteration once the iterator is exhausted.
  • Declarative nature allows for vectorization 🚀 (and compilation enhancements).
  • Also allows for just in time compilation too, if we want to extend for further optimization.

Glad I got your interest! I’ll explain in detail another time. Meanwhile, you can read a little bit about it here: Vectorization: A Key Tool To Improve Performance On Modern CPUs

IV. EXTENDING OUR SOLUTION

Getting the first non-empty value from a dictionary.

Getting the first item of a list is somewhat straightforward. But how about getting the first non-empty value from a dictionary, based on a set of keys?

For example, take the following document:

{
"key": {
"field_1": "one",
"field_2": "two"
}
}

Say you want to get the value for field_1, unless it doesn’t exist, then get the value of field_2 (unless it also doesn’t exist), else, return a default value.

Since our 3rd solution get_first_non_null_generator() takes in any iterator, we can create a mapper which binds our document to lookup keys, and use in our function like so:

my_doc = {
"field_1": "one",
"field_2": "two"
}

# Get the first non-empty value from a dictionary:
res = get_first_non_null_generator(
map(my_doc.get, ("field_1", "field_2"))
)

# We should get the first non-empty value
assert res == "one"

Here’s a slightly longer example (which more closely resembles the use case I had for writing this code):

# A dict of fields with default and example values
my_dict = {
"name": {
"example": "Willy Wonka"
},
"country": {
"default": "USA",
"example": "Wonka-land"
},
"n_wonka_bars": {
"default": 0,
"example": 11
},
"has_golden_ticket": {
"default": False
},
"is_an_oompa_loompa": {
"description": "Is this person an Oompa Loompa?"
}
}

# Now I want to get an example record, from default/example vals:
expected_result = {
"name": "Willy Wonka",
"country": "Wonka-land",
"n_wonka_bars": 11,
"has_golden_ticket": False,
"is_an_oompa_loompa": None
}

# Iterate through fields, though if we wanted to
# get crazy, we can compress to a single line (not shown)
example_record = {}
for key, value in my_dict.items():
# We want "examples" before "default", if any
example_record[key] = get_first_non_null_generator(
map(value.get, ("example", "default"))
)

# We should get the above expected result
assert example_record == expected_result

Here’s a really sophisticated use case of accessing class attributes using partial functions and mappers:

from typing import Any, Optional
from operator import attrgetter


class FieldAttributes:
"""
Field attributes.
We will want to access these dynamically
"""
example: Any
default: Any
description: Optional[str]

def __init__(self, example=None, default=None, description=None):
self.example = example
self.default = default
self.description = description


class Field(FieldAttributes):
"""Class representing a field"""
name: str
attrs: FieldAttributes

def __init__(self, name, **kwargs):
self.name = name
self.attrs = FieldAttributes(**kwargs)


class UserData:
"""Class representing our user data"""

name = Field("user_name", example="Willy Wonka")
country = Field("country", default="USA", example="Wonka-land")
n_wonka_bars = Field("n_wonka_bars", default=0, example=11)
has_golden_ticket = Field("has_golden_ticket", default=False)
is_an_oompa_loompa = Field("is_an_oompa_loompa",
description="Is this person an Oompa Loompa?"
)

# Access all the fields here
fields = (
name,
country,
n_wonka_bars,
has_golden_ticket,
is_an_oompa_loompa
)

# ------------------------------------------------

# We could compress it all down to something even tighter:
example_record = {
k.name: get_first_non_null_generator(
map(k.attrs.__getattribute__,
("example", "default")
)
)
for k in UserData.fields
}

assert example_record == expected_result

"""
If we were concerned with high-performance (at the expense
of readibility), we could compress everything further
into a single context – which could translate
neatly within a vectorized library. But this is way overkill
"""
example_record = dict(
zip(
map(attrgetter('name'), UserData.fields),
map(
get_first_non_null_generator,
map(
attrgetter("attrs.example", "attrs.default"),
UserData.fields
)
)
)
)
assert example_record == expected_result

Very often, code should solve current problems and future ones. My goal in writing these advanced (and somewhat weird) cases is to get your mind thinking about the possible uses for the code you are writing. Is vectorization a concern in the future? Is human readability? Will you be interfacing with a weird thingy that will need to be wildly extended? Or are you simply needing to return an element from a list?

Consider as many of these factors when writing your code upfront, and document your assumptions. (It’s okay to take shortcuts! As long as you tell your future self in the documentation.)

V. THINKING ABOUT SOLUTIONS

A big part of being a senior developer is how you think about problems. Most problems data teams (and software teams) face are a combination of technical and organizational concerns.

To demonstrate, in our case, we are writing code to find the first non-null value in a list. But over time, our solution will be used by other teams who will be using our solution in different ways. For example, someone may try to find the first non-null value in a dictionary, given a list of keys. This isn’t necessarily a bad thing. It is inevitable that developers will use your code in ways you didn’t anticipate when first writing it.

Without intervention, the complexity of a codebase is guranateed to increase over time. If the complexity gets too severe, you will create a big ball of mud. Knowing this, how can you protect the future state of your codebase?

If our organization was small: we can leave a comment in the code saying: #This code only works with flat lists. Contact YBressler if you have problems

In other words, disunion the technical and organizational concerns and solve for each separately. (Technical = write code. Organizational = leave a comment.)

Honestly, this is a great solution if your team is small. But once an organization reaches a certain size, or people leave the org, this solution becomes problematic.

A better solution takes the concerns of software development lifecycle into consideration. This usually means ensuring that your code is straightforward, easy to test, and performant. This type of code will allow future developers to refactor with ease, enabling them to repurpose and extend your original solution to later needs without increasing the complexity of the codebase.

In other words, our code should solve both the technical and organizational concerns. Technical = the code works. Organizational = the code is easy to understand and can be refactored with ease.

In our code examples, I am certainly concerned with the performance of a solution. I am equally as concerned with how future developers will interact with this solution.

If a code solution is not easy to understand (high degree of complexity), people will be afraid to make changes to it. They’ll either use it an even more complex way, or create more code, which also furthers a codebase’s complexity.

VI. CONCLUSION:

In conclusion, senior data engineers [try to] write code which is easy to understand, is high-performance, but most importantly, solves future problems by reducing a codebase’s complexity.

As a point of demonstration above, the fast solution get_first_non_null_generator() is clever, easy to read, and performant. Most importantly, it aims to reduce complexity in a codebase.

  1. Farley, D. (2022). In Modern Software Engineering: Doing what works to build better software faster (p. 128), Addison-Wesley.

--

--

Data Engineer & Theatre Producer. More about me at www.yaakovbressler.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store