Python

Using Generators to Simplify Recursion

Introduction

Complex recursion can get tricky. Recently I have been working with trees in Python, and it offers a nice pattern when working with recursion I thought I would share.

When working with traversal problems we can often break them into two parts:

  1. Traversal - Recursively producing new nodes to explore
  2. Consumption - Performing work on the produced nodes

Commonly you'd create a single recursive function that consumes the data while traversing the tree, but this has several drawbacks.

Basic Example

As an example, let's look at a basic traversal of a filesystem.

def get_large_files(
    file_system: dict,
    min_size_kb: int,
    exclude_dirs: set[str],
    current_path: str = "",
    results: Optional[list[str]] = None
):
    """
    Traverses the tree while applying exclusion rules.
    Finds large files and accumulates them in list.
    """
    if results is None:
        results = []

    for name, content in file_system.items():
        full_path = f"{current_path}/{name}" if current_path else name

        if name in exclude_dirs:
            continue

        # if content is a folder recursively call get_large_files        
        if isinstance(content, dict):
            get_large_files(
                content, min_size_kb, exclude_dirs, full_path, results
            )
        # otherwise if file > size, add to results
        elif content > min_size_kb:
            results.append(full_path)

    return results

This function takes a dictionary and iterates over the items, collecting items over a certain size to a list and returns it.

Example Usage

exclude_dirs = {".git", "node_modules"}
file_system = {
    "src": {
        "components": {"header.tsx": 5, "footer.tsx": 3},
        "utils": {"helpers.ts": 12, "validators.ts": 8},
        "legacy": {"old_parser.py": 250}
    },
    "node_modules": {    
        "react": {"index.js": 500}, 
        "lodash": {"index.js": 150}
    },
    ".git": { 
        "HEAD": 1
    },
    "build": {
        "bundle.js": 1200,
        "bundle.map": 2500
    },
    "README.md": 2
}
large_files = get_large_files(file_system, exclude_dirs, 100)
for f in large_files:
    print(f)
# Output:
# src/legacy/old_parser.py
# build/bundle.js
# build/bundle.map

Issues

This function works fine, but there are a few things we could improve

  1. Our function violates principle (i) of the unix philosophy
    • Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.
  2. Our function is not modular. If we wanted a new function to get the files below a certain size, or wanted a list of all the files, this is not useful. (a side effect of 1)
  3. It is a common mistake to set default results=[] which causes unforeseen issues as the same list is used across multiple calls.
  4. Debugging issues in the consumption logic will be complicated by the coupling with recursive cases
  5. We will always walk the entire tree even if we only wanted to use the first few results.

Separation of Concerns

To improve this function we can separate out the consumption logic from the traversal logic.

We'll first create traversal logic utilizing the yield expression and yield from statement.

Each yield temporarily suspends processing, remembering the execution state (including local variables and pending try-statements). When the generator iterator resumes, it picks up where it left off (in contrast to functions which start fresh on every invocation). Python Yield Documentation

def walk_files(file_system:dict, exclude_dirs, current_path=""):
    for name, content in file_system.items():
        full_path = f"{current_path}/{name}" if current_path else name

        if name in exclude_dirs:
            continue

        if isinstance(content, dict):
            yield from walk_files(content, exclude_dirs, full_path)
        else:
            yield full_path, content

The yield from statement is syntactic sugar to delegate part of a generator's operations to a sub-generator, or yields all values from an iterable, e.g.

# these are equivalent:
yield from walk_files(content, exclude_dirs, full_path)

for file_path, size in walk_files(content, exclude_dirs, full_path):
    yield file_path, size

Now we can simply re-write our old function to use the generator.

def get_large_files(file_system:dict, exclude_dirs:set[str], min_size:int):
    walker = walk_files(file_system, exclude_dirs)
    return [
        file_path
        for file_path, size in walker
        if size > min_size
    ]

Benefits

Now if we compare the two programs, the new version is much easier to read due to the separation of concerns, and we can easily create new functions that utilize the same traversal logic.

For example, to get small files:

def get_small_files(file_system:dict, exclude_dirs:set[str], max_size:int):
    walker = walk_files(file_system, exclude_dirs)
    return [
        file_path
        for file_path, size in walker 
        if size <= max_size
    ]

large_files = get_large_files(file_system, exclude_dirs, 100)
small_files = get_small_files(file_system, exclude_dirs, 10)

Note: We could also use generator expressions to create lazy versions of these functions if we wanted to process files one at a time without building a list.

def generate_small_files(file_system:dict, exclude_dirs:set[str], max_size:int):
    walker = walk_files(file_system, exclude_dirs)
    for file_path, size in walker:
        if size <= max_size:
            yield file_path

small_file_gen = generate_small_files(file_system, exclude_dirs, 10)
for i, small_file in enumerate(small_file_gen):
    # only process first 5 small files lazily
    if i >= 5:
        break
    print(small_file)

# and we can still create a full list if needed
small_files = list(generate_small_files(file_system, exclude_dirs, 10))

This pattern separates the traversal logic from the consumption logic, allowing us to create multiple functions that utilize the same traversal.

This may seem like overkill for simple traversals, but when handling many types of complex recursions, this pattern becomes extremely convenient.

Tree Example

If we needed to implement an algorithm that requires use of post-order and pre-order traversals of a tree, like the Reingold-Tilford algorithm we can create two generators.

class TreeNode:
    def __init__(self, value: Any):
        self.value = value
        self.children = []
    
    def pre_order(self):
        yield self.value
        for child in self.children:
            yield from child.pre_order()

    def post_order(self):
        for child in self.children:
            yield from child.post_order()
        yield self.value

Here's how the traversals work:

root = TreeNode(1)
child_a = TreeNode(2)
child_b = TreeNode(3)
child_c = TreeNode(4)

root.children.extend([child_a, child_b])
child_a.children.append(child_c)
#      Tree
#        1
#      /   \
#     2     3
#    /
#   4

print("Pre-order Traversal:")
for value in root.pre_order():
    # do work in pre-order
    print(value)
# 1
# 2
# 4
# 3
print("Post-order Traversal:")
for value in root.post_order():
    # do work in post-order
    print(value)
# 4
# 2
# 3
# 1

Now to create a function that requires both a preorder and postorder traversal, we don't have to create a complex recursive function, we can simply combine both generators.

def combined_traversal(root: TreeNode):
    for value in root.post_order():
        print(f"Post-order visit: {value}")
        # initialize values during first walk
    for value in root.pre_order():
        print(f"Pre-order visit: {value}")
        # modify final values during second walk
combined_traversal(root)

References