On the usefulness of topological sort

The topological algorithm is very useful, yet it is quite easy to understand and implement. I've implemented it quite frequently and want to share some of the examples where it was useful for solving practical tasks.

Topological sort

Topological sort can be defined as follows.

First, we have a graph G(V, E) that consists of a set of vertices V and a set of edges E.

For the purposes of the topological sort, an edge A→B models a dependency of vertex A on vertex B. That is, we want the processing of vertex A to happen only after the processing of B is completed. The topological sort of the graph is the permutation of its vertices (i.e., every vertex must be used only) such that a vertex is listed after all its dependencies.

For a given graph, the topological sort might not exist (if and only if there are cyclical dependencies) or there might be many topological sorts (as is usually the case).

fig1

There are two possible topological sorts for this graph:

  • 4, 5, 3, 1, 2
  • 5, 4, 3, 2, 1

I won't recite the algorithm here, but here are some links that have detailed descriptions of several algorithms for topological sorting:

Uses of topological sort

Whenever we have a task that requires ordering items or actions based on some criteria, topological sorting can be useful. Here are some cases where I have used it in real-world projects.

  1. Verifying the correctness of workflows in business process management software.

    I've been making a workflow system in which users would edit their workflows via the GUI.

    Workflows consist of several steps. Some of the steps can be performed in parallel while some are constrained to be performed one after another. When a user finishes editing a workflow, it has to be verified that all of the steps could be completed in some order. Topological sort of workflow steps does exactly that - if topological sort exists then the workflow is valid.

  2. Output ordering for code generation.

    The order of definitions and statements in generated code oftentimes matters.

    For example, to produce an SQL code to create tables, indices, and constraints, we have to ensure that:

    • an index is created only after the corresponding tables and columns
    • a foreign key constraint is created only after the corresponding tables, columns, and unique constraints

    Another example is the bundling of javascript code. Before the introduction of ES6 classes and modules, we had to rely on libraries to emulate classes. Class definitions using those libraries were inherently order-dependent: class could only be defined after its superclass was defined. This necessitates ordering bundled modules in the order of topological sorting.

  3. Code initialization order

    I was designing a modular system that loads and initializes plugins and modules. The initialization of a module could invoke components provided by another module and register its components, as well as do some side effects. This means that modules depend on each other, and the order of module initialization is specified by the dependencies of modules.

  4. Type inference in nested data queries

    In several BI systems, we have let users define custom data queries that could refer to other data queries. To infer the types of result data table columns, queries had to be processed in the order specified by their interdependencies.

  5. Database migrations

    I was designing a database migration framework for a larger framework for building web apps. It was designed for modularity: web apps are composed of modules (some are pre-built, some are custom-built) built by different teams.

    The framework was designed with one big shared database in mind, and each module was responsible for its own set of tables in the shared database. To manage changes to the database schema, we employed migrations but quickly ran into the problem of ordering the execution of migrations. The complexity of database schema management was exacerbated by the concurrency of development: different teams released versions of their modules on different schedules without too much coordination.

    Since all modules were in the same database, they executed DDL statements (such as CREATE TABLE or ALTER TABLE) or DML statements (such as INSERT INTO or UPDATE) that referred to the tables owned by other modules. Essentially, each migration is required to be executed before or after some other migrations (that created or dropped the corresponding tables or columns).

    By specifying this ordering in the migration itself, we could then find an appropriate order for database migrations.

Issues with topological sort

Some fine points must be considered when employing topological sort:

  • When a graph contains a cycle, the toposort algorithm just gives up. But in some cases, we want to process the graph anyway, even if some parts of it cannot be sorted

  • The topological sort definition is ambiguous. A graph may have multiple topological sorts. This issue comes in two dimensions:

    • Different algorithms may produce different order
    • Subtle changes in the graph may produce drastically different topological order