Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

How single-iteration InstCombine improves LLVM compile time

December 7, 2023
Nikita Popov
Related topics:
CompilersRust
Related products:
Red Hat Enterprise Linux

Share:

    For languages like Rust and C++, one of the main obstacles to developer productivity are long compilation times. Build times for large code bases can easily range from tens of minutes to hours, depending on the available hardware. The LLVM compiler toolchain, which provides the optimization and code-generation capabilities for many programming languages, is a case in point. Even on a 16-core machine, my standard build configuration takes about fifteen minutes for a non-incremental build. You can imagine how long this takes on a laptop.

    Figure 1 illustrates a comedic look at compiling time (courtesy of xkcd).

    A cartoon comic-like image showing developers passing the time by sword fighting while compiling.
    Figure 1: What you do while compiling.

    Unfortunately, compilers suffer from the "flat profile" syndrome. There usually aren't any significant performance hotspots, on which optimization efforts can be focused. Instead, time is spent across hundreds of different analysis and optimization passes.

    Even a large improvement in a single pass usually only translates into a small improvement in end-to-end compilation times. As such, compile-time changes (both improvements and regressions) are typically the result of accumulating many small changes on the order of 0.1%.

    This article covers one of the rare exceptions to this rule. A change to the InstCombine pass resulted in a ~4% end-to-end compile-time reduction on CTMark (Figure 2). This change will be part of LLVM 18.

    A graph showing compile time change over time.
    Figure 2: This graph shows compile time change in percent over time for compiler configurations. The big jump on the left is the InstCombine change.
    Figure 2: This graph shows compile time change in percent over time for different compiler configurations. The big jump on the left is the InstCombine change.

    The instruction combining pass

    The instruction combining pass is responsible for performing peephole optimizations, where a small sequence of instructions is replaced with some other, hopefully better, sequence of instructions. For example, x / 2 could be replaced by x >> 1, or x + y == y by x == 0.

    The InstCombine pass consists of hundreds (likely thousands) of folds of varying complexity. Some of them are as simple as the previous examples, while others have to inspect larger instruction graphs or prove non-trivial preconditions using analysis passes.

    InstCombine is the proverbial death by a thousand cuts. Each individual transform may be cheap, but if you add up hundreds of them, it comes as no surprise that InstCombine tends to be the single most expensive pass in the optimization pipeline. As new transforms get added to InstCombine on a daily basis, this is a situation that only gets worse with time. It is one of the ways in which compile-times keep slowly creeping upwards, and why continuous effort is required to push them back down.

    Worklists and fixpoints

    One way to improve InstCombine performance is to optimize individual transforms and the analyses they use. I have done my fair share of this kind of optimization over the years, which results in exactly the kind of small, incremental improvements previously mentioned. The other way is to improve the high-level algorithm, which is what I will talk about here. To that end, we should first discuss how InstCombine works on a technical level.

    The basic task of InstCombine is simple, visit each instruction in the function and apply some folds to it. The interesting part is what happens when a fold succeeds and produces new instructions. It's possible that the new instructions enable new folds. If we only perform a single pass over the function, we will miss such opportunities.

    InstCombine solves this problem by using the worklist algorithm. Initially, the worklist is populated with all instructions in the function. On each iteration, one instruction is removed from the worklist and visited. If any fold applies, the changed or newly created instructions are added back to the worklist, together with any instructions using them to be reprocessed in a future iteration.

    In an ideal world, this approach would be sufficient to find and exploit all optimization opportunities. In practice, this does not quite work out due to transforms that do something unusual or fail to manage the worklist correctly. We'll discuss examples in the following sections. To mitigate these cases, InstCombine combined the worklist algorithm with an additional fixpoint iteration. After folding all instructions with the worklist algorithm, it would check whether any changes have been made. If yes, it would start again from scratch, folding all instructions a second time. This would continue until no more changes were made. That is, until a fixpoint was reached.

    This approach is quite wasteful. In the common case, InstCombine would first perform one iteration that does useful work, and then a second iteration that serves no purpose other than verifying that a fixpoint has indeed been reached. The compile-time improvement comes down to removing this fixpoint iteration and making InstCombine only perform a single iteration. This is easier said than done because we first have to fix the various ways in which the worklist algorithm may fail to reach a fixpoint in one iteration.

    Convergence failures

    Fixing failures to reach a fixpoint involved many dozens of individual changes, so I will only discuss some of the more important ones here. To the most part, these come down to incorrect worklist management.

    IRBuilder polymorphism

    All instructions that are newly created by InstCombine need to be added back to the worklist for reprocessing. For this purpose, InstCombine uses a special IRBuilder, which will automatically add new instructions to the worklist.

    As long as all instructions are created through this IRBuilder, everything is fine. However, InstCombine sometimes calls into other transformation utilities such as SimplifyLibCalls, which historically could not use InstCombine's IRBuilder, because it was templated over InstCombine implementation details. For this reason, I migrated IRBuilder from using templates to virtual methods, so that there is a common base class that can be freely shared across passes. This also had various benefits outside this work.

    Deferred worklist

    Even if all instructions get added to the worklist, we have to be careful about their order. If an instruction C depends on instructions A and B, we want A and B to be simplified before C. Ignoring the cross-block case, this can be achieved by visiting instructions in program order. On a technical level, the worklist is actually a deduplicated stack, from which the top element is popped on each iteration. Initially, all instructions are pushed to this stack in reverse program order, so that they will be popped in program order.

    If a transform creates new instructions A, B and C, these used to be pushed to the worklist stack in that order, which means they were popped in the order C, B, A. That's the exact opposite of what we want. To fix this, an additional deferred worklist was introduced, into which newly created instructions are initially pushed. Before visiting the next instruction, these will be popped from the deferred worklist and pushed to the normal worklist, reversing them in the process and establishing the desired order.

    The deferred worklist also serves the additional purpose of removing dead instructions earlier. Unused instructions without side-effects will be removed when processing the deferred worklist, so that dead code is no longer present when the next instruction is visited. This is beneficial, because many transforms are use-count sensitive.

    In particular, it is common that transforms have a one-use restriction. For example, InstCombine will convert (A * B) + (C * B) into (A + C) * B. But what happens if the expressions (A * B) and (C * B) are also used somewhere else and cannot be removed? In that case, this transform would not save one multiply, but instead add one. The transform is only profitable under a one-use restriction. As such, we should also revisit users if a transform makes the use count of an instruction drop to one.

    Reverse post order

    One case I brushed over are cross-block dependencies. There is no well-defined static notion of program order if the function contains control flow. Consider a simple if/else structure like this:

    A;
    if (...) {
        B;
    } else {
        C;
    }
    D;

    Historically, InstCombine visited blocks using a depth-first search. That is, it would either use the order A B D C or A C D B. This is bad, because it violates our goal of making sure that folds always see already simplified operands. If D depends on values from B and C, then it will see unsimplified values in one of the blocks.

    The solution is to use reverse post-order (RPO) instead, which results in the order A B C D or A C B D. RPO guarantees that predecessors are visited before successors, except in the presence of loop backedges, where this is fundamentally impossible. The downside is that RPO is expensive to compute, making this a small compile-time sacrifice in pursuit of the greater good.

    Unreachable code elimination

    As a result of instruction combining, we may determine that the argument to some conditional branches is either true or false. In this case we could, in theory, remove the dead edge from the control flow graph, as well as any resulting unreachable code.

    InstCombine is not actually allowed to do this, because it is a control-flow preserving pass. This limitation allows it to preserve certain expensive analysis passes, such as the dominator tree. Other passes like SimplifyCFG are responsible for cleaning up the control-flow later.

    However, removing unreachable code can often expose new optimization opportunities in InstCombine, for example, by reducing use-counts. For that reason, we make a compromise, where InstCombine will not modify the control-flow graph but still remove all the instructions from unreachable blocks.

    Historically, this was done during initial worklist population, which means it only happened on the next InstCombine iteration. What we now do instead is to keep track of all control-flow edges that are known to be dead and add new ones if a branch condition turns into a constant. When this renders new code unreachable, it is directly removed as part of visiting the branch instruction.

    The long tail

    The previous changes are some of the more high level ones. Many others were fixes to individual transforms. InstCombine provides a number of utilities to perform certain operations (like replacing or removing instructions) in a way that correctly manages the worklist. However, quite a few transforms failed to use these (especially the older ones), resulting in failures to reach the fixpoint.

    Another common issue were transforms that don't just look at instruction operands, but also at adjacent instructions in program order. For transforms like these, it is important to scan instructions in the correct direction. For example, when trying to eliminate dead stores, it's necessary to walk upwards to find a store to the same pointer, instead of downwards. That's because the preceding instructions have already been simplified, while the following have not.

    Removing the fixpoint iteration

    There will always be cases where InstCombine fails to a reach a fixpoint in one iteration. However, at some point they become so rare that paying the compile-time cost of performing the fixpoint iteration is no longer worthwhile. Before switching InstCombine to single-iteration mode, I collected some statistics on how many iterations it performs on llvm-test-suite. The result was that only in 0.04% of cases an additional iteration was needed to reach a fixpoint. Conversely, in 22.3% of cases, InstCombine performed an extra iteration that did no useful work and only verified that the fixpoint was reached.

    Paying 4% of total compile-time to only see a difference in 0.04% of cases is clearly a bad tradeoff, especially when taking into account that InstCombine runs at least 8 times during the optimization pipeline, which means that something not caught by one run can still be handled by a later one.

    An important consideration here is how to avoid backsliding. If we only perform one iteration, it would be easy to introduce new cases that fail to reach the fixpoint without anyone noticing. We prevent this by enabling fixpoint-verification in our tests, which runs two iterations and checks that the second one made no changes. The verification is disabled for normal compilations.

    Lessons learned

    The basic idea behind this change is simple. Don't run a transform in a loop. There is nothing groundbreaking here. Making optimization algorithms sparse (that is, reprocessing only what is strictly necessary instead of whole functions) is a core tenet of optimizing compilers. The choice to add fixpoint iteration to InstCombine was made 15 years ago, when InstCombine was much simpler and smaller. Undoing that choice took a good bit of effort over the course of three years. This shows the importance of engineering optimization passes for compile-time efficiency from the start and tracking compile-time changes over time. It is much more expensive to fix mistakes decades later, than to avoid introducing them in the first place.

    Related Posts

    • Use multiple compilers to build better projects

    • Getting started with llvm-toolset

    • Remote LLVM development with Visual Studio Code

    • Extend C++ capabilities with LLVM STLExtras.h

    • A leaner <iostream> in libstdc++ for GCC 13

    • How to contribute to LLVM

    Recent Posts

    • Storage considerations for OpenShift Virtualization

    • Upgrade from OpenShift Service Mesh 2.6 to 3.0 with Kiali

    • EE Builder with Ansible Automation Platform on OpenShift

    • How to debug confidential containers securely

    • Announcing self-service access to Red Hat Enterprise Linux for Business Developers

    What’s up next?

    cheat sheet cover image

    This cheat sheet presents a collection of Linux commands and executables for developers who are using the Linux operating system in advanced programming scenarios. 

    Get the cheat sheet
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue