I stumbled on a difficult realization this week: a carefully engineered algorithm is not necessarily better than a very simple code.
One of the critical steps of my program used to rely on a quick and dirty heuristic – a hack relying on a very simple local search. Since it is an important optimization pass, I tried to adapt a mathematically clean and efficient algorithm, (based on minimum cost flows, which appear everywhere). It can optimize on a larger scale, and I hoped to get much better results.
Unsurprisingly, it turns out that it was a lost of time: for my datasets, the runtime of the quick heuristic is much lower to obtain the same results. Each pass is so much faster that its suboptimality doesn’t make a difference.
What is more disturbing is that same-quality results are not equivalent for the other optimization passes: when evaluating related cost functions, the clean algorithm tends to do much worse. I still don’t know if it is due to some kind of overfitting or to a specific feature of the heuristic, but I’d very much like to know what it is about.
There is a striking similarity between VLSI placement and scheduling. Digital electronics circuits are built from “standard cells”, logic gates with a unique height. They are organized in rows, so that the power wires run between two rows of standard cells.
Tasks must be scheduled on machines so that they do not conflict, standard cells must be placed in rows so that they do not overlap. The only difference between scheduling tasks and placing logic gates is the cost function, generally more complex for placement than for scheduling.
Placement problems have never been formulated as scheduling problems: it is no surprise that there has been few communication between the fields, the journals and laboratories being separate and the goals completely different. But it seems to me that we can solve some placement subproblems using scheduling techniques.
Scheduling is the problem of assigning tasks to machines. They are constrained by their execution times, with independent costs for each task.
In standard cell placement, the cost is usually much more complicated: the wirelength involves distances between tasks, and cannot be expressed a a sum of independent task costs. However, for a given ordering of the cells in the rows, it is possible to find the corresponding optimal positions by using linear programming (more precisely, network flow). This gives us a simple optimization pass to use during placement but it is still a bit expensive.
Therefore, for standard cell placement as well as scheduling, researchers have considered the simpler problem where there is only one row or one machine, still with fixed ordering. The cost function is a sum of independent piecewise linear functions, and hopefully there are faster algorithms to solve the problem.
Fast algorithms for the ordered single row problem
Indeed, this problem is much easier to solve in practice. The clumping algorithm solves this problem in worst case quadratic time, and is usually much faster. The idea behind it is to add the cells at the end of the row and push it to the next pin until its placement is optimal: the final placement is optimal due to the convexity of the cost function.
This quadratic worst case complexity is annoying: several publications managed to improve this runtime up to m log² m where m is the number of pins. However, the datastructures involved are a special type of balanced trees, both slow and difficult to code.
I discovered a simpler m log n algorithm based on a priority queue, but by chance I found out that it is already known for scheduling, as the specialized cascading descent algorithm. The cascading descent is equivalent to the clumping algorithm. The basic idea is to index the pins with a value that does not depend on the cell: this way, merging groups of cells takes constant time. The algorithm uses a single priority queue to hold the pins, hence its better complexity.
Applications to VLSI placement
It isn’t a breakthrough: even naive linear programming is fast enough to optimize a large part of a circuit, so that this improvement might be barely noticeable compared with approximate or quadratic algorithms.
However, there are two extensions that may prove particularly useful: local search and non-convex optimization.
Local search is the process of modifying the placement through simple modifications, like cell swapping. One of my goals is to modify the algorithm in order to support online placement modification with logarithmic complexity. That is, keep an optimal placement through small ordering modifications.
Non-convex optimization is trickier. There can be several local optimas, which makes most such problems extremely difficult. For the special structure of the ordered single row problem, however, there are quadratic algorithms. It means that we can integrate more complex cost functions directly in the ordered single row optimization, for example cell orientation.
Both ideas are relatively new: cell swapping, orientation and position optimization are almost always separate steps (although it is possible, even integer programming models in the litterature do not integrate all of them). I wonder how these methods will play compared to independent optimization passes.
Floorplanning is the process of placing a set of rectangular blocks on a chip. Huge blocks, like a RAM or a whole circuit module. As such, it is a very specific domain in placement: floorplanning usually handles less than 100 cells (to be compared with the ~100000 handled by other methods) but with entirely different algorithms.
Since the laboratory where I work has an important analog toolchain – for the design of circuits that are not purely digital, typically for wireless transmissions or sensors – I gave a try to algorithms to place the resulting transistors.
The constraints for analog circuit placement
I am not an analog designer myself, so I tried to gather informations about the needs in the analog world. The recurring answer was “they want to chose” between different possibilities, because the designers’ knowledge is not easily mimicked by an algorithm yet.
Providing this freedom is probably the hard part, but there were more satisfying answers for me too, that could lead directly to a toy implementation. Symmetry constraints, in order to mitigate process variations. Proximity constraints. Area minimization. Routing corridors. Those are easier to translate into an algorithm.
Moreover, analog circuits have some freedom during placement: the transistors are big, and it is possible to change their aspect ratio.
Integer programming for floorplanning
Most tools for floorplanning work on some kind of topological representation: block a is above block b, which is on the left of block c. This is usually limited to area optimization only, but linear programming can handle more complex situations.
In those situations, including deformable blocks and wirelength minimization, linear programming can yield a solution for the given topology. Rather than writing complex data structures to represent the topology, I included it in an integer programming model: it is simpler, requires less code, and ideally it would even prove the solution’s optimality.
With the tool I use (GLPK), I obtain what I consider to be good results. On a small real benchmark with 7 cells, it proves optimality relatively quickly. It is generally not able to prove optimality as soon as there are more than 10 cells, although local search heuristics are extremely efficient.
It remains to be seen whether such a floorplanner would be useful to an analog designer. It seems to me that obtaining a good enough placement quickly is important, but they would surely want to fine-tune it. For this trial-and-error process, tuning the model may be the way to go, giving block deformation and shifting for free whenever a cost or a constraint is modified.
This work gave me some insight in linear programming for digital detailed placement: I am experimenting with similar models to place standard cells in the main tool.
As I said in the previous post, Coloquinte will be built around an analytical placement algorithm alternating between convex placement optimization and cell spreading: it is one of the most efficient methods published so far.
My early prototypes worked, but the spreading algorithm was a bit too simple to cope with the benchmarks. More importantly, they lacked a clean architecture.
The new goal is to work well on a large class of circuits and the tool should be more modular. It is meant to interface loosely with Paris 6 university’s tools: the Coriolis toolchain from the Lip6 laboratory, where I work part-time from next week, and its database Hurricane.
It should be splitted between 3 independent blocks that transform the Hurricane representation: the global placer, the legalizer and the detailed placer. In order to keep control over the optimization passes applied, each tool is scriptable in Python. This is particularly important for the global placer, where everything is tunable from the spreading schedule to the net models.
The global placer
The global placer uses a technique known as quadratic placement where the wires are modelled as elastic springs: this yields a sparse linear system, which can be solved approximately in linearithmic time. It is not as artificial as it seems: it is in fact very close to Newton’s method for optimization.
This quadratic placement alternates with a spreading phase, whose result is fed back to “anchor” the next placement iteration. Everything is tunable: I intend to let the user chose the net model, the legalization algorithm and how strong the cells are pulled toward the legal position. Each of these components gets a Python interface so that the whole optimization is scriptable.
You may note that I include “Companion tools” that could interface with the global placer: such tools are necessary when optimizing for complex objectives. I build Coloquinte so that it can cope with other tools with minor modifications in case we want to extend it.
The detailed placer and legalizer need not as much controllability, but I am going to use the same architecture nonetheless: the more the designer can control, the better the result. Ideally, we would try multiple combinations of the optimization passes to obtain the best tradeoffs: they can be distributed as several Python scripts, possibly completely different.
Today, I begin to program the legalizer. Since this algorithm is completely new, I am probably not going to talk about it anytime soon, but I have a (partially) new one for detailed placement that I am going to present in the next post.
The choice between analytical placement and partitioning is a crucial and early one. For Coloquinte, however, I made this choice a long time ago.
There are two obvious reasons: analytical placement generally performs a little better and I know it better.
However, there are deeper reasons why I want to commit to using analytical placement. In my opinion, analytical placement yields better flexibility and modularity.
First, it is tunable with simple but numerous parameters. By changing the cost function, you can in theory optimize for timing and power. The spreading forces control the tradeoff between speed of convergence and solution quality.
With partitioning-based placement, you need to design whole new algorithms for power, timing and congestion optimization (which many papers did). With a complete analytical placer, you can test parameter tuning within minutes: change the cost function and the spreading schedule, then execute. And there are a lot of cost functions to chose from.
Another important property of some analytical placement algorithms is that they modify an existing placement at each step, not build it from scratch. For this reason, it is convenient to perform netlist or placement changes. Whether to interface it with other programs (performing cell resizing and buffering for example) or to enable engineering change orders (ECOs), it is a useful feature. I suppose it makes it easier to adapt the placer to new problems, with all the performance-critical code factored between the solver and the legalizer.
Therefore, I am going to use an analytical placement that continuously spreads the cells: if region constraints are used, circuit modifications are not as easy. Additionally, this method is the leading one on benchmarks: using it should improve both flexibility and efficiency.
That doesn’t mean writing a complete placement tool is easy: external tools are needed to analyze timing and congestion, but with analytical placement the interface may be very loose and quite modular. I am going to show Coloquinte’s interface in the next post.
The last post gave a quick introduction to the synthesis flow. Since it is the purpose of the tool, I am going to focus on placement algorithms here. Let’s leave all the other topics for now, even if high-level optimization, boolean optimization and technology mapping are equally interesting.
Placement may optimize the integrated circuit for several metrics: power, timing, area and, more importantly, congestion. That is, it must pass a feasible problem to the subsequent routing stage.
The most basic objective is wirelength: it correlates well with all the other metrics and is the primary one for most benchmarks.
So, how does a placer work? Early placers used meta-heuristics, such as simulated annealing. Even with clustering techniques, it becomes difficult to deal with large circuits. Both the industry and the academic world stopped using this technique during the 80s, and they began using two steps: a simplified global placement problem is solved to obtain approximate positions for the cells. It is followed by a legalization to obtain a correct circuit and a detailed placement phase then corrects local suboptimalities.
The global placement phase may be the most important one since it captures the problem’s complexity, while detailed placement is mostly about refinement. Two classes of algorithms have been developped for global placement: partitioning-based placement and analytical placement.
Placement through graph partitioning
You can see the netlist to place as an (hyper)graph: the cells are the vertices and the nets are the (hyper)edges. Placing the circuit with minimal wirelength is the problem of cutting the graph with minimal cost: you assign parts of the circuit to placement regions with as few wires between them as possible. By partitioning into gradually smaller regions, you can obtain a circuit with few global wiring. This weighted min-cut problem is complex but can be solved efficiently with heuristics: it is at the heart of partitioning-based placement.
Today, the partitioning-based tools are much more complex to capture timing and congestion: all tools, from Capo from the university of Michigan to Dragon and Fengshui have further heuristics to drive partitioning.
On the other hand, you may model the wirelength as a continuous function of the cells’ positions: this is the idea behind analytical placement. With a convex model, you may use efficient optimization algorithms. However, the obvious optimal solution is to cluster all cells together, whereas you want to return an almost overlap-free placement. In order to spread the cells, analytical placement needs a companion algorithm that modifies the cost function or constrains the cells to separate regions. The algorithms alternate between optimization and spreading until a sufficiently refined solution is reached.
The choice of these spreading heuristics varies considerably between placement tools. Early tools constrained the cells to regions: it is still the method used by the commercial tool BonnPlace. More recent academic tools pull the cells toward a legal position, the calculation of which is generally purely heuristic. Kraftwerk simulated repulsion with an electrostatic potential while tools like Fastplace, SimPL or MPl just try to spread the cells with partitioning-like heuristics. Notable exceptions are tools that model the legalization problem as a flow problem, which have more mathematical justifications.
Note that no tool that I know of is available for more than benchmarking: they all provide binary-only versions. This is a sufficient reason to create an open tool, and I hope that Coloquinte will fill this gap.
I will not publish code before I have something clean enough for the global placer, but I hope to open the Git within two (one?) months and to gradually publish the algorithms.
In the next post, I will discuss the choice of a specific analytical placement algorithm.
I am currently writing a tool for electronic design automation. I prototyped it last year on my spare time, and it is now an official research project targeting integrated circuit placement. I called it Coloquinte, hence the name of this blog: its purpose is mostly to present and discuss algorithms.
Electronic design automation (EDA) is a need when integrated circuits feature blocks with millions of logic gates, or when the designs and the processes are as complex as they are now. Therefore, there are many fields in EDA, targeting both circuit synthesis and simulation/verification. The synthesis is the whole process to transform a high-level specification of a circuit into a bare-metal chip.
The high-level specification it receives is usually a dedicated hardware description language (HDL) like Verilog, which provides a description that is still close to the final circuit. Nowadays, this complexity is a burden and higher-level languages are called for: it should is possible to transform C or OpenCL into hardware.
The synthesis flow works as a multipass compiler: it performs optimizations on various representations to output a manufacturable chip. It will transform the specification into boolean logic, place this logic on a chip, then route the wires between them and finally verify that it did not introduce any error. Since an integrated circuit is more constrained than a computer program, these steps often fail and still require a lot of designer input: compiling a circuit is difficult, expensive and time-consuming.
Since the synthesis has such a tremendous influence on cost (area, manufacturing yield and time-to-market) and performance (frequency and power), it has been an area of active research during the past decades. Coloquinte targets part of this synthesis flow, the placement of cells in a block.
Current placement tools are known to be at least 50% away from optimal on some benchmarks. A lot of algorithms have been published but few tools are available in the open: Coloquinte is both a research project and an attempt to have a good placement tool available in a complete toolchain. It is supposed to integrate with the toolchain from Paris VI university, Coriolis, and its database Hurricane.
My next post will be a short presentation of placement algorithms before I can explain the architecture I chosed for Coloquinte and begin posting algorithms and ideas.