## My optimization toolbox

The principle of optimization is simple: get the best possible solution on a problem, subject to a given cost and problem-specific constraints. It is important in many areas, from economics and planning to electronics design. Although the theory is useful and very important for tool writers, it is more of an art for the user. However, the more optimization problems I see, the more I learn to reuse the same old tricks. I wanted to look back on what I learned and why.

When I tackled my first optimization problems, I had a strong tendency to write custom algorithms, whether basic local-search algorithms, brute-force or problem-specific clever algorithms. Now I see this approach as a big loss of time: we have a huge toolbox of powerful tools and modeling approaches that usually work better, and are much faster to try in any case. Even for the rare cases where these tools wouldn’t work in practice, they are very helpful to get a better idea of your problem. The big question is “which tool and which model?”.

Most of the work I did could be tackled with integer programming or continuous optimization, and I think all programmers who need to optimize something — in particular people working on automation of electronic design — should at least know when to try them.

### Integer programming

Integer linear programming is probably the most important tool you will ever use: it can model almost anything if you know a few tricks, while being efficient on most problems. It optimizes a linear cost function, subject to linear inequalities on integer or real variables. Even on non-linear problems, an approximate linear programming model can be much better and faster than anything else.

There are some good free tools (I use GLPK and its modeling language, GMPL). On most problems it is my first modelling attempt. I discovered recently that it was able to use Minisat for pseudo-boolean problems (when all the variables are boolean): you get a completely different optimization algorithm for free, that can be much more efficient on some problems.

### Huge problems

On the other hand, I had to deal with some huge uncontrained problems, in the millions of variables. It may not be something you need to use every day, but the conjugate gradient method and Nesterov’s method are interesting to learn about. There is a bunch or libraries you can use for convex optimization in C++, R, Python, Julia…

For complex and huge problems, such as a big travelling salesman or vehicle routing problem, you can’t do much: either there is a library for this specific problem, or you write some ad-hoc heuristics from scratch.

Now I  don’t think it is a good idea to ever consider this as a first approach if you want to solve a problem optimally, but for big or badly defined problems a heuristic algorithm is generally the only way. Simple local search is generally a good starting point.

Depending on the problem, even greedy algorithms can give good result. I often try simple permutations first, like the Lin-Kerningham heuristic, and maybe simulated annealing. I find genetic algorithms are more of a buzzword, and quite overrated.

### Back to exact techniques

Even if an exact technique will not work for a difficult problem, it is often useful to try it anyway. It constrains you to model your problem, simplify it, and make it fit in a mathematical framework. Moreover, it can give you a hint of how bad your current algorithms are on small instances, and provide a subroutine for local search on your big problem.

In design automation, integer programming and the like aren’t used very often. In other fields, in particular planning, integer programming and the likes have proven invaluable long ago: give it a try.

## Gridless routing?

During a few weeks, I have been working on the router. I abandonned the project for a moment – there is still a lot of low hanging fruit in placement research – but it is nonetheless an interesting project .

Routing is the step where the wires are drawn, and on ASIC circuits (I don’t know about FPGAs) it consists of two steps: a global routing steps that balances congestion and makes choices regarding the overall shapes, and a detailed routing step which actually places the wires, hopefully without overlaps or design rules violations.

### Router limitations

Routers tend to be limited in what they can do, for physical and algorithmical reasons. For example, all practical digital routers use Manhattan wiring, with only horizontal and vertical wires. This is actually worth it: it greatly simplifies the algorithms and probably even the manufacturing processes. In fact, every attempt to generalize to non-Manhattan routing layers failed.

Another sensible limitation is the definition of a preferred direction for each routing layer: it is friendlier to sequential routing methods, where wires would tend to obstruct each other if no constraint is introduced.

The limitation I am writing about is the use of a routing grid during detailed routing. That is, the wires are placed using a constant pitch in each layer, generally wide enough to guarantee design rules conformance. This scheme is amenable to efficient representations: as a 2D array or, with a preferred direction, as a list of segments for each track. All usual routing algorithms (maze routing, channel or switchbox routing) work on such a data structure.

### Is the grid a problem?

It would be better if the router was not limited by the grid, but the grid isn’t necessarily a bad thing: if every wire is drawn at minimum width with a uniform spacing anyway, there is no gain to expect from a gridless router.

However, it isn’t necessarily true on newer design processes. Drawing wires of various widths can improve the RC for critical wires. On the other hand, non-uniform spacing can potentially improve crosstalk and is a less pessimistic approach to design rule conformance when minimum spacing varies with the wire’s length and shape. Research on gridless routing can make the router much more flexible on wire sizing and spacing.

### Data structure bloat in gridless routers

The difficulty of gridless routing is the definition of an efficient datastructure. Detailed routing needs to find paths for each net and typically will request neighbouring segments or whitespace, or segments in a given area. Such requests consume most of the detailed router runtime. Switching to a gridless architecture will make this accesses slower and is going to have a HUGE effect on runtime.

#### Tiling and corner stitching

The most versatile algorithm, maze routing, works on non-uniform grids as well, and a straightforward approach to gridless routing is to define such a grid based on the boundaries of already placed segments.

In memory, this tiling is generally represented with pointers to the neighbours. In “corner stitching”, used by the Magic layout tool, there are only 4 pointers at the corners. The problem with such structures is that they involve a lot  of pointer chasing and are heavy on memory. In the case of corner stitching, the algorithms to access and modify the datastructure tend to be complicated.

#### Spatial indexing

Another approach is to apply a coarse-grained grid or a quadtree on the routing area: this makes area queries much faster and the datastructure smaller. However, maze routing, which relies on an explicit tiling, cannot be performed with this structure.

I think that structures based on such a coarse-grained grid are more efficient than tiling: they are smaller and should yield simpler and faster queries… but they are not suitable for current routing algorithms.

### Simpler algorithms make fast gridless routing possible

The most common routing algorithm, maze routing, is almost unusable for gridless routing: it is painfully slow on complex tiled structures, and not amenable to spatial indexing. There is a need to reconsider routing algorithms for gridless routing, and luckily it turns out that this work has already been done by the maintainer of Coriolis.

#### Wires are mostly straight

When we look at a routed circuit, a striking result is that most wires are straight. Although they may use a few doglegs, or make detours in some extreme cases, complex patterns are uncommon even on highly congested instances. Coriolis Kite’s approach has been to focus on handling this common case efficiently.

While Kite uses a grid, it doesn’t use a classical maze, channel or switchbox router and tries to place the segments without breaking them. The main loop finds free space for a straight segment and doesn’t need a fully fledged maze algorithm. This makes it very promising for gridless routing.

#### Kite’s algorithm to get rid of tiling

If there is no need for maze routing except for a few difficult nets, the main representation can be much simpler, such as a coarse-grained grid. Finding a segment embedding is more expensive than on a track-based structure, but can still be relatively quick.

I will try to make it work before the end of my internship. It could be straightforward, but I expect it to be relatively painful: I didn’t design Kite’s algorithms and they took a lot of experience and trial-error steps to get right. However, they seem to be just the right approach for gridless routing, and I can’t wait to see how the datastructure will perform in practice.

## Sidenote: there is no such thing as independent optimization passes

During my work on the router, my first attempt was not on a gridless router but on a simple amelioration of the global router. It was apparently promising… but although it seemed better on routing metrics than the old global router, it failed spectacularly during detailed routing.

It turns out that all tools are intertwined, and in some cases you can’t compare two good tools based on simple metrics: you need to take into account every surrounding tools. This is one of the reasons why this discussion remains mostly theoretical: writing a new router is not a simple plug-and-play operation, especially when the existing tool is already pretty optimized.

## Coriolis’ legalizer is unexpectedly good

I spent part of the weekend trying to improve the legalization algorithm in Coloquinte – now integrated in Coriolis. It turns out that I didn’t manage to obtain any improvement: every change I tried obtained worse results than the current version.

The current algorithm doesn’t even legalize the true placement: it works on a modified placement obtained by heuristically solving a flow problem. Although this placement is supposed to be close to the best legal placement, I didn’t expect it to stand against more sophisticated heuristics that use it only as a hint.

I think that what makes it so good is that this first pass is inherently imprecise and optimistic: within the finer grained window, it doesn’t move the cells as much as it should. They are moved very close to their final position, but still a bit off in the direction of their target position. Therefore, the second pass receives a good hint placement which is still a bit biased toward the target.

Whatever the reason is, I’m finished with the legalizer. I don’t think there is much more work to do in this area, and I’m convinced that it is better than published works like Abacus and HiBinLegalizer. Sadly, benchmarking against them involves finding and installing them and MPL, which is all but trivial. Our legalizer should be good enough for us now (at least for standard cells, big macros are still a pain).

## Benchmark results

The ISPD conference organizes a design automation contest each year. It provides a lot of public benchmarks and results to compare to. Although Coloquinte does not yield the best results yet, it compares favorably against most published results. Here are some screenshots of wirelength-driven placements on some ISPD05 benchmarks.

## Low density placements

Variable densities are placements are now possible in Coloquinte. I am quite happy with the way it is done, by changing the placement region density rather than the cells’ areas like in other tools.

I implemented it using a line sweep algorithm, which makes it flexible: it can be fed overlapping regions and handles both macros and limited density regions in the same way. It remains to be seen if I can make a satisfying routing-driven placer.

## 97% density and a clock tree

Further testing and debugging during the last few days. Coloquinte hit some bugs in Coriolis that we (well, my advisor) solved, and we uncovered minor bugs in both tools in the process.

We tested Coloquinte with the clocktree generator, and our small test design is easily placed and routed. Since the clocktree generator yields a lot of preplaced repeaters and used to crash during post-processing I expected to find bugs in Coloquinte, but the python interface turned out to be guilty.

While my advisor makes Coriolis’ routing and clocktree synthesis work with Coloquinte, I am making it routing aware: it must be able to balance its target density with the routing demand.

It doesn’t account for congestion yet, and just packs the cell as densely as it can. For this small design, routing is not a problem even at full density, but a feedback loop will be needed for bigger ones.

## First routed circuits

Coloquinte is now used by the Coriolis toolchain. This means that you can now route the circuits obtained – and with an open-source tool.

Here are some screenshots of a circuit obtained with Coloquinte and Kite (the router). The next big step is to make congestion-aware placements: although Coloquinte and Kite optimize well enough that there is no routing failure on small circuits, this feature is mandatory in an industrial tool.

I find it very motivating to finally see what the algorithms do. Running the algorithm on a true circuit rather than ISPD benchmarks has caught some silly bugs, but now I’m adding features again.

It has been three months since I began this project, and it is time for a checkpoint. I just implemented the last mandatory algorithm, and Coloquinte’s algorithms are now usable to build a circuit.

It does not support standard formats, and since I still didn’t program the backend the results are still ~15% from the best state of the art. But all the core algorithms have been implemented.

### The optimizer

The core wirelength optimizer has been the first part I implemented. I chosed to write the sparse linear algebra code from scratch rather than using a library. The whole thing is in fact closer to a true non-linear optimizer, since it never completely solves the linear system, but overall it is very similar to other placement tools.

This tool modifies a solution so that it meets cell density constraints: it drives the optimizer toward an almost “legal” solution, where there are no more cells in a region than allowed by its area. It is although used as a pre-processing step of a true legalization.

It is very simple but extremely different from other algorithms. It distributes the cells in the regions and improves the distribution between nearby regions. There are powerful algorithms to do it, but the simplest heuristics seem to work best.

### Legalizer

The legalizer is the final step to obtain a correct placement. It places the cells so that they do not overlap, hopefully as close to the optimized solution as possible.

The naive way to do it is to sort the cells by position and chose the best position for each cell sequentially. Coloquinte’s legalizer uses a single row problem, which allows it to push already legalized cells to obtain a better solution (in Coloquinte it optimizes the displacement, but other legalizers did it for the quadratic displacement). Compared with the other solution, it is extremely stable.

### Next steps

Obviously, a good global placement doesn’t replace a detailed placement optimization: it is the next big step. There are still a lot of things to be done, and a lot of naive heuristics where we could do better, but detailed placement should finally close the gap with state of the art tools.

## Optimal is not always better

I stumbled on a difficult realization this week: a carefully engineered algorithm is not necessarily better than a very simple code.

One of the critical steps of my program used to rely on a quick and dirty heuristic – a hack relying on a very simple local search. Since it is an important optimization pass, I tried to adapt a mathematically clean and efficient algorithm, (based on minimum cost flows, which appear everywhere). It can optimize on a larger scale, and I hoped to get much better results.

Unsurprisingly, it turns out that it was a lost of time: for my datasets, the runtime of the quick heuristic is much lower to obtain the same results. Each pass is so much faster that its suboptimality doesn’t make a difference.

What is more disturbing is that same-quality results are not equivalent for the other optimization passes: when evaluating related cost functions, the clean algorithm tends to do much worse. I still don’t know if it is due to some kind of overfitting or to a specific feature of the heuristic, but I’d very much like to know what it is about.

## On scheduling algorithms applied to VLSI placement

There is a striking similarity between VLSI placement and scheduling. Digital electronics circuits are built from “standard cells”, logic gates with a unique height. They are organized in rows, so that the power wires run between two rows of standard cells.

Tasks must be scheduled on machines so that they do not conflict, standard cells must be placed in rows so that they do not overlap. The only difference between scheduling tasks and placing logic gates is the cost function, generally more complex for placement than for scheduling.

Placement problems have never been formulated as scheduling problems: it is no surprise that there has been few communication between the fields, the journals and laboratories being separate and the goals completely different. But it seems to me that we can solve some placement subproblems using scheduling techniques.

### Problem formulation

Scheduling is the problem of assigning tasks to machines. They are constrained by their execution times, with independent costs for each task.

In standard cell placement, the cost is usually much more complicated: the wirelength involves distances between tasks, and cannot be expressed a a sum of independent task costs. However, for a given ordering of the cells in the rows, it is possible to find the corresponding optimal positions by using linear programming (more precisely, network flow). This gives us a simple optimization pass to use during placement but it is still a bit expensive.

Therefore, for standard cell placement as well as scheduling, researchers have considered the simpler problem where there is only one row or one machine, still with fixed ordering. The cost function is a sum of independent piecewise linear functions, and hopefully there are faster algorithms to solve the problem.

### Fast algorithms for the ordered single row problem

Indeed, this problem is much easier to solve in practice. The clumping algorithm solves this problem in worst case quadratic time, and is usually much faster. The idea behind it is to add the cells at the end of the row and push it to the next pin until its placement is optimal: the final placement is optimal due to the convexity of the cost function.

This quadratic worst case complexity is annoying: several publications managed to improve this runtime up to m log² m where m is the number of pins. However, the datastructures involved are a special type of balanced trees, both slow and difficult to code.

I discovered a simpler m log n algorithm based on a priority queue, but by chance I found out that it is already known for scheduling,  as the specialized cascading descent algorithm. The cascading descent is equivalent to the clumping algorithm. The basic idea is to index the pins with a value that does not depend on the cell: this way, merging groups of cells takes constant time. The algorithm uses a single priority queue to hold the pins, hence its better complexity.

### Applications to VLSI placement

It isn’t a breakthrough: even naive linear programming is fast enough to optimize a large part of a circuit, so that this improvement might be barely noticeable compared with approximate or  quadratic algorithms.

However, there are two extensions that may prove particularly useful: local search and non-convex optimization.

Local search is the process of modifying the placement through simple modifications, like cell swapping. One of my goals is to modify the algorithm in order to support online placement modification with logarithmic complexity. That is, keep an optimal placement through small ordering modifications.

Non-convex optimization is trickier. There can be several local optimas, which makes most such problems extremely difficult. For the special structure of the ordered single row problem, however, there are quadratic algorithms. It means that we can integrate more complex cost functions directly in the ordered single row optimization, for example cell orientation.

Both ideas are relatively new: cell swapping, orientation and position optimization are almost always separate steps (although it is possible, even integer programming models in the litterature do not integrate all of them). I wonder how these methods will play compared to independent optimization passes.