In the electronic industry, there is one common denominator to all the tools and all the teams. Whether we are doing ASIC design, programming FPGAs or formally verifying chips, there is one thing we all use in our software: it is the representation of the design, called the netlist.
The netlist is the representation of the connections of the design. Although it probably originated from physical design representations, it is now a common representation throughout the compiler, from the high-level code to the final physical layout.
This post is the first in a series on the netlist, to present the basic concepts. Next ones will go more in depth on the operations performed on a netlist, the APIs that are provided to access it, and the actual representation in a programming language.
What’s in a netlist
A netlist represents the connections between the components of a design. Here is what it looks like:
Let’s call the component “cells” and the connections “nets”. Each cell has an interface: each element of this interface is called a “port”. Note that everything has a name, and that the ports generally have a direction as well. For the computer scientists among us, this is a lot like a graph.
Now, imagine we reuse mod in another netlist. It might just be a cell among others. some of those cells may actually be complex components, containing other, smaller cells: this creates a hierarchy of nested cells, some of which may be reused.
For example, a toy CPU: core is defined once, but used twice, creating two “instances”, core1 and core2. Like in software design, reuse is encouraged: a “module” (core, ALU) will be instantiated several times (core1, core2, core1.ALU1, core1.ALU2…). This reuse is there to make the designers’ and tools’ jobs easier: once we create the chip, there will be one copy of core for each core1, core2….
We have seen that the netlist came from the physical design world, but at the same time its quirks are present in “high-level” languages like Verilog and VHDL.
The consequence is that a netlist is the go-to format for everyone working in EDA… but there are many different netlist representations. For example:
Netlists targeting physical design with representations of the actual wiring as polygons. Coriolis’ Hurricane is a good example
Netlists targeting logical optimization, with specialized bit-level representations. This is the case in ABC
Higher-level netlists for synthesis and verification, with high-level operators, like Yosys
Generic netlists, that provide a generic API without application-specific utilities, like GBL
Since they are central to all those tools, their design as a lot of impact. First, on the overall development of the tools: an easy-to-use API is a must. Second, on the performance, to process millions to billions of gates efficiently. But let’s keep those for the next post.
I just realized that the report and presentation I wrote about Coloquinte were never made public. Here they are, under CC-BY-SA. Please use it as you see fit: to learn about placement algorithms, as a bibliography, or maybe to improve Coloquinte or other tools!
During a few weeks, I have been working on the router. I abandonned the project for a moment – there is still a lot of low hanging fruit in placement research – but it is nonetheless an interesting project .
Routing is the step where the wires are drawn, and on ASIC circuits (I don’t know about FPGAs) it consists of two steps: a global routing steps that balances congestion and makes choices regarding the overall shapes, and a detailed routing step which actually places the wires, hopefully without overlaps or design rules violations.
Routers tend to be limited in what they can do, for physical and algorithmical reasons. For example, all practical digital routers use Manhattan wiring, with only horizontal and vertical wires. This is actually worth it: it greatly simplifies the algorithms and probably even the manufacturing processes. In fact, every attempt to generalize to non-Manhattan routing layers failed.
Another sensible limitation is the definition of a preferred direction for each routing layer: it is friendlier to sequential routing methods, where wires would tend to obstruct each other if no constraint is introduced.
The limitation I am writing about is the use of a routing grid during detailed routing. That is, the wires are placed using a constant pitch in each layer, generally wide enough to guarantee design rules conformance. This scheme is amenable to efficient representations: as a 2D array or, with a preferred direction, as a list of segments for each track. All usual routing algorithms (maze routing, channel or switchbox routing) work on such a data structure.
Is the grid a problem?
It would be better if the router was not limited by the grid, but the grid isn’t necessarily a bad thing: if every wire is drawn at minimum width with a uniform spacing anyway, there is no gain to expect from a gridless router.
However, it isn’t necessarily true on newer design processes. Drawing wires of various widths can improve the RC for critical wires. On the other hand, non-uniform spacing can potentially improve crosstalk and is a less pessimistic approach to design rule conformance when minimum spacing varies with the wire’s length and shape. Research on gridless routing can make the router much more flexible on wire sizing and spacing.
Data structure bloat in gridless routers
The difficulty of gridless routing is the definition of an efficient datastructure. Detailed routing needs to find paths for each net and typically will request neighbouring segments or whitespace, or segments in a given area. Such requests consume most of the detailed router runtime. Switching to a gridless architecture will make this accesses slower and is going to have a HUGE effect on runtime.
Tiling and corner stitching
The most versatile algorithm, maze routing, works on non-uniform grids as well, and a straightforward approach to gridless routing is to define such a grid based on the boundaries of already placed segments.
In memory, this tiling is generally represented with pointers to the neighbours. In “corner stitching”, used by the Magic layout tool, there are only 4 pointers at the corners. The problem with such structures is that they involve a lot of pointer chasing and are heavy on memory. In the case of corner stitching, the algorithms to access and modify the datastructure tend to be complicated.
Another approach is to apply a coarse-grained grid or a quadtree on the routing area: this makes area queries much faster and the datastructure smaller. However, maze routing, which relies on an explicit tiling, cannot be performed with this structure.
I think that structures based on such a coarse-grained grid are more efficient than tiling: they are smaller and should yield simpler and faster queries… but they are not suitable for current routing algorithms.
Simpler algorithms make fast gridless routing possible
The most common routing algorithm, maze routing, is almost unusable for gridless routing: it is painfully slow on complex tiled structures, and not amenable to spatial indexing. There is a need to reconsider routing algorithms for gridless routing, and luckily it turns out that this work has already been done by the maintainer of Coriolis.
Wires are mostly straight
When we look at a routed circuit, a striking result is that most wires are straight. Although they may use a few doglegs, or make detours in some extreme cases, complex patterns are uncommon even on highly congested instances. Coriolis Kite’s approach has been to focus on handling this common case efficiently.
While Kite uses a grid, it doesn’t use a classical maze, channel or switchbox router and tries to place the segments without breaking them. The main loop finds free space for a straight segment and doesn’t need a fully fledged maze algorithm. This makes it very promising for gridless routing.
Kite’s algorithm to get rid of tiling
If there is no need for maze routing except for a few difficult nets, the main representation can be much simpler, such as a coarse-grained grid. Finding a segment embedding is more expensive than on a track-based structure, but can still be relatively quick.
I will try to make it work before the end of my internship. It could be straightforward, but I expect it to be relatively painful: I didn’t design Kite’s algorithms and they took a lot of experience and trial-error steps to get right. However, they seem to be just the right approach for gridless routing, and I can’t wait to see how the datastructure will perform in practice.
Sidenote: there is no such thing as independent optimization passes
During my work on the router, my first attempt was not on a gridless router but on a simple amelioration of the global router. It was apparently promising… but although it seemed better on routing metrics than the old global router, it failed spectacularly during detailed routing.
It turns out that all tools are intertwined, and in some cases you can’t compare two good tools based on simple metrics: you need to take into account every surrounding tools. This is one of the reasons why this discussion remains mostly theoretical: writing a new router is not a simple plug-and-play operation, especially when the existing tool is already pretty optimized.
I spent part of the weekend trying to improve the legalization algorithm in Coloquinte – now integrated in Coriolis. It turns out that I didn’t manage to obtain any improvement: every change I tried obtained worse results than the current version.
The current algorithm doesn’t even legalize the true placement: it works on a modified placement obtained by heuristically solving a flow problem. Although this placement is supposed to be close to the best legal placement, I didn’t expect it to stand against more sophisticated heuristics that use it only as a hint.
I think that what makes it so good is that this first pass is inherently imprecise and optimistic: within the finer grained window, it doesn’t move the cells as much as it should. They are moved very close to their final position, but still a bit off in the direction of their target position. Therefore, the second pass receives a good hint placement which is still a bit biased toward the target.
Whatever the reason is, I’m finished with the legalizer. I don’t think there is much more work to do in this area, and I’m convinced that it is better than published works like Abacus and HiBinLegalizer. Sadly, benchmarking against them involves finding and installing them and MPL, which is all but trivial. Our legalizer should be good enough for us now (at least for standard cells, big macros are still a pain).
The ISPD conference organizes a design automation contest each year. It provides a lot of public benchmarks and results to compare to. Although Coloquinte does not yield the best results yet, it compares favorably against most published results. Here are some screenshots of wirelength-driven placements on some ISPD05 benchmarks.
Variable densities are placements are now possible in Coloquinte. I am quite happy with the way it is done, by changing the placement region density rather than the cells’ areas like in other tools.
I implemented it using a line sweep algorithm, which makes it flexible: it can be fed overlapping regions and handles both macros and limited density regions in the same way. It remains to be seen if I can make a satisfying routing-driven placer.
Further testing and debugging during the last few days. Coloquinte hit some bugs in Coriolis that we (well, my advisor) solved, and we uncovered minor bugs in both tools in the process.
We tested Coloquinte with the clocktree generator, and our small test design is easily placed and routed. Since the clocktree generator yields a lot of preplaced repeaters and used to crash during post-processing I expected to find bugs in Coloquinte, but the python interface turned out to be guilty.
While my advisor makes Coriolis’ routing and clocktree synthesis work with Coloquinte, I am making it routing aware: it must be able to balance its target density with the routing demand.
It doesn’t account for congestion yet, and just packs the cell as densely as it can. For this small design, routing is not a problem even at full density, but a feedback loop will be needed for bigger ones.
Coloquinte is now used by the Coriolis toolchain. This means that you can now route the circuits obtained – and with an open-source tool.
Here are some screenshots of a circuit obtained with Coloquinte and Kite (the router). The next big step is to make congestion-aware placements: although Coloquinte and Kite optimize well enough that there is no routing failure on small circuits, this feature is mandatory in an industrial tool.
I find it very motivating to finally see what the algorithms do. Running the algorithm on a true circuit rather than ISPD benchmarks has caught some silly bugs, but now I’m adding features again.
It has been three months since I began this project, and it is time for a checkpoint. I just implemented the last mandatory algorithm, and Coloquinte’s algorithms are now usable to build a circuit.
It does not support standard formats, and since I still didn’t program the backend the results are still ~15% from the best state of the art. But all the core algorithms have been implemented.
The core wirelength optimizer has been the first part I implemented. I chosed to write the sparse linear algebra code from scratch rather than using a library. The whole thing is in fact closer to a true non-linear optimizer, since it never completely solves the linear system, but overall it is very similar to other placement tools.
This tool modifies a solution so that it meets cell density constraints: it drives the optimizer toward an almost “legal” solution, where there are no more cells in a region than allowed by its area. It is although used as a pre-processing step of a true legalization.
It is very simple but extremely different from other algorithms. It distributes the cells in the regions and improves the distribution between nearby regions. There are powerful algorithms to do it, but the simplest heuristics seem to work best.
The legalizer is the final step to obtain a correct placement. It places the cells so that they do not overlap, hopefully as close to the optimized solution as possible.
The naive way to do it is to sort the cells by position and chose the best position for each cell sequentially. Coloquinte’s legalizer uses a single row problem, which allows it to push already legalized cells to obtain a better solution (in Coloquinte it optimizes the displacement, but other legalizers did it for the quadratic displacement). Compared with the other solution, it is extremely stable.
Obviously, a good global placement doesn’t replace a detailed placement optimization: it is the next big step. There are still a lot of things to be done, and a lot of naive heuristics where we could do better, but detailed placement should finally close the gap with state of the art tools.
There is a striking similarity between VLSI placement and scheduling. Digital electronics circuits are built from “standard cells”, logic gates with a unique height. They are organized in rows, so that the power wires run between two rows of standard cells.
Tasks must be scheduled on machines so that they do not conflict, standard cells must be placed in rows so that they do not overlap. The only difference between scheduling tasks and placing logic gates is the cost function, generally more complex for placement than for scheduling.
Placement problems have never been formulated as scheduling problems: it is no surprise that there has been few communication between the fields, the journals and laboratories being separate and the goals completely different. But it seems to me that we can solve some placement subproblems using scheduling techniques.
Scheduling is the problem of assigning tasks to machines. They are constrained by their execution times, with independent costs for each task.
In standard cell placement, the cost is usually much more complicated: the wirelength involves distances between tasks, and cannot be expressed a a sum of independent task costs. However, for a given ordering of the cells in the rows, it is possible to find the corresponding optimal positions by using linear programming (more precisely, network flow). This gives us a simple optimization pass to use during placement but it is still a bit expensive.
Therefore, for standard cell placement as well as scheduling, researchers have considered the simpler problem where there is only one row or one machine, still with fixed ordering. The cost function is a sum of independent piecewise linear functions, and hopefully there are faster algorithms to solve the problem.
Fast algorithms for the ordered single row problem
Indeed, this problem is much easier to solve in practice. The clumping algorithm solves this problem in worst case quadratic time, and is usually much faster. The idea behind it is to add the cells at the end of the row and push it to the next pin until its placement is optimal: the final placement is optimal due to the convexity of the cost function.
This quadratic worst case complexity is annoying: several publications managed to improve this runtime up to m log² m where m is the number of pins. However, the datastructures involved are a special type of balanced trees, both slow and difficult to code.
I discovered a simpler m log n algorithm based on a priority queue, but by chance I found out that it is already known for scheduling, as the specialized cascading descent algorithm. The cascading descent is equivalent to the clumping algorithm. The basic idea is to index the pins with a value that does not depend on the cell: this way, merging groups of cells takes constant time. The algorithm uses a single priority queue to hold the pins, hence its better complexity.
Applications to VLSI placement
It isn’t a breakthrough: even naive linear programming is fast enough to optimize a large part of a circuit, so that this improvement might be barely noticeable compared with approximate or quadratic algorithms.
However, there are two extensions that may prove particularly useful: local search and non-convex optimization.
Local search is the process of modifying the placement through simple modifications, like cell swapping. One of my goals is to modify the algorithm in order to support online placement modification with logarithmic complexity. That is, keep an optimal placement through small ordering modifications.
Non-convex optimization is trickier. There can be several local optimas, which makes most such problems extremely difficult. For the special structure of the ordered single row problem, however, there are quadratic algorithms. It means that we can integrate more complex cost functions directly in the ordered single row optimization, for example cell orientation.
Both ideas are relatively new: cell swapping, orientation and position optimization are almost always separate steps (although it is possible, even integer programming models in the litterature do not integrate all of them). I wonder how these methods will play compared to independent optimization passes.