Category Archives: EDA

An introduction to netlists

In the electronic industry, there is one common denominator to all the tools and all the teams. Whether we are doing ASIC design, programming FPGAs or formally verifying chips, there is one thing we all use in our software: it is the representation of the design, called the netlist.

The netlist is the representation of the connections of the design. Although it probably originated from physical design representations, it is now a common representation throughout the compiler, from the high-level code to the final physical layout.

This post is the first in a series on the netlist, to present the basic concepts. Next ones will go more in depth on the operations performed on a netlist, the APIs that are provided to access it, and the actual representation in a programming language.

What’s in a netlist

A netlist represents the connections between the components of a design. Here is what it looks like:

A module containing two gates (and & or)
A simple module: two cells and four nets; net names are not shown

Let’s call the component “cells” and the connections “nets”. Each cell has an interface: each element of this interface is called a “port”. Note that everything has a name, and that the ports generally have a direction as well. For the computer scientists among us, this is a lot like a graph.

Now, imagine we reuse mod in another netlist. It might just be a cell among others. some of those cells may actually be complex components, containing other, smaller cells: this creates a hierarchy of nested cells, some of which may be reused.

A toy CPU netlist, with cell reuse for the cores and ALUs

For example, a toy CPU: core is defined once, but used twice, creating two “instances”, core1 and core2. Like in software design, reuse is encouraged: a “module” (core, ALU) will be instantiated several times (core1, core2, core1.ALU1, core1.ALU2…). This reuse is there to make the designers’ and tools’ jobs easier: once we create the chip, there will be one copy of core for each core1, core2….

Netlists everywhere

We have seen that the netlist came from the physical design world, but at the same time its quirks are present in “high-level” languages like Verilog and VHDL.

The consequence is that a netlist is the go-to format for everyone working in EDA… but there are many different netlist representations. For example:

  • Netlists targeting physical design with representations of the actual wiring as polygons. Coriolis’ Hurricane is a good example
  • Netlists targeting logical optimization, with specialized bit-level representations. This is the case in ABC
  • Higher-level netlists for synthesis and verification, with high-level operators, like Yosys
  • Generic netlists, that provide a generic API without application-specific utilities, like GBL

Since they are central to all those tools, their design as a lot of impact. First, on the overall development of the tools: an easy-to-use API is a must. Second, on the performance, to process millions to billions of gates efficiently. But let’s keep those for the next post.

Gridless routing?

During a few weeks, I have been working on the router. I abandonned the project for a moment – there is still a lot of low hanging fruit in placement research – but it is nonetheless an interesting project .

Routing is the step where the wires are drawn, and on ASIC circuits (I don’t know about FPGAs) it consists of two steps: a global routing steps that balances congestion and makes choices regarding the overall shapes, and a detailed routing step which actually places the wires, hopefully without overlaps or design rules violations.

Router limitations

Routers tend to be limited in what they can do, for physical and algorithmical reasons. For example, all practical digital routers use Manhattan wiring, with only horizontal and vertical wires. This is actually worth it: it greatly simplifies the algorithms and probably even the manufacturing processes. In fact, every attempt to generalize to non-Manhattan routing layers failed.

Another sensible limitation is the definition of a preferred direction for each routing layer: it is friendlier to sequential routing methods, where wires would tend to obstruct each other if no constraint is introduced.

The limitation I am writing about is the use of a routing grid during detailed routing. That is, the wires are placed using a constant pitch in each layer, generally wide enough to guarantee design rules conformance. This scheme is amenable to efficient representations: as a 2D array or, with a preferred direction, as a list of segments for each track. All usual routing algorithms (maze routing, channel or switchbox routing) work on such a data structure.

The grid is amenable to efficient representations
The grid is amenable to efficient representations. Maze routing algorithm generally use a 2D grid (a) while Kite is track-oriented (b)

Is the grid a problem?

It would be better if the router was not limited by the grid, but the grid isn’t necessarily a bad thing: if every wire is drawn at minimum width with a uniform spacing anyway, there is no gain to expect from a gridless router.

However, it isn’t necessarily true on newer design processes. Drawing wires of various widths can improve the RC for critical wires. On the other hand, non-uniform spacing can potentially improve crosstalk and is a less pessimistic approach to design rule conformance when minimum spacing varies with the wire’s length and shape. Research on gridless routing can make the router much more flexible on wire sizing and spacing.

Data structure bloat in gridless routers

The difficulty of gridless routing is the definition of an efficient datastructure. Detailed routing needs to find paths for each net and typically will request neighbouring segments or whitespace, or segments in a given area. Such requests consume most of the detailed router runtime. Switching to a gridless architecture will make this accesses slower and is going to have a HUGE effect on runtime.

Tiling and corner stitching

The most versatile algorithm, maze routing, works on non-uniform grids as well, and a straightforward approach to gridless routing is to define such a grid based on the boundaries of already placed segments.

The same routing pattern and its corresponding tiling
The same routing pattern and its corresponding tiling

In memory, this tiling is generally represented with pointers to the neighbours. In “corner stitching”, used by the Magic layout tool, there are only 4 pointers at the corners. The problem with such structures is that they involve a lot  of pointer chasing and are heavy on memory. In the case of corner stitching, the algorithms to access and modify the datastructure tend to be complicated.

Spatial indexing

Another approach is to apply a coarse-grained grid or a quadtree on the routing area: this makes area queries much faster and the datastructure smaller. However, maze routing, which relies on an explicit tiling, cannot be performed with this structure.

I think that structures based on such a coarse-grained grid are more efficient than tiling: they are smaller and should yield simpler and faster queries… but they are not suitable for current routing algorithms.

Simpler algorithms make fast gridless routing possible

The most common routing algorithm, maze routing, is almost unusable for gridless routing: it is painfully slow on complex tiled structures, and not amenable to spatial indexing. There is a need to reconsider routing algorithms for gridless routing, and luckily it turns out that this work has already been done by the maintainer of Coriolis.

Wires are mostly straight

Even when the wiring is quite dense, there is no need to introduce many bends.
Even when the wiring is quite dense, there is no need to introduce many bends.

When we look at a routed circuit, a striking result is that most wires are straight. Although they may use a few doglegs, or make detours in some extreme cases, complex patterns are uncommon even on highly congested instances. Coriolis Kite’s approach has been to focus on handling this common case efficiently.

While Kite uses a grid, it doesn’t use a classical maze, channel or switchbox router and tries to place the segments without breaking them. The main loop finds free space for a straight segment and doesn’t need a fully fledged maze algorithm. This makes it very promising for gridless routing.

Kite’s algorithm to get rid of tiling

If there is no need for maze routing except for a few difficult nets, the main representation can be much simpler, such as a coarse-grained grid. Finding a segment embedding is more expensive than on a track-based structure, but can still be relatively quick.

I will try to make it work before the end of my internship. It could be straightforward, but I expect it to be relatively painful: I didn’t design Kite’s algorithms and they took a lot of experience and trial-error steps to get right. However, they seem to be just the right approach for gridless routing, and I can’t wait to see how the datastructure will perform in practice.

Sidenote: there is no such thing as independent optimization passes

During my work on the router, my first attempt was not on a gridless router but on a simple amelioration of the global router. It was apparently promising… but although it seemed better on routing metrics than the old global router, it failed spectacularly during detailed routing.

It turns out that all tools are intertwined, and in some cases you can’t compare two good tools based on simple metrics: you need to take into account every surrounding tools. This is one of the reasons why this discussion remains mostly theoretical: writing a new router is not a simple plug-and-play operation, especially when the existing tool is already pretty optimized.

Coriolis’ legalizer is unexpectedly good

I spent part of the weekend trying to improve the legalization algorithm in Coloquinte – now integrated in Coriolis. It turns out that I didn’t manage to obtain any improvement: every change I tried obtained worse results than the current version.

The current algorithm doesn’t even legalize the true placement: it works on a modified placement obtained by heuristically solving a flow problem. Although this placement is supposed to be close to the best legal placement, I didn’t expect it to stand against more sophisticated heuristics that use it only as a hint.

I think that what makes it so good is that this first pass is inherently imprecise and optimistic: within the finer grained window, it doesn’t move the cells as much as it should. They are moved very close to their final position, but still a bit off in the direction of their target position. Therefore, the second pass receives a good hint placement which is still a bit biased toward the target.

Whatever the reason is, I’m finished with the legalizer. I don’t think there is much more work to do in this area, and I’m convinced that it is better than published works like Abacus and HiBinLegalizer. Sadly, benchmarking against them involves finding and installing them and MPL, which is all but trivial. Our legalizer should be good enough for us now (at least for standard cells, big macros are still a pain).

Benchmark results

The ISPD conference organizes a design automation contest each year. It provides a lot of public benchmarks and results to compare to. Although Coloquinte does not yield the best results yet, it compares favorably against most published results. Here are some screenshots of wirelength-driven placements on some ISPD05 benchmarks.

Bigblue1 is an "easy" placement, where there are few fixed macros
Bigblue1 is an “easy” placement, where there are few fixed macros in the center
adaptec2
Adaptec2 is more difficult: the fixed macros make it difficult to spread the placement

 

Low density placements

Variable densities are placements are now possible in Coloquinte. I am quite happy with the way it is done, by changing the placement region density rather than the cells’ areas like in other tools.

A uniform density placement after the rough legalization pass
A uniform density placement after the rough legalization pass
The legalized placement: this placement is 50% whitespace
The legalized placement: this placement is 50% whitespace
The same circuit after a few detailed placement passes. Since the detailed placement is not routing-aware, we already see some artifacts
The same circuit after a few detailed placement passes. There are some artifacts since the detailed placement is not density-aware yet

I implemented it using a line sweep algorithm, which makes it flexible: it can be fed overlapping regions and handles both macros and limited density regions in the same way. It remains to be seen if I can make a satisfying routing-driven placer.

 

97% density and a clock tree

Further testing and debugging during the last few days. Coloquinte hit some bugs in Coriolis that we (well, my advisor) solved, and we uncovered minor bugs in both tools in the process.

We tested Coloquinte with the clocktree generator, and our small test design is easily placed and routed. Since the clocktree generator yields a lot of preplaced repeaters and used to crash during post-processing I expected to find bugs in Coloquinte, but the python interface turned out to be guilty.

A placement at 97% density with ~5000 cells, and a routed clocktree
A placement at 97% density with ~5000 cells, and a routed clocktree

While my advisor makes Coriolis’ routing and clocktree synthesis work with Coloquinte, I am making it routing aware: it must be able to balance its target density with the routing demand.

It doesn’t account for congestion yet, and just packs the cell as densely as it can. For this small design, routing is not a problem even at full density, but a feedback loop will be needed for bigger ones.

First routed circuits

Coloquinte is now used by the Coriolis toolchain. This means that you can now route the circuits obtained – and with an open-source tool.

Close view of the routed circuit

Here are some screenshots of a circuit obtained with Coloquinte and Kite (the router). The next big step is to make congestion-aware placements: although Coloquinte and Kite optimize well enough that there is no routing failure on small circuits, this feature is mandatory in an industrial tool.

A legal placement of the circuit

I find it very motivating to finally see what the algorithms do. Running the algorithm on a true circuit rather than ISPD benchmarks has caught some silly bugs, but now I’m adding features again.

The circuit after routing

The road so far

It has been three months since I began this project, and it is time for a checkpoint. I just implemented the last mandatory algorithm, and Coloquinte’s algorithms are now usable to build a circuit.

It does not support standard formats, and since I still didn’t program the backend the results are still ~15% from the best state of the art. But all the core algorithms have been implemented.

The optimizer

The core wirelength optimizer has been the first part I implemented. I chosed to write the sparse linear algebra code from scratch rather than using a library. The whole thing is in fact closer to a true non-linear optimizer, since it never completely solves the linear system, but overall it is very similar to other placement tools.

Look-ahead legalization

This tool modifies a solution so that it meets cell density constraints: it drives the optimizer toward an almost “legal” solution, where there are no more cells in a region than allowed by its area. It is although used as a pre-processing step of a true legalization.

It is very simple but extremely different from other algorithms. It distributes the cells in the regions and improves the distribution between nearby regions. There are powerful algorithms to do it, but the simplest heuristics seem to work best.

 Legalizer

The legalizer is the final step to obtain a correct placement. It places the cells so that they do not overlap, hopefully as close to the optimized solution as possible.

The naive way to do it is to sort the cells by position and chose the best position for each cell sequentially. Coloquinte’s legalizer uses a single row problem, which allows it to push already legalized cells to obtain a better solution (in Coloquinte it optimizes the displacement, but other legalizers did it for the quadratic displacement). Compared with the other solution, it is extremely stable.

Next steps

Obviously, a good global placement doesn’t replace a detailed placement optimization: it is the next big step. There are still a lot of things to be done, and a lot of naive heuristics where we could do better, but detailed placement should finally close the gap with state of the art tools.

 

On scheduling algorithms applied to VLSI placement

There is a striking similarity between VLSI placement and scheduling. Digital electronics circuits are built from “standard cells”, logic gates with a unique height. They are organized in rows, so that the power wires run between two rows of standard cells.

Tasks must be scheduled on machines so that they do not conflict, standard cells must be placed in rows so that they do not overlap. The only difference between scheduling tasks and placing logic gates is the cost function, generally more complex for placement than for scheduling.

Placement problems have never been formulated as scheduling problems: it is no surprise that there has been few communication between the fields, the journals and laboratories being separate and the goals completely different. But it seems to me that we can solve some placement subproblems using scheduling techniques.

Problem formulation

Scheduling is the problem of assigning tasks to machines. They are constrained by their execution times, with independent costs for each task.

In standard cell placement, the cost is usually much more complicated: the wirelength involves distances between tasks, and cannot be expressed a a sum of independent task costs. However, for a given ordering of the cells in the rows, it is possible to find the corresponding optimal positions by using linear programming (more precisely, network flow). This gives us a simple optimization pass to use during placement but it is still a bit expensive.

Therefore, for standard cell placement as well as scheduling, researchers have considered the simpler problem where there is only one row or one machine, still with fixed ordering. The cost function is a sum of independent piecewise linear functions, and hopefully there are faster algorithms to solve the problem.

A standard cell with one pin on the same row and two on other rows, with the associated cost function
A standard cell with one pin on the same row and two on other rows (a), an equivalent placement of the pins (b), and the associated cost function (c)

Fast algorithms for the ordered single row problem

Indeed, this problem is much easier to solve in practice. The clumping algorithm solves this problem in worst case quadratic time, and is usually much faster. The idea behind it is to add the cells at the end of the row and push it to the next pin until its placement is optimal: the final placement is optimal due to the convexity of the cost function.

A group of cells is pushed while it decreases wirelength. Each time a pin is passed, the derivative of the wirelength is updated and the algorithm may stop and add the next cell to the row.
A group of cells is pushed while it decreases wirelength. Each time a pin is passed, the derivative of the wirelength is updated and the algorithm stops and adds the next cell to the row.

This quadratic worst case complexity is annoying: several publications managed to improve this runtime up to m log² m where m is the number of pins. However, the datastructures involved are a special type of balanced trees, both slow and difficult to code.

I discovered a simpler m log n algorithm based on a priority queue, but by chance I found out that it is already known for scheduling,  as the specialized cascading descent algorithm. The cascading descent is equivalent to the clumping algorithm. The basic idea is to index the pins with a value that does not depend on the cell: this way, merging groups of cells takes constant time. The algorithm uses a single priority queue to hold the pins, hence its better complexity.

The absolute position of a cell is the position of the first cell if all cells were clustered. The position of a pin is defined relative to the cell it connects to.
The absolute position of a cell is the corresponding position of the first cell if all cells were clustered. The position of a pin is defined relative to the cell it connects to.

 

Applications to VLSI placement

It isn’t a breakthrough: even naive linear programming is fast enough to optimize a large part of a circuit, so that this improvement might be barely noticeable compared with approximate or  quadratic algorithms.

However, there are two extensions that may prove particularly useful: local search and non-convex optimization.

Local search is the process of modifying the placement through simple modifications, like cell swapping. One of my goals is to modify the algorithm in order to support online placement modification with logarithmic complexity. That is, keep an optimal placement through small ordering modifications.

Non-convex optimization is trickier. There can be several local optimas, which makes most such problems extremely difficult. For the special structure of the ordered single row problem, however, there are quadratic algorithms. It means that we can integrate more complex cost functions directly in the ordered single row optimization, for example cell orientation.

Both ideas are relatively new: cell swapping, orientation and position optimization are almost always separate steps (although it is possible, even integer programming models in the litterature do not integrate all of them). I wonder how these methods will play compared to independent optimization passes.