Tuesday, 18 August 2009

Hardware/Software Co-design

I'll preface this post by saying: I don't know much about hardware.

I have, however, spent a lot of time around hardware engineers. As a very naive programmer, I generally assumed that what they did was programming with some "extra bits" added, though I've long since been corrected.

Hardware is fast

How fast? For FPGA developments, I believe many applications can be 200 times faster. I believe even a reasonably inexperienced hardware designer can achieve speed ups of 50-100 times on many applications. It's also worth thinking about power consumption. The power consumption of an FPGA can often be a small fraction of the equivalent processing power required to hit the same performance constraints using commercially-available microprocessors.

This all sounds wonderful. Why aren't we all doing this? Well, for me, the problem is basically me.

I have neither the tools, nor the experience, to design non-trivial solutions to problems in hardware. If I were to sit down and try to solve a problem in hardware, I would probably end up writing VHDL or Verilog in much the same way that I would write C or Java, which would result in really, really awful hardware, which would be unlikely to be efficient. The maximum clock-rate would be very low, and finding a device big enough to fit my resulting hardware on could become an expensive undertaking.

That said, I still really like the idea of using hardware to solve problems. While specialist hardware engineers will always have the job of designing high-performance hardware with tight performance constraints, is there any low-hanging fruit which we (as programmers) could target for using reconfigurable logic devices as part of our programs?

The most obvious way of achieving this seems to be with the general idea of "hardware acceleration", where a special-purpose device (such as a PCI-X card with an ASIC or FPGA implementation of some algorithm) takes over some of the hard work from the CPU. The device manufacturer provides some sort of software library that hides the hardware interaction behind nice high-level function calls that are familiar territory for programmers.

These are great for specific applications, but I'm not aware of any "general purpose" cards that would permit programmers to integrate them into an arbitrary application without having some knowledge of hardware design.

Perhaps the next obvious alternative is compiling software into hardware. This is dangerous territory. There have been so many claims of software-to-hardware compilers that I won't even bother to list them. I'm sure you can find hundreds with a quick Google search. The problem with most of these is that they are actually just hardware semantics hidden underneath a veneer of some mainstream programming language. A programmer familiar with the source language will still not be able to produce efficient hardware without knowing something about hardware.

I think there could be two possible solutions, which I will sketch here:

Component-based Approach

Presumably we could get experienced hardware engineers to implement a wide variety of software-style components as efficient hardware. We use the same approach as the domain-specific hardware accelerator manufacturers, hiding the complexity of hardware interaction behind some nice software libraries. If you had a significant portion of the expensive operations from a programming language's standard library available as hardware components, it wouldn't take a particularly smart compiler to replace certain standard library calls with calls to the (functionally identical) hardware-wrapper library.

But there are some problems that I can see.

It is unlikely to be efficient in terms of communication, area or power consumption to simply push a whole lot of small, infrequently-used library calls into hardware. There would need to be a lot of static (or runtime profiling?) analysis to determine where there are efficiencies to be gained. The compiler would be forced to "schedule" components into the finite area available on the FPGA. This would have implications for latency and throughput. It might be necessary to buffer large numbers of calls before executing them. The performance gains from doing operations in hardware are likely to be eaten up by the communication overhead resulting from having to move data from registers in the CPU (or from processor cache, or from main memory) to the FPGA. Similarly in a traditional imperative setting, this is going to be an interruption of program control. The program will have to stop and wait for the results from the FPGA. The alternative is concurrency, but as I've previously said, concurrency is hard, and this is the type of heterogeneous task-level parallelism that is really hard.

So it appears that this approach would require putting a lot of smarts into the compiler. We would need good performance heuristics, something resembling auto-vectorisation and automatic parallelisation, and likely an ability to trade-off latency for throughput.

It all seems a little bit challenging, given the state of the art of each of those areas. It would appear that some of the purely functional languages would be best-suited to this kind of approach, but even in the absence of side-effects it seems uh... difficult, at best.

Language-based Approach

So if we're not going to expect the compiler to perform a rather sophisticated brand of super-compilation on our programs, can we instead structure our programs (and programming languages) to make this kind of hardware/software co-design easier?

I believe that this is one of the possible future directions for the JStar project, although I'm not aware of the technical details in this case. The data-structure-free approach may permit the kinds of analyses required to allow for efficient scheduling of components onto an external FPGA, so it definitely seems like an interesting avenue to keep an eye on.

Another avenue has to be concurrent (and functional?) languages. These are designed to express concurrent computation, which is essentially what we are trying to maximize, while minimising the presence of mutable state. With mutable state largely out of the way, you are free to have copies of data lying around. There are still questions as to how we maximise the efficiency (and indeed, even approach predicting the efficiency).

I'm aware of efforts to compile Esterel to hardware, though I have no idea to what degree this is practical or whether it is used in industry.

It strikes me that a model-based approach to software development could be used to make the entire process a lot less painful. By starting with a high-level model, it would be possible to refine towards both hardware and software implementations of different parts of the system. Provided you could contrive acceptable heuristics for predicting performance, it seems like you could start to play this sort of game:

I have a high-level abstract specification S.

I have a set of (hardware) components:

H = {h1,h2,h3}

I have some software components that are synthesised from S, or are user-supplied:

W = {w1,w2,w3}

We want to choose a subset X ⊆ H and Y ⊆ W, such that:

S ⊑ x1 || ... || xn || y1 || ... || yn

Where '⊑' is some appropriate notion of refinement and '||' is parallel composition. I'm not sure if this is more tractable than simply solving the problem outright for a given language, but it certainly seems like it would be considerably easier to verify the correctness of the decomposition of the program into software and hardware components, which would seem rather more difficult in this case than for a traditional compiler.

Is anyone aware of any projects that attempt this kind of thing? I'd be interested to hear from hardware and software people alike about their opinions on enabling programmers with no specialist hardware knowledge to use hardware in a manner that is both efficient and reasonably familiar.

1 comment:

  1. Hey, scott_s on HN. I came here after I realized the person I'm discussing lines of code as a metric with is the same person who made the comment on Perl's "compilation."

    I've gone through some of the same lines of thought regarding FPGAs and high level languages. I took a course on configurable computing, which pushed me to think these things through. You might be interested in my reactions to some of the course assignemnts: http://www.cs.vt.edu/~scschnei/ece5530

    I'm more of a systems hacker with an interest in programming languages, but both of us are cognitively more alike than our hardware brethren.

    Something to keep in mind is the cost of transferring data to an accelerator. When it's something like an FPGA or a GPU connected to the motherboard by PCIe, data transfer time may dominate over computation time. If your accelerated computation is lightning fast, but you can stream that data from main memory to CPU registers faster than you can transfer it to the accelerator card, it still might be faster to do the computation on the CPU.