Review of Jack Ganssle’s Seminar
“Better Firmware Faster”
J. L. Sloan
Digital Aggregates (www.diag.com) footed the bill to send me to Jack Ganssle's one-day seminar "The Best Ideas for Developing Better Firmware Faster". I've been a fan of Ganssle (and his website, www.ganssle.com) for years. He's the technical editor and regular contributor to EMBEDDED SYSTEMS DESIGN magazine, the author of several books on firmware development, and (according to him) the founder of three companies which he later sold. His seminar was held at the Sheraton Denver Tech Center. There were about 25 or so people in attendance. His handouts are available at www.ganssle.com/misc/bi.pdf. (Warning: this review does not follow his slides sequentially.)
Digital Aggregates’ website states that they "apply object-oriented design to embedded, hard and soft real-time, message-oriented, and device applications", so I was pretty sure Ganssle's seminar would be very applicable to DIAG's mission, but his stuff is typically much more broadly applicable to any kind of software development.
Overall Ganssle's presentation was about 70% on managing the development process (step 1: have a process), and about 30% technical tricks that are fun to know and tell (how to build an R-2R resister ladder so you can measure instantaneous CPU utilization with a $15 voltmeter).
He mentioned that although there is a lot of concern about outsourcing and continued employment, employment among embedded developers is nearly at an all-time maximum, the amount of embedded firmware continues to rise dramatically, and sales of 8bit, 16bit and 32bit embedded CPUs continues to rise (4bit CPUs pretty much died around 2003).
As far as languages go, Java has only 1% of the embedded space, a "non-starter" (it occurred to me that I was probably the only developer in the room that has actually shipped a Java-based embedded product, ca. 1999). C++ has about 15% of the 32bit market, and there was a lot of extracurricular interest from other attendees when I mentioned that I had a long history using C++ in the embedded space. (My first use of C++ in an embedded/real-time system was around 1988 or so, IIRC, on embedded DEC LSI-11 processors, using a "roll my own" RTOS. To show how young and stupid I was, I dropped it in favor of FORTH.) Ganssle mentioned EC++, a dialect of C++ for embedded use, which has support in many compilers, which originated in Japan, and which Doug Gibbons has mentioned to me on more than one occasion. Ganssle also mentioned the MISRA dialect of C and C++, which came out of the automotive industry in the U.K. (Another hero of mine, Les Hatton, a British pundit who writes about safety-critical software systems, also pushes the MISRA dialect.) The massive DIAG library has a hardcopy of the MISRA C specification somewhere around here. MISRA is the Motor Industry Software Reliability Association.
When comparing languages, Ganssle notes that in one study, 40% of C bugs took more than two hours to fix, while 70% of C++ bugs took more than two hours to fix. He admitted he didn't know how to interpret this. (Where the C++ bugs design problems rather than simple coding problems? Was C++ used in larger, more complex projects, and hence the bugs were more complicated? Where the C++ developers less experienced in the language? I dunno. I'll have to track down his reference.) [This is from Les Hatton’s paper “Does OO really match the way we think?” which appeared in IEEE Software.]
Ganssle had several remarks on the pace of technological change and our ability to make use of new technology. The power density of an Intel P4 processor is within less than one order of magnitude of that of a nuclear reactor; getting the heat out of the high end chips is a growing problem. Gutenberg may have invented the printing press, but he still had his bibles hand-illuminated, limiting his ability to mass produce them.
Ganssle spoke at length on code reuse. He cites my favorite reuse book, CONFESSIONS OF A USED PROGRAM SALESMAN by Wil Tracz. (Of course, Tracz didn't have a choice: at the time he was preaching reuse, he was working for a defense contractor, and the DoD was mandating code reuse; Tracz made a pretty nice career out of it.) Ganssle made all of the usual arguments towards reuse, and while I agree with all of them, I was kind of tired of hearing about it. Having had a major reuse initiative failure in my career, Ganssle was hard pressed to make any arguments that I hadn't already made myself. Still, he had some interesting points.
Firmware is the most expensive thing in the universe: the F-4 jet fighter cost about $3M in adjusted 2006$US and had a firmware content of 0%. The F-22 cost about $30B, and had 50% firmware content. In fact, Ganssle claims that the manufacturer boosted the firmware content in the F-22 in a deliberate attempt to add to the bottom line. (Same basic rationale as why everyone wants to be a software company: nearly zero manufacturing cost once you amortize R&D.) [Some research has led me to believe Ganssle is quoting the original development cost, not the per-unit cost of the F-22.]
Typical bug rates in C is about 50 to 100 bugs per KLOC. This alone is a good argument for reuse: debug once, use again and again.
IBM says that the schedule grows faster than productivity: as projects get bigger in terms of staff-years, the monthly output of code per developer drops:
I didn't plot this, but it occurs to me that on very large projects, code production falls asymptotically to zero.) Some of this is due to human communication overhead (the old (N*(N-1))/2 interconnection issue, which is of O(N^2). [Sharp-eyed friend and former colleague David Hemmendinger noted that it was easy to see the successive halving in the code productivity with the order of magnitude increases in schedule.]
Software estimation guru Barry Boehm says the basic formula is
TIME = C * (KLOC ^ M)
where TIME is the time to complete the project, and C and M are constants that depend on, for example, whether you have any real-time constraints. Ganssle says typical M for embedded systems is 1.5 to 2.0.
Ganssle cited another hero of mine, Fred Brooks, author of THE MYTHICAL MAN MONTH: "Adding people to a late project only makes it later". It's easy to see how this happens since it adds a lot of communication overhead, but relatively little (or no) productivity. Ganssle also showed how the Boehm Cyclomatic Complexity metric (a measure of algorithmic complexity) rose non-linearly with the KLOC.
Ganssle says the only way to battle this is to relentlessly partition the problem at all levels. This includes at the hardware level: it's better to have multiple processors, and split the developers and the task at hand among then, then to kid yourself that you're lowering COGS by putting everything on a single CPU. (It occurs to me this is one of the driving forces behind the rise in multicore chips.) And it's better schedule-wise to put your best developers on the smaller stuff: productivity falls as KLOC rises, and as KLOC rises the productivity differential between you most-productive and your least-productive developers narrows.
[I just finished reading Peter Morville’s book AMBIENT FINDABILITY where he cites multiple studies that show that if you graph information flow density (X) with decision quality (Y) you get a hump-backed curve indicating that more information is not better. There is a sweat spot of information bandwidth in which every human operates. Too little OR too much information leads to a degradation in decision quality. This seems to be a recurring theme when you read post-hoc analyses about disasters or near-disasters. I certainly find that I am personally bandwidth limited – sometimes I give up keeping up with electronic mail or voice mail. Marketing folks talk about the battle for consumers’ attention. I suspect that the more complex a project gets, the additional information and greater connectivity among participants drives developers into information overload and leads to poor decisions. And while I’m at it: “continuous partial attention” is a myth, an excuse people use when they should just say “you’re too boring, so I’m reading email”.]
Ganssle also promotes a lot of other ideas that sound familiar: simulate, use evaluation boards, cross develop on PCs or workstations, use object oriented design (particularly encapsulation), and ruthlessly manage the schedule by trimming features. (In fact, 99% of what Ganssle said on Wednesday sounded familiar, because most of his "best practices" were all exploited ten years earlier by the Lucent Technologies Inc. Definity ATM firmware and hardware teams, the former lead by Tamarra Noirot and Randy Billinger, the latter by Bob Kalisch and Mike Ross. I thought these folks were brilliant when I worked with them in 1996, but now I'm just beginning to appreciate how spoiled I was and that the rest of the embedded community is still trying to catch up to what they were preaching a decade ago. Way to go, kids.)
Ganssle diverted into some hardware stuff that was pretty interesting (but may be old hat to my hardware engineer readers). Ringing on signal lines causes even slow systems to act as if they were really fast due to decomposition (think Fourier Transform) of the bouncing into high frequency components. A perfect square wave ("thank god we don't actually have them") Fourier transforms into an infinite hertz frequency. The morale: if you're dealing with firmware handling signals (e.g. a bit in a status register) don't be lulled into thinking you have a "slow" system just because the clock rate is slow. (I vividly remember Bob Kalisch patiently explaining this to me one time, and he even cited the same hardware book as Ganssle: HIGH-SPEED DIGITAL DESIGN: A HANDBOOK OF BLACK MAGIC by Johnson and Graham, also in the massive DIAG library.)
Ganssle recommended a lot of modern practices: test drive development, code inspections (inspections are far cheaper than debugging, and far more cost effective), and refactoring problematic modules. Barry Boehm says that 80% of the bugs fall in 20% of the modules (another classic 80/20 rule); IBM says 57% of the bugs are in 7% of the modules; Gerald Weinberg said 80% in 2%. Ganssle: carefully measure bug rates, and refactor/rewrite/otherwise toss out problematic modules: they'll end up costing you 4x then if you didn't; plan on tossing about 5% of your code for just this reason. Also, check out static analysis tools. (DIAG has evaluated several static analysis tools, sort of like super Lint processors. They're great with C, but aren't much better than turning the warning level on the GNU C++ compiler to "incredibly picky".) Funniest recommendation: do prototyping, but do so in a language that cannot ship.
He mentioned the same "broken windows" theory that was described by Malcolm Gladwell in his book THE TIPPING POINT, and also by the hosts of the NPR show CAR TALK. If a window is broken in a building in an urban neighborhood and isn't promptly repaired, it is the beginning of the decay of the entire neighborhood. It is an "epidemic" theory of urban decay that has been observed in New York City. The CAR TALK guys apply it to automobiles: if something minor goes wrong with your car, and you don't fix it promptly, you're already on the way to the junk yard. Ganssle argues this is the case (and I agree) when it comes to software. If you start ignoring known bugs in software, it is the beginning of the end. Not only are you sending the wrong quality message ("it's okay to ship crap"), not only are you doing harm to your image among your customers, but if you accumulate enough bugs, the product will eventually not be recoverable. And remember, your customer is ultimately the only judge of quality for your product: "good" is what the customer says is "good". Your opinion doesn't matter.
The waterfall method does not work. (I would claim: it never worked. Development is by nature fractally iterative.) Even the DoD has dropped the waterfall method in favor of iterative processes. Ganssle talked about feature drive development (the DIAG library has a book by Palmer and Felsing on this very topic). An "Observable Feature" is one the customer can see; observable features are ones you can negotiate out in order to ship earlier. Derived features are ones that come with the territory ("get the RTOS working", "upgrade the GNU compiler", etc.) and are not really customer observable. Ganssle also pushed the Wideband Delphi estimation technique, which I've actually used (it came out of the RAND Corporation during WWII). (I won't belabor it here, lots of web resources on it, and the basic technique has been copied by many Agile processes, many of whom would have you believe they invented it.)
It's okay to make quick hacks in order to meet a schedule: time to market can be vitally important. But you must have time to go back and make the real fix. Entropy, a.k.a. Software Rot, will kill you eventually: bad code penalizes you exponentially (due to the cost in maintaining it over time), but cleaning up and refactoring penalizes you linearly. You absolutely must pay off the "technical debt" of quick hacks. (One of my favorite papers, “Software Aging” by David Parnas, is on this very topic.)
Other factors that you might not expect will influence the schedule: developing a system that will have 90% CPU utilization will 2x development time; 95% utilization will 3x it. (Example: the DSPs on one project I worked on, in which space and time were so tight that you had to mine the code for individual machine instructions to remove just to make a simple fix. And then you counted machine cycles because you were worried that you would run a few microseconds over your time budget during the worst-case load. As you might expect, that led to very very brittle code that no one wanted to touch.)
Ganssle cited a lot of stuff from another hero of mine, Tom DeMarco, specifically DeMarco and Lister's book PEOPLEWARE. For example: it takes about 15 minutes from an interruption for a developer to re-achieve a "state of flow". Developers are typically interrupted every 11 minutes. (This is why I typically come in early and stay late: most of my hard thinking gets done before anyone else comes in and after they leave. It is also why I try to work at home one day a week.)
The CMU SEI Capability Maturity Model came up -- personal bias: I don't buy it. Linux was developed by an organization that wasn't even at CMM Level 1, and (as Les Hatton has argued) is arguably the most stable, reliable piece of software in the world. My interpretation: the CMM is one way to achieve software quality, but it is not the only way, and perhaps not even the best way. Amusingly, Ganssle presented the CiMM, the Capability Immaturity Model:
Level 0 Negligent (Indifference)
Level -1 Obstructive (Counter Productive)
Level -2 Contemptuous (Arrogance)
Level -3 Undermining (Sabotage)
and argued that many development organizations were operating at one of these levels.
He also talked a little about Extreme Programming (XP), and I took him to task. My opinion - XP makes several economic assumptions: communication among developers is cheap, testing is cheap, and refactoring is cheap. If this isn't the case for your project, XP will not be effective. All of the XP process steps are interdependent: if you don't do every one of them, the process isn’t effective. [David Hemmendinger notes that the XP proponents would no doubt concede my point.]
Towards the end of the seminar, Ganssle talked about some horror stories of firmware failures, many of which would be familiar to anyone that's read Ganssle or Les Hatton (www.leshatton.org). All of them had the following common elements:
o inadequate testing
o no code inspections
o untested or crappy exception handlers
o tired people make mistakes
o lack of adequate tools (specifically a version control system)
These were firmware failures that either killed people or cost hundreds of millions of dollars. (Don’t think telephony product can kill people? Consider if E911 doesn’t work, or is misdirected, or if the caller-id used by the E911 operator isn’t correct.)
That last factor astounds me, particularly since it seems to be a sin among hardware folk. Am I the only one that doesn't personally know of a manager (and there have been more than one) that had to go dumpster diving for the only PC that had the FPGA source code for a shipping product on it? Not that software folk are without sin: in 1989 I joined a multi-developer project that had no source control, and the first thing I did was institute the use of SCCS (because it came with SunOS, our development platform). I used SCCS when I worked at a university, even though I was the only developer on the project. And I use CVS (and before that, RCS) on my own open source work at home. I suppose if I didn't use a version control system, DIAG would have problems firing me (because I own the company), but lord knows, they should try.
(On the subject of disasters, I can also recommend the book INVITING DISASTER by James Chiles. Most of it reads like a Stephen King novel EXCEPT IT IS ALL TRUE! Disasters that “couldn’t happen” happen because a bunch of tiny, otherwise insignificant, failures all line up in time, and the reactor core melts down, or the airliner crashes, etc.)
Ganssle summed up his seminar:
o Manage the schedule by prioritizing features.
o Give your boss options.
o Manage development by adopting incremental, test driven development.
o Manage complexity by partitioning the problem (hardware and firmware).
o Manage bugs by measurement, and by not letting them slide.
o Manage your technical debt by replacing quick hacks with real fixes.
o Use a version control system, even for the hardware.
o Build quiet productive work areas.
o Adopt a coding standard.
o Do code inspections.
o Measure code productivity rates.
o Study software engineering.
Was Ganssle’s seminar worth the ~$700 bucks and one work day that DIAG paid for it? Yeah, I think so. And it was a great networking opportunity too. (One of the guys I ate lunch with has two Harleys in his garage. :-) If I made any criticism it is that Ganssle should include references and footnote his slides. I’m the type to do more research and chase down sources. However, he was generously forthcoming with sources when I followed up with questions via e-mail. A day spent learning useful stuff sure beats working. Thanks, Jack, well done!