Vaster than Empires and More
Slow: the Dimensions of Scalability
J. L. Sloan
2005-08-22
Updated 2006-04-03
At
the close of the last century, I found myself deeply immersed in the study of
large-scale hierarchical storage systems. By large-scale, I mean petabytes,
which is on the order of ten to the fifteenth, or millions of gigabytes. Such
systems may seem almost mundane today, in a world of gigabyte flash MP-3
players the size of your thumb. But back in 1994, the system that I worked with
seemed colossal. The storage hierarchy ranged from tens of thousands of offline
tapes, to immense robotic tape libraries, to disk farms, to processor memories.
Each of those components was implemented as its own hierarchy of technologies, lending
a fractal aspect to the overall architecture. The system was finely tuned in
terms of tradeoffs of capacity and speed, and, not being an academic exercise
but a production system in daily use by many people, reliability and usability.
I
found myself thinking a lot about cache behavior. Most folks think of caches as
a piece of silicon in the latest microprocessor chip, or maybe just a number on
a data sheet. But caches are all over the place, at all levels in every storage
hierarchy, right down to the yellow sticky notes you place around your computer
screen to remind you to meet a friend for lunch or what your password is.
I
tried, more or less successfully, to model the behavior of the storage system
with trace-driven computer simulations of cache behavior, trying to capture,
for example, the working set size of the cache formed by the disk farm. As hard
as it is to believe, disks, at least of the speed, capacity, and quantity that
we needed to keep a half-dozen supercomputers fed so that their applications
did not stall waiting for data, were expensive. Finding the right tradeoff for
disk farm capacity versus system performance was financially important enough
that it made the modeling, which was itself a supercomputer application,
worthwhile. I was sufficiently successful in my quest that I published several
papers on the topic.
One
of the papers I read by a fellow researcher in this area remarked that ultimately,
all problems in computer science boiled down to cache behavior. I knew this
wasn’t literally the case, but it had a ring of truth in it that I liked. All
forms of human endeavor are interconnected. If you study stamp collecting to
sufficient depth, you will probably find that most if not all of the history of
human civilization and all of its science is encapsulated in the domain of
stamps, postage, ink, glue, and the transmission of messages. Or, you could
study motorcycles, going back to the invention of the wheel, and even further
to the use of logs to roll big stone blocks. Author and television personality
James Burke has made a career of observing the connections between
technological developments that might otherwise seem unrelated.
But
in computing, caches show up with almost alarming frequency for another reason.
Caches are a vital weapon in the never-ending war for scalability.
Caches exist because the microprocessor can
consume data faster than the memory can deliver it, so it pays to keep the most
often used data near at hand. Or in my case, because it takes too long for the
robot in the tape library to find the correct tape cartridge, so it pays to
keep datasets used frequently by a supercomputer user on disk. Web browsers
cache often referenced web pages on a local disk because my spousal unit is too
impatient to wait for the network to deliver the page from some far off server.
The
differences in performance of microprocessors and memories, supercomputers and
tape drives, web servers and spousal units, are all well known by technologists
in those problem domains. These dimensions of scalability were painfully
familiar to the architects of these systems. But while I studied cache behavior
with the goal of improving system performance and maybe even saving money, it
occurred to me that there was another dimension of cache behavior that was
frequently ignored by system architects in their quest for scalability. This
dimension was how the relative performance of different system components
changed over time.
I
began looking for the first derivatives of the performance of the various
system components, for example, how microprocessor speeds changed versus how
network bandwidth changed over time. It occurred to me that a system architected
for the appropriate balance of performance of its different subsystems today might
become sub-optimal a few years down the road.
I
collected data from a variety of sources, ranging from
Microprocessor
speed doubles every 2 years.
Memory
density doubles every 1.5 years.
Bus
speed doubles every 10 years.
Bus
width doubles every 5 years.
Network
connectivity doubles every year.
Network
bandwidth increases by a factor of 10 every 10 years.
Secondary
storage density increases by a factor of 10 every 10 years.
Minimum
feature size halves (density doubles) every 7 years.
Die
size halfs (density doubles) every 5 years.
Transisters
per die doubles every 2 years.
CPU
cores per microprocessor chip double every 1.5 years.
What
we have here is a set of mostly disparate power
curves that illustrate how the performance of major components change with
respect to one another over time. And not much time, either. For example, in a
decade, microprocessor speed will increase by more than a factor of thirty, while
bus speed will only double, and network bandwidth will increase by an order of
magnitude. These power curves are illustrated logarithmically in Figure 1. Note that
some of curves fall on top of one another, making them a little hard to see.
Admittedly,
some of these power curves from my research in 1994 have gotten a little shaky
lately as the manufacturers have had to resort to more and more arcane measures
to maintain their rate of improvement. Somewhat ironically, many of these measures
involve the introduction of yet more caches. But if you just buy into the
concept that different technologies are on very different exponential curves of
performance improvement, then you are pretty much forced to admit that the
balanced system architecture you designed today might not cut the mustard just
a few years down the road. Design decisions which made sense at the time for
the trade-off of processor speed versus network bandwidth may not seem as wise
much later.
In
one of my favorite papers, “Software Aging” [Proc. 16th IEEE Int. Conf. on
Soft. Egr., May 1994], computer
scientist David Parnas describes how software systems bow to entropy just as
mechanical systems do, mostly due to the cumulative effects of changes made over
the life span of the system, as new features are added, or as the system is
adapted to changes in its environment. This is like radiation damage to DNA;
eventually, slowly, the cumulative effects become lethal. This temporal dimension of scalability illustrated
by the power curves in Figure
1 is an example of the kinds of environmental changes – network bandwidth,
faster servers, more remote users – that are likely to occur for any system.
Note that the system architect does not have a choice here. Given world enough
and time, technology marches on, and failing to adapt a system to the changing
climate would be equally lethal.
Lawrence
Bernstein was a Bell Labs wonk at a time when Bell Labs was still capable of
generating Nobel Prizes, before it too joined the list of sad, neutered
corporate research labs. In 1997, he wrote a paper called “Software Investment
Strategy” [Bell Labs Tech. J., Summer
1997] in which he made the following observation: improvement in programmer
productivity, as measured by the ratio of source lines of code to machine
instructions, was also on a sort of power curve, albeit a few twists and turns
here and there. The technologies contributing to this improvement in
productivity ranged from high level languages early on, to timesharing, and
later to object-oriented design and implementation. He predicted large-scale code reuse by 2000.
It would be easy to think he missed the mark on this one. But when you stop to
consider the impact made by the C++ Standard Template Library (STL), the
exploding popularity of design patterns, the industry that has grown up around
reusable managed components for Java or Microsoft’s .NET, or even the open
source movement, Bernstein may not have been off the mark. His curve of
programmer productivity is shown in Figure 2.
This
brings us to another dimension of scalability: process, that is, the techniques and tools we use to design,
develop, and implement the systems we build. Many years ago, my beloved and
frequently exasperating mentor, Bob Dixon, observed that technological
development was recursive: you could design more powerful microprocessors
because the tools you were using to do so were based on the prior generation of
slightly less powerful microprocessors. This creates a kind of positive feedback
loop. The same observation could probably be made of mechanical design all the
way back to the start of the bronze age, and maybe much earlier. More than one author
-- Vernor Vinge and his science fiction and non-fiction writing on the
Singularity immediately comes to mind – has made fruitful use of this idea.
The
processes we use to build our systems leverage, in part anyway, off those same
power curves. Should our processes and tools not keep pace with the technology,
we will surely have increasing difficulty grappling with the construction of
those systems as they get ever larger and faster. If you are looking for an
argument for moving from procedural to object oriented languages, for using
libraries like the STL, for trying out an integrated development environment, for
upgrading your servers and desktops, or even for replacing an ad hoc
development process with a more formal one, this might be it.
Where
we apply those processes and tools matters just as much to the scalability of
our systems as how we apply them. In his book Object-Oriented and Classical Software Engineering [McGraw-Hill,
2002], Stephen Schach describes the relative proportion of cost for each of the
phases of the life-cycle of a software development project. Schach’s numbers
are shown in Figure 3.
He comes to conclusion that some (but not all) will find startling: 67% of the
cost of software is in its maintenance: changes made to the software after the
project is deemed complete. Some organizations with experience in maintaining
large code bases, and by large here I mean millions of lines of code, place
this number closer to 70% or even higher.
This
harkens back to Parnas’ idea of software entropy. Software systems become more
and more expensive to modify over time due to the cumulative effect of changes.
Then it is no surprise that the bulk of the cost of developing software is in
making these changes. This is as much a limit to scalability as processor speed
or network bandwidth. Increases in efficiencies in tools and processes must be
applied not just to new code development, but to long-term code maintenance as
well. This is an area where code
refactoring – the ability to substantially improve the design of code,
including improving its ability to be maintained, without altering its external
behavior – will continue to play a major role. Likewise, this calls for more
thought into designing new code to be easy to modify, since the effort spent in
changing code after the fact is the bulk of the cost of software development.
In my experience, this issue is largely ignored among software developers
except among the refactoring proponents. Most developers -- and truth be told
their managers as well, if widely used development processes are any indication
– are happy if code passes unit testing and makes it into the code base
anywhere near the delivery deadline.
I
find that this long term cost is seldom taken into account, and its omission
arises in sometimes surprising ways. I once heard a presentation from a
software development outsourcing company. It happened to be based in
I
almost leaped from my chair, not because I was angry, but to go found a
software development company based on this very business model. The idea of
low-balling the initial estimate then making a killing on the 67% of the
software life-cycle cost pie was a compelling one. Only two things stopped me.
First, I had already founded a software development company. And second, I had
read a similar suggestion made by Dogbert in a recent Dilbert cartoon strip, so
I knew that everyone else already had the same idea. It is as if once we
deliver a line of code to the base, we think that the investment in that code
is over. In fact, Schach tells us it has just begun. Every single line of code
added to a code base contributes to an ever increasing total cost of code
ownership.
There
are more dimensions to system scalability than just balancing the performance
of the various components. You must take into account how the relative
performance of those components change over time. You must apply scalable
processes to the development of the system. And you must consider the long-term
maintenance of the code. Failure to take any of these issues into account
limits the scalability of your system just as surely as if you had designed it
around obsolete technology.
(The author would
like to acknowledge a debt to Andrew Marvell, whose poem “To His Coy Mistress”
inspired this article, and which is probably the best pick-up line of the 17th
or any other century.)
