Index
Home
About
Blog
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Code density and performance? Date: 22 Jul 2005 00:30:56 -0700 Message-ID: <[email protected]> Tom Linden wrote: > On Wed, 20 Jul 2005 12:07:33 -0700, glen herrmannsfeldt > <[email protected]> wrote: > > > That was 1986, though I don't believe VAX was a viable architecture > > by 1996, even though it wasn't out of address bits. > > It certainly could have been viable, had DEC continued its development and > squandered its resources on the Alpha adventure. Sigh ... that opinion is strongly at variance with the facts in the real world; the best engineers in the world (and DEC had plenty) couldn't have implemented viable (in the sense of being truly competitive) VAXen in 1996... I think the last VAXen were shipped in 2000, as installed base always has inertia... I'd suggest reading a fine paper by a couple of the best computer architecture performance people around, both of whom were senior DEC engineers: Dileep Bhandarkar, Douglas W. Clark, "Performance from Architecture: Comparing a RISC and a CISC with Similar Hardware Organization,", ACM SIGARCH CAN, 1991 [and a couple other places]. A copy can be found: http://www.cs.mipt.ru/docs/comp/eng/hardware/common/comparing_risc_and_cisc_proc/m A long, serious, competent analysis by world-class DEC people includes: VAX 4000/300 MIPS M/2000 1990 1989 [really 4Q88] system ship date 28ns 40ns cycle time $100K $80K List price 7.7 19.7 SPEC89 integer 8.1 16.3 SPEC89 float ==== "So, while VAX may 'catch up' to *current* single-instruction-issue RISC performance, RISC designs will push on with earlier adoption of advanced implementation techniques, achieving still higher performance. The VAX architectural disadvantage might thus be viewed as a time lag of some number of years." The summary: "RISC as exemplified by MIPS offers a significant processor performance advantage over a VAX of comparable hardware organization." And this proved to be true, even with superb CMOS VAX implementations done in the early 1990s. When computer companies design successive implementations of some ISA, they tend to accumulate more statistics to help tune designs, they keep tricks that work, they avoid ones that don't, i.e., it is a huge implementation advantage to have done multiple designs over many years. The paper especially discusses VAX designs that shipped in 1986 (VAX 8700) and 1990 (VAX 4000/300), 8 and 12 years, and many implementations after 1978's VAX 11/780. These were done by experienced design teams backed by a large, profitable revenue stream. By comparison, the MIPS R2000 CPU first shipped in 1986, the R2010 FPU in 1987, and the R3000/R3010 in 4Q88. The R3000 used an R2000 core with some improvements to the cache interface, and the R3010 was a shrunken R2000, i.e., there wasn't a lot of architectural tuning, and of course, that was done by a small startup that did *not* have a big revenue stream :-) BOTTOM LINE: DEC had every motivation in the world to keep extending the VAX as long as possible, as it was a huge cash cow. DEC had plenty of money, numerous excellent designers, long experience in implementing VAXen. BUT IT STOPPED BEING POSSIBLE TO DESIGN COMPETTIVE VAXen...
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Code density and performance? Date: 22 Jul 2005 14:38:18 -0700 Message-ID: <[email protected]> Anton Ertl wrote: > "John Mashey" <[email protected]> writes: > >Sigh ... that opinion is strongly at variance with the facts in the > >real world; the best engineers in the world (and DEC had plenty) > >couldn't have implemented viable (in the sense of being truly > >competitive) VAXen in 1996... > > How do you know? AFAIK in the real world in which I live in the > engineers at DEC did not try to implement a new VAX in 1996; instead, > VAX development stopped pretty soon after the release of the Alpha. I'm surprised any long-time participant in comp.arch would ask, since this topic has been discussed here numerous times, including the topic of architectural issues that made the (elegant) VAX and (fairly elegant 68020) more difficult to do cost-effective fast implementations for than the (simpler) 68010 and the (inelegant X86). Once again, I will remind people that RISC is a label for a specific style of ISA design, and the bunch of ISAs that more-or-less fit that are relatively similar. CISC was coined to describe "the rest", which covered an immense range of different ISAs. IBM S/360, X86, and VAX are *different*, and an analysis comparing X86 to (a) RISC does nothing to invalidate earlier comparison of VAX to (a) RISC. So, how do I know? Well, in the *real* world of CPU architecture, there were a fairly modest number of people who actually did this for a living, and many of them knew each other, moved among companies, went to the same conferences, recruited each other for panels, worked on committees (like SPEC), traded benchmarks, and talked informally in bars. I've said before here that I'd more than once been kidded by VAX implementors about how easy RISC guys had it, and then they'd cite various horrible weird special cases that had to be handled and that got in the way of performance. I wouldn't attribute that to anyone of course. But, the bottom line is that the engineers who *knew* the VAX best, and who implemented it many time, and many of whom were *really, really good* engineers, came to believe that they simply could not keep implementing competitive CPUs that were VAX ISA. Some of them were already starting to think that in the mid-1980s, but lots more thought so a few years later, and so did certain DEC sales managers, who knew that if the cutomer wanted VMS, they won, but if the customer wanted some UNIX, they just couldn't compete. The VAX 9000 fiasco didn't help. There were some fine CMOS VAX chips done, but it was just too hard, even with mature, good compilers. There were various internal DEC RISC efforts, and at a certain point [when DEC chose to use MIPS R3000s for some products], the most irritating thing to some DEC engineers was that they kept getting grabbed back off RISC investigations to help do more VAXen. In the *real world* of (very competent) designers who made their living doing VAXen, they just couldn't figure out how to keep doing it competitively. I will point out that many of the Alpha folks had done VAX implementations and software and performance analysis, i.e., guys like Witek, Sites, Dobberpuhl, Uhler, Bhandarkar, Supnik. Anyone who has the opinion that it was reasonable to be designing new VAXen in 1996, expecting them to be competitive, has to believe these guys are clueless idiots. NOTE: that doesn't mean that I am claiming "so, they had to do Alpha, and do it the way they did it", as they were other options. I'm just saying that continuing on a VAX-only path wasn't believable to their experienced engineers, whose technical skills I hold in high regard, often from repeated first-hand contact. > >I'd suggest reading a fine paper by a couple of the best computer > >architecture performance people around, both of whom were senior DEC > >engineers: > > > >Dileep Bhandarkar, Douglas W. Clark, "Performance from Architecture: > >Comparing a RISC and a CISC with Similar Hardware Organization,", ACM > >SIGARCH CAN, 1991 [and a couple other places]. A copy can be found: > >http://www.cs.mipt.ru/docs/comp/eng/hardware/common/comparing_risc_and_cisc_proc/m > > Well, a few years later Dileep Bhandarkar, then employed at Intel, > wrote a paper where he claimed (IIRC) that the performance advantage > of RISCs had gone (which I did not take very seriously at the time); > unfortunately I don't know which of his papers that is; I just looked > at "RISC versus CISC: a tale of two chips" and it looks more balanced > than what I remember. That was another fine paper from Dileep, but the conclusion: X86 can be made competitive with RISC is not the same as: VAX can be made competitive with RISC > >BOTTOM LINE: > > > >DEC had every motivation in the world to keep extending the VAX as long > >as possible, as it was a huge cash cow. DEC had plenty of money, > >numerous excellent designers, long experience in implementing VAXen. > > > >BUT IT STOPPED BEING POSSIBLE TO DESIGN COMPETTIVE VAXen... > > Looking at what Intel and AMD did with the 386 architecture, I am > convinced that it is technically possible to design competetive VAXen; > I don't see any additional challenges that the VAX poses over the 386 > that cannot be addressed with known techniques; out-of-order execution > of micro-instructions with in-order commit seems to solve most of the > problems that the VAX poses, and the decoding could be addressed > either with pre-decode bits (as used in various 386 implementations), > or with a trace cache as in the Pentium 4. You're entitled to your opinion, which was shared by the VAX9000 implementors. Many important senior VAX implementors disagreed. I've posted some of the reasons why VAX was harder than X86, years ago. Of course you can do these things, but different ISAs get different mileage from the same techniques. > Of course, on the political level it stopped being possible to design > competetive VAXen, because DEC had decided to switch to Alpha, and > thus would not finance such an effort, and of course nobody else > would, either. Ken Olsen loved the VAX and would have kept it forever. Key salespeople told him it was getting uncompetitive, and engineers told him they couldn't fix that problem, and they'd better start doing something else.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch,comp.lang.fortran Subject: Re: Code density and performance? Date: 28 Jul 2005 23:56:29 -0700 Message-ID: <[email protected]> Tom Linden wrote: > But again, back to the familiar theme, had VAX > received the billions that alpha received it too would be spinning today > at 4GHz. Nonsense. Not in the Real World where economics and ROI actually matter. 1) When the VAX was designed (1975-), PL/I may have been the third most important language, after FORTRAN and COBOL, especially if one wanted to attack the IBM mainframe market. Of course, the VAX itself was created by the agonizing decision that the PDP-11 could not be extended further upward, but the VAX certainly catered to PL/I and COBOL. 2) In 1988, the tradeoffs were different, and the fraction of new code being written in PL/I had decreased, and DEC's priorities were different. I understand that the VAX->Alpha transition might be less than optimal, especially for a PL/I vendor (like Kednos). That's Life. 3) Read the article by Bob Supnik: http://research.compaq.com/wrl/DECarchives/DTJ/DTJ800/axp-foreword.txt Speaking of 1988: "Nonetheless, senior managers and engineers saw trouble ahead. Workstations had displaced VAX VMS from its original technical market. Networks of personal computers were replacing timesharing. Application investment was moving to standard, high-volume computers. Microprocessors had surpassed the performance of traditional mid-range computers and were closing in on mainframes. And advances in RISC technology threatened to aggravate all of these trends. Accordingly, the Executive Committee asked Engineering to develop a long-term strategy for keeping Digital's systems competitive. Engineering convened a task force to study the problem. The task force looked at a wide range of potential solutions, from the application of advanced pipelining techniques in VAX systems to the deployment of a new architecture. A basic constraint was that the proposed solution had to provide strong compatibility with current products. After several months of study, the team concluded that only a new RISC architecture could meet the stated objective of long-term competitiveness, and that only the existing VMS and UNIX environments could meet the stated constraint of strong compatibility. Thus, the challenge posed by the task force was to design the most competitive RISC systems that would run the current software environments." .... "The original study team was called the "RISCy VAX Task Force." The advanced development work was labeled "EVAX." When the program was approved, the Executive Committee demanded a neutral code name, hence "Alpha." Put another way, the EVAX was *supposed* to be an aggressive VAX implementation, whose goals were to be extended to 64-bit, and close the performance gap with RISCs, but after serious work, a fine engineering team (my opinion) concluded the problems weren't solvable. Just as in 1975, they decided they *had* to make an architecture change. 4) DEC designers certainly understood the application of OOO techniques by 1991/1992, and knew the basics of what EV6 was going to look like, and could have applied that to the VAX for 1996. BUT, I don't think anybody serious believed it was possible for a VAX design to match an Alpha design in performance, given similar design costs and time-to-market. Either the VAX would need a huge design team [way beyond the 100 or so microprocessor designers they had in the 1980s], or it would take a lot longer to do. [I have a bunch of email from senior ex-DEC engineers involved in VAX and Alpha implementations, and I wouldn't quote them without their permission, but words like "uninformed opinion" and "revisionist history" were prominent in descriptions of this thread.] 5) I don't have time or interest in being serious about sketching an OOO VAX design, and of course, no one in their right mind would do that without access to the traces from many VAX codes, and lots of CPU cycles for simulation, sicne of course, serious designs can only be done with statistics. Handwaving has zero credibility. On the other hand, I have had some email discussions with engineers who've implemented VAX microprocessors, and vetted some of the ideas about performance bottlenecks. I know this might seem strange, but I actually ascribe high credibility on this topic to competent people who have implemented multiple VAX micros... If I get time in the next week, I'll try to consolidate that, at least to sketch the sorts of VAX architectural issues deemed to be hard.... and the main issues *aren't* the decoding of complex instructions, as much as the later-stage execution issues that make it difficult to get as much parallelism as you might expect in any reasonable implementation. The best I can do is a sketch, because I don't have the statistics ... but the people that did made the decision. ONE LAST TIME: it wasn't politics that ended the VAX, it was engineering judgement by excellent (IMHO) DEC engineers, and the economics of doing relatively low-volume chips [low-volume compared to X86]. Nobody could afford to keep the VAX ISA competitive at DEC volumes....
From: "John Mashey" <[email protected]> Newsgroups: comp.arch,comp.lang.fortran Subject: Re: Code density and performance? Date: 29 Jul 2005 14:38:34 -0700 Message-ID: <[email protected]> glen herrmannsfeldt wrote: > John Mashey wrote: > (snip regarding VAX and Alpha) > > > Put another way, the EVAX was *supposed* to be an aggressive VAX > > implementation, whose goals were to be extended to 64-bit, and close > > the performance gap with RISCs, but after serious work, a fine > > engineering team (my opinion) concluded the problems weren't solvable. > > Just as in 1975, they decided they *had* to make an architecture > > change. > > Not to disagree, (because I can tell you know this much better > than I do), but I always thought DEC wanted more from Alpha. > > I always felt that they wanted to be known for something, for breaking > open the 64 bit world when everyone else was stuck in the 32 bit world. > To show that they could do something other companies couldn't. > > Maybe sort of the way Cray felt about the supercomputer market. > > They could have had a darn fast processor with 32 bit addresses, > maybe even 64 bit registers, and easily beat the VAX. It was many > years before others decided to go for 64 bits, and even now I see little > need for 64 bits for 98% of the people. An x86 processor with a 36 bit > MMU would have gone a long way to satisfying users addressing needs. Well, the truth is not that way... 1) Recall that DEC was famously burnt by running out of address bits too quickly on the PDP-11. Google comp.arch: mashey gordon bell pdp-11 gets you my BYTE article on 64-bit that references Gordon's comment. DEC engineers could plot a straight-line on a log-scale chart as well as MIPS could, i.e., for prctical DRAM sizes. Personally, I think good computer vendors are supposed to think ahead, and have hardware ready, so that software can get ready, so that when customers were ready to buy bigger memory and use it, they could ... and not, once again, recreate the awkward workarounds that have occurred many times when running out of address bits. 2) A lot of readers of this newsgroup don't understand how interconnected the wroking computer architecture community could be. The following just notes some of the Stanford / DEC / SGI / MIPS relationships. Among other combinations: - "MIPS: A VLSI Processor Architecture", John Hennessy, Norman Jouppi, Forest Baskett, and John Gill, Stanford Tech Report 223, June 1983. - Forest left to run DECWRL, right across El Camino Real from Stanford, until he became CTO @ SGI later in 1986. Norm was over at DECWRL as well. - Hennessy, of course, took sabbatical in 1984-1985 to co-found MIPS. - DECWRL had many RISC fans; the first MIPS presentation there (that I was involved in, anyway) was April 1986, shortly after R2000 was announced. - DEC, of course, had various on-again, off-again RISC projects in the 1980s, with none getting critical mass. I wouldn' attempt to describe the politics, even what I know, except the phrase "VAX uber alles" was heard now and then :-) 3) Most people know that DEC did a deal with MIPS in 1988. As it happens, I iniated the process that led to that deal, via a chance meeting with an old Bell Labs colleague [Bob Rodriguez], then at DEC in NH, at a Uniforum in early 1988. I'd given Bob a MIPS Performance Brief. That evening, at the conference beer bust, Bob said "these look fast!" and evinced a wish to "port Ultrix to this and wheel it in and show Ken Olsen that it doesn't take 3 years and hundreds of people." I talked our Boston sales office into loaning Bob two MIPS systems, noting that the likelihood of DEC ever buying an outside CPU was low, but if there was somebody who could do a quick Ultrix port, it was Bob. After a month or so of paperwork, Bob and a couple friend got Ultrix working pretty well in less than 3 weeks [late April/early May]. That incited the the DEC Palo Alto workstation people, who were having a rough time competing in the workstation business with VAX chips, except when VMS was required. [BTW, I knew these folks also, if for no other reason, than MIPS participation in the Hamilton Group [of non-Sun UNIX folks], so named because the first meeting was held at DEC on Hamilton Avenue in Palo Alto.] A frenzied sequence of meetings then ensued, as DEC Palo Alto, and various Ultrix folks clamored to use MIPS chips so they could build competitive workstations *soon*. 4) In any case, there was a meeting in the Summer in which a team of DEC's most senior engineers [in VLSI, systems, compilers, and OS] was gathered together by Ken Olsen with a few days' notice and sent to Sunnyvale to do a solid day of technical due diligence. I'm pretty sure we knew by then that the R4000 should be 64-bit, and discussed that with DEC (and SGI, and a few other close customers, but DEC and SGI were the ones who were especially desirous of 64-bit). If it wasn't at that meeting [it was a lonnnngg day], it was shortly thereafter. The DEC engineers were *quite* competent, and asked lots of tough questions [although the OS part was trivial, given that Ultrix was already running and benchmarking well. :-)] The CPU designers & compiler people, of course, were from one of the premier places for doing this in the world, had been building one of the most successful computer lines in history, and DEC was at the height of its profitability. Some had been advocating RISC for years, but were always getting grabbed away, so the projects were on-again, off-again, which meant, among other things, that there was little progress on software, and of course, DEC was investing big-time in ECL, much to the Hudson, MA CMOS guys' consternation. Given all that, they got a mission to come see if DEC should use MIPS chips ... and it would have been really easy to have done a political NIH hatchet job, but they didn't. They went back the next day, wrote their report, and said OK, although I'm sure it was really, really painful for some of them. I RESPECT THAT A LOT, which is why I thought some of the postings in this thread were simply nonsense... At that point, MIPS was a few hundred people in two little buildings in Sunnyvale. At the end of the day, we gave them a tour of our computer room. Some of the DEC Hudson CMOS engineers were badly shocked. On the way out the door, I happened to overhear one say to another, in a stunned voice: "This little startup has more compute power for chip simulation than we do at Hudson!! We're DEC, how can that be?!" Answer (in voice cold enough to freeze air) "Yes, that's why we've got a bad problem and we'd better fix it." This was Summer of 1988 ... now look again at: http://research.compaq.com/wrl/DECarchives/DTJ/DTJ800/axp-foreword.txt Of course, by the time of the 64-bit working group meetings in mid-1992, both SGI/MIPS and DEC had chosen the same flavor of C model. It was well understood at that point (in that group) that HP, Sun, and IBM were working on 64-bit designs. 5) SGI shipped the R4000 in the Crimson in 1Q92, albeit running a 32-bit OS. DEC shipped production Alpha systems ~4Q92, running 64-bit OSs. Of course they marketed 64-bit hard... I would too. Around 4Q94, both DEC and SGI first shipped SMPs with *both* of the following: - 64-bit OSs - plausibly-purchasable memories above 4GB, i.e., where >4GB user address space was starting to be needed. I knew customers who bought Power Challenges, got 4GB+ of memory, and immediately recompiled code to use it all, in one (parallelized) program. [I.e., that's a one-line change in some big Computational Fluid Dynamics code :-), changing some array size.] During 1989-1992, there was plenty of discussion among MIPS, SGI, and DEC people about 64-bit programming models, for example. 6) SUMMARY DEC couldn't figure out how to make the VAX competitively fast and 64-bit, and they could look at DRAM and Moore's Law charts like the rest of us. Privately, they (and we) were a little surprised that other RISC vendors were waiting a generation. DEC certainly knew exactly what MIPS was doing, and certainly knew SGI intended to ship 64-bit OSs, if for no other reason than the amount of back-door communication amongst like-minded software people, and they certainly had a good idea of what everybody else was up to.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Code density and performance? Date: 29 Jul 2005 20:01:20 -0700 Message-ID: <[email protected]> Eric P. wrote: > The other problems might be: > > - Strong memory ordering prevents any access reordering > - Any number of idiosyncrasies wrt the order that data values > are read or written vs the order that auto increment/decrement > deferred operation are performed will inject pipeline stalls due > to potential memory aliases that probably never actually happen. > This combines with strong ordering to basically serialize everything. > - Having program counter in a general registers that can be > manipulated by auto increment addressing modes probably > causes many pipeline problems later to feed value forward > - 16 integer registers with many having predefined functions > is too small and causes lots of register spills. > - Having a single integer and float register set means extra > time moving float values over to float registers and back. > - Small combined register set means spilling float values a lot > - Small page size requires lots of TLB entries, which should be > fully assoc. for performance, which means big TLB chip real estate. > - The worst case of requiring 46 TLB entries to be resident > to ensure the ADDP6 instruction can complete. Not really a > performance limit so much as a pain in the ass to design around. I think you have the general idea; more later, when I get time.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Code density and performance? [really Part 1 of 3: Micro economics 101] Date: 30 Jul 2005 16:40:53 -0700 Message-ID: <[email protected]> John Mashey wrote: > I think you have the general idea; more later, when I get time. PART 1 - Microprocessor economics 101 (this post) PART 2 - Applying all this to DEC, NVAX, Alpha, competition (this post) PART 3 - Why it seems difficult to make an OOO VAX competitve (later) PART 1 - Microprocessor economics 101 (simplified) This thread is filled with *fantasies* about cheap/fast/timely VAXen, because the issue isn't (just) what's technically feasible, it's what you can do that's cost-competitive and time-to-market competitive. The following is over-simplified, of course, but hopefully it will bring some reality to this discussion. DESIGN COST Suppose it costs $50M to design a chip and get into to production. The amortized cost/chip for engineering versus total unit volume is: Cost/chip Volume $1,000,000 50 $100,000 500 $10,000 5,000 $1,000 50,000 $100 500,000 $10 5,000,000 $1 50,000,000 Alternatively, if your volumes happen to be 5,000,000, you could spend $500M on development, and still only have an engineering cost/chip of $100. INTEL AND AMD CAN MAKE FAST X86S BECAUSE THEY HAVE VOLUME. Anne&Lynn Wheelers' post a while ago pointed at VAX unit volumes, which as of 1987 had MicroVAX II (a very successful product) having shipped 65,000 units over 1985-1987. If the unit volumes are inherently lower, you either have to get profit based on *system* value that yields unusually high margins, so that the system profit essentially subsidizes the use of such chips. This works for a systems vendor when the market and customer switching costs allow high margins, i.e., IBM mainframes to this day, and VAXen in the 1980s. [The first MIPS were designed using 2 VAX 11/780s, plus (later) an 8600, plus some Apollos ... and the VAXen seemed expensive, but they were what we needed, so we paid.] SYSTEMS COMPANIES Mainframes and minicomputer systems companies thrived when the design style was to integrate a large number of lower-density components, with serious value-add in the design of CPUs from such components. [Look at a VAX 11/780 set of boards.] As microprocessors came in, and became usefully competitive, most such companies struggled very hard with the change in economics, and the internal struggles to maintain upward compatibility. Most systems companies had real problems with this. There were pervasive internal wars among different strategies, especially in multi-division companies: A "We can do this ourselves, especially with new ECL (or even GaAs) [Supercomputer and mainframe, and minicomputers chasing mainframes] B "We can build CMOS micros that are almost as fast as ECL, and much cheaper, and enough better than commodity in feature, function, and performance." [IBM, DEC, HP ... and later Sun] C "We should put our money in system design and software, and buy micros." [Apollo, Convergent, Sun. SGI, Sequent, and various divisions of older companies]. Of course, later there came: D "We should buy micros, add value in systems design, and use Microsoft" Hence, IBM PC division, Compaq, etc. and yet later: E "We should minimize engineering costs and focus on distribution" I.e., Dell. Many companies did one internal design too many. Most of the minicomputer vendors went out of business. IBM, DEC, and HP were among the few that actually had the expertise to do CMOS micro technology, albeit not without internal wars. [I was invovled in dozens of these wars. One of the most amusing was when IBM friends asked me to come and participate in a mostly-IBM-internal conference at T. J. Watson, ~1992. What they *really* wanted, it turned out, was for somebody outside IBM politics (i.e., "safe") to wave a red flag in front of the ECL mainframe folks, by showing a working 100Mhz 64-bit R4000.] SEMICONDUCTOR COMPANIES, SYSTEMS COMPANIES, AND FABS In the 1980s, if you were a semiconductor company, you owned one or more fabs, which were expensive at the time, but nothing like they are now. You designed chips that either had huge volumes, or high ASPs, or best, both! Volume experience improves yield of chips/wafer, and of course, volume amortizes not only the fab cost, but the design cost. If you are a systems vendor, and only one fab can make your chips, you have the awkward problem that if the fab isn't full, you have a lot of capital tied up, and if it is full, you are shipment-constrained. When you built a fab, first you ran high-value wafers, but even after the fab had aged, and was no longer state-of-the-art, quite often you could run parts that didn't need the most advanced technology. In the 1980s, if you wanted to have access to leading-edge VLSI technology for your own designs, EITHER: - You were a semiconductor company, i.e., the era of "Only real men own fabs" (sorry, sexist, but that was the quote of the day). OR - You were a systems company big enough to afford the fab treadmill. IBM [which always had great process technology]. DEC usually had 1-2 CMOS fabs [more on that later]. HP had at least 1, but at least sometimes there was conflict between the priorities for high-speed CPU design and high-volume lower-cost parts. Fujitsu, NEC, etc were both chip and systems companies. In any case, you must carefully amortize the FAB costs. - You were a fabless systems company who could convice a chip company to partner with you, where your designs were built in their fabs, and either you had enough volume alone, or (better) they could sell the chips to a large audience of others. [Example: Sun & TI] OR - You were a small systems/chip company [MIPS] that was convincing various other systems companies and embedded designers to use the chips, and thus able to convince a few chip partners to do long-term deals to make the chips, and sell most of them to other companies, and as desired, have licenses to make variations of their own from the designs. Motivations might be that a chip salesperson could get in the door with CPUs, and be able to sell many other parts, like SRAMs [motivation for IDT/MIPS and Cypress/SPARC], or be able to do ASIC versions [motivation for LSI Logic]. In MIPS Case, the first few years, the accessible partners were small/medium chip vendors, and it was only in 1988 that we were able to (almost) do a deal with Motorola and were able to do ones with NEC and Toshiba, i.e., high-volume vendors with multiple fabs. Now, you might say, why wouldn't a company like Sun just go to a foundry with its designs, or in MIPS case, why wouldn't it just be a normal fabless semiconductor vendor of which there are many? A: Accessible foundries, geared to producing outside designs, with state-of-the-art fabs ... didn't really exist. TSMC was founded in 1987, and it took a long time for it to grow. ON HAVING A FAB, OR NOT If you own the process, you can diddle it somewhat to fit what you're building. If your engineers are close with the fab process people, and you have wizard circuit designers, you can do all sorts of things to get higher clock rates. If you aren't, you use the factory design rules ... or maybe you can do a little negotiation with them. In any case, there is a tradeoff between owning a fab ($$$) and getting higher clock rate, and not owning the fab, and being less able to tune designs. SYSTEMS COMPANIES THAT DESIGN CHIPS, SOLD TO OTHERS OR NOT There is a distinct difference in approach between the extremes of: - We're designing these chips for our systems, running our OSes and compilers. We might sell chips to a few close partners for strategic reasons. VERSUS - We expect to use these chips in our systems, but we expect that large numbers will be used in other systems, with other software. We will include features that will never be used in our own systems. We will invest in design documentation, appropriate technical support, sampling support, debugging support, software licensing, etc, etc to enable others to be successful. IBM still has fabs, but of course IBM Microelectronics does foundry work for others (to amortize the fab cost). POWER was really geared to RS/6000; PPC made various changes to allow wider use outside, and IBM really sought this (volume). Sun never had a fab, did a lot of industry campaigning to spread SPARC, but in practice, outside of a few close partners, most of the $ sales of SPARC chips went to Sun. HP has sold PA-RISC chips to a few close partners, but in general, never was set up in the business of selling microprocessors. MIPS started to do chips, but had enough work to do on systems that it needed to build systems (and, if you understand the volume issues above, needed systems revenue, since in the early days, it couldn't poossibly get enough chip volume to make money. I.e., systems buisness can work at low volumes, whereas chip business doesn't.] DEC, of course, after its original business (modules) was much more set up as a systems vendor, and never really had a chip-supplier mindset, although Alpha, of course, was forced to try to do that (volume, again). Somebody suggested they should have been selling VAX chips, and that may be so, but it is really hard to make that sort of thing happen, as it requires serious investment to enable other customers to be ssuccessful, and it requires the right mindset, and it's really hard to make that work in a big systems company. (I'm DEC and I sell you VAX Chips. What OS do you run? Oh, VMS; OK we'll license you that. What sort of systems do you want to build? You want to build lower-cost minicomputers to sell to the VAX/VMS installed-base? Oh.... actually, it looks like we're out of VAX chips this quarter, sorry.) (Or one might recall that Sun talked Solbourne into using SPARC, and Solbourne designed their own CPUs and built SMPs. If a Sun account wanted an SMP, and somebody like SGI was knocking at the door, Sun would point at Solbourne (to keep SPARC), but if Solbourne was infringing on a Sun sale, it was not so friendly - I once got a copy of a Sun memo to the salesforce about how to clobber Solbourne.) Anyway, a *big* systems vendor, to be motivated to the bother of successfully selling its otherwise proprietary CPU chips, has to find other, essentially non-competitive users of them, who can be successful. The most successful example of that is the IBM PPC -> Apple case. Probably the most interesting Alpha case was its use in the Cray T3 systems, fine supercomputers, but not exactly high-volume. ON DESIGNS AND ECONOMICS People probably know the old project management adage: "features, cost, schedule: you pick two, I'll tell you the other." In CPU design, you could, these days, use: - FPGA - Structured ASIC - ASIC, full synthesized logic - Custom, with some synthesized and some custom logic/layout design, and maybe with some special circuit design. Better tools help ... but they're expensive, especially because people pushign the state of the art tend to need to build some of their own. This is in increasing order of design cost. - An FPGA will be big, use more power, and run at lower clock rate. - The more a custom a chip is, the faster it can go, but it either takes more people, or longer to design, or (usually) both. Companies like Intel have often produced an original design with a lot of synthesized logic, with one design team, and then another team right behind them, to use the same logic, but tune the critical paths for higher clock rate, shrink the die with more custom design, work on yield improvements, etc. Put another way, if you have enough volume, and good ASPs, you can afford to spend a lot of engineering effort to tune designs, even to overcome ISA problems. PART 2 - Applying all this to DEC, NVAX, Alpha, Competition DEC (at least some people) understood the importance of VLSI CMOS. DEC had excellent CPU and systems designers, software people, and invested in fabs (for better or worse - some of us could never quite fiture out how they could afford the fabs in the 1990s). They had some super-wizard circuit designers, who even impressed some of the best circuit designers I've known. However, in the 1980s, they never had more than about 100 VLSI CPU designers, which in practice meant that at any one time, they could realistically be doing one brand-new design, and one {shrink, variation}. They of course were doing the ECL VAX9000, but that was a whole different organization. The problem that DEC faced was that their VAX cash cow was under attack, and they simply couldn't figure out how to keep the VAX competitive, first in the technical markets [versus PA RISC, SPARC, and MIPS], and then in commercial [PA RISC]. I think Supinik's article described this reasonably well. http://research.compaq.com/wrl/DECarchives/DTJ/DTJ800/axp-foreword.txt As a good head-to-head comparison, NVAX and the Alpha 20164 were built: - in same process - about the same time - with the same design tools - with similar-sized teams ... although the NVAX team had the advantage of having implemented pipelined CMOS VAXen before, long history of diagnsotics, test programs, statistics on behavior, etc, wheras Alpha team didn't alread have as much of that. The ISA difference between VAX and Alpha was such that the NVAX team had to spend a lot more effort on micro-architecture, wheras the Alpha team could spend that effort on aggressive implementation, such that the MHz difference was something like 80-90MHz for NVAX/NVAX+, and up to 200Mhz for 21064. Around 1992, modulo maybe a year difference in software, that gave numbers like: SGI DEC DEC Crimson VAX7000/610 DEC7000/610 MIPS VAX Alpha R4000 NVAX 21064 1.3M 1.3M 1.68M # transistors 184 mm^2 237 mm^2 234 mm^2 # size 1.0 micron .75 micron .75 micron # process 2-metal 3-metal 3-metal # metals 1MB L2 4MB L2 4MB L2 # L2 100Mhz 90MHz 182Mhz # clock rate 61 34 95 # SPECint89 78 58 244 # SPECfp89 Now, we all know SPECint/SPECfp aren't everything, and the exact numbers don't matter much, but that's still a big difference. I threw in the MIPS chip to illustrate that even a well-designed NVAX was outperformed by a single-issue chip that was 3/4 the size, in a substantially less dense technology [1.0 micron versus .75, and 2-metal versus 3], required to meet a generic design rules across multiple fab vendors, was 64-bit, and still had a higher clock rate. None of this was due to incompetence on the NVAX team; that was a *fine*, successful design to be proud of. But once again, go back to the economics. It's a classical move to try to take market share and build volume via all-out performance, selling first to those with the most portable code and willing to pay for performance. It's a lot harder to do that with an NVAX what was 60-80% of the performance (on these, anyway) of something like an R4000, that, if not a commodity, was a lot closer to that. A bit later, I'll post Part 3, my analysis of why I think it would have been hard to build a *competitive* OOO VAX. In the real world, it wasn't enough to build an OOO VAX, it had to be competitive on time-to-market, performance, and cost. This post has covered the economic issues, the next will discuss some of the ISA issues. But, as a teaser, I note that there are some ISA attributes of the VAX: a) Not found in RISCs b) Not found in X86 c) Some of which are found in S/360 family, but less often Some of them are the same ones that make other aggressive implementations hard, but some *really* cause trouble for OOO implementations, and in particular, make it very hard to get as mileage from the X86 convert->micro-op style of designs.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Part 1 of 3: Micro economics 101 (was: Code density ...) Date: 31 Jul 2005 17:48:37 -0700 Message-ID: <[email protected]> Anton Ertl wrote: > "John Mashey" <[email protected]> writes: > >This thread is filled with *fantasies* about cheap/fast/timely VAXen, > >because the issue isn't (just) what's technically feasible, it's what > >you can do that's cost-competitive and time-to-market competitive. > > Well, my impression was that you made a claim about technical > feasibility. Once again, I don't know why a long-time participant in comp.arch would think that. I've posted on this topic off and on for years, including the old April 18, 1991 "RISC vs CISC (very long)" post that's been referenced numerous times. (Google: mashey risc cisc very long) It said, among other things: "General comment: this may sound weird, but in the long term, it might be easier to deal with a really complicated bunch of instruction formats, than with a complex set of addressing modes, because at least the former is more amenable to pre-decoding into a cache of decoded instructions that can be pipelined reasonably, whereas the pipeline on the latter can get very tricky (examples to follow). This can lead to the funny effect that a relatively "clean", orthogonal architecture may actually be harder to make run fast than one that is less clean." In context, this was ~ "it might be easier to deal with X86 than VAX" Decoded Instruction Cache ~ Intel "trace" cache And in March 8, 1993 (Google: mashey vax complex addressing modes), I said: "Urk, maybe I didn't say this right: a) Decoding complexity. b) Execution complexity, especially in more aggressive (more parallel) designs. I'm generally much more worried about the latter than the former, since there are reasonable things to do about the former (i.e., decoded instruction caches, which at least help some)." If I've ever posted anything that seemed to imply it was impossible to build an OOO VAX, I apologize for the ambiguity, but I think I've consistently expressed this as "difficult" or "complex", or "needs a lot of gates", or "likley to incur extra gate delays" NOT as "impossible". I've various times discussed the 360/91 or the VAX 90000 as things that went fast for their clock rate, but at the cost of high cost and complexity. After all, the key issues around OOO were published in Jan 1967 (Anderson, Sparacio, Tomasulo on 360/91). An aphorism of the mid-1980s amongst CPU desingers was "Sometime we'll get enough gates to catch up with the 360/91." In the real world, engineers have to juggle design/verification cost, product cost, and time-to-market; there are plenty of things that are "technically feasible" but have no ROI...
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Code density and performance? [really Part 2b of 3: Micro economics 101] Date: 3 Aug 2005 23:16:04 -0700 Message-ID: <[email protected]> From side questions, here's an update to Part 2, and you will definitely want to use Fixed Font... Here's a better one, with a few more CPUs to give context, and you may want to print. TABLE 1 - MIPS, VAX, Alpha, Intel CODE A B C D E SHIP 1Q92 3Q92 4Q92 3Q93 3Q95 CO SGI DEC DEC SGI DEC PROD Crimson 7000/610 7000/610 Chal XL 600 5/266 ARCH MIPS VAX Alpha MIPS Alpha CPU R4000 NVAX+ 21064 R4400 21164 XSTRS 1.3M 1.3M 1.68M 2.3M 9.7M mm^2 184 237 234 184 209 Micron 1.0 0.75 0.75 .8 0.35 Metals 2 3 3 2 4 L1 8KI+8KD 2KI+8KD 8K+8K 16K+16K 8KI+8KD L2 1MB 4MB 4MB 1MB 96K L3 MHz 100 90 182 150 266 Type 1P 1P 2SS 1P 4SS Bus 64 128 128 64 128 SPEC89 Issue Jun92 Sep92 Mar93 - - Si89 61 34 95 - - Sfp89 78* 58* 244* - - *All of these have the matrix300-raised numbers SPEC92 Issue June92 June92* Mar93 Jun93 Sep95 Si92 58 34E** 95 88 289 Sfp92 62 46E** 182 97 405 **My estimate, noting that MIPS & Alpha derated by .75-.8 going from SPECfp89 to SPECfp92. Take with many grains of salt. I couldn't easily find any SPEC92 numbers for VAX. CODE F G H I J SHIP 1991 3Q92 3Q93 2Q94 2Q96 CO Intel CPQ Intel Intel Intel PROD Xpress Deskpro Xpress Xpress Alder ARCH IA-32 IA-32 IA-32 IA-32 IA32 CPU 486DX 486DX2 Pentium P54C PentiumPro XSTRS 1.2M 1.2M+ 3.1M 3.2M 5.5M mm^2 81 ? 295? 147 196 Micron 0.8 ? 0.8 0.6 0.35 Metals 2? 2? 3 4 4 L1 8K 8K 8KI+8KD 8KI+8KD 8KI+8KD L2 256K 256K 256K 512K 256K MHz 50 66 66 100 200 Type 1P 1P 2SS 2SS 3-OOO Bus 32 32 64 64 64 SPEC92 Issue Mar92 June92 Jun93 Jun94 Dec95 Si92 30 32 65 100 366 Sfp92 14 16 60 75 283 Type: 1P: 1-issue, pipelined 2SS: 2-issue, superscalar 4SS: 4-issue superscalar 3-OOO: 3-issue, out-of-order ================================= What I've done is: Show the SPEC89 numbers for VAXen, because I can't find SPEC92 numbers. Then I've done a gross estimate of the equivalent SPEC92, so that I can get all of the machines on the same scale, noting of course that benchmarks degrade over time due to compiler-cracking. I used the highest NVAX numbers I have handy, from my old paper SPEC Newsletters. I'm ignoring costs, below, and the dates in the table must be taken with lots of salt, for numerous reasons, and as always SPECint and SPECfp aren't everything [spoken as an old OS person]. NVAX shipped in 4Q91, NVAX+ in 1992. The first R4000s shipped in systems in 1Q92, so these are contemporaneous, as they are with 486DX and 486DX2. The NVAX+ is about 75-80% of a MIPS R4000 on integer and FP here, despite using a better process [.75 micron, 3-metal, versus 1.0 micron, 2-metal], a larger die [237 versus 184], and being 32-bit rather than 64-bit. [It is somewhat an accident of history and business arrangements that the R4000 was done in 2-metal, but that forced it to be superpipelined, 1-issue, rather than the original plan of 2-issue superscalar. As a result, the R4000/R4400 often had lower SPECfp numbers than the contemporaneous HP and IBM RISCs, although for compensation it sometimes had better integer performance, and sometimes could afford bigger L2 caches, because the R4000/R4400 themselves were relatively small. In any case, on SPEC FP performance, in late 1992, the fastest NVAX+ was outperformed by {IBM, HP, MIPS, Sun (maybe), and Alpha (by 3-4X). The NVAX+ was 3X faster than a 66MHz 486DX2. In SPECint, in late 1992, the NVAX was outperformed, generally by the RISCs ... but worse, there wasn't much daylight between it and a 66MHZ 486DX2, or even a 50MHz 486DX. The real problem of course (not just for the VAX, but for everybody), was the bottom right corner of the Table. Intel had the resources and volume to "pipeline" major design teams [Santa Clara & Portland] plus variants and shrinks, and there was an incredible proliferation in these years. It's worth comparing [B] NVAX+ with [H] Pentium. Suppose one were a VAX customer in 1992: If you were using VAX/VMS: - commercial: committed to VMS for a long time. - technical (FP-part): RISC competitors keep coming by with their numbers If you were using Ultrix: - FP: serious pressure from RISC competitors - Integer: serious pressure already from RISC competitors, and horrors! Intel getting close to parity on performance I'm not going to comment on DEC's handling of Alpha, fabs, announcements, alternate strategy variations. But this part should make clear that ther was real pressure on the VAX architecture, for above (in terms of performance) and below (Intel coming up). One might imagine, that had there been no Alpha, and everybody at Hudson had kept working on VAXen, that they could have gotten: [X] a 2SS superscalar [like Pentium], in 1994, perhaps OR [Y} some OOO CPU [like Pentium Pro], in 1996, perhaps as well as doing the required shrinks and variants. From the resources I've heard described, I find it difficult to believe they could have done both [X] and [Y] (and note, world-class design teams don't grow on trees). I could be convinced otherwise, but (as one of the NVAX designers says), only by "members of the NVAX and Alpha design teams, plus Joel Emer" :-), i.e., well-informed people. In Part 3, I'll sketch some of the tough issues of implementing the VAX, as best I can, and in particular, note the ISA features that might make things harder for VAX than for X86, even for 2SS, 4SS, or OOO designs. In particular, what this means is that you can implement a type of microarchitecture, but it gains you more or less performance dependent on the ISA and the rest of the microarchitecture. For instance, the NVAX design at one point was going to decode 2 operands/cycle, and it was found to add much complexity and only get 2%.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: PART 3. Why it seems difficult to make an OOO VAX competitive (really long) Date: 7 Aug 2005 18:48:10 -0700 Message-ID: <[email protected]> (You will want Fixed Font). The earlier parts were: - (posted Jul 30) PART 1 - Microprocessor economics 101 PART 2 - Applying all this to DEC, NVAX, Alpha, competition - (posted Aug 3) Really part 2b (updated Table 1 and added more discussion) FUNDAMENTAL PROBLEM Certain VAX ISA features complexify high-performance parallel implementations, compared to high-performance RISCs, but also to IA-32. The key issue is highlighted by Hennessy & Patterson [1, E-21]]: "The VAX is so tied to microcode we predict it will be impossible to build the full VAX instruction set without microcode." Unsaid, presumably because it was taken for granted is: For any higher-performance, more parallel micro-architecture, designers try to reduce the need for microcode (ideally to zero!). Some kinds of microcoded instructions make it very difficult to decouple: A) Instruction fetch, decode, and branching B) Memory accesses C) Integer, FP, and other operations that act on registers Instead, they tend to make A&B, or A&C, or A,B&C have to run more in lockstep. It is hard to achieve much Instruction Level Parallelism (ILP) in a simple microcoded implementation, so in fact, implementations have evolved to do more prefetch, sometimes predecode, branch prediction, in-order superscalar issue with multiple function units, decoupled memory accesses, etc, etc. ISAs often had simple microcoded implementations [360/30, VAX-11/780, Intel 8086] and then evolved to allow more pipelining. Current OOO CPUs go all-out to decouple A), B), and C), to improve ILP actually achieved, at the expense of complex designs, die space, and power usage. Some ISAs are more suitable for aggressive implementations, and some make it harder. The canonical early comparison was the CDC 6600 versus the IBM 360/91; the even stronger later one would be Alpha versus VAX. A widespread current belief is that the complexity, die cost, and propensity for long wires of high-end OOOs may have reached diminishing returns, compared to multi-core designs with simpler cores, where the high-speed signals can be kept in compact blocks on-chip. IA-32 has baroque, inelegant instruction encoding, but once decoded, most frequently-used instructions can be converted to a small number (typically 1-4) micro-ops that are RISC-like in their semantic complexity, and certainly don't need typical microcode. As noted earlier in this sequence, the IA-32 volumes can pay for heroic design efforts. The VAX ISA is orthogonal, general, elegant, and easier to understand, but the generality, but it also has difficult decoding when trying to do several operands in parallel. Worse, numerous cases are possible that tend to lockstep together 2 or 3 of A), B), or C), lowering ILP, or requiring hardware designs that tend to slow clock rate or create difficult chip layouts. Even worse, a few of the cases are even common in some or many workloads, not just potential. As one VAX implementor wrote me: "it doesn't take much of a percentage of micro-coded instructions to kill the benefits of the micro-ops." That is a *crucial* observation, but of course, the people who really know the numbers tend to be the implementers... It is interesting to note that the same things that made VAX pipelining hard, and inhibited the use of a 2-issue superscalar, also make OOO hard. Some problems are easier to solve, but others just move around and manifest themselves in different ways. - decode complexity - indirect addressing - multiple side-effects - some very complex instructions - subroutine call mechanism Following is a more detailed analysis, showing REFERENCES first (easier to read on Web), briefly describing OOO, and then going through a sample of troublesome VAX features, and comparing them to IA-32, and sometimes S/360. CONCLUSION that wraps all this together with DEC's CMOS roadmap in the early 1990s to show the difficulty of keeping the VAX competitive. ========================== CAVEAT: I've never helped design VAXen, although I used them off-and-on between 1979-1986. I have participated (modestly) in several OOO designs, the MIPS R10000 and successors, plus one that never shipped. I've had lots of informal discussions over the years with VAX implementors, and I have reviewed some of the ideas below with at least one of them. I don't have the statistics that a professional needs to really do an OOO VAX design, so at best I can sketch some of the problems. With enough gates, you can do almost anything ... but complexity incurs design cost, design time, and often chip layout problems and gate delays. Unlike software on general-purpose systems, where adding a bit of code rarely bothers much, the blocks of a chip have to fit on a 2-dimensional layout, and their physical relationships *matter*. Sometimes minor-seeming differences cause real problems. ========================== REFERENCES (placed here for convenience): Assumed reading: [0] Hennessy & Patterson, Computer Architecture, a Quantitative Approach, 3rd Edition, 2003. Chapters 2, 5, and especially 3, plus Appendix A. Brief explanation, and detailed reference of the VAX: [1] Hennessy & Patterson, "Another alternative to RISC: the VAX Architecture", www.mkp.com/CA3, Appendix E. [2] Digital Equipment, VAX Architecture Handbook, 1981. Superb VAX performance analyses of the early 1980s by DEC people; ironically, invaluable to RISC designers [at MIPS, used to settle arguments]: [3] Clark and Levy, "Measurement and analysis of instruction use in the VAX-11/780", 1982. ACM SIGARCH CAN, 10, no 3 (April 1982), 9-17. [4] Wiecek, "A case study of VAX-11 instruction set usage for compiler execution", ASPLOS 1982, 177-184. [5] Emer and Clark, "A characterization of processor performance in the VAX-11/780, Proc. ISCA, 1984, 301-310. [6] Clark and Emer, "Performance of the VAX-11/780 Translation Buffer: Simulation and Measurement", ACM TOCS 3, No. 1, (Feb 1985), 31-62. Important analysis of ILP, discussed at length in [0]. [7] Wall, Limits of Instruction-Level Parallelism, DECWRL REPORT 93/6. Another superb performance analysis by two of the best: [8] Bhandarkar and Clark, "Performance from Architecture: Comparing a RISC and a CISC with similar hardware organization", 1991. The NVAX: [9] Uhler, Bernstein, Biro, Brown, Edmondson, Pickholtz, and Stamm, "The NVAX and NVAX+ High-performance VAX Microprocessors". http://research.compaq.com/wrl/DECarchives/DTJ/DTJ701/DTJ701SC.TXT A good intro to the IA-32. [10] Hennessy and Patterson, "An alternative to RISC: The Intel 80x86", www.mkp.com/CA3, Appendix D. 4-issue superscalar in-order Alpha versus OOO PentiumPro [11] Bhandarkar, "RISC versus CISC: A Tale of Two Chips", ACM SIGARCH CAN 25, Issue 1 (March 1997), 1-12. IBM ES/9000 (1992) was superscalar OOO in Bipolar, but in CMOS, they went back to simpler designs. [12] Sleegel, Pfeiffer, Magee, "The IBM eServer z990 microprocessor", IBM J. RES. DEV. Vol 48, No. 3/4, May/July 2004, 294-309. [13] Heller and Farrell, "Millicode in an IBM zSeries processor", IBM Journal of Research and Development 48, No. 3/4, May/July 2004, 425-434. [Some of these can be found on WWW, some are in ACM Digital Library (Subscription), many are discussed in detail in [0] anyway]. INTRODUCTION - OOO (Out-of-Order) (see [0, Chapter 3]): OOO CPUs try to maximize ILP as follows: A) Fetch instructions in-order, with extensive branch prediction, - decode, (and maybe even cache the decoded instructions - apply register renaming to convert logical registers to physical - put the resulting operations (decoded instruction using renamed registers) into internal queue(s) (reorder buffer, active-list, etc), such that an operation can be performed (often OOO) whenever upon availability of its inputs and necessary functional units. B) A load/store unit tries to discover cache misses and start refills quickly. Loads can (depending on ordering model) be done out of order, and stores can at least (sometimes) profitably fetch the targets of cache lines, although the final store operation must wait. Decoupling this unit as much as possible is absolutely crucial to getting good ILP, given the increasing relative latency to memory. C) Other instructions are executed by appropriate function units, commonly 1-2 integer ALUs, and a collection of independent FP units. A) Again, since it's more related to A): Completed instructions are retired in-order. If it turns out that the fetch unit has mispredicted a branch, when that is discovered, the register state, condition codes, etc are rolled back to those just before the branch, and the branch is followed in the other direction. If an instruction generates an exception, the exception normally doesn't take effect until the instruction is retired, in which case the following instructions are cancelled. Something similar occurs with asynchronous interrupts. OOO CPUs run most of the time speculating, i.e., working on multiple instructions that might or might not actually be reached, which is why people worry so much about good branch prediction, because the penalty for bad prediction gets worse as the design gets {longer pipelines, more parallelism}. Also, they hope for code patterns that help confirm branch directions early. Once upon a time, it was easy to know how many cycles an instruction would use, but that was long ago, with real-memory, uncached designs :-) It is very difficult to know what an OOO CPU is up to. There are also serious hardware tradeoff problems that arise, even though invisible to most programmers. There is never as much die space as potential uses, and the payoffs of different choices must be carefully analyzed across large workloads, especially because there can be serious discontinuities in gate-count, or worse, gate-delays caused by "minor" changes in things like queue sizes. For instance, unlike limits on logical (programmer-visible) registers, there is no apriori limit on the number of physical registers, but in practice, these register numbers are used in large numbers of comparators, and one would think hard in going from 64 to 65, or 128 to 129. Likewise, load/store queues have big associative CAMs so that the next address can be quickly checked against all the outstanding memory operations. Quite likely, load queues are filled with outstanding memory references, with multiple cache misses outstanding. If a queue is filled, but an instruction needs a piece of data *right now*, just to be decoded into micro-ops, it either has to have special-case hardware, or it will have to wait until a queue entry is available. [The VAX has this problem, unlike IA-32 or RISCs.] Some instructions (like ones changing major state/control registers, or memory mapping, etc) are inherently *serializers*, that is, their actions cannot take effect until all logically older instructions have been retired. Also, partially-executed instructions *following* the serializer may need to be redone. The decoder might recognize such serializers and stop fetching, if it is deemed a waste of time to go beyond them. Unlike loads, where all the work can be done speculatively, stores cannot be completed until they are retired, because they can't be sanely undone. Cleverness can preserve sequential (strict) ordering while pulling loads ahead of earlier stores, by redoing them if it turns out they conflict. [0, p.619 on R10000; discussed in US Patent 6,216,200]. The VAX's strict ordering might *not* have been a real problem. WHY DO ALL THIS OOO COMPLEXITY? a. Speculate into memory accesses as early as possible and get cache misses going, to deal with the increasing latency to memory. Overlap address calculations, get actual load data early, and fetch cache-line targets of stores early. Also, try to smooth the flow of cache accesses to lower latency and increase effective bandwidth. It was often said: "The main reason to do OOO is to find cache misses as early as possible." b. Extract more ILP by overlapping non-dependent ALU/FP operations in a bigger window [40+ typical] than is available for in-order superscalars, which typically examine no more than 4 instructions/clock. This is especially valuable for long-cycle operations like integer multiply/divide, or many FP ops. c. Alleviate decoding delays of messier variable-length instructions; this obviously applies less to RISCs, although some have done modest pre-decode when storing instructions in the I-cache. d. Reduce pressure on small register sets by using register renaming to create more physical registers than logical ones. This also eliminates false dependencies, even in RISCs with large register sets, but it does help VAX (and IA-32) somewhat more, as both are short of registers. That moves the problem around, as it puts more pressure on load/store queues, and efficient handling of load-after-store-to-same-address. The 360/91 used OOO for a. (it had no cache), b. (long FP-cycle ops), and d. (only 4 FP registers). I think c. was less important, as S/360 instruction length decode is easy. ILP, NORMAL INSTRUCTIONS, and IA-32 VERSUS VAX Consider the normal unprivileged instructions that need to be executed quickly, meaning with high degrees of ILP, and with minimal stalls from memory system. RISC instructions make 0-1 memory reference per operation. Despite the messy encodings, *most* IA-32 instructions (dynamic count) can be directly decoded into a fixed, small number of RISC-like micro-ops, with register references renamed onto the (larger) set of physical registers. Both IA-32 and VAX allow unaligned operations, so I'll ignore that extra source of complexity in the load/store unit. In an OOO design, the front-end provides memory references to a complex, highly-asynchronous load/store/cache control unit, and then goes on. In one case, [string instructions with REP prefix], IA-32 needs the equivalent of a microcode loop to issue a stream of micro-ops whose number is dependent on an input register, or dynamically, on repeated tests of operands. Such operations tend to lessen the parallelism available, because the effect is of a microcode loop that needs to tie together front-end, rename registers, and load/store unit into something like a lock-step. Although this doesn't require that all earlier instructions be retired before the first string micro-ops are issued, it is likely a partial serializer, because it's difficult do much useful work beyond an instruction that can generate arbitrary numbers of memory references (especially stores!) during its execution. However, the VAX has more cases, and some frequent ones, where the instruction bits alone (or even with register values) are insufficient to know even the number of memory references that will be made, and this is disruptive of normal OOO flow, and is likely to force difficult [read: complex, high-gate-count or long-wire] connections among functional blocks on a chip. Hence, while the VAX decoding complexity can be partially ameliorated by a speculative OOO design with decoded cache [I alluded to this in the RISC CISC 1991 posting], it doesn't fix the other problems, which either create microcode lock-steps between decode, load/store, and other execution units, or require other difficult solutions. In some VAX instructions, it can take a dependent chain of 2 memory references to find a length! VAX EXAMPLES [1], [2], especially compared to IA-32 [10] and sometimes S/360. Specific areas are: - Decimal string ops - Character string ops - Indirect addressing interactions with above - VAX Condition Codes (maybe) - Function calls, especially CALL*/RET, PUSHR/POPR. DECIMAL STRING OPERATIONS: MOVP, CMPP, ADDP, SUBP, MULP, DIVP, CVT*, ASHP, and especially EDITPC: are really, really difficult without looping microcode. [S/360 has same problem, which is why (efficient) non-microcoded implementations generally omitted them. The VAX versions, especially the 3-address forms, are even more complex than the 2-address ones on S/360, and there are weird cases. DIVP may allocate 16-bytes on the stack, and then restore the SP later. These instructions somewhat resemble the (interruptible) S/370 MVCL, but are more complex, including the infamous ADDP6. They all set 4-6 registers upon completion or interrupt. EDITPC is like the S/360 EDMK operation, but even more baroque. "The destination length is specified exactly by the pattern operators in the pattern string." [2, p. 336] I.e., you know the beginning address of the destination, but you can't tell the ending address of a written field without completely executing the instruction. The IA-32 doesn't have these memory-memory decimal operations. One might argue that C, FORTRAN, BLISS, PASCAL, etc could care less about these, but COBOL and PL/I do care, so if they are a customer's priority, they may not be happy with the performance they get on an OOO VAX, i.e., C speeds up, FORTRAN speeds up, but decimal operations are unlikely to speed up as much, as these certainly look like microcode that tends to serialize resources. CHARACTER STRING AND CRC OPERATIONS: MOVC, MOVTC, MOVTUC*, CMPC, SCANC, SPANC, LOCC, SKPC, MATCHC, CRC: also tough without looping microcode, and they are generally more complex than the S/360 equivalents. MOVTUC is a fine example: it has 3 memory addresses, and copies/translates bytes until it finds an escape character. Hence, at decode time, it is impossible to know how many memory addresses will be fetched from, and worse, stored into... The IA-32 REPEAT/string operations have some of the same issues, but are simpler, with the length and 2 string addresses supplied in registers. VAX INDIRECT ADDRESSING AND CHARACTER OR DECIMAL OPS For any of the above, note that most operands, INCLUDING the lengths can be given by indirect (DEC deferred) addresses: @D(Rn) Displacement deferred [2.7%, according to [5,Table 4] @(Rn)+ Auto-increment deferred [2.1%, according to [5, Table 4] The first adds the displacement to register Rn, to address a memory word, the second uses the address in Rn (followed by an auto-increment) to address the memory word. That word contains the address of the actual operand value. This makes it impossible for the front-end to know the length early. Rather than being able to hand off load/store operations, unidirectionally to the load/store unit, the front-end has to wait for the load/store unit to supply the operand value, just to know the character string length. I have no idea how frequent this is, but VAXen pass arguments on the stack, and a call-by-reference that passes a length argument will do this: @D(SP). Consider how much easier are the regular VAX MOV* instructions, each of whose length is fixed. Each of those is easily translated into: Load (1 value, 1, 2, or 4 bytes) into (renamed register); store that value Or Load (2 or 4 longwords); store (2 or 4 longwords) (Of course, one might like the MOV to just act in load/store unit, but that's not quite possible, due to the MOV and Condition Codes issue described later.) IA-32 doesn't have this problem, as the length for a REP-string op is just taken from a register. Of course, that value must be available, but that falls out as part of the normal OOO process. The closest the S/360 gets is the use of an Execute instruction to supply length(s) to a Move Character (MVC) or other SS instruction. That's a somewhat irksome control transfer: think of it as replacing the EX with the MVC after ORing the length in, but at least you know the length at that point, without having to ask the load/store unit to do possibly) multiple memory references in the middle of the instruction, which requires some special cases in the front-end <-> load/store unit interaction. VAX CONDITION CODES AND MOVES [CONJECTURE ON MY PART] OOO processors typically use rename maps to map logical registers to physicals. Condition Codes (CC) require an additional rename map of their own, in ISAs that have them. Each micro-op has an extra dependency on the CC of some predecessor, and produces a CC, just as it produces a result. Register renaming uses massive bunches of comparators to keep track of dependencies, and CCs just add more maps and more wires and comparators. IA-32 and S/360 would need this also, but the VAX is slightly different, in that its data movement instructions affect the CC. S/360: CVB,CVD, DR, D, IC, LA, LR, L, LH, LM, MVC, MVI, MVN, MVO, MVZ, MR, M, MH, PACK, SLL, SRDL, SRL, ST, STC, STCM, STH, TR, UNPK do *not* set the CC, i.e., most data movement instructions do NOT affect CC. IA-32: MOV does not affect any flags. VAX: almost everything affects flags, including all the MOVes (except MOVA and PUSHA, which do address arithmetic), so that the simple equivalents of LOAD and STORE on other ISAs now have to set the CC. It's hard to say whether this matters or not without a lot of statistics. It does complexify some advanced optimizations. For instance, there are some kinds of store/load sequences where one wants everything to be done in a L/S unit (which generally knows nothing about CCs). For instance, one may recognize that a pending store has the same address as a later load, and one can simply hand the store data directly to the load without incurring a cache access. [I think Pentium 4 does something like this]. This easily happen when a bunch of arguments are quickly pushed onto the stack, and the stores are queued in the L/S unit (because they arrive faster than the cache can service them), but later loads quickly appear to fetch the arguments. This seems to imply extra complexity, because the L/S unit must compute the CC and get it back to the rest of the CPU. NOTE: upon exception or interrupt, the CC must be set appropriately, which means that it has to be tracked. However, it also means that most conditional branches depend on the immediately-preceding instruction, and that may (or may not) make it harder to extract ILP. AND SAVING THE BEST FOR LAST: "SHOTGUN" INSTRUCTIONS LIKE VAX CALLS, CALLG In the NVAX, these shut down the pipeline because the scoreboard couldn't keep track of them, so that sequences of simpler instructions were faster. The VAX ISA makes it harder than usual for a decoder to turn an instruction into a small, known set of micro-ops. CALLS and CALLG generate long sequences of actions, most of which can be turned into micro-ops straightforwardly. However, one thing is painful: CALLS numarg.rl,dst.ab and CALLG arglist.ab, dst.ab The decoder cannot tell from the instruction how many registers will get saved on the stack, because the dst.ab argument (which could be indirect) yields the address, not of the first instruction of the called routine, but of a *mask*, 12 of whose bits correspond to registers R11 through R0, showing which ones need to be saved onto the stack along with everything else, and all the register side-effects. This means, that in the middle of decoding the instruction, the decoder has to hand dst.ab to the address calculator, get the result back [OK so far], but then it has to fetch the target mask, scan the bits, and generate one micro-op store per register save. Presumably, in an OOO with trace-cache design, and with fully-resolved subroutine addresses, one could do this OK, but it's a pain, because of the potential variability. Of course, in C, with pointers to functions, an indirect call through a pointer is awkward ... but of course, it's awkward for everybody. RET inverts this, but the trace cache approach doesn't help, in that it POPS a word from the stack that has the register mask, scans the mask, and restores the marked registers from the stack. This is another thing that wants to generate a variable number of memory operations based on another memory operand, so the micro-ops are not easily generatable from the instruction bits alone, or even instruction+register values. PUSHR and POPR push and pop multiple registers, using a similar register mask, but at least, in common practice, they would be immediate operands ... although of course, it is possible the mask was some indirect-addressed value, sigh. Of course, some use the simpler: BSB/JSB to subroutine Subroutine: PUSHR, plus other prolog code Body POPR and other epilog code RSB However, at least as seen in [3], [4], [5], CALLS/RET certainly got used: [4, Table 3] has 4.86% of instructions being CALLS+RET, for instructions executed by compilers for BASIC, BLISS, PASCAL, PL/I, COBOL, FORTRAN. [3] has instruction distributions [by frequency and time], with [3, Table 7] showing the distributions across all modes. This lowers the % of CALLS/RET, since the VMS kernel and supervisor don't use them. Still one would guess that ~1% of executions each for CALL and RET, with about 12% of total cycles, would fit VAX 11/780. [5, Table 1] gives 3.2% for CALL/RET Of course, the semantics of these things tend to incur bursts of stores or loads, which means the load/store queues better be well-sized to accept them. IA-32: While the CALL looks amazingly baroque, it's not as bad as it looks, that is, there are bunch of combinations to decode, and they do different things, but once you decode the instruction bits, you know what each will do. RET doesn't restore registers, it just jumps back, although again, there is a complex set of alternatives, but each is relatively simple, especially the heavily-used ones (I think). PUSHA/PUSHAD and POPA/POPAD simply push/op all the general registers... a fixed set of micro-ops. S/360: This has LM/STM (Load/store Multiple), but the register numbers are encoded in the instructions, with the only indirect case being an Execute of an LM/STM, something rarely seen by me. NOT IMPOSSIBLE, BUT HARD I my job had been to have been keeping the VAX competitive, I'd probably have been thinking about software tricks to lessen the number of CALL/RETs executed, but it's just one of many issues. Maybe there are other implementation tricks, but in general, this stuff is *hard*, and solutions that are straightforward in RISCs, and somewhat so in IA-32, are different/tricky for VAXen. To see how complex it can get to make an older architecture go fast, see [12] on the z990. IBM did OOO CPUs in the 360/91 (1967), and ES/9000 (around 1992), but has reverted to in-order superscalars since. The recent z990 (2-core) is a 2-decode, 3-issue, in-order superscalar. The chip is in 130nm, with 8 metals, has 121M transistors, of which 84M are in arrays, and the rest (37M!!) are combinatorial logic. That is two cores, so figure ech side is 60M, with 17M in combinatorinal logic. That's still big. It has 3 integer pipelines (X, Y, and Z), of which one does complex instructions (fixed point multiplies and decimal), and sometimes (as for Move Character and other SS instructions), X and Y are combined into a virtual Z, with a 16-byte datapath. "Instructions that use the Z pipeline always execute alone." The millicode approach [13] might help a VAX, but again, this is not simple. AND MORE I picked out a couple of the obvious issues. In my experience, the people who *really* know the warts and weird problems of implementing an ISA are those who've actually done it a couple times, and I haven't ever done a VAX. If one of those implementers says they knew how to fix all the issues, I'd at least listen to their solutions with interest, but I do know that a lot of the issues are statistical things, not just simple feature descriptions. CONCLUSION Earlier in this thread, I noted [11], which says: "The VAX architectural disadvantage might thus be viewed as a time lag of some number of years. The data in Table 1 in my previous post agrees, as does the clear evidence of the late 1980s. DEC understood the VAX quite well; there were superb architectural analysis articles [3-6] from the 1980s. Serious CPU designers gather massive data, and simulate alternatives, and DEC folks were very good at this process. Nth REMINDER: there is *architecture* (ISA) and there is *implementation*, and they interact, but they are different. If this isn't familiar, go back and read old postings. One might be able to say that one ISA is simpler than another, because the minimum gate count for a reasonable implementation is lower than the other's. One might say that the design complexity of similar implementations differs between the two ISAs. VAX VS RISC [TABLE 1, FIRST GROUP] By the late 1980s, some system-RISCs were selling in volumes similar to VAXen, i.e., in workstation/server markets [SPARC, HP PA, MIPS], and hence, none had the vast volumes of the PC market to allow extraordinary-expensive designs, but all could design faster CPUs than contemporaneous VAXen, at lower design cost. It was certainly clear by 1988 that RISCs were causing trouble for the VAX franchise, at least, in the Ultrix side of it. Reference [11] was discussed earlier in this thread, and its conclusions recognized that. Of course, IBM re-entered the RISC fray in 1991 with (aggressive) POWER CPUs. It is not at all unreasonable that RISC ISAs, first shipped in 1986 [HP PA, MIPS], 1987 [SPARC], 1991 [IBM POWER], and 1992 [DEC Alpha], should be more cost-effectively implementable than the VAX, first shipped in 1978. Even tiny MIPS was able to do that, over most of that period. Hence, one of the jaws closing on the VAX was higher-performance RISCs, delivered at lower cost, in similar volumes. The other jaw, as discussed in the previous post, was the performance rise of high-volume IA-32 CPUs, which allowed the use of larger design resources to deal with the complexities of IA-32. The second group in Table 1 showed a few examples of that. The VAX ISA is far cleaner, more orthogonal, more elegant, and easier to comprehend, than the IA-32 ISA, as it was in 1993. The Intel Pentium offered 2-issue superscalar (1993), and PentiumPro (1996) went OOO, and Pentium 4 (2000) went even further, with decoded instruction cache (trace cache). It took substantial resources, which DEC didn't have, to do that, including "pipelining" design teams at Intel (Santa Clara, Portland). In 1992, at Microprocessor Forum, Michael Uhler showed a chart that included; CMOS-4 CMOS-5 CMOS-6 1991 1993 1996 manufacturing year ..75 .5 .35 Min feature 3 4 4 Metals 7.2 16 32 Relative logic density 2.2 2.9 3.7 Relative Gate speed 1.3X 1.7X Gate speed relative to CMOS-4 It should be pretty clear from Table 1 of the previous post, that straight shrinks from CMOS-4 to CMOS-5 and then CMOS-6 wouldn't have put the VAX back competitive, because you wouldn't even get the clock-rate gain, given increasing relative memory latency. At the least, you'd have to redo the layout and increase the cache sizes. If you got the gate speed improvement, you'd have: 1993: 1.3*90Mhz = 117Mhz, 44E SPECint92, 60E SPECfp92, compared to 65 Si and 60 Sfp for Pentium. 1996: 1.7*90Mhz = 153Mhz, 58E Si92, 99E Sfp82, compared to 366 Si92, 283 Si92 for PentiumPro. For various reasons, I doubt that DEC would ever have built a Pentium-like 2-issue superscalar. In particular, the NVAX team found that it didn't help much (2%) to do multiple-operand decode (and was complex hardware), because the bottlenecks were later in the pipeline. I conjecture that it is hard to get much ILP just looking at 1-2 VAX instructions, as lots of them have immediate dependencies. Hence, (if there had been no Alpha, just VAX), it would have been more plausible to target an OOO design for 1996, but I'd guess it would have also had to make the big change to 64-bit at that point. It's hard to believe they could have gotten to a trace cache design then [neither Intel nor AMD had], and the tougher VAX decode might well incur more branch-delay penalty than the IA-32. Given DEC's design resources, one can sort of imagine doing: a) Clock spins and minor reworks on NVAXen, to keep installed base from bolting, while holding out hope that all would get well in 1996, but that's 3 years with not much performance scaling; very tough market. b) Simultaneously doing a 64-bit OOO VAX, because 1999 would have been late. However, as has been pointed out in detail, there are just lots of extra complexities in the VAX ISA, and all this stuff just adds up. Professionals don't design CPUs using vague handwaving, because it doesn't work. Anyway, DEC's gamble with Alpha didn't work [for various reasons], but at least it was a gutsy call to recreate the "one architecture" rule at DEC. Of course, personally, I would rather they had done something else :-) But the bottom line is: the VAX ISA was very difficult to keep competitive. The obvious decoding complexity is always there, in one form or another, but the more serious problem is execution complexity that lessens effective ILP and is thus a continual drag on performance with reasonable implementations. VAX: one of the great computer families, built around a clean ISA appropriate to the time, but increasingly difficult to implement competitively. R.I.P.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: PART 3. Why it seems difficult to make an OOO VAX competitive (really long) Date: 8 Aug 2005 07:39:55 -0700 Message-ID: <[email protected]> Among the problems with comp.arch is that it fills up with opinions that don't survive even minimal perusal of the literature... 1) One can argue about the PC, but if one reads the VAX-study references I quoted, one finds [Emer & Clark] that Immediates (PC)+ were 2.4% of the specifiers, and Absolute @(PC)+ were 0.6%, or 3% of the total. Personally, I didn't think that was worth the other problems, and neither did many of the other RISC designers,(ARM being a notable exception), nor X86 nor 68K, but it does help with code size. 2) Peter thinks it's a bad idea to use a GPR as the SP. Most designers of real chips in recent years have thought otherwise, because SP-relative addressing is common, and it is ugly to special-case it. 3) The VAX equivalents of IA-32 LEA are MOVA and PUSHA. Peter "Firefly" Lund wrote: > > b) The PC and SP are general registers for historical reasons - upwards > > compatibility with the PDP-11. > > Compatibility is a good reason -- a /very/ good one. > > But the VAX didn't have binary compatibility, it just had a mapping from > PDP-11 registers, addressing modes, and instructions onto the VAX ones. > > That made it easy to transliterate assembly source code. Emulating (or > even JITting) is also made easier. > > But would it really have hurt so much if the VAX had provided one or two > more general purpose registers and hid away the SP and PC? A couple of > extra registers for the emulator to play with internally could have been > nice (but there were already eight more in the VAX than in the PDP-11 so I > guess it wouldn't have mattered much). > > Instructions that accessed the SP and PC would have had to be > special-cased in the transliterator and the emulator -- but I'm not sure > it would have been difficult or expensive (you would need special handling > of the PC register anyway, since PDP-11 code addresses wouldn't match VAX > code addresses, and of the SP register since 16-bit values on the stack > for calls/returns won't match the native 32-bit values). > > What do you need to do with SP? Push, pop, call/ret, the occasional > add/sub, SP-relative addressing for loading/storing parameters/return > values/local variables. If you can move the SP to/from a GPR then what > else would you need? > > What do you need to do with PC? Conditional/unconditional branches, > calls, returns, and PC-relative loads and stores. > > Maybe we would like an equivalent of the IA-32 LEA instruction, too, for > creating absolute pointers to values with SP/PC-relative addresses.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: PART 3. Why it seems difficult to make an OOO VAX competitive (really long) Date: 14 Aug 2005 21:13:20 -0700 Message-ID: <[email protected]> Eric P. wrote: > John Mashey" <[email protected]> writes: > > > > But the bottom line is: the VAX ISA was very difficult to keep > > competitive. The obvious decoding complexity is always there, in one > > form or another, but the more serious problem is execution complexity > > that lessens effective ILP and is thus a continual drag on performance > > with reasonable implementations. > > In case anyone is still interested in this topic, > there are a bunch of papers by Bob Supnik at > http://simh.trailing-edge.com/papers.html > covering a variety of DEc design issues. Great material; thanks for posting; Bob is doing a dandy job preserving old stuff. In particular, if somebody actually wants to build things, it is really useful to get insight about design processes and tradeoffs. The HPS postings were useful too. > The one labeled "VLSI VAX Micro-Architecture" is from 1988 > (marked "For Internal Use Only, Semiconductor Engineering Group") > mentions at the end the ways a VAX might get lower CPI. It says > > "However the VAX architecture is highly resistant to macro-level > parallelism: > - Variable length specifiers make parallel decoding of specifiers > difficult and expensive > - Interlocks within and between instructions make overlap of > specifiers with instruction execution difficult and expensive > > Most (but not all) VAX architects feel that the costs of macro-level > parallelism outweighs the benefits; hence this approach is > not being actively pursued." > > So it would seem that the designers felt at that time that decode > was a major impediment. I actually hadn't read this before I posted, but obviously, I'd talked to VAX implementors in the late 1980s, and what they complained about sank in. Anyway, thanks for posting.
Index Home About Blog
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Code density and performance? Date: 22 Jul 2005 00:30:56 -0700 Message-ID: <[email protected]> Tom Linden wrote: > On Wed, 20 Jul 2005 12:07:33 -0700, glen herrmannsfeldt > <[email protected]> wrote: > > > That was 1986, though I don't believe VAX was a viable architecture > > by 1996, even though it wasn't out of address bits. > > It certainly could have been viable, had DEC continued its development and > squandered its resources on the Alpha adventure. Sigh ... that opinion is strongly at variance with the facts in the real world; the best engineers in the world (and DEC had plenty) couldn't have implemented viable (in the sense of being truly competitive) VAXen in 1996... I think the last VAXen were shipped in 2000, as installed base always has inertia... I'd suggest reading a fine paper by a couple of the best computer architecture performance people around, both of whom were senior DEC engineers: Dileep Bhandarkar, Douglas W. Clark, "Performance from Architecture: Comparing a RISC and a CISC with Similar Hardware Organization,", ACM SIGARCH CAN, 1991 [and a couple other places]. A copy can be found: http://www.cs.mipt.ru/docs/comp/eng/hardware/common/comparing_risc_and_cisc_proc/m A long, serious, competent analysis by world-class DEC people includes: VAX 4000/300 MIPS M/2000 1990 1989 [really 4Q88] system ship date 28ns 40ns cycle time $100K $80K List price 7.7 19.7 SPEC89 integer 8.1 16.3 SPEC89 float ==== "So, while VAX may 'catch up' to *current* single-instruction-issue RISC performance, RISC designs will push on with earlier adoption of advanced implementation techniques, achieving still higher performance. The VAX architectural disadvantage might thus be viewed as a time lag of some number of years." The summary: "RISC as exemplified by MIPS offers a significant processor performance advantage over a VAX of comparable hardware organization." And this proved to be true, even with superb CMOS VAX implementations done in the early 1990s. When computer companies design successive implementations of some ISA, they tend to accumulate more statistics to help tune designs, they keep tricks that work, they avoid ones that don't, i.e., it is a huge implementation advantage to have done multiple designs over many years. The paper especially discusses VAX designs that shipped in 1986 (VAX 8700) and 1990 (VAX 4000/300), 8 and 12 years, and many implementations after 1978's VAX 11/780. These were done by experienced design teams backed by a large, profitable revenue stream. By comparison, the MIPS R2000 CPU first shipped in 1986, the R2010 FPU in 1987, and the R3000/R3010 in 4Q88. The R3000 used an R2000 core with some improvements to the cache interface, and the R3010 was a shrunken R2000, i.e., there wasn't a lot of architectural tuning, and of course, that was done by a small startup that did *not* have a big revenue stream :-) BOTTOM LINE: DEC had every motivation in the world to keep extending the VAX as long as possible, as it was a huge cash cow. DEC had plenty of money, numerous excellent designers, long experience in implementing VAXen. BUT IT STOPPED BEING POSSIBLE TO DESIGN COMPETTIVE VAXen...
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Code density and performance? Date: 22 Jul 2005 14:38:18 -0700 Message-ID: <[email protected]> Anton Ertl wrote: > "John Mashey" <[email protected]> writes: > >Sigh ... that opinion is strongly at variance with the facts in the > >real world; the best engineers in the world (and DEC had plenty) > >couldn't have implemented viable (in the sense of being truly > >competitive) VAXen in 1996... > > How do you know? AFAIK in the real world in which I live in the > engineers at DEC did not try to implement a new VAX in 1996; instead, > VAX development stopped pretty soon after the release of the Alpha. I'm surprised any long-time participant in comp.arch would ask, since this topic has been discussed here numerous times, including the topic of architectural issues that made the (elegant) VAX and (fairly elegant 68020) more difficult to do cost-effective fast implementations for than the (simpler) 68010 and the (inelegant X86). Once again, I will remind people that RISC is a label for a specific style of ISA design, and the bunch of ISAs that more-or-less fit that are relatively similar. CISC was coined to describe "the rest", which covered an immense range of different ISAs. IBM S/360, X86, and VAX are *different*, and an analysis comparing X86 to (a) RISC does nothing to invalidate earlier comparison of VAX to (a) RISC. So, how do I know? Well, in the *real* world of CPU architecture, there were a fairly modest number of people who actually did this for a living, and many of them knew each other, moved among companies, went to the same conferences, recruited each other for panels, worked on committees (like SPEC), traded benchmarks, and talked informally in bars. I've said before here that I'd more than once been kidded by VAX implementors about how easy RISC guys had it, and then they'd cite various horrible weird special cases that had to be handled and that got in the way of performance. I wouldn't attribute that to anyone of course. But, the bottom line is that the engineers who *knew* the VAX best, and who implemented it many time, and many of whom were *really, really good* engineers, came to believe that they simply could not keep implementing competitive CPUs that were VAX ISA. Some of them were already starting to think that in the mid-1980s, but lots more thought so a few years later, and so did certain DEC sales managers, who knew that if the cutomer wanted VMS, they won, but if the customer wanted some UNIX, they just couldn't compete. The VAX 9000 fiasco didn't help. There were some fine CMOS VAX chips done, but it was just too hard, even with mature, good compilers. There were various internal DEC RISC efforts, and at a certain point [when DEC chose to use MIPS R3000s for some products], the most irritating thing to some DEC engineers was that they kept getting grabbed back off RISC investigations to help do more VAXen. In the *real world* of (very competent) designers who made their living doing VAXen, they just couldn't figure out how to keep doing it competitively. I will point out that many of the Alpha folks had done VAX implementations and software and performance analysis, i.e., guys like Witek, Sites, Dobberpuhl, Uhler, Bhandarkar, Supnik. Anyone who has the opinion that it was reasonable to be designing new VAXen in 1996, expecting them to be competitive, has to believe these guys are clueless idiots. NOTE: that doesn't mean that I am claiming "so, they had to do Alpha, and do it the way they did it", as they were other options. I'm just saying that continuing on a VAX-only path wasn't believable to their experienced engineers, whose technical skills I hold in high regard, often from repeated first-hand contact. > >I'd suggest reading a fine paper by a couple of the best computer > >architecture performance people around, both of whom were senior DEC > >engineers: > > > >Dileep Bhandarkar, Douglas W. Clark, "Performance from Architecture: > >Comparing a RISC and a CISC with Similar Hardware Organization,", ACM > >SIGARCH CAN, 1991 [and a couple other places]. A copy can be found: > >http://www.cs.mipt.ru/docs/comp/eng/hardware/common/comparing_risc_and_cisc_proc/m > > Well, a few years later Dileep Bhandarkar, then employed at Intel, > wrote a paper where he claimed (IIRC) that the performance advantage > of RISCs had gone (which I did not take very seriously at the time); > unfortunately I don't know which of his papers that is; I just looked > at "RISC versus CISC: a tale of two chips" and it looks more balanced > than what I remember. That was another fine paper from Dileep, but the conclusion: X86 can be made competitive with RISC is not the same as: VAX can be made competitive with RISC > >BOTTOM LINE: > > > >DEC had every motivation in the world to keep extending the VAX as long > >as possible, as it was a huge cash cow. DEC had plenty of money, > >numerous excellent designers, long experience in implementing VAXen. > > > >BUT IT STOPPED BEING POSSIBLE TO DESIGN COMPETTIVE VAXen... > > Looking at what Intel and AMD did with the 386 architecture, I am > convinced that it is technically possible to design competetive VAXen; > I don't see any additional challenges that the VAX poses over the 386 > that cannot be addressed with known techniques; out-of-order execution > of micro-instructions with in-order commit seems to solve most of the > problems that the VAX poses, and the decoding could be addressed > either with pre-decode bits (as used in various 386 implementations), > or with a trace cache as in the Pentium 4. You're entitled to your opinion, which was shared by the VAX9000 implementors. Many important senior VAX implementors disagreed. I've posted some of the reasons why VAX was harder than X86, years ago. Of course you can do these things, but different ISAs get different mileage from the same techniques. > Of course, on the political level it stopped being possible to design > competetive VAXen, because DEC had decided to switch to Alpha, and > thus would not finance such an effort, and of course nobody else > would, either. Ken Olsen loved the VAX and would have kept it forever. Key salespeople told him it was getting uncompetitive, and engineers told him they couldn't fix that problem, and they'd better start doing something else.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch,comp.lang.fortran Subject: Re: Code density and performance? Date: 28 Jul 2005 23:56:29 -0700 Message-ID: <[email protected]> Tom Linden wrote: > But again, back to the familiar theme, had VAX > received the billions that alpha received it too would be spinning today > at 4GHz. Nonsense. Not in the Real World where economics and ROI actually matter. 1) When the VAX was designed (1975-), PL/I may have been the third most important language, after FORTRAN and COBOL, especially if one wanted to attack the IBM mainframe market. Of course, the VAX itself was created by the agonizing decision that the PDP-11 could not be extended further upward, but the VAX certainly catered to PL/I and COBOL. 2) In 1988, the tradeoffs were different, and the fraction of new code being written in PL/I had decreased, and DEC's priorities were different. I understand that the VAX->Alpha transition might be less than optimal, especially for a PL/I vendor (like Kednos). That's Life. 3) Read the article by Bob Supnik: http://research.compaq.com/wrl/DECarchives/DTJ/DTJ800/axp-foreword.txt Speaking of 1988: "Nonetheless, senior managers and engineers saw trouble ahead. Workstations had displaced VAX VMS from its original technical market. Networks of personal computers were replacing timesharing. Application investment was moving to standard, high-volume computers. Microprocessors had surpassed the performance of traditional mid-range computers and were closing in on mainframes. And advances in RISC technology threatened to aggravate all of these trends. Accordingly, the Executive Committee asked Engineering to develop a long-term strategy for keeping Digital's systems competitive. Engineering convened a task force to study the problem. The task force looked at a wide range of potential solutions, from the application of advanced pipelining techniques in VAX systems to the deployment of a new architecture. A basic constraint was that the proposed solution had to provide strong compatibility with current products. After several months of study, the team concluded that only a new RISC architecture could meet the stated objective of long-term competitiveness, and that only the existing VMS and UNIX environments could meet the stated constraint of strong compatibility. Thus, the challenge posed by the task force was to design the most competitive RISC systems that would run the current software environments." .... "The original study team was called the "RISCy VAX Task Force." The advanced development work was labeled "EVAX." When the program was approved, the Executive Committee demanded a neutral code name, hence "Alpha." Put another way, the EVAX was *supposed* to be an aggressive VAX implementation, whose goals were to be extended to 64-bit, and close the performance gap with RISCs, but after serious work, a fine engineering team (my opinion) concluded the problems weren't solvable. Just as in 1975, they decided they *had* to make an architecture change. 4) DEC designers certainly understood the application of OOO techniques by 1991/1992, and knew the basics of what EV6 was going to look like, and could have applied that to the VAX for 1996. BUT, I don't think anybody serious believed it was possible for a VAX design to match an Alpha design in performance, given similar design costs and time-to-market. Either the VAX would need a huge design team [way beyond the 100 or so microprocessor designers they had in the 1980s], or it would take a lot longer to do. [I have a bunch of email from senior ex-DEC engineers involved in VAX and Alpha implementations, and I wouldn't quote them without their permission, but words like "uninformed opinion" and "revisionist history" were prominent in descriptions of this thread.] 5) I don't have time or interest in being serious about sketching an OOO VAX design, and of course, no one in their right mind would do that without access to the traces from many VAX codes, and lots of CPU cycles for simulation, sicne of course, serious designs can only be done with statistics. Handwaving has zero credibility. On the other hand, I have had some email discussions with engineers who've implemented VAX microprocessors, and vetted some of the ideas about performance bottlenecks. I know this might seem strange, but I actually ascribe high credibility on this topic to competent people who have implemented multiple VAX micros... If I get time in the next week, I'll try to consolidate that, at least to sketch the sorts of VAX architectural issues deemed to be hard.... and the main issues *aren't* the decoding of complex instructions, as much as the later-stage execution issues that make it difficult to get as much parallelism as you might expect in any reasonable implementation. The best I can do is a sketch, because I don't have the statistics ... but the people that did made the decision. ONE LAST TIME: it wasn't politics that ended the VAX, it was engineering judgement by excellent (IMHO) DEC engineers, and the economics of doing relatively low-volume chips [low-volume compared to X86]. Nobody could afford to keep the VAX ISA competitive at DEC volumes....
From: "John Mashey" <[email protected]> Newsgroups: comp.arch,comp.lang.fortran Subject: Re: Code density and performance? Date: 29 Jul 2005 14:38:34 -0700 Message-ID: <[email protected]> glen herrmannsfeldt wrote: > John Mashey wrote: > (snip regarding VAX and Alpha) > > > Put another way, the EVAX was *supposed* to be an aggressive VAX > > implementation, whose goals were to be extended to 64-bit, and close > > the performance gap with RISCs, but after serious work, a fine > > engineering team (my opinion) concluded the problems weren't solvable. > > Just as in 1975, they decided they *had* to make an architecture > > change. > > Not to disagree, (because I can tell you know this much better > than I do), but I always thought DEC wanted more from Alpha. > > I always felt that they wanted to be known for something, for breaking > open the 64 bit world when everyone else was stuck in the 32 bit world. > To show that they could do something other companies couldn't. > > Maybe sort of the way Cray felt about the supercomputer market. > > They could have had a darn fast processor with 32 bit addresses, > maybe even 64 bit registers, and easily beat the VAX. It was many > years before others decided to go for 64 bits, and even now I see little > need for 64 bits for 98% of the people. An x86 processor with a 36 bit > MMU would have gone a long way to satisfying users addressing needs. Well, the truth is not that way... 1) Recall that DEC was famously burnt by running out of address bits too quickly on the PDP-11. Google comp.arch: mashey gordon bell pdp-11 gets you my BYTE article on 64-bit that references Gordon's comment. DEC engineers could plot a straight-line on a log-scale chart as well as MIPS could, i.e., for prctical DRAM sizes. Personally, I think good computer vendors are supposed to think ahead, and have hardware ready, so that software can get ready, so that when customers were ready to buy bigger memory and use it, they could ... and not, once again, recreate the awkward workarounds that have occurred many times when running out of address bits. 2) A lot of readers of this newsgroup don't understand how interconnected the wroking computer architecture community could be. The following just notes some of the Stanford / DEC / SGI / MIPS relationships. Among other combinations: - "MIPS: A VLSI Processor Architecture", John Hennessy, Norman Jouppi, Forest Baskett, and John Gill, Stanford Tech Report 223, June 1983. - Forest left to run DECWRL, right across El Camino Real from Stanford, until he became CTO @ SGI later in 1986. Norm was over at DECWRL as well. - Hennessy, of course, took sabbatical in 1984-1985 to co-found MIPS. - DECWRL had many RISC fans; the first MIPS presentation there (that I was involved in, anyway) was April 1986, shortly after R2000 was announced. - DEC, of course, had various on-again, off-again RISC projects in the 1980s, with none getting critical mass. I wouldn' attempt to describe the politics, even what I know, except the phrase "VAX uber alles" was heard now and then :-) 3) Most people know that DEC did a deal with MIPS in 1988. As it happens, I iniated the process that led to that deal, via a chance meeting with an old Bell Labs colleague [Bob Rodriguez], then at DEC in NH, at a Uniforum in early 1988. I'd given Bob a MIPS Performance Brief. That evening, at the conference beer bust, Bob said "these look fast!" and evinced a wish to "port Ultrix to this and wheel it in and show Ken Olsen that it doesn't take 3 years and hundreds of people." I talked our Boston sales office into loaning Bob two MIPS systems, noting that the likelihood of DEC ever buying an outside CPU was low, but if there was somebody who could do a quick Ultrix port, it was Bob. After a month or so of paperwork, Bob and a couple friend got Ultrix working pretty well in less than 3 weeks [late April/early May]. That incited the the DEC Palo Alto workstation people, who were having a rough time competing in the workstation business with VAX chips, except when VMS was required. [BTW, I knew these folks also, if for no other reason, than MIPS participation in the Hamilton Group [of non-Sun UNIX folks], so named because the first meeting was held at DEC on Hamilton Avenue in Palo Alto.] A frenzied sequence of meetings then ensued, as DEC Palo Alto, and various Ultrix folks clamored to use MIPS chips so they could build competitive workstations *soon*. 4) In any case, there was a meeting in the Summer in which a team of DEC's most senior engineers [in VLSI, systems, compilers, and OS] was gathered together by Ken Olsen with a few days' notice and sent to Sunnyvale to do a solid day of technical due diligence. I'm pretty sure we knew by then that the R4000 should be 64-bit, and discussed that with DEC (and SGI, and a few other close customers, but DEC and SGI were the ones who were especially desirous of 64-bit). If it wasn't at that meeting [it was a lonnnngg day], it was shortly thereafter. The DEC engineers were *quite* competent, and asked lots of tough questions [although the OS part was trivial, given that Ultrix was already running and benchmarking well. :-)] The CPU designers & compiler people, of course, were from one of the premier places for doing this in the world, had been building one of the most successful computer lines in history, and DEC was at the height of its profitability. Some had been advocating RISC for years, but were always getting grabbed away, so the projects were on-again, off-again, which meant, among other things, that there was little progress on software, and of course, DEC was investing big-time in ECL, much to the Hudson, MA CMOS guys' consternation. Given all that, they got a mission to come see if DEC should use MIPS chips ... and it would have been really easy to have done a political NIH hatchet job, but they didn't. They went back the next day, wrote their report, and said OK, although I'm sure it was really, really painful for some of them. I RESPECT THAT A LOT, which is why I thought some of the postings in this thread were simply nonsense... At that point, MIPS was a few hundred people in two little buildings in Sunnyvale. At the end of the day, we gave them a tour of our computer room. Some of the DEC Hudson CMOS engineers were badly shocked. On the way out the door, I happened to overhear one say to another, in a stunned voice: "This little startup has more compute power for chip simulation than we do at Hudson!! We're DEC, how can that be?!" Answer (in voice cold enough to freeze air) "Yes, that's why we've got a bad problem and we'd better fix it." This was Summer of 1988 ... now look again at: http://research.compaq.com/wrl/DECarchives/DTJ/DTJ800/axp-foreword.txt Of course, by the time of the 64-bit working group meetings in mid-1992, both SGI/MIPS and DEC had chosen the same flavor of C model. It was well understood at that point (in that group) that HP, Sun, and IBM were working on 64-bit designs. 5) SGI shipped the R4000 in the Crimson in 1Q92, albeit running a 32-bit OS. DEC shipped production Alpha systems ~4Q92, running 64-bit OSs. Of course they marketed 64-bit hard... I would too. Around 4Q94, both DEC and SGI first shipped SMPs with *both* of the following: - 64-bit OSs - plausibly-purchasable memories above 4GB, i.e., where >4GB user address space was starting to be needed. I knew customers who bought Power Challenges, got 4GB+ of memory, and immediately recompiled code to use it all, in one (parallelized) program. [I.e., that's a one-line change in some big Computational Fluid Dynamics code :-), changing some array size.] During 1989-1992, there was plenty of discussion among MIPS, SGI, and DEC people about 64-bit programming models, for example. 6) SUMMARY DEC couldn't figure out how to make the VAX competitively fast and 64-bit, and they could look at DRAM and Moore's Law charts like the rest of us. Privately, they (and we) were a little surprised that other RISC vendors were waiting a generation. DEC certainly knew exactly what MIPS was doing, and certainly knew SGI intended to ship 64-bit OSs, if for no other reason than the amount of back-door communication amongst like-minded software people, and they certainly had a good idea of what everybody else was up to.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Code density and performance? Date: 29 Jul 2005 20:01:20 -0700 Message-ID: <[email protected]> Eric P. wrote: > The other problems might be: > > - Strong memory ordering prevents any access reordering > - Any number of idiosyncrasies wrt the order that data values > are read or written vs the order that auto increment/decrement > deferred operation are performed will inject pipeline stalls due > to potential memory aliases that probably never actually happen. > This combines with strong ordering to basically serialize everything. > - Having program counter in a general registers that can be > manipulated by auto increment addressing modes probably > causes many pipeline problems later to feed value forward > - 16 integer registers with many having predefined functions > is too small and causes lots of register spills. > - Having a single integer and float register set means extra > time moving float values over to float registers and back. > - Small combined register set means spilling float values a lot > - Small page size requires lots of TLB entries, which should be > fully assoc. for performance, which means big TLB chip real estate. > - The worst case of requiring 46 TLB entries to be resident > to ensure the ADDP6 instruction can complete. Not really a > performance limit so much as a pain in the ass to design around. I think you have the general idea; more later, when I get time.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Code density and performance? [really Part 1 of 3: Micro economics 101] Date: 30 Jul 2005 16:40:53 -0700 Message-ID: <[email protected]> John Mashey wrote: > I think you have the general idea; more later, when I get time. PART 1 - Microprocessor economics 101 (this post) PART 2 - Applying all this to DEC, NVAX, Alpha, competition (this post) PART 3 - Why it seems difficult to make an OOO VAX competitve (later) PART 1 - Microprocessor economics 101 (simplified) This thread is filled with *fantasies* about cheap/fast/timely VAXen, because the issue isn't (just) what's technically feasible, it's what you can do that's cost-competitive and time-to-market competitive. The following is over-simplified, of course, but hopefully it will bring some reality to this discussion. DESIGN COST Suppose it costs $50M to design a chip and get into to production. The amortized cost/chip for engineering versus total unit volume is: Cost/chip Volume $1,000,000 50 $100,000 500 $10,000 5,000 $1,000 50,000 $100 500,000 $10 5,000,000 $1 50,000,000 Alternatively, if your volumes happen to be 5,000,000, you could spend $500M on development, and still only have an engineering cost/chip of $100. INTEL AND AMD CAN MAKE FAST X86S BECAUSE THEY HAVE VOLUME. Anne&Lynn Wheelers' post a while ago pointed at VAX unit volumes, which as of 1987 had MicroVAX II (a very successful product) having shipped 65,000 units over 1985-1987. If the unit volumes are inherently lower, you either have to get profit based on *system* value that yields unusually high margins, so that the system profit essentially subsidizes the use of such chips. This works for a systems vendor when the market and customer switching costs allow high margins, i.e., IBM mainframes to this day, and VAXen in the 1980s. [The first MIPS were designed using 2 VAX 11/780s, plus (later) an 8600, plus some Apollos ... and the VAXen seemed expensive, but they were what we needed, so we paid.] SYSTEMS COMPANIES Mainframes and minicomputer systems companies thrived when the design style was to integrate a large number of lower-density components, with serious value-add in the design of CPUs from such components. [Look at a VAX 11/780 set of boards.] As microprocessors came in, and became usefully competitive, most such companies struggled very hard with the change in economics, and the internal struggles to maintain upward compatibility. Most systems companies had real problems with this. There were pervasive internal wars among different strategies, especially in multi-division companies: A "We can do this ourselves, especially with new ECL (or even GaAs) [Supercomputer and mainframe, and minicomputers chasing mainframes] B "We can build CMOS micros that are almost as fast as ECL, and much cheaper, and enough better than commodity in feature, function, and performance." [IBM, DEC, HP ... and later Sun] C "We should put our money in system design and software, and buy micros." [Apollo, Convergent, Sun. SGI, Sequent, and various divisions of older companies]. Of course, later there came: D "We should buy micros, add value in systems design, and use Microsoft" Hence, IBM PC division, Compaq, etc. and yet later: E "We should minimize engineering costs and focus on distribution" I.e., Dell. Many companies did one internal design too many. Most of the minicomputer vendors went out of business. IBM, DEC, and HP were among the few that actually had the expertise to do CMOS micro technology, albeit not without internal wars. [I was invovled in dozens of these wars. One of the most amusing was when IBM friends asked me to come and participate in a mostly-IBM-internal conference at T. J. Watson, ~1992. What they *really* wanted, it turned out, was for somebody outside IBM politics (i.e., "safe") to wave a red flag in front of the ECL mainframe folks, by showing a working 100Mhz 64-bit R4000.] SEMICONDUCTOR COMPANIES, SYSTEMS COMPANIES, AND FABS In the 1980s, if you were a semiconductor company, you owned one or more fabs, which were expensive at the time, but nothing like they are now. You designed chips that either had huge volumes, or high ASPs, or best, both! Volume experience improves yield of chips/wafer, and of course, volume amortizes not only the fab cost, but the design cost. If you are a systems vendor, and only one fab can make your chips, you have the awkward problem that if the fab isn't full, you have a lot of capital tied up, and if it is full, you are shipment-constrained. When you built a fab, first you ran high-value wafers, but even after the fab had aged, and was no longer state-of-the-art, quite often you could run parts that didn't need the most advanced technology. In the 1980s, if you wanted to have access to leading-edge VLSI technology for your own designs, EITHER: - You were a semiconductor company, i.e., the era of "Only real men own fabs" (sorry, sexist, but that was the quote of the day). OR - You were a systems company big enough to afford the fab treadmill. IBM [which always had great process technology]. DEC usually had 1-2 CMOS fabs [more on that later]. HP had at least 1, but at least sometimes there was conflict between the priorities for high-speed CPU design and high-volume lower-cost parts. Fujitsu, NEC, etc were both chip and systems companies. In any case, you must carefully amortize the FAB costs. - You were a fabless systems company who could convice a chip company to partner with you, where your designs were built in their fabs, and either you had enough volume alone, or (better) they could sell the chips to a large audience of others. [Example: Sun & TI] OR - You were a small systems/chip company [MIPS] that was convincing various other systems companies and embedded designers to use the chips, and thus able to convince a few chip partners to do long-term deals to make the chips, and sell most of them to other companies, and as desired, have licenses to make variations of their own from the designs. Motivations might be that a chip salesperson could get in the door with CPUs, and be able to sell many other parts, like SRAMs [motivation for IDT/MIPS and Cypress/SPARC], or be able to do ASIC versions [motivation for LSI Logic]. In MIPS Case, the first few years, the accessible partners were small/medium chip vendors, and it was only in 1988 that we were able to (almost) do a deal with Motorola and were able to do ones with NEC and Toshiba, i.e., high-volume vendors with multiple fabs. Now, you might say, why wouldn't a company like Sun just go to a foundry with its designs, or in MIPS case, why wouldn't it just be a normal fabless semiconductor vendor of which there are many? A: Accessible foundries, geared to producing outside designs, with state-of-the-art fabs ... didn't really exist. TSMC was founded in 1987, and it took a long time for it to grow. ON HAVING A FAB, OR NOT If you own the process, you can diddle it somewhat to fit what you're building. If your engineers are close with the fab process people, and you have wizard circuit designers, you can do all sorts of things to get higher clock rates. If you aren't, you use the factory design rules ... or maybe you can do a little negotiation with them. In any case, there is a tradeoff between owning a fab ($$$) and getting higher clock rate, and not owning the fab, and being less able to tune designs. SYSTEMS COMPANIES THAT DESIGN CHIPS, SOLD TO OTHERS OR NOT There is a distinct difference in approach between the extremes of: - We're designing these chips for our systems, running our OSes and compilers. We might sell chips to a few close partners for strategic reasons. VERSUS - We expect to use these chips in our systems, but we expect that large numbers will be used in other systems, with other software. We will include features that will never be used in our own systems. We will invest in design documentation, appropriate technical support, sampling support, debugging support, software licensing, etc, etc to enable others to be successful. IBM still has fabs, but of course IBM Microelectronics does foundry work for others (to amortize the fab cost). POWER was really geared to RS/6000; PPC made various changes to allow wider use outside, and IBM really sought this (volume). Sun never had a fab, did a lot of industry campaigning to spread SPARC, but in practice, outside of a few close partners, most of the $ sales of SPARC chips went to Sun. HP has sold PA-RISC chips to a few close partners, but in general, never was set up in the business of selling microprocessors. MIPS started to do chips, but had enough work to do on systems that it needed to build systems (and, if you understand the volume issues above, needed systems revenue, since in the early days, it couldn't poossibly get enough chip volume to make money. I.e., systems buisness can work at low volumes, whereas chip business doesn't.] DEC, of course, after its original business (modules) was much more set up as a systems vendor, and never really had a chip-supplier mindset, although Alpha, of course, was forced to try to do that (volume, again). Somebody suggested they should have been selling VAX chips, and that may be so, but it is really hard to make that sort of thing happen, as it requires serious investment to enable other customers to be ssuccessful, and it requires the right mindset, and it's really hard to make that work in a big systems company. (I'm DEC and I sell you VAX Chips. What OS do you run? Oh, VMS; OK we'll license you that. What sort of systems do you want to build? You want to build lower-cost minicomputers to sell to the VAX/VMS installed-base? Oh.... actually, it looks like we're out of VAX chips this quarter, sorry.) (Or one might recall that Sun talked Solbourne into using SPARC, and Solbourne designed their own CPUs and built SMPs. If a Sun account wanted an SMP, and somebody like SGI was knocking at the door, Sun would point at Solbourne (to keep SPARC), but if Solbourne was infringing on a Sun sale, it was not so friendly - I once got a copy of a Sun memo to the salesforce about how to clobber Solbourne.) Anyway, a *big* systems vendor, to be motivated to the bother of successfully selling its otherwise proprietary CPU chips, has to find other, essentially non-competitive users of them, who can be successful. The most successful example of that is the IBM PPC -> Apple case. Probably the most interesting Alpha case was its use in the Cray T3 systems, fine supercomputers, but not exactly high-volume. ON DESIGNS AND ECONOMICS People probably know the old project management adage: "features, cost, schedule: you pick two, I'll tell you the other." In CPU design, you could, these days, use: - FPGA - Structured ASIC - ASIC, full synthesized logic - Custom, with some synthesized and some custom logic/layout design, and maybe with some special circuit design. Better tools help ... but they're expensive, especially because people pushign the state of the art tend to need to build some of their own. This is in increasing order of design cost. - An FPGA will be big, use more power, and run at lower clock rate. - The more a custom a chip is, the faster it can go, but it either takes more people, or longer to design, or (usually) both. Companies like Intel have often produced an original design with a lot of synthesized logic, with one design team, and then another team right behind them, to use the same logic, but tune the critical paths for higher clock rate, shrink the die with more custom design, work on yield improvements, etc. Put another way, if you have enough volume, and good ASPs, you can afford to spend a lot of engineering effort to tune designs, even to overcome ISA problems. PART 2 - Applying all this to DEC, NVAX, Alpha, Competition DEC (at least some people) understood the importance of VLSI CMOS. DEC had excellent CPU and systems designers, software people, and invested in fabs (for better or worse - some of us could never quite fiture out how they could afford the fabs in the 1990s). They had some super-wizard circuit designers, who even impressed some of the best circuit designers I've known. However, in the 1980s, they never had more than about 100 VLSI CPU designers, which in practice meant that at any one time, they could realistically be doing one brand-new design, and one {shrink, variation}. They of course were doing the ECL VAX9000, but that was a whole different organization. The problem that DEC faced was that their VAX cash cow was under attack, and they simply couldn't figure out how to keep the VAX competitive, first in the technical markets [versus PA RISC, SPARC, and MIPS], and then in commercial [PA RISC]. I think Supinik's article described this reasonably well. http://research.compaq.com/wrl/DECarchives/DTJ/DTJ800/axp-foreword.txt As a good head-to-head comparison, NVAX and the Alpha 20164 were built: - in same process - about the same time - with the same design tools - with similar-sized teams ... although the NVAX team had the advantage of having implemented pipelined CMOS VAXen before, long history of diagnsotics, test programs, statistics on behavior, etc, wheras Alpha team didn't alread have as much of that. The ISA difference between VAX and Alpha was such that the NVAX team had to spend a lot more effort on micro-architecture, wheras the Alpha team could spend that effort on aggressive implementation, such that the MHz difference was something like 80-90MHz for NVAX/NVAX+, and up to 200Mhz for 21064. Around 1992, modulo maybe a year difference in software, that gave numbers like: SGI DEC DEC Crimson VAX7000/610 DEC7000/610 MIPS VAX Alpha R4000 NVAX 21064 1.3M 1.3M 1.68M # transistors 184 mm^2 237 mm^2 234 mm^2 # size 1.0 micron .75 micron .75 micron # process 2-metal 3-metal 3-metal # metals 1MB L2 4MB L2 4MB L2 # L2 100Mhz 90MHz 182Mhz # clock rate 61 34 95 # SPECint89 78 58 244 # SPECfp89 Now, we all know SPECint/SPECfp aren't everything, and the exact numbers don't matter much, but that's still a big difference. I threw in the MIPS chip to illustrate that even a well-designed NVAX was outperformed by a single-issue chip that was 3/4 the size, in a substantially less dense technology [1.0 micron versus .75, and 2-metal versus 3], required to meet a generic design rules across multiple fab vendors, was 64-bit, and still had a higher clock rate. None of this was due to incompetence on the NVAX team; that was a *fine*, successful design to be proud of. But once again, go back to the economics. It's a classical move to try to take market share and build volume via all-out performance, selling first to those with the most portable code and willing to pay for performance. It's a lot harder to do that with an NVAX what was 60-80% of the performance (on these, anyway) of something like an R4000, that, if not a commodity, was a lot closer to that. A bit later, I'll post Part 3, my analysis of why I think it would have been hard to build a *competitive* OOO VAX. In the real world, it wasn't enough to build an OOO VAX, it had to be competitive on time-to-market, performance, and cost. This post has covered the economic issues, the next will discuss some of the ISA issues. But, as a teaser, I note that there are some ISA attributes of the VAX: a) Not found in RISCs b) Not found in X86 c) Some of which are found in S/360 family, but less often Some of them are the same ones that make other aggressive implementations hard, but some *really* cause trouble for OOO implementations, and in particular, make it very hard to get as mileage from the X86 convert->micro-op style of designs.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Part 1 of 3: Micro economics 101 (was: Code density ...) Date: 31 Jul 2005 17:48:37 -0700 Message-ID: <[email protected]> Anton Ertl wrote: > "John Mashey" <[email protected]> writes: > >This thread is filled with *fantasies* about cheap/fast/timely VAXen, > >because the issue isn't (just) what's technically feasible, it's what > >you can do that's cost-competitive and time-to-market competitive. > > Well, my impression was that you made a claim about technical > feasibility. Once again, I don't know why a long-time participant in comp.arch would think that. I've posted on this topic off and on for years, including the old April 18, 1991 "RISC vs CISC (very long)" post that's been referenced numerous times. (Google: mashey risc cisc very long) It said, among other things: "General comment: this may sound weird, but in the long term, it might be easier to deal with a really complicated bunch of instruction formats, than with a complex set of addressing modes, because at least the former is more amenable to pre-decoding into a cache of decoded instructions that can be pipelined reasonably, whereas the pipeline on the latter can get very tricky (examples to follow). This can lead to the funny effect that a relatively "clean", orthogonal architecture may actually be harder to make run fast than one that is less clean." In context, this was ~ "it might be easier to deal with X86 than VAX" Decoded Instruction Cache ~ Intel "trace" cache And in March 8, 1993 (Google: mashey vax complex addressing modes), I said: "Urk, maybe I didn't say this right: a) Decoding complexity. b) Execution complexity, especially in more aggressive (more parallel) designs. I'm generally much more worried about the latter than the former, since there are reasonable things to do about the former (i.e., decoded instruction caches, which at least help some)." If I've ever posted anything that seemed to imply it was impossible to build an OOO VAX, I apologize for the ambiguity, but I think I've consistently expressed this as "difficult" or "complex", or "needs a lot of gates", or "likley to incur extra gate delays" NOT as "impossible". I've various times discussed the 360/91 or the VAX 90000 as things that went fast for their clock rate, but at the cost of high cost and complexity. After all, the key issues around OOO were published in Jan 1967 (Anderson, Sparacio, Tomasulo on 360/91). An aphorism of the mid-1980s amongst CPU desingers was "Sometime we'll get enough gates to catch up with the 360/91." In the real world, engineers have to juggle design/verification cost, product cost, and time-to-market; there are plenty of things that are "technically feasible" but have no ROI...
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: Code density and performance? [really Part 2b of 3: Micro economics 101] Date: 3 Aug 2005 23:16:04 -0700 Message-ID: <[email protected]> From side questions, here's an update to Part 2, and you will definitely want to use Fixed Font... Here's a better one, with a few more CPUs to give context, and you may want to print. TABLE 1 - MIPS, VAX, Alpha, Intel CODE A B C D E SHIP 1Q92 3Q92 4Q92 3Q93 3Q95 CO SGI DEC DEC SGI DEC PROD Crimson 7000/610 7000/610 Chal XL 600 5/266 ARCH MIPS VAX Alpha MIPS Alpha CPU R4000 NVAX+ 21064 R4400 21164 XSTRS 1.3M 1.3M 1.68M 2.3M 9.7M mm^2 184 237 234 184 209 Micron 1.0 0.75 0.75 .8 0.35 Metals 2 3 3 2 4 L1 8KI+8KD 2KI+8KD 8K+8K 16K+16K 8KI+8KD L2 1MB 4MB 4MB 1MB 96K L3 MHz 100 90 182 150 266 Type 1P 1P 2SS 1P 4SS Bus 64 128 128 64 128 SPEC89 Issue Jun92 Sep92 Mar93 - - Si89 61 34 95 - - Sfp89 78* 58* 244* - - *All of these have the matrix300-raised numbers SPEC92 Issue June92 June92* Mar93 Jun93 Sep95 Si92 58 34E** 95 88 289 Sfp92 62 46E** 182 97 405 **My estimate, noting that MIPS & Alpha derated by .75-.8 going from SPECfp89 to SPECfp92. Take with many grains of salt. I couldn't easily find any SPEC92 numbers for VAX. CODE F G H I J SHIP 1991 3Q92 3Q93 2Q94 2Q96 CO Intel CPQ Intel Intel Intel PROD Xpress Deskpro Xpress Xpress Alder ARCH IA-32 IA-32 IA-32 IA-32 IA32 CPU 486DX 486DX2 Pentium P54C PentiumPro XSTRS 1.2M 1.2M+ 3.1M 3.2M 5.5M mm^2 81 ? 295? 147 196 Micron 0.8 ? 0.8 0.6 0.35 Metals 2? 2? 3 4 4 L1 8K 8K 8KI+8KD 8KI+8KD 8KI+8KD L2 256K 256K 256K 512K 256K MHz 50 66 66 100 200 Type 1P 1P 2SS 2SS 3-OOO Bus 32 32 64 64 64 SPEC92 Issue Mar92 June92 Jun93 Jun94 Dec95 Si92 30 32 65 100 366 Sfp92 14 16 60 75 283 Type: 1P: 1-issue, pipelined 2SS: 2-issue, superscalar 4SS: 4-issue superscalar 3-OOO: 3-issue, out-of-order ================================= What I've done is: Show the SPEC89 numbers for VAXen, because I can't find SPEC92 numbers. Then I've done a gross estimate of the equivalent SPEC92, so that I can get all of the machines on the same scale, noting of course that benchmarks degrade over time due to compiler-cracking. I used the highest NVAX numbers I have handy, from my old paper SPEC Newsletters. I'm ignoring costs, below, and the dates in the table must be taken with lots of salt, for numerous reasons, and as always SPECint and SPECfp aren't everything [spoken as an old OS person]. NVAX shipped in 4Q91, NVAX+ in 1992. The first R4000s shipped in systems in 1Q92, so these are contemporaneous, as they are with 486DX and 486DX2. The NVAX+ is about 75-80% of a MIPS R4000 on integer and FP here, despite using a better process [.75 micron, 3-metal, versus 1.0 micron, 2-metal], a larger die [237 versus 184], and being 32-bit rather than 64-bit. [It is somewhat an accident of history and business arrangements that the R4000 was done in 2-metal, but that forced it to be superpipelined, 1-issue, rather than the original plan of 2-issue superscalar. As a result, the R4000/R4400 often had lower SPECfp numbers than the contemporaneous HP and IBM RISCs, although for compensation it sometimes had better integer performance, and sometimes could afford bigger L2 caches, because the R4000/R4400 themselves were relatively small. In any case, on SPEC FP performance, in late 1992, the fastest NVAX+ was outperformed by {IBM, HP, MIPS, Sun (maybe), and Alpha (by 3-4X). The NVAX+ was 3X faster than a 66MHz 486DX2. In SPECint, in late 1992, the NVAX was outperformed, generally by the RISCs ... but worse, there wasn't much daylight between it and a 66MHZ 486DX2, or even a 50MHz 486DX. The real problem of course (not just for the VAX, but for everybody), was the bottom right corner of the Table. Intel had the resources and volume to "pipeline" major design teams [Santa Clara & Portland] plus variants and shrinks, and there was an incredible proliferation in these years. It's worth comparing [B] NVAX+ with [H] Pentium. Suppose one were a VAX customer in 1992: If you were using VAX/VMS: - commercial: committed to VMS for a long time. - technical (FP-part): RISC competitors keep coming by with their numbers If you were using Ultrix: - FP: serious pressure from RISC competitors - Integer: serious pressure already from RISC competitors, and horrors! Intel getting close to parity on performance I'm not going to comment on DEC's handling of Alpha, fabs, announcements, alternate strategy variations. But this part should make clear that ther was real pressure on the VAX architecture, for above (in terms of performance) and below (Intel coming up). One might imagine, that had there been no Alpha, and everybody at Hudson had kept working on VAXen, that they could have gotten: [X] a 2SS superscalar [like Pentium], in 1994, perhaps OR [Y} some OOO CPU [like Pentium Pro], in 1996, perhaps as well as doing the required shrinks and variants. From the resources I've heard described, I find it difficult to believe they could have done both [X] and [Y] (and note, world-class design teams don't grow on trees). I could be convinced otherwise, but (as one of the NVAX designers says), only by "members of the NVAX and Alpha design teams, plus Joel Emer" :-), i.e., well-informed people. In Part 3, I'll sketch some of the tough issues of implementing the VAX, as best I can, and in particular, note the ISA features that might make things harder for VAX than for X86, even for 2SS, 4SS, or OOO designs. In particular, what this means is that you can implement a type of microarchitecture, but it gains you more or less performance dependent on the ISA and the rest of the microarchitecture. For instance, the NVAX design at one point was going to decode 2 operands/cycle, and it was found to add much complexity and only get 2%.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: PART 3. Why it seems difficult to make an OOO VAX competitive (really long) Date: 7 Aug 2005 18:48:10 -0700 Message-ID: <[email protected]> (You will want Fixed Font). The earlier parts were: - (posted Jul 30) PART 1 - Microprocessor economics 101 PART 2 - Applying all this to DEC, NVAX, Alpha, competition - (posted Aug 3) Really part 2b (updated Table 1 and added more discussion) FUNDAMENTAL PROBLEM Certain VAX ISA features complexify high-performance parallel implementations, compared to high-performance RISCs, but also to IA-32. The key issue is highlighted by Hennessy & Patterson [1, E-21]]: "The VAX is so tied to microcode we predict it will be impossible to build the full VAX instruction set without microcode." Unsaid, presumably because it was taken for granted is: For any higher-performance, more parallel micro-architecture, designers try to reduce the need for microcode (ideally to zero!). Some kinds of microcoded instructions make it very difficult to decouple: A) Instruction fetch, decode, and branching B) Memory accesses C) Integer, FP, and other operations that act on registers Instead, they tend to make A&B, or A&C, or A,B&C have to run more in lockstep. It is hard to achieve much Instruction Level Parallelism (ILP) in a simple microcoded implementation, so in fact, implementations have evolved to do more prefetch, sometimes predecode, branch prediction, in-order superscalar issue with multiple function units, decoupled memory accesses, etc, etc. ISAs often had simple microcoded implementations [360/30, VAX-11/780, Intel 8086] and then evolved to allow more pipelining. Current OOO CPUs go all-out to decouple A), B), and C), to improve ILP actually achieved, at the expense of complex designs, die space, and power usage. Some ISAs are more suitable for aggressive implementations, and some make it harder. The canonical early comparison was the CDC 6600 versus the IBM 360/91; the even stronger later one would be Alpha versus VAX. A widespread current belief is that the complexity, die cost, and propensity for long wires of high-end OOOs may have reached diminishing returns, compared to multi-core designs with simpler cores, where the high-speed signals can be kept in compact blocks on-chip. IA-32 has baroque, inelegant instruction encoding, but once decoded, most frequently-used instructions can be converted to a small number (typically 1-4) micro-ops that are RISC-like in their semantic complexity, and certainly don't need typical microcode. As noted earlier in this sequence, the IA-32 volumes can pay for heroic design efforts. The VAX ISA is orthogonal, general, elegant, and easier to understand, but the generality, but it also has difficult decoding when trying to do several operands in parallel. Worse, numerous cases are possible that tend to lockstep together 2 or 3 of A), B), or C), lowering ILP, or requiring hardware designs that tend to slow clock rate or create difficult chip layouts. Even worse, a few of the cases are even common in some or many workloads, not just potential. As one VAX implementor wrote me: "it doesn't take much of a percentage of micro-coded instructions to kill the benefits of the micro-ops." That is a *crucial* observation, but of course, the people who really know the numbers tend to be the implementers... It is interesting to note that the same things that made VAX pipelining hard, and inhibited the use of a 2-issue superscalar, also make OOO hard. Some problems are easier to solve, but others just move around and manifest themselves in different ways. - decode complexity - indirect addressing - multiple side-effects - some very complex instructions - subroutine call mechanism Following is a more detailed analysis, showing REFERENCES first (easier to read on Web), briefly describing OOO, and then going through a sample of troublesome VAX features, and comparing them to IA-32, and sometimes S/360. CONCLUSION that wraps all this together with DEC's CMOS roadmap in the early 1990s to show the difficulty of keeping the VAX competitive. ========================== CAVEAT: I've never helped design VAXen, although I used them off-and-on between 1979-1986. I have participated (modestly) in several OOO designs, the MIPS R10000 and successors, plus one that never shipped. I've had lots of informal discussions over the years with VAX implementors, and I have reviewed some of the ideas below with at least one of them. I don't have the statistics that a professional needs to really do an OOO VAX design, so at best I can sketch some of the problems. With enough gates, you can do almost anything ... but complexity incurs design cost, design time, and often chip layout problems and gate delays. Unlike software on general-purpose systems, where adding a bit of code rarely bothers much, the blocks of a chip have to fit on a 2-dimensional layout, and their physical relationships *matter*. Sometimes minor-seeming differences cause real problems. ========================== REFERENCES (placed here for convenience): Assumed reading: [0] Hennessy & Patterson, Computer Architecture, a Quantitative Approach, 3rd Edition, 2003. Chapters 2, 5, and especially 3, plus Appendix A. Brief explanation, and detailed reference of the VAX: [1] Hennessy & Patterson, "Another alternative to RISC: the VAX Architecture", www.mkp.com/CA3, Appendix E. [2] Digital Equipment, VAX Architecture Handbook, 1981. Superb VAX performance analyses of the early 1980s by DEC people; ironically, invaluable to RISC designers [at MIPS, used to settle arguments]: [3] Clark and Levy, "Measurement and analysis of instruction use in the VAX-11/780", 1982. ACM SIGARCH CAN, 10, no 3 (April 1982), 9-17. [4] Wiecek, "A case study of VAX-11 instruction set usage for compiler execution", ASPLOS 1982, 177-184. [5] Emer and Clark, "A characterization of processor performance in the VAX-11/780, Proc. ISCA, 1984, 301-310. [6] Clark and Emer, "Performance of the VAX-11/780 Translation Buffer: Simulation and Measurement", ACM TOCS 3, No. 1, (Feb 1985), 31-62. Important analysis of ILP, discussed at length in [0]. [7] Wall, Limits of Instruction-Level Parallelism, DECWRL REPORT 93/6. Another superb performance analysis by two of the best: [8] Bhandarkar and Clark, "Performance from Architecture: Comparing a RISC and a CISC with similar hardware organization", 1991. The NVAX: [9] Uhler, Bernstein, Biro, Brown, Edmondson, Pickholtz, and Stamm, "The NVAX and NVAX+ High-performance VAX Microprocessors". http://research.compaq.com/wrl/DECarchives/DTJ/DTJ701/DTJ701SC.TXT A good intro to the IA-32. [10] Hennessy and Patterson, "An alternative to RISC: The Intel 80x86", www.mkp.com/CA3, Appendix D. 4-issue superscalar in-order Alpha versus OOO PentiumPro [11] Bhandarkar, "RISC versus CISC: A Tale of Two Chips", ACM SIGARCH CAN 25, Issue 1 (March 1997), 1-12. IBM ES/9000 (1992) was superscalar OOO in Bipolar, but in CMOS, they went back to simpler designs. [12] Sleegel, Pfeiffer, Magee, "The IBM eServer z990 microprocessor", IBM J. RES. DEV. Vol 48, No. 3/4, May/July 2004, 294-309. [13] Heller and Farrell, "Millicode in an IBM zSeries processor", IBM Journal of Research and Development 48, No. 3/4, May/July 2004, 425-434. [Some of these can be found on WWW, some are in ACM Digital Library (Subscription), many are discussed in detail in [0] anyway]. INTRODUCTION - OOO (Out-of-Order) (see [0, Chapter 3]): OOO CPUs try to maximize ILP as follows: A) Fetch instructions in-order, with extensive branch prediction, - decode, (and maybe even cache the decoded instructions - apply register renaming to convert logical registers to physical - put the resulting operations (decoded instruction using renamed registers) into internal queue(s) (reorder buffer, active-list, etc), such that an operation can be performed (often OOO) whenever upon availability of its inputs and necessary functional units. B) A load/store unit tries to discover cache misses and start refills quickly. Loads can (depending on ordering model) be done out of order, and stores can at least (sometimes) profitably fetch the targets of cache lines, although the final store operation must wait. Decoupling this unit as much as possible is absolutely crucial to getting good ILP, given the increasing relative latency to memory. C) Other instructions are executed by appropriate function units, commonly 1-2 integer ALUs, and a collection of independent FP units. A) Again, since it's more related to A): Completed instructions are retired in-order. If it turns out that the fetch unit has mispredicted a branch, when that is discovered, the register state, condition codes, etc are rolled back to those just before the branch, and the branch is followed in the other direction. If an instruction generates an exception, the exception normally doesn't take effect until the instruction is retired, in which case the following instructions are cancelled. Something similar occurs with asynchronous interrupts. OOO CPUs run most of the time speculating, i.e., working on multiple instructions that might or might not actually be reached, which is why people worry so much about good branch prediction, because the penalty for bad prediction gets worse as the design gets {longer pipelines, more parallelism}. Also, they hope for code patterns that help confirm branch directions early. Once upon a time, it was easy to know how many cycles an instruction would use, but that was long ago, with real-memory, uncached designs :-) It is very difficult to know what an OOO CPU is up to. There are also serious hardware tradeoff problems that arise, even though invisible to most programmers. There is never as much die space as potential uses, and the payoffs of different choices must be carefully analyzed across large workloads, especially because there can be serious discontinuities in gate-count, or worse, gate-delays caused by "minor" changes in things like queue sizes. For instance, unlike limits on logical (programmer-visible) registers, there is no apriori limit on the number of physical registers, but in practice, these register numbers are used in large numbers of comparators, and one would think hard in going from 64 to 65, or 128 to 129. Likewise, load/store queues have big associative CAMs so that the next address can be quickly checked against all the outstanding memory operations. Quite likely, load queues are filled with outstanding memory references, with multiple cache misses outstanding. If a queue is filled, but an instruction needs a piece of data *right now*, just to be decoded into micro-ops, it either has to have special-case hardware, or it will have to wait until a queue entry is available. [The VAX has this problem, unlike IA-32 or RISCs.] Some instructions (like ones changing major state/control registers, or memory mapping, etc) are inherently *serializers*, that is, their actions cannot take effect until all logically older instructions have been retired. Also, partially-executed instructions *following* the serializer may need to be redone. The decoder might recognize such serializers and stop fetching, if it is deemed a waste of time to go beyond them. Unlike loads, where all the work can be done speculatively, stores cannot be completed until they are retired, because they can't be sanely undone. Cleverness can preserve sequential (strict) ordering while pulling loads ahead of earlier stores, by redoing them if it turns out they conflict. [0, p.619 on R10000; discussed in US Patent 6,216,200]. The VAX's strict ordering might *not* have been a real problem. WHY DO ALL THIS OOO COMPLEXITY? a. Speculate into memory accesses as early as possible and get cache misses going, to deal with the increasing latency to memory. Overlap address calculations, get actual load data early, and fetch cache-line targets of stores early. Also, try to smooth the flow of cache accesses to lower latency and increase effective bandwidth. It was often said: "The main reason to do OOO is to find cache misses as early as possible." b. Extract more ILP by overlapping non-dependent ALU/FP operations in a bigger window [40+ typical] than is available for in-order superscalars, which typically examine no more than 4 instructions/clock. This is especially valuable for long-cycle operations like integer multiply/divide, or many FP ops. c. Alleviate decoding delays of messier variable-length instructions; this obviously applies less to RISCs, although some have done modest pre-decode when storing instructions in the I-cache. d. Reduce pressure on small register sets by using register renaming to create more physical registers than logical ones. This also eliminates false dependencies, even in RISCs with large register sets, but it does help VAX (and IA-32) somewhat more, as both are short of registers. That moves the problem around, as it puts more pressure on load/store queues, and efficient handling of load-after-store-to-same-address. The 360/91 used OOO for a. (it had no cache), b. (long FP-cycle ops), and d. (only 4 FP registers). I think c. was less important, as S/360 instruction length decode is easy. ILP, NORMAL INSTRUCTIONS, and IA-32 VERSUS VAX Consider the normal unprivileged instructions that need to be executed quickly, meaning with high degrees of ILP, and with minimal stalls from memory system. RISC instructions make 0-1 memory reference per operation. Despite the messy encodings, *most* IA-32 instructions (dynamic count) can be directly decoded into a fixed, small number of RISC-like micro-ops, with register references renamed onto the (larger) set of physical registers. Both IA-32 and VAX allow unaligned operations, so I'll ignore that extra source of complexity in the load/store unit. In an OOO design, the front-end provides memory references to a complex, highly-asynchronous load/store/cache control unit, and then goes on. In one case, [string instructions with REP prefix], IA-32 needs the equivalent of a microcode loop to issue a stream of micro-ops whose number is dependent on an input register, or dynamically, on repeated tests of operands. Such operations tend to lessen the parallelism available, because the effect is of a microcode loop that needs to tie together front-end, rename registers, and load/store unit into something like a lock-step. Although this doesn't require that all earlier instructions be retired before the first string micro-ops are issued, it is likely a partial serializer, because it's difficult do much useful work beyond an instruction that can generate arbitrary numbers of memory references (especially stores!) during its execution. However, the VAX has more cases, and some frequent ones, where the instruction bits alone (or even with register values) are insufficient to know even the number of memory references that will be made, and this is disruptive of normal OOO flow, and is likely to force difficult [read: complex, high-gate-count or long-wire] connections among functional blocks on a chip. Hence, while the VAX decoding complexity can be partially ameliorated by a speculative OOO design with decoded cache [I alluded to this in the RISC CISC 1991 posting], it doesn't fix the other problems, which either create microcode lock-steps between decode, load/store, and other execution units, or require other difficult solutions. In some VAX instructions, it can take a dependent chain of 2 memory references to find a length! VAX EXAMPLES [1], [2], especially compared to IA-32 [10] and sometimes S/360. Specific areas are: - Decimal string ops - Character string ops - Indirect addressing interactions with above - VAX Condition Codes (maybe) - Function calls, especially CALL*/RET, PUSHR/POPR. DECIMAL STRING OPERATIONS: MOVP, CMPP, ADDP, SUBP, MULP, DIVP, CVT*, ASHP, and especially EDITPC: are really, really difficult without looping microcode. [S/360 has same problem, which is why (efficient) non-microcoded implementations generally omitted them. The VAX versions, especially the 3-address forms, are even more complex than the 2-address ones on S/360, and there are weird cases. DIVP may allocate 16-bytes on the stack, and then restore the SP later. These instructions somewhat resemble the (interruptible) S/370 MVCL, but are more complex, including the infamous ADDP6. They all set 4-6 registers upon completion or interrupt. EDITPC is like the S/360 EDMK operation, but even more baroque. "The destination length is specified exactly by the pattern operators in the pattern string." [2, p. 336] I.e., you know the beginning address of the destination, but you can't tell the ending address of a written field without completely executing the instruction. The IA-32 doesn't have these memory-memory decimal operations. One might argue that C, FORTRAN, BLISS, PASCAL, etc could care less about these, but COBOL and PL/I do care, so if they are a customer's priority, they may not be happy with the performance they get on an OOO VAX, i.e., C speeds up, FORTRAN speeds up, but decimal operations are unlikely to speed up as much, as these certainly look like microcode that tends to serialize resources. CHARACTER STRING AND CRC OPERATIONS: MOVC, MOVTC, MOVTUC*, CMPC, SCANC, SPANC, LOCC, SKPC, MATCHC, CRC: also tough without looping microcode, and they are generally more complex than the S/360 equivalents. MOVTUC is a fine example: it has 3 memory addresses, and copies/translates bytes until it finds an escape character. Hence, at decode time, it is impossible to know how many memory addresses will be fetched from, and worse, stored into... The IA-32 REPEAT/string operations have some of the same issues, but are simpler, with the length and 2 string addresses supplied in registers. VAX INDIRECT ADDRESSING AND CHARACTER OR DECIMAL OPS For any of the above, note that most operands, INCLUDING the lengths can be given by indirect (DEC deferred) addresses: @D(Rn) Displacement deferred [2.7%, according to [5,Table 4] @(Rn)+ Auto-increment deferred [2.1%, according to [5, Table 4] The first adds the displacement to register Rn, to address a memory word, the second uses the address in Rn (followed by an auto-increment) to address the memory word. That word contains the address of the actual operand value. This makes it impossible for the front-end to know the length early. Rather than being able to hand off load/store operations, unidirectionally to the load/store unit, the front-end has to wait for the load/store unit to supply the operand value, just to know the character string length. I have no idea how frequent this is, but VAXen pass arguments on the stack, and a call-by-reference that passes a length argument will do this: @D(SP). Consider how much easier are the regular VAX MOV* instructions, each of whose length is fixed. Each of those is easily translated into: Load (1 value, 1, 2, or 4 bytes) into (renamed register); store that value Or Load (2 or 4 longwords); store (2 or 4 longwords) (Of course, one might like the MOV to just act in load/store unit, but that's not quite possible, due to the MOV and Condition Codes issue described later.) IA-32 doesn't have this problem, as the length for a REP-string op is just taken from a register. Of course, that value must be available, but that falls out as part of the normal OOO process. The closest the S/360 gets is the use of an Execute instruction to supply length(s) to a Move Character (MVC) or other SS instruction. That's a somewhat irksome control transfer: think of it as replacing the EX with the MVC after ORing the length in, but at least you know the length at that point, without having to ask the load/store unit to do possibly) multiple memory references in the middle of the instruction, which requires some special cases in the front-end <-> load/store unit interaction. VAX CONDITION CODES AND MOVES [CONJECTURE ON MY PART] OOO processors typically use rename maps to map logical registers to physicals. Condition Codes (CC) require an additional rename map of their own, in ISAs that have them. Each micro-op has an extra dependency on the CC of some predecessor, and produces a CC, just as it produces a result. Register renaming uses massive bunches of comparators to keep track of dependencies, and CCs just add more maps and more wires and comparators. IA-32 and S/360 would need this also, but the VAX is slightly different, in that its data movement instructions affect the CC. S/360: CVB,CVD, DR, D, IC, LA, LR, L, LH, LM, MVC, MVI, MVN, MVO, MVZ, MR, M, MH, PACK, SLL, SRDL, SRL, ST, STC, STCM, STH, TR, UNPK do *not* set the CC, i.e., most data movement instructions do NOT affect CC. IA-32: MOV does not affect any flags. VAX: almost everything affects flags, including all the MOVes (except MOVA and PUSHA, which do address arithmetic), so that the simple equivalents of LOAD and STORE on other ISAs now have to set the CC. It's hard to say whether this matters or not without a lot of statistics. It does complexify some advanced optimizations. For instance, there are some kinds of store/load sequences where one wants everything to be done in a L/S unit (which generally knows nothing about CCs). For instance, one may recognize that a pending store has the same address as a later load, and one can simply hand the store data directly to the load without incurring a cache access. [I think Pentium 4 does something like this]. This easily happen when a bunch of arguments are quickly pushed onto the stack, and the stores are queued in the L/S unit (because they arrive faster than the cache can service them), but later loads quickly appear to fetch the arguments. This seems to imply extra complexity, because the L/S unit must compute the CC and get it back to the rest of the CPU. NOTE: upon exception or interrupt, the CC must be set appropriately, which means that it has to be tracked. However, it also means that most conditional branches depend on the immediately-preceding instruction, and that may (or may not) make it harder to extract ILP. AND SAVING THE BEST FOR LAST: "SHOTGUN" INSTRUCTIONS LIKE VAX CALLS, CALLG In the NVAX, these shut down the pipeline because the scoreboard couldn't keep track of them, so that sequences of simpler instructions were faster. The VAX ISA makes it harder than usual for a decoder to turn an instruction into a small, known set of micro-ops. CALLS and CALLG generate long sequences of actions, most of which can be turned into micro-ops straightforwardly. However, one thing is painful: CALLS numarg.rl,dst.ab and CALLG arglist.ab, dst.ab The decoder cannot tell from the instruction how many registers will get saved on the stack, because the dst.ab argument (which could be indirect) yields the address, not of the first instruction of the called routine, but of a *mask*, 12 of whose bits correspond to registers R11 through R0, showing which ones need to be saved onto the stack along with everything else, and all the register side-effects. This means, that in the middle of decoding the instruction, the decoder has to hand dst.ab to the address calculator, get the result back [OK so far], but then it has to fetch the target mask, scan the bits, and generate one micro-op store per register save. Presumably, in an OOO with trace-cache design, and with fully-resolved subroutine addresses, one could do this OK, but it's a pain, because of the potential variability. Of course, in C, with pointers to functions, an indirect call through a pointer is awkward ... but of course, it's awkward for everybody. RET inverts this, but the trace cache approach doesn't help, in that it POPS a word from the stack that has the register mask, scans the mask, and restores the marked registers from the stack. This is another thing that wants to generate a variable number of memory operations based on another memory operand, so the micro-ops are not easily generatable from the instruction bits alone, or even instruction+register values. PUSHR and POPR push and pop multiple registers, using a similar register mask, but at least, in common practice, they would be immediate operands ... although of course, it is possible the mask was some indirect-addressed value, sigh. Of course, some use the simpler: BSB/JSB to subroutine Subroutine: PUSHR, plus other prolog code Body POPR and other epilog code RSB However, at least as seen in [3], [4], [5], CALLS/RET certainly got used: [4, Table 3] has 4.86% of instructions being CALLS+RET, for instructions executed by compilers for BASIC, BLISS, PASCAL, PL/I, COBOL, FORTRAN. [3] has instruction distributions [by frequency and time], with [3, Table 7] showing the distributions across all modes. This lowers the % of CALLS/RET, since the VMS kernel and supervisor don't use them. Still one would guess that ~1% of executions each for CALL and RET, with about 12% of total cycles, would fit VAX 11/780. [5, Table 1] gives 3.2% for CALL/RET Of course, the semantics of these things tend to incur bursts of stores or loads, which means the load/store queues better be well-sized to accept them. IA-32: While the CALL looks amazingly baroque, it's not as bad as it looks, that is, there are bunch of combinations to decode, and they do different things, but once you decode the instruction bits, you know what each will do. RET doesn't restore registers, it just jumps back, although again, there is a complex set of alternatives, but each is relatively simple, especially the heavily-used ones (I think). PUSHA/PUSHAD and POPA/POPAD simply push/op all the general registers... a fixed set of micro-ops. S/360: This has LM/STM (Load/store Multiple), but the register numbers are encoded in the instructions, with the only indirect case being an Execute of an LM/STM, something rarely seen by me. NOT IMPOSSIBLE, BUT HARD I my job had been to have been keeping the VAX competitive, I'd probably have been thinking about software tricks to lessen the number of CALL/RETs executed, but it's just one of many issues. Maybe there are other implementation tricks, but in general, this stuff is *hard*, and solutions that are straightforward in RISCs, and somewhat so in IA-32, are different/tricky for VAXen. To see how complex it can get to make an older architecture go fast, see [12] on the z990. IBM did OOO CPUs in the 360/91 (1967), and ES/9000 (around 1992), but has reverted to in-order superscalars since. The recent z990 (2-core) is a 2-decode, 3-issue, in-order superscalar. The chip is in 130nm, with 8 metals, has 121M transistors, of which 84M are in arrays, and the rest (37M!!) are combinatorial logic. That is two cores, so figure ech side is 60M, with 17M in combinatorinal logic. That's still big. It has 3 integer pipelines (X, Y, and Z), of which one does complex instructions (fixed point multiplies and decimal), and sometimes (as for Move Character and other SS instructions), X and Y are combined into a virtual Z, with a 16-byte datapath. "Instructions that use the Z pipeline always execute alone." The millicode approach [13] might help a VAX, but again, this is not simple. AND MORE I picked out a couple of the obvious issues. In my experience, the people who *really* know the warts and weird problems of implementing an ISA are those who've actually done it a couple times, and I haven't ever done a VAX. If one of those implementers says they knew how to fix all the issues, I'd at least listen to their solutions with interest, but I do know that a lot of the issues are statistical things, not just simple feature descriptions. CONCLUSION Earlier in this thread, I noted [11], which says: "The VAX architectural disadvantage might thus be viewed as a time lag of some number of years. The data in Table 1 in my previous post agrees, as does the clear evidence of the late 1980s. DEC understood the VAX quite well; there were superb architectural analysis articles [3-6] from the 1980s. Serious CPU designers gather massive data, and simulate alternatives, and DEC folks were very good at this process. Nth REMINDER: there is *architecture* (ISA) and there is *implementation*, and they interact, but they are different. If this isn't familiar, go back and read old postings. One might be able to say that one ISA is simpler than another, because the minimum gate count for a reasonable implementation is lower than the other's. One might say that the design complexity of similar implementations differs between the two ISAs. VAX VS RISC [TABLE 1, FIRST GROUP] By the late 1980s, some system-RISCs were selling in volumes similar to VAXen, i.e., in workstation/server markets [SPARC, HP PA, MIPS], and hence, none had the vast volumes of the PC market to allow extraordinary-expensive designs, but all could design faster CPUs than contemporaneous VAXen, at lower design cost. It was certainly clear by 1988 that RISCs were causing trouble for the VAX franchise, at least, in the Ultrix side of it. Reference [11] was discussed earlier in this thread, and its conclusions recognized that. Of course, IBM re-entered the RISC fray in 1991 with (aggressive) POWER CPUs. It is not at all unreasonable that RISC ISAs, first shipped in 1986 [HP PA, MIPS], 1987 [SPARC], 1991 [IBM POWER], and 1992 [DEC Alpha], should be more cost-effectively implementable than the VAX, first shipped in 1978. Even tiny MIPS was able to do that, over most of that period. Hence, one of the jaws closing on the VAX was higher-performance RISCs, delivered at lower cost, in similar volumes. The other jaw, as discussed in the previous post, was the performance rise of high-volume IA-32 CPUs, which allowed the use of larger design resources to deal with the complexities of IA-32. The second group in Table 1 showed a few examples of that. The VAX ISA is far cleaner, more orthogonal, more elegant, and easier to comprehend, than the IA-32 ISA, as it was in 1993. The Intel Pentium offered 2-issue superscalar (1993), and PentiumPro (1996) went OOO, and Pentium 4 (2000) went even further, with decoded instruction cache (trace cache). It took substantial resources, which DEC didn't have, to do that, including "pipelining" design teams at Intel (Santa Clara, Portland). In 1992, at Microprocessor Forum, Michael Uhler showed a chart that included; CMOS-4 CMOS-5 CMOS-6 1991 1993 1996 manufacturing year ..75 .5 .35 Min feature 3 4 4 Metals 7.2 16 32 Relative logic density 2.2 2.9 3.7 Relative Gate speed 1.3X 1.7X Gate speed relative to CMOS-4 It should be pretty clear from Table 1 of the previous post, that straight shrinks from CMOS-4 to CMOS-5 and then CMOS-6 wouldn't have put the VAX back competitive, because you wouldn't even get the clock-rate gain, given increasing relative memory latency. At the least, you'd have to redo the layout and increase the cache sizes. If you got the gate speed improvement, you'd have: 1993: 1.3*90Mhz = 117Mhz, 44E SPECint92, 60E SPECfp92, compared to 65 Si and 60 Sfp for Pentium. 1996: 1.7*90Mhz = 153Mhz, 58E Si92, 99E Sfp82, compared to 366 Si92, 283 Si92 for PentiumPro. For various reasons, I doubt that DEC would ever have built a Pentium-like 2-issue superscalar. In particular, the NVAX team found that it didn't help much (2%) to do multiple-operand decode (and was complex hardware), because the bottlenecks were later in the pipeline. I conjecture that it is hard to get much ILP just looking at 1-2 VAX instructions, as lots of them have immediate dependencies. Hence, (if there had been no Alpha, just VAX), it would have been more plausible to target an OOO design for 1996, but I'd guess it would have also had to make the big change to 64-bit at that point. It's hard to believe they could have gotten to a trace cache design then [neither Intel nor AMD had], and the tougher VAX decode might well incur more branch-delay penalty than the IA-32. Given DEC's design resources, one can sort of imagine doing: a) Clock spins and minor reworks on NVAXen, to keep installed base from bolting, while holding out hope that all would get well in 1996, but that's 3 years with not much performance scaling; very tough market. b) Simultaneously doing a 64-bit OOO VAX, because 1999 would have been late. However, as has been pointed out in detail, there are just lots of extra complexities in the VAX ISA, and all this stuff just adds up. Professionals don't design CPUs using vague handwaving, because it doesn't work. Anyway, DEC's gamble with Alpha didn't work [for various reasons], but at least it was a gutsy call to recreate the "one architecture" rule at DEC. Of course, personally, I would rather they had done something else :-) But the bottom line is: the VAX ISA was very difficult to keep competitive. The obvious decoding complexity is always there, in one form or another, but the more serious problem is execution complexity that lessens effective ILP and is thus a continual drag on performance with reasonable implementations. VAX: one of the great computer families, built around a clean ISA appropriate to the time, but increasingly difficult to implement competitively. R.I.P.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: PART 3. Why it seems difficult to make an OOO VAX competitive (really long) Date: 8 Aug 2005 07:39:55 -0700 Message-ID: <[email protected]> Among the problems with comp.arch is that it fills up with opinions that don't survive even minimal perusal of the literature... 1) One can argue about the PC, but if one reads the VAX-study references I quoted, one finds [Emer & Clark] that Immediates (PC)+ were 2.4% of the specifiers, and Absolute @(PC)+ were 0.6%, or 3% of the total. Personally, I didn't think that was worth the other problems, and neither did many of the other RISC designers,(ARM being a notable exception), nor X86 nor 68K, but it does help with code size. 2) Peter thinks it's a bad idea to use a GPR as the SP. Most designers of real chips in recent years have thought otherwise, because SP-relative addressing is common, and it is ugly to special-case it. 3) The VAX equivalents of IA-32 LEA are MOVA and PUSHA. Peter "Firefly" Lund wrote: > > b) The PC and SP are general registers for historical reasons - upwards > > compatibility with the PDP-11. > > Compatibility is a good reason -- a /very/ good one. > > But the VAX didn't have binary compatibility, it just had a mapping from > PDP-11 registers, addressing modes, and instructions onto the VAX ones. > > That made it easy to transliterate assembly source code. Emulating (or > even JITting) is also made easier. > > But would it really have hurt so much if the VAX had provided one or two > more general purpose registers and hid away the SP and PC? A couple of > extra registers for the emulator to play with internally could have been > nice (but there were already eight more in the VAX than in the PDP-11 so I > guess it wouldn't have mattered much). > > Instructions that accessed the SP and PC would have had to be > special-cased in the transliterator and the emulator -- but I'm not sure > it would have been difficult or expensive (you would need special handling > of the PC register anyway, since PDP-11 code addresses wouldn't match VAX > code addresses, and of the SP register since 16-bit values on the stack > for calls/returns won't match the native 32-bit values). > > What do you need to do with SP? Push, pop, call/ret, the occasional > add/sub, SP-relative addressing for loading/storing parameters/return > values/local variables. If you can move the SP to/from a GPR then what > else would you need? > > What do you need to do with PC? Conditional/unconditional branches, > calls, returns, and PC-relative loads and stores. > > Maybe we would like an equivalent of the IA-32 LEA instruction, too, for > creating absolute pointers to values with SP/PC-relative addresses.
From: "John Mashey" <[email protected]> Newsgroups: comp.arch Subject: Re: PART 3. Why it seems difficult to make an OOO VAX competitive (really long) Date: 14 Aug 2005 21:13:20 -0700 Message-ID: <[email protected]> Eric P. wrote: > John Mashey" <[email protected]> writes: > > > > But the bottom line is: the VAX ISA was very difficult to keep > > competitive. The obvious decoding complexity is always there, in one > > form or another, but the more serious problem is execution complexity > > that lessens effective ILP and is thus a continual drag on performance > > with reasonable implementations. > > In case anyone is still interested in this topic, > there are a bunch of papers by Bob Supnik at > http://simh.trailing-edge.com/papers.html > covering a variety of DEc design issues. Great material; thanks for posting; Bob is doing a dandy job preserving old stuff. In particular, if somebody actually wants to build things, it is really useful to get insight about design processes and tradeoffs. The HPS postings were useful too. > The one labeled "VLSI VAX Micro-Architecture" is from 1988 > (marked "For Internal Use Only, Semiconductor Engineering Group") > mentions at the end the ways a VAX might get lower CPI. It says > > "However the VAX architecture is highly resistant to macro-level > parallelism: > - Variable length specifiers make parallel decoding of specifiers > difficult and expensive > - Interlocks within and between instructions make overlap of > specifiers with instruction execution difficult and expensive > > Most (but not all) VAX architects feel that the costs of macro-level > parallelism outweighs the benefits; hence this approach is > not being actively pursued." > > So it would seem that the designers felt at that time that decode > was a major impediment. I actually hadn't read this before I posted, but obviously, I'd talked to VAX implementors in the late 1980s, and what they complained about sank in. Anyway, thanks for posting.
Index Home About Blog