Index Home About Blog
Index Home About Blog
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: MIPS-UNIX-context switch Date: 11 Jul 1995 19:38:02 GMT In article <3teqdm$hk6@data.interserv.net>, levine@amarex.com writes: |> Organization: Amarex Technology - High Speed Glitch Division |> |> I have a question which I would like to direct only to those who have |> MIPS 3000 knowlege. Given a MIPS 3000 chip , an I-chache , a D-cache |> and main mem. If this configuration where made into UNIX based |> machine where would the majority of time be spent for evey context |> switch ? e.g Saving regs, clearing TLB... a) Not saving regs: figure that you save: 33 integer registers [R1-R31 + HI + LO] some number of CP0 registers, let's say 7 and you might or might not arrange to save 32 32-bit FP registers, depending on how your OS wants to work. Assuming the interrupt sequence is in the I-cache, and a good memory system, saving 40 registers = 40 cycles; @ 40Mhz, = 1 micro-second. Real systems would likely be slower, so guess a couple microseconds. Restoring another register set: likely to be cache misses, so takes a few microsecs more. b) You don't need to clear the TLB, since there are Address-Space IDs, such that you only need to flush the TLB every time you see >64 distinct processes. You would normally reload a handful of TLB entries, then let other missed entries fault in. Base cost: a few micro seconds. Most OS's use the trickery of the MIPS TLB direct-mapped region to avoid TLB misses for kernel code. Caches are physically-tagged, so you get wahtever sharing is really there. c) In UNIX, most of the time goes to UNIXy scheduling & overhead, and executing unpredictable code paths and accessing state data likely to be cache misses. Register save/restore is likely a factor only in very tight embedded control systems. -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@sgi.com DDD: 415-390-3090 FAX: 415-967-8496 USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Stack vs GPR and Multi-threading (was Re: A Series Compilers) Date: 12 Jul 1995 17:44:10 GMT In article <1995Jul12.143336.21769@il.us.swissbank.com>, gerryg@il.us.swissbank.com (Gerald Gleason) writes: |> If I'm interpreting what you are saying correctly, it is that in terms of |> total system performance, register save/restore is a much smaller |> opportunity than the latency associated with bad locality in various |> forms. A multi-threaded processor might be able to fill in most of what |> would be idle time waiting for cache misses doing useful work on another |> thread. The issue of multi-threading is somewhat orthagonal to GPR vs Yes, and there is a reasonable separate thread running on multi-threaded CPUs, including contributions from people who have/are building them. But for sure, I think that so much of the worry many people have about register save/restore is that it's simpler to worry about, than for example, all these latency and probabilistic arguments, i.e., it's the equivalent of the "coffee-fund paradox", i.e., a) If the coffee-pot fund is running low, a committee will debate long and hard about the solution thereof. b) But then committee must vote on $10B appropriations. Little debate: how many people really grasp $10B? :-) This is not to say register save/restore time is unimportant ... but every time I've done the cycle-by-cycle counts on a real implementation, running a general-purpose OS, I got convinced I should worry about other things more. -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@sgi.com DDD: 415-390-3090 FAX: 415-967-8496 USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 7 Nov 1997 18:31:04 GMT In article <63vhbo$hmk$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes: |> Yes, it is. But even with hardware reloads, a TLB miss is often |> much more expensive than a cache miss (sometimes 5-10 times more). |> With software reloads, they are death on wheels :-( Since "death on wheels" is difficult to evaluate, but clearly conveys the thought that this is a bad idea, let us observe: Software-reloaded TLBs are widely-used; in fact, many of the microprocessor types commonly-used to run large programs on large datasets "happen" to do this, specifically: - PA-RISC & MIPS, from 1986 onward - DEC Alphas, 1992- - Sun UltraSparcs, 1995- Consider the kind of code used to start this example: FORTRAN code with big floating point arrays, an area of interest to RISC chips. Of the 5 major RISC micro families, 4 have chosen to use software-reloaded TLBs, with IBM being the the main exception. Now, when we published info about MIPS RISC in 1986, most people (outside of HP & MIPS) thought software-reloaded TLBs were crazy ... but from 1986 through 2000, I count 6 *new* micro architectures used in systems where large memories & TLBs might be relevant: [PA-RISC, MIPS, SPARC, IBM POWER/PPC, Alpha, IA64], and of those, 6, the current implementations of 4 use software-reloaded TLBs, 1 doesn't, and one (IA64) remains to be seen. There are many reasons of flexibility and debuggability to have a software TLB, of which some were covered in 1986 COMPCON, "Operating System Support on a RISC", DeMoney, Moore, Mashey. It should be no surprise that designers of chips study TLB-miss overhead, and try to allocate resources appropriately. In modern systems: 1) A TLBmiss, in software, may actually take *less* time than actually doing a cache miss. Why is that? TLBmiss: a) Miss b) Trap c) Refill TLB, making one or more memory references, which *may* well hit in the offchip datacache. d) Return Cache miss: a) Miss b) Schedule cache miss to memory, which can be a very long time in some ccNUMA systems, but is easily 300-600ns in many SMPs. With clock cycles in in the 2-5ns range, that's 60-300 clocks, and with 2-4 superscalar chips, thats 120-1200 instructions. Now, of course, there are also TLBmisses that take longer than cache misses, but in fact, whether a refill is done by a trap to software, or by a hardware engine, the time is: T = C + N * M C = ~constant overhead time M = time for cache miss N = number of cache misses caused by doing TLB processing If there are a lot of TLBmisses, the TLBMiss code ends up living in the on-chip L1 I-cache. If the PTE structures for hardware or software versions are the same, there will be about the same number of accesses to memory. In some cases historically, the complexity of TLB-table-walks in memory has demanded that PTEs *not* be cacheable, hence giving up the ability to use the cache as a backing store for the TLB ... which is trivial and straightforward to accomplish in a software-controlled TLB. TLBs are famous for the weird bugs and odd cases in many early micros, which is why OS people were often the ones who preferred software-controlled ones as less troublesome. For the long-term, one can either make TLBs larger (more entries), or allow entries to have multiple sizes ... and the industry seems to be tending towards the latter; the R4000, in 1992, went this way, because we couldn't figure out how to keep up with 4X/3 years in memory sizes, for on-chip data structures that had to be fast. Bottom line: Nick's characterization of software TLBs as "death on wheels", in general, flies in the face of increasing use of this technique by very experienced CPU designers. -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 7 Nov 1997 21:50:52 GMT In article <63vrmv$nd5$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes: |> >In modern systems: -------^^^^^^ |> >1) A TLBmiss, in software, may actually take *less* time than actually |> >doing a cache miss. Why is that? |> |> Well, on the machines that I have tried (and they DO include some of the |> ones you mentioned), a TLB miss is usually significantly more expensive. |> The factor of 5-10 times was based both on measurement, as well as some |> figures given by current hardware designers for their chips. It would be helpful to quote some of these, since (date of system) is fairly important in this discussion, given that, in 1986, we had 8Mhz (125 ns) single-issue CPUs, and DRAM with raw read times of ~120ns, while we now have 2-4-issue CPUs in the 200-500MHz (5-2ns) range, and raw DRAMs ~60ns, and total cache miss times in the 200-500ns range for SMPs and local nodes, 400-1000 for low-latency ccNUMAs, and maybe 200-3000ns for higher latency ones. |> >Now, of course, there are also TLBmisses that take longer than cache misses, |> >but in fact, whether a refill is done by a trap to software, or by a |> >hardware engine, the time is: |> > T = C + N * M |> > C = ~constant overhead time |> > M = time for cache miss |> > N = number of cache misses caused by doing TLB processing |> |> I think that you are being misleading - in fact, I am certain. In many |> or most architectures, handling a miss in software involves a context |> switch. Not a full context switch, to be sure, but the CPU has to move |> from DAT-on in user mode to DAT-off in kernel mode. This means that the |> constant overhead is potentially a great deal larger than for the |> hardware solution. You may be certain, but you are incorrect. There is no context switch (as most people use the term, i.e., from one user task to another user task.) I don't recall exactly what the Alpha & UltraSPARC folks do, but they're not idiots, so presumably they do something similar to what HP & MIPS have done for a long time: There is a special, low-overhead trap to the OS, and it has nothing to do with turning DATs on & off. HP provided some special registers to make this faster, MIPS used a "hack" of telling user code that there were 2 registers they could expect to be trashed at any time, so the kernel doesn't even have to save/restore these registers; there were enough registers to get away with this. Various chunks of hardware are added to make extractions or virtual references (in Alpha's case, anyway) faster, where the issue is a series of dependent operations that are trivial to do in hardware, leaving the sequencing and control in software. Note: some of the beliefs here come from a long discussion in a Cupertino bar with PA-RISC architects, of the form "why did you do this? we did that..." There was some head-slapping when I said we hadn't had to do special registers, although I had tried to get 3 rather than 2 for the kernel, but the compiler people wouldn't give me the third one. I wrote the original MIPS version of such code in early 1986, Steve Stone tuned it up, and we identified various simple hardware that could help. I have the original code somewhere, but couldn't find it, here was Steve's version as of April 1 1985: From scs Mon Apr 1 16:02:19 1985 From: scs (Steve Stone) Subject: user TLB miss. I have been trying to reduce the number of instructions involved in resolving a user tlbmiss. The best that I can do (with some hardware changes assumed) is around 15 cycles (assuming no cache misses). The following is a first cut at the problem. The following hardware features are assumed: - There are seperate UTLBMISS/KTLBMISS cause bits. - The EPC is predecremented by hardware if the branch delay bit is set in the SR. I know this is difficult to implement. One possible way around this is to seperate out UTLBMISS in a branch delay slot from other UTLBMISSes. - At the time of a UTLBMISS, the TLBENHI register is set up correctly (the TLBPID is or'd in and the VPN is correct). - There are two registers usable by the kernel only. The state of these registers is never saved and can only be trusted while interrupts are disabled (called RT1 and RT2). Here is the exception handler code: /* * Grab the cause bits. User tlbmiss should be handled quickly * if possible (i.e. the only cause for the exception). */ mfcause RT1 sub RT1,CAUSE_UTLBMISS bne RT1,r0,exc_noutlbmiss /* * - Grab the VPN/TLBPID register from CP0. * - Isolate the VPN in the low order bits * 4. * - Add in the USERPTBASE constant (in kseg3). The high order * bit of the VPN will have been set in the TLBENHI. This * should be taken into consideration when choosing the * USERPTBASE location. */ mfc0 RT1,TLBENHI lsr RT1,TLBPIDSZ-2 and RT1,~3 la RT2,USERPTBASE add RT1,RT2 /* * We now have a pointer to the TLB entry. Grab it. A fault * may occur here. If so, the KTLBMISS handler will have to * be smart enough to reset RT1 to be the original PTE pointer * and reset the c0 registers so the following code will work. */ lw RT1,0(RT1) /* * If the PTE is invalid, handle the long way. */ and RT2,TLB_V,RT1 beq RT2,r0,exc_upteinval mtc0 RT1,TLBENLO c0 TLBWRITE nop rfe nop |> You are effectively saying that this case has been optimised so much |> that it is no longer significant. That is most interesting. Actually, I didn't say that. It is sometimes significant for certain programs. However, truly big programs often want big pages anyway, so once you figure out how to do that in the general case, you are better off than shaving a few cycles off something with a terrible TLB miss rate. This problem gets studied every time, and the general approach is to give the TLB some more resource, but worry a lot more about cache misses, which are way more frequent for most codes. a) If a program has a low TLB-miss rate a1) if the cache-miss rate is low, all is well. a2) if the cache-miss rate is high, then that's the problem. b) If the program has a high TLB-miss rate. b1) If the cache-miss rate is high, you're down to DRAM speed, and either you have a problem for a vector machine, or you need to be doing cache-blocking anyway. b2) If the cache-miss rate is low, then the TLB is actually the bottleneck. Many designers have never been able to find enough (b2) programs to justify huge amounts of hardware to help the TLB. Note of course, that IBM RS/6000s have a fairly different philosophy in various ways. |> >TLBs are famous for the weird bugs and odd cases in many early micros, |> >which is why OS people were often the ones who preferred |> >software-controlled ones as less troublesome. |> |> Don't you really mean that the bugs are easier to fix, and hence less |> embarrassing :-) Not exactly. What I meant was that almost any OS person involved in the design of the first round of RISC chips had had experience with early micros, and running into weird-case bugs late in the development cycle, with complex hardware logic that took full chip spins to fix. it isn't a question of embarrasment, it's a question of whether or not you can ship a product. The following has been known to happen, when designing new systems with brand new micros: (a) System comes up. (b) Debug it, looks good. (c) Get a bunch of systems ready, be running QA. (d) Fix a bug in C compiler. (e) Some instruction moves 2 bytes, crosses a page boundary, regression tests start breaking 4 weeks before shipment; it takes 2 weeks to figure out exactly what is happening. (f) Then you realize that the odd case could potentially happen with any user-compiled program, and it is a bug in the microcode, and it's going to be 3 months before it gets fixed ... and you're dead. The MIPS utlbmiss codes have often been diddled to work around some odd hardware error, so that you can get beyond the first ones to see what else there is. |> >Bottom line: Nick's characterization of software TLBs as "death on wheels", |> >in general, flies in the face of increasing use of this technique by |> >very experienced CPU designers. |> |> I accept your correction! I stand by my point that TLB misses are |> generally "death on wheels", but it is very likely that I have been |> using software implementations that I thought were hardware :-) |> |> I also take your point that TLB misses are becoming less expensive as |> time goes on, in a way that cache misses are not. But I don't believe |> that the turnover point has yet arrived! Hmmm. I thought your point was that "death on wheels" was equivalent to "software-reloaded TLBs are a bad idea and should be done away with." Was that a misinterpretation? I'm not sure what "turnover point" means. A CPU designer has to provide a set of facilities, which for systems-type chips, includes cache + MMU, with various tradeoffs. All that's been happening is that countless studies have convinced many designers that they can avoid a bunch of complex microcode, or worse, a lot of random logic with touchy special cases, in favor of a low-overhead trap to a small piece of code, and that if it takes a few more cycles to do the logic, it takes less die space, is more flexible, and the times are increasingly dominated by the time to fetch PTEs from memory anyway. -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 9 Nov 1997 06:31:38 GMT In article <641ecg$ot1$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes: |> Also, I am talking about the TOTAL effect on application speed, and not |> just the raw cost of processing the problem. The problem with TLB misses |> (and, generally, anything that needs a trap) is that the indirect costs |> are often larger than the direct ones. Things like conflict for the |> first-level cache, interference with coprocessors and so on. |> Well, in MY book, that is a partial context switch, and the TLB refilling |> is being done by a hybrid hardware/software solution! But I accept your |> point that the TLB miss handler 'context' is both minimal and permanently |> available. ... |> >Hmmm. I thought your point was that "death on wheels" was equivalent |> >to "software-reloaded TLBs are a bad idea and should be done away with." |> >Was that a misinterpretation? |> |> Yes and no. It IS what I meant, but we were clearly talking about |> different things! I have no problem with the solutions that you have |> described, but I would call them hybrid solutions. "But `glory' doesn't mean `a nice knock-down argument,'" Alice objected. "When *I* use a word," Humpty Dumpty said, in a rather scornful tone, "it means just what I choose it to mean-neither more nor less." Occasionally, discussion threads get going where people attempt to modify "standard" terminology, resulting in massive confusion. Usually I stop reading the thread at that point. *I* use the terms "context switch", "trap", "hardware TLB", "software-reloaded TLB (or just software TLB)" the same way as other people do, who actually design chips and OS's for a living, and I propose to people reading this newsgroup that more things will make sense if they do the same, that is: 1) A context-switch switches state from one process/task to another. Maybe someone uses the term "partial context-switch" to mean "trap"; I'll admit I've never heard it. 2) A *trap* directs the program flow to a kernel address, which: - Takes action and returns very quickly, as in a normal MIPS UTLBMISS trap. - Takes action and returns more slowly, as in a 2-level UTLBMISS, or some system calls - Takes action that eventually turns into a context-switch, as in a system call that causes a real I/O & a reschedule to another process, or UTLBMISS that is discovered to actually be a page fault. 3) A "hardware TLB" usually means a TLB, which, if the desired entry is not present, performs a tablewalk, or other appropriate mechanism to reload the TLB entry from memory, without causing a trap for normal refills. Such mechanisms were used in 360/67, 370..., VAX, and most early micro TLBs. Depending on the implementation, the actual implementation might use part of the microde (if there is any), or else logic state machines, and whatever it is does, is hardwired as part of the CPU spec. 4) A "software-reloaded, software-refilled, or just software TLB" causes a trap (or TLB miss, or TLB miss fault) to the OS, which uses instruction sequences to do the reload, and return. The hardware does exactly what it does for other kinds of traps: record the necessary information, in a convenient form, and then get to the OS as fast as it can. Occasionally, CPUs provide instructions targeted at helping this process, but most such sequences are regular instructions, excecuted by the CPU, i.e., "software". Calling 4) "hybrid" is an unfamiliar term to me. Of course, TLBs may be used before cache access (Physical Index, Physical tag), in parallel (Virtual Index, Physical tag), or after (Virtual Index, Virtual Tag, translate after miss). Bottom line: there certainly exist programs where TLBmisses are a serious factor, just as (for instance) big non-zero stride vector codes run slowly on most cached machines. People who design CPUs always try to make TLBs bigger, or map bigger pages, or have less overhead per miss ... but so far, when people have to do this in the real world and ship CPUs, TLBs have to fight for their die space along with everything else, and they get their share, but the studies keep telling people to spend dies space on other things. -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 11 Nov 1997 05:47:11 GMT In article <mark.879201015@hubcap>, mark@hubcap.clemson.edu (Mark Smotherman) writes: |> And I think I've seen this distinction also used in other texts. Might |> this reflect earlier IBM OS terminology? [raise your hand if you remember |> the definition of "data set" ;-)] Maybe, just haven't seen it lately. |> I would agree that with current usage, "process switch" and "context |> switch" are not usefully distinct. |> |> |> Actually I'm enjoying the thread, and wanted to ask about PTEs: |> |> 1) What is the hardware consensus on updating reference and dirty bits? |> I.e., is it the standard to set the reference bit of the PTE in memory |> upon hardware (or software) table walk and have the processor cause a |> trap [interrupt/fault/exception - if you want to talk about other |> debatably useful terms in distinguishing various processor actions |> and stack-frame formats ;-)] upon the first write to a page? We wrote about this in the COMPCON paper I mentioned before; for various reasons: (a) One must think very carefully about hardware that updates PTEs directly, i.e., without any traps, for reasons discussed below. Note that even on some systems that do hardware Dirty bits, people often turn them off some of the time to allow for Copy-on-Write. (b) In a software-managed TLB, the software normally wants to do this, since, after all, the hardware need not have *any* idea where page tables are. In fact, there might not actually *be* any page tables in memory. For example, if you were allocating contiguous pages in an embedded environment as segments, you might be able to just *compute* the page frame from the VPN with no memory references. (c) Note that the transition not-Dirty -> Dirty doesn't happen "very often", compared to the number of TLB misses, i.e., it's on the order of (pagein of writable pages) + (allocation of writable pages) The MIPS TLBs, starting with R2000: (1) Have a Dirty bit per TLB entry, but it is NEVER changed directly by hardware. A write acceess to a page no marked Dirty is trapped, and the OS figures out whether it is: - an illegal write: error. - a write to a writable page that is not yet dirty: find the PTE, lock it, mark it dirty, reload the TLB entry and mark it Dirty. - a write to a writable page, that is dirty, but is temporarily marked non-Dirty because it is a Copy on Write: do a lot of work. - and there are various other more complex cases. (2) Have no hardware-set Reference bits, which are simulated by software in the time-honored ways. Reference bits, after all, are just hints... Since Reference bits are just hints, inconsistencies are no problem, and oddly enough, stale copies showing not-Dirty are OK also. If a Write access really occurs, the Dirty bit in the PTE better get set, but strangely enough, one need not bother to notify any other TLBs that have copies showing hte page as not-Dirty: the next reference will trap, and the TLB will be reloaded anyway. On the other hand, if there is a transition from Dirty to not-Dirty, you have to flush any not-Dirty entries from all the TLBs that might have them. |> 2) If you reset the reference and/or dirty bit of a PTE (e.g., clock |> replacement algorithm), I assume you need to flush that PTE from the |> TLB - correct? No, for reference; no, if changing not-Dirty to Dirty; yes, if changing Dirty to not-Dirty. In a complex environment, one becomes happier for TLBs that are software-refilled, and disconnected from memory. Consider a simple MMU design: (1) MMU refills TLB by doing a page-walk, with physically-addressed data structures that cannot be cached. MMU sets Dirty bit in PTE in memory, and Reference bit when needed. ==> Not so bad with CPU whose speed is not much faster than DRAM, and whose memory system is happy with data transfers the size of one PTE. Not so good with current CPUs, wide memory systems, long latencies. Some implementations may require atomic read-modify-write to PTE, so need to lock this ... which is unappealing in SMP systems, and really unappealing in ccNUMAs. ==> paging of PTEs is probably not so nice. Consider the 3 cases: (1) First reference to a clean, but writable page is a read. (2) First reference to a clean, but writable page is a write. (3) First reference is a read, but there is a later write. Case (1) just loads the TLB, and maybe sets a Reference bit; Case (2) loads the TLB, and must set the Dirty bit. Case (3) (assuming all hardware design) has to change the state of the entry inside the TLB, and either read-modify-write that entry back to memory right then, or be prepared to do if the entry is ever replaced. The former is likely to be less buggy. Among the various considerations are: uncached access versus cached ones [suppose some piece of code is scanning the PTEs, which it would prefer to do using cached accesses]; good use of bus bandwidth; buffering and queue sizes inside the processor (especially for an aggressive out-of-order superscalar). (2) So, how about: the MMU does cached table walks, and uses the normal cache-coherency mechanisms as it modifies PTEs. Some of the previous issues go away. On the other hand, consider a ccNUMA like an SGI Origin, and suppose that R10000s had hardware-refilled TLBs that wrote Dirty bits and Reference bits automatically, with caching. (a) Program issues a store, which enters the load/store queue. (b) The store gets to head of queue, to the TLB, but misses in TLB. (c) MMU does tablewalk, finds the PTE in its cache. (d) Examining the PTE, the MMU discovers that it needs to modify the PTE, but discovers that the cache line containing the PTE was not Exclusive, so must issue a coherency request to the cache lines home node, to upgrade to exclusive, but it gets nacked because another CPU got there earlier with the same request, and in fact, this CPU gets an invalidate back to invalidate the cache line, which means the MMU, in the middle of this, must discard the PTE it has already gotten, and reissue the access to the cache to find the PTE (misses this time), then go to memory, adn get what is probably a shared copy (which may well be satisfied via an intervention from the other node that has just modified this). You'd probably prefer that PTE fetches normally be done as ordinary reads, rather than as read-exclusive (intent to write), since they probably don't change the state of the PTE very often, but of course, when they do, it gets complicated. Any table-walking fetches that occur must also check for pending stores, i.e., stores to the same address that happened logically earlier, but have not yet graduated, and thus have not actually been written into the cache. [This will hardly ever happen, but it could, so you have to check for it, i.e., queues of loads and stores have to be carefully checked.] Sooner or later, after potentially multiple memory accesses, and some fairly complicated churnings of state machines, you have the line of PTEs in the cache, and the PTE you wanted copied into the TLB, and any modifications of the entry set in the TLB, and into the (by-now) Exclusive copy of the cache line that contains the TLB. Of course, you may have done a writeback of a Dirty cache line, to get a line into which the PTE line can be placed. Now, (assuming this is PI-PT cache), you are ready to actually do the cache access for the store itself. There are of course, a myriad of cases, and very complex state machines... and of course, an o-o-o machine wants to have a long load/store queue, and be overlapping memory accesses, which means all of this should be interleaved with other TLB refills, cache misses, writebacks, coherency traffic, etc. Summary: has been said before, but worth saying again: Design ideas that work fine in sequential uniprocessor systems, often get very complex when you have to use them in multiprocessors, and especially with out-of-order CPUs, even RISCs with no more than one memory address per instruction. It is very easy to get a design that is simple, but forced to be slower than necessary in normal cases in order to be correct in all cases, or else create extraordinary complex designs to get performance (VAX 9000 comes to mind). As an especially terrifying example, Dave Cutler once pointed out to me that a VAX addp6, with all operands indirect, and all of the addresses crossing pages, and being aligned just right to cross page table boundaries, could require an amazing: 41 pages (including pages in the page tables of course) to complete. Also, autoincrements (early in an instruction) that have to be undone if a translation (later in the instruction) fails, are one more wonderful thing to deal with. Finally, all of this is a continued argument that intuition is often bad, and that computer engineering makes forward progress with better measurements ... akin to improvements in science from getting another decimal place. -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: jfc@athena.mit.edu (John F Carr) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 13 Nov 1997 01:53:45 GMT In article <648rgv$46m$1@murrow.corp.sgi.com>, John R. Mashey <mash@mash.engr.sgi.com> wrote: >(1) MMU refills TLB by doing a page-walk, with physically-addressed >data structures that cannot be cached. Physical addressing and cacheability can be independent. On the Alpha 21164 the virtual I-cache supports cacheable physical accesses for PALcode. By locking the page table tables in TLB the page tables can use virtual addresses. >Any table-walking fetches that occur must also check for pending stores, >i.e., stores to the same address that happened logically earlier, >but have not yet graduated, and thus have not actually been written into >the cache. Since this is only relevant for supervisor code, why not require a memory barrier instruction betwee the store of a TLB and a reference to it? I think that's what Alpha does (at least when multiprocessor consistency is required). >There are of course, a myriad of cases, and very complex state machines... The RSC implementation of the RS/6000 (maybe the 601 too?) had a separate microprocessor to handle TLB misses. >It is very easy to get a design that is simple, but forced to be slower >than necessary in normal cases in order to be correct in all cases, >or else create extraordinary complex designs to get performance >(VAX 9000 comes to mind). And they still didn't get that one right. I think as of the final microcode version branch prediction didn't work and there was an invalid form of a nonprivileged instruction which would crash the system. >As an especially terrifying example, Dave Cutler once pointed out to me >that a VAX addp6, with all operands indirect, and all of the addresses >crossing pages, and being aligned just right to cross page table boundaries, >could require an amazing: > >41 pages (including pages in the page tables of course) to complete. I have a table of instruction times for the CVAX processor. The final example, which I'm sure the authors enjoyed constructing, follows (r/w = read/write, 1 cycle if cache hit). INDEX d(r),@d(r),(r)[rx],@(r)+,-(r),@d(r)[rx] - all memory operands unaligned across page boundaries, all memory operands take TB misses on both reads, M bit clear, cache hits specifier 1 time 1+1r specifier 1 cross page 8 specifier 1 TB miss x 2 12+2r specifier 1 unaligned 1r specifier 2 time 1+2r specifier 2 cross page x 2 16 specifier 2 TB miss x 4 24+4r specifier 2 unaligned x 2 2r specifier 3 time 2+1r specifier 3 cross page 8 specifier 3 TB miss x 2 12+2r specifier 3 unaligned 1r specifier 4 time 2+2r specifier 4 cross page x 2 16 specifier 4 TB miss x 4 24+4r specifier 4 unaligned x 2 2r specifier 5 time 2+1r specifier 5 cross page 8 specifier 5 TB miss x 2 12+2r specifier 5 unaligned 1r specifier 6 time 3+1r+1w specifier 6 read cross page 8 specifier 6 write cross page 7+2r specifier 6 TB miss x 4 24+4r specifier 6 M bit clear x 2 18+4r+2w specifier 6 read unaligned 1r specifier 6 write unaligned 1+1w execute, fetch time 39 ---- total 248+38r+4w (that's 7 microseconds at 40 MHz). Note that this requires more TLB entries than the CVAX has. I think the CVAX had a 28 entry TLB and the uVAX II 8. Some big VAXes had 1K. -- John Carr (jfc@mit.edu)
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 13 Nov 1997 03:18:22 GMT Keywords: locality In article <64ce5v$p5i$1@flood.weeg.uiowa.edu>, dsiebert@icaen.uiowa.edu (Doug Siebert) writes: |> Organization: The University of Iowa |> |> Joe Keane <jgk@jgk.org> writes: |> |> >Recently i saw a table about TLBs. I was surprised that the biggest |> >number of entries was only 128 entries, for some Power chip, and many |> >chips had considerably less. It seems to me that that is not enough, |> >and even 128 is a bit low to run well in many cases. |> |> |> I know some guys doing CAD/CAM using HP workstations. Back when the C110 |> (120MHz PA-7200) was the fastest desktop box, that's what they were using. |> Later, the C160 & C180 came out with the PA-8000, and SPEC results showed |> them to be 100-200% faster. But for their codes, they were no improvement, |> and actually slower in some cases, because the TLB in the 7200 was (from |> memory) 120 entries, and it was only 96 entries on the PA-8000. The PA-8200 |> is back to 120 entries, and the PA-8500 has I believe 160 entries. But HP |> is also going the way of variable page sizes, which for CAD software that |> runs in a gigabyte of data space, is probably better than trying to find a |> way to have a massive TLB. Can you say more why they believed that the TLB was the issue? Do the profiling tools there offer such statistics? It is quite possible that the TLB was the issue, but it is also possible that: (a) Different cache design was the issue. (b) The PA8000 spec #s were with new compilers, tuned to PA8000. From other experience, it usually takes a while for real apps to get compiled with new compilers, and if the binaries don't run well on the installed base, it may take a long time. Anyway, in numerous competitive cases, there were PA7000 binaries whose instruction scheduling seemed to interact badly with PA8000 chips, and performance didn't go up until the programs were recompiled, but then performance did go up, even with identical # of TLB entries. As usual, what one really wants is a profile of the program that gives # of TLB misses, cache misses, and cycle costs of them plus instructions, stalls, etc ... because it is all too easy for human intuition to be incorrect. I agree on the bigger pages, of course ... which is why professional CPU designers have been doing that more and more in recent years. -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 13 Nov 1997 03:41:00 GMT Keywords: locality People ask why I don't look in on comp.arch so often any more... In article <64b95a$jqn$1@rocky.jgk.org>, Joe Keane <jgk@jgk.org> writes: |> TLB contention is a real problem. In many realistic situations people |> find that it severely dents performance. It happens a fair amount, and |> people know about it, but it's not something really exciting. Since terms like "real problem", "realistic", "severely dents", "fair amount" are hard to evaluate, so is this statement... |> >People who design CPUs always try to make TLBs bigger, or map bigger |> >pages, or have less overhead per miss ... |> |> Bigger pages is not a solution. |> |> Bigger pages is a mistake. It would be good if jgk would back up these strong statements with some recent professional CPU design experience. Many professional CPU designers have recently chosen to broaden their support for multiple page sizes, and OS's are starting to use them. |> >but so far, when people have to do this in the real world and ship |> >CPUs, TLBs have to fight for their die space along with everything |> >else, and they get their share, but the studies keep telling people to |> >spend dies space on other things. |> |> I think the studies are not very good. What do they look at? |> |> Running a nice, simple benchmark, say matrix multiply, this may not show |> up as close to a problem, and the TLB transistors hardly get warmed up. jgk: is the above is an assertion, that this is all that CPU designers do? If so, you are essentially saying that CPU designers are total fools... You are also implying that you have good access to the internal studies, not just the ones that get published. |> But it's a different story running lots of programs on a big machine |> with tons of RAM and big, complicated software and lots of users. There were Power Challenges in 1995 with 16GB of RAM, there are many 1-rack Origins with 8-32GB; there is at least one multi-rack Origin complex with 112GB of RAM. The bigger machines use 4MB cache per CPU. I assume jgk is talking about his experience with larger machines than these, running more complicated software. Oh, maybe not: |> My Pentium's L2 cache is 512 KB. A rule of thumb is that one `object' |> But wait, say this is a current high-end machine, not my Pentium, then |> it has an L3 cache of 8 MB. Even with completely optimal locality, with |> the cache taking whole pages from memory, that is still a lot of pages. |> So even in the optimal case, accessing L3 without misses is right out. |> It looks like you could be TLB missing pretty frequently, when the data |> is in the L3 cache. The big question, of course, is how frequently. |> That is what we want to know. That is what decides it. Yes, and that is exactly the kind of thing that professional CPU designers study, although they only occasionally publish the results. Unfortunately, when the answer is "20% of the time is in TLBmisses". the solution may not be obvious: does more associativity help? does a bigger TLB help? how much bigger? do bigger pages help? how much?" How much does it cost to reduce TLB overhead by 10% across a specified workload? How much performance is lost by takingthat die space from otehr functions? One more time: human intuition is pretty bad; questions need precise formulations to even have a chance to be answered. CPU designers spend a lot of time acquiring the best data they can, and still have to make tradeoffs. Measurement is crucial. -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 13 Nov 1997 22:29:50 GMT Keywords: locality In article <64eqkc$de6$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes: |> |> Can you say more why they believed that the TLB was the issue? |> |> Do the profiling tools there offer such statistics? |> |> I manage a system where I can measure TLB misses on a PA-RISC |> architecture, but very few hardware architectures or vendors' |> systems provide any decent (or indeed ANY) tools for detailed |> performance investigations. This means that the poor sods who |> |> to make tradeoffs. Measurement is crucial. |> |> Right. Now please explain why most modern systems make it impossible |> for application designers and tuners to measure any of the critical |> performance factors. I can't help what other vendors do or don't do. SGI systems supply elaborate facilities for doing such performance analysis, especially in R10K-based machines, which added a pair of hardware counters, and a whole set of tools for controlling them, analyzing them, handling multi-thread and multiprocessor cases, running on individual programs, setting up to analyze global system behavior, etc. A good place to start would be perfex(1) and ecstats(1), or prof(1) on any of the machines. For instance, with no particular preparation: perfex -e 0 -e 23 command args runs command, and then prints the counts for events 1 & 23, which happen to be issued instructions and user TLB misses. Countable events (although not all combinations allowed in one run) include: 0 = Cycles 1 = Issued instructions 2 = Issued loads 3 = Issued stores 4 = Issued store conditionals 5 = Failed store conditionals 6 = Decoded branches 7 = Quadwords written back from scache 8 = Correctable scache data array ECC errors 9 = Primary instruction cache misses 10 = Secondary instruction cache misses 11 = Instruction misprediction from scache way prediction table 12 = External interventions 13 = External invalidations 14 = Virtual coherency conditions 15 = Graduated instructions 16 = Cycles 17 = Graduated instructions 18 = Graduated loads 19 = Graduated stores 20 = Graduated store conditionals 21 = Graduated floating point instructions 22 = Quadwords written back from primary data cache 23 = TLB misses 24 = Mispredicted branches 25 = Primary data cache misses 26 = Secondary data cache misses 27 = Data misprediction from scache way prediction table 28 = External intervention hits in scache 29 = External invalidation hits in scache 30 = Store/prefetch exclusive to clean block in scache 31 = Store/prefetch exclusive to shared block in scache There are numerous flags for various combinations ... but this is not too much harder to use than time(1) ... to get details at basic-block level, one uses speedshop(1). -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 15 Nov 1997 18:34:08 GMT In article <64i61p$crv$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes: |> Yes, mainly. But there has been some confusion over what people meant |> by cache misses. TLB misses are typically a LOT more expensive than |> cache misses that are filled from later caches, and can very easily |> have indirect costs that take them beyond even memory references. |> And this may get worse in the future, on at least some architectures. (a) Cache misses are cache misses. Some TLB designs can also generate cache misses, although the impact of I-cache misses inside TLBmiss handlers is minimal: if there are a lot of misses, the code lives in the I-cache. (b) I'm not sure what "TLB misses are typically a LOT more expensive... than later caches" means. One more time: those caches misses that cannot be prefetched (either explicitly or automatically) are expensive, and getting worse, relative to cycle time. If a TLB miss generates multiple of such cache misses, then it is indeed costly, but TLB misses that go through the cache, and hit there, are not very expensive, and it is very hard to make generalizations without having the numbers handy. (c) There indeed can be indirect costs. |> For example, what happens if you have multiple threads, coprocessors |> etc.? The TLB miss handler typically has to get into a state where it (a) Well, both PA-RISC & MIPS have had FP coprocessors, separate in the early days, and have managed OK. (b) Multiple threads: as has been discussed here, multiple threads need multiple resources, and it's one of the issues in studies of multi-threaded CPUs: it's not enough to to duplicate the PC and user registers... and this is true whether or not TLB refills are done by hardware or software. |> is addressing real memory and interrupts are disabled (the PA-RISC |> architecture document describes this as a fast context switch), and |> restore user state on return. Mere cache misses don't necessarily |> do anything more than block one thread. This is what PA-RISC does. Others do other things, at least with regard to addressing real memory: the very first R2000 used virtual addressing for its PTEs. |> So what happens to instructions that are in the process of being |> executed? They can be suspended, nullified or completed before the |> handler is invoked, or (heaven help us!) carried on in parallel if |> they don't affect any registers the TLB miss handler needs. The |> last is clearly the fastest solution, but .... In o-o-o chips that do this, of which several exist: - logically-earlier instructions are usually completed - logically-later instructions usually must be nullified It is a fair complaint that the first group need to be done before entering the miss handler. Whether this matters much or not depends on the statistics, not on human intuition, which is especially bad regarding what's happening inside such chips. |> Incidentally, I vaguely remember an interesting security report on |> one of the very early multi-threaded systems (perhaps a 360/91). Must have been something else: I think Nick is confusing "out-of-order speculative" and "multithreaded", which do *not* mean the same thing. 360/91s were not multi-threaded machines, at least if the term is used the way most people use the term. -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 17 Nov 1997 06:51:05 GMT In article <slrn66ta2f.tue.yodaiken@chelm.cs.nmt.edu>, yodaiken@chelm.cs.nmt.edu (Victor Yodaiken) writes: |> On 15 Nov 1997 18:34:08 GMT, John R. Mashey <mash@mash.engr.sgi.com> wrote: |> >In article <64i61p$crv$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes: |> >(a) Cache misses are cache misses. Some TLB designs can also generate |> >cache misses, although the impact of I-cache misses inside TLBmiss handlers |> >is minimal: if there are a lot of misses, the code lives in the I-cache. |> |> I don't know if this is a good assumption. Suppose we have a cluster |> of TLB misses, a big working set that fills i-cache, and a process |> switch. If the TLB miss handler needs an i-cache fill every |> 0.5 milliseconds, won't that add a noticeble drag to performance? For any system design, it is possible to generate awkward worst cases ... and real-time system designers worry about them a lot, and for most others, people try to optimize overall throughput and do the best they can to round off the bad cases. (These numbers aren't exactly right, but they're close enough): Assume a 200Mhz R10000 (5ns cycle). Assume 50ns = 10 clocks I-cache miss from L1 to L2. Assume a 500ns cache miss from L1 to L2 to main memory [some better, some worse]. = 100 clocks. L1: I & D caches each 16KB, 2-set-associative, 128B cache lines (i.e., 128 lines fo 128 bytes each). The most common UTLBMISS handler is less than one cache line. L2: 1MB, 2MB, or 4MB, also 2-set-associative. .5 millisecs = 500 microsecs, so if TLBMISS needs an I-cache fill, all the way from DRAM, and this takes .5 microseconds, then we have the "noticable drag" of .5/500 = 1/1000 ... but of course, with a substantial 2-set-associative L2 cache, *most* of the time it's more like .05/500 = 1/10000, IF that is the rate of misses (which it may or may not be). If I-caches don't "work", thre are worse problems than TLBmiss handlers :-) One more time: computer architecture needs numbers. Some of the numbers may be hard to get, but back-of-the-envelope anlyses at least give some ideas. Hopefully, the frequent characterizations of systems as "blazing fast" is not rotting peoples' analytical capacities as badly as I fear :-) -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 17 Nov 1997 19:41:29 GMT In article <slrn671p2q.6or.yodaiken@chelm.cs.nmt.edu>, yodaiken@chelm.cs.nmt.edu (Victor Yodaiken) writes: |> On 17 Nov 1997 06:51:05 GMT, John R. Mashey <mash@mash.engr.sgi.com> wrote: |> >In article <slrn66ta2f.tue.yodaiken@chelm.cs.nmt.edu>, yodaiken@chelm.cs.nmt.edu (Victor Yodaiken) writes: |> >|> I don't know if this is a good assumption. Suppose we have a cluster |> >|> of TLB misses, a big working set that fills i-cache, and a process |> >|> switch. If the TLB miss handler needs an i-cache fill every |> >|> 0.5 milliseconds, won't that add a noticeble drag to performance? |> > |> >For any system design, it is possible to generate awkward worst cases ... |> >and real-time system designers worry about them a lot, and for most others, |> And as you noted above, for my real-time system, 1/2 microsecond |> is significant. If we pessimize your calculations a little it gets worse. I don't know what axe Victor is trying to grind, or why, but it would be nice to stop confusing people in this newsgroup with inconsistent nonsense. This discussion is equivalent to saying: "If F-16s were Chevy's their speed in Reverse would be only 3.2 MPH." But jet planes aren't cars. 1) "Suppose we have a cluster of TLB misses, a big working set that fill i-cache, and a process switch..." 2) "my real-time system, 1/2 microsecond is significant." 3) I'm trying to understand what system has *both* these characteristics; I must admit I've never seen them, so I have a suspicion that Victor has changed the field of discussion. a) Designers of embedded real-time systems hate surprises and irregularity, and do whatever they must to avoid them. They lock code into I-caches, play games with D-caches, and use TLBs far differently than the general-purpose systems that were being discussed. They try pretty hard not to have TLB misses at all once they get initialized, and for cost reasons, they tend to use lower-end chips. b) People who do real-time systems that are not embedded (and actually, SGI systems are heavily used for this), try to do the same thing, although it is far easier to provide tighter guarantees of interrupt response on low-end embedded systems. IRIX allows dedication of CPU resources, lockdown, etc, etc, for these purposes, but it's hard to beat an R5000 or R4300i, with code pre-loaded into I-cache, TLB's preloaded, and explicitly swapped on context-switch, etc. |> Suppose the tlb miss handler falls accross a cache line --- and it is |> not obvious that padding the handlers to 128b boundaries is a win. |> This is inconsistent: IF his argument is that a I-cache miss for a TLBMISS is disaster, then halving the number of such cannot help but be good. |> Now suppose that the memory/bus is PC quality and a cache line fill |> takes 4microseconds, say two lines get filled in 5 microseconds. If the |> TLB handler is not in icache, is it likely that the pte and pointers |> to the pte are in d-cache? If the handler was in i-cache, the loads |> for the pte could be start while the pipeline is emptying --- essentially |> for free. But we now need to load the icache first, begin execution of |> the handler, then reach for the ptes. So add another 4 microseconds |> for the needed loads. We now have 9microseconds delay. If this happens |> 1 time every 500microseconds, we have 491microseconds of 5ns/instruction |> plus 9 microseconds for 32 instructions give us 5.09ns/instruction an |> almost 2% slowdown. Make the cache line fill a little slower or |> improve things otherwise and it looks worse. And that's the average |> case. IF jet planes were cars... |> have no idea if it would correspond to anything in reality. Yes. |> >If I-caches don't "work", thre are worse problems than TLBmiss handlers :-) |> |> But there may be cases, interesting cases, where i-cache works great |> on average but causes trainwrecks regularly. This is not new news. |> >One more time: computer architecture needs numbers. Some of the numbers |> >may be hard to get, but back-of-the-envelope anlyses at least give some |> >ideas. Hopefully, the frequent characterizations of systems as |> >"blazing fast" is not rotting peoples' analytical capacities as badly as |> >I fear :-) |> |> Good sentiment, but basing your analysis on quantitative methods makes |> it tempting to focus on the easily quantified case and that is the |> average case behavior of programs that can already run on average hardware. |> If you did that, all systems might converge on the same general performance |> and the company with the largest advertising and development budget would |> win out. Do you understand that you've just categorized as idiots people who: - Design chips for real-time, including real-time-tuned variants of general chips - Design real-time hardware. - Design real-time applications for a living I've worked with many such people over the years (and of course, I'm the person at MIPS who insisted on certain features in the R2000 to allow it to be useful outside UNIX environments, including embedded). Maybe there are ones dumb enough to only be interested in average characteristics, but I never met any such. They love things like cache lock down, worst-case interrupt response, TLB lockdowns [they love TLB wired entries, for instance; they love set-associative caches with lockable banks, etc]. However, the "context" for this discussion didnt' start as real-time systems; if you want to start a separate thread on "real-time systems versus complicated CPUs with caches", that's a worthy thread in its own right, but it doesn't have much to do with PTE & TLB management in a general- purpose system like HP/UX or IRIX or DEC UNIX or VMS or NT or SOLARIS. -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 18 Nov 1997 03:37:23 GMT In article <slrn6729k2.7jp.yodaiken@chelm.cs.nmt.edu>, yodaiken@chelm.cs.nmt.edu (Victor Yodaiken) writes: |> >|> Good sentiment, but basing your analysis on quantitative methods makes |> >|> it tempting to focus on the easily quantified case and that is the |> >|> average case behavior of programs that can already run on average hardware. |> >|> If you did that, all systems might converge on the same general performance |> >|> and the company with the largest advertising and development budget would |> >|> win out. |> > |> >Do you understand that you've just categorized as idiots people who: |> |> I did not categorize as, call, insinuate, or in anyway imply |> that anyone is an idiot. Please don't invent opinions for me. Sorry, I did not mean that that Victor was insulting people on purpose. However, if I said anything like this to people who do production R/T for a living, i.e., like controllers in fighter planes, or cars, they would take it as an insult to their intelligence and get angry. This is *not* speculation, I've seen it happen (not because I did it). Likewise, chip vendors have produced a wide variety of chip variants with carefully crafted features aimed exactly at real-time response, and didn't do it by accident, or because they were only thinking of average properties. Some of us (including me) have spent numerous hours working through worst-case latency examples for customers, strangely enough for the exact combination of real-time response plus floating-point described above. |> one wants both real-time and a general purpose OS on a commodity PC. |> Even without these systems, 1Ghz ethernet, plus video, plus |> a simulation is not an unlikely combination in the next couple of |> years. In this job mix an extra couple of microseconds here and there |> can make a serious difference and big working sets are going to mix |> with small ones. 1) PCs use hardware-refilled TLBs, so they do not have the issue being theorized about. The simplest thing is to get a faster PC, but of course, that may not dothe job either. 2) It is perfectly reasonable to *want* to mix large programs and real-time on one CPU ... it is just very difficult to make the response *guarantee* get very close to the *typical* response, especially if you demand an environment where the non-realtime tasks are permitted to consume 100% of the resources. 3) If you need a real-time response of 10 microsecs, and you have done everything to guarantee 11, and it takes 1 for a TLBmiss, then it matters. If you need a real-time response of 10, and the actual response varies between 5 and 200, then the 1 is not the problem... 4) So, to get back to reality, how about posting some numbers about the RTLinux example mentioned, and anything known about where the time *is* going in the 10 microsec interval (because it certainly isn't going into software TLBmisses). -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 18 Nov 1997 23:40:49 GMT In article <slrn674hj8.cbq.yodaiken@chelm.cs.nmt.edu>, yodaiken@chelm.cs.nmt.edu (Victor Yodaiken) writes: |> Even in the x86 world Cyrix has lockable cache lines and Linux and |> OS's like L4 go to some lengths to avoid TLB reloads despite the hardware. |> We found that the 603PPC with software controlled TLBs could match |> performance of faster 604PPCs with bigger L1 cache. I think that |> we could get a major performance increase on x86s with more control |> of TLB loading. Less cycles is always good; as usual, the question is where all the time is actually going ... and even worse, where it *might* go in various weird cases (having worked with people who thought airplanes would crash if events were missed...) |> Obviously you can't give away more than 100% of resources, |> but most PCs are idle most of the time. Certainly Since we're talking about R/T (of guaranteed latency flavor), I agree with your earlier comment that averages don't mean much. |> a 300mhz PII or 700mhz Samsung Alpha should have the power to |> run unix and do data acquisition during pauses between keystrokes. |> >3) If you need a real-time response of 10 microsecs, |> >and you have done everything to guarantee 11, and it takes 1 for a TLBmiss, |> >then it matters. If you need a real-time response of 10, |> >and the actual response varies between 5 and 200, then the 1 is not the |> >problem... |> It is possible to get a 486/33 ISA based PC to run UNIX and still offer |> 30us guaranteed response time. Do you have the *typical* (i.e., the mode) response time for that same system? It's always interesting to know how close the worst-case is to the typical. |> The real barrier right now seems to be in access to off chip i/o |> and irq controllers during periods of heavy i/o. I don't have access |> to a good enough bus analyzer, but I'd bet that the board chip set |> holds up the processor when the PCI bus is too busy. Yes, and it may be even worse than this... |> The sequence: |> read global clock % an io action itself |> write i/o port |> read global clock |> compute difference in clocks |> |> can report 20us but is usually between 0 and 2us for a slow port. |> The 20us is the problem. Since I don't know what your systems look like, here are some interesting questions: (1) In this particular configuration, what is the longest transfer size on the bus for any device? (2) How are DRAM refreshes done, and how long do they take? |> Which brings us back to a process switch that would require a tlb fetch |> and instruction reload from main memory -- not anywhere near the critical |> problem right now, but not completely dismissable either. OK, now we agree :-) as always, the issue is understanding the worst conceivable case, which is likely to be something like: (a) Run a big user program that consumes the cache & TLB. (b) An external interrupt occurs (and this may or may not be an issue on a particular chip, but there is usually some worst-case instruction that has to be completed, especially in an in-order issue, out-of-order completion design; this would often be something like double sqrt, or in the earliest R2000s, even, integer divide, whose latency you picked up when you did a move-from to save the hi and lo registers). (c) Most CPUs get nonpreemptable for a while. Guaranteed response required to NEW interrupt here Worst-case cache & TLB misses (d) Continues until preemptible again. (e) Preempts to handle guaranteed interrupt. (f) Guaranteed interrupt starts execution, but: Just before it gets PCI bus, disk DMA commences, and ties up the bus for some length of time, which can be difficult to figure out, and then, the code runs into memory refresh cycle(s). -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 20 Nov 1997 02:11:22 GMT In article <34919c56.183150240@philos.philosys.de>, Emil.Naepflein@philosys.de (Emil Naepflein) writes: |> The real question is: |> Isn't it possible to use a hardware solution or other solution for |> such a problem? |> Is it worth to invest all the money into tuning the software on |> limited hardware, instead off building a special solution? Yes, we agree, that is: the point is that there are *lots* issues to owrry about, the closer you want guarantees to approach the typical. |> > Just before it gets PCI bus, disk DMA commences, and ties up the bus for |> > some length of time, which can be difficult to figure out, and then, |> > the code runs into memory refresh cycle(s). |> |> Why not pinning down the corresponding TLB entry? |> |> Use a MIPS processor and you will have absolutely no problem with with |> the TLB for realtime applications. ;-) Yes, we agree! These applications are part of the reason why: (a) MIPS memory-mapping has always included some unmapped space that requires zero TLB entries. (b) There are wired entries that don't get replaced by normal refill sequences. -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: Cache and context switches Date: 20 Nov 1997 02:16:16 GMT In article <34939fa4.183996343@philos.philosys.de>, Emil.Naepflein@philosys.de (Emil Naepflein) writes: |> If the performance of the database application is critical the problem |> can be solved by using more than 1 GB memory and big pages. Putting |> more memory into systems to avoid paging is a standard procedure for |> mission critical performance tuning. By combining this with |> intelligent use of big pages the TLB problem just disappears. |> If we look for a cost-effective way to solve the problem it's probably |> much cheaper to attack the problem in the processor by adding hundreds |> or thousands of TLB entries. This seems for me much cheaper than |> buying GBs of memory. Unfortunately, it is not always cost-effective to add TLB entries: it really depends on tradeoffs, and impact on cycle times (sometimes); people already use 2-level TLBs sometimes. One of the reasons most vendors have started to include multiple page sizes as a general feature (i.e., not just a handful for special cases), is: If you handle, for example, 4KB -> 16MB, 16MB pages can map 4K more space than 4KB ones, whereas it is fairly difficult to wrestle the chip folks into 4000X bigger TLBs :-) -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: page sizes (was: Re: Cache and context switches) Date: 20 Nov 1997 20:05:03 GMT In article <01bcf5bf$d3512380$ed0abf81@peglarr.network.com>, "Network Systems Corporation" <robp@network.com> writes: |> Organization: Network Systems Corporation |> |> Another episode of "there's nothing new under the sun..." |> > One of the reasons most vendors have started to include multiple page |> > sizes as a general feature (i.e., not just a handful for special cases), |> > is: If you handle, for example, 4KB -> 16MB, 16MB pages can map |> > 4K more space than 4KB ones, whereas it is fairly difficult to wrestle |> > the chip folks into 4000X bigger TLBs :-) |> |> John is right (of course) but the '...started to include...' part is |> interesting. |> Multiple page sizes, including relatively large-sized pages, have been |> around |> for nearly 3 decades. The CDC Star-100 was the first machine (AFAIK) that Yes,; hardly anything is new; we still haven't caught up with everything 360/91s did, for instance. Maybe I should have said more: "started to include" meant: more than one vendor is doing doing it, and it's not an odd case, and it appears in products that are reasonably widely available. This is along the lines of: every mistake happens at least 3 times: first the mainframe folks then the minicomputer folks and then micro folks, at least once -- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: John Mashey <old_systems_guy@yahoo.com> Newsgroups: comp.arch Subject: Re: Programmable TLB management? Date: Wed, 27 Jun 2007 23:27:34 -0700 Message-ID: <1183012054.119993.3680@i38g2000prf.googlegroups.com> On Jun 25, 6:54 am, vandy...@gmail.com wrote: > My experience with this was on a MIPS processor (still, IMHO, one of > the best architected CPUs I've ever had the pleasure to port an OS > to). While in theory this flexibility is wonderful, in practice a two- > level PTE tree with 4K pages works so well, for such a broad range of > usage patterns, that all you end up doing is trying to code up two- > level PTE walks as efficiently as possible. Thanks for the kind words, but note that it depends on what you're doing: a) BSD UNIX was first ported to the original MIPS R2000. b) A few months later, SYSTEM V was ported, and for reasons I can't recall (it being 20 years later), it had a slightly different UTLBMISS code. This actually exposed a hardware bug that BSD had never encountered, but since it was software, we just tweaked the code to work around it until it got fixed. This bug, of course, confirmed the argument of the OS people who'd fought with bugs for years in complex hardware MMUs, and had bigged the chip designers for the most minimalist MMU we could get ... and even that had a bug. It was unsurprising that early 1980s micros often had MMU bugs, and OS programmers hated them. c) Some people doing embedded systems have managed not to have PTEs in memory at all, or at least none for some parts of their address space. I.e., if there are user tasks, but they can be direct-mapped to some regions of memory, it may be just as easy for the UTLBMISS code to simply compute TLB entires, rather than loading PTEs from memory. d) When the R4000 introduced the wide range of page sizes, and then as 64/-32-bit systems became used, some systems had a group of different ULTBMISS routines, selected on a per-process basis. This was especially true of the big SGI ccNUMA systems, which supported multiple page sizes and all ran mixtures of 64 and 32-bit processes. e) Hardware assists are especially useful if the primary use of a CPU is for software systems owned by the chip vendor, but if a chip has to be useful to a wide range of OSes, it gets harder to get that right, and software MMU handling tends to make for quick ports, which was at least important in the 1980s. f) I've long ago pointed out that in cache-coherent multiprocessors, especially ccNUMAs, any MMU table-walk hardware must be very careful not cause coherency problems, especially if it goes around the cache at some point. Software TLB-handlers just the existing coherency hardware. Hardware TLB-handlers can also, but there were bugs in 1980s micros where this was a problem in cache-coherent SMPs. g) TLBmisses *CAN* add noticeable performance overhead for some kinds of codes, in some kinds of CPUs, but one always has to measure and analyze the real overhead, not do it from intuition. In many designs, no matter what you do, a PTE that is a cache miss all the way to main memory is the main cost anyway. Both MIPS and PA-RISC managed to get good performance for many years with (mostly) software managed TLBs. h) If I could go back and do things differently in 1985, knowing everything we do now, I'd probably tweak a few minor details in the original 32-bit version to make selectable UTLBMISS routines more efficient, to do a little more for kernel misses, and to make transition to 64/32-bit a little cleaner... I might allow for a mechanism to support optional hardware assist, although the first version wouldn't have had that. But, I still would have kept a similar software-reloaded TLB.
From: John Mashey <old_systems_guy@yahoo.com> Newsgroups: comp.arch Subject: Re: Programmable TLB management? Date: Sun, 08 Jul 2007 23:08:19 -0700 Message-ID: <1183961299.397369.250160@g37g2000prf.googlegroups.com> On Jun 28, 4:26 am, Anne & Lynn Wheeler <l...@garlic.com> wrote: > 360/67 had a bug in the associative array (i.e. used by 360/67 for TLB) > that charlie (aka compare&swap, CAS, charlie) found circa 1970 (nearly > 40yrs ago). .... then there > was some chance the virtual address space execution and/or the system > might have some explained anomolous behavior. Kernel software work > around was to make sure that LCTL was always done. Ahh, my 2nd S/360 machine... needless to say, there was no implication that the 1980s micros were the first to have weird MMU bugs. they merely followed the traditions from mainframes and minis. I think "explained" above was supposed to be "unexplained". Systems programmers *loathe* MMU (& related exception-handling) hardware bugs because they - often create effects noticeable only long after the damage has been done [i.e., like causing a few words to be overwritten in some page somewhere], - may have timing dependencies (i.e., related to clock interrupts), - may be hard to replicate, as when inserting debug code eliminates the bad behavior. They often seem to incorporate an unholy mixture of Murphy's law and a Heisenberg Uncertainty Principle analog. The main hope is that once one understands the problem, that there is a simple fix, and that the bug is found early enough that you haven't already shipped a bunch of them.
Index Home About Blog
from Hacker News https://ift.tt/oLlKArF
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.