Friday, February 25, 2022

Software-Refilled TLBs

Index Home About Blog
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: MIPS-UNIX-context switch
Date: 11 Jul 1995 19:38:02 GMT

In article <3teqdm$hk6@data.interserv.net>, levine@amarex.com writes:
|> Organization: Amarex Technology - High Speed Glitch Division
|> 
|> I have a question which I would like to direct only to those who have
|> MIPS 3000 knowlege.  Given a MIPS 3000 chip , an I-chache , a D-cache
|> and main mem.  If this configuration where made into UNIX based
|> machine where would the majority of time be spent for evey context
|> switch ?  e.g Saving regs, clearing TLB...

a) Not saving regs: figure that you save:
        33 integer registers [R1-R31 + HI + LO]
        some number of CP0 registers, let's say 7
        and you might or might not arrange to save 32 32-bit FP registers,
                depending on how your OS wants to work.
        Assuming the interrupt sequence is in the I-cache, and a good memory
        system, saving 40 registers = 40 cycles; @ 40Mhz, = 1 micro-second.
        Real systems would likely be slower, so guess a couple microseconds.

        Restoring another register set: likely to be cache misses, so
        takes a few microsecs more.

b) You don't need to clear the TLB, since there are Address-Space IDs, such
that you only need to flush the TLB every time you see >64 distinct processes.
You would normally reload a handful of TLB entries, then let other missed
entries fault in.  Base cost: a few micro seconds.  Most OS's use the trickery
of the MIPS TLB direct-mapped region to avoid TLB misses for kernel code.
Caches are physically-tagged, so you get wahtever sharing is really there.

c) In UNIX, most of the time goes to UNIXy scheduling & overhead, and
executing unpredictable code paths and accessing state data likely to be cache
misses.
        Register save/restore is likely a factor only in very tight embedded
        control systems.


-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-390-3090    FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311


From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Stack vs GPR and Multi-threading (was Re: A Series Compilers)
Date: 12 Jul 1995 17:44:10 GMT

In article <1995Jul12.143336.21769@il.us.swissbank.com>,
gerryg@il.us.swissbank.com (Gerald Gleason) writes:

|> If I'm interpreting what you are saying correctly, it is that in terms of  
|> total system performance, register save/restore is a much smaller  
|> opportunity than the latency associated with bad locality in various  
|> forms.  A multi-threaded processor might be able to fill in most of what  
|> would be idle time waiting for cache misses doing useful work on another  
|> thread.  The issue of multi-threading is somewhat orthagonal to GPR vs  

Yes, and there is a reasonable separate thread running on multi-threaded
CPUs, including contributions from people who have/are building them.
But for sure, I think that so much of the worry many people have about
register save/restore is that it's simpler to worry about, than for example,
all these latency and probabilistic arguments,  i.e., it's
the equivalent of the "coffee-fund paradox", i.e.,
        a) If the coffee-pot fund is running low, a committee will debate
           long and hard about the solution thereof.
        b) But then committee must vote on $10B appropriations.
           Little debate: how many people really grasp $10B? :-)

This is not to say register save/restore time is unimportant ... but
every time I've done the cycle-by-cycle counts on a real implementation,
running a general-purpose OS, I got convinced I should worry about other
things more.

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-390-3090    FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 7 Nov 1997 18:31:04 GMT

In article <63vhbo$hmk$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick
Maclaren) writes:


|> Yes, it is.  But even with hardware reloads, a TLB miss is often
|> much more expensive than a cache miss (sometimes 5-10 times more).
|> With software reloads, they are death on wheels :-(

Since "death on wheels" is difficult to evaluate, but clearly conveys the
thought that this is a bad idea, let us observe:

Software-reloaded TLBs are widely-used; in fact, many of the microprocessor
types commonly-used to run large programs on large datasets "happen" to do
this, specifically:

- PA-RISC & MIPS, from 1986 onward
- DEC Alphas, 1992-
- Sun UltraSparcs, 1995-

Consider the kind of code used to start this example: FORTRAN code with
big floating point arrays, an area of interest to RISC chips.
Of the 5 major RISC micro families, 4 have chosen to use software-reloaded
TLBs, with IBM being the the main exception.

Now, when we published info about MIPS RISC in 1986, most people
(outside of HP & MIPS) thought software-reloaded TLBs were crazy ...
but from 1986 through 2000, I count 6 *new* micro architectures used
in systems where large memories & TLBs might be relevant:
[PA-RISC, MIPS, SPARC, IBM POWER/PPC, Alpha, IA64],
and of those, 6, the current implementations of 4 use software-reloaded
TLBs, 1 doesn't, and one (IA64) remains to be seen.

There are many reasons of flexibility and debuggability to have a software
TLB, of which some were covered in 1986 COMPCON, "Operating System
Support on a RISC", DeMoney, Moore, Mashey.

It should be no surprise that designers of chips study TLB-miss overhead,
and try to allocate resources appropriately.

In modern systems:
1) A TLBmiss, in software, may actually take *less* time than actually
doing a cache miss.  Why is that?
TLBmiss:
        a) Miss
        b) Trap
        c) Refill TLB, making one or more memory references, which *may*
        well hit in the offchip datacache.
        d) Return
Cache miss:
        a) Miss
        b) Schedule cache miss to memory, which can be a very long time
                in some ccNUMA systems, but is easily 300-600ns in many
                SMPs.  With clock cycles in in the 2-5ns range, that's
                60-300 clocks, and with 2-4 superscalar chips, thats
                120-1200 instructions.

Now, of course, there are also TLBmisses that take longer than cache misses,
but in fact, whether a refill is done by a trap to software, or by a
hardware engine, the time is:
        T = C + N * M
        C = ~constant overhead time
        M = time for cache miss
        N = number of cache misses caused by doing TLB processing

If there are a lot of TLBmisses, the TLBMiss code ends up living in the
on-chip L1 I-cache.  If the PTE structures for hardware or software versions
are the same, there will be about the same number of accesses to memory.
In some cases historically, the complexity of TLB-table-walks in memory
has demanded that PTEs *not* be cacheable, hence giving up the ability to
use the cache as a backing store for the TLB ... which is trivial and
straightforward to accomplish in a software-controlled TLB.

TLBs are famous for the weird bugs and odd cases in many early micros,
which is why OS people were often the ones who preferred software-controlled
ones as less troublesome.

For the long-term, one can either make TLBs larger (more entries), or
allow entries to have multiple sizes ... and the industry seems to be
tending towards the latter; the R4000, in 1992, went this way, because
we couldn't figure out how to keep up with 4X/3 years in memory sizes,
for on-chip data structures that had to be fast.

Bottom line: Nick's characterization of software TLBs as "death on wheels",
in general, flies in the face of increasing use of this technique by
very experienced CPU designers.





--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 7 Nov 1997 21:50:52 GMT

In article <63vrmv$nd5$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes:

|> >In modern systems:
-------^^^^^^
|> >1) A TLBmiss, in software, may actually take *less* time than actually
|> >doing a cache miss.  Why is that?
|>
|> Well, on the machines that I have tried (and they DO include some of the
|> ones you mentioned), a TLB miss is usually significantly more expensive.
|> The factor of 5-10 times was based both on measurement, as well as some
|> figures given by current hardware designers for their chips.

It would be helpful to quote some of these, since (date of system) is
fairly important in this discussion, given that, in 1986, we had
8Mhz (125 ns) single-issue CPUs, and DRAM with raw read times of ~120ns,
while we now have 2-4-issue CPUs in the 200-500MHz (5-2ns) range, and
raw DRAMs ~60ns, and total cache miss times in the 200-500ns range for
SMPs and local nodes, 400-1000 for low-latency ccNUMAs, and maybe
200-3000ns for higher latency ones.

|> >Now, of course, there are also TLBmisses that take longer than cache misses,
|> >but in fact, whether a refill is done by a trap to software, or by a
|> >hardware engine, the time is:
|> >        T = C + N * M
|> >        C = ~constant overhead time
|> >        M = time for cache miss
|> >        N = number of cache misses caused by doing TLB processing
|>
|> I think that you are being misleading - in fact, I am certain.  In many
|> or most architectures, handling a miss in software involves a context
|> switch.  Not a full context switch, to be sure, but the CPU has to move
|> from DAT-on in user mode to DAT-off in kernel mode.  This means that the
|> constant overhead is potentially a great deal larger than for the
|> hardware solution.

You may be certain, but you are incorrect.
There is no context switch (as most people use the term, i.e., from one user
task to another user task.)  I don't recall exactly what the Alpha &
UltraSPARC folks do, but they're not idiots, so presumably they do something
similar to what HP & MIPS have done for a long time:

There is a special, low-overhead trap to the OS, and it has nothing
to do with turning DATs on & off.  HP provided some special registers
to make this faster, MIPS used a "hack"  of telling user
code that there were 2 registers they could expect to be trashed
at any time, so the kernel doesn't even have to save/restore these
registers; there were enough registers to get away with this.
Various chunks of hardware are added to make extractions or virtual
references (in Alpha's case, anyway)  faster, where the issue is
a series of dependent operations that are trivial to do in hardware,
leaving the sequencing and control in software.

Note: some of the beliefs here come from a long discussion
in a Cupertino bar with PA-RISC architects, of the form
"why did you do this? we did that..." There was some head-slapping when
I said we hadn't had to do special registers, although I had tried to
get 3 rather than 2 for the kernel, but the compiler people wouldn't give
me the third one.

I wrote the original MIPS version of such code in early 1986, Steve Stone
tuned it up, and we identified various simple hardware that could help.
I have the original code somewhere, but couldn't find it, here was Steve's
version as of April 1 1985:

From scs Mon Apr  1 16:02:19 1985
From: scs (Steve Stone)
Subject: user TLB miss.

   I have been trying to reduce the number of instructions involved in
resolving a user tlbmiss.  The best that I can do (with some hardware
changes assumed) is around 15 cycles (assuming no cache misses).


   The following is a first cut at the problem.  The following hardware
features are assumed:

        - There are seperate UTLBMISS/KTLBMISS cause bits.
        - The EPC is predecremented by hardware if the branch delay bit
          is set in the SR.  I know this is difficult to implement.  One
          possible way around this is to seperate out UTLBMISS in a
          branch delay slot from other UTLBMISSes.
        - At the time of a UTLBMISS, the TLBENHI register is set up
          correctly (the TLBPID is or'd in and the VPN is correct).
        - There are two registers usable by the kernel only.  The state
          of these registers is never saved and can only be trusted
          while interrupts are disabled (called RT1 and RT2).

   Here is the exception handler code:

        /*
         * Grab the cause bits.  User tlbmiss should be handled quickly
         * if possible (i.e. the only cause for the exception).
         */
        mfcause RT1
        sub     RT1,CAUSE_UTLBMISS
        bne     RT1,r0,exc_noutlbmiss
        /*
         * - Grab the VPN/TLBPID register from CP0.
         * - Isolate the VPN in the low order bits * 4.
         * - Add in the USERPTBASE constant (in kseg3).  The high order
         *   bit of the VPN will have been set in the TLBENHI.  This
         *   should be taken into consideration when choosing the
         *   USERPTBASE location.
         */
        mfc0    RT1,TLBENHI
        lsr     RT1,TLBPIDSZ-2
        and     RT1,~3
        la      RT2,USERPTBASE
        add     RT1,RT2

        /*
         * We now have a pointer to the TLB entry.  Grab it.  A fault
         * may occur here.  If so, the KTLBMISS handler will have to
         * be smart enough to reset RT1 to be the original PTE pointer
         * and reset the c0 registers so the following code will work.
         */
        lw      RT1,0(RT1)
        /*
         * If the PTE is invalid, handle the long way.
         */
        and     RT2,TLB_V,RT1
        beq     RT2,r0,exc_upteinval
        mtc0    RT1,TLBENLO
        c0      TLBWRITE
        nop
        rfe
        nop





|> You are effectively saying that this case has been optimised so much
|> that it is no longer significant.  That is most interesting.

Actually, I didn't say that.  It is sometimes significant for certain
programs.  However, truly big programs often want big pages anyway,
so once you figure out how to do that in the general case, you are better
off than shaving a few cycles off something with a terrible TLB miss rate.
This problem gets studied every time, and the general approach is to give the
TLB some more resource, but worry a lot more about cache misses, which are
way more frequent for most codes.
a) If a program has a low TLB-miss rate
        a1) if the cache-miss rate is low, all is well.
        a2) if the cache-miss rate is high, then that's the problem.
b) If the program has a high TLB-miss rate.
        b1) If the cache-miss rate is high, you're down to DRAM speed,
        and either you have a problem for a vector machine, or you need to
        be doing cache-blocking anyway.
        b2) If the cache-miss rate is low, then the TLB is actually the
        bottleneck.

Many designers have never been able to find enough (b2) programs to justify
huge amounts of hardware to help the TLB.  Note of course, that IBM RS/6000s
have a fairly different philosophy in various ways.

|> >TLBs are famous for the weird bugs and odd cases in many early micros,
|> >which is why OS people were often the ones who preferred
|> >software-controlled ones as less troublesome.
|>
|> Don't you really mean that the bugs are easier to fix, and hence less
|> embarrassing :-)

Not exactly.  What I meant was that almost any OS person involved in the
design of the first round of RISC chips had had experience with
early micros, and running into weird-case bugs late in the development
cycle, with complex hardware logic that took full chip spins to fix.
it isn't a question of embarrasment, it's a question of whether or not
you can ship a product.  The following has been known to happen, when designing
new systems with brand new micros:
        (a) System comes up.
        (b) Debug it, looks good.
        (c) Get a bunch of systems ready, be running QA.
        (d) Fix a bug in C compiler.
        (e) Some instruction moves 2 bytes, crosses a page boundary,
        regression tests start breaking 4 weeks before shipment;
        it takes 2 weeks to figure out exactly what is happening.
        (f) Then you realize that the odd case could potentially happen
        with any user-compiled program, and it is a bug in the microcode,
        and it's going to be 3 months before it gets fixed ... and you're dead.
The MIPS utlbmiss codes have often been diddled to work around some
odd hardware error, so that you can get beyond the first ones to see what
else there is.

|> >Bottom line: Nick's characterization of software TLBs as "death on wheels",
|> >in general, flies in the face of increasing use of this technique by
|> >very experienced CPU designers.
|>
|> I accept your correction!  I stand by my point that TLB misses are
|> generally "death on wheels", but it is very likely that I have been
|> using software implementations that I thought were hardware :-)
|>
|> I also take your point that TLB misses are becoming less expensive as
|> time goes on, in a way that cache misses are not.  But I don't believe
|> that the turnover point has yet arrived!

Hmmm.  I thought your point was that "death on wheels" was equivalent
to "software-reloaded TLBs are a bad idea and should be done away with."
Was that a misinterpretation?

I'm not sure what "turnover point" means. A CPU designer has to provide
a set of facilities, which for systems-type chips, includes cache + MMU,
with various tradeoffs. All that's been happening is that countless
studies have convinced many designers that they can avoid a bunch of
complex microcode, or worse, a lot of random logic with touchy special cases,
in favor of a low-overhead trap to a small piece of code, and that if it
takes a few more cycles to do the logic, it takes less die space,
is more flexible, and the times are increasingly dominated by the time to
fetch PTEs from memory anyway.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 9 Nov 1997 06:31:38 GMT

In article <641ecg$ot1$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick
Maclaren) writes:

|> Also, I am talking about the TOTAL effect on application speed, and not
|> just the raw cost of processing the problem.  The problem with TLB misses
|> (and, generally, anything that needs a trap) is that the indirect costs
|> are often larger than the direct ones.  Things like conflict for the
|> first-level cache, interference with coprocessors and so on.

|> Well, in MY book, that is a partial context switch, and the TLB refilling
|> is being done by a hybrid hardware/software solution!   But I accept your
|> point that the TLB miss handler 'context' is both minimal and permanently
|> available.

...

|> >Hmmm.  I thought your point was that "death on wheels" was equivalent
|> >to "software-reloaded TLBs are a bad idea and should be done away with."
|> >Was that a misinterpretation?
|>
|> Yes and no.  It IS what I meant, but we were clearly talking about
|> different things!  I have no problem with the solutions that you have
|> described, but I would call them hybrid solutions.

"But `glory' doesn't mean `a nice knock-down argument,'" Alice objected.

"When *I* use a word," Humpty Dumpty said, in a rather scornful tone,
"it means just what I choose it to mean-neither more nor less."

Occasionally, discussion threads get going where people attempt to
modify "standard" terminology, resulting in massive confusion.
Usually I stop reading the thread at that point.

*I* use the terms "context switch", "trap", "hardware TLB", "software-reloaded
TLB (or just software TLB)" the same way as other people do, who actually
design chips and OS's for a living, and I propose to people reading this
newsgroup that more things will make sense if they do the same, that is:

1) A context-switch switches state from one process/task to another.
   Maybe someone uses the term "partial context-switch" to mean "trap";
   I'll admit I've never heard it.

2) A *trap* directs the program flow to a kernel address, which:
   - Takes action and returns very quickly, as in a normal MIPS UTLBMISS trap.
   - Takes action and returns more slowly, as in a 2-level UTLBMISS,
      or some system calls
   - Takes action that eventually turns into a context-switch, as in
      a system call that causes a real I/O & a reschedule to another process,
      or UTLBMISS that is discovered to actually be a page fault.

3) A "hardware TLB" usually means a TLB, which, if the desired entry is not
present, performs a tablewalk, or other appropriate mechanism to reload the
TLB entry from memory, without causing a trap for normal refills.
Such mechanisms were used in 360/67, 370..., VAX, and most early micro TLBs.
Depending on the implementation, the actual implementation might use
part of the microde (if there is any), or else logic state machines,
and whatever it is does, is hardwired as part of the CPU spec.

4) A "software-reloaded, software-refilled, or just software TLB" causes
a trap (or TLB miss, or TLB miss fault)
to the OS, which uses instruction sequences to do the reload,
and return.  The hardware does exactly what it does for other kinds of
traps: record the necessary information, in a convenient form,
and then get to the OS as fast as it can.  Occasionally, CPUs provide
instructions targeted at helping this process, but most such sequences
are regular instructions, excecuted by the CPU, i.e., "software".

Calling 4) "hybrid" is an unfamiliar term to me.

Of course, TLBs may be used before cache access (Physical Index, Physical
tag), in parallel (Virtual Index, Physical tag), or after (Virtual Index,
Virtual Tag, translate after miss).

Bottom line: there certainly exist programs where TLBmisses are a
serious factor, just as (for instance) big non-zero stride vector codes
run slowly on most cached machines.

People who design CPUs always try
to make TLBs bigger, or map bigger pages, or have less overhead per miss ...
but so far, when people have to do this in the real world and ship CPUs,
TLBs have to fight for their die space along with everything else,
and they get their share, but the studies keep telling people to spend dies
space on other things.


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 11 Nov 1997 05:47:11 GMT

In article <mark.879201015@hubcap>, mark@hubcap.clemson.edu (Mark
Smotherman) writes:

|> And I think I've seen this distinction also used in other texts.  Might
|> this reflect earlier IBM OS terminology?  [raise your hand if you remember
|> the definition of "data set" ;-)]

Maybe, just haven't seen it lately.

|> I would agree that with current usage, "process switch" and "context
|> switch" are not usefully distinct.
|>
|>
|> Actually I'm enjoying the thread, and wanted to ask about PTEs:
|>
|> 1) What is the hardware consensus on updating reference and dirty bits?
|>    I.e., is it the standard to set the reference bit of the PTE in memory
|>    upon hardware (or software) table walk and have the processor cause a
|>    trap [interrupt/fault/exception - if you want to talk about other
|>    debatably useful terms in distinguishing various processor actions
|>    and stack-frame formats ;-)] upon the first write to a page?

We wrote about this in the COMPCON paper I mentioned before;
for various reasons:
        (a) One must think very carefully about hardware that updates
        PTEs directly, i.e., without any traps, for reasons discussed below.
        Note that even on some systems that do hardware Dirty bits,
        people often turn them off some of the time to allow for Copy-on-Write.

        (b) In a software-managed TLB, the software normally wants to do
        this, since, after all, the hardware need not have *any* idea
        where page tables are.  In fact, there might not actually *be*
        any page tables in memory.  For example, if you were allocating
        contiguous pages in an embedded environment as segments, you
        might be able to just *compute* the page frame from the VPN with
        no memory references.

        (c) Note that the transition not-Dirty -> Dirty doesn't happen
        "very often", compared to the number of TLB misses, i.e., it's
        on the order of (pagein of writable pages) + (allocation of
        writable pages)

        The MIPS TLBs, starting with R2000:
        (1) Have a Dirty bit per TLB entry, but it is NEVER changed
        directly by hardware.  A write acceess to a page no marked Dirty
        is trapped, and the OS figures out whether it is:
        - an illegal write: error.
        - a write to a writable page that is not yet dirty:
                find the PTE, lock it, mark it dirty, reload the TLB entry
                and mark it Dirty.
        - a write to a writable page, that is dirty, but is temporarily
                marked non-Dirty because it is a Copy on Write:
                do a lot of work.
        - and there are various other more complex cases.
        (2) Have no hardware-set Reference bits, which are simulated by
        software in the time-honored ways.  Reference bits, after all,
        are just hints...

        Since Reference bits are just hints, inconsistencies are no problem,
        and oddly enough, stale copies showing not-Dirty are OK also.
        If a Write access really occurs, the Dirty bit in the PTE better
        get set, but strangely enough, one need not bother to notify any
        other TLBs that have copies showing hte page as not-Dirty:
        the next reference will trap, and the TLB will be reloaded anyway.
        On the other hand, if there is a transition from Dirty to not-Dirty,
        you have to flush any not-Dirty entries from all the TLBs that
        might have them.



|> 2) If you reset the reference and/or dirty bit of a PTE (e.g., clock
|>    replacement algorithm), I assume you need to flush that PTE from the
|>    TLB - correct?

No, for reference; no, if changing not-Dirty to Dirty;
yes, if changing Dirty to not-Dirty.

In a complex environment, one becomes happier for TLBs that are
software-refilled, and disconnected from memory.
Consider a simple MMU design:

(1) MMU refills TLB by doing a page-walk, with physically-addressed
data structures that cannot be cached.  MMU sets Dirty bit in PTE in
memory, and Reference bit when needed.
==> Not so bad with CPU whose speed is not much faster than DRAM,
and whose memory system is happy with data transfers the size of one PTE.
Not so good with current CPUs, wide memory systems, long latencies.
Some implementations may require atomic read-modify-write to PTE,
so need to lock this ... which is unappealing in SMP systems,
and really unappealing in ccNUMAs.
==> paging of PTEs is probably not so nice.

Consider the 3 cases:
(1) First reference to a clean, but writable page is a read.
(2) First reference to a clean, but writable page is a write.
(3) First reference is a read, but there is a later write.

Case (1) just loads the TLB, and maybe sets a Reference bit;
Case (2) loads the TLB, and must set the Dirty bit.
Case (3) (assuming all hardware design) has to change the state of
the entry inside the TLB, and either read-modify-write that entry back
to memory right then, or be prepared to do if the entry is ever replaced.
The former is likely to be less buggy.

Among the various considerations are: uncached access versus cached ones
[suppose some piece of code is scanning the PTEs, which it would prefer
to do using cached accesses]; good use of bus bandwidth; buffering and
queue sizes inside the processor (especially for an aggressive out-of-order
superscalar).

(2) So, how about: the MMU does cached table walks, and uses the normal
cache-coherency mechanisms as it modifies PTEs.

Some of the previous issues go away. On the other hand, consider
a ccNUMA like an SGI Origin, and suppose that R10000s had
hardware-refilled TLBs that wrote Dirty bits and Reference bits
automatically, with caching.

(a) Program issues a store, which enters the load/store queue.
(b) The store gets to head of queue, to the TLB,  but misses in TLB.
(c) MMU does tablewalk, finds the PTE in its cache.
(d) Examining the PTE, the MMU discovers that it needs to modify the PTE,
but discovers that the cache line containing the PTE was not Exclusive,
so must issue a coherency request to the cache lines home node,
to upgrade to exclusive, but it gets nacked because another CPU got there
earlier with the same request, and in fact, this CPU gets an invalidate
back to invalidate the cache line, which means the MMU, in the middle of this,
must discard the PTE it has already gotten, and reissue the access to the
cache to find the PTE (misses this time), then go to memory,
adn get what is probably a shared copy (which may well be satisfied via
an intervention from the other node that has just modified this).
You'd probably prefer that PTE fetches normally be done as ordinary reads,
rather than as read-exclusive (intent to write), since they probably
don't change the state of the PTE very often, but of course, when they do,
it gets complicated.

Any table-walking fetches that occur must also check for pending stores,
i.e., stores to the same address that happened logically earlier,
but have not yet graduated, and thus have not actually been written into
the cache.  [This will hardly ever happen, but it could, so you have to check
for it, i.e., queues of loads and stores have to be carefully checked.]

Sooner or later, after potentially multiple memory accesses, and some
fairly complicated churnings of state machines, you have the line of PTEs in
the cache, and the PTE you wanted copied into the TLB, and any modifications
of the entry set in the TLB, and into the (by-now) Exclusive copy of the
cache line that contains the TLB.
Of course, you may have done a writeback of a Dirty cache line, to get a
line into which the PTE line can be placed.


Now, (assuming this is PI-PT cache), you are ready to actually do the
cache access for the store itself.

There are of course, a myriad of cases, and very complex state machines...
and of course, an o-o-o machine wants to have a long load/store queue, and be
overlapping memory accesses, which means all of this should be interleaved
with other TLB refills, cache misses, writebacks, coherency traffic, etc.

Summary: has been said before, but worth saying again:

Design ideas that work fine in sequential uniprocessor systems,
often get very complex when you have to use them in multiprocessors,
and especially with out-of-order CPUs, even RISCs with no more than
one memory address per instruction.

It is very easy to get a design that is simple, but forced to be slower
than necessary in normal cases in order to be correct in all cases,
or else create extraordinary complex designs to get performance
(VAX 9000 comes to mind).

As an especially terrifying example, Dave Cutler once pointed out to me
that a VAX addp6, with all operands indirect, and all of the addresses
crossing pages, and being aligned just right to cross page table boundaries,
could require an amazing:

41 pages (including pages in the page tables of course) to complete.

Also, autoincrements (early in an instruction) that have to be undone if
a translation (later in the instruction) fails, are one more
wonderful thing to deal with.


Finally, all of this is a continued argument that intuition is often bad,
and that computer engineering makes forward progress with better
measurements ... akin to improvements in science from getting
another decimal place.
--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: jfc@athena.mit.edu (John F Carr)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 13 Nov 1997 01:53:45 GMT

In article <648rgv$46m$1@murrow.corp.sgi.com>,
John R. Mashey <mash@mash.engr.sgi.com> wrote:

>(1) MMU refills TLB by doing a page-walk, with physically-addressed
>data structures that cannot be cached.

Physical addressing and cacheability can be independent.  On the Alpha
21164 the virtual I-cache supports cacheable physical accesses for PALcode.
By locking the page table tables in TLB the page tables can use virtual
addresses.

>Any table-walking fetches that occur must also check for pending stores,
>i.e., stores to the same address that happened logically earlier,
>but have not yet graduated, and thus have not actually been written into
>the cache.

Since this is only relevant for supervisor code, why not require a
memory barrier instruction betwee the store of a TLB and a reference
to it?  I think that's what Alpha does (at least when multiprocessor
consistency is required).


>There are of course, a myriad of cases, and very complex state machines...

The RSC implementation of the RS/6000 (maybe the 601 too?) had a separate
microprocessor to handle TLB misses.

>It is very easy to get a design that is simple, but forced to be slower
>than necessary in normal cases in order to be correct in all cases,
>or else create extraordinary complex designs to get performance
>(VAX 9000 comes to mind).

And they still didn't get that one right.  I think as of the final
microcode version branch prediction didn't work and there was an
invalid form of a nonprivileged instruction which would crash the system.


>As an especially terrifying example, Dave Cutler once pointed out to me
>that a VAX addp6, with all operands indirect, and all of the addresses
>crossing pages, and being aligned just right to cross page table boundaries,
>could require an amazing:
>
>41 pages (including pages in the page tables of course) to complete.

I have a table of instruction times for the CVAX processor.  The final
example, which I'm sure the authors enjoyed constructing, follows
(r/w = read/write, 1 cycle if cache hit).

        INDEX d(r),@d(r),(r)[rx],@(r)+,-(r),@d(r)[rx] - all memory operands
                unaligned across page boundaries, all memory operands take
                TB misses on both reads, M bit clear, cache hits

                specifier 1 time                1+1r
                specifier 1 cross page          8
                specifier 1 TB miss x 2         12+2r
                specifier 1 unaligned           1r
                specifier 2 time                1+2r
                specifier 2 cross page x 2      16
                specifier 2 TB miss x 4         24+4r
                specifier 2 unaligned x 2       2r
                specifier 3 time                2+1r
                specifier 3 cross page          8
                specifier 3 TB miss x 2         12+2r
                specifier 3 unaligned           1r
                specifier 4 time                2+2r
                specifier 4 cross page x 2      16
                specifier 4 TB miss x 4         24+4r
                specifier 4 unaligned x 2       2r
                specifier 5 time                2+1r
                specifier 5 cross page          8
                specifier 5 TB miss x 2         12+2r
                specifier 5 unaligned           1r
                specifier 6 time                3+1r+1w
                specifier 6 read cross page     8
                specifier 6 write cross page    7+2r
                specifier 6 TB miss x 4         24+4r
                specifier 6 M bit clear x 2     18+4r+2w
                specifier 6 read unaligned      1r
                specifier 6 write unaligned     1+1w
                execute, fetch time             39
                                                ----
                total                           248+38r+4w

(that's 7 microseconds at 40 MHz).

Note that this requires more TLB entries than the CVAX has.
I think the CVAX had a 28 entry TLB and the uVAX II 8.  Some
big VAXes had 1K.

--
    John Carr (jfc@mit.edu)


From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 13 Nov 1997 03:18:22 GMT
Keywords: locality

In article <64ce5v$p5i$1@flood.weeg.uiowa.edu>, dsiebert@icaen.uiowa.edu
(Doug Siebert) writes:

|> Organization: The University of Iowa
|>
|> Joe Keane <jgk@jgk.org> writes:
|>
|> >Recently i saw a table about TLBs.  I was surprised that the biggest
|> >number of entries was only 128 entries, for some Power chip, and many
|> >chips had considerably less.  It seems to me that that is not enough,
|> >and even 128 is a bit low to run well in many cases.
|>
|>
|> I know some guys doing CAD/CAM using HP workstations.  Back when the C110
|> (120MHz PA-7200) was the fastest desktop box, that's what they were using.
|> Later, the C160 & C180 came out with the PA-8000, and SPEC results showed
|> them to be 100-200% faster.  But for their codes, they were no improvement,
|> and actually slower in some cases, because the TLB in the 7200 was (from
|> memory) 120 entries, and it was only 96 entries on the PA-8000.  The PA-8200
|> is back to 120 entries, and the PA-8500 has I believe 160 entries.  But HP
|> is also going the way of variable page sizes, which for CAD software that
|> runs in a gigabyte of data space, is probably better than trying to find a
|> way to have a massive TLB.

Can you say more why they believed that the TLB was the issue?
Do the profiling tools there offer such statistics?

It is quite possible that the TLB was the issue, but it is also
possible that:
        (a) Different cache design was the issue.
        (b) The PA8000 spec #s were with new compilers, tuned to PA8000.
            From other experience, it usually takes a while for real apps to
            get compiled with new compilers, and if the binaries don't run
            well on the installed base, it may take a long time.

Anyway, in numerous competitive cases, there were PA7000 binaries whose
instruction scheduling seemed to interact badly with PA8000 chips,
and performance didn't go up until the programs were recompiled, but then
performance did go up, even with identical # of TLB entries.

As usual, what one really wants is a profile of the program that
gives # of TLB misses, cache misses, and cycle costs of them plus
instructions, stalls, etc ... because it is all too easy for human
intuition to be incorrect.

I agree on the bigger pages, of course ... which is why professional
CPU designers have been doing that more and more in recent years.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 13 Nov 1997 03:41:00 GMT
Keywords: locality

People ask why I don't look in on comp.arch so often any more...

In article <64b95a$jqn$1@rocky.jgk.org>, Joe Keane <jgk@jgk.org> writes:

|> TLB contention is a real problem.  In many realistic situations people
|> find that it severely dents performance.  It happens a fair amount, and
|> people know about it, but it's not something really exciting.

Since terms like "real problem", "realistic", "severely dents",
"fair amount" are hard to evaluate, so is this statement...


|> >People who design CPUs always try to make TLBs bigger, or map bigger
|> >pages, or have less overhead per miss ...
|>
|> Bigger pages is not a solution.
|>
|> Bigger pages is a mistake.

It would be good if jgk would back up these strong statements with
some recent professional CPU design experience.  Many professional
CPU designers have recently chosen to broaden their support for
multiple page sizes, and OS's are starting to use them.

|> >but so far, when people have to do this in the real world and ship
|> >CPUs, TLBs have to fight for their die space along with everything
|> >else, and they get their share, but the studies keep telling people to
|> >spend dies space on other things.
|>
|> I think the studies are not very good.  What do they look at?
|>
|> Running a nice, simple benchmark, say matrix multiply, this may not show
|> up as close to a problem, and the TLB transistors hardly get warmed up.

jgk: is the above is an assertion, that this is all that CPU designers do?
If so, you are essentially saying that CPU designers are total fools...
You are also implying that you have good access to the internal studies,
not just the ones that get published.

|> But it's a different story running lots of programs on a big machine
|> with tons of RAM and big, complicated software and lots of users.

There were Power Challenges in 1995 with 16GB of RAM,
there are many 1-rack Origins with 8-32GB; there is at least one
multi-rack Origin complex with 112GB of RAM.
The bigger machines use 4MB cache per CPU.
I assume jgk is talking about his experience with larger machines than these,
running more complicated software.

Oh, maybe not:

|> My Pentium's L2 cache is 512 KB.  A rule of thumb is that one `object'

|> But wait, say this is a current high-end machine, not my Pentium, then
|> it has an L3 cache of 8 MB.  Even with completely optimal locality, with
|> the cache taking whole pages from memory, that is still a lot of pages.
|> So even in the optimal case, accessing L3 without misses is right out.
|> It looks like you could be TLB missing pretty frequently, when the data
|> is in the L3 cache.  The big question, of course, is how frequently.
|> That is what we want to know.  That is what decides it.

Yes, and that is exactly the kind of thing that professional CPU designers
study, although they only occasionally publish the results.
Unfortunately, when the answer is "20% of the time is in TLBmisses".
the solution may not be obvious: does more associativity help?
does a bigger TLB help? how much bigger? do bigger pages help? how much?"
How much does it cost to reduce TLB overhead by 10% across a specified
workload?  How much performance is lost by takingthat die space from
otehr functions?

One more time: human intuition is pretty bad; questions need precise
formulations to even have a chance to be answered.  CPU designers spend a lot
of time acquiring the best data they can, and still have to make tradeoffs.
Measurement is crucial.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 13 Nov 1997 22:29:50 GMT
Keywords: locality

In article <64eqkc$de6$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick
Maclaren) writes:

|> |> Can you say more why they believed that the TLB was the issue?
|> |> Do the profiling tools there offer such statistics?
|>
|> I manage a system where I can measure TLB misses on a PA-RISC
|> architecture, but very few hardware architectures or vendors'
|> systems provide any decent (or indeed ANY) tools for detailed
|> performance investigations.  This means that the poor sods who

|> |> to make tradeoffs.  Measurement is crucial.
|>
|> Right.  Now please explain why most modern systems make it impossible
|> for application designers and tuners to measure any of the critical
|> performance factors.

I can't help what other vendors do or don't do.

SGI systems supply elaborate facilities for doing such performance
analysis, especially in R10K-based machines,
which added a pair of hardware counters,
and a whole set of tools for controlling them, analyzing them,
handling multi-thread and multiprocessor cases,
running on individual programs, setting up to analyze global system
behavior, etc.  A good place to start would be perfex(1) and ecstats(1),
or prof(1) on any of the machines.

For instance, with no particular preparation:

        perfex -e 0 -e 23 command args

runs command, and then prints the counts for events 1 & 23,
which happen to be issued instructions and user TLB misses.  Countable
events (although not all combinations allowed in one run) include:
          0 = Cycles
          1 = Issued instructions
          2 = Issued loads
          3 = Issued stores
          4 = Issued store conditionals
          5 = Failed store conditionals
          6 = Decoded branches
          7 = Quadwords written back from scache
          8 = Correctable scache data array ECC errors
          9 = Primary instruction cache misses
          10 = Secondary instruction cache misses
          11 = Instruction misprediction from scache way prediction table
          12 = External interventions
          13 = External invalidations
          14 = Virtual coherency conditions
          15 = Graduated instructions
          16 = Cycles
          17 = Graduated instructions
          18 = Graduated loads
          19 = Graduated stores
          20 = Graduated store conditionals
          21 = Graduated floating point instructions
          22 = Quadwords written back from primary data cache
          23 = TLB misses
          24 = Mispredicted branches
          25 = Primary data cache misses
          26 = Secondary data cache misses
          27 = Data misprediction from scache way prediction table
          28 = External intervention hits in scache
          29 = External invalidation hits in scache
          30 = Store/prefetch exclusive to clean block in scache
          31 = Store/prefetch exclusive to shared block in scache


There are numerous flags for various combinations ... but this is not
too much harder to use than time(1) ...  to get details at basic-block
level, one uses speedshop(1).

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 15 Nov 1997 18:34:08 GMT

In article <64i61p$crv$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick
Maclaren) writes:

|> Yes, mainly.  But there has been some confusion over what people meant
|> by cache misses.  TLB misses are typically a LOT more expensive than
|> cache misses that are filled from later caches, and can very easily
|> have indirect costs that take them beyond even memory references.
|> And this may get worse in the future, on at least some architectures.

(a) Cache misses are cache misses.  Some TLB designs can also generate
cache misses, although the impact of I-cache misses inside TLBmiss handlers
is minimal: if there are a lot of misses, the code lives in the I-cache.
(b) I'm not sure what "TLB misses are typically a LOT more expensive...
than later caches" means.  One more time: those caches misses that
cannot be prefetched (either explicitly or automatically) are expensive,
and getting worse, relative to cycle time.  If a TLB miss generates
multiple of such cache misses, then it is indeed costly, but TLB misses
that go through the cache, and hit there, are not very expensive,
and it is very hard to make generalizations without having the numbers handy.


(c) There indeed can be indirect costs.


|> For example, what happens if you have multiple threads, coprocessors
|> etc.?  The TLB miss handler typically has to get into a state where it

(a) Well, both PA-RISC & MIPS have had FP coprocessors, separate in
the early days, and have managed OK.
(b) Multiple threads: as has been discussed here, multiple threads need
multiple resources, and it's one of the issues in studies of
multi-threaded CPUs: it's not enough to to duplicate the PC and user
registers...  and this is true whether or not TLB refills are done by
hardware or software.

|> is addressing real memory and interrupts are disabled (the PA-RISC
|> architecture document describes this as a fast context switch), and
|> restore user state on return.  Mere cache misses don't necessarily
|> do anything more than block one thread.

This is what PA-RISC does. Others do other things, at least with
regard to addressing real memory: the very first R2000 used virtual
addressing for its PTEs.

|> So what happens to instructions that are in the process of being
|> executed?  They can be suspended, nullified or completed before the
|> handler is invoked, or (heaven help us!) carried on in parallel if
|> they don't affect any registers the TLB miss handler needs.  The
|> last is clearly the fastest solution, but ....

In o-o-o chips that do this, of which several exist:
- logically-earlier instructions are usually completed
- logically-later instructions usually must be nullified
It is a fair complaint that the first group need to be done before
entering the miss handler.  Whether this matters much or not depends on the
statistics, not on human intuition, which is especially bad regarding
what's happening inside such chips.


|> Incidentally, I vaguely remember an interesting security report on
|> one of the very early multi-threaded systems (perhaps a 360/91).

Must have been something else: I think Nick is confusing
"out-of-order speculative" and "multithreaded", which do *not*
mean the same thing. 360/91s were not multi-threaded machines, at least if
the term is used the way most people use the term.
--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 17 Nov 1997 06:51:05 GMT

In article <slrn66ta2f.tue.yodaiken@chelm.cs.nmt.edu>,
yodaiken@chelm.cs.nmt.edu (Victor Yodaiken) writes:

|> On 15 Nov 1997 18:34:08 GMT, John R. Mashey <mash@mash.engr.sgi.com> wrote:
|> >In article <64i61p$crv$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
|> >(a) Cache misses are cache misses.  Some TLB designs can also generate
|> >cache misses, although the impact of I-cache misses inside TLBmiss handlers
|> >is minimal: if there are a lot of misses, the code lives in the I-cache.
|>
|> I don't know if this is a good assumption. Suppose we have a cluster
|> of TLB misses, a big working set that fills i-cache, and a process
|> switch.  If the TLB miss handler needs an i-cache fill every
|> 0.5 milliseconds, won't that add a noticeble drag to performance?

For any system design, it is possible to generate awkward worst cases ...
and real-time system designers worry about them a lot, and for most others,
people try to optimize overall throughput and do the best they can to
round off the bad cases.

(These numbers aren't exactly right, but they're close enough):
Assume a 200Mhz R10000 (5ns cycle).
Assume 50ns = 10 clocks I-cache miss from L1 to L2.
Assume a 500ns cache miss from L1 to L2 to main memory [some better, some
worse].  = 100 clocks.

L1: I & D caches each 16KB, 2-set-associative, 128B cache lines
        (i.e., 128 lines fo 128 bytes each).
        The most common UTLBMISS handler is less than one cache line.
L2: 1MB, 2MB, or 4MB, also 2-set-associative.


.5 millisecs = 500 microsecs, so if TLBMISS needs an I-cache fill,
all the way from DRAM, and this takes .5 microseconds, then we have the
"noticable drag" of .5/500 = 1/1000 ... but of course, with a substantial
2-set-associative L2 cache, *most* of the time it's more like
.05/500 = 1/10000, IF that is the rate of misses (which it may or may not be).

If I-caches don't "work", thre are worse problems than TLBmiss handlers :-)

One more time: computer architecture needs numbers.  Some of the numbers
may be hard to get, but back-of-the-envelope anlyses at least give some
ideas.  Hopefully, the frequent characterizations of systems as
"blazing fast" is not rotting peoples' analytical capacities as badly as
I fear :-)

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 17 Nov 1997 19:41:29 GMT

In article <slrn671p2q.6or.yodaiken@chelm.cs.nmt.edu>,
yodaiken@chelm.cs.nmt.edu (Victor Yodaiken) writes:

|> On 17 Nov 1997 06:51:05 GMT, John R. Mashey <mash@mash.engr.sgi.com> wrote:
|> >In article <slrn66ta2f.tue.yodaiken@chelm.cs.nmt.edu>, yodaiken@chelm.cs.nmt.edu (Victor Yodaiken) writes:
|> >|> I don't know if this is a good assumption. Suppose we have a cluster
|> >|> of TLB misses, a big working set that fills i-cache, and a process
|> >|> switch.  If the TLB miss handler needs an i-cache fill every
|> >|> 0.5 milliseconds, won't that add a noticeble drag to performance?
|> >
|> >For any system design, it is possible to generate awkward worst cases ...
|> >and real-time system designers worry about them a lot, and for most others,

|> And as you noted above, for my real-time system, 1/2 microsecond
|> is significant. If we pessimize your calculations a little it gets worse.

I don't know what axe Victor is trying to grind, or why, but it would
be nice to stop confusing people in this newsgroup with inconsistent nonsense.
This discussion is equivalent to saying: "If F-16s were Chevy's their
speed in Reverse would be only 3.2 MPH."  But jet planes aren't cars.

1) "Suppose we have a cluster of TLB misses, a big working set that
fill i-cache, and a process switch..."

2) "my real-time system, 1/2 microsecond is significant."

3) I'm trying to understand what system has *both* these characteristics;
I must admit I've never seen them, so I have a suspicion that Victor
has changed the field of discussion.
        a) Designers of embedded real-time systems hate surprises and
        irregularity, and do whatever they must to avoid them.
        They lock code into I-caches, play games with D-caches, and use
        TLBs far differently than the general-purpose systems that were
        being discussed. They try pretty hard not to have TLB misses at
        all once they get initialized, and for cost reasons, they tend to
        use lower-end chips.

        b) People who do real-time systems that are not embedded
        (and actually, SGI systems are heavily used for this), try to do the
        same thing, although it is far easier to provide tighter guarantees
        of interrupt response on low-end embedded systems.  IRIX allows
        dedication of CPU resources, lockdown, etc, etc, for these
        purposes, but it's hard to beat an R5000 or R4300i, with code
        pre-loaded into I-cache, TLB's preloaded, and explicitly swapped on
        context-switch, etc.


|> Suppose the tlb  miss handler falls accross a cache line  --- and it is
|> not obvious  that padding the handlers to 128b boundaries is a win.
|> This is inconsistent:

IF his argument is that a I-cache miss for a TLBMISS is disaster, then
halving the number of such cannot help but be good.


|> Now suppose that the memory/bus is PC quality and a cache line fill
|> takes 4microseconds, say two lines get filled in 5 microseconds. If the
|> TLB handler is not in icache, is it likely that the pte and pointers
|> to the pte are in d-cache? If the handler was in i-cache, the loads
|> for the pte could be start while the pipeline is emptying ---  essentially
|> for free. But we now need to load the icache first, begin execution of
|> the handler, then reach for the ptes. So add another 4 microseconds
|> for the needed loads. We now have 9microseconds delay. If this happens
|> 1 time every 500microseconds, we have 491microseconds of 5ns/instruction
|> plus 9 microseconds for 32 instructions give us 5.09ns/instruction an
|> almost 2% slowdown. Make the cache line fill a little slower or
|> improve things otherwise and it looks worse. And that's the average
|> case.

IF jet planes were cars...


|> have no idea if it would correspond to anything in reality.

Yes.

|> >If I-caches don't "work", thre are worse problems than TLBmiss handlers :-)
|>
|> But there may be cases, interesting cases, where i-cache works great
|> on average but causes trainwrecks regularly.

This is not new news.

|> >One more time: computer architecture needs numbers.  Some of the numbers
|> >may be hard to get, but back-of-the-envelope anlyses at least give some
|> >ideas.  Hopefully, the frequent characterizations of systems as
|> >"blazing fast" is not rotting peoples' analytical capacities as badly as
|> >I fear :-)
|>
|> Good sentiment, but basing your analysis on quantitative methods makes
|> it tempting to focus on the easily quantified case and that is the
|> average case behavior of programs that can already run on  average hardware.
|> If you did that, all systems might converge on the same general performance
|> and the company with the largest advertising and development budget would
|> win out.

Do you understand that you've just categorized as idiots people who:
        - Design chips for real-time, including real-time-tuned variants
                of general chips
        - Design real-time hardware.
        - Design real-time applications for a living

I've worked with many such people over the years (and of course, I'm
the person at MIPS who insisted on certain features in the R2000 to
allow it to be useful outside UNIX environments, including embedded).
Maybe there are ones dumb enough to only be interested in average
characteristics, but I never met any such.
They love things like cache lock down, worst-case interrupt
response, TLB lockdowns [they love TLB wired entries, for instance;
they love set-associative caches with lockable banks, etc].

However, the "context" for this discussion didnt' start as real-time
systems; if you want to start a separate thread on "real-time systems
versus complicated CPUs with caches", that's a worthy thread in its own
right, but it doesn't have much to do with PTE & TLB management in a general-
purpose system like HP/UX or IRIX or DEC UNIX or VMS or NT or SOLARIS.



--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 18 Nov 1997 03:37:23 GMT

In article <slrn6729k2.7jp.yodaiken@chelm.cs.nmt.edu>,
yodaiken@chelm.cs.nmt.edu (Victor Yodaiken) writes:

|> >|> Good sentiment, but basing your analysis on quantitative methods makes
|> >|> it tempting to focus on the easily quantified case and that is the
|> >|> average case behavior of programs that can already run on  average hardware.
|> >|> If you did that, all systems might converge on the same general performance
|> >|> and the company with the largest advertising and development budget would
|> >|> win out.
|> >
|> >Do you understand that you've just categorized as idiots people who:
|>
|> I did not categorize as, call, insinuate, or  in anyway imply
|> that  anyone is  an idiot. Please don't invent opinions for me.

Sorry, I did not mean that that Victor was insulting people on purpose.
However, if I said anything like this to people who do production R/T for
a living, i.e., like controllers in fighter planes, or cars, they would
take it as an insult to their intelligence and get angry.  This is *not*
speculation, I've seen it happen (not because I did it).  Likewise, chip
vendors have produced a wide variety of chip variants with carefully
crafted features aimed exactly at real-time response, and didn't do it by
accident, or because they were only thinking of average properties.  Some
of us (including me) have spent numerous hours working through worst-case
latency examples for customers, strangely enough for the exact
combination of real-time response plus floating-point described above.

|> one wants both real-time and a general purpose OS on a commodity PC.
|> Even without these systems, 1Ghz ethernet, plus video, plus
|> a simulation is not an unlikely combination in the next couple of
|> years. In this job mix an extra couple of microseconds here and there
|> can make a serious difference and big working sets are going to mix
|> with small ones.

1) PCs use hardware-refilled TLBs, so they do not have the
issue being theorized about.  The simplest thing is to get a faster PC,
but of course, that may not dothe job either.

2) It is perfectly reasonable to *want* to mix large programs and
real-time on one CPU ... it is just very difficult to make the
response *guarantee* get very close to the *typical* response,
especially if you demand an environment where the non-realtime
tasks are permitted to consume 100% of the resources.

3) If you need a real-time response of 10 microsecs,
and you have done everything to guarantee 11, and it takes 1 for a TLBmiss,
then it matters.  If you need a real-time response of 10,
and the actual response varies between 5 and 200, then the 1 is not the
problem...

4) So, to get back to reality, how about posting some numbers about the
RTLinux example mentioned, and anything known about where the time *is*
going in the 10 microsec interval (because it certainly isn't going into
software TLBmisses).


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 18 Nov 1997 23:40:49 GMT

In article <slrn674hj8.cbq.yodaiken@chelm.cs.nmt.edu>,
yodaiken@chelm.cs.nmt.edu (Victor Yodaiken) writes:

|> Even in the x86 world Cyrix has lockable cache lines and  Linux and
|> OS's like L4 go  to some lengths to avoid TLB reloads despite the hardware.
|> We found that the 603PPC with software controlled TLBs could match
|> performance of faster 604PPCs with bigger L1 cache. I think that
|> we could get a major performance increase on x86s with more control
|> of TLB loading.

Less cycles is always good; as usual, the question is where all the time
is actually going ... and even worse, where it *might* go in various
weird cases (having worked with people who thought airplanes would crash
if events were missed...)

|> Obviously you can't give away more than 100% of resources,
|> but most PCs are idle most of the time. Certainly

Since we're talking about R/T (of guaranteed latency flavor),
I agree with your earlier comment that averages don't mean much.

|> a 300mhz PII or 700mhz Samsung Alpha should have the power to
|> run unix and do data acquisition during pauses between keystrokes.

|> >3) If you need a real-time response of 10 microsecs,
|> >and you have done everything to guarantee 11, and it takes 1 for a TLBmiss,
|> >then it matters.  If you need a real-time response of 10,
|> >and the actual response varies between 5 and 200, then the 1 is not the
|> >problem...

|> It is possible to  get a 486/33 ISA based PC to run UNIX and still offer
|> 30us guaranteed response time.

Do you have the *typical* (i.e., the mode) response time for that same
system?  It's always interesting to know how close the worst-case is to
the typical.

|> The real barrier right now  seems to be in access to off chip i/o
|> and irq controllers during periods of heavy  i/o. I don't have access
|> to a good enough bus analyzer, but I'd bet that the board chip set
|> holds up the processor when the PCI bus is too busy.

Yes, and it may be even worse than this...

|> The sequence:
|> read global clock % an io action itself
|> write i/o port
|> read global clock
|> compute difference in clocks
|>
|> can report 20us but is usually between 0 and 2us for a slow port.
|> The 20us is the problem.

Since I don't know what your systems look like, here are some interesting
questions:
        (1) In this particular configuration, what is the longest
        transfer size on the bus for any device?
        (2) How are DRAM refreshes done, and how long do they take?


|> Which brings us back to a process switch that would require a tlb fetch
|> and instruction reload from main memory -- not  anywhere near the critical
|> problem right now, but not completely dismissable either.

OK, now we agree :-) as always, the issue is understanding the worst
conceivable case, which is likely to be something like:
        (a) Run a big user program that consumes the cache & TLB.
        (b) An external interrupt occurs (and this may or may not be
                an issue on a particular chip, but there is usually some
                worst-case instruction that has to be completed, especially
                in an in-order issue, out-of-order completion design;
                this would often be something like double sqrt, or in the
        earliest R2000s, even, integer divide, whose latency you picked up when
        you did a move-from to save the hi and lo registers).
        (c) Most CPUs get nonpreemptable for a while.
Guaranteed response required to NEW interrupt here
        Worst-case cache & TLB misses
        (d) Continues until preemptible again.
        (e) Preempts to handle guaranteed interrupt.
        (f) Guaranteed interrupt starts execution, but:
Just before it gets PCI bus, disk DMA commences, and ties up the bus for
some length of time, which can be difficult to figure out, and then,
the code runs into memory refresh cycle(s).



--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 20 Nov 1997 02:11:22 GMT

In article <34919c56.183150240@philos.philosys.de>,
Emil.Naepflein@philosys.de (Emil Naepflein) writes:

|> The real question is:
|> Isn't it possible to use a hardware solution or other solution for
|> such a problem?
|> Is it worth to invest all the money into tuning the software on
|> limited hardware, instead off building a special solution?

Yes, we agree, that is: the point is that there are *lots* issues to
owrry about, the closer you want guarantees to approach the typical.


|> > Just before it gets PCI bus, disk DMA commences, and ties up the bus for
|> > some length of time, which can be difficult to figure out, and then,
|> > the code runs into memory refresh cycle(s).
|>
|> Why not pinning down the corresponding TLB entry?
|>
|> Use a MIPS processor and you will have absolutely no problem with with
|> the TLB for realtime applications. ;-)

Yes, we agree! These applications are part of the reason why:
        (a) MIPS memory-mapping has always included some unmapped
        space that requires zero TLB entries.
        (b) There are wired entries that don't get replaced by normal
        refill sequences.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Cache and context switches
Date: 20 Nov 1997 02:16:16 GMT

In article <34939fa4.183996343@philos.philosys.de>,
Emil.Naepflein@philosys.de (Emil Naepflein) writes:

|> If the performance of the database application is critical the problem
|> can be solved by using more than 1 GB memory and big pages. Putting
|> more memory into systems to avoid paging is a standard procedure for
|> mission critical performance tuning. By combining this with
|> intelligent use of big pages the TLB problem just disappears.
|> If we look for a cost-effective way to solve the problem it's probably
|> much cheaper to attack the problem in the processor by adding hundreds
|> or thousands of TLB entries. This seems for me much cheaper than
|> buying GBs of memory.

Unfortunately, it is not always cost-effective to add TLB entries:
it really depends on tradeoffs, and impact on cycle times (sometimes);
people already use 2-level TLBs sometimes.
One of the reasons most vendors have started to include multiple page
sizes as a general feature (i.e., not just a handful for special cases),
is:  If you handle, for example, 4KB -> 16MB, 16MB pages can map
4K more space than 4KB ones, whereas it is fairly difficult to wrestle
the chip folks into 4000X bigger TLBs :-)
--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: page sizes (was: Re: Cache and context switches)
Date: 20 Nov 1997 20:05:03 GMT

In article <01bcf5bf$d3512380$ed0abf81@peglarr.network.com>, "Network
Systems Corporation" <robp@network.com> writes:

|> Organization: Network Systems Corporation
|>
|> Another episode of  "there's nothing new under the sun..."

|> > One of the reasons most vendors have started to include multiple page
|> > sizes as a general feature (i.e., not just a handful for special cases),
|> > is:  If you handle, for example, 4KB -> 16MB, 16MB pages can map
|> > 4K more space than 4KB ones, whereas it is fairly difficult to wrestle
|> > the chip folks into 4000X bigger TLBs :-)
|>
|> John is right (of course) but the '...started to include...' part is
|> interesting.
|> Multiple page sizes, including relatively large-sized pages, have been
|> around
|> for nearly 3 decades.  The CDC Star-100 was the first machine (AFAIK) that

Yes,; hardly anything is new; we still haven't caught up with
everything 360/91s did, for instance.
Maybe I should have said more:
"started to include" meant: more than one vendor is doing doing it,
and it's not an odd case, and it appears in products that are reasonably
widely available.

This is along the lines of:
        every mistake happens at least 3 times:
        first the mainframe folks
        then the minicomputer folks
        and then micro folks, at least once


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From:  John Mashey <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Programmable TLB management?
Date: Wed, 27 Jun 2007 23:27:34 -0700
Message-ID: <1183012054.119993.3680@i38g2000prf.googlegroups.com>

On Jun 25, 6:54 am, vandy...@gmail.com wrote:

> My experience with this was on a MIPS processor (still, IMHO, one of
> the best architected CPUs I've ever had the pleasure to port an OS
> to).  While in theory this flexibility is wonderful, in practice a two-
> level PTE tree with 4K pages works so well, for such a broad range of
> usage patterns, that all you end up doing is trying to code up two-
> level PTE walks as efficiently as possible.

Thanks for the kind words, but note that it depends on what you're
doing:

a) BSD UNIX was first ported to the original MIPS R2000.

b) A few months later, SYSTEM V was ported, and for reasons I can't
recall (it being 20 years later), it had a slightly different UTLBMISS
code.  This actually exposed a hardware bug that BSD had never
encountered, but since it was software, we just tweaked the code to
work around it until it got fixed.

This bug, of course, confirmed the argument of the OS people who'd
fought with bugs for years in complex hardware MMUs, and had bigged
the chip designers for the most minimalist MMU we could get ... and
even that had a bug.  It was unsurprising that early 1980s micros
often had MMU bugs, and OS programmers hated them.

c) Some people doing embedded systems have managed not to have PTEs in
memory at all, or at least none for some parts of their address
space.  I.e., if there are user tasks, but they can be direct-mapped
to some regions of memory, it may be just as easy for the UTLBMISS
code to simply compute TLB entires, rather than loading PTEs from
memory.

d) When the R4000 introduced the wide range of page sizes, and then as
64/-32-bit systems became used, some systems had a group of different
ULTBMISS routines, selected on a per-process basis.  This was
especially true of the big SGI ccNUMA  systems, which supported
multiple page sizes and all ran mixtures of 64 and 32-bit processes.

e) Hardware assists are especially useful if the primary use of a CPU
is for software systems owned by the chip vendor, but if a chip has to
be useful to a wide range of OSes, it gets harder to get that right,
and software MMU handling tends to make for quick ports, which was at
least important in the 1980s.

f) I've long ago pointed out that in cache-coherent multiprocessors,
especially ccNUMAs, any MMU table-walk hardware must be very careful
not cause coherency problems, especially if it goes around the cache
at some point.  Software TLB-handlers just the existing coherency
hardware.
Hardware TLB-handlers can also, but there were bugs in 1980s micros
where this was a problem in cache-coherent SMPs.

g) TLBmisses *CAN* add noticeable performance overhead for some kinds
of codes, in some kinds of CPUs, but one always has to measure and
analyze the real overhead, not do it from intuition.  In many designs,
no matter what you do, a PTE that is a cache miss all the way to main
memory is the main cost anyway.  Both MIPS and PA-RISC managed to get
good performance for many years with (mostly) software managed TLBs.

h) If I could go back and do things differently in 1985, knowing
everything we do now, I'd probably tweak a few minor details in the
original 32-bit version to make selectable UTLBMISS routines more
efficient, to do a little more for kernel misses, and to make
transition to 64/32-bit a little cleaner...  I might allow for a
mechanism to support optional hardware assist, although the first
version wouldn't have had that.

But, I still would have kept a similar software-reloaded TLB.



From:  John Mashey <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Programmable TLB management?
Date: Sun, 08 Jul 2007 23:08:19 -0700
Message-ID: <1183961299.397369.250160@g37g2000prf.googlegroups.com>

On Jun 28, 4:26 am, Anne & Lynn Wheeler <l...@garlic.com> wrote:

> 360/67 had a bug in the associative array (i.e. used by 360/67 for TLB)
> that charlie (aka compare&swap, CAS, charlie) found circa 1970 (nearly
> 40yrs ago).
....
 then there
> was some chance the virtual address space execution and/or the system
> might have some explained anomolous behavior. Kernel software work
> around was to make sure that LCTL was always done.

Ahh, my 2nd S/360 machine... needless to say, there was no implication
that the 1980s micros were the first to have weird MMU bugs.  they
merely followed the traditions from mainframes and minis.

I think "explained" above was supposed to be "unexplained".

Systems programmers *loathe* MMU (& related exception-handling)
hardware bugs because they
- often create effects noticeable only long after the damage has been
done [i.e., like causing a few words to be overwritten in some page
somewhere],
- may have timing dependencies (i.e., related to clock interrupts),
- may be hard to replicate, as when inserting debug code eliminates
the bad behavior.

They often seem to incorporate an unholy mixture of Murphy's law and a
Heisenberg Uncertainty Principle analog.  The main hope is that once
one understands the problem, that there is a simple fix, and that the
bug is found early enough that you haven't already shipped a bunch of
them.



Index Home About Blog


from Hacker News https://ift.tt/oLlKArF

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.