CS162 Operating Systems and Systems Programming Lecture 13

Address Translation (Finished), Caching

# October 12<sup>th</sup>, 2015 Prof. John Kubiatowicz http://cs162.eecs.Berkeley.edu

Acknowledgments: Lecture slides are from the Operating Systems course taught by John Kubiatowicz at Berkeley, with few minor updates/changes. When slides are obtained from other sources, a a reference will be noted on the bottom of that slide, in which case a full list of references is provided on the last slide.

# Recall: Simple Segmentation (16 bit addresses)



10/12/15

# Recall: Paging



- Page Table (One per process)
  - Resides in physical memory
  - Contains physical page and permission for each virtual page » Permissions include: Valid bits, Read, Write, etc
- Virtual address mapping
  - Offset from Virtual address copied to Physical Address
     » Example: 10 bit offset ⇒ 1024-byte pages
  - Virtual page # is all remaining bits
     » Example for 32-bits: 32-10 = 22 bits, i.e. 4 million entries
     » Physical page # copied from table into physical address
     Check Page Table bounds and permissions

10/12/15

# Recall: Simple Page Table Discussion

- What needs to be switched on a context switch?
  - Page table pointer and limit
- Analysis
  - Pros
    - » Simple memory allocation
    - » Easy to Share
  - Con: What if address space is sparse?
    - » E.g. on UNIX, code starts at 0, stack starts at (2<sup>31</sup>-1).
       » With 1K pages, need 2 million page table entries!
  - Con: What if table really big?
    - » Not all pages used all the time ⇒ would be nice to have working set of page table in memory
- How about combining paging and segmentation?
  - Segments with pages inside them?
  - Need some sort of multi-level translation

# Memory Layout for Linux 32-bit



http://static.duartes.org/img/blogPosts/linuxFlexibleAddressSpaceLayout.png

10/12/15



# What is in a Page Table Entry?

- What is in a Page Table Entry (or PTE)?
  - Pointer to next-level page table or to actual page
  - Permission bits: valid, read-only, read-write, write-only
- Example: Intel x86 architecture PTE:
  - Address same format previous slide (10, 10, 12-bit offset)
  - Intermediate page tables called "Directories"

| Page Frame Number      |                                                      | Free        | OL    | DA    | PC         |     |
|------------------------|------------------------------------------------------|-------------|-------|-------|------------|-----|
| (Physical Page Number) |                                                      | <u>(OS)</u> |       |       | <b>Ö</b> A |     |
| 31-12                  |                                                      | 11-9        | 8 7   | 65    | 43         | 210 |
| P:                     | Present (same as "valid" bit in other architectures) |             |       |       |            |     |
| <b>W</b> :             | Writeable                                            |             |       |       |            |     |
| U:                     | User accessible                                      |             |       |       |            |     |
| PWT:                   | Page write transparent: external cache write-through |             |       |       |            |     |
| PCD:                   | Page cache disabled (page cannot be cached)          |             |       |       |            |     |
| <b>A</b> :             | Accessed: page has been accessed recently            |             |       |       |            |     |
| D:                     | Dirty (PTE only): page has been modified recently    |             |       |       |            |     |
| L:                     | $L=1 \rightarrow 4MB$ page (directory only).         |             |       |       |            |     |
| Bottom                 | 22 bits of virtue                                    | al address  | serve | as of | fset       |     |

# Examples of how to use a PTE

• How do we use the PTE?

Invalid PTE can imply different things:
 » Region of address space is actually invalid or

- » Page/directory is just somewhere else than memory
- Validity checked first

» OS can use other (say) 31 bits for location info

• Usage Example: Demand Paging

- Keep only active pages in memory

- Place others on disk and mark their PTEs invalid
- Usage Example: Copy on Write
  - UNIX fork gives copy of parent address space to child
     » Address spaces disconnected after child created
  - How to do this cheaply?

» Make copy of parent's page tables (point at same memory) » Mark entries in both sets of page tables as read-only

- » Mark entries in both sets of page tables as read-only
- » Page fault on write creates two copies
- Usage Example: Zero Fill On Demand
  - New data pages must carry no information (say be zeroed)
  - Mark PTEs as invalid; page fault on use gets zeroed page
  - Often, OS creates zeroed pages in background

## Summary: Two-Level Paging



## Summary: Two-Level Paging



# Recall: Segments + Pages

- What about a tree of tables?
  - Lowest level page table > memory still allocated with bitmap
  - Higher levels often segmented
- Could have any number of levels. Example (top segment):



- What must be saved/restored on context switch?
  - Contents of top-level segment registers (for this example)

- Pointer to top-level table (page table) 10/12/15 Kubiatowicz C5162 ©UCB Fall 2015

# Recall Sharing (Complete Segment)



- · Pros:
  - Only need to allocate as many page table entries as we need for application

» In other wards, sparse address spaces are easy

- Easy memory allocation
- Easy Sharing

» Share at segment or page level

- Cons:
  - One pointer per page (typically 4K 16K pages today)
  - Page tables need to be contiguous
    - » However, previous example keeps tables to exactly one page in size
  - Two (or more, if >2 levels) lookups per reference

» Seems very expensive!

Making it real: X86 Memory model with segmentation (16/32-bit)



### X86\_64: Four-level page table!



10/12/15

# IA64: 64bit addresses: Six-level page table?!?

9 bits 12 bits 9 bits 9 bits 9 bits 7 bits 9 bits 64bit Virtual Virtual Virtual Virtual Virtual Virtual Address: Virtual Offset index P3 index P6 index P4 index P5 index

# No!

# Too slow Too many almost-empty tables

10/12/15

# **Inverted Page Table**

- With all previous examples ("Forward Page Tables")
  - Size of page table is at least as large as amount of virtual memory allocated to processes
  - Physical memory may be much less
    - » Much of process space may be out on disk or not in use



- Answer: use a hash table
  - Called an "Inverted Page Table"
  - Size is independent of virtual address space
  - Directly related to amount of physical memory
  - Very attractive option for 64-bit address spaces
- Cons: Complexity of managing hash changes

- Often in hardware! 10/12/15 Kubiatowi

## **IPT** address translation

Need an associative map from VM page to IPT address:



10/12/15

### **Summary: Inverted Table**



## **Address Translation Comparison**

|  |                                  | Advantages                                                         | Disadvantages                                                  |  |  |
|--|----------------------------------|--------------------------------------------------------------------|----------------------------------------------------------------|--|--|
|  | Simple<br>Segmentation           | Fast context<br>switching: Segment<br>mapping maintained<br>by CPU | External fragmentation                                         |  |  |
|  | Paging<br>(single-level<br>page) | No external fragmentation, fast easy allocation                    | Large table size ~ virtual<br>memory<br>Internal fragmentation |  |  |
|  | Paged segmentation               | Table size ~ # of pages in virtual                                 | Multiple memory references per page                            |  |  |
|  | Two-level<br>pages               | memory, fast easy<br>allocation                                    | access                                                         |  |  |
|  | Inverted Table                   | Table size ~ # of<br>pages in physical<br>memory                   | Hash function more complex                                     |  |  |

# How is the translation accomplished?

Physical

Addresses



- What, exactly happens inside MMU?
- One possibility: Hardware Tree Traversal
  - For each virtual address, takes page table base pointer and traverses the page table in hardware
  - Generates a "Page Fault" if it encounters invalid PTE
     » Fault handler will decide what to do
     » More on this next lecture
  - Pros: Relatively fast (but still many memory accesses!)
  - Cons: Inflexible, Complex hardware
- Another possibility: Software
  - Each traversal done in software
  - Pros: Very flexible
  - Cons: Every translation must invoke Fault!
- In fact, need way to cache translations for either case! 10/12/15 Kubiatowicz CS162 ©UCB Fall 2015

- Can a process modify its own translation tables?
   NO!
  - If it could, could get access to all of physical memory
  - Has to be restricted somehow
- Recall: To Assist with Protection, Hardware provides at least two modes (Dual-Mode Operation):
  - "Kernel" mode (or "supervisor" or "protected")
  - "User" mode (Normal program mode)
  - Mode set with bits in special control register only accessible in kernel-mode
- Certain operations restricted to Kernel mode:
  - Including modifying the page table (CR3 in ×86), and segment registers
  - Have to transition into Kernel mode before you can change them

# How to get from Kernel→User

- What does the kernel do to create a new user process?
  - Allocate and initialize address-space control block
  - Read program off disk and store in memory
  - Allocate and initialize translation table
     » Point at code in memory so program can execute
    - » Possibly point at statically initialized data
  - Run Program:
    - » Set machine registers
    - » Set hardware pointer to translation table
    - » Set processor status word for user mode
    - » Jump to start of program
- How does kernel switch between processes?
  - Same saving/restoring of registers as before
  - Save/restore PSL (hardware pointer to translation table)

# Recall: User→Kernel (System Call)

- Can't let inmate (user) get out of padded cell on own
   Would defeat purpose of protection!
  - So, how does the user program get back into kernel?



• System call: Voluntary procedure call into kernel

- Hardware for controlled User  $\rightarrow$  Kernel transition
- Can any kernel routine be called?
   » No! Only specific ones.
- System call ID encoded into system call instruction
   » Index forces well-defined interface with kernel

10/12/15

# User -> Kernel (Exceptions: Traps and Interrupts)

- A system call instruction causes a synchronous exception (or "trap")
  - In fact, often called a software "trap" instruction
- Other sources of Synchronous Exceptions ("Trap"):
  - Divide by zero, Illegal instruction, Bus error (bad address, e.g. unaligned access)
  - Segmentation Fault (address out of range)
  - Page Fault (for illusion of infinite-sized memory)
- Interrupts are Asynchronous Exceptions
  - Examples: timer, disk ready, network, etc....
  - Interrupts can be disabled, traps cannot!
- On system call, exception, or interrupt:
  - Hardware enters kernel mode with interrupts disabled
  - Saves PC, then jumps to appropriate handler in kernel
  - For some processors (x86), processor also saves registers, changes stack, etc.
- Actual handler typically saves registers, other CPU state, and switches to kernel stack

10/12/15

# Closing thought: Protection without Hardware

- Does protection require hardware support for translation and dual-mode behavior?
  - No: Normally use hardware, but anything you can do in hardware can also do in software (possibly expensive)
- Protection via Strong Typing
  - Restrict programming language so that you can't express program that would trash another program
  - Loader needs to make sure that program produced by valid compiler or all bets are off
  - Example languages: LISP, Ada, Modula-3 and Java
- Protection via software fault isolation:
  - Language independent approach: have compiler generate object code that provably can't step out of bounds
    - » Compiler puts in checks for every "dangerous" operation (loads, stores, etc). Again, need special loader.
    - » Alternative, compiler generates "proof" that code cannot do certain things (Proof Carrying Code)
  - Or: use virtual machine to guarantee safe behavior (loads and stores recompiled on fly to check bounds)

10/12/15

# **Caching Concept**



 Cache: a repository for copies that can be accessed more quickly than the original

- Make frequent case fast and infrequent case less dominant

- Caching underlies many of the techniques that are used today to make computers fast
  - Can cache: memory locations, address translations, pages, file blocks, file names, network routes, etc...
- Only good if:
  - Frequent case frequent enough and
  - Infrequent case not too expensive
- Important measure: Average Access time =

(Hit Rate x Hit Time) + (Miss Rate x Miss Time)



### Another Major Reason to Deal with Caching



- Cannot afford to translate on every access
  - At least three DRAM accesses per actual DRAM access
  - Or: perhaps I/O if page table partially on disk!
- Even worse: What if we are using caching to make memory access faster than DRAM access???
- Solution? Cache translations!

```
- Translation Cache: TLB ("Translation Lookaside Buffer")
10/12/15 Kubiatowicz CS162 ©UCB Fall 2015
```

Why Does Caching Help? Locality!



- Temporal Locality (Locality in Time):
  - Keep recently accessed data items closer to processor
- Spatial Locality (Locality in Space):
  - Move contiguous blocks to the upper levels



Memory Hierarchy of a Modern Computer System

- Take advantage of the principle of locality to:
  - Present as much memory as in the cheapest technology
  - Provide access at speed offered by the fastest technology



- Compulsory (cold start or process migration, first reference): first access to a block
  - "Cold" fact of life: not a whole lot you can do about it
  - Note: If you are going to run "billions" of instruction, Compulsory Misses are insignificant

Capacity:

- Cache cannot contain all blocks access by the program
- Solution: increase cache size
- Conflict (collision):
  - Multiple memory locations mapped to the same cache location
  - Solution 1: increase cache size
  - Solution 2: increase associativity
- Coherence (Invalidation): other process (e.g., I/O) updates memory

10/12/15



- Index Used to Lookup Candidates in Cache
  - Index identifies the set
- Tag used to identify actual copy
  - If no candidates match, then declare cache miss
- Block is minimum quantum of caching
  - Data select field used to select data within block
  - Many caching applications don't have data select field

10/12/15

**Review: Direct Mapped Cache** 

- Direct Mapped 2<sup>N</sup> byte cache:
  - The uppermost (32 N) bits are always the Cache Tag
  - The lowest M bits are the Byte Select (Block Size = 2<sup>M</sup>)
- Example: 1 KB Direct Mapped Cache with 32 B Blocks
  - Index chooses potential block
  - Tag checked to verify block
  - Byte select chooses byte within block



- Page Tables
  - Memory divided into fixed-sized chunks of memory
  - Virtual page number from virtual address mapped through page table to physical page number
  - Offset of virtual address same as physical address
  - Large page tables can be placed into virtual memory
- Multi-Level Tables
  - Virtual address mapped to series of tables
  - Permit sparse population of address space
- Inverted page table
  - Size of page table related to physical memory size
- PTE: Page Table Entries
  - Includes physical page number
  - Control info (valid bit, writeable, dirty, user, etc)

- The Principle of Locality:
  - Program likely to access a relatively small portion of the address space at any instant of time.
    - » Temporal Locality: Locality in Time
    - » Spatial Locality: Locality in Space
- Three (+1) Major Categories of Cache Misses:
  - Compulsory Misses: sad facts of life. Example: cold start misses.
  - Conflict Misses: increase cache size and/or associativity
  - Capacity Misses: increase cache size
  - Coherence Misses: Caused by external processors or I/O devices