codingfreak

3.22.2009

Cache Memory - Part2

Interaction Policies with Main Memory
Basically READ operations dominate processor cache accesses since many of the instruction accesses are READ operation’s and most instructions do not WRITE into memory. When the address of the block to be READ is available then the tag is read and if it is a HIT then READ from it.

In case of a miss the READ policies are:

Read Through - Reading a block directly from main memory.
No Read Through - Reading a block from main memory into cache and then from cache to CPU. So we even update the cache memory.

Basically a Miss is comparatively slow because they require the data to be transferred from main memory to CPU which incurs a delay since main memory is much slower than cache memory, and also incurs the overhead for recording the new data in the cache before it is delivered to the processor. To take advantage of Locality of Reference, the CPU copies data into the cache whenever it accesses an address not present in the cache. Since it is likely the system will access that same location shortly, the system will save wait states by having that data in the cache. Thus cache memory handles the temporal aspects of memory access, but not the spatial aspects.

Accessing Caching memory locations won't speed up the program execution if we constantly access consecutive memory locations (Spatial Locality of Reference). To solve this problem, most caching systems read several consecutive bytes from memory when a cache miss occurs. 80x86 CPUs, for example, read between 16 and 64 bytes at a shot (depending upon the CPU) upon a cache miss. If you read 16 bytes, why read them in blocks rather than as you need them? As it turns out, most memory chips available today have special modes which let you quickly access several consecutive memory locations on the chip. The cache exploits this capability to reduce the average number of wait states needed to access memory.

Cache READ operation

It is not the same with Cache WRITE operation. Modifying a block cannot begin until the tag is checked to see if the address is a hit. Also the processor specifies the size of the write, usually between 1 and 8 bytes; only that portion of the block can be changed. In contrast, reads can access more bytes than necessary without a problem.

The Cache WRITE policies on write hit often distinguish cache designs:

Write Through - the modified data is written back to both the block in the cache memory and in the main memory.

Advantage:
1. READ miss never results in writes to main memory.
2. Easy to implement
3. Main Memory always has the most current copy of the data (consistent)

Disadvantage:
1. WRITE operation is slower as we have to update both Main Memory and Cache Memory.
2. Every write needs a main memory access as a result uses more memory bandwidth

Write Back - the modified data is first written only to the block in the cache memory. The modified cache block is written to main memory only when it is replaced. In order to reduce the frequency of writing back blocks on replacement, a dirty bit (a status bit) is commonly used to indicate whether the block is dirty (modified while in the cache) or clean (not modified). If it is clean the block is not written on a miss.

Advantage:
1. WRITE’s occur at the speed of the cache memory.
2. Multiple WRITE’s within a block require only one WRITE to main memory as a result uses less memory bandwidth

Disadvantage:
1. Harder to implement
2. Main Memory is not always consistent with cache reads that result in replacement may cause writes of dirty blocks to main memory.

Incase of Cache Write MISS we have to options.

Write Allocate - the memory block is first loaded into cache memory from main memory on a write miss, followed by the write-hit action.
No Write Allocate - the block is directly modified in the main memory and not loaded into the cache memory.

Although either write-miss policy could be used with write through or write back, write-back caches generally use write allocate (hoping that subsequent writes to that block will be captured by the cache) and write-through caches often use no-write allocate (since subsequent writes to that block will still have to go to memory).

The data in main memory being cached may be changed by other entities, in which case the copy in the cache may become out-of-date or stale. Alternatively, when the CPU updates the data in the cache, copies of data in other caches will become stale. Communication protocols between the cache managers which keep the data consistent are known as cache coherence protocols.

3.18.2009

Principle of Locality

When a program executes on a computer, most of the memory references are not made uniformly to a small number of locations. Here the Locality of the reference does matter.

Locality of Reference, also known as the Principle of Locality, the phenomenon of the same value or related storage locations being frequently accessed. Locality occurs in time(temporal locality) and in space (spatial locality).

Temporal Locality refers to the reuse of specific data and/or resources within relatively small time durations.
Spatial Locality refers to the use of data elements within relatively close storage locations. Sequential locality, a special case of spatial locality, occurs when data elements are arranged and accessed linearly, eg, traversing the elements in a one-dimensional array.

To be very simple when exhibiting spatial locality, a program accesses consecutive memory locations and during temporal locality of reference a program repeatedly accesses the same memory location during a short time period. Both forms of locality occur in the following Pascal code segment:

  for i := 0 to 10 do
   A [i] := 0;

In the above Pascal code, the variable 'i' is referenced several times in for loop where 'i' is compared against 10 to see if the loop is complete and also incremented by one at the end of the loop. This shows temporal locality of reference in action since the CPU accesses 'i' at different points in a short time period.

This program also exhibits spatial locality of reference. The loop itself zeros out the elements of array A by writing a zero to the first location in A, then to the second location in A, and so on. Assume Pascal stores elements of A into consecutive memory locations then on each loop iteration it accesses adjacent memory locations.

Cache Memory - Set Associative Mapped Cache

Set Associative mapping scheme combines the simplicity of Direct mapping with the flexibility of Fully Associative mapping. It is more practical than Fully Associative mapping because the associative portion is limited to just a few slots that make up a set.

In this mapping mechanism, the cache memory is divided into 'v' sets, each consisting of 'n' cache lines. A block from Main memory is first mapped onto a specific cache set, and then it can be placed anywhere within that set. This type of mapping has very efficient ratio between implementation and efficiency. The set is usually chosen by

Cache set number = (Main memory block number) MOD (Number of sets in the cache memory)

If there are 'n' cache lines in a set, the cache placement is called n-way set associative i.e. if there are two blocks or cache lines per set, then it is a 2-way set associative cache mapping and four blocks or cache lines per set, then it is a 4-way set associative cache mapping.

Let us assume we have a Main Memory of size 4GB (2³²), with each byte directly addressable by a 32-bit address. We will divide Main memory into blocks of each 32 bytes (2⁵). Thus there are 128M (i.e. 2³²/2⁵ = 2²⁷) blocks in Main memory.

We have a Cache memory of 512KB (i.e. 2¹⁹), divided into blocks of each 32 bytes (2⁵). Thus there are 16K (i.e. 2¹⁹/2⁵ = 2¹⁴) blocks also known as Cache slots or Cache lines in cache memory. It is clear from above numbers that there are more Main memory blocks than Cache slots.

NOTE: The Main memory is not physically partitioned in the given way, but this is the view of Main memory that the cache sees.

NOTE: We are dividing both Main Memory and cache memory into blocks of same size i.e. 32 bytes.

Let us try 2-way set associative cache mapping i.e. 2 cache lines per set. We will divide 16K cache lines into sets of 2 and hence there are 8K (2¹⁴/2 = 2¹³) sets in the Cache memory.

Cache Size = (Number of Sets) * (Size of each set) * (Cache line size)

So even using the above formula we can find out number of sets in the Cache memory i.e.

2¹⁹ = (Number of Sets) * 2 * 2⁵

Number of Sets = 2¹⁹ / (2 * 2⁵) = 2¹³.

When an address is mapped to a set, the direct mapping scheme is used, and then associative mapping is used within a set.

The format for an address has 13 bits in the set field, which identifies the set in which the addressed word will be found if it is in the cache. There are five bits for the word field as before and there is 14-bit tag field that together make up the remaining 32 bits of the address as shown below:

As an example of how the set associative cache views a Main memory address, consider again the address (A035F014)₁₆. The leftmost 14 bits form the tag field, followed by 13 bits for the set field, followed by five bits for the word field as shown below:

In the below example we have chosen the block 14 from Main memory and compared it with the different block replacement algorithms. In Direct Mapped cache it can be placed in Frame 6 since 14 mod 8 = 6. In Set associative cache it can be placed in set 2.

Checkout one more solved problem below.

References

1. Computer Architecture Tutorial - By Gurpur M. Prabhu.
2. Computer Architecture And Organization: An Integrated Approach - By Murdocca & Vincent Heuring

Cache Memory - Fully Associative Mapped Cache

If a Main memory block can be placed in any of the Cache slots, then the cache is said to be mapped in fully associative.

Let us assume we have a Main Memory of size 4GB (2³²), with each byte directly addressable by a 32-bit address. We will divide Main memory into blocks of each 32 bytes (2⁵). Thus there are 128M (i.e. 2³²/2⁵ = 2²⁷) blocks in Main memory.

We have a Cache memory of 512KB (i.e. 2¹⁹), divided into blocks of each 32 bytes (2⁵). Thus there are 16K (i.e. 2¹⁹/2⁵ = 2¹⁴) blocks also known as Cache slots or Cache lines in cache memory. It is clear from above numbers that there are more Main memory blocks than Cache slots.

NOTE: The Main memory is not physically partitioned in the given way, but this is the view of Main memory that the cache sees.

NOTE: We are dividing both Main Memory and cache memory into blocks of same size i.e. 32 bytes.

In fully associative mapping any one of the 128M (i.e. 2²⁷) Main memory blocks can be mapped into any of the single Cache slot. To keep track of which one of the 2²⁷ possible blocks is in each slot, a 27-bit tag field is added to each slot which holds an identifier in the range from 0 to 2²⁷ – 1. The tag field is the most significant 27 bits of the 32-bit memory address presented to the cache.

In an associative mapped cache, each Main memory block can be mapped to any slot. The mapping from main memory blocks to cache slots is performed by partitioning an address into fields for the tag and the word (also known as the “byte” field) as shown below:

When a reference is made to a Main memory address, the cache hardware intercepts the reference and searches the cache tag memory to see if the requested block is in the cache. For each slot, if the valid bit is 1, then the tag field of the referenced address is compared with the tag field of the slot. All of the tags are searched in parallel, using an associative memory. If any tag in the cache tag memory matches the tag field of the memory reference, then the word is taken from the position in the slot specified by the word field. If the referenced word is not found in the cache, then the main memory block that contains the word is brought into the cache and the referenced word is then taken from the cache. The tag, valid, and dirty fields are updated, and the program resumes execution.

Associative mapped cache has the advantage of placing any main memory block into any cache line. This means that regardless of how irregular the data and program references are, if a slot is available for the block, it can be stored in the cache. This results in considerable hardware overhead needed for cache bookkeeping.

Although this mapping scheme is powerful enough to satisfy a wide range of memory access situations, there are two implementation problems that limit performance.

The process of deciding which slot should be freed when a new block is brought into the cache can be complex. This process requires a significant amount of hardware and introduces delays in memory accesses.
When the cache is searched, the tag field of the referenced address must be compared with all 2¹⁴ tag fields in the cache.

Cache Memory - Direct Mapped Cache

If each block from main memory has only one place it can appear in the cache, the cache is said to be Direct Mapped. Inorder to determine to which Cache line a main memory block is mapped we can use the formula shown below

Cache Line Number = (Main memory Block number) MOD (Number of Cache lines)

Let us assume we have a Main Memory of size 4GB (2³²), with each byte directly addressable by a 32-bit address. We will divide Main memory into blocks of each 32 bytes (2⁵). Thus there are 128M (i.e. 2³²/2⁵ = 2²⁷) blocks in Main memory.

We have a Cache memory of 512KB (i.e. 2¹⁹), divided into blocks of each 32 bytes (2⁵). Thus there are 16K (i.e. 2¹⁹/2⁵ = 2¹⁴) blocks also known as Cache slots or Cache lines in cache memory. It is clear from above numbers that there are more Main memory blocks than Cache slots.

NOTE: The Main memory is not physically partitioned in the given way, but this is the view of Main memory that the cache sees.

NOTE: We are dividing both Main Memory and cache memory into blocks of same size i.e. 32 bytes.

A set of 8k (i.e. 2²⁷/2¹⁴ = 2¹³) Main memory blocks are mapped onto a single Cache slot. In order to keep track of which of the 2¹³ possible Main memory blocks are in each Cache slot, a 13-bit tag field is added to each Cache slot which holds an identifier in the range from 0 to 2¹³ – 1.

All the tags are stored in a special tag memory where they can be searched in parallel. Whenever a new block is stored in the cache, its tag is stored in the corresponding tag memory location.

When a program is first loaded into Main memory, the Cache is cleared, and so while a program is executing, a valid bit is needed to indicate whether or not the slot holds a block that belongs to the program being executed. There is also a dirty bit that keeps track of whether or not a block has been modified while it is in the cache. A slot that is modified must be written back to the main memory before the slot is reused for another block. When a program is initially loaded into memory, the valid bits are all set to 0. The first instruction that is executed in the program will therefore cause a miss, since none of the program is in the cache at this point. The block that causes the miss is located in the main memory and is loaded into the cache.

This scheme is called "direct mapping" because each cache slot corresponds to an explicit set of main memory blocks. For a direct mapped cache, each main memory block can be mapped to only one slot, but each slot can receive more than one block.

The mapping from main memory blocks to cache slots is performed by partitioning an main memory address into fields for the tag, the slot, and the word as shown below:

The 32-bit main memory address is partitioned into a 13-bit tag field, followed by a 14-bit slot field, followed by a 5-bit word field. When a reference is made to a main memory address, the slot field identifies in which of the 2¹⁴ cache slots the block will be found if it is in the cache.

If the valid bit is 1, then the tag field of the referenced address is compared with the tag field of the cache slot. If the tag fields are the same, then the word is taken from the position in the slot specified by the word field. If the valid bit is 1 but the tag fields are not the same, then the slot is written back to main memory if the dirty bit is set, and the corresponding main memory block is then read into the slot. For a program that has just started execution, the valid bit will be 0, and so the block is simply written to the slot. The valid bit for the block is then set to 1, and the program resumes execution.

Check out one more solved problem below

References

1. Computer Architecture Tutorial - By Gurpur M. Prabhu.

Subscribe to: Posts ( Atom )

Pages