Computer Architecture quick walkthrough-----processor

Computer Architecture quick walkthrough-----processor (My master's project)
View Yu Zhang's profile on LinkedIn
=============================================================
Abstract
This paper is not some research paper which digs as deep as possible. It is actually a summary for what we have learned in computer architecture from the textbook :< Computer Architecture- A Quantitative Approach - 4th edition>. It introduces various basic concepts in the design of a processor through a very natural flow. This paper majorly covers the pipelining, cache, and virtual memory portions in the computer architecture. It does not have everything, but it does give a general introduction to processor design and some good explanations for some tricky problems in the textbook. Most of the frequently asked questions in the interview are at least listed and some of them are given detailed explanation in the paper.
============================================================= 

First, let’s imagine that you are a computer architect who is going to design a processor, regardless of simple or complex. How do you start designing? Of course, we need get clear what a processor’s job is. A processor fetches instructions from memory, executes them and store information back to memory.

We can divide the job of a processor into five basic tasks: 
1.fetch an instruction from memory (get next instruction to be executed)
2.decode the instruction (obtain opcode, operands, etc)
3.execute the instruction(ALU operations)
4.memory access(load or store data to memory)
5.write back(write data back to register file)

Now a question arises: should the processor perform all the five tasks in one clock cycle period or one task in one clock cycle? 
Let’s consider the design that performs all five tasks in one clock cycle first. After the clock’s positive edge, the fetching unit retrieves an instruction from memory and passes it to decoding unit. The decoding unit decodes the instruction, gets the operand from register file, configures the ALU or other part of the circuit and passes the instruction the executing unit (ALU)…etc. We have to ask a question: what is fetching unit doing after it passes the instruction the next unit? In fact, it is doing nothing useful but keeping the same state. This is a hardware waste! And it is where we can try to improve the performance.

What about processing one task in one clock cycle? Well, it now takes 5 clock cycles to finish one instruction, although the clock cycle period is much smaller. It still keeps the hardware units idle since each clock cycle only one hardware unit is activated to perform effect task. However, it provides a good base for the pipelining.

So what is pipelining? Why we need pipelining for the processor?
To solve the idle problem exposed above, we now consider using pipelining in our processor. Pipelining is an implementation technique whereby multiple instructions are overlapped in execution; it takes advantage of parallelism that exists among the actions needed to execute an instruction. In simple words, it takes advantage of those idle units, assign new work to them and hence keep them busy. Now we take a look at the big picture of a pipeline.

IF: instruction fetch
ID: instruction decode
EX: instruction execution
MEM: memory access
WB: write back register file
In each clock cycle, the five hardware units are processing different parts of different instructions. Therefore, processing of multiple instructions can be overlapped.

The advantage of pipelining is that it increases the throughput of multiple instructions, although the process time for an individual instruction actually increases due to the overhead of pipelining.
Now we may consider this question: how much faster could a pipelined processor perform, compared to one without pipelining?

The ideal case depends on the pipeline stages, in our example are 5. The ideal CPI is 1, which means 1 clock cycle for each instruction. However, life is not always that good. The reality is that there are multiple kinds of dependencies which potentially cause hazards that severely restrict the parallelism among instructions.

Let’s take a look at some common dependencies: data dependence, name dependence and control dependence.
And some common hazards: data hazard, branch hazard, structural hazard.
For data hazard, there are: RAW hazard, WAR hazard, WAW hazard.
About the details, you can refer to the textbook (p68). Here I am not going to copy the tedious definition. What I want to mention is the relation between dependencies and hazards.

Dependencies are a property of the program.

Data dependencies may potentially cause RAW data hazards, but not necessarily. If the distance between two dependent instructions is long enough to be safe, the hazard could be avoided while maintaining the proper dependence. A common approach for this is rescheduling the code, which is usually done by the compiler statically. Another good approach for minimizing data hazard is forwarding. Consider this situation: the result produced by an instruction i is needed by the following instruction j. Without rescheduling the code in compile time, the pipeline has to stall until the instruction i has written the result back to the register. But we observe that the result is actually produced after the execution stage is finished. Therefore, we don't have to wait for the result to go through MEM and WB stage. We can forward the result directly to the ALU for next instruction’s use after the execution stage is finished and the ALU output is stored into intermediate register. Be careful, the result is not fed back to ALU input from the ALU output, but from the intermediate register where the last ALU output is stored.

Name dependencies causes WAR and WAW hazard. The essence of name dependence is hardware resource conflict. If instruction j is name dependent on instruction i, the value in the destination register of instruction i is not what instruction j interested. What instruction j is interested is the register itself, where it can place its value. A hazard occurs when the instruction j uses this register before instruction i has finished using it. A simple approach for this problem is register renaming. It assigns a different register for the dependent instruction, so that no resource conflict exists.

Control dependencies cause branch hazards. A control dependence determines the ordering of an instruction, i, with respect to a branch instruction so that the instruction i is executed in correct program order and only when it should be. It is difficult to completely remove control hazards, but we can reduce the penalty by multiple approaches. One approach is to execute the immediately following instruction as if there is no branch instruction. When it turns out that the branch is not taken, then nothing needs to be changed. If the branch is taken, which means the PC is going to jump to target address, and then the processor needs to flush the pipeline by turning those fetched instructions into no-op instructions which do not affect the processor states. This approach is effectively a static prediction. Speaking of prediction, there are many other better and sophisticated dynamic schemes, using branch history table, such as 2-bit predictor and correlating branch predictor. The details could be found in the textbook (p81).
However, one question may occur to us: why does prediction work?

The answer is based on observing that most of the branches are strongly biased. For example, in the following code segment:
          Mov R1, #0;
start: Add R1, R1, #4;
          Cmp R1, #80;
          Bneq start;  // if R1 is not equal to 80, go back to label start
          Mul R4, R1, #5;
There are 20 times that the program executes the branch instruction, nonetheless, the branch is taken 19 times while only once is not taken. If we predict the branch as taken, the miss rate in this case is 5% which is acceptable.
Other widely used approaches for reducing branch penalty are loop-unrolling (p75) and branch delay slot (A-23).
There are many other solutions for solving all the hazards listed above, of course. Here what I am trying to explain is the basic and general idea of what problems we may face in designing a pipelined processor and how to solve them.

After we’ve gone through the concept of pipeline, factors that limit its performance, and solutions to eliminating or minimizing those limitations, we can’t help thinking: what’s next to design?

Let’s recall that in the first pipeline stage, the processor fetches instructions from memory and in the fourth stage, processor loads or stores data from or to memory (In this paper, I leave the memory alone, and focus on the processor design). But how long or how many clock cycles does it take for memory access? The answer is hundreds or thousands of clock cycles or maybe even more. In a word, the memory is too slow compared to the processor. If a memory access is not finished, then the pipeline has to stall until it is done. This obviously brings down the performance of the CPU. We need something that is fast to access for the processor. Right, that is cache! The huge speed gap between CPU and main memory answers this question: why do we need cache?
Cache is the highest or first level of memory hierarchy. Cache is a mapping of data in main memory. It contains a small portion of data in the main memory. Compared to main memory (MB~GB), cache is much smaller (KB~MB), but much faster and expensive. Let’s have a look at the overview among processor, cache, and main memory.


CPU gets data or instructions via cache directly rather than from main memory. The data unit being transferred between cache and main memory is block. It is a collection of data, perhaps 64 bytes. So each time a word in main memory is requested, all the data in its block are transferred to cache. There are two important localities that support the cache’s performance. 

Temporal locality: we are likely to need this same word in the near future, so it is useful to place it in the cache so it can be accessed quickly.
Spatial locality: there is a high probability that the other data in this block will be needed soon.

If the data requested by the processor is in the cache, this is called a cache hit. If not, it is a cache miss. Since the cache is too small to hold all the data in main memory, the cache miss is inevitable. Here comes a question:what happens on a cache miss?
First we need to know that there are two types of cache miss: 1. read miss; 2.write miss.
For read miss, the cache retrieves the block that contains the requested data from the main memory. And the time cost in the data transfer is called miss penalty.
For write miss, there two policies the processor could apply: 1. write allocate; 2.no-write allocate.
For write allocate, the cache updates the value in main memory and also transferred a copy into the cache for the future use.
For no-write allocate, the cache only updates the main memory.
We now also have to answer another question: what happens on a cache hit?
For read hit, it is straightforward. The processor simply fetches the requested data. But for write hit, it is a little bit complicated. There are two write policies when writing to cache, which distinguish cache designs.

For write through, the information is written to both the block in the cache and the block in the main memory.
For write back, the information is only written to the block in the cache and written back to main memory only when this block is replaced. Here we need a dirty bit to indicate that this block is modified or not. If it is dirty, write it back to main memory. Otherwise, we can simply discard it since the cache and main memory hold the same copy of the data.
The detailed comparison could be found in the textbook (C-11). Here I am going to consider another problem: when a block is written to main memory, the cache has to wait for write to complete. This is called write stall. To solve this problem, we can add one buffer which takes care of the writing to the main memory so that cache can continue execution. This buffer is called write buffer (C-11).

Before we think of other approaches to improve the performance of the cache, we need to have a basic understanding of the structure of cache. A cache consists of multiple block frames. But how do we map the data block in main memory into cache?
There are three categories of cache organization: 1.direct mapped; 2.set associative; 3.fully associative.


For direct mapped cache, a block from memory can be placed in exactly one designated block frame in the cache. Hence, this mapping costs the least to find a block in the cache. But it has high miss rate.

The set associative cache is divided into multiple sets. A set is a group of block frames in the cache. If the set contains n block frames, then the cache placement is called n-way set associative. When a block is transferred into cache, it can be placed in any block frame in its designated set. Therefore, it improves the usage efficiency of the block frames and thus reduces the miss rate. However, it increases the cost of the searching a block in the cache, since the block could be anywhere in its set. So, the cache has to compare all the block frames in that set simultaneously.

For fully associative cache, a block can be placed in anywhere in the cache, which reduces miss rate but increases search cost.
Now we are going to consider how to improve cache performance?
A good measure of memory hierarchy performance is average memory access time (AMAT).
AMAT = Hit time + Miss rate × Miss penalty.
Thus we will try to improve the performance based on these three factors.
To reduce hit time, we keep the cache small. A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and compare it to the address. Small hardware can be faster, so small cache can help reduce hit time. Another method is to keep the cache simple, such as using direct mapping. One benefit of direct-mapped caches is that the designer can overlap the tag check with the transmission of the data. This effectively reduces hit time.

To reduce miss rate, we have three basic approaches: 1.large block; 2. large cache; 3. high associativity.
For large block and large cache, their benefit is straightforward. But the obvious drawback is potentially longer hit time and higher cost and power.

For high associativity, the advantage and drawback are similar.

To reduce miss penalty, we can design multilevel cache. The first level cache can be small enough to match the clock cycle time of the fast processor. Yet the lower level cache can be large enough to capture many accesses that would go to main memory, thereby lessening the effective miss penalty. However, we need to modify the AMAT since it matches only one level cache. For simplicity, we consider a 2-level cache. If the first level is a miss, we need go to second level for the requested data. Therefore, the original miss penalty or level-1 miss penalty is the time for accessing the level-2 cache. AMATL2= Hit timeL2 + Miss rateL2 × Miss penaltyL2. Now we put everything together: AMAT = Hit timeL1 + Miss rateL1 × (Hit timeL2 + Miss rateL2 × Miss penaltyL2).

Here I only list multiple basic approaches for improving cache performance. For more advanced and complicated optimizations, you can refer to the textbook (p293).

Now we have pipelined processor and multilevel cache, what is next to design? First, we think of this situation: you have a large program to run, but the physical memory cannot provide enough memory for it. The programmers used to solve this problem by dividing the large program into several small pieces called overlays. And then they run these overlays one by one. Apparently, this is a burden for programmers. Nowadays, operating system (OS) takes care of this task. A concept of virtual memory is introduced. The OS maps virtual memory to both physical memory and secondary storage, such as hard disk drive. Therefore the virtual address space is much larger than physical address space.

The mapping information between virtual memory and physical memory is stored in a unit called page tablelocated in the main memory. The address flowing in the processor is virtual address. The processor determines the width of the virtual address. Now a new problem arises: every time we need data or instructions, we need to access main memory to retrieve their physical addresses. This extra memory reference obviously reduces the performance of the processor. We need something similar to the cache which is a mapping of data and instructions in the main memory. This answers the question: why do we need a TLB?

A TLB (translation lookaside buffer) is nothing more than a cache which contains a portion of the page table in the main memory, and hence reduce the time spent for accessing memory during address translation.
Now we can take a look at what we’ve got for the processor:


This is only a basic architecture for a processor. The cache and TLB are usually on-chip, as parts of the processor, while the main memory is off-chip, and can be connected by sockets on the mainboard. The real architecture is much more complicated. Yet this paper focuses on the basic concepts in designing a processor which are the foundation for building more complex systems.

This paper walks through the textbook, links various computer architecture concepts, and gives some useful explanations about some tricky problems in the book.

Comments

Popular Posts