It is given that each instruction takes 5 clock cycles to execute in the non-pipelined architecture, so time taken to execute each instruction = 5 * 0.4 = 2ns
In the pipelined architecture the clock cycle time = 1/2G = 0.5 ns
In the pipelined architecture there are stalls due to memory instructions and branch instructions.
In the pipeline, the updated clocks per instruction CPI = (1+stall frequency due to memory operations * stalls of memory instructions + stall frequency due to branch operations * stalls due to branch instructions)
Out of the total instructions , 30% are memory instructions. Out of those 30%, only 5% cause stalls of 50 cycles each.
Stalls per instruction due to memory operations = 0.3*0.05*50 = 0.75
Out of the total instructions 10% are branch instructions. Out of those 10% of instructions 50% of them cause stalls of 2 cycles each.
Stalls per instruction due to branch operations = 0.1*0.5*2 = 0.1
The updated CPI in pipeline = 1 + 0.75 + 0.1 = 1.85
The execution time in the pipeline = 1.85 * 0.5 = 0.925 ns
The speed up = Time in non-pipelined architecture / Time in pipelined architecture = 2 / 0.925 = 2.16
Instruction execution in a processor is divided into 5 stages, Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF), Execute (EX), and Write Back (WB). These stages take 5, 4, 20, 10 and 3 nanoseconds (ns) respectively. A pipelined implementation of the processor requires buffering between each pair of consecutive stages with a delay of 2 ns. Two pipelined implementations of the processor are contemplated:
(i) a naive pipeline implementation (NP) with 5 stages and
(ii) an efficient pipeline (EP) where the OF stage is divided into stages OF1 and OF2 with execution times of 12 ns and 8 ns respectively.
The speedup (correct to two decimal places) achieved by EP over NP in executing 20 independent instructions with no hazards is _________.
The stage delays are 5, 4, 20, 10 and 3. And buffer delay = 2ns
So clock cycle time = max of stage delays + buffer delay
= max(5, 4, 20, 10,3)+2
Execution time for n-instructions in a pipeline with k-stages = (k+n-1) clock cycles
= (k+n-1)* clock cycle time
In this case execution time for 20 instructions in the pipeline with 5-stages
Efficient Pipeline implementation:
OF phase is split into two stages OF1, OF2 with execution times of 12ns, 8ns
New stage delays in this case = 5, 4, 12, 8, 10, 3
Buffer delay is the same 2ns.
So clock cycle time = max of stage delays + buffer delay
= max(5, 4, 12, 8, 10,3) + 2
Execution time = (k+n-1) clock cycles
= (k+n-1)* clock cycle time
In this case no. of pipeline stages, k = 6
No. of instructions = 20
Execution time = (6+20-1)*14 = 25*14 = 350ns
Speed up of Efficient pipeline over native pipeline
= Naive pipeline execution time / efficient pipeline execution time
= 528 / 350
The stage delays in a 4-stage pipeline are 800, 500, 400 and 300 picoseconds. The ﬁrst stage (with delay 800 picoseconds) is replaced with a functionally equivalent design involving two stages with respective delays 600 and 350 picoseconds. The throughput increase of the pipeline is ________ percent.
Cycle time = max of all stage delays.
In the first case max stage delay = 800.
So throughput = 1/800 initially.
After replacing this stage with two stages of delays 600, 350... the cycle time = maximum stage delay = 600.
So the new throughput = 1/600.
The new throughput > old throughput.
And the increase in throughput = 1/600 - 1/800.
We calculate the percentage increase in throughput w.r.t initial throughput, so the % increase in throughput
= (1/600 - 1/800) / (1/800) * 100
= ((800 / 600) - 1) * 100
= ((8/6) -1) * 100
Consider a 3 GHz (gigahertz) processor with a three-stage pipeline and stage latencies τ1, τ2, τ3 and such that τ1 = 3τ2/4 = 2τ3. If the longest pipeline stage is split into two pipeline stages of equal latency, the new frequency is _________ GHz, ignoring delays in the pipeline registers.
Given ,τ1 = 3 τ2/4 = 2 τ3
Put τ1 = 6t, we get τ2 = 8t, τ3 = 3t
Now largest stage time is 8t.
So, frequency is 1/8t
⇒ 1/8t = 3 GHz
⇒ 1/t = 24 GHz
From the given 3 stages, τ 1 = 6t, τ 2 = 8t and τ 3 = 3t
So, τ 2 > τ1 > τ3.
The longest stage is τ2 = 8t and we will split that into two stages of 4t & 4t.
New processor has 4 stages - 6t, 4t, 4t, 3t.
Now largest stage time is 6t.
So, new frequency is = 1/6t
We can substitute 24 in place of 1/t, which gives the new frequency as 24/6 = 4 GHz
Consider the sequence of machine instruction given below
MUL R5,R0,R1 DIV R6,R2,R3 ADD R7,R5,R6 SUB R8,R7,R4 In the above sequence, R0 to R8 are general purpose registraters. In the instructions shown, the first register stores the result of the operation performed on the second and the third registers. This sequence of instructions is to be executed in a pipelined instruction processor with the following 4 stages: (1) Instruction Fetch and Decode (IF), (2) Operand Fetch (OF), (3) Perfom Operations (PO) and (4) Write back the result (WB). The IF, OF and WB stages take 1 clock cycle each instruction The PO stahe takes 1 clock cycle for ADD or SUB instruction, 3 clock cycles for MUL instruction and 5 clock cycles for DIV instructions. The pipelined processor uses operand foewarding from the PO stage to the OF stage. The number of clock cycles taken for the execution of the aboce sequence of instructions is _______
O ⇒ Operand Fetch
P ⇒ Perform the operation
W ⇒ write back the result
Time --> ----------------------------- 1 2 3 4 5 ----------------------------- S1 | X | | | | X | S2 | | X | | X | | S3 | | | X | | |The minimum average latency (MAL) is __________
S1 is needed at time 1 and 5, so its forbidden latency is 5-1 = 4.
S2 is needed at time 2 and 4, so its forbidden latency is 4-2 = 2.
So, forbidden latency = (2,4,0) (0 by default is forbidden)
Allowed latency = (1,3,5) (any value more than 5 also).
Collision vector (4,3,2,1,0) = 10101 which is the initial state as well.
From initial state we can have a transition after "1" or "3" cycles and we reach new states with collision vectors
(10101 >> 1 + 10101 = 11111) and (10101 >> 3 + 10101 = 10111) respectively.
These 2 becomes states 2 and 3 respectively.
For "5" cycles we come back to state 1 itself.
From state 2 (11111), the new collision vector is 11111.
We can have a transition only when we see first 0 from right.
So, here it happens on 5th cycle only which goes to initial state. (Any transition after 5 or more cycles goes to initial state as we have 5 time slices).
From state 3 (10111), the new collision vector is 10111.
So, we can have a transition on 3, which will give (10111 >> 3 + 10101 = 10111) third state itself. For 5, we get the initial state.
Thus all the transitions are complete. State\Time 1 3 5 1 (10101) 2 3 1 2 (11111) - - 1 3 (10111) - 3 1 So, minimum length cycle is of length 3 either from 3-3 or from 1-3. So the minimum average latency is also 3.
OP Ri, Rj, Rkwhere operation OP is performed on contents of registers Rj and Rk and the result is stored in register Ri.
I1 : ADD R1, R2, R3 I2 : MUL R7, R1, R3 I3 : SUB R4, R1, R5 I4 : ADD R3, R2, R4 I5 : MUL R7, R8, R9Consider the following three statements:
S1: There is an anti-dependence between instructions I2 and I5. S2: There is an anti-dependence between instructions I2 and I4. S3: Within an instruction pipeline an anti-dependence always creates one or more stalls.
Only S1 is true
Only S2 is true
Only S1 and S3 are true
Only S2 and S3 are true
S2: True. There is WAR dependency between I2 and I4.
S3: False. Because WAR or antidependency can be resolved by register renaming.
There were 2 stall cycles for pipelining for 25% of the instructions.
So pipeline time =(1+(25/100)*2)=3/2=1.5
Speed up =Non-pipeline time / Pipeline time=6/1.5=4
Instruction Meaning of instruction I0 :MUL R2 ,R0 ,R1 R2 ¬ R0 *R1 I1 :DIV R5 ,R3 ,R4 R5 ¬ R3/R4 I2 :ADD R2 ,R5 ,R2 R2 ¬ R5+R2 I3 :SUB R5 ,R2 ,R6 R5 ¬ R2-R6
I and II only
I and III only
II and III only
I, II and III
II. True. Register renaming can eliminate all WAR Hazard as well as WAW hazard.
III. If this statement would have said that
"Control hazard penalties can be completely eliminated by dynamic branch prediction", then it is false. But it is only given that "Control hazard penalties can be eliminated by dynamic branch prediction". So, it is true.
Hence, none of the given Option is Correct.
The instruction following the conditional branch instruction in memory is executed.
The first instruction in the fall through path is executed.
The first instruction in the taken path is executed.
The branch takes longer to execute than any other instruction.
From the given set of instructions I3 is updating R1, and the branch condition is based on the value of R1 so I3 can’t be executed in the delay slot.
Instruction I1 is updating the value of R2 and R2 is used in I3. So I1 also can’t be executed in the delay slot.
Instruction I2 is updating R4, and at the memory location represented by R4 the value of R1 is stored. So if I2 is executed in the delay slot then the memory location where R1 is to be stored as part of I4 will be in a wrong place. Hence between I2 and I4, I2 can’t be executed after I4. Hence I2 can’t be executed in the delay slot.
Instruction I4 can be executed in the delay slot as this is storing the value of R1 in a memory location and executing this in the delay slot will have no effect. Hence option D is the answer.
A non pipelined single cycle processor operating at 100 MHz is converted into a synchronous pipelined processor with five stages requiring 2.5 nsec, 1.5 nsec, 2 nsec, 1.5 nsec and 2.5 nsec, respectively. The delay of the latches is 0.5 nsec. The speedup of the pipeline processor for a large number of instructions is
For pipelined system = Max(stage delay) + Max(latch delay) = 2.5 + 0.5 = 3.0
Speedup = Time in non-pipelined system/Time in pipelined system = 10/3 = 3.33
IF: Instruction Fetch ID: Instruction Decode and Operand Fetch EX: Execute WB: Write BackThe IF, ID and WB stages take one clock cycle each to complete the operation. The number of clock cycles for the EX stage dependson the instruction. The ADD and SUB instructions need 1 clock cycle and the MUL instruction needs 3 clock cycles in the EX stage. Operand forwarding is used in the pipelined processor. What is the number of clock cycles taken to complete the following sequence of instructions?
ADD R2, R1, R0 R2 <- R0 + R1 MUL R4, R3, R2 R4 <- R3 * R2 SUB R6, R5, R4 R6 <- R5 - R4
So, total no. of clock cycles needed to execute the given 3 instructions is 8.
For a non-pipelined processor each instruction takes 12 cycles.
So for n instructions total execution time be 12 × n = 12n
For a pipelined processor each instruction takes
max (3, 2, 5, 4, 6, 2) =6
So for n instructions total execution time be,
(1×6 + (n-1) × 1) × 6
= (6 + n - 1) × 6
= (5 + n) × 6
= 30 + 6n
So, if n is very large,
20% are condition branches out of 109
⇒ 20/100 × 109
⇒ 2 × 108
In third stage of pipeline it consists of 2 stage cycles.
Total cycle penalty = 2 × 2 × 108 = 4 × 108
Clock speed = 1 GHz
Each Instruction takes 1 cycle i.e., 109 instructions.
Total execution time of a program is
= (109 / 109) +((4× 108) / 109) = 1+0.4 = 1.4seconds
A pipelined processor uses a 4-stage instruction pipeline with the following stages: Instruction fetch (IF), Instruction decode (ID), Execute (EX) and Writeback (WB). The arithmetic operations as well as the load and store operations are carried out in the EX stage. The sequence of instructions corresponding to the statement X = (S - R * (P + Q))/T is given below. The values of variables P, Q, R, S and T are available in the registers R0, R1, R2, R3 and R4 respectively, before the execution of the instruction sequence.
The number of Read-After-Write (RAW) dependencies, Write-After-Read( WAR) dependencies, and Write-After-Write (WAW) dependencies in the sequence of instructions are, respectively,
2, 2, 4
3, 2, 3
4, 2, 2
3, 3, 2
I1 - I2 (R5)
I2 - I3 (R6)
I3 - I4 (R5)
I4 - I5 (R6)
I2 - I3 (R5)
I3 - I4 (R6)
I1 - I3 (R5)
I3 - I4 (R6)
IF — Instruction fetch from instruction memory, RD — Instruction decode and register read, EX — Execute: ALU operation for data and address computation, MA — Data memory access - for write access, the register read at RD stage is used, WB — Register write back. Consider the following sequence of instructions: I1 : L R0, 1oc1; R0 <= M[1oc1] I2 : A R0, R0; R0 <= R0 + R0 I3 : S R2, R0; R2 <= R2 - R0 Let each stage take one clock cycle.What is the number of clock cycles taken to complete the above sequence of instructions starting from the fetch of I1 ?
If we don't use operator forwarding:
Total clock cycles = 8/11
There is no '11' in option.
Then no. of cycles = 8
A 4-stage pipeline has the stage delays as 150, 120, 160 and 140 nanoseconds respectively. Registers that are used between the stages have a delay of 5 nanoseconds each. Assuming constant clocking rate, the total time taken to process 1000 data items on this pipeline will be
1st instruction × 4 × clock time + 999 instruction × 1 × clock time
1 × 4 × 165ns + 999 × 1 × 165ns
1. The j + 1-st instruction uses the result of the j-th instruction as an operand 2. The execution of a conditional jump instruction 3. The j-th and j + 1-st instructions require the ALU at the same timeWhich of the above can cause a hazard ?
I and II only
II and III only
All the three
II is belongs to the Control hazard.
III is belongs to the Structural hazard.
→ Hazards are the problems with the instruction pipeline in CPU micro architectures.
the pipeline stages have different delays
consecutive instructions are dependent on each other
the pipeline stages share hardware resources
All of the above
If pipeline stages can’t have different delays, no dependency among consecutive instructions and sharing of hardware resources should not be there.
T1 ≤ T2
T1 ≥ T2
T1 < T2
T1 is T2 plus the time taken for one instruction fetch cycle
Pipelining is an implementation technique where multiple instructions are overlapped in execution. It has a high throughput (amount of instructions executed per unit time). In pipelining, many instructions are executed at the same time and execution is completed in fewer cycles. The pipeline is filled by the CPU scheduler from a pool of work which is waiting to occur. Each execution unit has a pipeline associated with it, so as to have work pre-planned. The efficiency of pipelining system depends upon the effectiveness of CPU scheduler.
NON- PIPELINING SYSTEM:
All the actions (fetching, decoding, executing of instructions and writing the results into the memory) are grouped into a single step. It has a low throughput.
Only one instruction is executed per unit time and execution process requires more number of cycles. The CPU scheduler in the case of non-pipelining system merely chooses from the pool of waiting work when an execution unit gives a signal that it is free. It is not dependent on CPU scheduler.
Theory Explanation is given below.