Throughput is defined as how much data can be transferred or how much work can be done per unit time. The more data can be transferred or more work can be done, the higher the throughput would be. Sometimes performance, bandwidth, and throughput are used interchangeably. For CPU, throughput is defined by how many instructions can be executed per unit time. For a DDR memory, throughput is defined by how much data can be written to or how much data can be read from the memory.
Another term that is talked along with throughput is latency. As we will see later, latency, and throughput are not same. When we talk about throughput, we generally talk about sustained throughput. To determine the sustained throughput, we need to measure the amount of data transferred or instructions executed over a period of time. This gives an average number rather than an instantaneous value, which may be high or low.
Latency on the other hand is the minimum amount of time it takes to get your first result. For example, you have planted a lemon tree and are waiting for the fruits to come. It is going to take probably a year to get the first batch of fruits. There is not much you can do to reduce this latency (one-yr. period). However, you can increase your throughput of lemon production by planning 10 trees. You will get 10 times the fruits, but still you have to wait one year before getting the first batch of fruits. There are ways to reduce latency, but most of the time there is not much you can do. We will talk about ways to reduce latency later.
In some applications throughput is the deciding factor, and in some cases, lower latency is critical. For example, when writing a large burst of data to DRAM, it is critical to have the high throughput. On the other hand, in on-line transactions where server is processing random IO operations, it is critical that the storage system serve these random IO requests with the least amount of latency – needs to return the results as soon as possible.
When we architect a system or a chip, we strive to get high throughput, and there are various knobs that can be used to increase throughput. If we understand how these knobs work not only by themselves but also with one another, we can choose the best possible combinations to achieve the desired goals. Here are the knobs that can be turned on to increase throughput.
Modern digital systems are synchronous systems that operate on one or multiple clocks. One obvious way to increase amount of instructions executed or data transferred is to use clock with higher frequency. This was the philosophy of processor design when the clock frequency kept on increasing. However, there is a side effect to this – the power consumption is directly proportional to the operating clock frequency. If you are designing low-power application, increasing the frequency to meet the throughput requirement may not be a sound idea. You probably need to look into other knobs without bumping up the clock frequency. The current trend of processor design is not solely focused on the processor clock frequency. It looks at overall system throughput – by putting multiple processor cores and by using multiple memory channels, for example.
You can haul more data by making the datapath wider. If the clock frequency is same, you can increase throughput by making the datapath wider. More cars can travel in a 4-lane road compared to a 2-lane road. PCI Express, for example, has provision to make the link 1-lane, 2-lane, or up to 32-lane wide. Let us say you are designing a SAS PCIe controller card or a PCIe SSD card. What PCI Express interface do you need to use? First, you figure out the HDD or SSD bandwidth you will have in the storage side. Accordingly decide if you need an x1, x2, x4, or even x8 PCIe connection. The bandwidth at the PCI Express connection needs to be equal or more than the bandwidth you are targeting at the storage side. It needs to be more, as you need to factor in the overhead of PCIe protocol – like any protocol, not all the bits transferred are used for payload data.
One of the dilemmas often encountered is determining the width of the internal datapath. Based on the system requirements, the bandwidth at the connector is decided. Now we need to decide the internal datapath width and frequency. Do I make the internal datapath 64-bit wide at 800 MHz, 128-bit at 400 MHz, or 256-bit at 200 MHz? With smaller process geometry, you can use a higher-frequency clock to meet timing. Lower frequency is easy to work with for timing closure, but the datapath will be wider. Designs with wider datapath typically are larger in terms of silicon size (gate count). Wider data path at lower frequency also poses challenges to process back-to-back data packets.
Let us look at the scenario where there are back-to-back packets that are 64-bit wide, and you can get four of these packets in a single clock period. It becomes very tricky and complicated with these kinds of traffic to process with a slower clock. Also slower clock means larger latency (takes more time to process data). If gate count is not a concern, and latency is not a big factor, one can manufacture the chip in a relatively older process to keep the cost down. Slower clock will provide a bigger clock period to meet timing in a slower process. If you want to reduce latency, and your budget permits use of newest process geometry, going with a higher frequency and narrower datapath is a better choice.
All modern processors use pipelining to increase throughput. We have detailed discussions on pipeline concepts in a separate section. If an instruction or product takes n stages to complete after we start the process, it is called an n-stage pipeline. When the first instruction moves to the second stage, we can start a new instruction and so on. After n cycles, a new instruction is getting completed every clock cycle. If we do not use pipelining and wait to start the next instruction or product only when the first one completes, we will have only one instruction completed or only one product ready every n cycles instead. Pipeline concept is used not only in processors but also in many system applications. A PCIe IO card can launch memory read requests to main memory one after another before getting data back for the first read request. While architecting a product, think how pipeline can be used to increase throughput. You can use pipeline concepts to develop high-speed pipelined adders and multipliers. If you are architecting a protocol, think how pipeline can help overall throughput.
Parallel processing is different from pipelining. In parallel processing, more resources are used at the same time. Some of modern processors have multiple lines of processing where each line operates in a pipelined fashion. Each line can execute one instruction per clock period or m instructions per second. With n such lines, (m)(n) instructions can be processed in one second.
A real life example of multiple pipelined processing is the cash registers in supermarket store- Safeway, for example. Each register line is a two-stage pipelined process – one person rings the goods and the other person bags the good. During crunch time, or when more people are waiting, more such lines get opened up to process more people and more goods.
My wife gave me six errands (written down as a list on a piece of paper) to complete on a beautiful Saturday morning. As long as all six of them get completed within a time period, she is happy. She does not care which one I do first, which one next, and so on, unless there is dependency with one another. I look at the list, make a mental map of the places I need to go, come up with an order, and then complete these tasks. My re-ordering of these things can be based on least amount of travel (saving gas). Or, it can be the order in which I keep the item to the last that I like the least. It may not be the most efficient way to do it, but human mind is not necessarily the most efficient engine.
The ability to perform ‘n’ number of tasks in any order rather than in the strict order it was presented, provides opportunity to finish all the tasks in less amount of time, which increases throughput. SATA and SAS protocol supports a feature called NCQ (Native Command Queuing). The OS sends some commands to the hard drive. Then the disk drive controller can re-order them based on seek time. The data is stored in the magnetic disk, and depending on the relative position of the read/write head and the data, it may take more or less time for individual commands. The disk drive controller studies all the commands given to it and based on some algorithm (least seek time), comes up with its own order of execution.
Modern processors break the instructions into smaller micro-codes and send them to an execution pool. Then the micro-codes are executed in an out-of-order fashion. However, the instructions are retired in the same order these were fetched. Let us look at some examples. Instruction 2 has all the micro-codes executed, but instruction 1 is still waiting for one piece of information. In this case, instruction 2 cannot be completed and has to wait for instruction 1 to complete. However, the flexibility of executing the micro-codes from multiple instructions in an out-of-order fashion allows overall shorter completion time and hence higher throughput.
Another example is how memory controllers execute memory commands. A PCIe card can send multiple memory read commands to the main memory. The memory controller can reorder the memory reads (based on which banks it is working etc.). When it returns the completion data, the order could be different from the order it was issued by the PCIe card. The PCIe card will make sure that it works properly. If it needs the data from an earlier read command, it has to wait for that data to arrive even if completion data from a later read command has already arrived. However, this flexibility allows the memory controller to get the best throughput.
Processors use memory caches to improve performance. After boot-up, system programs, such as OS and user programs are loaded from the hard drive into the RAMs. All the instructions are kept inside the RAMs. When processor wants to execute instructions, it gets them from the memory. Similarly, when it wants to read/write data, it accesses the RAM memory. Theoretically, this is how a computer works, but there is a problem. Access time for the RAM is still much higher than the clock cycle time or instruction execution time of processor.
The processor uses a smaller memory close to it where not all data or instruction, but some of it are stored. This smaller and fast memory is called cache. The cache is closer to the CPU and operates much faster – almost the same speed as CPU internal speed. When processor wants to fetch instruction or data, it gets them from the cache instead of accessing them from memory. It goes out to the memory only when the requested instruction or data is not available in the cache. However, going out to main memory is not a frequent or regular thing. Many processor architectures use separate instruction and data cache. Also, there are two layers of cache (L1 and L2) between the CPU and main memory.
Caches and caching concepts can be used where two components of a system differ by great extent in their operating speed. Since the hard drive read access is much slower, it uses a local cache to store the data it reads from the disk. When a read request comes to the hard drive controller, it can quickly return the data from the local cache rather than taking longer time to read from the disk. Caches are also used to improve write performance. When CPU writes data to the main memory, it takes a long time to complete the write, and CPU could be held up and will not be able to execute instructions. One option is to have a write cache, where data is written to the buffer, and let the CPU know that write is complete. CPU can then continue to execute further instructions. The data can be written (flushed) to the memory later.
Solid-state drives also use local caches to improve write and read performance. One of the peculiarities of solid-state drives (flash) is that there are limited numbers of write/erase cycles. The write-data caches help in absorbing the write and giving a quick ACK back to the CPU. This helps in improving write performance. There is another side benefit of having a write cache. If there are multiple writes to same logical address, we need to write only the last one instead of writing many times to the flash memory. This reduces the overall writes to the flash memory and improves SSD usable life.
The use of cache is not limited to hardware but is found in software as well. Search engines query the internet space and store the information in local server storage space. When a new search comes in, the server can provide the search results much faster from the web caches instead going to the internet.
Pre-fetching is a concept where more data, than required immediately, is accessed and stored in a buffer. Since many data access are sequential in nature, the data can be supplied from the pre-fetch buffer rather than going to the medium.
Modern processors use multiple cores (8 or 16 cores, for example). Each core is a complete CPU, and each CPU can have multiple parallel processing engines (multi-threading).
In a high-performance memory controller, there could be multiple DDR controllers working independently on separate memory channels. Also, a high-performance SSD controller can have multiple flash controllers that can access the flash memory chips (channels) independently and simultaneously.
Copyright 2013 Advanced Chip Design Book. All rights reserved.