Performance – Monocarn-only memory bandwidth

On modern multi-core platforms, the parallel performance of memory bandwidth limiting applications usually does not adapt well to the number of cores. Generally, speedups are observed on a certain number of cores, but after that Performance saturation. The synthetic example is the well-known STREAM benchmark, which is often used to report the achievable memory bandwidth, that is, the memory bandwidth at the saturation point.

Consider the following STREAM on a single Xeon E5-2680 Benchmark test (Triad) results, the peak memory bandwidth is 42.7GB/s (DDR3-1333):

1 core 16 GB/s
2 cores 30 GB/ s
3+ cores 36 GB/s

STREAM can scale well from 1 to 2 cores, but above 3 cores, the performance is roughly the same.

< p>My question is: What determines the memory bandwidth that a single CPU core can achieve? Since this question is definitely too broad, I narrowed it down to the above architecture: How to predict that a STREAM of 1 thread will get 16 GB/s from the specifications of E5-2680, or by looking at the hardware counter, etc.?

For a single core, the main factors are the CPU frequency and CPU micro-architecture, that is, the single core sends to the bus The speed of the request and how the CPU predicts the memory location you want to access. If memory access is random and code execution depends on data that you have to consider memory access latency, then the CPU designer will go to great lengths to make things look faster than they actually are. Hide the effect of latency, and if you read only a bunch of data and say that you will have bandwidth. But for a single core, the absolute upper limit is the clock speed.

For multi-threaded access, The bottleneck will be the bus and RAM architecture on the motherboard and the North Bridge. So it depends on your motherboard. You can slow down the DRAM by 50%, but 4 of them are paralleled and accelerated. Or vice versa.

However, the problem is very broad. If you want to learn more about memory from a programmer’s perspective, please check What every programmer should know about memory. It provides an in-depth description of various factors.

This is a very in-depth topic.

PS, as for the prediction, it is unlikely or not very practical. Measurements are better unless you have access to very detailed CPU, chipset, motherboard and RAM specifications , Even so, it’s just an educated guess. Under your specific workload, you’d better measure it in real life.

On modern multi-core platforms , The parallel performance of memory bandwidth limiting applications usually does not adapt well to the number of cores. Generally, acceleration is observed on a certain number of cores, but performance is saturated after that. The synthetic example is the well-known STREAM benchmark, which is often used for reporting The achievable memory bandwidth, that is, the memory bandwidth at the saturation point.

Considering the following STREAM benchmark (Triad) results on a single Xeon E5-2680, the peak memory bandwidth is 42.7GB/s( DDR3-1333):

1 core 16 GB/s
2 cores 30 GB/s
3+ cores 36 GB/s

STREAM can scale well from 1 to 2 cores, but above 3 cores, the performance is roughly the same.

My question is: What determines what a single CPU core can achieve Memory bandwidth? Since this question is definitely too broad, I narrowed it down to the above architecture: How to predict that a STREAM of 1 thread will get 16 GB/s from the specifications of E5-2680, or by looking at the hardware counter, etc.?

For a single core, the main factors are the CPU frequency and CPU microarchitecture, that is, the speed at which the single core makes requests to the bus and how the CPU predicts the memory you want to access Location. If memory access is random and code execution depends on data that you have to consider memory access latency, then CPU designers will do their best to make things look faster than they actually are and hide the effects of latency, and if you only read one Heap data and say that you will have bandwidth when added up. But for a single core, the absolute upper limit is the clock speed.

For multi-threaded access, the bottleneck will be the bus and RAM architecture on the motherboard and North Bridge . So it depends on your motherboard. You can slow down the DRAM by 50%, but 4 of them are paralleled and speed up. Or vice versa. However, the problem is very broad. If you want to learn from the programmer For more information about memory, please see What every programmer should know about memory. It provides an in-depth description of various factors.

This is a very in-depth topic.

PS, as for the prediction, it is unlikely or not very practical. The measurement is better unless you have access to very detailed CPU, chipset, motherboard and RAM specifications, even so, it is just an educated guess. Under your specific workload, you’d better measure it in real life.

Leave a Comment

Your email address will not be published.