TY - GEN
T1 - Selectively GPU cache bypassing for un-coalesced loads
AU - Zhao, Chen
AU - Wang, Fei
AU - Lin, Zhen
AU - Zhou, Huiyang
AU - Zheng, Nanning
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/2
Y1 - 2016/7/2
N2 - GPUs are widely used to accelerate general purpose applications, and could hide memory latency through massive multithreading. But multithreading can increase contention for the L1 data caches (L1D). This problem is exacerbated when an application contains irregular memory references which would lead to un-coalesced memory accesses. In this paper, we propose a simple yet effective GPU cache Bypassing scheme for Un-Coalesced Loads (BUCL). BUCL makes bypassing decisions at two granularities. At the instruction-level, when the number of memory accesses generated by a non-coalesced load instruction is bigger than a threshold, referred as the threshold of un-coalescing degree (TUCD), all the accesses generated from this load will bypass L1D. The reason is that the cache data filled by un-coalesced loads typically have low probabilities to be reused. At the level of each individual memory access, when the L1D is stalled, the accessed data is likely with low locality, and the utilization of the target memory sub-partition is not high, this memory access may also bypass L1D. Our experiments show that BUCL achieves 36% and 5% performance improvement over the baseline GPU for memory un-coalesced and memory coherent benchmarks, respectively, and also significantly outperforms prior GPU cache bypassing and warp throttling schemes.
AB - GPUs are widely used to accelerate general purpose applications, and could hide memory latency through massive multithreading. But multithreading can increase contention for the L1 data caches (L1D). This problem is exacerbated when an application contains irregular memory references which would lead to un-coalesced memory accesses. In this paper, we propose a simple yet effective GPU cache Bypassing scheme for Un-Coalesced Loads (BUCL). BUCL makes bypassing decisions at two granularities. At the instruction-level, when the number of memory accesses generated by a non-coalesced load instruction is bigger than a threshold, referred as the threshold of un-coalescing degree (TUCD), all the accesses generated from this load will bypass L1D. The reason is that the cache data filled by un-coalesced loads typically have low probabilities to be reused. At the level of each individual memory access, when the L1D is stalled, the accessed data is likely with low locality, and the utilization of the target memory sub-partition is not high, this memory access may also bypass L1D. Our experiments show that BUCL achieves 36% and 5% performance improvement over the baseline GPU for memory un-coalesced and memory coherent benchmarks, respectively, and also significantly outperforms prior GPU cache bypassing and warp throttling schemes.
KW - Cache Bypassing
KW - Data Cache
KW - GPU
KW - Memory divergence
KW - Un-Coalesced Load Instruction
UR - https://www.scopus.com/pages/publications/85018495148
U2 - 10.1109/ICPADS.2016.0122
DO - 10.1109/ICPADS.2016.0122
M3 - 会议稿件
AN - SCOPUS:85018495148
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 908
EP - 915
BT - Proceedings - 22nd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2016
A2 - Liao, Xiaofei
A2 - Lovas, Robert
A2 - Shen, Xipeng
A2 - Zheng, Ran
PB - IEEE Computer Society
T2 - 22nd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2016
Y2 - 13 December 2016 through 16 December 2016
ER -