research-article

Open access

Zeroploit: Exploiting Zero Valued Operands in Interactive Gaming Applications

Authors:

Mark W. Stephenson,

Aditya Ukarande,

Marc BlacksteinAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 3

Article No.: 17, Pages 1 - 26

https://doi.org/10.1145/3394284

Published: 03 August 2020 Publication History

All formats PDF

Abstract

In this article, we first characterize register operand value locality in shader programs of modern gaming applications and observe that there is a high likelihood of one of the register operands of several multiply, logical-and, and similar operations being zero, dynamically. We provide intuition, examples, and a quantitative characterization for how zeros originate dynamically in these programs. Next, we show that this dynamic behavior can be gainfully exploited with a profile-guided code optimization called Zeroploit that transforms targeted code regions into a zero-(value-)specialized fast path and a default slow path. The fast path benefits from zero-specialization in two ways, namely: (a) the backward slice of the other operand of a given multiply or logical-and can be skipped dynamically, provided the only use of that other operand is in the given instruction, and (b) the forward slice of instructions originating at the given instruction can be zero-specialized, potentially triggering further backward slice specializations from operations of that forward slice as well. Such specialization helps the fast path avoid redundant dynamic computations as well as memory fetches, while the fast-slow versioning transform helps preserve functional correctness. With an offline value profiler and manually optimized shader programs, we demonstrate that Zeroploit is able to achieve an average speedup of 35.8% for targeted shader programs, amounting to an average frame-rate speedup of 2.8% across a collection of modern gaming applications on an NVIDIA® GeForce RTX™ 2080 GPU.

References

[1]

Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 1--13.

Digital Library

[2]

J. R. Allen, Ken Kennedy, Carrie Porterfield, Joe Warren, and Joe Warren. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’83). 177--189.

Digital Library

[3]

Bruce W. Arden, Bernard A. Galler, and Robert M. Graham. 1962. An algorithm for translating boolean expressions. J. ACM 9, 2 (Apr. 1962), 222--239.

Digital Library

[4]

Louis Bavoil. 2019. The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload. Retrieved from https://devblogs.nvidia.com/the-peak-performance-analysis-method-for-optimizing-any-gpu-workload/.

[5]

Chris Brennan. 2016. Delta Color Compression Overview. Retrieved from https://gpuopen.com/dcc-overview/.

[6]

Brad Calder, Peter Feller, and Alan Eustace. 1997. Value profiling. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’97). 259--269.

Digital Library

[7]

Brad Calder, Peter Feller, and Alan Eustace. 1999. Value profiling and optimization. J. Instruct. Level Parallel. 1 (Mar. 1999). Retrieved from https://www.jilp.org/vol1/v1paper2.pdf.

[8]

Eui-Young Chung, B. Luca, G. DeMicheli, G. Luculli, and M. Carilli. 2002. Value-sensitive automatic code specialization for embedded software. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 21, 9 (Sep. 2002).

[9]

Sylvain Collange. 2011. Identifying Scalar Behavior in CUDA Kernels. Technical Report HAL-00555134.

[10]

Microprocessor Standards Committee. 2019. 754-2019-IEEE Standard for Floating-Point Arithmetic. Retrieved from https://ieeexplore.ieee.org/servlet/opac?punumber=8766227.

[11]

Charles Consel, Luke Hornof, François Noël, Jacques Noyé, and Nicolae Volansche. 1996. A uniform approach for compile-time and run-time specialization. In Selected Papers from the International Seminar on Partial Evaluation.

[12]

Microsoft Corporation. 2015. Fixed Order of Pipeline Results. Retrieved from https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#4.2%20Fixed%20Order%20of%20Pipeline%20Results.

[13]

Microsoft Corporation. 2015. Unordered Access Views. Retrieved from https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#UAVs.

[14]

Microsoft Corporation. 2018. Atomic Iadd. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/atomic-iadd--sm5---asm.

[15]

Microsoft Corporation. 2018. Direct3D 11 Graphics. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d11/atoc-dx-graphics-direct3d-11.

[16]

Microsoft Corporation. 2018. Direct3D 12 Graphics. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d12/direct3d-12-graphics.

[17]

Microsoft Corporation. 2018. Effect-Compiler Tool. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dtools/fxc.

[18]

Microsoft Corporation. 2018. High Level Shading Language. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl.

[19]

Microsoft Corporation. 2018. movc (sm4-asm). Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/movc--sm4---asm.

[20]

Microsoft Corporation. 2018. Shader Model 4 Assembly (DirectX HLSL)-dcl_globalFlags. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dcl-globalflags.

[21]

Microsoft Corporation. 2018. Shader Model 5 Assembly (DirectX HLSL). Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/shader-model-5-assembly--directx-hlsl.

[22]

Microsoft Corporation. 2018. Sync. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/sync--sm5---asm.

[23]

Microsoft Corporation. 2018. Unordered Access Buffer or Texture. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-advanced-stages-cs-resources#unordered-access-buffer-or-texture.

[24]

Microsoft Corporation. 2018. Variable Syntax. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-variable-syntax.

[25]

Microsoft Corporation. 2019. DirectX Intermediate Language. Retrieved from https://github.com/Microsoft/DirectXShaderCompiler/blob/master/docs/DXIL.rst.

[26]

NVIDIA Corporation. 2019. Geforce Game Ready Driver, Version 441.41-WHQL. Retrieved from https://www.geforce.com/drivers/results/155060.

[27]

NVIDIA Corporation. 2019. Nsight 2019.6. Retrieved from https://developer.nvidia.com/nsight-graphics.

[28]

NVIDIA Corporation. 2019. Parallel Thread Execution ISA: Application Guide. Retrieved from https://docs.nvidia.com/pdf/ptx_isa_6.5.pdf.

[29]

Igor Costa, Pericles Alves, Henrique Nazare Santos, and Fernando Magno Quintao Pereira. 2013. Just-in-time value specialization. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). 1--11.

Digital Library

[30]

Freddy Gabbay and Avi Mendelson. 1996. Speculative Execution Based on Value Prediction. Technical Report. Technion-Israel Institute of Technology, EE Department, TR 1080.

[31]

S. Z. Gilani, N. S. Kim, and M. J. Schulte. 2013. Power-efficient computing for compute-intensive GPGPU applications. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 330--341.

[32]

Brian Grant, Matthai Philipose, Markus Mock, Craig Chambers, and Susan J. Eggers. 1999. An evaluation of staged run-time optimizations in DyC. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’99). 293--304.

[33]

Hilbert Hagedoorn. 2019. NVIDIA GeForce RTX 2080 SUPER (8GB Founder). Retrieved from https://www.guru3d.com/articles-pages/geforce-rtx-2080-super-review,1.html.

[34]

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 243--254.

[35]

H. D. Huskey and W. H. Wattenburg. 1961. Compiling techniques for boolean expressions and conditional statements in ALGOL 60. Commun. ACM 4, 1 (Jan. 1961), 70--75.

Digital Library

[36]

Randall Hyde. 2006. Writing Great Code, Volume 2: Thinking Low-Level, Writing High-Level. No Starch Press, Chapter 13, 427--435.

[37]

The Khronos Group Inc.[n.d.]. OpenGL Overview. Retrieved from https://www.opengl.org/documentation/.

[38]

The Khronos Group Inc.2018. Vulkan Overview. Retrieved from https://www.khronos.org/vulkan/.

[39]

Neil D. Jones, Carsten K. Gomard, and Peter Sestoft. 1993. Partial Evaluation and Automatic Program Generation. Prentice-Hall, Upper Saddle River, NJ.

Digital Library

[40]

Baldur Karlsson. 2019. Renderdoc v1.5. Retrieved from https://renderdoc.org/docs/index.html.

[41]

John Kessenich, Dave Baldwin, and Randi Rost. 2017. The OpenGL Shading Language. Retrieved from https://www.khronos.org/registry/OpenGL/specs/gl/GLSLangSpec.4.50.pdf.

[42]

Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten. 2013. Microarchitectural mechanisms to exploit value structure in SIMT architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). 130--141.

Digital Library

[43]

Kevin M. Lepak and Mikko H. Lipasti. 2000. On the value locality of store instructions. In Proceedings of the 27th International Symposium on Computer Architecture.

[44]

Kevin M. Lepak and Mikko H. Lipasti. 2000. Silent stores for free. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’00). 22--31.

[45]

Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’96). 226--237.

Digital Library

[46]

Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’96). 138--147.

Digital Library

[47]

Future Mark. 2019. 3DMARK® Technical Guide. Retrieved from https://s3.amazonaws.com/download-aws.futuremark.com/3dmark-technical-guide.pdf.

[48]

D. K. McAllister, S. E. Molnar, Jr. J. F. Duluk, E. M. Kilgariff, P. R. Brown, C. J. Amsinck, J. M. O’Connor, J. M. Burgess, G. A. Muthler, and J. Robertson. 2012. Zero Bandwidth Clears. United States Patent No. 8330766.

[49]

Robert Muth, Scott A. Watterson, and Saumya K. Debray. 2000. Code specialization based on value profiles. In Proceedings of the 7th International Symposium on Static Analysis (SAS’00). Springer-Verlag, London, 340--359.

[50]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 27--40.

[51]

Hyunsun Park, Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo. 2016. Zero and data reuse-aware fast convolution for deep neural networks on GPU. In Proceedings of the 11th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES’16). 33:1--33:10.

Digital Library

[52]

Alex Peleg, Sam Wilkie, and Uri Weiser. 1997. Intel MMX for multimedia PCs. Commun. ACM 40, 1 (1997).

[53]

Tech Powerup. 2018. NVIDIA Geforce RTX 2080. Retrieved from https://www.techpowerup.com/gpu-specs/geforce-rtx-2080.c3224.

[54]

Stephen E. Richardson. 1992. Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation. Technical Report SMLI TR-92-1. Sun Microsystems Laboratories, Inc.

[55]

S. Subramanya Sastry, Rastilav Bodik, and James E. Smith. 2000. Characterizing coarse-grained reuse of computation. In Proceedings of the ACM Workshop on Feedback Directed and Dynamic Optimization.

[56]

Sanchari Sen, Shubham Jain, Swagath Venkataramani, and Anand Raghunathan. 2018. SparCE: Sparsity aware general-purpose core extensions to accelerate deep neural networks. IEEE Trans. Comput. 68, 6 (Nov. 2018).

[57]

Ajeet Shankar, S. Subramanya Sastry, Rastislav Bodík, and James E. Smith. 2005. Runtime specialization with optimistic heap analysis. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA’05). 327--343.

[58]

Nathan T. Slingerland and Alan Jay Smith. 2000. Multimedia Instruction Sets for General Purpose Microprocessors: A Survey. Technical Report UCB/CSD-00-1124. University of California, Berkeley, Computer Sciences Division.

[59]

Ryan Smith. 2019. The NVIDIA Geforce RTX 2080 Super Review: Memories of the Future. Retrieved from https://www.anandtech.com/show/14663/the-nvidia-geforce-rtx-2080-super-review/3.

[60]

Avinash Sodani and Gurindar S. Sohi. 1997. Dynamic instruction reuse. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97). 194--205.

[61]

Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible software profiling of GPU architectures. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). 185--197.

[62]

Jarred Walton. 2019. NVIDIA GeForce RTX 2080 Super Review. Retrieved from https://www.pcgamer.com/nvidia-geforce-rtx-2080-super-review/.

[63]

D. Wong, N. S. Kim, and M. Annavaram. 2016. Approximating warps with intra-warp operand value similarity. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16). 176--187.

[64]

Tsung Tai Yeh, Roland N. Green, and Timothy G. Rogers. 2020. Dimensionality-aware redundant SIMT instruction elimination. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). 1327--1340.

[65]

Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16).

Cited By

You XYang HLei KLuan ZQian DAamodt TJerger NSwift M(2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3576934

Index Terms

Zeroploit: Exploiting Zero Valued Operands in Interactive Gaming Applications
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

PGZ: automatic zero-value code specialization
CC 2021: Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction

In prior work we proposed Zeroploit, a transform that duplicates code, specializes one path assuming certain key program operands, called versioning variables, are zero, and leaves the other path unspecialized. Dynamically, depending on the versioning ...
Loner: utilizing the CPU vector datapath to process scalar integer data
CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

Modern CPUs utilize SIMD vector instructions and hardware extensions to accelerate code with data-level parallelism. This allows for high performance gains in select application domains such as image and signal processing. However, general purpose code ...
Compiler assisted coalescing
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Tightly integrated CPU-GPU systems that share the same virtual address space have significantly improved the programmability of GPUs in recent years. However, to achieve this, every memory access from a GPU has to go through an address translation unit ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 17, Issue 3

September 2020

200 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3415154

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 August 2020

Online AM: 07 May 2020

Accepted: 01 April 2020

Revised: 01 April 2020

Received: 01 January 2020

Published in TACO Volume 17, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,289
Total Downloads

Downloads (Last 12 months)376
Downloads (Last 6 weeks)42

Reflects downloads up to 14 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

You XYang HLei KLuan ZQian DAamodt TJerger NSwift M(2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3576934

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents