skip to main content
research-article
Open access

Zeroploit: Exploiting Zero Valued Operands in Interactive Gaming Applications

Published: 03 August 2020 Publication History

Abstract

In this article, we first characterize register operand value locality in shader programs of modern gaming applications and observe that there is a high likelihood of one of the register operands of several multiply, logical-and, and similar operations being zero, dynamically. We provide intuition, examples, and a quantitative characterization for how zeros originate dynamically in these programs. Next, we show that this dynamic behavior can be gainfully exploited with a profile-guided code optimization called Zeroploit that transforms targeted code regions into a zero-(value-)specialized fast path and a default slow path. The fast path benefits from zero-specialization in two ways, namely: (a) the backward slice of the other operand of a given multiply or logical-and can be skipped dynamically, provided the only use of that other operand is in the given instruction, and (b) the forward slice of instructions originating at the given instruction can be zero-specialized, potentially triggering further backward slice specializations from operations of that forward slice as well. Such specialization helps the fast path avoid redundant dynamic computations as well as memory fetches, while the fast-slow versioning transform helps preserve functional correctness. With an offline value profiler and manually optimized shader programs, we demonstrate that Zeroploit is able to achieve an average speedup of 35.8% for targeted shader programs, amounting to an average frame-rate speedup of 2.8% across a collection of modern gaming applications on an NVIDIA® GeForce RTX™ 2080 GPU.

References

[1]
Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 1--13.
[2]
J. R. Allen, Ken Kennedy, Carrie Porterfield, Joe Warren, and Joe Warren. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’83). 177--189.
[3]
Bruce W. Arden, Bernard A. Galler, and Robert M. Graham. 1962. An algorithm for translating boolean expressions. J. ACM 9, 2 (Apr. 1962), 222--239.
[4]
Louis Bavoil. 2019. The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload. Retrieved from https://devblogs.nvidia.com/the-peak-performance-analysis-method-for-optimizing-any-gpu-workload/.
[5]
Chris Brennan. 2016. Delta Color Compression Overview. Retrieved from https://gpuopen.com/dcc-overview/.
[6]
Brad Calder, Peter Feller, and Alan Eustace. 1997. Value profiling. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’97). 259--269.
[7]
Brad Calder, Peter Feller, and Alan Eustace. 1999. Value profiling and optimization. J. Instruct. Level Parallel. 1 (Mar. 1999). Retrieved from https://www.jilp.org/vol1/v1paper2.pdf.
[8]
Eui-Young Chung, B. Luca, G. DeMicheli, G. Luculli, and M. Carilli. 2002. Value-sensitive automatic code specialization for embedded software. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 21, 9 (Sep. 2002).
[9]
Sylvain Collange. 2011. Identifying Scalar Behavior in CUDA Kernels. Technical Report HAL-00555134.
[10]
Microprocessor Standards Committee. 2019. 754-2019-IEEE Standard for Floating-Point Arithmetic. Retrieved from https://ieeexplore.ieee.org/servlet/opac?punumber=8766227.
[11]
Charles Consel, Luke Hornof, François Noël, Jacques Noyé, and Nicolae Volansche. 1996. A uniform approach for compile-time and run-time specialization. In Selected Papers from the International Seminar on Partial Evaluation.
[12]
Microsoft Corporation. 2015. Fixed Order of Pipeline Results. Retrieved from https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#4.2%20Fixed%20Order%20of%20Pipeline%20Results.
[13]
Microsoft Corporation. 2015. Unordered Access Views. Retrieved from https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#UAVs.
[14]
Microsoft Corporation. 2018. Atomic Iadd. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/atomic-iadd--sm5---asm.
[15]
Microsoft Corporation. 2018. Direct3D 11 Graphics. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d11/atoc-dx-graphics-direct3d-11.
[16]
Microsoft Corporation. 2018. Direct3D 12 Graphics. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d12/direct3d-12-graphics.
[17]
Microsoft Corporation. 2018. Effect-Compiler Tool. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dtools/fxc.
[18]
Microsoft Corporation. 2018. High Level Shading Language. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl.
[19]
Microsoft Corporation. 2018. movc (sm4-asm). Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/movc--sm4---asm.
[20]
Microsoft Corporation. 2018. Shader Model 4 Assembly (DirectX HLSL)-dcl_globalFlags. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dcl-globalflags.
[21]
Microsoft Corporation. 2018. Shader Model 5 Assembly (DirectX HLSL). Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/shader-model-5-assembly--directx-hlsl.
[22]
Microsoft Corporation. 2018. Sync. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/sync--sm5---asm.
[23]
Microsoft Corporation. 2018. Unordered Access Buffer or Texture. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-advanced-stages-cs-resources#unordered-access-buffer-or-texture.
[24]
Microsoft Corporation. 2018. Variable Syntax. Retrieved from https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-variable-syntax.
[25]
Microsoft Corporation. 2019. DirectX Intermediate Language. Retrieved from https://github.com/Microsoft/DirectXShaderCompiler/blob/master/docs/DXIL.rst.
[26]
NVIDIA Corporation. 2019. Geforce Game Ready Driver, Version 441.41-WHQL. Retrieved from https://www.geforce.com/drivers/results/155060.
[27]
NVIDIA Corporation. 2019. Nsight 2019.6. Retrieved from https://developer.nvidia.com/nsight-graphics.
[28]
NVIDIA Corporation. 2019. Parallel Thread Execution ISA: Application Guide. Retrieved from https://docs.nvidia.com/pdf/ptx_isa_6.5.pdf.
[29]
Igor Costa, Pericles Alves, Henrique Nazare Santos, and Fernando Magno Quintao Pereira. 2013. Just-in-time value specialization. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). 1--11.
[30]
Freddy Gabbay and Avi Mendelson. 1996. Speculative Execution Based on Value Prediction. Technical Report. Technion-Israel Institute of Technology, EE Department, TR 1080.
[31]
S. Z. Gilani, N. S. Kim, and M. J. Schulte. 2013. Power-efficient computing for compute-intensive GPGPU applications. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 330--341.
[32]
Brian Grant, Matthai Philipose, Markus Mock, Craig Chambers, and Susan J. Eggers. 1999. An evaluation of staged run-time optimizations in DyC. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’99). 293--304.
[33]
Hilbert Hagedoorn. 2019. NVIDIA GeForce RTX 2080 SUPER (8GB Founder). Retrieved from https://www.guru3d.com/articles-pages/geforce-rtx-2080-super-review,1.html.
[34]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 243--254.
[35]
H. D. Huskey and W. H. Wattenburg. 1961. Compiling techniques for boolean expressions and conditional statements in ALGOL 60. Commun. ACM 4, 1 (Jan. 1961), 70--75.
[36]
Randall Hyde. 2006. Writing Great Code, Volume 2: Thinking Low-Level, Writing High-Level. No Starch Press, Chapter 13, 427--435.
[37]
The Khronos Group Inc.[n.d.]. OpenGL Overview. Retrieved from https://www.opengl.org/documentation/.
[38]
The Khronos Group Inc.2018. Vulkan Overview. Retrieved from https://www.khronos.org/vulkan/.
[39]
Neil D. Jones, Carsten K. Gomard, and Peter Sestoft. 1993. Partial Evaluation and Automatic Program Generation. Prentice-Hall, Upper Saddle River, NJ.
[40]
Baldur Karlsson. 2019. Renderdoc v1.5. Retrieved from https://renderdoc.org/docs/index.html.
[41]
John Kessenich, Dave Baldwin, and Randi Rost. 2017. The OpenGL Shading Language. Retrieved from https://www.khronos.org/registry/OpenGL/specs/gl/GLSLangSpec.4.50.pdf.
[42]
Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten. 2013. Microarchitectural mechanisms to exploit value structure in SIMT architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). 130--141.
[43]
Kevin M. Lepak and Mikko H. Lipasti. 2000. On the value locality of store instructions. In Proceedings of the 27th International Symposium on Computer Architecture.
[44]
Kevin M. Lepak and Mikko H. Lipasti. 2000. Silent stores for free. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’00). 22--31.
[45]
Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’96). 226--237.
[46]
Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’96). 138--147.
[47]
Future Mark. 2019. 3DMARK® Technical Guide. Retrieved from https://s3.amazonaws.com/download-aws.futuremark.com/3dmark-technical-guide.pdf.
[48]
D. K. McAllister, S. E. Molnar, Jr. J. F. Duluk, E. M. Kilgariff, P. R. Brown, C. J. Amsinck, J. M. O’Connor, J. M. Burgess, G. A. Muthler, and J. Robertson. 2012. Zero Bandwidth Clears. United States Patent No. 8330766.
[49]
Robert Muth, Scott A. Watterson, and Saumya K. Debray. 2000. Code specialization based on value profiles. In Proceedings of the 7th International Symposium on Static Analysis (SAS’00). Springer-Verlag, London, 340--359.
[50]
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 27--40.
[51]
Hyunsun Park, Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo. 2016. Zero and data reuse-aware fast convolution for deep neural networks on GPU. In Proceedings of the 11th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES’16). 33:1--33:10.
[52]
Alex Peleg, Sam Wilkie, and Uri Weiser. 1997. Intel MMX for multimedia PCs. Commun. ACM 40, 1 (1997).
[53]
Tech Powerup. 2018. NVIDIA Geforce RTX 2080. Retrieved from https://www.techpowerup.com/gpu-specs/geforce-rtx-2080.c3224.
[54]
Stephen E. Richardson. 1992. Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation. Technical Report SMLI TR-92-1. Sun Microsystems Laboratories, Inc.
[55]
S. Subramanya Sastry, Rastilav Bodik, and James E. Smith. 2000. Characterizing coarse-grained reuse of computation. In Proceedings of the ACM Workshop on Feedback Directed and Dynamic Optimization.
[56]
Sanchari Sen, Shubham Jain, Swagath Venkataramani, and Anand Raghunathan. 2018. SparCE: Sparsity aware general-purpose core extensions to accelerate deep neural networks. IEEE Trans. Comput. 68, 6 (Nov. 2018).
[57]
Ajeet Shankar, S. Subramanya Sastry, Rastislav Bodík, and James E. Smith. 2005. Runtime specialization with optimistic heap analysis. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA’05). 327--343.
[58]
Nathan T. Slingerland and Alan Jay Smith. 2000. Multimedia Instruction Sets for General Purpose Microprocessors: A Survey. Technical Report UCB/CSD-00-1124. University of California, Berkeley, Computer Sciences Division.
[59]
Ryan Smith. 2019. The NVIDIA Geforce RTX 2080 Super Review: Memories of the Future. Retrieved from https://www.anandtech.com/show/14663/the-nvidia-geforce-rtx-2080-super-review/3.
[60]
Avinash Sodani and Gurindar S. Sohi. 1997. Dynamic instruction reuse. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97). 194--205.
[61]
Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible software profiling of GPU architectures. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). 185--197.
[62]
Jarred Walton. 2019. NVIDIA GeForce RTX 2080 Super Review. Retrieved from https://www.pcgamer.com/nvidia-geforce-rtx-2080-super-review/.
[63]
D. Wong, N. S. Kim, and M. Annavaram. 2016. Approximating warps with intra-warp operand value similarity. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16). 176--187.
[64]
Tsung Tai Yeh, Roland N. Green, and Timothy G. Rogers. 2020. Dimensionality-aware redundant SIMT instruction elimination. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). 1327--1340.
[65]
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16).

Cited By

View all
  • (2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023

Index Terms

  1. Zeroploit: Exploiting Zero Valued Operands in Interactive Gaming Applications

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 3
    September 2020
    200 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3415154
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 August 2020
    Online AM: 07 May 2020
    Accepted: 01 April 2020
    Revised: 01 April 2020
    Received: 01 January 2020
    Published in TACO Volume 17, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPUs
    2. Profile guided optimization
    3. gaming applications
    4. shader programs
    5. value specialization

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)376
    • Downloads (Last 6 weeks)42
    Reflects downloads up to 14 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media