Software
- benchmark (Google)
- Celero
- hayai: the C++ benchmarking framework
- moodycamel::microbench
- geiger: A micro benchmark library in C++ that supports hardware performance counters
- Nonius: A C++ micro-benchmarking framework
Readings
Memory
Memory - Benchmarking
- Intel Memory Latency Checker (MLC)
- Memory Bandwidth Benchmark
- MBW determines the "copy" memory bandwidth available to userspace programs. Its simplistic approach models that of real applications. It is not tuned to extremes and it is not aware of hardware architecture, just like your average software package.
- https://github.com/raas/mbw
- pmbw: Parallel Memory Bandwidth Benchmark / Measurement
- Spatter: Benchmark for measuring the performance of sparse and irregular memory access
- STREAM: Sustainable Memory Bandwidth in High Performance Computers
- tinymembench: simple benchmark for memory throughput and latency
Memory - Profiling
- Heaptrack - A Heap Memory Profiler for Linux
- How to Write a Heap Memory Profiler
- MALT & NUMAPROF: Memory Profiling for HPC Applications
- NUMAPROF: a NUMA memory profiler based on Pintool to track remote memory accesses
- MALT: a MALloc Tracker to find where and how your made your memory allocations in C/C++/Fortran applications
- MALT: A Malloc Tracker
- FOSDEM 2019; Sébastien Valat
- Memoro: A Detailed Heap Profiler
- memory-profiler: A memory profiler for Linux
- memtrail: A LD_PRELOAD based memory profiler and leak detector for Linux
- memusage - profile memory usage of a program
- MTuner - a C/C++ memory profiler and memory leak finder for Windows, PlayStation 4, PlayStation 3, etc.
- PerfMemPlus: A Tool for Automatic Discovery of Memory Performance Problems
- Tool for memory performance analysis based on Linux perf.
- https://github.com/helchr/perfMemPlus
- PerfMemPlus: A Tool for Automatic Discovery of Memory Performance Problems
- On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency
- Typegrind
- Valgrind
Microarchitecture
- Tools for microarchitectural benchmarking
- asmbench: A Benchmark Toolkit for Assembly Instructions Using the LLVM JIT
- BHive: A Benchmark Suite and Measurement Framework for Validating x86-64 Basic Block Performance Models
- Intel Architecture Code Analyzer (IACA)
- ibench: Measure instruction latency and throughput
- Ithemal: Instruction THroughput Estimator using MAchine Learning
- llvm-exegesis – LLVM Machine Instruction Benchmark
- llvm-mca - LLVM Machine Code Analyzer
- nanoBench: A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs
- used for running the microbenchmarks for obtaining the latency, throughput, and port usage data available on http://uops.info
- https://github.com/andreas-abel/nanoBench
- https://uops.info/
- nanoBench Cache Analyzer
- uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures
- nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems
- Open Power/Performance Analysis Tool (OPPAT)
- a cross-OS, cross-architecture Power and Performance Analysis Tool
- cross-OS: supports Windows ETW trace files and Linux/Android perf/trace-cmd trace files
- cross-architecture: supports Intel and ARM chips hardware events (using perf and/or PCM)
- https://patinnc.github.io/
- https://github.com/patinnc/oppat
- OSACA: Open Source Architecture Code Analyzer
- https://github.com/RRZE-HPC/osaca
- https://hpc.fau.de/research/tools/
- Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures
- Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) 2018
- Jan Laukemann, Julian Hammer, Johannes Hofmann, Georg Hager, Gerhard Wellein
- https://arxiv.org/abs/1809.00912
- Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels
- Cross-Architecture Automatic Critical Path Detection For In-Core Performance Analysis
- PMEvo: Portable Inference of Port Mappings for Out-of-Order Processors by Evolutionary Optimization
- timing-harness: Harness for profiling arbitrary basic blocks.
- uarch-bench: A benchmark for low-level CPU micro-architectural features
Optimization
- BOLT: Binary Optimization and Layout Tool
- A linux command-line utility used for optimizing performance of binaries
- https://github.com/facebookincubator/BOLT
- Accelerate large-scale applications with BOLT
- Building Binary Optimizer with LLVM
- BOLT: A Practical Binary Optimizer for Data Centers and Beyond
- MAQAO (Modular Assembly Quality Analyzer and Optimizer)
Profiling
- Agner Fog's test programs for measuring clock cycles and performance monitoring
- BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more
- Caliper: A Performance Analysis Toolbox in a Library
- Coz: Finding Code that Counts with Causal Profiling
- https://github.com/plasma-umass/coz/
- Charlie Curtsinger, Emery Berger
- SOSP 2015
- ;login: 41(2) (2016)
- Performance Matters - Strange Loop 2019; Emery Berger
- Coz vs. Sampling Profilers
- easy_profiler: Lightweight cross-platform profiler library for C++
- Event Tracing for Windows (ETW) / Windows Performance Toolkit – Xperf
- gperftools (originally Google Performance Tools)
- gprof2dot
- HawkTracer
- Hotspot - the Linux perf GUI for performance analysis
- Likwid: Performance monitoring and benchmarking suite
- microprofile: an embeddable profiler
- Optick: C++ Profiler For Games
- perf
- perf-tools - https://github.com/brendangregg/perf-tools
- perf_events: The Unofficial Linux Perf Events Web-Page
- perfmon2 - http://perfmon2.sourceforge.net/
- "Perfmon2 aims to be a portable interface across all modern processors. It is designed to give full access to a given PMU and all the corresponding hardware performance counters. Typically the PMU hardware implementations use a different number of registers, counters with different length and possibly other unique features, a complexity that the software has to cope with. Although processors have different PMU implementations, they usually use configurations registers and data registers. Perfmon2 provides a uniform abstract model of these registers and exports read/write operations accordingly."
- Performance Application Programming Interface (PAPI)
- pmu tools: Intel PMU profiling tools
- Processor Counter Monitor (PCM)
- Remotery: Single C file, Realtime CPU/GPU Profiler with Remote Web Viewer
- sysdig
- Tracy Profiler
- Tracy is a real time, nanosecond resolution frame profiler that can be used for remote or embedded telemetry of your application. It can profile CPU (C++, Lua), GPU (OpenGL, Vulkan) and memory. It also can display locks held by threads and their interactions with each other.
- https://bitbucket.org/wolfpld/tracy
- Introduction to the Tracy profiler - https://www.youtube.com/watch?v=fB5B46lbapc
Timing
- low-overhead-timers: Very low-overhead timer/counter interfaces for C on Intel 64 processors
Visualization
- Flame Graphs
- FlameScope: a visualization tool for exploring different time ranges as Flame Graphs
- pprof - a tool for visualization and analysis of profiling data
Tags:
tool
reading
Last modified 16 December 2024