Saturday, 6 January 2018

Tracy frame profiler

Tracy is a real time, nanosecond resolution frame profiler that can be used for remote or embedded telemetry of your application. It can profile both CPU (C++, Lua) and GPU (OpenGL). It also can display locks held by threads and their interactions with each other.

Tracy requires compiler support for C++11, Thread Local Storage and a way to workaround static initialization order fiasco. There are no other requirements. The following platforms are confirmed to be working (this is not a complete list):
  • Windows (x86, x64)
  • Linux (x86, x64, ARM, ARM64)
  • Android (ARM, x86)
  • FreeBSD (x64)
  • Cygwin (x64)
  • WSL (x64)
  • OSX (x64)
The following compilers are supported:
  • MSVC
  • gcc
  • clang

Source code and more information:

A quick FAQ:

Q: I already use VTune/perf/Very Sleepy/callgrind/MSVC profiler. 
A: These are statistical profilers, which can be used to find hot spots in the code. This is very useful, but it won't show you the underlying reason for semi-random frame stutter that may occur every couple of seconds.

Q: You can use Telemetry for that.
A: Telemetry license costs about 8000 $ per year. Tracy is open source software. Telemetry doesn't have Lua bindings. 

Q: You can use the free Brofiler. Crytek does use it, so it has to be good.
A: After a cursory look at the Brofiler code I can tell that the timer resolution there is at 300 ns. Tracy can achieve 5 ns timer resolution. Brofiler event logging infrastructure seems to be overengineered. Brofiler can't track lock contention, nor does it have Lua bindings.

Q: So tracy is supposedly faster?
A: My measurements show that logging a single zone with tracy takes only 15 ns. In theory, if the program was doing nothing else, tracy should be able to log 66 million zones per second.

Q: Bullshit, RAD is advertising that they are able only to log about a million zones, over the network nevertheless: "Capture over a million timing zones per second in real-time!"
A: Tracy can perform network transfer of 15 million zones per second. Should the client and server be on separate machines, this number will be even higher, but you will need more than a gigabit link to achieve the maximum throughput.

Q: Can I connect to my application at any time and start profiling at this moment?
A: No, all events are registered from the beginning of program's execution and are waiting in a queue.

Q: Am I seeing correctly that the profiler allocates one gigabyte of memory per second?
A: Only in extreme cases. Normal usage has much lower memory pressure.

Q: Why do you do magic with the static initialization order? Everyone says that's a bad practice.
A: It allows tracking construction of static objects and memory allocations performed before main() is entered.

Q: There's no support for consoles.
A: Welp. But there's mobile support.

Q: I do need console support.
A: The code is open. Write your own, then send a patch.

Following is the annotated assembly code (generated from C++ sources) that's responsible for logging start of the zone:
call        qword ptr [__imp_GetCurrentThreadId]
mov         r14d,eax
mov         qword ptr [rsp+0F0h],r14        // save thread id for later use
mov         r12d,10h
mov         rax,qword ptr gs:[58h]          // TLS
mov         r15,qword ptr [rax]             // queue address
mov         rdi,qword ptr [r12+r15]         // data address
mov         rbp,qword ptr [rdi+20h]         // buffer counter
mov         rbx,rbp
and         ebx,7Fh                         // 128 item buffer
jne         Application::InnerLoop+66h --+
mov         rdx,rbp                      |
mov         rcx,rdi                      |
call        enqueue_begin_alloc          |  // reclaim/alloc next buffer
shl         rbx,5  <---------------------+  // buffer items are 32 bytes
add         rbx,qword ptr [rdi+40h]
mov         byte ptr [rbx],4                // queue item type
mov         dword ptr [rbx+19h],ecx         // cpu id
shl         rdx,20h
or          rax,rdx                         // 64 bit timestamp
mov         qword ptr [rbx+1],rax
mov         qword ptr [rbx+9],r14           // thread id
lea         rax,[__tracy_source_location]   // static struct address
mov         qword ptr [rbx+11h],rax
lea         rax,[rbp+1]                     // increment buffer counter
mov         qword ptr [rdi+20h],rax