Tuesday, 9 October 2018

Tracy Profiler 0.4

The new version of Tracy Profiler has been released. Learn more about it in this video:




Complete release notes:

- Renamed "standalone" utility to "profiler".
- Added trace update utility, which will convert files saved in previous
  versions of tracy to be up-to-date.
  - Optional high compression (--hc) mode is available that will increase
    the compression level, at the cost of considerably longer compression
    time.
- Fix regression causing varying size of profiler window for different
  captures.
- Added support for on-demand tracing.
  - If a client application is compiled with the TRACY_ON_DEMAND macro
    defined, tracing will not begin until a connection to server is
    established.
  - Since data is not fully captured in this mode, the resulting trace will
    be less precise, until application state is appropriately reset. For
    example, locks need to be fully released, zone stacks need to be
    flushed. This is an automatic process.
  - All tracing macros are able to work in the on-demand mode.
- Improved compatibility with various system setups.
- Aside from using TRACY_NO_EXIT define you can also set the same-named
  environmental variable to 1 to get the same effect.
- Added ability to show/hide all threads and plots.
- Performance improvements.
- Improvements to memory data presentation.
  - Added memory allocation info window.
  - Selecting memory allocation on a plot will draw time range of the
    allocation.
  - Middle clicking on an memory allocation address (or on a button in
    memory allocation info window) will zoom the view to the allocation
    range.
- Find zone menu improvements:
  - Zones can be now also grouped by call stacks.
  - Zone groups can be now also sorted by time spend in each zone.
  - Zone groups list now displays group times.
  - Average and median zone times are now displayed on the histogram.
  - Selected zones will be highlighted on the timeline view.
- Added named versions of tracing macros that allow specifying scoped
  variable name.
- The main profiler window is now kept at the bottom of windows stack.
- The "profiler" utility will now use a custom embedded font.
- Microseconds are now displayed using correct symbol ('ÎĽ' instead of 'u').
- Unix builds of the "profiler" utility will now ask for a file name when
  saving a trace.
- Progress popup is now displayed when a trace file is loading.
- Zones that share source location with a zone that is hovered over are now
  highlighted.
- Added ability to zoom-in to a selection range made using middle mouse
  button.
  - Holding the ctrl key will switch to zoom-out mode.
- The "profiler" utility will use less resources when its window is
  out-of-focus or minimized.
- Added support for cross-DLL profiling.
- Items in options menu (locks, threads, etc.) are now described with number
  of events.
  - Source location of lock declaration is also provided.
- Created an extensive user manual for the profiler.
- Added ability to capture multiple frame sets.
  - Viewer will display multiple frame ranges at once.
  - Only one frame set can be active at once. The selected one is used for
    the frame navigation graph, frame navigation buttons and drawing frame
    separators.
  - The active frame set will be highlighted, and the rest will be dimmed
    out.
  - Frames can now also be discontinuous.
- Frames and zones too small to be displayed will be marked with a zig-zag
  pattern.
- General improvements to message list and message markers.
  - Hovering over message on a list will highlight its marker (previously it
    only worked the other way).
  - Left clicking on a message marker will focus the message list on the
    selected message.
  - Middle clicking on a message marker will center it on screen.
- Added trace information window.
  - This includes frame time statistics and histogram.
- Displayed memory sizes are now properly formatted.
- Added call stack tree for memory allocations.
  - You can display allocations list for each call stack tree entry.
- The source code of the profiled application may now be viewed in the
  profiler.
  - BIG FAT WARNING: The actual profiled program source code is not known to
    the profiler. It only checks if there is a file on your disk that
    matches the file name of the captured source location. Even if the file
    is displayed, it may be out of date.
  - CPU and GPU zones will have "Source" button, if source file can be
    opened.
  - Source files for call stack traces can be opened by right-clicking on
    the file name. Since in this case there is no button that can be hidden,
    a small animation will be played to notify user if the source cannot be
    opened.
- The main profiler view will now occupy the whole window. Previous behavior
  is still available for embedded use cases.
- Many button labels are now accompanied by icons.
- Fonts should now be less blurry.
- "Go to parent" button in zone info window won't be displayed if there is
  no parent to go to.
- Improvements to the compare traces menu.
  - There are now colored markers to make it easier to distinguish "this" and
    "external" traces.
  - The amount of saved time is now displayed (a difference between total
    run times of both traces).
- Tracy will now collect host information, like CPU name, amount of system
  memory, etc.
- Windows builds of the "profiler" utility will perform a check of supported
  CPU instruction set and match it against the one required by the binary
  (by default AVX2 is used). If the program cannot be executed on the
  processor, a message dialog with workaround instructions will be
  displayed.
- Tracy can intercept crashes and finish sending data from a dying process.
  - Currently this is only implemented on Windows, Linux and Android.
- Call stack window may now display addresses of the frames, instead of
  source file locations.
- Memory events will now properly register their thread.
- Profiler settings are now stored in a persistent location.
  - On Windows settings are stored in %APPDATA%/tracy.
  - On other platforms settings are stored in $XDG_CONFIG_HOME/tracy or
    $HOME/.config/tracy, if the variable is not set.
  - The main profiler window position, size and maximized state are saved
    and restored.
  - The size and position of internal windows now doesn't depend on the
    runtime directory of the profiler executable.
- Added connection handshake.
  - Server won't be able to connect to client if there's a protocol version
    mismatch.
  - Client not in on-demand mode will refuse connections after the first
    connection was made and the initial event buffers were cleared.
- A single server will no longer try to connect to multiple clients.
- The capture utility will now display time span of the ongoing capture.

Wednesday, 11 July 2018

etcpak 0.6

A new version of etcpak has been released. There are a couple of small changes in 0.6, but the main one is newly added support for compressing ETC2 RGBA textures.

Example compression result (only showing alpha channel):


16K x 16K image benchmark:
ETC1: 113 ms (only RGB part)
ETC2 RGB: 213 ms (only RGB part)
ETC2 RGBA: 404 ms

Tracy Profiler 0.3

A new version of tracy has been released. A short summary of the new features:


Complete list of features:

- Breaking change: the format of trace files has changed.
  - Previous tracy version will crash when trying to open new traces.
  - Loading of traces saved by previous version is supported.
  - Tracy will no longer crash when trying to load traces saved by future
    versions. Instead, a dialog advising to update will be displayed.
  - Tracy will no longer crash in most cases when trying to open files that
    are not traces. Some crashes are still possible, due to support of old,
    header-less traces.
- Ability to track every memory allocation in profiled program.
  - Allocation event queuing must be done in order, which requires exclusive
    access to the serialized queue on the client side. This has no effect on
    the rest of events, which are stored in a concurrent queue, as before.
  - You can search for a memory address and see where it was allocated, for
    how long, etc. This lists all matching allocations since the program was
    started.
  - All active (non-freed) allocations may be listed. This shows the current
    memory state by default, but can go back to any point in time.
  - Graphical representation of process memory map may be displayed. New
    allocations/frees are displayed in a bright color and fade out with
    time. This feature also can look back in time.
  - Memory usage plot is automatically generated.
  - Basic allocation information is displayed in memory plot tooltips.
  - A summary of memory events within a zone (and its children) is now
    printed in zone info window.
- Support loading profile dumps with no memory allocation data (generated by
  v0.2).
- Added ability to display global statistics of a selected zone from the
  zone info window.
- Fixed regression with lock announce processing that appeared during
  worker/viewer split.
- Allow selecting/unselecting all locks for display.
- Performance improvements.
- Don't save unneeded lock information in trace file.
- Don't save thrash in message list data.
- Allow expanding view span up to one hour, instead of one minute.
- Added trace comparison window.
  - An external trace has to be loaded first.
  - Zone query in both traces (current and external).
  - Both results are overlaid on the same histogram.
  - Graphs can be adjusted as-if there was the same number of zones
    collected.
- Read time directly from a hardware register on ARM/ARM64, if possible.
  - User-space access to the timer needs to be enabled in the kernel, so
    tracy will perform run-time checks and fallback to the old method if the
    check fails.
- Prevent connections in a TIME-WAIT state from blocking new listen
  connections.
- Display y-range of plots.
- Added ability to unload traces loaded from files. To do so close the main
  profiler window. You will return to the connect/open selection dialog.
  Live captures cannot be terminated this way.
- Zones previously displayed in zone info window are remembered and you can
  go back to them. Closing the zone info window or switching between CPU and
  GPU zones will clear the memory.
- Improved message list window.
  - Messages are now displayed in columns.
  - Originating thread of each message is now included in the list.
- You can now navigate to next and previous frame.
- Zone statistics can be now displayed using only self times.
- Support for tracing GPU events using Vulkan.
- Timeline will now display "OpenGL context" or "Vulkan context" instead of
  "GPU context".
- Fixed regression causing invalid display of GPU context appearance time.
- Fixed regression causing invalid reporting of an active CPU in zone end
  events, if MSVC rdtscp optimization was not enabled.
- Ability to collect true call stacks.
  - Supported on Windows, Linux, Android.
  - The following events can collect call stacks:
    - Memory alloc/free.
    - Zone begin.
    - GPU zone begin.
  - Zone stack trace now also displays frames from a real call trace.
  - On Linux call stack frame name resolution requires a call to dladdr,
    which in turn requires linking with libdl.
- Allow manual entry of GPU time drift value.
- Unix build system no longer shares object files between different build
  units.
  - Fixes inability to build debug and release versions of a single utility
    without "make clean".
  - Fixes incompatibility between "standalone" and "capture" utilities due
    to different set of used feature flags.
- On Windows "standalone" utility now adapts to system DPI setting.
- Optional per-call zone naming.

Thursday, 29 March 2018

Introduction to the tracy profiler

A short feature presentation and integration guide for the tracy profiler.


Saturday, 6 January 2018

Tracy frame profiler

Tracy is a real time, nanosecond resolution frame profiler that can be used for remote or embedded telemetry of your application. It can profile both CPU (C++, Lua) and GPU (OpenGL). It also can display locks held by threads and their interactions with each other.



Tracy requires compiler support for C++11, Thread Local Storage and a way to workaround static initialization order fiasco. There are no other requirements. The following platforms are confirmed to be working (this is not a complete list):
  • Windows (x86, x64)
  • Linux (x86, x64, ARM, ARM64)
  • Android (ARM, x86)
  • FreeBSD (x64)
  • Cygwin (x64)
  • WSL (x64)
  • OSX (x64)
The following compilers are supported:
  • MSVC
  • gcc
  • clang

Source code and more information: https://bitbucket.org/wolfpld/tracy

A quick FAQ:

Q: I already use VTune/perf/Very Sleepy/callgrind/MSVC profiler. 
A: These are statistical profilers, which can be used to find hot spots in the code. This is very useful, but it won't show you the underlying reason for semi-random frame stutter that may occur every couple of seconds.

Q: You can use Telemetry for that.
A: Telemetry license costs about 8000 $ per year. Tracy is open source software. Telemetry doesn't have Lua bindings. 

Q: You can use the free Brofiler. Crytek does use it, so it has to be good.
A: After a cursory look at the Brofiler code I can tell that the timer resolution there is at 300 ns. Tracy can achieve 5 ns timer resolution. Brofiler event logging infrastructure seems to be overengineered. Brofiler can't track lock contention, nor does it have Lua bindings.

Q: So tracy is supposedly faster?
A: My measurements show that logging a single zone with tracy takes only 15 ns. In theory, if the program was doing nothing else, tracy should be able to log 66 million zones per second.

Q: Bullshit, RAD is advertising that they are able only to log about a million zones, over the network nevertheless: "Capture over a million timing zones per second in real-time!"
A: Tracy can perform network transfer of 15 million zones per second. Should the client and server be on separate machines, this number will be even higher, but you will need more than a gigabit link to achieve the maximum throughput. https://www.youtube.com/watch?v=DSMIHShKGAc

Q: Can I connect to my application at any time and start profiling at this moment?
A: No, all events are registered from the beginning of program's execution and are waiting in a queue.

Q: Am I seeing correctly that the profiler allocates one gigabyte of memory per second?
A: Only in extreme cases. Normal usage has much lower memory pressure.

Q: Why do you do magic with the static initialization order? Everyone says that's a bad practice.
A: It allows tracking construction of static objects and memory allocations performed before main() is entered.

Q: There's no support for consoles.
A: Welp. But there's mobile support.

Q: I do need console support.
A: The code is open. Write your own, then send a patch.

Following is the annotated assembly code (generated from C++ sources) that's responsible for logging start of the zone:
call        qword ptr [__imp_GetCurrentThreadId]
mov         r14d,eax
mov         qword ptr [rsp+0F0h],r14        // save thread id for later use
mov         r12d,10h
mov         rax,qword ptr gs:[58h]          // TLS
mov         r15,qword ptr [rax]             // queue address
mov         rdi,qword ptr [r12+r15]         // data address
mov         rbp,qword ptr [rdi+20h]         // buffer counter
mov         rbx,rbp
and         ebx,7Fh                         // 128 item buffer
jne         Application::InnerLoop+66h --+
mov         rdx,rbp                      |
mov         rcx,rdi                      |
call        enqueue_begin_alloc          |  // reclaim/alloc next buffer
shl         rbx,5  <---------------------+  // buffer items are 32 bytes
add         rbx,qword ptr [rdi+40h]
mov         byte ptr [rbx],4                // queue item type
rdtscp
mov         dword ptr [rbx+19h],ecx         // cpu id
shl         rdx,20h
or          rax,rdx                         // 64 bit timestamp
mov         qword ptr [rbx+1],rax
mov         qword ptr [rbx+9],r14           // thread id
lea         rax,[__tracy_source_location]   // static struct address
mov         qword ptr [rbx+11h],rax
lea         rax,[rbp+1]                     // increment buffer counter
mov         qword ptr [rdi+20h],rax