Skip to content

[PTI-SDK] Device / context-based buffers instead of thread-based buffers #54

@Thyre

Description

@Thyre

Device / context-based buffers instead of thread-based buffers

While continuing to evaluate how we may be able to use PTI-SDK for support of Level Zero as an adapter in Score-P, I've ran into the following issue:

Right now, PTI-SDK collects events for different kinds of activities on accelerators, which can be enabled through ptiViewSetCallbacks. At some point during program execution, the implemented buffer_request function will be called. If requested or when a buffer is full, the SDK may dispatch a callback for buffer evaluation. This is totally fine. However, I noticed a detail, significantly complicating the handling of programs using multiple threads to dispatch events.

To illustrate the issue, we can look at the following (very simple) OpenMP offload program:

int main(void)
{
    #pragma omp parallel num_threads( 2 )
    {
        unsigned long long int x = 0;
        for(int i = 0; i < 10; ++i) {
            #pragma omp target map(tofrom: x)
            {
	        ++x;
	    }
        }
    }
}

We have eight threads working in parallel on a single accelerator. This does work and events are correctly captured by PTI-SDK. Now, lets look at how they are captured.

How PTI-SDK PoC currently captures events

Events can be generally found in view_handler.h. For simplicity, we focus on MemCopyEvent but others follow the same principle.

At the end of the event method, a call to Instance().InsertRecord(...) is being done. This is a templated method with the following code

template <typename T>
inline void InsertRecord(const T& view_record) {
    static_assert(std::is_trivially_copyable<T>::value,
                  "One can only insert trivially copyable types into the "
                  "ViewBuffer (view records)");
    auto& buffer = view_buffers_[std::this_thread::get_id()];

    if (buffer.IsNull()) {
        RequestNewBuffer(buffer);
    }

    buffer.Insert(view_record);
    static_assert(SizeOfLargestViewRecord() != 0, "Largest record not avaiable on compile time");
    if (buffer.FreeBytes() >= SizeOfLargestViewRecord()) {
        // There's space to insert more records. No need for swap.
        return;
    }

    buffer_queue_.Push(std::move(buffer));
}

Note the way we determine the buffer. This is done through the unique id of the thread writing the event. In the parallel OpenMP region, this is the executing thread. Looking further at how the buffers are implemented, we end up here: using ViewBufferTable = ThreadSafeHashTable<KeyT, ViewBuffer>;.
This means, that events are stored in a buffer and accessed through a hash table with the thread id being the key.

What the current implementation does

Regardless on the devices, contexts, and command queues being used by a thread, events are stored on a thread basis. This can cause issues if tools require events to be written in a certain way. In Score-P for example, we require our locations (where we store our events) to write events in timestamp order. With PTI-SDK however, this is quite difficult. Let's look at the output of the example above with some interface:

Click to open
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061487291 ns
Ze Kernel End Time: 1704727757061490207 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 15
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061632913 ns
Ze Kernel End Time: 1704727757061635829 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 16
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061857128 ns
Ze Kernel End Time: 1704727757061859523 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 18
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061907886 ns
Ze Kernel End Time: 1704727757061910281 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 20
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061985554 ns
Ze Kernel End Time: 1704727757061987949 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 22
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062035503 ns
Ze Kernel End Time: 1704727757062038003 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 24
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062084320 ns
Ze Kernel End Time: 1704727757062086715 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 26
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062133252 ns
Ze Kernel End Time: 1704727757062135647 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 28
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062183094 ns
Ze Kernel End Time: 1704727757062185489 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 30
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062232119 ns
Ze Kernel End Time: 1704727757062234514 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 32
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062280055 ns
Ze Kernel End Time: 1704727757062282555 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 33
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
Reached End of buffer
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(D2M)
Memory Op Start Time: 1704727757057692365 ns
Memory Op End Time: 1704727757057696219 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 1
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(D2M)
Memory Op Start Time: 1704727757059441071 ns
Memory Op End Time: 1704727757059443883 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 2
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(D2M)
Memory Op Start Time: 1704727757059629202 ns
Memory Op End Time: 1704727757059632952 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 3
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(D2M)
Memory Op Start Time: 1704727757059699457 ns
Memory Op End Time: 1704727757059702790 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 4
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(D2M)
Memory Op Start Time: 1704727757059772795 ns
Memory Op End Time: 1704727757059776232 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 5
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757059906260 ns
Memory Op End Time: 1704727757059910114 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 6
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060523755 ns
Memory Op End Time: 1704727757060527088 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 7
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060623218 ns
Memory Op End Time: 1704727757060626447 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 8
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060690014 ns
Memory Op End Time: 1704727757060693347 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 9
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060761374 ns
Memory Op End Time: 1704727757060765124 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 10
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060830073 ns
Memory Op End Time: 1704727757060833406 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 11
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060907619 ns
Memory Op End Time: 1704727757060911264 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 12
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757061004800 ns
Memory Op End Time: 1704727757061008445 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 13
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061145847 ns
Ze Kernel End Time: 1704727757061148763 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 14
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061840664 ns
Ze Kernel End Time: 1704727757061842955 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 17
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061897810 ns
Ze Kernel End Time: 1704727757061900205 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 19
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061948646 ns
Ze Kernel End Time: 1704727757061951041 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 21
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061996369 ns
Ze Kernel End Time: 1704727757061998660 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 23
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062046902 ns
Ze Kernel End Time: 1704727757062049297 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 25
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062095791 ns
Ze Kernel End Time: 1704727757062098186 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 27
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062144432 ns
Ze Kernel End Time: 1704727757062146723 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 29
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062196928 ns
Ze Kernel End Time: 1704727757062199323 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 31
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
Reached End of buffer

The output is pretty large, but shows a weird thing. The following entry can be found in the buffer for Kernel Thread Id = 670104, even though the event is from another Kernel Thread Id

Found Kernel Record
Ze Kernel Start Time: 1704727757061632913 ns
Ze Kernel End Time: 1704727757061635829 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 16
Kernel Thread Id : 670096

If we evaluate the first buffer first and then the second one, we will end up with timestamp errors coming from Score-P, since 1704727757057692365 (first event of second buffer) < 1704727757061632913 (wrong event in first buffer).

The issue

From my understanding, each thread will execute events on a separate command queue, if possible. My question here is: Is it possible that command queues are used by multiple threads at the same time?
In general, I am a bit skeptical about using thread ids as the key. If a buffer is not completely filled, but contains events for a context, device, or command queue and is flushed at the end of the program, performance tools need to store all events happening during program execution because there might be an event which gets missed or cause other issues otherwise.

For the behavior shown above, there seem to be events stored incorrectly, as I wouldn't expect to see a thread id for another thread in that buffer.

Side note

It seems like this isn't the only issue with multiple threads. When running the program multiple times, I've also ran into the following error:

a.out: /opt/apps/sources/PTI-SDK/9ee0e46cafa145856eaeeefe5f26ec046462300f/sdk/src/levelzero/ze_collector.h:1446: void ZeCollector::GetHostTime(const ZeKernelCommand *, const ze_kernel_timestamp_result_t &, uint64_t &, uint64_t &): Assertion `host_start > command->submit_time' failed.
[1]    669066 IOT instruction  ./a.out

Reproducer

You can use the following code to reproduce the issue:
pti_sdk_openmp_world.zip

To run the example, use the following command:

$ source ~/Env/oneAPI.sh 
$ icpx main.cpp -fiopenmp -fopenmp-targets=spir64 -lpti -lpti_view
$ ./a.out

Environment

  • OS: Ubuntu 22.04 LTS
  • Compiler: Intel oneAPI 2024.0 (Base Toolkit + HPC Toolkit)
  • CPU / GPU: Intel i7-1260P with Integrated Graphics
  • Memory: 16GiB
  • PTI-SDK: 9ee0e46
  • Level Zero:
    • level-zero-dev 1.14.0-744~22.04
    • intel-level-zero-gpu 1.3.27191.42-775~22.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions