Tiled rendering consists of multiple programs each accessing the TMU for reading and their own dedicated space in the VPM for writing.
qpu_debug_tiled demonstrates the tiling pattern, and VPM writing. All QPUs can simultaneously use their part of the VPM to write values. This works fine, even without mutex synchronization.
qpu_blit_tiled is structurally exactly the same but adds TMU load and writes that instead of the debug pattern. However, just adding the TMU loading instructions breaks the program. Uncommenting them and writing debug values makes the program work again. However, from what I gathered, it is not the timing that is the issue, some nop operations instead of the TMU access don't trigger the behaviour.
Executing each programs right one after another however works fine, so the functionality is fine.
So I tried adding mutex to synchronize the QPUs, at several different stages - whole program, each line, each VPM access. The whole program mutex works, but only at low framerates (e.g. 10). Without mutex, that would break. This indicates the mutex does work to a degree. However, when increasing the framerate, the QPUs quickly (after a few frames) start overwriting the whole memory without reason.
The mutex synchronizations on each line or even VPM access seem to only worsen this behaviour.
So there are three parts to this problem that I do not understand:
- Why does accessing the TMU affect the VPM access? Or is it that with predictable timing the qpu programs previously, by chance, just never interfered?
- Why does the mutex access break at higher framerates? It seems accessing it at high frequencies seems to break it, however I've seen others (e.g. gpu_fft) accessing the mutex in a multi-program environment each VPM access before.
- And finally, is the mutex required at all when each qpu only uses a fixed, small part of the VPM exclusively? In the current qpu_blit_tiled, all qpus only use 4 vectors (so 12x4 =48 out of the 64 I reserved for user programs).
Any help is greatly appreciated. The referenced programs can be easily tested out with the commands found in commands.txt
Tiled rendering consists of multiple programs each accessing the TMU for reading and their own dedicated space in the VPM for writing.
qpu_debug_tiled demonstrates the tiling pattern, and VPM writing. All QPUs can simultaneously use their part of the VPM to write values. This works fine, even without mutex synchronization.
qpu_blit_tiled is structurally exactly the same but adds TMU load and writes that instead of the debug pattern. However, just adding the TMU loading instructions breaks the program. Uncommenting them and writing debug values makes the program work again. However, from what I gathered, it is not the timing that is the issue, some nop operations instead of the TMU access don't trigger the behaviour.
Executing each programs right one after another however works fine, so the functionality is fine.
So I tried adding mutex to synchronize the QPUs, at several different stages - whole program, each line, each VPM access. The whole program mutex works, but only at low framerates (e.g. 10). Without mutex, that would break. This indicates the mutex does work to a degree. However, when increasing the framerate, the QPUs quickly (after a few frames) start overwriting the whole memory without reason.
The mutex synchronizations on each line or even VPM access seem to only worsen this behaviour.
So there are three parts to this problem that I do not understand:
Any help is greatly appreciated. The referenced programs can be easily tested out with the commands found in commands.txt