Calculating compression on a ~10% filled device with multiple threads led to ~10% of the actual threads being used for the majority of work.
It looks like the current algo evenly divides the data space among them, without regard to content. In this instance ~90% of threads finished very fast because they were reading unallocated data, and the remaining ~10% took very long due to threading limitations.
I propose making a work queue of much smaller (but still manageable) chunks that can put non-contiguous work per every thread to more evenly allocate work when there are trivial data regions. i.e.
- Program is invoked across large data space with 10 threads
- Main thread generates work items
items[10000] evenly distributed across data space, and program is called with 10 threads
- Thread 1 takes
items[0], thread 2 takes items[1] ... thread 10 takes items[9]
- When each thread finishes the current work, it will take the next un-serviced work in
items[] (or exit, and a new thread can be spawned)
Calculating compression on a ~10% filled device with multiple threads led to ~10% of the actual threads being used for the majority of work.
It looks like the current algo evenly divides the data space among them, without regard to content. In this instance ~90% of threads finished very fast because they were reading unallocated data, and the remaining ~10% took very long due to threading limitations.
I propose making a work queue of much smaller (but still manageable) chunks that can put non-contiguous work per every thread to more evenly allocate work when there are trivial data regions. i.e.
items[10000]evenly distributed across data space, and program is called with 10 threadsitems[0], thread 2 takesitems[1]... thread 10 takesitems[9]items[](or exit, and a new thread can be spawned)