Tasking Compiler -

That world is gone. For nearly two decades, the primary driver of computational performance has not been faster clock speeds, but parallelism . Modern processors are not single workers; they are orchestras with multiple cores (CPUs), vector units (SIMD), graphics cards (GPUs) with thousands of tiny cores, and specialized accelerators (NPUs, FPGAs). To write software that runs fast today is to write concurrent, parallel, and distributed software.

// Original: too fine-grained #pragma omp parallel for for(i=0; i<1000000; i++) a[i] = sqrt(b[i]); // Compiler transforms to: #pragma omp parallel for schedule(static, 10000) for(i=0; i<1000000; i+=10000) task for(j=i; j<i+10000; j++) a[j] = sqrt(b[j]); The single biggest cost in parallel computing is moving data —between caches, between cores, between CPU and GPU, across a network. A tasking compiler performs data affinity analysis : it tracks which tasks access which data and attempts to schedule tasks on the core/GPU where the data already resides. tasking compiler

task @main() %t1 = spawn @compute_pi(0, 1000000) %t2 = spawn @compute_pi(1000000, 2000000) %res1 = await %t1 %res2 = await %t2 %total = fadd %res1, %res2 That world is gone