How renderscript supports parallel compution

0x1 Architecture

RenderScript is a framework for running computationally intensive tasks at high performance on Android. It is similar to OpenCL which is cross platform spec for parallel computation. RenderScript is primarily oriented for use with data-parallel computation. The RenderScript runtime will parallelize work across all processors available on a device, such as multi-core CPUs, GPUs, or DSPs. RenderScript is especially useful for applications performing image processing, computational photography, or computer vision. Android’s doc

Here is the architecture of RenderScript.
architecture

RS Wrapper acts as the wrapper layer for RenderScript, provides the RenderScript api mapping and resource management.
cpu ref is the software implementation of RenderScript on CPU.
gpu rs is the hardware implementation of RenderScript on GPU.
slang/llvm provides the front and backend compiler support for RenderScript’s C99-derived language.

In this article, we will discuss how software RenderScript is suported on multi-core CPU.

0x2 Software Implementation

cpu_rs

0x21 Create Threads

Create threads for parallel computing based on cpu core’s number

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
bool RsdCpuReferenceImpl::init(uint32_t version_major, uint32_t version_minor,
sym_lookup_t lfn, script_lookup_t slfn) {
… …
GetCpuInfo();
int cpu = sysconf(_SC_NPROCESSORS_CONF);
if(mRSC->props.mDebugMaxThreads) {
cpu = mRSC->props.mDebugMaxThreads;
}
if (cpu < 2) {
mWorkers.mCount = 0;
return true;
}
// Subtract one from the cpu count because we also use the command thread as a worker.
mWorkers.mCount = (uint32_t)(cpu - 1);
for (uint32_t ct=0; ct < mWorkers.mCount; ct++) {
status = pthread_create(&mWorkers.mThreadId[ct], &threadAttr, helperThreadProc, this);
if (status) {
mWorkers.mCount = ct;
ALOGE("Created fewer than expected number of RS threads.");
break;
}
}

0x22 Thread implementation

Here is the thread’s source code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
void * RsdCpuReferenceImpl::helperThreadProc(void *vrsc) {
RsdCpuReferenceImpl *dc = (RsdCpuReferenceImpl *)vrsc;
uint32_t idx = __sync_fetch_and_add(&dc->mWorkers.mLaunchCount, 1);
dc->mWorkers.mLaunchSignals[idx].init();
dc->mWorkers.mNativeThreadId[idx] = gettid();
memset(&dc->mTlsStruct, 0, sizeof(dc->mTlsStruct));
int status = pthread_setspecific(gThreadTLSKey, &dc->mTlsStruct);
if (status) {
ALOGE("pthread_setspecific %i", status);
}
while (!dc->mExit) {
dc->mWorkers.mLaunchSignals[idx].wait();
if (dc->mWorkers.mLaunchCallback) {
// idx +1 is used because the calling thread is always worker 0.
dc->mWorkers.mLaunchCallback(dc->mWorkers.mLaunchData, idx+1);
}
__sync_fetch_and_sub(&dc->mWorkers.mRunningCount, 1);
dc->mWorkers.mCompleteSignal.set();
}
return nullptr;
}

dc->mWorkers.mLaunchSignals[idx].wait() is used to wait for the task to be processed.

dc->mWorkers.mLaunchCallback will call the actual processing routine.

dc->mWorkers.mCompleteSignal.set() is used to indicate the processing is complete.

0x23 Launch Thread

dc->mWorkers.mLaunchSignals[idx].wait() is signed in RsdCpuReferenceImpl::launchThreads().

And in RsdCpuReferenceImpl::launchThreads(), we can see mWorkers.mCompleteSignal.wait() is used to wait for the finish of the executing threads.

And the ‘WorkerCallback_t cbk’ is passed in to do the actual processing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void RsdCpuReferenceImpl::launchThreads(WorkerCallback_t cbk, void *data) {
mWorkers.mLaunchData = data;
mWorkers.mLaunchCallback = cbk;
for (uint32_t ct = 0; ct < mWorkers.mCount; ct++) {
mWorkers.mLaunchSignals[ct].set();
}
// We use the calling thread as one of the workers so we can start without
// the delay of the thread wakeup.
if (mWorkers.mLaunchCallback) {
mWorkers.mLaunchCallback(mWorkers.mLaunchData, 0);
}
while (__sync_fetch_and_or(&mWorkers.mRunningCount, 0) != 0) {
mWorkers.mCompleteSignal.wait();
}
}

Here is the code about how launchThreads is used.

1
2
3
4
5
6
Line 767: launchThreads(walk_3d_reduce, mtls);
Line 770: launchThreads(walk_2d_reduce, mtls);
Line 773: launchThreads(walk_1d_reduce, mtls);
Line 851: launchThreads(walk_general_foreach, mtls);
Line 873: launchThreads(walk_2d_foreach, mtls);
Line 895: launchThreads(walk_1d_foreach, mtls);

0x24 Thread Execution

Each thread setups the task according to the current mSliceNum,
Then it setups yStart and yEnd, then executes kernel from yStart to yEnd.
The kernel is set in RsdCpuScriptIntrinsic::invokeForEach()
or RsdCpuScriptImpl::forEachKernelSetup.

RsdCpuScriptIntrinsic::invokeForEach() is used to setup kernel for Intrinsic.

RsdCpuScriptImpl::forEachKernelSetup() is used to setup user defined kernel in *.rs files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
static void walk_2d_foreach(void *usr, uint32_t idx) {
MTLaunchStructForEach *mtls = (MTLaunchStructForEach *)usr;
RsExpandKernelDriverInfo fep = mtls->fep;
fep.lid = idx;
ForEachFunc_t fn = mtls->kernel;
while (1) {
uint32_t slice = (uint32_t)__sync_fetch_and_add(&mtls->mSliceNum, 1);
uint32_t yStart = mtls->start.y + slice * mtls->mSliceSize;
uint32_t yEnd = yStart + mtls->mSliceSize;
yEnd = rsMin(yEnd, mtls->end.y);
if (yEnd <= yStart) {
return;
}
for (fep.current.y = yStart; fep.current.y < yEnd; fep.current.y++) {
FepPtrSetup(mtls, &fep, mtls->start.x, fep.current.y);
fn(&fep, mtls->start.x, mtls->end.x, fep.outStride[0]);
}
}
}

0x25 Kernel implementation

Let’s use IntrinsicBlur as the example, in its kernel function kernelU1(), it will produce output pixel from xstart to xend.
the algorithm is based on Gaussian Weights which are initialized in ComputeGaussianWeights().

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
void RsdCpuScriptIntrinsicBlur::kernelU1(const RsExpandKernelDriverInfo *info,
uint32_t xstart, uint32_t xend,
uint32_t outstep) {
float buf[4 * 2048];
RsdCpuScriptIntrinsicBlur *cp = (RsdCpuScriptIntrinsicBlur *)info->usr;
if (!cp->mAlloc.get()) {
ALOGE("Blur executed without input, skipping");
return;
}
const uchar *pin = (const uchar *)cp->mAlloc->mHal.drvState.lod[0].mallocPtr;
const size_t stride = cp->mAlloc->mHal.drvState.lod[0].stride;
uchar *out = (uchar *)info->outPtr[0];
uint32_t x1 = xstart;
uint32_t x2 = xend;
float *fout = (float *)buf;
int y = info->current.y;
if ((y > cp->mIradius) && (y < ((int)info->dim.y - cp->mIradius -1))) {
const uchar *pi = pin + (y - cp->mIradius) * stride;
OneVFU1(fout, pi, stride, cp->mFp, cp->mIradius * 2 + 1, 0, info->dim.x);
} else {
x1 = 0;
while(info->dim.x > x1) {
OneVU1(info, fout, x1, y, pin, stride, cp->mFp, cp->mIradius);
fout++;
x1++;
}
}
x1 = xstart;
while ((x1 < x2) &&
((x1 < (uint32_t)cp->mIradius) || (((uintptr_t)out) & 0x3))) {
OneHU1(info, out, x1, buf, cp->mFp, cp->mIradius);
out++;
x1++;
}
while(x2 > x1) {
OneHU1(info, out, x1, buf, cp->mFp, cp->mIradius);
out++;
x1++;
}
}