CUDA 三层循环，最里面一层每次都要把计算得到的数据累加到显存数据中，外面两层是并行的，里面的一层怎么办呢

CUDA 三层循环，最里面一层每次都要把计算得到的数据累加到显存数据中，外面两层是并行的，里面的一层怎么办呢；现在的程序没问题，主要是第三层循环不能并行，只能保持循环并且循环还在kernel函数外，所以运行速度很慢，该怎么办
程序大致如下：
global void GpuLineKernel()
{

int xindex=blockDim.x*blockIdx.x+threadIdx.x;//行
int yindex=blockIdx.y;//列
            int intx,inty,intz; 
            double u1,v1;      
            double xr, yr;        
            double x,y,z;    
            double t1,t2,t3;  
            double Sinangle,Cosangle;
            double temp1,temp2,temp3,temp4,temp5,temp6;
            float xcenter = (float)(ImageWidth - 1) / 2.f;
            float ycenter = (float)(ImageHeight - 1) / 2.f;
            float zcenter = (float)(LayersOfImage - 1) / 2.f;
            float dycenter = (float)(ColumnsOfDetector - 1) / 2.f+ channel_offset_h;
            float dzcenter = (float)(RowsOfDetector - 1) / 2.f+ channel_offset_v;
            Sinangle = sin(angle);
            Cosangle = cos(angle);
            u1=(xindex - dycenter)*HSpaceOfDetector; 
            v1=(dzcenter-yindex)*VSpaceOfDetector;   
            for(int numHeight=0;numHeight<ImageHeight;numHeight++) 
            //如何把这层循环用加速的方式处理，注意
            //d[yindex*ColumnsOfDetector+xindex] +=  (float)(temp5*(1-
            //t1)+temp6*t1)这里的累加操作
            {
                    xr=(numHeight-xcenter) * HSpaceOfObject;  
                    yr=u1*(SourceToOriginal+ xr)/SourceToD; 
                    z =v1*(SourceToOriginal+ xr)/SourceToD;

                    y=xr*Cosangle+yr*Sinangle;
                    x=-xr*Sinangle+yr*Cosangle;
                    z=z/DSpaceOfObject+(1-indx)*zcenter;
                    y=y/HSpaceOfObject+ycenter;  
                    x=x/HSpaceOfObject+xcenter;  
                    intx=(int)floor(x);
                    inty=(int)floor(y);
                    intz=(int)floor(z);
                    t1=x-intx;
                    t2=y-inty;
                    t3=z-intz;
                    if(intx>=0 && inty>=0 && intz>=0 && intz+1<LayersOfImage/numstep && inty+1<ImageWidth && intx+1<ImageHeight)
                    {
                        temp1=(double)d_backprojectdata[intz*ImageWidth*ImageHeight+inty*ImageWidth+intx]*(1-t3)+(double)d_backprojectdata[(intz+1)*ImageWidth*ImageHeight+inty*ImageWidth+intx]*t3;
                        temp2=(double)d_backprojectdata[intz*ImageWidth*ImageHeight+inty*ImageWidth+intx+1]*(1-t3)+(double)d_backprojectdata[(intz+1)*ImageWidth*ImageHeight+inty*ImageWidth+intx+1]*t3;
                        temp3=(double)d_backprojectdata[intz*ImageWidth*ImageHeight+(inty+1)*ImageWidth+intx]*(1-t3)+(double)d_backprojectdata[(intz+1)*ImageWidth*ImageHeight+(inty+1)*ImageWidth+intx]*t3;
                        temp4=(double)d_backprojectdata[intz*ImageWidth*ImageHeight+(inty+1)*ImageWidth+intx+1]*(1-t3)+(double)d_backprojectdata[(intz+1)*ImageWidth*ImageHeight+(inty+1)*ImageWidth+intx+1]*t3;
                        
                        temp5=temp1*(1-t2)+temp3*t2;
                        temp6=temp2*(1-t2)+temp4*t2;
                        d[yindex*ColumnsOfDetector+xindex] +=  (float)(temp5*(1-t1)+temp6*t1);
                    }
                }

}

CUDA 三层循环，最里面一层每次都要把计算得到的数据累加到显存数据中，外面两层是并行的，里面的一层怎么办呢

你尚未登录，登录后可以