Three-dimensional photoacoustic tomography based on graphics-processing-unit-accelerated finite element method Kuan Peng,1 Ling He,2 Ziqiang Zhu,1 Jingtian Tang,1 and Jiaying Xiao1,* 1

Department of Biomedical Engineering, School of Info-Physics Geomatics Engineering, Central South University, Hunan, Changsha 410083, China

2

Obstetrics & Gynecology Department, Second Xiangya Hospital, Central South University, Hunan, Changsha 410011, China *Corresponding author: [email protected] Received 16 September 2013; accepted 20 October 2013; posted 29 October 2013 (Doc. ID 197396); published 25 November 2013

Compared with commonly used analytical reconstruction methods, the frequency-domain finite element method (FEM) based approach has proven to be an accurate and flexible algorithm for photoacoustic tomography. However, the FEM-based algorithm is computationally demanding, especially for threedimensional cases. To enhance the algorithm’s efficiency, in this work a parallel computational strategy is implemented in the framework of the FEM-based reconstruction algorithm using a graphic-processingunit parallel frame named the “compute unified device architecture.” A series of simulation experiments is carried out to test the accuracy and accelerating effect of the improved method. The results obtained indicate that the parallel calculation does not change the accuracy of the reconstruction algorithm, while its computational cost is significantly reduced by a factor of 38.9 with a GTX 580 graphics card using the improved method. © 2013 Optical Society of America OCIS codes: (170.5120) Photoacoustic imaging; (110.6960) Tomography; (170.3010) Image reconstruction techniques; (200.4960) Parallel processing. http://dx.doi.org/10.1364/AO.52.008270

1. Introduction

Photoacoustic tomography (PAT) has received great attention in the past few years for its ability to combine high optical-absorption contrast with high ultrasonic spatial resolution for deep-tissue imaging in the optical diffusion regime [1–3]. Scanning with an unfocused transducer in various locations around the tissue, followed by an imaging reconstruction algorithm, the optical-absorption distribution inside the tissue can be recovered using PAT. An effective reconstruction algorithm is a critical component of PAT. The most commonly used reconstruction algorithm in PAT is an analytical method, which relies on the analytical solution to the photoacoustic wave 1559-128X/13/348270-10$15.00/0 © 2013 Optical Society of America 8270

APPLIED OPTICS / Vol. 52, No. 34 / 1 December 2013

equation with numerous assumptions [4–6]. For example, the analytical solution is limited to the measurement configuration with regular shapes. Moreover, the heterogeneous acoustic wave equation cannot be solved analytically. To overcome these limitations, the finite element (FEM) based reconstruction algorithm was proposed by Yuan and co-workers [2,7–12]. With the FEM-based method, arbitrary measurement configurations can be used for data collection. Furthermore, both the acoustic and optical heterogeneous media can be considered. However, the FEM-based method is time-consuming, especially for 3D cases. Since plenty of calculations can be executed independently in this algorithm, parallel calculation represents a promising method to improve its computational efficiency. A graphic processing unit (GPU) provides a lowcost, powerful tool for parallel calculation with a

personal computer. The parallelization of the FEM calculation based on the GPU has already been investigated by several researchers [13–15]; their results indicate that the parallelization based on a GPU can efficiently speed up the generation and solution of an FEM linear system. These works, however, are all concerned with processing one single FEM linear system with parallel computation. Since the frequency-domain FEM-based PAT reconstruction algorithm includes several independent FEM linear systems, different FEM linear systems could be processed in parallel to improve efficiency. In this work, a parallel computational structure is used for the frequency-domain FEM-based PAT reconstruction algorithm. A GPU-based sparse symmetric linear solver is built in this work to solve the FEM linear systems, and multiple FEM linear systems of different frequencies are generated and solved in parallel. Considerable simulations are conducted to test the implemented parallel method. The remainder of this paper is organized as follows: in Section 2, we first introduce the basic principle of the FEM-based PAT reconstruction algorithm. Then the GPU-based parallelization for this algorithm is presented in Section 3. The performance of the proposed method is verified by simulations in Section 4. Conclusion is given in last section.

Aω Pω  Bω Φ:

Since the unknown absorbed light energy density Φ is independent of the frequency, using multifrequency measurement can improve the reconstruction results by increasing the amount of measured data. With the Marquard Tikhonov regularization method proposed by Yuan and co-workers [2,8,11] and the dual-mesh strategy proposed by Gu et al. [16], the unknown absorbed light energy density can be updated iteratively by the following objective function: T co ΔΦco  J co T Pc − Po ;

where Pc is the calculated detection value based on the fine mesh, and Po is the observed detection value. The Jacobian matrix J co is defined as 2

J co

J 1;1 6 R;co . 6 4 .. w ;1 JN R;co

(1)

where Pr; ω denotes the pressure wave, which is decided by the acoustic angular frequency ω and spatial position r; k0  ω∕c0 is the wave number, which is defined by dividing the angular frequency by the acoustic speed in a reference or coupling medium c0 ; absorbed light energy density Φr is the product of the optical absorption coefficient μa r and incident laser fluence φr. O is the acoustic coefficient described by the acoustic wave c and attenuation α, β is the thermal expansion coefficient, and Cp is the specific heat. Since α is quite small in most of the biological tissues, it is ignored in this work. Applying the Galerking method to Eq. (1) and using the divergence theorem, we obtain the following FEM element linear system [7]: e  1; 2:…N e ;

co    J 1;N R;co .. .. . . n w ;N co    JN R;co n

J 1;1 I;co .. . w ;1 JN I;co

 .. . 

3 n co J 1;N I;co 7 .. 7 . n 5; w ;N co JN I;co (5)

The propagation of photoacoustic wave in biological tissue can be described by the following frequencydomain photoacoustic wave equation [2,7–12,16]:

Ae;ω Pe;ω  Be;ω Φω

(4)

with

2. FEM-Based PAT Reconstruction Algorithm

∇2 Pr; ω  k0 1  OPr; ω  k0 v0 βΦr∕Cp ;

(3)

(2)

where Ae;ω and Be;ω are the element matrices at the frequency ω, N e is the number of tetrahedral elements in the mesh, and Pe;ω and Φe are the pressure wave and absorbed light energy density on the vertex of the element e. Equation (2) can be further assembled to the global FEM linear system of the frequency ω:

ω T ω J ω;n R∕I;co  Ψ  B

∂Φfi ; ∂ΦnR∕I;co

(6)

where ΦnR∕I;co are the real/imaginary parts of absorbed light energy density on the nth coarse mesh node, N nco is the total number of coarse mesh nodes, and Ψω is the solution of the following equation: Aω Ψω  Q;

(7)

where Q is a matrix defined as  Qi;j 

1; if dj  i 0;

else

i  1; 2;    N nfi ;

j  1; 2;    ; N d :

(8)

In Eq. (8), dj is the index of the jth detection node in the fine mesh, N nfi is the total number of fine mesh nodes, N d is the total number of detection nodes, Aω and Bω are the FEM system matrices based on the fine meshes, T co  J co T J co  λI is a full real matrix, ΔΦco is the updated unknown absorbed light energy density on the coarse mesh, λ is the regularization parameter, and I is the unit matrix. Φfi , which is the unknown absorbed light energy density on the node of fine mesh, can be expressed by that of the coarse mesh by the following formula: 1 December 2013 / Vol. 52, No. 34 / APPLIED OPTICS

8271

χ  χ 1 ; …; χ N nfi T T X 4 4 X cN nfi ;k cN nfi ;k c1;k c1;k  ψ co χ co ; …; ψ co χ co ; k1

3. Parallelization of FEM-Based Reconstruction Algorithm

(9)

k1

where χ n is the unknown value on the nth node of fine mesh, and cn is the element in coarse mesh, which is the contains the nth node of fine mesh. ψ cn;k co is the value on FEM shape function of cn, and χ cn;k co the vertex of cn. The flow diagram of the algorithm is shown in Fig. 1. With the scheme provided by [8], the distribution of the absorption coefficient μa can be obtained by solving μa  φ∕Φ iteratively, where φ is the flux density calculated by solving the diffusion equation. However, we found that significant errors can be observed at the locations far away from the illumination source in 3D PAT. Since the locations far away from one illumination source could be close to another illumination source in a multiple illuminations scheme, the reconstructed absorption coefficient from different illuminations is averaged in this work to improve the calculation i 1 X μi ; N i i1 a

N

μa 

(10)

where N i is the number of the illumination source, and μia is the reconstructed absorption coefficient from the ith illumination source.

The parallel calculation in our work is based on a hardware and software architecture for general computing purposes on a GPU named the “compute unified device architecture” (CUDA). In CUDA, the calculation is divided into many parts and executed independently by thousands of threads. The threads of CUDA are organized uniformly by a series of blocks. The threads in the same block are carried out by a hardware structure named multiprocessor, so the blocks can be synchronized and exchange data with the high-speed on-chip shared memory, while the threads belonging to different blocks cannot be synchronized and can only exchange data through low-speed on-board global memory. The number of threads in one block is represented as blockdim in this work, and the total number of blocks is defined as griddim. Both blockdim and griddim are threecomponent vectors, so the threads and blocks can be identified using a 1D, 2D, and 3D index. The parallel calculation in CUDA is packed into the userdefined C function called kernel. The operation defined in a kernel is carried out in parallel for different data according to the preset blockdim and griddim. A. GPU-Based Construction of FEM Linear System

The FEM linear system is constructed by generating two system matrices Aω and Bω , which are assembled from the element matrices Ae;ω and Be;ω , respectively. Since the calculations of different element matrices are independent of each other, they can be calculated in parallel. Furthermore, the system matrices of different frequencies are also generated in parallel in this work. The parallelized generating of FEM system matrices is realized by employing one thread to calculate and assemble one element matrix of one frequency. A memory access control method named “atomic operation” is utilized here to avoid the memory access conflict in assembling. The system matrix of FEM is symmetric, so only the low triangular part of the matrix needs to be stored. The system matrix of FEM is also sparse, so the row/column compression (RC/CC) storage method is employed to further reduce the memory requirement and amount of calculation. According to the RC/CC method, a sparse matrix M can be stored in three 1D matrices. The first matrix is employed to store the value of nonzero elements; the column/row coordinates of nonzero elements are stored in the second matrix; the last matrix is employed to store the index of the first nonzero elements of each row/column in the prementioned two matrices. B. GPU-Based Linear Calculators

Fig. 1. Flow diagram of the FEM-based PAT reconstruction algorithm. 8272

APPLIED OPTICS / Vol. 52, No. 34 / 1 December 2013

Two kinds of GPU-based linear calculators, the multiply accumulator (MA) and matrix multiplier (MM), are defined in this part. The major parts of calculation in the FEM-based algorithm for PAT consists of these two linear calculators. Multiple threads are employed to accelerate the linear calculation in a

single calculator; moreover, multiple calculators can be carried out in parallel to further improve efficiency. 1. GPU-Based Multiply Accumulator Since the threads belonging to different blocks cannot be synchronized and exchange data through the high-speed on-chip shared memory, it is more efficient to execute the calculation of a single MA by the threads belonging to one block. The basic principle of the GPU-based MA is shown in Fig. 2. The blockdim in it is set to (4,1,1) to simplify the description. The multiplication accumulation is first carried out by four threads in parallel, and intermediate results are stored in shared memory. Then the intermediate results are added together by the parallelreducing summing to obtain the final results. At last, a single thread is employed to carry out the subsequent calculation with the results of MA. 2. GPU-Based Matrix Multiplier The result of MM is divided into a series of submatrices, which are calculated by the MA of submatrices of multiplier and multiplicand, as shown in Fig. 3. Since the submatrices of multiplier and multiplicand are small enough to be prestored in the shared memory, the efficiency can be improved by reducing the access of the low-efficiency on-board memory to 1∕sizes compared with the method that does not divide. sizes is the size of the submatrix, and it is set to be 32 in this work. The result of submatrices are calculated in parallel by the threads belonging to different blocks. The number of threads in one block is equal to the number of elements in the submatrices of the result. The flow diagram of the block, which is employed to calculate the submatrix of ith row and jth column in the result, is shown in Fig. 4. C.

Solvers for FEM Linear System

A GPU-based sparse symmetric linear solver, which is based on the LDLT factorization (LDLF),

Fig. 2. Flow diagram of the GPU-based MA.

Fig. 3. Decomposition of the matrix multiplication.

is introduced in this part. After factorizing the left system matrix Aω into the product of a lower triangular matrix Lω and an upper triangular matrix L1ω T  Lω Dω T T , the FEM linear system can be solved with forward and backward substitutions. Although the factorization results in some fill-in elements, Lω is still a sparse matrix. The sparse structure of Lω, which is symbolized by SSL, is independent on the frequency ω. Moreover, SSL is shared by L1ω and Lω . SSL is calculated before the factorization by the reduced factorization present in Eq. (11). Where sign· is the sign function, and bandwidth is the max bandwidth of Aω, which is determined by the mesh structure. Since the max bandwidth of a matrix is not changed by factorizing, the factorization can be restricted within the bandwidth to accelerate the calculation. The reduced factorization is executed iteratively for each column. Different elements in a column are calculated in parallel by the threads of different blocks. The flow diagram of the GPU-based reduced LDLT factorization is shown in Fig. 5, where Lmi is the number of elements in the ith column, which need to be calculated in the reduced factorization, and MA represents the GPU-based MA. blockdim · x is set to be 256 herein. SSL can be obtained by counting the nonzero elements in L1 and stored with the RC/CC methods.

Fig. 4. Flow diagram of the block, which is employed to calculate the submatrix of the ith row and jth column in the product. 1 December 2013 / Vol. 52, No. 34 / APPLIED OPTICS

8273

Fig. 5. Flow diagram of the GPU-based reduced LDLT factorization.

 L1i;j 

1

if Ai;j ≠ 0 and i ≥ j

0

else

For i  1∶N n

X  i−1 L1k;i  sign L1i;l L1k;l ; lm1

with k  i; i  1;    ; m2;

The flow diagram of the GPU-based LDLT factorization is quite similar with that shown in Fig. 5. A 2D blockdim blockdim · x; blockdim · y; 1 is employed here. The data of different frequencies are calculated in parallel by the threads belonging to different rows. blockdim · y is the number of frequencies whose data are calculated in the same block. The sparse information, which is independent of the frequency, is stored in the shared memory and shared by the calculation of different frequencies. Because the low-efficient access to the on-board memory is reduced, using one 2D block to process the data of multiple frequencies is more efficient than employing multiple 1D blocks for different frequencies. Since the max number of threads in a block is limited to 1024 by CUDA, blockdim · y could be smaller than the total number of frequency N ω. Multiple blocks are needed in this case to calculate one nonzero element of a column for all frequencies in parallel. We found that the best efficiency can be achieved by setting blockdim · x and blockdim · y to 32 and 8, respectively. When the LDLT factorization of the left system matrix Aω has been obtained, the FEM linear system in Eq. (3) can be solved with a forward substitution and a backward substitution, as described in Eq. (13): for i  1∶N n

m1  max1; i  1 − bandwidth; m2  mini  bandwidth − 1; N n :

yωi  bωi −

HL i −1 X

Lωi;cj bωcj ;

j1

end SSL  fi; jjL1i;j  1g:

with i; cj ⊂ SSL ; cj ≠ i:

(11) end T

With the information provided by SSL, the LDL factorization is only carried out for the nonzero elements of each column. Moreover, the number of elements, which needs to be considered in the MA operation, is also reduced. The sparse LDLT factorization can be described as Eq. (12), where Hmbk and Lmi are the number of the nonzero elements in the bkth row and ith column in the factorization result.  L1ωj;i 

Aωj;i

if Ai;j ≠ 0 and i ≥ j

0;

else

For i  1∶N n L1ωbk;i



L1ωbk;i

X L1ωbk;cl L1ωi;cl

HLbk



l1

L1ωcl;cl

;

with k  1; 2;    ; LLi ; bk; i ⊂ SSL ; bk; cl ⊂ SSL ; cl ≠ i: end Lωi;j 

8274



L1ωi;j ∕L1ωj;j ; if i; j ⊂ SSL ; 0;

else:

;

(12)

APPLIED OPTICS / Vol. 52, No. 34 / 1 December 2013

for i  N n ∶1 Pωi



yωi −

PLLi

ω ω j2 L1cj;i xcj L1ωc1;i

;

with cj; i ⊂ SSL ; cj ≠ i: end;

(13)

where bω are the right vectors of the linear systems. The superscript of ω represents the ωth frequency. It is notable that SSL is utilized here to improve the efficiency. A kernel is carried out repeatedly in each of the iterative steps to do the forward or backward substituting in this work. In the ith iterative step, Pωi or yωi are calculated in parallel for all of the frequencies. The kernel consists of a series of GPU-based MA. A 2D block is used here. The threads of one row in a block are employed to carry out one MA to calculate the data for one frequency. N ω ∕blockdim · y blocks are needed to process the data of all frequencies. blockdim · x and blockdim · y are set to 32 and 8 in this work to achieve the best efficiency. The linear system in Eq. (7) can be solved by the forward and backward substitutions present in Eq. (14):

for i  1∶N n yω;d  bω;d − i i

HL i −1 X

the kernel with the similar structure described in Fig. 5. The subsequent forward and backward substitutions are computed with a single GPU-based MA, which is executed in a 1D block. To carry out the dual-mesh strategy, the fine mesh nodes need to be mapped to the coarse mesh. The coarse mesh element that contains the fine mesh nodes should be found according to the following formula:

Lωi;cj bω;d cj ;

j1

with i; cj ⊂ SSL ; cj ≠ i: end for i  N n ∶1 Ψω;d i



yω;d − i

PLLi

ω;d ω j2 L1cj;i xcj L1ωc1;i

;

Emc i

X  4 o  k  jj mine V j;i   jj mine V j;i − V j ; n

j1∼N co

with cj; i ⊂ SSL ; cj ≠ i: end;

j1∼N co

k1

(15) (14)

where the superscript ω; d represents the ωth frequency and dth detection location. It can be found in Eq. (14) that the same elements of Lω or L1ω are used by the calculation of different detection locations. For this reason, the data of different locations for the same frequency are computed in the same blocks. The elements of Lω ∕L1ω and SSL are written into shared memory before the calculation, so the latency of the memory access can be reduced. Just like the kernels we used to solve Eq. (3), the 2D blocks are employed in the kernel to carry out a series of GPU-based MA for the forward or backward substitutions described in Eq. (14). The threads of one row in a block are employed to calculate the data of one detection location. The total block number of N ω  N d ∕blockdim · y is needed to calculate Eq. (14) in parallel. blockdim · x and blockdim · y are also set to 32 and 8 here. D. Parallelization of Generating and Solving Objective Function

Before generating the objective function in Eq. (4), a Jacobian matrix J co should be calculated first. While Ψω is obtained by the method we described in the previous section, J co can be calculated by the matrix multiplication according to the formula presented in Eqs. (5) and (6). Since Bω ∂Φ∕∂Φco R∕I;n are highly sparse, the dividing strategy we mentioned in Section 3.B.2 is not necessary here. The threads of one block are employed to calculate one J ω;n R∕I;co . A multiple-accumulation operation is executed by each of the threads to obtain one element of J ω;n R∕I;co in parω;n allel. All of the J R∕I;co are calculated in parallel by multiple blocks to obtain J co . With J co we calculated above, the objective function can be constructed by the multiplication and summing of the matrix. The multiplications of the matrix are carried out with the GPU-based matrix-multiplier introduced in Section 3.B.2. The summing of the matrix is carried out in parallel by using one thread to calculate one element of the results. Since T co is a full matrix, and there is only one indivisible linear system needed to be solved here, the LDLT decomposition of T co can be calculated by

where Emc is the index of coarse mesh element that i contains the ith fine mesh node, N co e is the number of elements in the coarse mesh, V j is the volume of the jth coarse mesh element, and V kj;i is the volume of the element, which is constructed by the ith fine mesh node and the kth bound of the jth coarse mesh element. In this work, the mapping of one fine mesh node is executed by the threads of one 1D block. The flow diagram of the ith block is shown in Fig. 6. The V j;i is first calculated by multiple threads. Then V j;i is compared with one value in the array S, which is located in the shared memory. The value of the array S is replaced with V j;i if it is larger than V j;i . When all of V j;i are processed, the parallel even-odd sorter is employed to sort the element of array S to obtain the Emc i . The mapping formula of Eq. (9) can be obtained by substituting the coordinate of the ith fine mesh node into the FEM shape functions of Emc i . 4. Results and Discussion

To verify the performance of the proposed method, several simulation experiments were carried out with a cylindrical configuration. First, we verified the accuracy of our GPU-accelerated FEM method. The results, which are reconstructed by the GPUand CPU-based methods, are compared to evaluate the influence of the parallel calculation to the accuracy. Then the acceleration effect of the proposed method was evaluated by comparing the computational time of our method performed on a GPU with that of the serial calculations carried out on the CPU. All the simulation experiments were carried out on a desk computer with a Intel I7-3770 CPU, 32 GB RAM and NVIDIA GTX 580 graphics card with 3 GB onboard memory. The serial computation of the CPU and parallel computation of the GPU are both programmed by C++ programming with the same data structure and calculating process. The geometrical information of the cylindrical phantom is shown in Fig. 7. Two spherical targets with a radius of 1 mm are set in a cylindrical phantom. The center of the phantom is set as the origin of the coordinate system here. Its optical and acoustic parameters are set according to [12], as listed in Table 1, in which D and μa are the diffusion 1 December 2013 / Vol. 52, No. 34 / APPLIED OPTICS

8275

Fig. 6. Flow diagram of the parallel mesh mapping.

imaging area. To avoid the inverse crime, a special forward mesh containing the target with the highest resolution is employed to calculate the simulated measurements. The basic information of these meshes is listed in Table 2. Five frequency sets with a different number of frequencies were employed in the simulations. The interval between the frequencies in these sets are all 10 KHz, as shown in Table 3. A. Verification of Accuracy

Fig. 7. Geometrical information of the cylindrical phantom. (a) Lateral view. (b) Sectional view.

coefficient and absorption coefficient, respectively; c is the speed of the acoustic wave; the thermal expansion coefficient β, specific heat Cp , and the acoustic wave in the reference medium c0 are set to β  1e − 5°C−1 , Cp  1.0 Ca · g · °C−1 , and c0  1459e 3 mm∕s, respectively. Both the top and bottom surfaces of the phantom are irradiated, respectively, by a uniform excitation source to produce the photoacoustic signal. Two distributions of μa are reconstructed from two different illumination positions. The final results are obtained by taking the average of these two distributions to improve the accuracy of reconstruction. The measurements are taken from the side surface of the phantom with three different heights, which are z  −2.5 mm, z  0 mm, and z  2.5 mm, respectively. Ninety-six detection locations are distributed uniformly on each of the heights, so the total number of detection locations is 288. One coarse and three fine tetrahedral meshes are used to discretize the

Verification studies are performed using the cylindrical phantom. To ensure the accuracy of the result, fine mesh 1 and frequency set 1 are employed in this simulation. We first employ the serial computation to invest in the necessity of averaging strategy with multiple illuminations. The simulations are carried out with two excitation positions (top and bottom surfaces), respectively. The reconstructed distribution of μa on the plane of x  0 mm is compared with the real distribution in Fig. 8. The reconstructed results are processed by a threshold of maxμa r  0.5. The positioning deviation on depth can be found as a Table 2.

Mesh Index

Optical and Acoustic Parameters of the Phantom

Background Target

8276

μa mm−1 

D mm

c mm∕s

0.01 0.10

0.33 0.03

1440e  3 1553e  3

APPLIED OPTICS / Vol. 52, No. 34 / 1 December 2013

Number of Nodes

Number of Tetrahedrons

60141 14571 10966 7382 3381

343637 80386 60073 40117 16920

Forward mesh Fine mesh 1 Fine mesh 2 Fine mesh 3 Coarse mesh

Table 3.

Set Index Table 1.

Information of Spatial Discrete Meshes

1 2 3 4 5

Information of Frequency Sets

Number of Frequencies

Frequency Range (KHz)

56 48 40 32 24

50–600 50–520 50–440 50–360 50–280

Fig. 8. Reconstruction results of different excitation schemes on the plane of x  0 mm. (a) Real distribution. (b) Reconstructed distribution of top surface illumination. (c) Reconstructed distribution of bottom surface illumination. (d) Average of (b) and (c).

result of two excitation locations, while the accuracy is improved significantly in the average of these two results. Because of the ill-condition of the reconstruction problem, it is difficult to recover the sources with a small strength of photoacoustic signal reliably. Since the strength of the photoacoustic signal in PAT is proportional to the flux density of illumination light, the deviation of positioning usually can be found for the target, which is far away from the illumination source. However, a location that is far away from one illumination source could be close to another illumination source in a multiple illuminations scheme, so taking the averaging of results from different illumination locations can improve the accuracy. Then the reconstruction algorithm is carried out by the parallel computation on the GPU to devaluate the influence of parallel computation on the accuracy. The reconstructed absorption coefficient on four planes—which are z  2.5 mm, z  −2.5 mm, x  0 mm, and y  3 mm—are sampled from both

results of the parallel and serial computations to make a comparison, as shown in Fig. 9. It can be found that two targets are distinguished clearly in both reconstruction results. A good agreement of positioning can be found between the reconstructed target and real target. However, because the resolution of mesh is not high enough, the shape of the target is not reconstructed properly in both results. Moreover, the reconstructed quantity of target 2 is only as large as 86% of the real value. This phenomenon is also caused by the ill-pose of the reconstructed problem. The reconstruction results of the parallel computation and serial computation are quite similar, the relative root-mean-square error (RRMSE) between the results of parallel computation and serial computation on four planes are listed in Table 4. It can be found that the error between two results on these planes is smaller than 1.5%. The error between the GPU-based results and CPU-based results is caused by the subtle calculation difference between the arithmetic units in the CPU and GPU.

Fig. 9. Comparisons of the CPU-based reconstruction results and GPU-based reconstruction results on the planes of z  2.5, y  3, and x  0. (a) Real distribution of absorption coefficient on these four planes. (b) CPU-based reconstruction results. (c) GPU-based reconstruction results. 1 December 2013 / Vol. 52, No. 34 / APPLIED OPTICS

8277

Table 4.

Plane RRMSE

B.

RRMSE between GPU- and CPU-based Reconstruction Results

z  2.5 mm

z  −2.5 mm

x  0 mm

y  3 mm

1.15%

1.47%

0.91%

1.28%

Evaluation of Acceleration Effect

Several simulation experiments were carried out to evaluate the acceleration performance of the proposed method in this section. The number of the iterative steps is set to be 14 for the serial and parallel calculations to ensure convergence of the iteration. The acceleration ratio, which is defined by the ratio between time cost of the CPU and GPU, was used to evaluate the acceleration effect of our method. 1. Acceleration Effect with Different Number of Frequencies and Discrete Meshes The acceleration effect of our method is first evaluated with multiple frequency sets and multiple meshes. The simulations are carried out for each combination of the fine meshes in Table 2 and the frequency sets in Table 3. The time cost of serial computation and parallel computation in each simulation are listed in Table 5. The acceleration ratio of parallel computation in each simulation is shown in Table. 6. It can be found that the parallel computation accelerates the reconstruction algorithm significantly in all cases. The calculation independence, which can be utilized by the parallel computation, is proportional to the number of the frequencies and the resolution of the mesh. However, the acceleration ratio doesn’t change significantly as expected in the simulations with a different number of frequencies and a different resolution of meshes. This phenomenon is caused by the limited capability of parallel calculation of our GPU. There are 18 multiprocessors in the GTX580, and the maximum number of resident blocks per multiprocessor is eight. This means that the max number of the blocks, which are carried out in parallel, is 144. For the number of blocks, which is much larger than 144, increasing that number can only improve the parallel efficiency by hiding the latency of memory access, which is relatively limited. The number of the blocks, which are used in a major part of our calculation, is far larger than 144. For example, in the simulation with the smallest number of frequencies (24) and the fine

Table 6.

Number of Frequencies

Calculation Time on CPU with Different Spatial Discrete Meshes and Frequency Sets in Simulationsa

Number of CPU/GPU Time CPU/GPU Time CPU/GPU Time Frequencies (1e  3 s) FMi (1e  3 s) FM2 (1e  3 s) FM3 56 48 40 32 24 a

8.55/0.22 7.63/0.20 6.81/0.18 5.61/0.15 4.76/0.13

6.33/0.17 5.79/0.16 5.02/0.14 4.40/0.12 3.78/0.11

4.77/0.13 4.30/0.12 3.84/0.11 3.42/0.10 3.06/0.09

FMi denotes the ith fine mesh.

8278

APPLIED OPTICS / Vol. 52, No. 34 / 1 December 2013

Acceleration Ratio FM1

Acceleration Ratio FM2

Acceleration Ratio FM3

38.9 38.2 37.8 37.4 36.6

37.2 36.2 35.9 35.0 34.4

36.7 35.9 34.9 34.2 34.0

56 48 40 32 24 a

FMi denotes the ith fine mesh.

mesh with the lowest resolution (fine mesh 3), the number of the blocks—which are needed in the LDLT factorization, solution of Eq. (10) and J co T J co —are already as large as 1344, 864, and 43969 respectively. As a result, when the number of blocks further increases in other simulations with more frequencies and finer mesh, the improvement of efficiency is small. Lots of branch and loop operations are needed in the algorithm, and it is more efficient to carry out these operations on a CPU than on a GPU. Moreover, a large amount of data is transmitted between the RAM and on-board memory, the limited bandwidth of PCI-E port also decreases the efficiency. As a result, the acceleration ratio we obtained is smaller than the theoretical value. 2. Acceleration Effect in Each Module of Calculation We further studied the acceleration ratios of our method in different modules of the calculation. The fine mesh 1 in Table 2 and frequencies set 1 in Table 3 are employed in this simulation. The acceleration ratios of the calculation are listed in Table 7. Since the FEM linear system is not changed for different iterative steps in our algorithm, only the last three modules need to be executed repeatedly in each of the iterative steps. Because of the relatively small computational independence, the acceleration ratio, which is obtained by the parallel calculation, is relatively low in these modules. Since many iterative steps are carried out in our simulations, nearly 50% of the total time cost is consumed by these three modules. As a result, the acceleration ratio of the parallel computation is pulled down by these low efficient modules significantly. However, for the Table 7.

Table 5.

Calculation Time on GPU with Different Spatial Discrete Meshes and Frequency Sets in Simulationsa

Acceleration Ratio of Different Modules of Calculation

Calculation Module Mesh mapping Aω Pω  Bω Φ Aω  Lω L1ω T Generating J co Generating T co Generating Pc r  J co T Pc − Po  ΔΦ  T co −1 r

Acceleration Ratio 1200.0 11.4 62.5 65.7 86.1 3.5 1.46 31.2

Proportion Proportion in Time in Total of Each Iterative Time (%) Step (%) 0.004 0.091 16 22 6.4 22 1.5 25

/ 0.17 27 38 11 2.6 0.2 3.1

reconstruction problems in which the FEM linear system is updated iteratively, module 2 to module 5 should be carried out in each of the iterative steps. The proportion of the time cost consumed by the last three modules will decrease notably in these problems, as shown in the last column of Table 7. As a result, a better parallel efficiency can be obtained by our method to solve these problems. 5. Conclusion

Although the parallel computational scheme in this work is implemented for the PAT reconstruction algorithm, it also can be employed for other FEMbased inverse problems, such as frequency-domain ultrasound imaging and optical imaging with multiple spectra or irradiated positions, without much difficulty. Moreover, the proposed method can obtain a better accelerated effect than that of this work for the problems in which the FEM system matrix needs to be updated at each iterative step. Although the simple averaging strategy we employed in this work can relieve the reconstruction error caused by the single illumination scheme partly, a more sophisticated reconstruction algorithm that can utilize the information of multiple illuminations scheme more effectively may be needed for 3D quantitative PAT. We will investigate this kind of algorithm in our future work. Moreover, the number of fill-in elements in the factorization matrix can be reduced significantly by rearranging the row and column of a sparse matrix. It is obvious that the amount of calculation and memory requirements are reduced with a more sparse factorization matrix. The efficiency of GPU-based parallel computation also can be improved by increasing the degree of parallelism. In conclusion, the GPU-based parallel computation scheme is proposed to reduce the time-consuming FEM-based PAT reconstruction algorithm. This parallel computation scheme utilizes the computational independence not only in normal linear calculation, but also between the calculation of different measured frequencies. An averaging strategy with a multiple illuminations scheme is employed to relieve the reconstruction error caused by the nonuniform distribution of the illuminated flux density. As we can find in the simulations of Section 4, the parallel computation does not change the accuracy of the algorithm, and the largest acceleration ratio obtained by the parallel calculation relative to the serial calculation is 38.9.

This study was funded by the National Natural Science Foundation of China (Grant No. 31200748), the Postdoctoral Science Fund of Central South University (Grant No. 20120312), and the Freedom Explore Program of Central South University (Grant No. Z12048). References 1. R. A. Kruger, P. Liu, and Y. Fang, “Photoacoustic ultrasound (PAUS)-reconstruction tomography,” Med. Phys. 22, 1605–1609 (1995). 2. Z. Yuan, Q. Zhang, and H. Jiang, “Simultaneous reconstruction of acoustic and optical properties of heterogeneous media by quantitative photoacoustic tomography,” Opt. Express 14, 6749–6754 (2006). 3. Q. Zhang, Z. Liu, P. R. Carney, and H. Jiang, “Noninvasive imaging of epileptic seizures in vivo using photoacoustic tomography,” Phys. Med. Biol. 53, 1921–1931 (2008). 4. R. A. Kruger, D. Reinecke, and G. Kruger, “Thermoacoustic computed tomography-technical considerations,” Med. Phys. 26, 1832–1837 (1999). 5. S. J. Norton and T. Vo-Dinh, “Optoacoustic diffraction tomography: analysis of algorithms,” J. Opt. Soc. Am. A. 20, 1859–1866 (2003). 6. M. Xu and L. V. Wang, “Universal back-projection algorithm for photoacoustic computed tomography,” Phys. Rev. E. 71, 016706 (2005). 7. Z. Yuan, C. Wu, H. Zhao, and H. Jiang, “Imaging of small nanoparticle-containing objects by finite-element-based photoacoustic tomography,” Opt. Lett. 30, 3054–3056 (2005). 8. Z. Yuan and H. Jiang, “Quantitative photoacoustic tomography: recovery of optical absorption coefficient maps of heterogeneous media,” Appl. Phys. Lett. 88, 231101 (2006). 9. H. Jiang, Z. Yuan, and X. Gu, “Spatially varying optical and acoustic property reconstruction using finite-element-based photoacoustic tomography,” J. Opt. Soc. Am. 23, 878–888 (2006). 10. Z. Yuan, H. Zhao, C. Wu, Q. Zhang, and H. Jiang, “Finiteelement-based photoacoustic tomography: phantom and chicken bone experiments,” Appl. Opt. 45, 3177–3183 (2006). 11. Z. Yuan and H. Jiang, “Three-dimensional finite elementbased photoacoustic tomography: reconstruction algorithm and simulations,” Med. Phys. 34, 538–546 (2007). 12. Y. Sun and H. Jiang, “Quantitative three-dimensional photoacoustic tomography of the finger joints: phantom studies in a spherical scanning geometry,” Phys. Med. Biol. 54, 5457–5467 (2009). 13. J. Prakash, V. Chandrasekharan, V. Upendra, and P. K. Yalavarthy, “Accelerating frequency-domain diffuse optical tomographic image reconstruction using graphics processing units,” J. Biomed. Opt. 15, 066009 (2010). 14. D. Wang, H. Qiao, X. Song, Y. Fan, and D. Li, “Fluorescence molecular tomography using a two-step three-dimensional shape-based reconstruction with graphics processing unit acceleration,” Appl. Opt. 51, 8731–8744 (2012). 15. D. M. Fernandez, M. M. Dehnavi, W. J. Gross, and D. Giannacopoulos, “Alternate parallel processing approach for FEM,” IEEE Trans. Magn. 48, 399–402 (2012). 16. X. Gu, Y. Xu, and H. Jiang, “Mesh-based enhancement schemes in diffuse optical tomography,” Med. Phys. 30, 861–869 (2003).

1 December 2013 / Vol. 52, No. 34 / APPLIED OPTICS

8279

Three-dimensional photoacoustic tomography based on graphics-processing-unit-accelerated finite element method.

Compared with commonly used analytical reconstruction methods, the frequency-domain finite element method (FEM) based approach has proven to be an acc...
897KB Sizes 0 Downloads 0 Views