Stacked-Bloch-wave electron diffraction simulations using GPU acceleration.

Ultramicroscopy 141 (2014) 32–37

Contents lists available at ScienceDirect

Ultramicroscopy journal homepage: www.elsevier.com/locate/ultramic

Stacked-Bloch-wave electron diffraction simulations using GPU acceleration Robert S. Pennington n, Feng Wang, Christoph T. Koch ELIM, Institut für Experimentelle Physik, Universität Ulm, Albert-Einstein-Allee 11, 89081 Ulm, Germany

art ic l e i nf o

a b s t r a c t

Article history: Received 11 November 2013 Received in revised form 28 February 2014 Accepted 9 March 2014 Available online 17 March 2014

In this paper, we discuss the advantages for Bloch-wave simulations performed using graphics processing units (GPUs), based on approximating the matrix exponential directly instead of performing a matrix diagonalization. Our direct matrix-exponential algorithm yields a functionally identical electron scattering matrix to that generated with matrix diagonalization. Using the matrix-exponential scalingand-squaring method with a Padé approximation, direct GPU-based matrix-exponential doubleprecision calculations are up to 20 faster than CPU-based calculations and up to approximately 70 faster than matrix diagonalization. We compare precision and runtime of scaling and squaring methods with either the Padé approximation or a Taylor expansion. We also discuss the stacked-Blochwave method, and show that our stacked-Bloch-wave implementation yields the same electron scattering matrix as traditional Bloch-wave matrix diagonalization. & 2014 Elsevier B.V. All rights reserved.

Keywords: Electron diffraction Bloch-wave TEM simulation Parallel processing GPU programming

1. Introduction While the transmission electron microscope (TEM) is a versatile tool for micro-, nano-, and sub-nano-scale materials characterization [1], it has long been known that quantitative contrast and intensity interpretation for many real specimens requires dynamical diffraction [2,3]. Two main computational algorithms have been developed to simulate dynamical diffraction in the TEM [4,5]: Bloch-wave (BW) [3,6–8] and multislice (MS) [9–12]. An example of Bloch-wave calculations for convergent-beam and large-angle rocking-beam electron diffraction patterns is seen in Fig. 1. In addition to direct comparison between experimental data and simulation through programs like JEMS [13], these dynamical diffraction simulation algorithms are at the core of several information-reconstruction techniques such as structure refinement [14,15] and orientation mapping [16]. Thus, improvements in these algorithms benefit multiple applications. Recent reports of MS algorithms implemented on graphics processing units (GPUs) discuss notable speed increases over CPU-based MS simulations, including various applications such as simulating electron diffraction patterns and high-resolution images [14,17–19]. Additionally, an extension to the BW algorithm allows for sequential stacked BW (SBW) matrices to represent inhomogeneous specimens through a layer-by-layer approach [20–22].

n

Corresponding author. E-mail address: [email protected] (R.S. Pennington).

http://dx.doi.org/10.1016/j.ultramic.2014.03.003 0304-3991/& 2014 Elsevier B.V. All rights reserved.

In this paper, we discuss and present GPU-based calculations for Bloch-wave TEM data simulation. In Section 2, we conceptually compare GPU-based Bloch-wave simulations with GPU-based multislice, and we compare both to the stacked-Bloch-wave algorithm. In Section 3, we present several results from our GPU-based Bloch-wave simulations, and in Section 4 we discuss possible further innovations.

2. Theory In this section, first, the differences between the stacked-Blochwave (SBW) algorithm and the standard Bloch wave algorithm for TEM data simulation are discussed. Then, the computational requirements of different dynamical-diffraction simulation algorithms are discussed. The SBW algorithm, introduced and discussed elsewhere [20– 22], is here briefly summarized and compared with standard BW. Conventional BW takes a single scattering matrix (S) to represent specimen–beam interactions between the initial beam state ψ 0 ð! q Þ and the final beam state ! ψ f ð! q Þ ¼ S ψ 0ð q Þ ðπ it=kn ÞA

ð1Þ

, where the matrix A contains specimen properwith S ¼ e ! ! ties and excitation error information, and kn ¼ n^ k where k is ! the incident radiation wave vector with amplitude j k j ¼ 1=λ, where λ is the electron wavelength [2]. The primary difference between SBW and conventional BW is that SBW simulates the

R.S. Pennington et al. / Ultramicroscopy 141 (2014) 32–37

33

For the MS algorithm, the major rate-limiting step is fast Fourier transforms (FFTs), which are required to move from one slice to the next [9]. GPUs have proven effective at speeding up the MS algorithm through GPU-based FFTs [17,19]. For the BW algorithm, two approaches are possible. Traditionally, the major rate-limiting step has been a matrix diagonalization to determine eigenvalues and eigenvectors, and then performing the matrix exponential using the result [2,4,25]. Alternatively, a direct matrix exponential approximation can be used for generating the S-matrix directly from the A-matrix [26,27], through e.g. a scaling-and-squaring algorithm [28]. The scaling and squaring algorithm, as described in [28] for the matrix exponential, requires first scaling the input matrix using the matrix norm M and calculating the integer scaling factor j ¼ ceilðlog 2 MÞ. The scaling and squaring algorithm then requires scaling down the input matrix by r ¼ 2j , performing the matrix exponential through e.g. a Padé approximation, and then squaring j times: S ¼ ðe½ðT=rÞA Þr

ð3Þ

S ¼ ð⋯ððe½ðT=rÞA Þ2 Þ2 ⋯Þ2

ð4Þ j

Fig. 1. Electron diffraction patterns generated by our GPU-based simulation program for convergent-beam electron diffraction (CBED) and large-angle rocking-beam electron diffraction (LARBED) [23], both presented in [24] and used here as an illustrative example. (a) CBED pattern for silicon, simulated, 130 nm at 120 kV, maximum beam convergence semi-angle 4.2 mrad. (b) LARBED pattern for TlNb7O18, simulated, 10 nm at 120 kV. The disc radius corresponds to a beam tilt range of 70 mrad. Post-specimen beam-tilt compensation was 99.2%, where 100% would lead to a spot pattern like that from precession electron diffraction.

specimen–beam interaction as a sequential stack of reciprocalspace scattering S-matrices, similar to the MS method's sequential stack of real-space transmission potentials [9]. In the SBW algo! rithm, the final beam state ψ f ð q Þ is calculated by ! ψ f ð! q Þ ¼ ðSN ð…ðS2 ðS1 ψ 0 ð q ÞÞÞ…ÞÞ

Because T=r is a function of ft; kn ; 2 g, changing the specimen thickness, accelerating voltage, or the A-matrix can change the number of matrix multiplications j required to calculate the matrix exponential. However, the integer j will change in a stepwise fashion, so doubling the thickness will lead to one more matrix multiplication. The matrix-exponential approach combines well with the SBW algorithm, because the SBW algorithm uses Smatrices directly. SBW has the same computational requirements as the matrixexponential approach to BW, with the added requirement of matrix–vector multiplications between scattering matrices for each layer. When the beam propagates through the stack, matrix–vector multiplications are performed between each scattering matrix in the stack and the beam vector. Thus, the two important calculation steps for SBW are matrix–vector multiplications and the matrix exponential. However, for the common situation where the number of included Bloch waves is much greater than the number of slices in the SBW stack, if the matrixexponential approximation involves matrix–matrix multiplications (like the scaling and squaring algorithm does [28]), then, to a first approximation, the most important routine for SBW is the matrix-exponential. This is especially true if each layer's S-matrix is different due to inhomogeneity. In summary, the time-consuming aspects of the MS algorithm are FFTs, and for the SBW algorithm are matrix exponentials. The matrix exponential can be directly evaluated using the scalingand-squaring method, instead of matrix diagonalization. In Section 3, we show the effectiveness of GPU acceleration on SBW matrix exponentials, and the advantages of direct matrix-exponential evaluation over matrix diagonalization.

3. Results and analysis ð2Þ

with Sj ¼ eðπ it=kn ÞAj ¼ eTAj ; in [21], Sj is decomposed into eigenva! lues and eigenvectors. Calculating ψ f ð q Þ using SBW requires N matrix–vector multiplications, discussed later. Naturally, in the case of a single S-matrix, SBW's one-matrix “stack” should yield identical results to conventional BW. The two major TEM data-simulation methods, Bloch-wave (BW) and multislice (MS), are subject to different computational requirements and, thus, benefit in different ways from GPU acceleration.

For our simulations, we wrote two computer programs, one in the Python programming language to implement the SBW algorithm including the matrix exponential for both CPUs and GPUs, and one written natively in the C & CUDA programming languages for an alternative GPU-only matrix exponential calculation. Both use the scaling-and-squaring matrix exponential algorithm; the former with a Padé approximation, the latter with a Taylor expansion. In this section, we discuss and compare the precision and runtime characteristics of these two programs, and we compare them against a conventional matrix-diagonalization

34


approach written in Python and using the same libraries as the scaling and squaring algorithm with the Padé approximation. Python was chosen for our SBW simulations due to its extensive modular scientific package library, ease of use, ability to incorporate other programming languages, and general flexibility. Packages used (with version numbers in parentheses) include SciPy (0.10.1)/NumPy (1.6.1) for CPU-based vector and matrix operations [29], provided as part of the Enthought Python distribution, the ATOM Fortran program for absorption [30] via F2PY [31], PYCUDA (2012.1) for GPU-based operations [32], and scikits. cuda (0.042, currently available from http://lebedov.github.io/sci kits.cuda/) for advanced GPU-based operations, including an interface to the CULA Dense package (R17) [33]. For standard CPU computing, optimized libraries are widely used for both matrix operations (e.g. LAPACK [34]) and FFT operations (e.g. FFTW3 [35]). Similar libraries also exist for GPU computing: CULA [33] is a version of LAPACK for GPUs, and FFT libraries for GPU processing, CUFFT, are provided as part of the CUDA toolkit (Nvidia Corp.). The computing hardware used was a single desktop computer including a 4-core Intel Xeon W3550 CPU (Intel Corp.) and an Nvidia Tesla K20 general-purpose GPU (Nvidia Corp.), both commercially available. All calculations presented here are performed with complex-double-precision operations, to allow for the potential high dynamic range of weak beams, and only the zero-orderLaue-zone is simulated. Multiple algorithms exist to approximate the matrix exponential; our chosen approach in the Python program relies on the scaling-and-squaring method and the Padé approximation, which requires solving a linear system of equations (LAPACK Zgesv) and matrix multiplications (LAPACK Zgemm) [28,29,34,36,37]. The scaling-and-squaring method using the Padé approximation was already implemented in the CPU-based SciPy software package [29], aiding GPU algorithmic adaptation. The Python program on both the CPU and the GPU uses floor instead of ceil when calculating the integer scaling factor j. For comparison, we have also written a traditional CPU-based matrix diagonalization [25] in Python, using the same software libraries and hardware. Both CPU-based routines were using all four cores of the CPU. Quantitatively, the scattering matrices generated with the approximated matrix exponential and matrix diagonalization are functionally identical. Table 1 shows the maximum discrepancies between a BW scattering matrix generated with matrix diagonalization and a BW scattering matrix generated with different orders q of the GPU-based Padé approximation, both for 1 nm silicon ½110 with 200 beams at 80 kV accelerating voltage. If the Padé approximation is used with order q Z 5, the discrepancies between the two methods are minimal, and little improvement is

Table 1 Precision test showing the maximum single-element difference between S-matrices simulating a 1 nm specimen generated with matrix diagonalization (Sd;1 ) and using the scaling and squaring algorithm for the matrix exponential (Se;11 ) with a Padé approximation of order q. “Real” and “Imaginary” are the maximum-difference elements from the real part and the imaginary part of the S-matrices respectively. The Δ values are the order-of-magnitude of the maximum discrepancy when using excitation errors for an incident beam tilted 11 in the ½001 direction, which changes the diagonal of the A-matrix. jSd;1 Se;11 j Order

Real

Imaginary

q ¼3

9:19 10 7 ðΔ10 6 Þ

3:63 10 6 ðΔ10 4 Þ

q ¼5

4:99 10

Þ

1:56 10 12 ðΔ10 12 Þ

q ¼7

2:67 10 14 ðΔ10 14 Þ

2:53 10 14 ðΔ10 14 Þ

q ¼9

2:80 10

14

Þ

2:53 10 14 ðΔ10 14 Þ

q ¼11

2:53 10 14 ðΔ10 14 Þ

2:52 10 14 ðΔ10 14 Þ

13

ðΔ10 ðΔ10

12

14

Table 2 As Table 1, but using matrix diagonalization for a 100 nm specimen (Sd;100 ), and SBW with 100 layers of 1 nm slices (Se;1001 ). jSd;100 Se;1001 j Order

Real

Imaginary

q¼ 3

5:90 10 5 ðΔ10 4 Þ

6:92 10 5 ðΔ10 4 Þ

q¼ 5

3:22 10 11 ðΔ10 10 Þ

3:25 10 11 ðΔ10 10 Þ

q¼ 7

3:03 10 12 ðΔ10 12 Þ

2:78 10 12 ðΔ10 12 Þ

q¼ 9

3:02 10 12 ðΔ10 12 Þ

2:77 10 12 ðΔ10 12 Þ

q¼ 11

3:02 10

12

ðΔ10

12

Þ

2:58 10 12 ðΔ10 12 Þ

seen above q Z 7, with discrepancies on the order of Oð10 14 Þ. Table 2 shows the maximum discrepancies between a 100 nm matrix diagonalization BW scattering matrix and 100-layer SBW using a 1 nm S-matrix, also as a function of Padé approximation order q. For q Z5, the discrepancies are minimal, and q Z 7 is a safe choice, with discrepancies Oð10 12 Þ. The change in error arising from a shift in excitation errors due to a 11 beam tilt, which will increase some of the diagonal elements of the A-matrix, is also shown in Tables 1 and 2. Changing the excitation error can lead to larger elements along the diagonal of the A-matrix and thus could be a challenge for our matrix-exponential methods. For the tiltedbeam cases, the single-matrix data in Table 1 shows a general increase in maximum discrepancy, but Table 2, from 100-layer SBW, shows that this only has a substantial effect if q ¼3 or q¼ 5, and has little influence for q Z7. According to [28], q ¼13 is preferred for double-precision applications generally, but, for S-matrix generation, there appears to be little precision advantage with q 4 7. While the scaling and squaring algorithm with the Padé approximation is still an approximation to the matrix exponential, these tests indicate that it is as precise as matrix diagonalization for S-matrix generation. A speed comparison of the CPU-based matrix diagonalization method and the CPU-based scaling-and-squaring method with the 7th-order Padé approximation, shown in Fig. 2 for simulations including between 5 and 200 beams, reveals that the matrix diagonalization method is approximately 3 slower than the CPU-based matrix exponential method. For a 200-beam A-matrix, the matrix diagonalization method requires 0.080 s per iteration (off the graph in Fig. 2), while the CPU-based scaling-and-squaring method requires 0.024 s per iteration. The eigenvector/eigenvalue generation subroutine (LAPACK Zgeev [34]) is where the matrix diagonalization routine spends the majority of its time. However, this timing relationship may be dependent on the specific implementation, including software libraries and hardware, and other implementations may take different amounts of time. In summary, we find that the traditional CPU-based matrix diagonalization is slower than our CPU-based matrix-exponential algorithm using the scaling and squaring method with the Padé approximation. While our matrix exponential approach is faster than matrix diagonalization, it can be accelerated further using GPUs. Fig. 3 presents a compute-time comparison of the 7th-order Padé approximation to the matrix exponential in Python as calculated on the CPU and GPU as a function of A-matrix size between 50 and 3000 beams. The GPU code is faster than the CPU code for all but the smallest matrices, providing from 5 speedup for 300-beam matrices to 20 speedup for 3000-beam A-matrices. The behavior for the smallest matrices is shown in Fig. 2, where the runtimes are broken down into LAPACK and non-LAPACK subroutines. As noted above, references to LAPACK on GPUs in this paper refer to the CULA library implementation of LAPACK functionality. The CPU routine is faster for matrices with fewer than 120 beams, because the GPU routine has more non-LAPACK


Table 3 Precision test like that in Table 1 between an 200 beam S-matrix generated using the scaling-and-squaring method with a pth-order Taylor expansion for the matrix exponential (SY;1 ) and an S-matrix generated using matrix diagonalization (Sd;1 ). p is the order of the Taylor expansion. Δ is the order-of-magnitude maximumdiscrepancy change between the ZA and 11 orientations described in Table 1. The p ¼ 9† and p ¼ 12† precisions used the optimization detailed in Appendix A. The simulated specimen thickness was 1 nm at 80 kV accelerating voltage.

0.025

Time (s)

0.02

Eigensolver (Zgeev) Other

0.015

35

0.01

jSd;1 SY;1 j 0.005

Order 0

5

20

35

50

65

80

95

110 125 140 155 170 185 200

Time (s)

2:08 10 3 ðΔ10 2 Þ 3:30 10 8 ðΔ10 7 Þ

1:95 10 10 ðΔ10 10 Þ

7:15 10 11 ðΔ10 9 Þ

3:55 10 12 ðΔ10 10 Þ

4:20 10 12 ðΔ10 11 Þ

1:06 10 12 ðΔ10 14 Þ

1:02 10 12 ðΔ10 13 Þ

1:01 10 12 ðΔ10 14 Þ

9:51 10 13 ðΔ10 14 Þ

p ¼ 9†

5:00 10 8

3:36 10 8

p ¼ 12†

3:55 10 12

4:28 10 12

p ¼ 15 p ¼ 18

Matrix multiplication (Zgemm) Linear equation solving (Zgesv) Other

0.015 0.01 0.005 0

5

20

35

50

65

80

95

110 125 140 155 170 185 200

Number of beams 0.025 0.02

Time (s)

2:88 10 3 ðΔ10 3 Þ

p ¼ 12

0.02

Imaginary

5:00 10 8 ðΔ10 7 Þ

p ¼ 11

0.025

Matrix multiplication (Zgemm) Linear equation solving (Zgesv) Other

0.015 0.01 0.005 0

5

20

35

50

65

80

95

110 125 140 155 170 185 200

Number of beams

100

25

10

20

1

15 GPU (left axis) CPU (left axis) CPU / GPU (right axis)

0.1

5

0.01 0.001

10

0

500

1000

1500

2000

2500

GPU speedup multiplier

Fig. 2. Speed comparisons of (top) the traditional matrix diagonalization method of calculating an S-matrix [25], (middle) the scaling-and-squaring method with the 7th-order Padé approximation on the CPU via Scipy [28,29,36], and (bottom) the scaling-and-squaring method with the 7th-order Padé approximation on the GPU, coded in-house. These speed tests were run on the same machine with the same libraries. For a realistic performance test, we exponentiated zero-order-Laue-zone A-matrices corresponding to 1 nm zone-axis ½110 silicon measured at 80 kV. Times are averaged over 1000 runs.

Runtime (s)

p¼5 p¼9

Number of beams

Real

0 3000

Number of beams Fig. 3. CPU and GPU benchmarks for the 7th-order Padé-approximated matrix exponential (log scale) and the speedup factor provided by the GPU (linear scale) as a function of A-matrix size. Dotted lines are cubic-polynomial fits. Times are averaged from 100 runs.

near-constant overhead. For larger matrices, the GPU routine is faster because matrix multiplication (Zgemm) is much faster on the GPU than the CPU, and linear-equation solving (Zgesv) is approximately the same speed for both routines. Thus, for larger matrices, the overall routine is faster because faster GPU-based matrix multiplication is more important than for the higher fixed GPU overhead. The large improvement in matrix multiplication seen with the GPU calculations would also be beneficial for SBW, because the major difference between SBW and BW is the additional matrix–vector multiplications. Additionally, we investigated the effect of using a Taylor expansion (optimized and non-optimized) inside the scalingand-squaring algorithm instead of the Padé approximation. Appendix A contains details of the optimization; data generated using the optimization is indicated with a † after the expansion order p. This was done to compare different exponential function approximations for generating S-matrices from A-matrices. This was implemented and optimized natively in the C programming language with CUDA libraries and without Python to provide the greatest speed benefit. Table 3 shows the precision of the Taylor expansion for a single S-matrix when compared to matrix diagonalization, similar to Table 1 for the Padé approximation. Using the optimized Taylor expansion makes little difference compared to the conventional Taylor expansion. Minimal discrepancies are found for a Taylor expansion of order p Z12, although the discrepancy is still higher than for the Padé approximation at q¼ 5. The Δ values in Table 3 show the order-of-magnitude effect on the maximum real and imaginary discrepancy from the tiltedbeam case considered above; for p r 15, this noticeably increases the discrepancy, but this increase is still only Oð10 10 Þ for p ¼12. Therefore, the Taylor expansion should be used in this case only when the excitation error is relatively small; for larger excitation errors, the Padé approximation detailed above is more suitable. However, 9 r p r 11 may still be a suitable choice for comparison of experimental data with limited dynamic range. Fig. 4 shows the runtime difference for different approximations of the native-C & CUDA optimized-Taylor-expansion matrix exponential and the Python-based GPU Padé approximation code running on A-matrices between 500 and 5000 beams. Approximations were chosen with similar error order O; see Tables 1 and 3. The optimized Taylor expansion in Appendix A was used for Fig. 4. The less-precise functions in Fig. 4 are the Padé approximation with q ¼3 (Oð10 7 Þ) and the optimized Taylor expansion with p ¼ 9† (Oð10 8 Þ). The q ¼7 (Oð10 14 Þ) and p ¼ 12† (Oð10 12 Þ) approximations, discussed above, are when the matrix exponential

36


1000

Future work could explore applying GPU-accelerated Blochwave code to the dynamical-diffraction applications mentioned in Section 1 (e.g. orientation determination). Other ways of approximating the matrix exponential could be explored, evaluating their accuracy and compute time relative to the methods presented here. Different computing hardware and different parallel-processing libraries could change the computational efficiency of the direct matrix exponential relative to matrix diagonalization.

Runtime (s)

100

10

1

q=3 q=7

0.1

0.01

Acknowledgments

CPU Diagonalization

0

1000

2000

3000

4000

5000

The authors acknowledge funding by the Carl Zeiss Foundation as well as the German Research Foundation (Grant no. KO 2911/7-1).

Number of beams Fig. 4. Benchmarks as a function of A-matrix size for different GPU-based implementations of the scaling and squaring algorithm for the matrix exponential, compared against the CPU-based matrix diagonalization. Graphed is the runtime for the CPU-based matrix diagonalization (blue) and two different GPU-based routines: a Python-based program using the Padé approximation (grey and black) and a native-C & CUDA program using the optimized Taylor expansion (green). The Padé approximation shows time required for q¼ 3 (dashed grey) and q¼ 7 (solid black). The Taylor expansion shows the time required for p ¼ 9† (dashed green) and p ¼ 12† (solid green). The 5000-beam p ¼ 12† matrix exponential is omitted because it ran out of memory at an intermediate step. Times are averaged from 10 runs. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

precision remains constant with additional order. For both precision cases, the Taylor expansion is faster than the Padé approximation. For the calculation of the scaling factor j, the optimized Taylor expansion used ceil instead of floor, which leads to one more matrix multiplication relative to the Padé approximation implementation used here; even with this speed penalty, the Taylor expansion is faster. All four GPU-based methods shown in Fig. 4 are similar speeds, and the more precise versions of an algorithm are slower than the less-precise versions. Fig. 4 also shows that CPU-based matrix diagonalization is much slower than the GPU-based direct matrix exponential. The q ¼7 GPU-based code has a maximum speed advantage of 39 over traditional matrix diagonalization. The p ¼ 9† code has a maximum speed advantage of more than 69 over traditional matrix diagonalization. In summary, GPU-based programs can perform electron scattering matrix generation with the same accuracy as conventional matrix diagonalization, but much faster.

Appendix A. Optimized Taylor expansion In this appendix, we discuss the optimized Taylor expansion for the matrix exponential used in the calculations in Section 3. This optimized Taylor expansion minimizes the number of matrix– matrix multiplications, so p ¼ 9† requires only 4 instead of 8 and p ¼ 12† 5 instead of 11. The optimized Taylor expansion is used in Table 3 and Fig. 4, and is indicated by a daggered numeral (p ¼ 9†). The conventional Taylor expansion is used in Table 3, and is indicated with an undaggered numeral (p ¼9). The conventional Taylor expansion for exponentiating the matrix M is Mr 1 1 ¼ 1 þ M þ M2 þ M3 þ ⋯ 2 6 r ¼ 0 r! 1

eM ¼ ∑

This would lead to p 1 matrix multiplications for an order-p Taylor expansion. The optimized Taylor-expansion calculations presented here as p ¼ 9† and p ¼ 12† use the pre-calculated complex factorization of the Taylor polynomial of order p: 1 1 1 eM ¼ 1 þ M þ M 2 þ M 3 þ ⋯ þ M p 2 6 p! eM ¼

1 ðM c1 ÞðM c2 ÞðM c3 Þ⋯ðM cp Þ p!

ðA:2Þ

ðA:3Þ

p ¼ 9† is now taken as an example. Clustering the terms in Eq. (A.3) leads to three polynomials for M, with different coefficients, shown in brackets: eM jp ¼ 9† ¼

4. Conclusions and future work The Bloch-wave method is more efficiently performed using a direct matrix exponential approximation rather than the traditional matrix diagonalization. In addition, the matrix exponential, using the scaling-and-squaring method with the Padé approximation, is more efficiently performed using graphics processing units (GPUs) rather than conventional CPUs. Inside the scaling-andsquaring algorithm, a 7th order Padé approximation is sufficient for practical purposes. In addition, the stacked-Bloch-wave algorithm (SBW) allows for layer-by-layer Bloch-wave calculations, and the matrix multiplications required for SBW are more efficiently carried out on the GPU than on the CPU. Direct evaluation of the matrix exponential with GPUs can provide a speedup of more than 20 relative to the CPU for larger matrices, and approximately 70 relative to the CPU-based matrix diagonalization algorithm. In the scaling-and-squaring method, both a Padé approximation and an optimized Taylor expansion were considered; the latter was found to be faster, but slightly less precise.

ðA:1Þ

eM jp ¼ 9† ¼

1 ½ðM c1 ÞðM c2 ÞðM c3 Þ p! ½ðM c4 ÞðM c5 ÞðM c6 Þ ½ðM c7 ÞðM c8 ÞðM c9 Þ

ðA:4Þ

1 3 ½M ðc1 þc2 þc3 ÞM2 þ ⋯ p! ½M3 ðc4 þ c5 þ c6 ÞM2 þ ⋯

eM jp ¼ 9† ¼

½M3 ðc7 þ c8 þ c9 ÞM2 þ ⋯

ðA:5Þ

1 N123 N456 N789 p!

ðA:6Þ

Generating all three bracketed polynomials in Eq. (A.5) requires generating M3 and M2 only once for two matrix multiplications. Each polynomial in Eq. (A.5) is then added into a single matrix in Eq. (A.6), yielding three Nabc matrices. Performing the two matrix–matrix multiplications in Eq. (A.6) yields eM jp ¼ 9† in only four matrix–matrix multiplications. Similarly, eM jp ¼ 12† requires only five matrix–matrix multiplications, and so on. However, this optimization requires the precalculated complex coefficients C p† ¼ c1 ; c2 ; …cp to be accurate. Our set of C p† , shown in


Table A1 Complex coefficients for 9th-order optimized Taylor expansion. c1 c2 c3 c4 c5 c6 c7 c8 c9

( 3:3335514852690488032942739163345055 þ 0i) ( 3:0386480729366970892124687564926859 1:5868011957588383288038677051222921i) ( 3:0386480729366970892124687564926859 þ 1:5868011957588383288038677051222921i) ( 2:1108398003026547374987047865183922 3:0899109287255009227777015426228801i) ( 2:1108398003026547374987047865183922 þ 3:0899109287255009227777015426228801i) ( 0:38106984566311299903129424501333242 4:3846445331453979503692027283066828i) ( 0:38106984566311299903129424501333242 þ 4:3846445331453979503692027283066828i) (2:6973334615369892273896047461916633 5:1841620626494141778340870727109629i) (2:6973334615369892273896047461916633 þ 5:1841620626494141778340870727109629i)

Table A2 Complex coefficients for 12th-order optimized Taylor expansion. c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12

( 4:1356082392646311334114795511234522 0:77554204911016202877346518523227687i) ( 4:1356082392646311334114795511234522 þ 0:77554204911016202877346518523227687i) ( 3:6888971024462053570220817044343043 2:3027571492470953276351045107744921i) ( 3:6888971024462053570220817044343043 þ 2:3027571492470953276351045107744921i) ( 2:7579892311919211464358709785450804 3:7525483823537071551232429081271234i) ( 2:7579892311919211464358709785450804 þ 3:7525483823537071551232429081271234i) ( 1:2491251433584184050195851818976650 5:0495551072841080894517136908904045i) ( 1:2491251433584184050195851818976650 þ 5:0495551072841080894517136908904045i) (1:0534236396562679920254003621447688 6:0594914338643727910205790817884843i) (1:0534236396562679920254003621447688 þ 6:0594914338643727910205790817884843i) (4:7781960766049080498636170538557330 6:4511763374203890543976549462152707i) (4:7781960766049080498636170538557330 þ 6:4511763374203890543976549462152707i)

Tables A1 and A2, was determined by root-finding the Taylor polynomial to maximum possible precision using the fsolve function in the commercially available program Maple 16, but other approaches are possible. When multiplying the factorization in Eq. (A.3) back together, the residuals for eM jp ¼ 9† were order 10 29 =9!, and for eM jp ¼ 12† , 10 26 =12!. References [1] D.B. Williams, C.B. Carter, Transmission Electron Microscopy, 2nd ed., Springer, New York, 2009. [2] P. Hirsch, A. Howie, R. Nicholson, D. Pashley, M. Whelan, Electron Microscopy of Thin Crystals, 2nd ed., Robert E. Krieger Publishing Co., Malabar, FL, 1977. [3] F. Fujimoto, Periodicity of crystal structure images in electron microscopy with crystal thickness, Phys. Status Solidi (a) 45 (1) (1978) 99–106. [4] J.C.H. Spence, J.M. Zuo, Electron Microdiffraction, Plenum Press, New York, NY, 1992. [5] E.J. Kirkland, Advanced Computing in Electron Microscopy, 2nd ed., Springer US, New York, NY, 2010.

37

[6] H. Bethe, Theorie der Beugung von Elektronen an Kristallen, Ann. Phys. 392 (17) (1928) 55–129. [7] C.H. Mac Gillavry, Zur prüfung der dynamischen theorie der elektronenbeugung am kristallgitter, Physica 7 (4) (1940) 329–343. [8] K. Kambe, Visualization of Bloch waves of high energy electrons in high resolution electron microscopy, Ultramicroscopy 10 (3) (1982) 223–227. [9] K. Ishizuka, N. Uyeda, A new theoretical and practical approach to the multislice method, Acta Crystallogr. Sect. A 33 (5) (1977) 740–749. [10] K. Ishizuka, Multislice formula for inclined illumination, Acta Crystallogr. Sect. A 38 (6) (1982) 773–779. [11] K. Ishizuka, A practical approach for STEM image simulation based on the FFT multislice method, Ultramicroscopy 90 (2–3) (2002) 71–83. [12] K. Ishizuka, FFT multislice method—the silver anniversary, Microsc. Microanal. 10 (01) (2004) 34–40. [13] P. Stadelmann, EMS-a software package for electron diffraction analysis and HREM image simulation in materials science, Ultramicroscopy 21 (2) (1987) 131–145. [14] L. Palatinus, D. Jacob, P. Cuvillier, M. Klementová, W. Sinkler, L.D. Marks, Structure refinement from precession electron diffraction data, Acta Crystallogr. Sect. A 69 (2) (2013) 171–188. [15] J.M. Zuo, J. Spence, Automated structure factor refinement from convergentbeam patterns, Ultramicroscopy 35 (3) (1991) 185–196. [16] A. Morawiec, E. Bouzy, H. Paul, J.J. Fundenberger, Orientation precision of TEM-based orientation mapping techniques, Ultramicroscopy 136 (2014) 107–118. [17] C. Dwyer, Simulation of scanning transmission electron microscope images on desktop computers, Ultramicroscopy 110 (3) (2010) 195–198. [18] M.A. Dyson, J. Sloan, GPU accelerated image simulation, image processing and exit wave reconstruction, in: Proceedings of the 15th European Microscopy Conference, 2012. [19] A.S. Eggeman, A. London, P.A. Midgley, Ultrafast electron diffraction pattern simulations using GPU technology. Applications to lattice vibrations, Ultramicroscopy 134 (2013) 44–47. [20] D.J. Eaglesham, C.J. Kiely, D. Cherns, M. Missous, Electron diffraction from epitaxial crystals—a convergent-beam electron diffraction of the interface structure for NiSi 2/Si and Al/GaAs, Philos. Mag. A 60 (2) (1989) 161–175. [21] T. Yamazaki, K. Watanabe, K. Kuramochi, I. Hashimoto, Extended dynamical HAADF STEM image simulation using the bloch-wave method, Acta Crystallogr. Sect. A 62 (4) (2006) 233–236. [22] M. Ohtsuka, T. Yamazaki, I. Hashimoto, K. Watanabe, Many-beam dynamical simulation for multilayer structures without a superlattice cell, Acta Crystallogr. Sect. A 65 (2) (2009) 135–140. [23] C.T. Koch, Aberration-compensated large-angle rocking-beam electron diffraction, Ultramicroscopy 111 (7) (2011) 828–840. [24] F. Wang, R.S. Pennington, C.T. Koch, GPU-accelerated Bloch wave calculation for quantitative structure factor determination, in: Proceedings MC 2013, Regensburg, Germany, 2013. [25] P. Rez, The absorption of bloch waves in high energy electron diffraction, Phys. Status Solidi (a) 55 (1) (1979) K79–K82. [26] F. Fujimoto, Dynamical theory of electron diffraction in Laue-case, I. General theory, J. Phys. Soc. Japan 14 (11) (1959) 1558–1568. [27] C. Humphreys, The scattering of fast electrons by crystals, Rep. Prog. Phys. 42 (1979) 1825–1887. [28] N.J. Higham, The scaling and squaring method for the matrix exponential revisited, SIAM J. Matrix Anal. Appl. 26 (4) (2005) 1179–1193. [29] T.E. Oliphant, Python for scientific computing, Comput. Sci. Eng. 9 (3) (2007) 10–20. [30] D.M. Bird, Q.A. King, Absorptive form factors for high-energy electron diffraction, Acta Crystallogr. A 46 (1990) 202–208. [31] P. Peterson, F2PY: a tool for connecting Fortran and Python programs, Int. J. Comput. Sci. Eng. 4 (4) (2009) 296. [32] A. Klöckner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, A. Fasih, PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation, Parallel Comput. 38 (3) (2012) 157–174. [33] J.R. Humphrey, D.K. Price, K.E. Spagnoli, A.L. Paolini, E.J. Kelmelis, CULA: hybrid GPU accelerated linear algebra routines, in: Proceedings of SPIE, Modeling and Simulation for Defense Systems and Applications V, vol. 7705, 2010, pp. 770502-1–770502-7. [34] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, LAPACK Users' Guide, 3rd ed., Society for Industrial and Applied Mathematics, Philadelphia, PA, 1999. [35] M. Frigo, S.G. Johnson, The design and implementation of FFTW3, in: Proceedings of the IEEE, vol. 93, no. 2, 2005, pp. 216–231. [36] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in C: The Art of Scientific Computing, 2nd ed., Cambridge University Press, Cambridge, UK, 1992. [37] C. Moler, C. Van Loan, Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later, SIAM Rev. 45 (1) (2003) 3–49.

Ultrafast electron diffraction using an ultracold source.

Evaluating the power of GPU acceleration for IDW interpolation algorithm.

GPU acceleration of melody accurate matching in query-by-humming.

GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.

Affine arithmetic-based B-spline surface intersection with GPU acceleration.

High performance MRI simulations of motion on multi-GPU systems.

Structure refinement using precession electron diffraction tomography and dynamical diffraction: theory and implementation.

A complete comparison of simulated electron diffraction patterns using different parameterizations of the electron scattering factors.

Fast acceleration of 2D wave propagation simulations using modern computational accelerators.

ICDD Electron Diffraction Database: A Database for Phase Identification by Electron Diffraction.

Accelerating electron tomography reconstruction algorithm ICON with GPU.

Electron acceleration during streamer collisions in air.

A compact electron gun for time-resolved electron diffraction.

Using GPU shaders for visualization, part 2.

Precession electron diffraction - a topical review.

Using GPU shaders for visualization, part 3.

Double-shot MeV electron diffraction and microscopy.

Electron Diffraction of Superfluid Helium Droplets.

Scanning-mode 2D acoustic radiation force impulse (s2D-ARFI) imaging based on GPU acceleration.

GPU-Meta-Storms: computing the structure similarities among massive amount of microbial community samples using GPU.

MDI-GPU: accelerating integrative modelling for genomic-scale data using GP-GPU computing.

MM Simulations.

Best bang for your buck: GPU nodes for GROMACS biomolecular simulations.

An automated method of quantifying ferrite microstructures using electron backscatter diffraction (EBSD) data.