Assignment 3, PHYD57

Due date and time: 8 Nov (Fri) at 11am

We denote first and second $x$-derivative as $\partial_x$ ≡ $\frac{\partial}{\partial x}$, and $\partial_{xx}$ ≡ $\frac{\partial^2}{\partial x^2}$.

Problem 3.1 [25 pts] Research the history of GPGPU (i.e., GPU computing)

We are going to talk about GPU computing during Lectures, but will not cover in detail the history of GPU computing. Study it using numerous resources on:
(i) course web page, Links > Parallel computing - CUDA C & CUDA Fortran,
(ii) auxiliary page /xx with books which sometimes provide historical introduction,
(iii) internet
In your 2-page writeup (max. 3 pages please) mention not just the history in the sense of who did what & when, but also how GPU computing differs from CPU computing. You may include some diagrams or short code snippets (in Word or Latex or jupyter notebook format). Try to capture the essential things; it's easy to get carried away citing various interesting but not essential information and suddenly run out space (no more than 3 pages please). Cite more than two sources, mention the types of applications GPU has found, and personal experience with graphics computing, if any.

Problem 3.2 [25 pts] Stencils, stability and CFL numbers

Introduction and the definition of the CFL number

Much of numerical Physics revolves around the solution of PDEs (partial differential eqs.) Everything that happens in n-dimensional space and involves time evolution will likely be described by a PDE, specifying how fast the quantities involved change in time at a given location on a stationary or moving grid (discretizing the space with local cell sizes $\Delta x$, $\Delta y$, etc.). Time is discretized too, into intervals of duration $\Delta t$.

The CFL (Courant-Fredrichs-Levy) condition connects $\Delta x$, $\Delta y$, ... and $\Delta t$ in wave propagation and fluid flow problems, governed by the so-called hyperbolic equations like Navier-Stokes equation, ideal gasdynamics equations called Euler equations, various water wave equations (deep water, shallow water equations). Essentially, the meaning of such conditions is to specify when a given computational stencil, i.e. pattern of combining the data from previous time step to produce the updated value at a given location in the next time step, works ok, and when it produces instabilities. Instabilities are not just truncation errors, meaning - inaccuracies caused by numerical approximations to math oprations like the derivatives and integrals. Instabilities introduce expopnentially growing errors quickly making the solution non-physical.
There are so-called implicit schemes, which use stencils transferring information about the future (next timestep) state of the spatial neighbors of a given cell, in order to compute the future state of that cell, and many of them don't have a CFL-like condition, because they always are stable, no matter how large $dt= \Delta t$ one uses. Such schemes are said to be unconditionally stable. The problem is that they require information (about neighbors) which is not yet available. In practice, one can device iterative scheme, which will derive better and better approximations to all the future values of variables on the grid, but there is a price to pay for this: implicit schemes are computationally expensive. This forces the user to adopt larger timesteps than in explicit schemes, with adverse effect on accuracy. Some of the best available CFD algorthms (Computational Fluid Dynamics) are therefore explicit.

You have learned much about the approximators of the 1st and 2nd derivatives on a grid. Now is the time to see whether they work in the context of some popular PDEs. One example that we consdered in Lectures is the von Neumann analysis of error propagation in the heat (or diffusion) equation $\partial_t T = D\; \partial_{xx} T$, when the first order, forward time differencing is used for time, and 2nd order symmetric formula for the r.h.s. 2nd derivative is used. The stencil looks like this

 
    *
    |
o___o___o

The time is going up by one timestep, 3 spatial gridpoints are shown as o. In more dimensions we have 5 o's in 2-D, and 7 o's in 3-D. The discretization of 1-D heat equation PDE is goes as follows: $$\frac{T_j^{n+1} - T_j^n}{\Delta t} = D\; \frac{T_{j-1}^n - 2 T_j^n +T_{j+1}^n } {(\Delta x)^2}$$ or $$T_j^{n+1}=T_j^n +\frac{\Delta t \;D}{(\Delta x)^2} (T_{j-1}^n - 2 T_j^n +T_{j+1}^n)$$ We mark the timestep number as upper index, and the spatial cell number as lower index. Notice that we could have denoted the nondimensional constant $D\;\Delta t/(\Delta x)^2$ as a CFL number. We found that CFL $\le 1/(2n)$ is sufficient and necessary for the stability of the considered scheme, i.e. the timestep must satisfy the inequality $\Delta t\le (\Delta x)^2/(2nD)$. Check the lecture notes.

The physical meaning of $(\Delta x)^2/D$ is the diffusion time after which a local peak or trough of $T$ of width $\Delta x$ gets significantly leveled by diffusion. Timestep is a couple of times smaller than diffusion timescale across one cell, for the algorithm to be able to adequately describe the diffusion or temperature changes.

Stability of schemes for advection and Schrödinger equation

The 1-D advection equation describes transport of quantity described by function $u(x,t)$ across the $x$-axis: $$ \partial_t u = c \,\partial_x u $$ where $c = const$ is the propagation speed (its sign determines direction of propagation). In multidimensional space we'd write $ \partial_t u = c \nabla u $. In 1-D $\nabla = \partial_x$, in 2-D it is $\nabla = (\partial_x, \partial_y)$ and so on.

The time-dependent Schrödinger equation describes the time evolution of $\psi (\vec{r},t)$, the quantum-mechanical wave function: $$i\hbar \;\partial_t \psi(\vec{r},t) = \left[-\frac{\hbar^2}{2m}\,\nabla^2 + V(\vec{r})\right] \,\psi(\vec{r},t).$$ where $m$ is the mass of the particle represented by the wave function, $\hbar = h/(2\pi)$ is reduced Planck's constant, and $\nabla^2$ is Laplace operator. In 1-D, $\nabla^2 = \partial_{xx}$, in 2-D it is $\nabla^2 = \partial_{xx}+\partial_{yy}$, and so on. We are interested in the simple case of free-space propagation of the wave function, $V(\vec{r})= 0$.

Show that discretization of either of the above equations using the same numerical approximations to $\partial_t$ and $\nabla^2$ as we discussed above in case of the diffusion equation, is unconditionally unstable. That is, there is no CFL relation guaranteeng stability, because the numerical solution always blows up.

So what do we do in such a situation? We have to employ more sophisticated explicit methods, or implicit methods.

The Klein-Gordon and the wave equation. CFL limit

There is a number of wave equations in physics that describe the propagation of disturbance in some medium, or quantum-mechanical probability wave functions in vacuum. Notice that Schrödinger equation does not contain the speed of light $c$. This is because it applies to nonrelativistic particles only, a fact that initially saddened its discoverer so much that he seriously considered not publishing it.

One attempt was to formulate a wave equation patterned after the relativistic energy triangle equation $$E^2 = (mc^2)^2 + (pc)^2$$ where $m$ is the rest mass of a particle, $p$ its relativistic momentum (in general not equal to $mv$ since that product would be limited in value because of $v\le c$ requirement), and $E$ total energy. This gave the Klein-Gordon equation which, as we now know, describes the spin-zero particles such as the Higgs boson.

Let's consider the application to massles particles first. If $m=0$, the Klein-Gordon equation looks exactly like the familiar wave equation of clasical physics, although the function $\psi$ is the same quantum-mechanical wave function in Schrödinger equation, and is in general complex-valued. $$ \partial_{tt} \psi = c^2 \;\nabla^2 \psi $$

Use von Neuman's analysis, where spatial and time dependence of error is assumed to be split into Fourier components that depend on position $\vec{r}$ and time like $e^{i(\vec{k}\cdot \vec{r} - \omega t)} $, the $\vec{k}$ being the wave vector and $\omega$ the wave's frequency. Show that any wave equation (including the homogeneous, mass-less Klein-Gordon equation) can be successfully solved numerically with the 2nd derivative approximator (2nd order accuracy) used in heat equation. Derive the critical CFL number. The explicit numerical stencil has the form

 
    *
    |
o___o___o
    |
    o

Time is running upwards (2 time steps) and space is extending horizontally; we illustrate only one spatial dimension, but analogous multidimensional stencils can be written down. The CFL number will depend on the dimension $n$ of space! This is how we define CFL number in this equation, for brevity denoting it as $q$ in 1-D: $$\frac{\psi_{j}^{n+1} - 2\psi_{j}^n +\psi_{j}^{n-1}}{(\Delta t)^2} = c^2\; \frac{\psi_{j+1}^n - 2 \psi_{j}^n +\psi_{j-1}^n }{(\Delta x)^2} $$ or $$\psi_j^{n+1}-2\psi_j^n +\psi_j^{n-1} = q^2(\psi_{j-1}^n -2\psi_j^n + \psi_{j+1}^n);\;\;\;\;\;\;\;\; q :=\mbox{CFL} =\frac{c\;\Delta t}{\Delta x}.$$ Let us illustrate the 2-D wave equation by labeling spatial locations with indices $j,l$ $$\frac{\psi_{j,l}^{n+1} - 2\psi_{j,l}^n +\psi_{j,l}^{n-1}}{(\Delta t)^2} = c^2\; \left[ \frac{\psi_{j+1,l}^n - 2 \psi_{j,l}^n +\psi_{j-1,l}^n }{(\Delta x)^2} + \frac{\psi_{j,l+1}^n - 2 \psi_{j,l}^n +\psi_{j,l-1}^n }{(\Delta y)^2} \right]$$ or $$\psi_{j,l}^{n+1} - 2\psi_{j,l}^n +\psi_{j,l}^{n-1} = q^2 (\psi_{j+1,l}^n - 2 \psi_{j,l}^n +\psi_{j-1,l}^n) + p^2 (\psi_{j,l+1}^n - 2 \psi_{j,l}^n +\psi_{j,l-1}^n) ;\;\;\;\;\;\;\;\; q =\frac{c\;\Delta t}{\Delta x}, \;\; p=\frac{c\;\Delta t}{\Delta y}.$$ I don't think that having different cell sizes in different directions on the grid makes the task much more difficult. But if you prefer - you can consider a spatial grid with $\Delta x = \Delta y = ..$. Interpret your result by comparing the numerical speed of propagation of information on the uniform space-time grid, and the physical wave-crest velocity $c$.

HINT: Follow the philosophy that was used for diffusion equation in L5, use the trick $ e^{iz} + e^{-iz} -2 = -4 \,\sin^2 z/2$. Judge at which values of $q$ there are no real-valued $\omega$'s that satisfy the equation, assuming a real-valued $k$ (or $\vec{k}$ if dimension $n>1$).

OPTIONAL PART: Finally, if you dare, derive the stability criterion for the Klein-Gordon equation with $m > 0$, discretized as above. This is not necessary to get full points for this problem! But if you do it, some mistakes in the earlier parts of this problem will be forgiven.

Problem 3.3 [40 pts] N-body problem: Gravitational collapse

Write a code in C or Fortran that solves the N-body gravitational problem by 4th order symplectic method (cf. L7). The code should be modular, i.e. write separate procedures/subroutines to: initiate the positions and velocities, compute accelerations by direct summetion (that part is the bottleneck, as it is $O(N^2)$ complexity, so it needs to be well parallelized by OpenMP multithreading), compute instantaneous value of total energy and momentum of the system for monitoring accuracy, and the integrator that does kick and push substeps, as well as the main program that calls those procedures. Unless you have other ideas, keep all the variables for all particles: coordinates, velocity components, accelerations, in one array dimensioned 9 by N, or N by 9 (depending on the C/Fortran language, since C and F store rows and columns of arrays differently, and it is usually better to keep all data belonging to one particle physically close in RAM).

Consider an initially motionless, uniform-density, spherical cloud of randomly placed $N=1000$ points, extending to radius $R=1$. The total mass is equal $M=1$, and each body's mass is $m = M/N$. The gravitational constant is $G= 1$. [You are free to use the actual physical constants and dimensional values. I'll not omit $M$ ($m$) or $G$ in the formulae below. The task you're facing will not depend on what values of $R$ and $M$ you use. Only the $dt$ and the simulated times will have different values, but the number of steps taken or wall clock time should not be much different.]

The acceleration acting on particle $i$ is equal $$\vec{f}_i = G m \sum_{j \ne i}\,\frac{\vec{r}_{ij}} {(r_{ij}^2+\epsilon^2)^{3/2}}$$ where $\vec{r}_{ij} = \vec{r}_j - \vec{r}_i$, $\;\;r_{ij}= |\vec{r}_{ij}|$ , and the sum is done over all particles other than $i$. It's not required, but you may use non-zero softening parameter $\epsilon \approx 10^{-5}$, effectively limiting the strength of gravity force at distances $r \lt \epsilon$. It's rare that particles are that close to each other, but then their trajectories will be computed more reasonably (as if gravity was actually non-Newtonian at small distances, or if gravity was Newtonian but particles were really fuzzy balls with radius $\sim \epsilon$ of smaller constituents. In astronomy, such objects with potential $-Gm^2/\sqrt{r^2+\epsilon^2}$ are known as Plummer spheres.)

Monitor the relative change of the total mechanical energy $(E-E_0)/E_0$ and the instantaneous total momentum of the system divided by $N$, where $E_0 = E(t=0)$ and $$ E = m \sum_i \left[\frac{v_i^2}{2} -Gm \sum_{j \lt i}\,\frac{1}{ \sqrt{r_{ij}^2+\epsilon^2}} \right], \;\;\;\;\;\;\; \vec{p} = -\frac{m}{N} \sum_{i}\,\vec{v}_{i}.$$

Choose a constant timestep $dt$ that assures good conservations of energy and nearly zero total momentum during the simulation. State the absolute value of momentum divided by $M$. Try to keep the relative energy error below $10^{-5}$. Consider any results where the relative error is larger than 0.01 garbage, affected by a very close encounter of particles. If the running time is too long, kill the program by CTRL-C after 10 or 15 minutes and run it again with much smaller $N$, it should then finish quickly. Establish the biggest N that gives reasonably short runtime by trial and error guided by the method of bisection. Hopefully you won't have to decrease N, but who knows..

You will finish the computation once the following happens. Write a subroutine or procedure, which you will call once in a while, not after every timestep, since that would slow down the working of the code unnecessarily, but maybe every 10 or more steps. It will return some information about the system, such as $E$ and $\vec{p}$, as well as the mean radius of the system $R_m = (1/N) \sum_i |\vec{r_i}|$ and the mean velocity, $V_m = (1/N) \sum_i |\vec{v_i}|$. The ratio $t_c = R_m/V_m$ tells you the average crossing time, which is similar to the typical infall timescale (initially) and the orbital period of a particle at the end. The code should finish when the time exceeds $20 t_c$, by which time we will consider that the system settled into equilibrium.

The main goal of your code is to find by what factor the mean radius of the system in final equilibrium is smaller than the initial mean radius. (The virial theorem can be used to make a prediction for that ratio, and you can try to make the prediction, but that is not required.)

Some particles may escape. Don't include in the average radius those, which escaped beyond 10 times the initial ball's radius, and don't plot them (you can choose to plot only the particles within 3 initial radii). But study wherether they escaped because of code inaccuracies- by varying dt. If there is a noticeble change in results when making energy more exactly conserved (smaller dt) then it might mean that even dE/E ~ 1e-5 is not sufficient to converge to a reliable answer. Consider adding a small number called softening parameter ε² ~ 10^-10 to the square of every particle-particle distance to limit acceleration and decrease errors due to close encounters. Cf. the $\vec{f}_i$ for a proper way to modify the equation.
Cluster evaporation is a physical phenomenon, but the theoretical timescale of it (to lose significant part of the cluster) much exceeds 20 crossing times.

Code details need to be described in comments, and in a separate text that will also include the results and their discussion.

SOLUTION

Your solution will be looked at individually by the TA.
During the lecture I commented on how I have overdone my attempt to simplify the formulation of the N-body problem maximally. Einstein famously said that things should be done with maximum simplicity but no more than that. Asking you to derive the $\braket{r}$ instead of a better physically motivated $r_{ave} := \braket{1/r}^{-1}$ was already on the "too much simplification" side. Sorry. Numerically inaccurate, fast, escaping particles can easily influence the arithmetic average, whereas they are unimportant with the second definition. Anyway, that is mostly a numerical exercise, not a physical analysis, where we would argue that $\braket{1/r}$ is like potential energy $U$ and needs to be evaluated, whereas $\braket{r}$ actually does not have any physical meaning, maybe just some purely graphical meaning.
I also remarked on how force summation may be speeded up by only summing over ordered pairs of indices $\{ij\}$ such that $i\lt j$ or $i\lt j$ (it does not matter which condition you use). Two acceleration counters (memory locations in the particle propertis table) must be updated then, using Newton's 3rd law.

One idea on how to speed up the direct summation part of the code is to gather particles in bunches of, say, 128 each (or 256), load those consecutively into the CPU/GPU and leave them there for a while. They're the "$i$" particles. While accumulating their accelerations, you fetch only the 'other' or $j$-particles in consecutive bunches to interact with the resident $i$-particles. It is a little hard to implement the ordered-pairs only idea then, but doing so might may create write collisions in gpu or cpu RAM, anyway. (There are easy ways to see how many collisions happen in tests in which every force is replaced by number 1.0; In case of frequent collisions in a CPU code, switch back to full NxN matrix of interactions.)
The whole 128x128 interaction sub-matrix (called "tile") can be very quickly evaluated by 128 cuda threads in one thread block, without fine-grained reads/writes to RAM (hence a great speedup, it's like using L1 cache on cpu), allowing N = 128K to be treated by 1M blocks, only dozens of them computing concurrently (given today's hardware). On art-1, N = 128K problem of exactly interacting particles can be solved with an admittedly oversimplified time integrator fast enough to create a good looking animation in real time (more than 20 frames per sec). Here are some details of the method & implementation in C and CUDA C . On art-1, run the Nvidia-written program called nbody like so
% ~/nbody -benchmark --numbodies=...
with your preferred number of bodies (in the benchmark mode: nbody code won't be trying to send the very pretty OpenGL graphics to the screen; that's how OpenGL works, it consistently refuses to be channeled via TCP/IP connections, as if deliberately preventing users from complaining about 'slow graphics' when run remotely; indeed a good codec would ameliorate that problem, but nobody seems to offer any solution, and we have to use nbody on art-1 in the text mode).

To dig into the program source code, you'll need to look at bodysystemcuda.cu, nbody.cpp, and supporting header files bodysystemcuda.h , bodysystemcuda_impl.h . A lot of the stuff is about graphics, graphic buffers, of course, but try locate the compute engine.

For a scientific calculation, one timestep should not last much longer than 1 second (we sometimes need hundreds of thousands of timesteps and our lives are of finite length). What number N corresponds to 1 timestep per 3 seconds? What arithmetic speed in TFLOPs ($10^{12}$ operations per sec) does that give & how much faster it is than the KFLOPS typical of machines 70 years ago?

Problem 3.4 [10 pts] A variant of code in 3.3

Once the code in problem 3.3 is debugged and works well, create a separate version with a variable timestep, updated after each timestep to a value depending on the minimum distance $d_{min}$ of any two masses (this value should be established as part of the acceleration calculation, not by a separate procedure finding all pairs of particles again!). The dependence should be $dt = dt_0 (d_{min}/R)^{3/2}$, with $dt_0$ being a constant you change to obtain a satisfactory conservation of the integrals of motion $E$ and $\vec{p}$. Submit the file with the variant of the code. Compare the results and timing of the two kinds of time-integration.

What about the following idea for another optimization. The method, as described above, computes a given pair $ij$ twice, once while summing the acceleration on particle $i$, and another time doing summation for particle $j$. Meanwhile, Newton's 3rd law states that the second calculation of force is redundant, because the force is equal and opposite to the one already computed (both acceleration can be updated simultaneously). One can only consider pairs that satisfy $i\lt j$ or $j\lt i$, to reduce the number of 3-D distances calculated. But will that work well with OMP? You can try and tell us what happened. (We are looking forward to your bright ideas on code optimization. E.g., you might check whether 9 by N or N by 9 arrays work faster, whether putting them on stack or heap, by palcing the declaration in or outside a procedure or fortran module, works better. Just remember to increase OS stack size if you get error messages having to do with memory allocations, or segmentation faults, which is the typical symptom of those troubles. This is just a suggestion. Describe your efforts even if they didn't bring much benefit. We ask you to describe your solution to 3.3 and 3.4 in a text file, points will be cut for submitting only the codes.)

Pawel Artymowicz / pawel.artymowicz@utoronto.ca