Assignment 3, PHYD57

Due date and time: 8 Nov (Fri) at 11am

We denote first and second x-derivative as xx, and xx2x2.

Problem 3.1 [25 pts] Research the history of GPGPU (i.e., GPU computing)

We are going to talk about GPU computing during Lectures, but will not cover in detail the history of GPU computing. Study it using numerous resources on:
(i) course web page, Links > Parallel computing - CUDA C & CUDA Fortran,
(ii) auxiliary page /xx with books which sometimes provide historical introduction,
(iii) internet
In your 2-page writeup (max. 3 pages please) mention not just the history in the sense of who did what & when, but also how GPU computing differs from CPU computing. You may include some diagrams or short code snippets (in Word or Latex or jupyter notebook format). Try to capture the essential things; it's easy to get carried away citing various interesting but not essential information and suddenly run out space (no more than 3 pages please). Cite more than two sources, mention the types of applications GPU has found, and personal experience with graphics computing, if any.

Problem 3.2 [25 pts] Stencils, stability and CFL numbers

Introduction and the definition of the CFL number

Much of numerical Physics revolves around the solution of PDEs (partial differential eqs.) Everything that happens in n-dimensional space and involves time evolution will likely be described by a PDE, specifying how fast the quantities involved change in time at a given location on a stationary or moving grid (discretizing the space with local cell sizes Δx, Δy, etc.). Time is discretized too, into intervals of duration Δt.

The CFL (Courant-Fredrichs-Levy) condition connects Δx, Δy, ... and Δt in wave propagation and fluid flow problems, governed by the so-called hyperbolic equations like Navier-Stokes equation, ideal gasdynamics equations called Euler equations, various water wave equations (deep water, shallow water equations). Essentially, the meaning of such conditions is to specify when a given computational stencil, i.e. pattern of combining the data from previous time step to produce the updated value at a given location in the next time step, works ok, and when it produces instabilities. Instabilities are not just truncation errors, meaning - inaccuracies caused by numerical approximations to math oprations like the derivatives and integrals. Instabilities introduce expopnentially growing errors quickly making the solution non-physical.
There are so-called implicit schemes, which use stencils transferring information about the future (next timestep) state of the spatial neighbors of a given cell, in order to compute the future state of that cell, and many of them don't have a CFL-like condition, because they always are stable, no matter how large dt=Δt one uses. Such schemes are said to be unconditionally stable. The problem is that they require information (about neighbors) which is not yet available. In practice, one can device iterative scheme, which will derive better and better approximations to all the future values of variables on the grid, but there is a price to pay for this: implicit schemes are computationally expensive. This forces the user to adopt larger timesteps than in explicit schemes, with adverse effect on accuracy. Some of the best available CFD algorthms (Computational Fluid Dynamics) are therefore explicit.

You have learned much about the approximators of the 1st and 2nd derivatives on a grid. Now is the time to see whether they work in the context of some popular PDEs. One example that we consdered in Lectures is the von Neumann analysis of error propagation in the heat (or diffusion) equation tT=DxxT, when the first order, forward time differencing is used for time, and 2nd order symmetric formula for the r.h.s. 2nd derivative is used. The stencil looks like this

 
    *
    |
o___o___o

The time is going up by one timestep, 3 spatial gridpoints are shown as o. In more dimensions we have 5 o's in 2-D, and 7 o's in 3-D. The discretization of 1-D heat equation PDE is goes as follows: Tjn+1TjnΔt=DTj1n2Tjn+Tj+1n(Δx)2 or Tjn+1=Tjn+ΔtD(Δx)2(Tj1n2Tjn+Tj+1n) We mark the timestep number as upper index, and the spatial cell number as lower index. Notice that we could have denoted the nondimensional constant DΔt/(Δx)2 as a CFL number. We found that CFL 1/(2n) is sufficient and necessary for the stability of the considered scheme, i.e. the timestep must satisfy the inequality Δt(Δx)2/(2nD). Check the lecture notes.

The physical meaning of (Δx)2/D is the diffusion time after which a local peak or trough of T of width Δx gets significantly leveled by diffusion. Timestep is a couple of times smaller than diffusion timescale across one cell, for the algorithm to be able to adequately describe the diffusion or temperature changes.

Stability of schemes for advection and Schrödinger equation

The 1-D advection equation describes transport of quantity described by function u(x,t) across the x-axis: tu=cxu where c=const is the propagation speed (its sign determines direction of propagation). In multidimensional space we'd write tu=cu. In 1-D =x, in 2-D it is =(x,y) and so on.

The time-dependent Schrödinger equation describes the time evolution of ψ(r,t), the quantum-mechanical wave function: itψ(r,t)=[22m2+V(r)]ψ(r,t). where m is the mass of the particle represented by the wave function, =h/(2π) is reduced Planck's constant, and 2 is Laplace operator. In 1-D, 2=xx, in 2-D it is 2=xx+yy, and so on. We are interested in the simple case of free-space propagation of the wave function, V(r)=0.

Show that discretization of either of the above equations using the same numerical approximations to t and 2 as we discussed above in case of the diffusion equation, is unconditionally unstable. That is, there is no CFL relation guaranteeng stability, because the numerical solution always blows up.

So what do we do in such a situation? We have to employ more sophisticated explicit methods, or implicit methods.

The Klein-Gordon and the wave equation. CFL limit

There is a number of wave equations in physics that describe the propagation of disturbance in some medium, or quantum-mechanical probability wave functions in vacuum. Notice that Schrödinger equation does not contain the speed of light c. This is because it applies to nonrelativistic particles only, a fact that initially saddened its discoverer so much that he seriously considered not publishing it.

One attempt was to formulate a wave equation patterned after the relativistic energy triangle equation E2=(mc2)2+(pc)2 where m is the rest mass of a particle, p its relativistic momentum (in general not equal to mv since that product would be limited in value because of vc requirement), and E total energy. This gave the Klein-Gordon equation which, as we now know, describes the spin-zero particles such as the Higgs boson.

Let's consider the application to massles particles first. If m=0, the Klein-Gordon equation looks exactly like the familiar wave equation of clasical physics, although the function ψ is the same quantum-mechanical wave function in Schrödinger equation, and is in general complex-valued. ttψ=c22ψ

Use von Neuman's analysis, where spatial and time dependence of error is assumed to be split into Fourier components that depend on position r and time like ei(krωt), the k being the wave vector and ω the wave's frequency. Show that any wave equation (including the homogeneous, mass-less Klein-Gordon equation) can be successfully solved numerically with the 2nd derivative approximator (2nd order accuracy) used in heat equation. Derive the critical CFL number. The explicit numerical stencil has the form

 
    *
    |
o___o___o
    |
    o

Time is running upwards (2 time steps) and space is extending horizontally; we illustrate only one spatial dimension, but analogous multidimensional stencils can be written down. The CFL number will depend on the dimension n of space! This is how we define CFL number in this equation, for brevity denoting it as q in 1-D: ψjn+12ψjn+ψjn1(Δt)2=c2ψj+1n2ψjn+ψj1n(Δx)2 or ψjn+12ψjn+ψjn1=q2(ψj1n2ψjn+ψj+1n);q:=CFL=cΔtΔx. Let us illustrate the 2-D wave equation by labeling spatial locations with indices j,l ψj,ln+12ψj,ln+ψj,ln1(Δt)2=c2[ψj+1,ln2ψj,ln+ψj1,ln(Δx)2+ψj,l+1n2ψj,ln+ψj,l1n(Δy)2] or ψj,ln+12ψj,ln+ψj,ln1=q2(ψj+1,ln2ψj,ln+ψj1,ln)+p2(ψj,l+1n2ψj,ln+ψj,l1n);q=cΔtΔx,p=cΔtΔy. I don't think that having different cell sizes in different directions on the grid makes the task much more difficult. But if you prefer - you can consider a spatial grid with Δx=Δy=... Interpret your result by comparing the numerical speed of propagation of information on the uniform space-time grid, and the physical wave-crest velocity c.

HINT: Follow the philosophy that was used for diffusion equation in L5, use the trick eiz+eiz2=4sin2z/2. Judge at which values of q there are no real-valued ω's that satisfy the equation, assuming a real-valued k (or k if dimension n>1).

OPTIONAL PART: Finally, if you dare, derive the stability criterion for the Klein-Gordon equation with m>0, discretized as above. This is not necessary to get full points for this problem! But if you do it, some mistakes in the earlier parts of this problem will be forgiven.

Problem 3.3 [40 pts] N-body problem: Gravitational collapse

Write a code in C or Fortran that solves the N-body gravitational problem by 4th order symplectic method (cf. L7). The code should be modular, i.e. write separate procedures/subroutines to: initiate the positions and velocities, compute accelerations by direct summetion (that part is the bottleneck, as it is O(N2) complexity, so it needs to be well parallelized by OpenMP multithreading), compute instantaneous value of total energy and momentum of the system for monitoring accuracy, and the integrator that does kick and push substeps, as well as the main program that calls those procedures. Unless you have other ideas, keep all the variables for all particles: coordinates, velocity components, accelerations, in one array dimensioned 9 by N, or N by 9 (depending on the C/Fortran language, since C and F store rows and columns of arrays differently, and it is usually better to keep all data belonging to one particle physically close in RAM).

Consider an initially motionless, uniform-density, spherical cloud of randomly placed N=1000 points, extending to radius R=1. The total mass is equal M=1, and each body's mass is m=M/N. The gravitational constant is G=1. [You are free to use the actual physical constants and dimensional values. I'll not omit M (m) or G in the formulae below. The task you're facing will not depend on what values of R and M you use. Only the dt and the simulated times will have different values, but the number of steps taken or wall clock time should not be much different.]

The acceleration acting on particle i is equal fi=Gmjirij(rij2+ϵ2)3/2 where rij=rjri, rij=|rij| , and the sum is done over all particles other than i. It's not required, but you may use non-zero softening parameter ϵ105, effectively limiting the strength of gravity force at distances r<ϵ. It's rare that particles are that close to each other, but then their trajectories will be computed more reasonably (as if gravity was actually non-Newtonian at small distances, or if gravity was Newtonian but particles were really fuzzy balls of radius ϵ of much smaller components. In astronomy, such objects with potential Gm2/r2+ϵ2 are known as Plummer spheres.)

Monitor the relative change of the total mechanical energy (EE0)/E0 and the instantaneous total momentum of the system divided by N, where E0=E(t=0) and E=mi[vi22Gmj<i1rij2+ϵ2],p=mNivi.

Choose a constant timestep dt that assures good conservations of energy and nearly zero total momentum during the simulation. State the absolute value of momentum divided by M. Try to keep the relative energy error below 105. Consider any results where the relative error is larger than 0.01 garbage, affected by a very close encounter of particles. If the running time is too long, kill the program by CTRL-C after 10 or 15 minutes and run it again with much smaller N, it should then finish quickly. Establish the biggest N that gives reasonably short runtime by trial and error guided by the method of bisection. Hopefully you won't have to decrease N, but who knows..

You will finish the computation once the following happens. Write a subroutine or procedure, which you will call once in a while, not after every timestep, since that would slow down the working of the code unnecessarily, but maybe every 10 or more steps. It will return some information about the system, such as E and p, as well as the mean radius of the system Rm=(1/N)i|ri| and the mean velocity, Vm=(1/N)i|vi|. The ratio tc=Rm/Vm tells you the average crossing time, which is similar to the typical infall timescale (initially) and the orbital period of a particle at the end. The code should finish when the time exceeds 20tc, by which time we will consider that the system settled into equilibrium.

The main goal of your code is to find by what factor the mean radius of the system in final equilibrium is smaller than the initial mean radius. (The virial theorem can be used to make a prediction for that ratio, and you can try to make the prediction, but that is not required.)

Some particles may escape. Don't include in the average radius those, which escaped beyond 10 times the initial ball's radius, and don't plot them (you can choose to plot only the particles within 3 initial radii). But study wherether they escaped because of code inaccuracies- by varying dt. If there is a noticeble change in results when making energy more exactly conserved (smaller dt) then it might mean that even dE/E ~ 1e-5 is not sufficient to converge to a reliable answer. Consider adding a small number called softening parameter ε2 ~ 10-10 to the square of every particle-particle distance to limit acceleration and decrease errors due to close encounters. Cf. the fi for a proper way to modify the equation.
Cluster evaporation is a physical phenomenon, but the theoretical timescale of it (to lose significant part of the cluster) much exceeds 20 crossing times.

Code details need to be described in comments, and in a separate text that will also include the results and their discussion.

Problem 3.4 [10 pts] A variant of code in 3.3

Once the code in problem 3.3 is debugged and works well, create a separate version with a variable timestep, updated after each timestep to a value depending on the minimum distance dmin of any two masses (this value should be established as part of the acceleration calculation, not by a separate procedure finding all pairs of particles again!). The dependence should be dt=dt0(dmin/R)3/2, with dt0 being a constant you change to obtain a satisfactory conservation of the integrals of motion E and p. Submit the file with the variant of the code. Compare the results and timing of the two kinds of time-integration.

What about the following idea for another optimization. The method, as described above, computes a given pair ij twice, once while summing the acceleration on particle i, and another time doing summation for particle j. Meanwhile, Newton's 3rd law states that the second calculation of force is redundant, because the force is equal and opposite to the one already computed (both acceleration can be updated simultaneously). One can only consider pairs that satisfy i<j or j<i, to reduce the number of 3-D distances calculated. But will that work well with OMP? You can try and tell us what happened. (We are looking forward to your bright ideas on code optimization. E.g., you might check whether 9 by N or N by 9 arrays work faster, whether putting them on stack or heap, by palcing the declaration in or outside a procedure or fortran module, works better. Just remember to increase OS stack size if you get error messages having to do with memory allocations, or segmentation faults, which is the typical symptom of those troubles. This is just a suggestion. Describe your efforts even if they didn't bring much benefit. We ask you to describe your solution to 3.3 and 3.4 in a text file, points will be cut for submitting only the codes.)


Pawel Artymowicz / pawel.artymowicz@utoronto.ca