The CFL (Courant-Fredrichs-Levy) condition connects $\Delta x$, $\Delta y$,
... and $\Delta t$ in wave propagation and fluid flow problems, governed by
the so-called hyperbolic equations like Navier-Stokes equation, ideal
gasdynamics equations called Euler equations, various water wave equations
(deep water, shallow water equations). Essentially, the meaning of such
conditions is to specify when a given computational stencil, i.e.
pattern of combining the data from previous time step to produce the updated
value at a given location in the next time step, works ok,
and when it produces instabilities. Instabilities are not just truncation errors,
meaning - inaccuracies caused by numerical approximations to math oprations like
the derivatives and integrals. Instabilities introduce expopnentially growing
errors quickly making the solution non-physical.
There are so-called implicit schemes,
which use stencils transferring information about the future (next timestep) state
of the spatial neighbors of a given cell, in order to compute the future
state of that cell, and many of them don't have a CFL-like condition, because
they always are stable, no matter how large $dt= \Delta t$ one uses. Such schemes are
said to be unconditionally stable. The problem is that they require
information (about neighbors) which is not yet available. In practice, one can device
iterative scheme, which will derive better and better approximations to all
the future values of variables on the grid, but there is a price to pay for this:
implicit schemes are computationally expensive. This forces the user to
adopt larger timesteps than in explicit schemes, with adverse effect on accuracy.
Some of the best available CFD algorthms (Computational Fluid Dynamics) are therefore
explicit.
You have learned much about the approximators of the 1st and 2nd
derivatives on a grid. Now is the time to see whether they work in the context of
some popular PDEs. One example that we consdered in Lectures is the von Neumann
analysis of error propagation in the heat (or diffusion)
equation $\partial_t T = D\; \partial_{xx} T$,
when the first order, forward time differencing is used for time, and 2nd order
symmetric formula for the r.h.s. 2nd derivative is used. The stencil looks like this
* | o___o___o
The physical meaning of $(\Delta x)^2/D$ is the diffusion time after which a local peak or trough of $T$ of width $\Delta x$ gets significantly leveled by diffusion. Timestep is a couple of times smaller than diffusion timescale across one cell, for the algorithm to be able to adequately describe the diffusion or temperature changes.
The time-dependent Schrödinger equation describes the time evolution of $\psi (\vec{r},t)$, the quantum-mechanical wave function: $$i\hbar \;\partial_t \psi(\vec{r},t) = \left[-\frac{\hbar^2}{2m}\,\nabla^2 + V(\vec{r})\right] \,\psi(\vec{r},t).$$ where $m$ is the mass of the particle represented by the wave function, $\hbar = h/(2\pi)$ is reduced Planck's constant, and $\nabla^2$ is Laplace operator. In 1-D, $\nabla^2 = \partial_{xx}$, in 2-D it is $\nabla^2 = \partial_{xx}+\partial_{yy}$, and so on. We are interested in the simple case of free-space propagation of the wave function, $V(\vec{r})= 0$.
Show that discretization of either of the above equations using the same numerical approximations to $\partial_t$ and $\nabla^2$ as we discussed above in case of the diffusion equation, is unconditionally unstable. That is, there is no CFL relation guaranteeng stability, because the numerical solution always blows up.
So what do we do in such a situation? We have to employ more sophisticated explicit methods, or implicit methods.
One attempt was to formulate a wave equation patterned after the
relativistic energy triangle equation $$E^2 = (mc^2)^2 + (pc)^2$$
where $m$ is the rest mass of a particle, $p$ its relativistic momentum
(in general not equal to $mv$ since that product would be limited in value
because of $v\le c$ requirement), and $E$ total energy. This gave the
Klein-Gordon equation which, as we now know, describes the spin-zero
particles such as the Higgs boson.
Let's consider the application to massles particles first. If $m=0$, the Klein-Gordon equation looks exactly like the familiar wave equation of clasical physics, although the function $\psi$ is the same quantum-mechanical wave function in Schrödinger equation, and is in general complex-valued. $$ \partial_{tt} \psi = c^2 \;\nabla^2 \psi $$
Use von Neuman's analysis, where spatial and time dependence of
error is assumed to be split into Fourier components that depend on position
$\vec{r}$ and time like $e^{i(\vec{k}\cdot \vec{r} - \omega t)} $,
the $\vec{k}$ being the wave vector and $\omega$ the wave's frequency.
Show that any wave equation (including the homogeneous, mass-less
Klein-Gordon equation) can be successfully solved numerically with the 2nd
derivative approximator (2nd order accuracy) used in heat equation.
Derive the critical CFL number. The explicit numerical stencil has the form
* | o___o___o | o
HINT: Follow the philosophy that was used for diffusion equation in L5, use the trick $ e^{iz} + e^{-iz} -2 = -4 \,\sin^2 z/2$. Judge at which values of $q$ there are no real-valued $\omega$'s that satisfy the equation, assuming a real-valued $k$ (or $\vec{k}$ if dimension $n>1$).
OPTIONAL PART: Finally, if you dare, derive the stability criterion for the Klein-Gordon equation with $m > 0$, discretized as above. This is not necessary to get full points for this problem! But if you do it, some mistakes in the earlier parts of this problem will be forgiven.
Consider an initially motionless, uniform-density, spherical cloud of randomly placed $N=1000$ points, extending to radius $R=1$. The total mass is equal $M=1$, and each body's mass is $m = M/N$. The gravitational constant is $G= 1$. [You are free to use the actual physical constants and dimensional values. I'll not omit $M$ ($m$) or $G$ in the formulae below. The task you're facing will not depend on what values of $R$ and $M$ you use. Only the $dt$ and the simulated times will have different values, but the number of steps taken or wall clock time should not be much different.]
The acceleration acting on particle $i$ is equal $$\vec{f}_i = G m \sum_{j \ne i}\,\frac{\vec{r}_{ij}} {(r_{ij}^2+\epsilon^2)^{3/2}}$$ where $\vec{r}_{ij} = \vec{r}_j - \vec{r}_i$, $\;\;r_{ij}= |\vec{r}_{ij}|$ , and the sum is done over all particles other than $i$. It's not required, but you may use non-zero softening parameter $\epsilon \approx 10^{-5}$, effectively limiting the strength of gravity force at distances $r \lt \epsilon$. It's rare that particles are that close to each other, but then their trajectories will be computed more reasonably (as if gravity was actually non-Newtonian at small distances, or if gravity was Newtonian but particles were really fuzzy balls with radius $\sim \epsilon$ of smaller constituents. In astronomy, such objects with potential $-Gm^2/\sqrt{r^2+\epsilon^2}$ are known as Plummer spheres.)
Monitor the relative change of the total mechanical energy $(E-E_0)/E_0$ and the instantaneous total momentum of the system divided by $N$, where $E_0 = E(t=0)$ and $$ E = m \sum_i \left[\frac{v_i^2}{2} -Gm \sum_{j \lt i}\,\frac{1}{ \sqrt{r_{ij}^2+\epsilon^2}} \right], \;\;\;\;\;\;\; \vec{p} = -\frac{m}{N} \sum_{i}\,\vec{v}_{i}.$$
Choose a constant timestep $dt$ that assures good conservations of energy and nearly zero total momentum during the simulation. State the absolute value of momentum divided by $M$. Try to keep the relative energy error below $10^{-5}$. Consider any results where the relative error is larger than 0.01 garbage, affected by a very close encounter of particles. If the running time is too long, kill the program by CTRL-C after 10 or 15 minutes and run it again with much smaller $N$, it should then finish quickly. Establish the biggest N that gives reasonably short runtime by trial and error guided by the method of bisection. Hopefully you won't have to decrease N, but who knows..
You will finish the computation once the following happens. Write a subroutine or procedure, which you will call once in a while, not after every timestep, since that would slow down the working of the code unnecessarily, but maybe every 10 or more steps. It will return some information about the system, such as $E$ and $\vec{p}$, as well as the mean radius of the system $R_m = (1/N) \sum_i |\vec{r_i}|$ and the mean velocity, $V_m = (1/N) \sum_i |\vec{v_i}|$. The ratio $t_c = R_m/V_m$ tells you the average crossing time, which is similar to the typical infall timescale (initially) and the orbital period of a particle at the end. The code should finish when the time exceeds $20 t_c$, by which time we will consider that the system settled into equilibrium.
The main goal of your code is to find by what factor the mean radius of the system in final equilibrium is smaller than the initial mean radius. (The virial theorem can be used to make a prediction for that ratio, and you can try to make the prediction, but that is not required.)
Some particles may escape. Don't include in the average
radius those, which escaped beyond 10 times the initial ball's radius, and don't
plot them (you can choose to plot only the particles within 3 initial radii).
But study wherether they escaped because of code inaccuracies- by varying dt.
If there is a noticeble change in results when making energy more
exactly conserved (smaller dt) then it might mean that even dE/E ~ 1e-5 is not
sufficient to converge to a reliable answer. Consider adding a small number
called softening parameter ε2 ~ 10-10
to the square of every particle-particle distance to limit acceleration
and decrease errors due to close encounters. Cf. the $\vec{f}_i$
for a proper way to modify the equation.
Cluster evaporation is a physical phenomenon, but the theoretical timescale
of it (to lose significant part of the cluster) much exceeds 20 crossing times.
Code details need to be described in comments, and in a separate text that will also include the results and their discussion.
One idea on how to speed up the direct summation part of the code is to
gather particles in bunches of, say, 128 each (or 256), load those
consecutively into the CPU/GPU and leave them there for a while.
They're the "$i$" particles. While accumulating their accelerations,
you fetch only the 'other' or $j$-particles in consecutive bunches to
interact with the resident $i$-particles. It is a little hard to
implement the ordered-pairs only idea then, but doing so might may
create write collisions in gpu or cpu RAM, anyway. (There are easy ways to
see how many collisions happen in tests in which every force is replaced
by number 1.0; In case of frequent collisions in a CPU code,
switch back to full NxN matrix of interactions.)
The whole 128x128 interaction sub-matrix (called "tile") can be very
quickly evaluated by 128 cuda threads in one thread block, without
fine-grained reads/writes to RAM (hence a great speedup, it's like using
L1 cache on cpu), allowing N = 128K to be treated by 1M blocks,
only dozens of them computing concurrently (given today's hardware).
On art-1, N = 128K problem of exactly interacting particles can be
solved with an admittedly oversimplified time integrator fast enough
to create a good looking animation in real time (more than 20 frames per
sec). Here are some details of the method
& implementation in C and CUDA C . On art-1, run the Nvidia-written
program called nbody like so
% ~/nbody -benchmark --numbodies=...
with your preferred number of bodies (in the benchmark mode:
nbody code won't be trying to send the very pretty OpenGL graphics to the
screen; that's how OpenGL works, it consistently refuses to be
channeled via TCP/IP connections, as if deliberately preventing users from
complaining about 'slow graphics' when run remotely; indeed a good codec
would ameliorate that problem, but nobody seems to offer any solution, and
we have to use nbody on art-1 in the text mode).
To dig into the program source code, you'll need to look at bodysystemcuda.cu, nbody.cpp, and supporting header files bodysystemcuda.h , bodysystemcuda_impl.h . A lot of the stuff is about graphics, graphic buffers, of course, but try locate the compute engine.
For a scientific calculation, one timestep should not last much longer than 1 second (we sometimes need hundreds of thousands of timesteps and our lives are of finite length). What number N corresponds to 1 timestep per 3 seconds? What arithmetic speed in TFLOPs ($10^{12}$ operations per sec) does that give & how much faster it is than the KFLOPS typical of machines 70 years ago?
What about the following idea for another optimization. The method, as described above, computes a given pair $ij$ twice, once while summing the acceleration on particle $i$, and another time doing summation for particle $j$. Meanwhile, Newton's 3rd law states that the second calculation of force is redundant, because the force is equal and opposite to the one already computed (both acceleration can be updated simultaneously). One can only consider pairs that satisfy $i\lt j$ or $j\lt i$, to reduce the number of 3-D distances calculated. But will that work well with OMP? You can try and tell us what happened. (We are looking forward to your bright ideas on code optimization. E.g., you might check whether 9 by N or N by 9 arrays work faster, whether putting them on stack or heap, by palcing the declaration in or outside a procedure or fortran module, works better. Just remember to increase OS stack size if you get error messages having to do with memory allocations, or segmentation faults, which is the typical symptom of those troubles. This is just a suggestion. Describe your efforts even if they didn't bring much benefit. We ask you to describe your solution to 3.3 and 3.4 in a text file, points will be cut for submitting only the codes.)