Codes for PHYD57 Advanced Computing in Physics, UTSC

Advice for beginners in HPC (High Performance Computing)

While the course PHYD57 is mostly on advanced, efficient, computation in physics and presents the relevant techniques, some derivations and many algorithms, you must develop the actual programming skills in parallel with attending and following Lectures. The aim is to achieve a practical benefit from this course, which will be the ability to implement one's algorithms as programs written in Numpy/Python and the HPC languages: C(++) and F90/95 (which we will call F95 or Fortran95, as the differences with language definition of Fortran 90 are microscopic), using several features and tools that help us utilize the full processing power of the hardware: CPU and/or coprocessors. To help with this, programs used in Lectures, as well as some other codes may be posted on this auxiliary page.

You may use and modify the codes below. As a minimum, analyze them line by line and run them! Knowledge of structures and methods used in these codes is a course requirement of PHYD57. You must have a basic, passive knowledge of (i.e., ability to understand) both C and Fortran. You will actively program your assignments in one of those languages, as you prefer, unless explicitly instructed to use Python.

Codes for Lecture 2 and 3  

Example of speed gain in NumPy vs. pure Python
All 3 methods compared. The 3rd method, allocating and filling 1D array with uniformly spaced floats from a given one to another given number works OK; np.linspace(a,b,N) program is much faster than the ones containing Py loops. Actually several dozen times faster. Image processing program. Uses heat/diffusion Laplace operator stencil to thoroughly smear out an image, then subtract it as background from the original image. The method of unsharp masking enhances the image details, removes some overexposure, and any overall gradient or large scale, smooth pattern.

The important part here is the speed of the code. Timing shows that on average, the image (644 x 680 pixels in size) takes 0.00112 s to process once. The equivalent time for 1M pixel image (1024 x 1024) is about 0.00370 s, that is it's done 270 times per second. This is a highly optimized python code (NumPy arrays and their rolling/shifting is used).

By changing the flag to
you can see that the 1M pixel processing time by Python loops is t = 0.155 s (done 6.45 times/s). This is 42 times slower! It is likely that Numpy speed is near maximum achievable, a good example of the power of vectorization (array operations).

Below we show some advanced coding in Fortran aimed at speeding up the Python code (though not dramatic speedup will be obtained - NumPy works great on the Laplacian blurring problem).
This program is written in F95, as its extension .f95 shows. It uses several different methods to scan a square array (image 1K x 1K), and apply smearing passes to the image (q=0.2 means nearly max rate of smearing). CUDA = execution on Nvidia GPU (gtx 1080ti card). Results show that F95 compiler which knows how to use GPU (Portland Group Fortran 95 compiler) can compute one Laplace sweep on 1M pixel images (double precision values) about 9 times faster than Numpy, at a rate of 2.5-3 billion pixels per second, or 2.5-3 Gpixel/s.
Single precision CUDA isn't (much or not at all) faster than double precision, and neither is super fast - maybe the problem we solve is too small, or inefficiently coded in CUDA, e.g. making the time to start all the kernels significant.

And here are the programs running on host (CPU), which were compiled with two different fortran compilers. They compare the performance of a variety acceleration techniques.   is a version compiled by ifort (Intel compiler), and   is a version compiled by pgf95 (PGI or Portland Group Fortran compiler). The programs are identical. The only difference is comments insode on how to compile them, and quoted results at the end in ifort version. We can spot some mild differences between the two compiler, however the much more important difference is timing of different algorithms contained in separate subroutines.

We will discuss those results in class to show you the fundamental reason for the insufficiency of knowing only Python. We will return to these programs and try to rewrite them in C/C++, as we learn Linux, Fortran and C(++). This order is due to the fact that we will use compilers working under Linux operating system (but you should install a Windows/Mac versions on your laptop if you find one), and that Fortran is easier to learn, so it should be learned before C++.

Codes for Lecture 4  

711 puzzle by brute force

In lecture 4 we first saw different codes solving the 711 puzzle, and then its variants. First we had brute force attempt to crack the puzzle, i.e., attempts to search trough a comprehensive list of very many possibilities. The puzzle essentially asks for four prices which either summed up or multiplied together give $7.11. It turns out there is a unique precise silution to this problem
x,y,z,r =     120    125   150     316  cents, P = 711

The only time-saving trick is to search only through combinations which satisfy x <= y <= z <= r, so that permutations of the above solution are not computed, and do not appear in the output in 4!=24 copies. This is the Python code. This is the C code. This is the F95 code.

711 and similar problems - optimized algorithms

Next, if price are in cents (to avoid roundoff error in float calculations) then, for instance, prices like 3 times 1.00 dollar and one time 7.11, multiply to give 711*100**3 = 711 millions (int). There is a way to use prime factor decomposition of the product 711000000 = 79 * 3*3 * (2*5)**6, to restrict search to only such integers that contain factors of 2, 3, 5, and also for one number only, let's call it x, factor 79. In fact, only 6 possibilies can be shown possible for x: 79, 79*2, 79*3, 79*4, 79*5 and 79*6. Numbers y, z have a combined 105 possible pair values, number r follows uniquely from x,y,z. So we only check less than 6*105 possibilities. With right ordering, we actually only calculate 4k to 5k possibilies, which takes only about a millisecond in fortran. This is the Python code. This is the F90 code.

Finally there is a version of the brute force search through a part of cube of possibile (x,y,z) triplets [the well ordered triplets], which treats not only 7.11 but all numbers from 3.00 to 20.20, as a possible sum and product of four numbers with two decimal digits after the point. The code is in fortran95. This is the F90 code.

Dynkin-Kruskal count - convergent Markov chains

We looked at the Markov chains, where a given random sequence of cards or numbers is traversed in a linked-list manner, that is when you see a value in a list of integers, you use it to determine the position of the next number in a sequence, and you repeat that until the list ends. This is the basis of a card trick with a magician correctly guessing the final number in a seemingly hidden/random sequence computed by a spectator in his/her mind (but using a given well-shuffled deck of 52 cards. See the comments in the programs: This is the Python implementation with nice graphic representaion of cards. In order to successfully run graphics of this code, you need to start python from a directory containing this program, after having copied
the whole contents of the subdirectory from art-2 into you machine, as "jpgs" subdirectory of your directory. You can use % scp -r . command to do this. This subdir holds 53 jpeg's, 52 for cards from the deck, and one for a blank card. It's fun to watch the shuffled deck, and then a game slowly played with starting key card number from 1 to 10, displaying only the key cards in the sequence.

Random sequences that converge when the length of element is used to create a jump to the next element are called Dynkin-Kruskal sequences, after Eugene Dynkin (1924-2014), a Russian-American mathematician, who mentioned them in his work, and American mathematician Martin David Kruskal (1925-2006). The nature of these Kruskal sequences is that they converge exponentially fast, and for N=52 there is already more than 90% chance that the two randomly started sequences converge at the end of the deck, that is the magician and the spectator independently arrive at the same last key card. I saw the trick demonstrated by 50% of Islandic astrophysicists (1 of 2, he said) at one conference, but didn't know that these convergent, linked list-like, series are so common. Almost any books can be used to show that. Skip a number of words equal to the number of letters in a key word. By the end of the third line you normally converge to the same sequence forever after, no matter which word in the top line you start with. Try it! I think almost any book will be good, with a possible exception of a book in German (very long compound words :-). is the math paper that explains why the sequences converge.


Codes for Lecture 5 and 6

We continue learning the HPC languages C/Fortran, and begin to use a new directive-driven language called OpenMP (Open Multiprocessing) that is very popular and powerful on multicore CPUs and CPU-like platform MIC (Intel Xeon Phi). OpenMP cannot be used on graphics cards, although maybe one day it'll talk to GPUs via some CUDA or OpenCL interface.

We programmed a simple loop over a 2-d array in fortran: This code does not use any explicit parallelization. Array is declared in the program, not in a separate module, which makes the compiler allocate the array on the stack, not on heap (two different data structures in RAM; google "difference stack heap fortran"). Therefore it is quite possible to get a segfault (segmentation fault) while running the executable, because of limited stack size. To fix that problem, in tcsh shell on art-2 we say something like: % limit stacksize 4000m, while in bash we would use ulimit command (for proper options syntax please see man pages or google it).

The above code is poorly instrumented and we can get only a rough idea of the speed from running it with time command like so: % time simpler-nb-0.x (if we have named executable file simpler-nb-0.x). It is better to use the omp timer omp_get_wtime(), an intrinsic function that returns a double precision (8B!) result with time in seconds. See code

Data are still on the stack, but the code spits out more information. It interpretes the apparent number of loop passage times estimated FLOP count per passage in two ways: either as FLOPs number or as a result of bandwidth limitation:

$ time simpler-nb-1.x
  done in    36.7159843444824      ms
  throughput   32.68332      GFLOPs << theor. limit
  if bandwidth limited:   130.7333      GB/s 
0.101u 0.056s 0:00.16 93.7%	0+0k 0+0io 0pf+0w

This example shows that this version of the code is certainly not computationally limited, but likely bandwidth-limited, because computations on CPU could go 1 order of magnitude faster. However, look at these 3 different optimization levels while compiling the code, and the numers pronted by the program:
art-1[397]:~/avx$ ifort -O0 -openmp simpler-nb-1.f90 -o simpler-nb-1.x
  done in    1158.52308273315      ms
  throughput   1.035802      GFLOPs << theor. limit
  if bandwidth limited:   4.143206      GB/s 

art-1[401]:~/avx$ ifort -O1 -openmp simpler-nb-1.f90 -o simpler-nb-1.x
art-1[402]:~/avx$ time simpler-nb-1.x
  done in    356.431007385254      ms
  throughput   3.366711      GFLOPs << theor. limit
  if bandwidth limited:   13.46684      GB/s 
0.401u 0.068s 0:00.47 97.8%	0+0k 0+0io 0pf+0w

art-1[403]:~/avx$ ifort -O2 -openmp simpler-nb-1.f90 -o simpler-nb-1.x
art-1[404]:~/avx$ time simpler-nb-1.x
  done in    36.4382266998291      ms
  throughput   32.93245      GFLOPs << theor. limit
  if bandwidth limited:   131.7298      GB/s 
0.103u 0.052s 0:00.15 100.0%	0+0k 0+0io 0pf+0w

art-1[405]:~/avx$ ifort -O3 -openmp simpler-nb-1.f90 -o simpler-nb-1.x
art-1[406]:~/avx$ time simpler-nb-1.x
  done in    36.4911556243896      ms
  throughput   32.88468      GFLOPs << theor. limit
  if bandwidth limited:   131.5387      GB/s 
0.096u 0.059s 0:00.15 93.3%	0+0k 0+0io 0pf+0w
Yes, ~150 GB/s is the absolute limit for the RAM bandwidth, actually I think a fantastic result owing partially to an effective use of its cache by CPU. Level -O0 means no optimization, and that's bad for speed, but the standard level -O2 is good. -O3 doesn't give more speedup.


Codes for Lecture 7 and 8  

Multilanguage programming: Calling C from Python

First, look at the following program, with a few functions that we will compile into a shared object library (.so, binary library that can be used by more than one program, not incorporated into the calling program but a standalone binary file, which can be linked to a calling program at runtime). This sample of functions will run on CPU, and utilize OMP in the last function. It's like a simple benchmark code, to do a billion or so operations on floats.

Here is the calling C program for our library:

Study how these programs are compiled, try them yourself, modify them to your needs (also all the others below). Compilation flag I used awre mentioned in their headers. In particular, option -lfun links the library with the program use_libc.c ["lib" is an implied beginning of the .so library].

Now let's compile a library (first, a simplified version w/o OpanMP part and big array) for use with Python:

Here is the calling Python program for library
It is, as the name indicates, a Python2 version, since only that version is fully installed on art-2, where I want you to try compiling and running programs yourself in user phyd57's sub-directory progD57.

As you see, there are certain tricks that need to be known or googled up, about calling C or other libraries from Python, since it's not trivial. But you do have this template now, so explore and modify it to your needs.

Finally, let's call the full version of the library with OMP.
Here is the calling Python program for

We see that the estimates of speed are virtually the same for C calling C library, and Python calling the same library. That's good, it means that the C/Python inteface does not steal any performance, in this example.

Laplace stencil again

Please go back to the links (Lectures 2 and 3) to programs efficiently implementing the cross-shaped stencil of Laplace operator - now you are going to be able to understand the details, such as OpenMP parallelization!


Take a look at this program, which numerically solves the following question:
What is the chance that 4 points placed at random on a sphere form a tetrahedron that included the center of the sphere? (It's 1/8).

First, a simple CPU version

And this is the CPU version with changes, can be compiled either with GNU, Intel or PGI fortran compilers (I believe Intel produces fastest code):

Plotting from any Fortran or C(++) - DISLIN

Or even from Python, Ruby, Java etc. Home page of DISLIN. Download. DISLIN Documentation, Contents.

Midterm problem 1 implemented in Fortran, in program volcano-disl.f90 . See the program below, compile, run.

The crucial computations are done like this (click the nicely colored view):

There is opportunity for you to explore several issues by playing with the program and running it in art-2's phyd57 account, dislin subdirectory, where it resides:
•   How does the compiler (Intel ifort vs. PGI pgf90) affect performance. PGI produces much slower code, processing ~50 instead ~140 mega-ejecta per sec.
•   OpenMP scheduling options, play with it - no big differences.
•   Change Nthreads from 1 to 24, see that 12 is optimum
•   Insert $omp atomic, $omp critical, around the Map(ix,iy)=Map(ix,iy)+1 update, watch speed drop a few times to 30 times, correspondingly. That's the price to pay for no contentions while updating Map pixels. See the statistics printed by the program. We discussed thios in tutorial 6, and it turned out to be true: probability of simultaneous writing to a pixel location (race condition resulting in one or both updates being dropped) is only P < 0.001.
•   different ranges of V
v0 = co*(0.5 + 0.5*v0) -> v0 = co*(0.95 + 0.1*v0) (see the ring?)
•   Notice that I generate one random number per impact only. Pseudorandom float contains quite a few random digits, therefore one can generate three (like here) or even more slightly more coarse-grained pseudorandom floats from it. Even though I didn't see much speedup from this trick, it saves substantial amount of memory.
•   Read and learn the usage of DISLIN from the the preceding, as well as these two codes (compilation instructions inside): ex11_1.c
and ex11_1.f90


Codes for Lectures 10-12  

CUDA C and CUDA Fortran

CUDA Fortran blog by Greg Ruetsch (author of one of recommended books in book direct.)

CUDA C blog by Mark Harris, NVIDIA   <== recomended

CUDA C presentation - different modes of usage, by Cyril Zeller, NVIDIA

CUDA C Programming Guide. Detailed. Definitive (By Nvidia!). Read first 30 pages.

CUFFT library description - useful when you need FFT. Some learning curve, but once you get it to work, the world of DSP is yours. Has examples of all fft's (complex to complex, real to complex etc.).

CUDA nvcc compiler manual

Paper comparing CUDA in Fortran and C with CPU performance, as a function of job size.

Dr. Dobbs blog CUDA, Supercomputing for the Masses. Links to all 12 pieces. A bit old but informative.

Cuda programming model - Refresher.


In the phyd57/progD57 account on art-2 you can find these three simple tests of the basic saxpy (meanig: single-precision float a*x+y) i.e. the basic linear algebra operation. (sometimes we give extension .cu to CUDA codes written for nvcc compilation, but here it's a version for CPU, never mind!) This one does the honest CUDA kernels. F95 version with more diagnostics, showing bandwidth, compute performance, and a way to call shell commands from within the program.

Diffusion stencil computation

This is a GPU version of the 5-pt stencil Laplacian code you have seen before; written in part using CUDA Fortran; to be compiled with the PGI compiler pgf95. This version is dp (double precision floats, 8B), there is also an sp version (single precision float, 4B) with a similar name on art-2.

Tetrahedra on graphics gear

This is a GPU version of the code you may have seen before if you were interested in tetrahedra; written in part using CUDA Fortran; to be compiled with the PGI compiler pgf95.

Tau program - IRI in disks

This program

tau-dis-cpu.f90 although it works, has some problems. Find and remove them


Codes for Python skills refresher outside Lectures

Numerical approximation and Taylor expansions

If you forgot how to use PyPlot, refresh your memory here. shows Taylor (Maclaurin) series of exponential function exp(x): Single maximum is found of the range, in the considered interval of cos(θ).

Partial summation of infinite series over 1,2,3,...,k terms has the form:
exp(x) = ex = 1 + x/1! + x2/2! + x3/3! + + ... + xk/k! + ...

Factorial function k! is defined as k! = 1*2*3....*(n-1)*n As you see, we do not need to start this multiplication from scratch doing repeated loops over k=1,2,3,.. . Rather, since n! = n (n-1)!, we simply start from the first (lowest) value of (n-1)! and repeatedly update the factorial variable in every loop passage. See the code.

We keep the always improving partial sum of the series terms (functions) and call it y. We plot y vs. x for every k. We stop at k~20, Cosine function series approximation.

The figure constructed with MatPlotlib shows that the larger the absolute value of the argument x, the more terms are needed for a good approximation of the function exp(x), both for positive and negative x.
Remark: use the zoom button on the plotting widget that your Python installation provides. This way you can disentangle particularly dense parts of the figure!

Another example of the same type of simple overplotted lines with automatic coloring by PyPlot. does the same as above with the trigonometric function cos(x). Taylor (MaclLaurin) series of it is given by
cos(x) = 1 - x2*1/(2*1)! + x2*2/(2*2)! + x2*3/(2*3)! + + ... + x2k/(2k)! + ...

Epidemics modeling

The program is written in Python for easy graphical output (animation). Therefore, it's not very efficient. But it allows you to explore many input parameteres, hardcoded for simplicuty in different versions:, ;nbsp&,,,,

I/O and stats in Python shows how to read and write to/from ASCII files, comma or space-separated values. shows how to compute (in 3 different ways) simple statistics, like the average values and two kinds of sigmas (st. dev. of the data, std dev of the mean), for data read from files.

Extrasolar orbit, Kepler equation
Planetary system around a small M5 dwarf star not far from Earth (10 pc) was discovered and announced in late 2019. Media say it's a big discovery challenging the existing planet formation theories. Not necessarily.

Planets is on eccentric orbit, and the program produces not only the visualization of planet's uneven, eccentric motion but also the correction to mean flux received on the planet because of e=0.4356. It even computes the temperature of the planet. Exponential (Artymowicz) fractal calculator Exponential (Artymowicz) fractal calculator, diffrent color map & region Generate a bunch of random numbers and show their histogram. Compares uniform and normal distributions. Calculation of Pi by Monte Carlo. A classic of MtCarlo demonstrations. Simple version.

Monte Carlo method Calculation of Pi by Monte Carlo. This version compares the Monte Carlo accuracy with a regular integration on a uniform mesh of N points, the same number as MtC particles (so, similar computational effort). We begin to see the problem with MtC: 1/sqrt(N) signifies the slow convergence to zero value of error.

Radioactive decay Calculation of decay histories of radioactive isotope U-235 by Monte Carlo simulations. Compares different realizations of the experiment in one plot.

Radiation transfer Dusty box, or rather a dusty duct, is simulated to see how we can solve a basic radiation transfer problem by randomly placing absorbers in the field of view. It turns out, we can reproduce very exactly the analytical result I = I0 exp(-τ).
On the left is a layer of clouds with τ>>1, which we analysed in Lecture 5. Bay of San Fancisco, 21 July 2019. And to the right, I photographed the optically thinner cloud (τ ~ 3.6?) not of water droplets but smoke and ash particles from forest/grass fires near Meteor Crater, AZ, 19 July 2019.

Random walks Multiple 1D (i.e., 1-dimensional) random walks on one plot. This simulates diffusion in physics, paths of photons in the sun, or some extremely large Galton board, flight of an arrow through a turbulent air, how perfume spreads in still air, how heat speards through solids, and/or a number of other things. No restrictions on the range of random walk in this version. provides a modified version of 1D RND Walk, with absorbing boundary at X=0 (a.k.a. Gambler's Ruin Problem).

Numerical calculus: differentiation and root finding Thorough error analysis, both roundoff and truncation error, analytical and numerical, of two stencils for first order derivative. Demonstrates the advantages of high order stencils. Bisection method of root search of a 1-D real function. Illustrates the generally robust performance and the possibility of missing some zeros. Secant method of zero finding of a 1-D real function. There was a bug in the program, now converges as fast as it should, and in fact impressively fast, 'superlinearly', almost like Newton's scheme, in less than 8 steps as a rule. Newton's method of zero finding of a 1-D real function. Illustrates the relatively fast convergence in the case of 3rd order polynomial function. When it works, it works really great. When it fails, it fails like no other algorithm. Can you say when? Comparison of all three methods of finding zero efficiently, one of them (Newton's) is run in two variants: with numerical and analytical derivative calculation. One can see the huge advantage of those methods which converge quadratically or at least faster than linearl, meaning that the next error is asymtotically proportional the a power of the previous error larger than 1., sometimes 2.
and a variant
Archival data from Oxford, UK, weather analysis 1929-2010.
Monthly data on # of sunny hours per month, which is used in this exercise, is in this file:

Numerical calculus: Integration methods compares the convergence of several simple 1st to 4th order integration schemes of the function exp(-x) between x=0 and x=1. compares the convergence of several simple 1st to 4th order integration schemes of the function 1/(1+x*x) between x=0 and x=1. Name derives from Arctan(x) which is an analytical integral value. BTW, all of the integration codes have known exact answers, mostly involving number π . These codes therefore calculate that number numerically. compares the convergence of several simple 1st to 4th order integration schemes of the function exp(-x) between x=0 and x=1. compares the convergence of several simple 1st to 4th order integration schemes on the problem of the length of the curtain as a line integral using Pythagoras theorem, or an integral of (1+(y')**2)**0.5 where y is the curtain's shape and y'=dy/dx. This turns out an easy problem for numerical integration schemes! compares the convergence of several simple 1st to 4th order integration schemes of the function sqrt(1-x*x) (1/4 of a circle between x=0 and x=1). This function poses a lot of problems for numerical integration schemes, because of its unbound derivatives as x-->1. compares the convergence of several simple 1st to 4th order integration schemes of the length of curve qrt(1-x*x) (1/4 of a circle between x=0 and x=1). This integration also poses a lot of problems for most numerical integration schemes, because of its unbound derivatives as x-->1. solves the linear least-squares fit problem. The function has 3 parameters (amplitudes) multiplying a parabola. Data have a moderate amount of noise added. Program restores the parabola, hence denoising the data. solves the linear least-squares fit problem. The function has 4 parameters (amplitudes) multiplying a parabola plus a cosine function of known period. Data includes noise. Program successfully restores the hidden truth (line), hence denoising the data. solves the linear least-squares fit problem. The function has 4 parameters (amplitudes) multiplying a parabola plus a cosine function of known period. Data has a large amount of noise added, making it visually difficult to separate smooth functional dependence from quasi-Gaussian noise. Despite this, program fairly successfully restores the smooth underlying function/denoises the data, although the quality of parameter restoration starts to suffer (esp. curvature a[0] is too low).
This program fort simulates a throw of ball (or arbitrary test mass) in vacuum. Notice that the air drag will be studied in Lecture 11. Initial conditions: (x0,z0) = (0,0) & (vx0,vz0) = (70,20) m/s.
1-D ODE integration; solve dy/dx = f(x,y) = -sin(x)*y, with boundary cond. y(0)=1, & various integration schemes.
Same as above but over a 10 times longer period of time. Despite an initially high precision of some methods, a drift in solution becomes evident. This is bad news, as it is enough to just wait until the drift destroys the initial accuracy.
Motion of a physical pendulum (able to go to the top and over the top on a rigid massles arm, depending on initial conditions. The script implements a Leapfrog algorithm, but looks deceptively simple like Euler method. Not only does it not do the half-step with positions, full step velocities and again half-step with positions, but also does not star in a special way to separate speeds and positions in time by dt/2... yet it is accurate (2nd order accurate).

This is the diffraction-interference pattern from two slits on a screen 1 m away. Slits are 30 microns wide, spaces 240 microns apart.
Compare the pattern with the pattern of diffraction on one slit! And click on the middle illustration above. You'll see the red laser interference fringes in the actual experiment on the right hand side.
This code is deliberately a little buggy and slow. But that's how one would very naively approach the solution of the 2-D linear wave equation:
u_tt = c**2 (u_xx + u_yy)
using Python loops (hence slowness!)
shows a wave pattern from an initial disturbance of a pond near the corner. Algorithm uses only Z arrays (different copies for new, current, and previous time step of array Z.
shows a wave pattern from a localized initial disturbance far from the boundary. This algorithm is only slightly different, introduces explicitly V array which is first time derivative of array Z (times dt). Run at about the same speed(?) as
shows that one can force not only boundary conditions on the walls, but also in the middle, creating a region impermeable for waves.
This is the famous double slit experiment simulated on water surface or as soundwaves (no quantum mechanics here, though QM experiment is strictly analogous and creates diffraction-interference patterns for the same reason.

Pawel Artymowicz /