PHYD57 - Computing Projects

PHYD57 Advanced Comp. Methods in Physics - Final Exam. SOLUTIONS & COMMENTS.

Preamble

Computations can be done on your own computer or on art-1 (please remember to do them in your own subdirectory!).
Submit results to Quercus Assignments, final exam section. Multiple files are ok, multiple submission of updated files too, within the normal time of 3 hrs.
Analytical solutions can be submitted as scans or snapshots taken with your phone, but should be in some common format like jpg or pdf. Descriptions of algorithms, solutions, should be described in doc, txt or pdf file (preferably, one file for the exam).
Numerical codes should be submitted to Quercus before the submission time, which normally is 17:10, and also (this could be after 17:10) left in your subdir on phyd57@art-1, in case we want to run them.

Caltec (Calif. Inst. of Technology) students strictly & proudly adhere to honors code. UofT students should do the same. Copying codes produced by "AI" or browsers (i) is plagiarism in this course, (ii) easily detectable, and (iii) most of them don't actually solve the problems. As an exception (not a recommendation), you may use internet tools to resolve syntax errors in your code, if you get stuck with them (which we judge unlikely).

Problem 1 [50 p.] Elastic 3-D pendulum

1. Create a single-thread program in C/C++ or Fortran using double precision variables, calculating $T=10^3$ seconds of motion of the elastic pandulum with the following characteristics. The suspension point is fixed at the origin of Cartesian coordinates. Earth's gravity results in constant z-axis acceleration $\vec{g} = -\hat{z}$ 9.81 m/s$^2$; versor $\hat{z}$ points upward, while $\vec{g}$ downward. A mass is attached to the suspension point by massless elastic rod (always straight, but extendable) that, depending on rod's length $l$, produces on the attached mass the acceleration following Hooke's ideal spring law, and directed toward the suspension point. The full acceleration is $$ \vec{f} = - \frac{k}{m}(l-l_0) \,\frac{\vec{r}}{l} \;+ \vec{g}$$ where $l = |\vec{r}|=\sqrt{x^2+y^2+z^2}$, parameter $\frac{k}{m} = 33$ s$^{-2}$, and $l_0 = 0.1$ m is the zero-force length of the rod.

Initial conditions (position in meters and velocity of point mass in m/s) are: $$ \vec{r}_0 = (x,y,z)_0 = (-0.1, 0.2, 0.1)\; {\rm m}$$ $$ \vec{v}_0 = (v_x,v_y,v_z)_0 = (0, 0, -0.1)\; {\rm m/s}.$$

The choice of integration scheme is up to you. If you choose a leapfrog method then remember to offset the position and velocities by dt/2 (except when measuring energy $E$), take dt = $10^{-3}$ s. If you use the Euler method, compensate its smaller accuracy by a smaller dt = $10^{-4}$ s. If you use a more sophisticated algorithm (not necessary!), choose an appropriate, larger timestep dt, making sure that the relative energy error (ending vs. starting value) is $\ll 10^{-3}$.
Analytically show that the specific mechanical energy (per unit mass of pendulum) is given by the formula $E = \frac{1}{2} [v^2 + \frac{k}{m}(l-l_0)^2] +g\,z$. Evaluate its relative error, i.e. ${\Delta}E/E = [E(T)-E_0]/E_0$ at $t= T$, where $E_0=E(t=0)$.
As output, print: the number of equal timesteps your algorithm made, final position $(x,y,z)$ and length $l$, as well as $\Delta E/E$. [As a result of a misprint on energy formula in the text of exam, energy conservation will not be checked.]

2. Put an outer loop on the integrations, doing three different simulations: first, the standard case specified above, second the version with 2 times larger dt to verify the accuracy of calculation, and third the version with slightly different value of $y_0$: $\vec{r}_0 = (x,y,z)_0 = (-0.1, 0.21, 0.1)\; {\rm m}$. Compare results with the standard case and provide numerical and physical commentary. Is the elastic pendulum solution very sensitive to initial conditions? That (plus non-periodicity that you did not explore, but the solution is non-periodic) would make our elastic pendulum chaotic. Can the evolution be reliably computed in timeframe $T$?

3. Comment in one sentence on whether the calculation would benefit from OpenMP or CUDA, and why.

SOLUTION

Coding styles differ and there is no single valid solution. I wrote this program in the Euler, and Leapfrog versions:
fin-prob1-eu-2024.f95 (Euler) fin-prob1-lf-2024.f95 (Leapfrog). The computations gave the following results.

Problem 2 [30 p.] Trapezoidal integration

One of the popular methods of computing 1-dimensional integrals is trapezoidal rule, in which we divide the interval of integration $x\in [0,1]$ into N equal subintervals, then approximate the areas under a curve $f(x)$ by N trapezoids, which is equivalent to N+1 point, piecewise linear approximation of $f(x)$. By using Taylor expansions, derive analytically the error term of the approximation. Use discretization step $h = 1/N$ (some would call it $\Delta x$ but it's more difficult to type than $h$). Show that this error, called truncation error, because it stems from truncated Taylor series expansion, is proportional to $h^3$ in every subinterval, and to $h^2$ on the whole interval of $x$. If you encounter second or higher derivatives of $f$ over $x$, you may assume that for the function you integrate, they are approximately 1 or less, in absolute value.

Next, write an approximate formula for the roundoff error, assuming it is a result of random addition (like random walk) of N steps of calculation, each of which requires 6 additions and multiplications. Every arithmetic operation contributes $\varepsilon^2$ to the variance (sum of squares of errors), where $\varepsilon=10^{-7}$ is the machine epsilon, the possible error in one arithmetic operation. Show that this error drops rather than grows with $h$ (with what power and front coefficient?) because the number of steps $N$ decreases with growing $h$, making rounoff error smaller. Useful analogy: roundoff error growth is a random walk. How does the mean-square root distance covered in a random walk depend on N?

Compare the two types of error and establish the optimum discretization spatial $h$ that guarantees the least combined truncation and roundoff error in trapezoidal integration of the considered smooth function on [0,1] interval.

SOLUTION

1. Consider a subinterval starting at $x_0$, ending at $x_1 = x_0 +h$. We want to find the truncation error of the trapezoid formula: $$ \tilde{I} =\frac{f(x_0) + f(x_1)}{2}\,h . $$ We have to get the series expansion for the exact integral $I$, and evaluate $\tilde{I}-I$. First,
$$f(x) = f(x_0) + f'(x_0) (x-x_0) + f''(x_0) (x-x_0)^2/2!\,+ ...$$ is the Taylor series (exact) for a the integrand $f(x)$. In particular, $f(x_1)$ is equal $$f(x_1) = f(x_0) + f'(x_0) h + f''(x_0)h^2/2! + f'''(x_0) h^3/3!\, + ...$$ and therefore $\tilde{I}$ can be written as $$\tilde{I} = f(x_0)h+\frac{1}{2}f'(x_0)h^2 +\frac{1}{4}f''(x_0)h^3+\frac{1}{12} f'''(x_0) h^4\, + ...$$ On the other hand, we can integrate the series to obtain $$ I = \int_{x_0}^{x_0+h} f(x) \,dx = f(x_0) h + \frac{1}{2} f'(x_0) h^2 + \frac{1}{6} f''(x_0) h^3 \,+ ... $$ Subtraction yields the truncation error formula for one subinterval of width $h$. $$\tilde{I} - I = -\frac{1}{12} f''(\zeta) h^3 $$ where $\zeta$, by mean value theorem, lies somewhere between $x_0$ and $x_1$.

The total error on N subintervals of width $h$ (where the problem formulation says that $N= 1/h$), is the sum of such expressions. Considering that all second derivatives are $\pm O(1)$, the global truncation error (in absolute value sense) equals $$ \Delta I_{tr} = N |< \tilde{I} - I>| = \frac{O(1)}{12} h^2 + O(h^3).$$ We can drop the $O(h^3)$ terms due to $h^3\ll h^2$. This proves that the trapezium method has second order accuracy.

2. What about the roundoff error? We can arrange the computation such that in one subinterval, one new $f$ function value is evaluated. Assume that this takes some number $n$ of arithmetic operations. In a random walk of $n N$ steps, each of size $\varepsilon$ (machine espilon) on average, the mean-square-root distance reached (or roundoff error) will have absolute magnitude $$\Delta I_{\varepsilon} = \sqrt{n N}\, \varepsilon = \varepsilon \sqrt{\frac{n}{h}}.$$ This proves that the roundoff error drops with $h$, as square root of it, q.e.d.

3. Optimum discretization step $h$ is such an intermediate $h$, for which neither the truncation nor the roundoff error is very large. The minimum absolute truncation+roundoff error requires $$\Delta I = \varepsilon \sqrt{\frac{n}{h}} +\frac{O(1)}{12} h^2 \to min.$$ Calculus gives the condition of this minimum for a variable $h$: $ -\varepsilon\sqrt{n} + \frac{O(1)}{3} h^{5/2} = 0$, or the optimum step $$ h \approx (3\varepsilon\sqrt{n})^{2/5}.$$ The minimum error is $\Delta I = 5\cdot 2^{-2}\cdot 3^{-1/5} \varepsilon^{4/5} n^{2/5}$. For example, taking $n = 6$, and $\varepsilon = 10^{-7}$, we get that $ h \approx (7.3\cdot 10^{-7})^{0.4} \approx 0.0035$ or $N\approx 280$ optimizes the error of trapezoid integration of smooth functions on [0,1] interval, and results in the minimum error $\Delta I \approx 5\cdot 10^{-6}.$

4. These were not the questions asked in the exam, but two comments can be made.
Comment 1: OMP or CUDA could be very helpful in parallelization of the trapezoidal integration in dp, but in single precision the optimum number of steps (less than $10^3$) only favors OMP, not CUDA (too few cuda kernels called, essentially one block of threads only).
Comment 2: In dp (double prec., $\varepsilon \approx 10^{-15}$) we would use optimum step $h\approx 2.2\cdot 10^{-6}$ and get erorr $\Delta I \approx 2\cdot 10^{-12}$. So in dp we never get close to machine accuracy in trapezoid method (error of integration is 2000$\varepsilon$, but in sp it's better: we got error $\sim 5\cdot 10^{-6}$ which is only 10 times the machine accuracy, although the error itself is of course larger than in dp.

Problem 3 [35 p.] Adding reverberation of Massey Hall to a home studio recording

You graduate from UofT and become acoustics & DSP (digital signal processing) expert at Massey Hall, T.O. You measure that the concert hall has reverberation time equal $t_R = 2.25$ s. This means that the measured power of a very short-lasting sound (emitted at intermediate acoustic frequency), subsequently bouncing off the walls and structures many times, after $t_R$ decays, exponentially in time, by 60 dB, i.e. to $10^{-6}$ of the initial power, or equivalently $10^{-3}$ of the initial amplitude, because power $\sim $ (amplitude)$^2$. Mathematically, amplitude decays at $t>0$ as $$ K(t) = e^{-t/\tau}$$ with $\tau \approx 0.3257$ s, since $\exp(-2.25/0.3257) \simeq 10^{-3}$, as you can easily verify. Also notice that for negative argument (here, $t<0$) K is not given by this formula, instead it equals zero.

Your task is to digitally add Massey Hall reverb to a CD-quality sound recorded in a non-reverb, padded-wall studio. Let's call that recording $f(t)$, where the acoustic pressure amplitude $f$ is sampled in dt = 0.02267 ms intervals (so there is a very large number N of sampling intervals, perhaps $10^8$!), at a standard 44.1 kHz frequency used on CDs. Mathematically, the operation of replacing each sharp sound with its dragged-out copy (having a shape described by K) is a convolution.

A. First, compute a discretized version of convolution $$F(t) = \oint f(t-x) K(x) \,dx = f \star K.$$ You may think of it as creating $N_K$ copies of $f(t)$, each shifted forward in time by $x$ (that is done by subtracting x from t in argument of $f$), multiplying each copy by decay function $K(x)$, and adding together. $N_K$ is the number of samples in reverb function; $K$ does not need to have as many as $N$ samples, because the reverberation decays to fairly small, barely perceptible levels, say, in one second ($N_K = 44k$ samples). Shifted copies will stick out beyond the end time of the recording, i.e. beyond N samples. It would be OK to discard any output beyond N samples, in our example below. But in general, you want to keep a reverb continuing after the original input stops. You'll have to reserve a little more space in memory for output than for input, by however many samples you think your reverb has non-negligible amplitude. 1s time or 44100 more samples in output seems plenty enough to me. If you follow the above description, you can see that, if there is a single signal of duration $dx=0.02267$ ms at position $t_1$ in $f(t)$, then it will correctly be replaced by the a copy of it, followed by an exponentially decaying tail of shape $K(t-t_1)$.

The issue with this direct convolution procedure is that its complexity scales as $O(2 N_K \,N) $. But you see, that's not a very nice scaling, because $N_K$ is very large. (We've encounter a similar but more grave an issue in direct-summation gravity solver, where $N_K = N$ was even much larger, 'cause the pattern of force extended throughout the physical space, while the reverb pattern is more compact in time domain). But write a direct convolution subroutine or procedure in Ftn or C(++), anyway, and remove any syntax/compile-time errors from it, if you manage. This will be a pseudocode that you won't actually execute because there is no time for testig it, especially on real recordings. But take care of the complexity issue by trying to use OpenMP or CUDA, whichever you prefer. Mutithreading/CUDA could alleviate the slow speed of direct convolution somewhat.

B. There are two much better methods than direct-summation convolution, thousands of times faster, in practice. One is FFT, which we will skip here (it allows for an arbitrary convolution kernel $K(x)$, of arbitrary length and shape). The other, almost miraculously, allows you to write a streamlined method for adding reverb, using only one pass through the input recording. That algorithm's arithmetic complexity is only about $O(2N)$. But it only works with exponential decay kernel $K(t)$ above.

Hint: Notice this special property of exponential decay. The ratio between its value and the value in the previous sample is a constant factor (smaller than one). We can write this factor as $K(t)/K(t-dt) = e^{-t/\tau} / e^{-(t-dt)/\tau} = e^{-dt/\tau}$. The reverb of all the previously encountered sound amplitude values (no matter how long ago the original sound first arrived at the microphone) is simply the previous sample of the output amplitude. It carries over to the given sound sample diminished by the constant factor. At the same time, new input arrives from the original recording, and is added without reduction to output.

Write a fast routine that converts any recording of N samples to N+44100 samples (in order not to lose the reverb of the last note or sound played by your favorite musician). You don't need to bother with actual reading of input or writing of output stream (we describe how to create it in a test below.) The N you will use for testing in your code will be N = 44100, so only 1 second of the original recording $f(t)$ will be processed. In fact, to be easily verifiable, create an empty array $f$ (N zero entries), and only at indices corresponding to times = 0.02267s, 0.25s, 0.30s, and 0.50s, enter values 1.0 (amplitude 1 lasting for one sample's duration). Compute on a calculator what output you expect at time 1s and compare with the printed output array value at that particular time/index.

SOLUTION

The non-optimal way of convolving $f$ with asymmetric, half-exponential $K$ function requires two nested loops, say, the outer doing $N$ sideway shifts of the kernel, and the inner loop adding the $N_K$ kernel values to the output array (amplitude vs time array of sound with reverb). The outer loop could be parallelized by OMP. Collapsing 2 loops, there would be enough work for thousands of cuda-threads organized in blocks. 1 block would do summations related to a fixed position of $K$, different blocks would move the $K$ pattern around.

A much more clever way is the one that allows 'DSP on the run', where no large working array in memory is needed at all. There is a stream of numbers representing studio recording, coming on-by-one to be processed (it may come from non-volatile mass memory, or directly from a sound source like electronic instrument or microphone). It gets added undiminished to the to the single value keeping the current amplitude, since that signal has just arrived at the microphone. The previous value, representing the sound reverberating in the imaginary studio, which you may keep in some separate variable or just overwrite, gets added mulitplied by the constant reduction factor $\exp (-dt/\tau)<1$ to the new input amplitude.
Then you iterate, after recording somewhere the current amplitude in some array, or directly in an output file. The complexity is $2N$, because visiting $N$ memory locations, we do only 1 multiplication and one addition per input data point.

Your concrete example should give the following sum of reverbs of 4 spikes: exp((0.02267-1)/tau) + exp((0.25-1)/tau) + exp((0.3-1)/tau) + exp((0.5-1)/tau), where tau=0.3257. This evaluates to 0.481735.

Problem 4 [20 p.] Correct the algo

Since all of you can read a working, uncomplicated code in Fortran, I removed all the syntax and other compile and runtime errors originally introduced.

prob4_w_issue.f95

This is the program, and it resides also in the top directory of phyd57 user on art-1, for your convenience.

This program tests various ways, including fast, parallel ways, of computing a 2-d Laplacian operator that acts to smooth an array that may be an image. 4/5 of a pixel's value is taken away from the pixel and added to the neighboring 4 cells (noth, south, east & west). The sum of coefficents is 1, preserving the total sum of values in the array.

Compile it on your machine or on art-1 in your subaccount like this
\$ pgf95d prob4_w_issue

It will compile. Run the program. It will run correctly:
art-1[187]:~\$ pgf95d prob4_w_issue
art-1[188]:~\$
art-1[188]:~\$ prob4_w_issue.x
prob4_w_issue.x
result = 54.06276 CUDA kernel
result = 53.61340 host OMP slices
result = 52.78773 host OMP
result = 52.79328 host (CPU)

Look at the output (above). Read the text of the program. What does it do? Should the results of different subroutines/procedures executed by the main program be identical, almost identical, or different? Something is a bit wrong!

Describe in solutions file what's wrong with the algorithm, what causes the differences. The issue is not in the programming details, like using different random numbers or different syntax. It is in the logic of what the algorithm does with the data, and it probably can be fixed in 15 minutes. For almost full credit, you can just find and describe the cause of the difference, but if you rename, fix, and recompile the program, that's even better! Submit the new, fixed version including your 4 last digits of student number in filename, in that case.

SOLUTION

All the values obtained with different methods should be equal to machine accuracy, but are not. The trouble is with mistakenly, prematurely overwriting the values in grid array. We sometimes mistakenly use (especially in some parallel schemes) values that we wrongly think are inputs, but they have already been updated. Threads (omp or cuda) live their own life, they don't pay attention to how we imagine things proceed in a parallel code.

Introduce a second grid called grid2, and in odd steps (assume you start from an odd step), deposit newly smoothed values from grid into grid2, while in even numbered steps do the opposite: take numbers from grid2 and deposit results in grid. That's all!

QUIZ [41 pt. worth 30% of final exam]

Here is the simple text file that you need to edit: finalD57-Q-2024.txt .

It will also reside for your convenience in the top directory of phyd57 user on art-1.
Copy it to your subdirectory (or computer).
Change the name to include your four last digits of student number.
Edit & submit to Quercus > Assignments > Final 2024.

Now you can do Problem 4 in full or relax.. Thank you for your participation & effort. And happy holidays soon!

SOLUTION

Here is the final quiz with solutions .