Caltec (Calif. Inst. of Technology) students strictly & proudly adhere to honors code. UofT students should do the same. Copying codes produced by "AI" or browsers (i) is plagiarism in this course, (ii) easily detectable, and (iii) most of them don't actually solve the problems. As an exception (not a recommendation), you may use internet tools to resolve syntax errors in your code, if you get stuck with them (which we judge unlikely).
Initial conditions (position in meters and velocity of point mass in m/s) are: $$ \vec{r}_0 = (x,y,z)_0 = (-0.1, 0.2, 0.1)\; {\rm m}$$ $$ \vec{v}_0 = (v_x,v_y,v_z)_0 = (0, 0, -0.1)\; {\rm m/s}.$$
The choice of integration scheme is up to you. If you choose a leapfrog method
then remember to offset the position and velocities by dt/2 (except when
measuring energy $E$), take dt = $10^{-3}$ s.
If you use the Euler method, compensate its smaller accuracy
by a smaller dt = $10^{-4}$ s. If you use a more sophisticated algorithm
(not necessary!), choose an appropriate, larger timestep dt, making sure that
the relative energy error (ending vs. starting value) is $\ll 10^{-3}$.
Analytically show that the specific mechanical energy (per unit mass of
pendulum) is given by the formula $E = \frac{1}{2} [v^2 + \frac{k}{m}(l-l_0)^2]
+g\,z$. Evaluate its relative error, i.e.
${\Delta}E/E = [E(T)-E_0]/E_0$ at $t= T$, where $E_0=E(t=0)$.
As output, print: the number of equal timesteps your algorithm
made, final position $(x,y,z)$ and length $l$, as well as $\Delta E/E$.
[As a result of a misprint on energy formula in the text of exam,
energy conservation will not be checked.]
2. Put an outer loop on the integrations, doing three different simulations: first, the standard case specified above, second the version with 2 times larger dt to verify the accuracy of calculation, and third the version with slightly different value of $y_0$: $\vec{r}_0 = (x,y,z)_0 = (-0.1, 0.21, 0.1)\; {\rm m}$. Compare results with the standard case and provide numerical and physical commentary. Is the elastic pendulum solution very sensitive to initial conditions? That (plus non-periodicity that you did not explore, but the solution is non-periodic) would make our elastic pendulum chaotic. Can the evolution be reliably computed in timeframe $T$?
3. Comment in one sentence on whether the calculation would benefit from OpenMP or CUDA, and why.
Next, write an approximate formula for the roundoff error, assuming it is a result of random addition (like random walk) of N steps of calculation, each of which requires 6 additions and multiplications. Every arithmetic operation contributes $\varepsilon^2$ to the variance (sum of squares of errors), where $\varepsilon=10^{-7}$ is the machine epsilon, the possible error in one arithmetic operation. Show that this error drops rather than grows with $h$ (with what power and front coefficient?) because the number of steps $N$ decreases with growing $h$, making rounoff error smaller. Useful analogy: roundoff error growth is a random walk. How does the mean-square root distance covered in a random walk depend on N?
Compare the two types of error and establish the optimum discretization spatial $h$ that guarantees the least combined truncation and roundoff error in trapezoidal integration of the considered smooth function on [0,1] interval.
The total error on N subintervals of width $h$ (where the problem formulation says that $N= 1/h$), is the sum of such expressions. Considering that all second derivatives are $\pm O(1)$, the global truncation error (in absolute value sense) equals $$ \Delta I_{tr} = N |< \tilde{I} - I>| = \frac{O(1)}{12} h^2 + O(h^3).$$ We can drop the $O(h^3)$ terms due to $h^3\ll h^2$. This proves that the trapezium method has second order accuracy.
2. What about the roundoff error? We can arrange the computation such that in one subinterval, one new $f$ function value is evaluated. Assume that this takes some number $n$ of arithmetic operations. In a random walk of $n N$ steps, each of size $\varepsilon$ (machine espilon) on average, the mean-square-root distance reached (or roundoff error) will have absolute magnitude $$\Delta I_{\varepsilon} = \sqrt{n N}\, \varepsilon = \varepsilon \sqrt{\frac{n}{h}}.$$ This proves that the roundoff error drops with $h$, as square root of it, q.e.d.
3. Optimum discretization step $h$ is such an intermediate $h$, for which neither the truncation nor the roundoff error is very large. The minimum absolute truncation+roundoff error requires $$\Delta I = \varepsilon \sqrt{\frac{n}{h}} +\frac{O(1)}{12} h^2 \to min.$$ Calculus gives the condition of this minimum for a variable $h$: $ -\varepsilon\sqrt{n} + \frac{O(1)}{3} h^{5/2} = 0$, or the optimum step $$ h \approx (3\varepsilon\sqrt{n})^{2/5}.$$ The minimum error is $\Delta I = 5\cdot 2^{-2}\cdot 3^{-1/5} \varepsilon^{4/5} n^{2/5}$. For example, taking $n = 6$, and $\varepsilon = 10^{-7}$, we get that $ h \approx (7.3\cdot 10^{-7})^{0.4} \approx 0.0035$ or $N\approx 280$ optimizes the error of trapezoid integration of smooth functions on [0,1] interval, and results in the minimum error $\Delta I \approx 5\cdot 10^{-6}.$
4. These were not the questions asked in the exam, but two comments can be
made.
Comment 1: OMP or CUDA could
be very helpful in parallelization of the trapezoidal integration in dp,
but in single precision the optimum number of steps (less than $10^3$)
only favors OMP, not CUDA (too few cuda kernels called, essentially one
block of threads only).
Comment 2: In dp (double prec., $\varepsilon \approx 10^{-15}$)
we would use optimum step $h\approx 2.2\cdot
10^{-6}$ and get erorr $\Delta I \approx 2\cdot 10^{-12}$.
So in dp we never get close to machine accuracy in trapezoid method (error
of integration is 2000$\varepsilon$, but in sp it's better: we got error
$\sim 5\cdot 10^{-6}$ which is only 10 times the machine accuracy, although
the error itself is of course larger than in dp.
Your task is to digitally add Massey Hall reverb to a CD-quality sound recorded in a non-reverb, padded-wall studio. Let's call that recording $f(t)$, where the acoustic pressure amplitude $f$ is sampled in dt = 0.02267 ms intervals (so there is a very large number N of sampling intervals, perhaps $10^8$!), at a standard 44.1 kHz frequency used on CDs. Mathematically, the operation of replacing each sharp sound with its dragged-out copy (having a shape described by K) is a convolution.
A. First, compute a discretized version of convolution $$F(t) = \oint f(t-x) K(x) \,dx = f \star K.$$ You may think of it as creating $N_K$ copies of $f(t)$, each shifted forward in time by $x$ (that is done by subtracting x from t in argument of $f$), multiplying each copy by decay function $K(x)$, and adding together. $N_K$ is the number of samples in reverb function; $K$ does not need to have as many as $N$ samples, because the reverberation decays to fairly small, barely perceptible levels, say, in one second ($N_K = 44k$ samples). Shifted copies will stick out beyond the end time of the recording, i.e. beyond N samples. It would be OK to discard any output beyond N samples, in our example below. But in general, you want to keep a reverb continuing after the original input stops. You'll have to reserve a little more space in memory for output than for input, by however many samples you think your reverb has non-negligible amplitude. 1s time or 44100 more samples in output seems plenty enough to me. If you follow the above description, you can see that, if there is a single signal of duration $dx=0.02267$ ms at position $t_1$ in $f(t)$, then it will correctly be replaced by the a copy of it, followed by an exponentially decaying tail of shape $K(t-t_1)$.
The issue with this direct convolution procedure is that its complexity scales as $O(2 N_K \,N) $. But you see, that's not a very nice scaling, because $N_K$ is very large. (We've encounter a similar but more grave an issue in direct-summation gravity solver, where $N_K = N$ was even much larger, 'cause the pattern of force extended throughout the physical space, while the reverb pattern is more compact in time domain). But write a direct convolution subroutine or procedure in Ftn or C(++), anyway, and remove any syntax/compile-time errors from it, if you manage. This will be a pseudocode that you won't actually execute because there is no time for testig it, especially on real recordings. But take care of the complexity issue by trying to use OpenMP or CUDA, whichever you prefer. Mutithreading/CUDA could alleviate the slow speed of direct convolution somewhat.
B. There are two much better methods than direct-summation convolution, thousands of times faster, in practice. One is FFT, which we will skip here (it allows for an arbitrary convolution kernel $K(x)$, of arbitrary length and shape). The other, almost miraculously, allows you to write a streamlined method for adding reverb, using only one pass through the input recording. That algorithm's arithmetic complexity is only about $O(2N)$. But it only works with exponential decay kernel $K(t)$ above.
Hint: Notice this special property of exponential decay. The ratio between its value and the value in the previous sample is a constant factor (smaller than one). We can write this factor as $K(t)/K(t-dt) = e^{-t/\tau} / e^{-(t-dt)/\tau} = e^{-dt/\tau}$. The reverb of all the previously encountered sound amplitude values (no matter how long ago the original sound first arrived at the microphone) is simply the previous sample of the output amplitude. It carries over to the given sound sample diminished by the constant factor. At the same time, new input arrives from the original recording, and is added without reduction to output.
Write a fast routine that converts any recording of N samples to N+44100 samples (in order not to lose the reverb of the last note or sound played by your favorite musician). You don't need to bother with actual reading of input or writing of output stream (we describe how to create it in a test below.) The N you will use for testing in your code will be N = 44100, so only 1 second of the original recording $f(t)$ will be processed. In fact, to be easily verifiable, create an empty array $f$ (N zero entries), and only at indices corresponding to times = 0.02267s, 0.25s, 0.30s, and 0.50s, enter values 1.0 (amplitude 1 lasting for one sample's duration). Compute on a calculator what output you expect at time 1s and compare with the printed output array value at that particular time/index.
A much more clever way is the one that allows 'DSP on the run', where
no large working array in memory is needed at all.
There is a stream of numbers representing studio recording, coming on-by-one
to be processed (it may come from non-volatile mass memory, or directly
from a sound source like electronic instrument or microphone).
It gets added undiminished to the to the single value keeping the current
amplitude, since that signal has just arrived at the microphone.
The previous value, representing the sound reverberating in the imaginary studio,
which you may keep in some separate variable or just overwrite,
gets added mulitplied by the constant reduction factor $\exp (-dt/\tau)<1$
to the new input amplitude.
Then you iterate, after recording somewhere the current amplitude
in some array, or directly in an output file.
The complexity is $2N$, because visiting $N$ memory locations, we
do only 1 multiplication and one addition per input data point.
Your concrete example should give the following sum of reverbs of 4 spikes: exp((0.02267-1)/tau) + exp((0.25-1)/tau) + exp((0.3-1)/tau) + exp((0.5-1)/tau), where tau=0.3257. This evaluates to 0.481735.
Since all of you can read a working, uncomplicated code in Fortran, I removed all the syntax and other compile and runtime errors originally introduced.
This is the program, and it resides also in the top directory of phyd57 user on art-1, for your convenience.
This program tests various ways, including fast, parallel ways, of computing a 2-d Laplacian operator that acts to smooth an array that may be an image. 4/5 of a pixel's value is taken away from the pixel and added to the neighboring 4 cells (noth, south, east & west). The sum of coefficents is 1, preserving the total sum of values in the array.
Compile it on your machine or on art-1 in your subaccount like this
\$ pgf95d prob4_w_issue
It will compile. Run the program. It will run correctly:
art-1[187]:~\$ pgf95d prob4_w_issue
art-1[188]:~\$
art-1[188]:~\$ prob4_w_issue.x
prob4_w_issue.x
result = 54.06276 CUDA kernel
result = 53.61340 host OMP slices
result = 52.78773 host OMP
result = 52.79328 host (CPU)
Look at the output (above). Read the text of the program. What does it do? Should the results of different subroutines/procedures executed by the main program be identical, almost identical, or different? Something is a bit wrong!
Describe in solutions file what's wrong with the algorithm, what causes the differences. The issue is not in the programming details, like using different random numbers or different syntax. It is in the logic of what the algorithm does with the data, and it probably can be fixed in 15 minutes. For almost full credit, you can just find and describe the cause of the difference, but if you rename, fix, and recompile the program, that's even better! Submit the new, fixed version including your 4 last digits of student number in filename, in that case.
Introduce a second grid called grid2, and in odd steps (assume you start from an odd step), deposit newly smoothed values from grid into grid2, while in even numbered steps do the opposite: take numbers from grid2 and deposit results in grid. That's all!
It will also reside for your convenience in the top directory of phyd57 user
on art-1.
Copy it to your subdirectory (or computer).
Change the name to include your four last digits of student number.
Edit & submit to Quercus > Assignments > Final 2024.
Now you can do Problem 4 in full or relax.. Thank you for your participation & effort. And happy holidays soon!