This may seem a peculiar statement, because the samples expressed very different pH values. And yet… let’s imagine all the water is just not completely homogeneous, with the differences being just very local. Taking three samples in a row we already noticed the differences. The pH meter also has a random error of probably 0.03 (between different measurements – in a calibration buffer the values are quite stable) and all this together could lead to (small) fluctuations of the observed values.
To test this idea (it’s not a real hypothesis, but more of a thought experiment), we could take all measurements on fresh samples (ignoring the secondary values for the days after) and put them all together. It doesn’t matter whether it was for different samples (triplo) or repeated measurements of the same sample (duplo) and it’s even not relevant which time of the day we took the samples.
In total we got 72 primary values and we won’t use the correction for the drift of the meter (observed during the first measurements) right now.
We can calculated a 99% Confidence Interval (99% CI) for the population (all samples to be taken, using a Z-value of 2,576, so minus and plus 2,576 times the standard deviation of the sampled values, to obtain the lower and upper boundary of the 99% CI). We can expect 99% of our future samples to be within this bandwidth (but beware, this will only apply if the distribution is Gaussian).
Dividing the previous standard deviation by the square root of the number of samples minus 1 (that would be sqrt(71) = 8.43), we can calculate a Confidence interval for the mean.
If we would calculate the mean (taking the average in Excel) for a reasonable set of future samples, we would get a value between those boundaries – at least with a certainty of 99%, so once in a while it could be outside, but that should be very rare.
Determining the median: sorting all values, we can take the value between #36 and #37, but both are 8.16 so that’s easy. Obviously the median is lower than the mean, so the distribution is skewed, we have outliers or both. The mean is very sensitive to outliers, but the median is not. The sorted values can also be visualised in a graph.
Then we can see clearly the outliers to the right. Those are the values we corrected, because the values measured for the calibration buffers were too high. Another way to visualise the distribution is the creation of classes.
Now we can see that the shape is not a nice Gaussian bell-curve at all. Even not when the outliers are taken out. But… the bias could be caused by my selection of course. The Weerwater has many more values than e.g. the Markermeer!
What we should do is sample those high pH lakes (especially Markermeer and IJmeer) again and see whether they are really more alkaline* than expected.
To be honest, taking them out won’t change a lot and by now it seems that the pH is nearly always 8 or more. On the other hand, as already mentioned, there is a strong connection with the Ijssel, and because the Ijssel branches from the Rhine, we should consider the South or the East of the Netherlands as well. Let’s see where we get.
* Alkaline (and alkalinity) is now often defined as “resistance of the pH to acid”, effectively being buffering power. However, a long time ago when I was doing the research for my Master’s Degree in biochemistry, we used the word as the opposite of acidity. The word “basicity” to indicate a high pH was not used at all.