# What if … all lakes were actually at the same pH?

This may seem a peculiar statement, because the samples expressed very different pH values. And yet… let’s imagine all the water is just not completely homogeneous, with the differences being just very local. Taking three samples in a row we already noticed the differences. The pH meter also has a random error of probably 0.03 (between different measurements – in a calibration buffer the values are quite stable) and all this together could lead to (small) fluctuations of the observed values.

To test this idea (it’s not a real hypothesis, but more of a thought experiment), we could take all measurements on fresh samples (ignoring the secondary values for the days after) and put them all together. It doesn’t matter whether it was for different samples (triplo) or repeated measurements of the same sample (duplo) and it’s even not relevant which time of the day we took the samples.

In total we got 72 primary values and we won’t use the correction for the drift of the meter (observed during the first measurements) right now.

We can calculated a 99% Confidence Interval (99% CI) for the population (all samples to be taken,  using a Z-value of 2,576, so minus and plus 2,576 times the standard deviation of the sampled values,  to obtain the lower and upper boundary of the 99% CI). We can expect 99% of our future samples to be within this bandwidth (but beware, this will only apply if the distribution is Gaussian).

Dividing the previous standard deviation by the square root of the number of samples minus 1 (that would be sqrt(71) = 8.43), we can calculate a Confidence interval for the mean.

If we would calculate the mean (taking the average in Excel) for a reasonable set of future samples, we would get a value between those boundaries – at least with a certainty of 99%, so once in a while it could be outside, but that should be very rare.

Determining the median: sorting all values, we can take the value between #36 and #37, but both are 8.16 so that’s easy.  Obviously the median is lower than the mean, so the distribution is skewed, we have outliers or both. The mean is very sensitive to outliers, but the median is not. The sorted values can also be visualised in a graph.