Bill Keel's Lecture Notes - Astronomical Techniques

Astronomical Techniques - Noise and Signal

Sources of noise in data-taking: We may have to contend with purely instrumental noise sources - grain statistics in photography, amplifier noise in photomultipliers or CCDs, thermal noise in cooled detectors, etc. The most fundamental limit is that set by photon noise. Photons arrive randomly, with a certain mean rate set by the radiation intensity and wavelength. Always keep in mind that the photon noise may include a contribution from sources that are a nuisance, such as airglow or scattered sun- or moonlight, and that the numbers of photons in calculations must include this "background". Also note that a form of noise can be introduced in digitizing a signal (roundoff or truncation) or in further processing, for example in use of calibration data less accurate than the desired results.

It can be important to watch the sources of error for a given measurement. In succeeding steps, errors may accumulate. Statistically independent errors add in quadrature. To track this, you may need to carry an error term through all steps of a calculation, or simply add all the relative errors in quadrature. Doing this before carrying out a series of measurements can be very instructive.

Poisson processes

Consider a process (such as detection of a photon in all but the highest radiation densities) with constant probability per unit time; that is, the probability of detecting a photon in dt is λdt for dt infinitesimal, and the expected number detected in finite time interval ΔT is then λ ΔT. What is the expected distribution of the number actually detected in various statistically independent time intervals, or experimental trials?

This was worked out by Poisson in 1831, and later by Gossett (writing as "Student" due to employment restrictions at the Guinness Brewery). If we define W_n(t) as the probability of seeing exactly n events in a time interval T, we have

Probability of n+1 events in time t = W_n+1(t)

Probability of 0 events in time dt = 1 - λdt

Probability of n events in time t = W_n(t)

Probability of 1 event in time dt = λ dt

Probability of n-1 events in time t = W_n-1(t)

Probability of 2 events in time dt = (λ dt)²

which last is of negligible importance by construction of differentials. This gives a recursion relation W_n+1(t+dt) = (1 - λ dt) W_n+1(t) + W_n(t) λ dt and a resulting differential equation d (W_n+1(t))/dt = λ [W_n(t) - W_n+1(t)] for each n. With the boundary condition W_n(0)=0, the family of equations has the general solution W_n+1(t) = ((λt)ⁿ⁺¹ e^-λt)/(n+1)! If we let M be the average number of events in time t (which is to say λ t, usually the quantity we want to measure), then we have W_n(M) = Mⁿ e^-M /n! as the distribution of probabilities for a Poisson process. Note that this is properly normalized (i.e. the probability of measuring one or another value of n is unity). The mean value is Σ n W_n(M) = M and it has standard deviation given by σ² = Σ[n² Mⁿ e^-M /n! - 2Mn Mⁿ e^-M/n! + M² W_n(M)] = M or σ = M^1/2 (so-called root-N statistics or errors). (I am indebted to George Blumenthal for the form of this derivation.)

For large n, the probability distribution approaches the normal or Gaussian distribution

W_n(M) = (1/[(2 π)^1/2 σ] e^{-(M-n)²/
2σ²}. For this distribution, the standard deviation σ has immediately useful meaning:

P(n > M+ σ) = 0.16

P(n > M+2 σ) = 0.023

P(n > M+3 σ) = 0.0013

(likewise for the distribution below M.)

Gaussians have many useful properties. For example, the convolution of two Gaussians of different σ is another Gaussian with σ² = σ₁² + σ₂². The Fourier transform of a Gaussian is another Gaussian. Instrumental profiles are frequently Gaussian enough that this analytically useful form is used as a fitting function with or without explicit motivation.

The transition from small-number Poisson statistics to a Gaussian is one application of the central limit theorem. This states that for an arbitrary distribution f(x),if one repeatedly measures the mean of f from sufficiently large numbers of trials, the measured means will themselves obey a normal distribution. This enters almost intuitively into experimental and observational design.

Statistics and Data Assessment

There are several commonly used statistics for a distribution, expressing either some measure of the zero point or width. These include mean, median, and mode for the zero point, and various moments for the width (and more details). The mean (or average) is defined by Σ x f(x) dx where f(x) is the normalized distribution. The mean may be strongly affected by (possibly spurious) outlying values, so that for some applications the median or mode may be preferable. As an example, say we want to know the sky brightness in some image, where the mean may be contaminated by faint stars, cosmic-ray events, etc. Here, the mode will be a better estimate of what we're after, though still not perfect - to do it right, one would need to understand the distributions of the contaminating processes. The more you know a priori, the better you can do...

We are often interested in what we can learn from data sets via statistical tests. As examples, what is the average luminosity of RR Lyrae stars, or the average X-ray-to-visual ratio for solar type stars? Different kinds of tests are in order, depending on whether there is a physically motivated model to test or parameter to measure. Sometimes, we can only ask whether a data set is consistent with another one or not, or whether two kinds of objects are the same or different. These reflect parametric and nonparametric statistics. Further broadening the field, one must sometimes deal with data sets (such as surveys) with a fixed and high detection threshold, where you wish to derive a mean flux for sets of objects where the positions is known but none is detected individually.

Parametric statistics may include model fitting and parameter estimation. Linear regression is one example. Other commonly used techniques include least-squares function fitting and χ² minimization.

Least-squares fitting, in its linear variety, is used when there is reason to believe that the data represent points of the form y_i = Σ a_n f_n(x_i). A special case is fitting a line, y_i = a₀ + a₁ x_i . The usual derivation takes all the scatter in data as along the y axis; more complicated variants exist in the case of substantial errors in both variables. The goodness-of-fit criterion here is that the best fit is defined to be the one for which the sum Σ(y' - y_i)² is minimized, where y' is the value predicted for a given set of coefficients a_n. It can be shown that minimizing the squared residuals from a given functional fit is yields sets of normal equations from the given data of the following form:

Σf₁²(x_i) + Σf₁(x₁) f₂(x_i) + Σf₁(x_i) f₃(x_i) + ... = Σ f₁(x_i) y_i forming a set of n equations in n unknowns which can be solved by appropriate methods from linear algebra. The functions f_n need not be analytic; I have used this to ask what kinds of stars are needed to reproduce a galaxy spectrum, where the f_n were simply the spectra of stars and stellar groupings tabulated at 4000 points. One book which I have found to spell this out in some detail is Scientific Data Analaysis: an Introduction to Overdetermined Systems by R.L. Branham, Jr. (Springer-Verlag, 1990).

Another widely-useful fitting technique, especially important for data sets where the errors may not be normally distributed or differ widely among data points, is χ² fitting. Here, one finds the values of parameters for some fitting function that minimizes the quantity χ² = Σ (y' - y_i)/σ² for which it is important that one have an accurate assessment of σ (be honest, now!). Not only does this give parameter estimates, but also tells how well they represent the data. For a good fit, the reduced χ² (which is the value above divided by n-p, the number of data values minus the number of parameter values determined in the fit) should be close to 1. This approach can also be used to tell whether specific components of a proposed model are significant in a statistical sense. Mapping χ² as a function of the parameters allows study of the allowed parameter space and any correlation of errors in their estimates. There can be subtleties in application when the fitting function has most of its weight in a small part of the data space, or for small-number statistics (n < 5 or so gives biased results). These pitfalls, and limits on what reduced χ² means for complex fitting functions are discussed in this paper

Often, either the formulation of a problem or the nature of the data do not admit a parametric approach. This happens when we want to ask, "Are these two samples drawn from the same parent population?", or when there are upper or lower limits as well as genuine measurements. In the first case, one may use comparisons of ranked lists or the Kolmogorov-Smirnov test, which compares cumulative distribution functions of two samples or a sample and a model. The maximum difference between the CDFs for a given sample size is the critical statistic. In the second case, there is a body of approaches known as survival analysis; Feigelson and collaborators have brought this work to astronomers' attention, and produced Fortran routines for their implementation. See, for example, Isobe et al 1990, ApJ 364, 104; Babu and Feigelson 1992, Commun. Statist. Comput. Simul. 22, 533; and Feigelson and Babu 1992, ApJ 397, 55

Statistical sleaze - when do you reject points, and when do you "accept" a hypothesis?

In contrast to the more usually encountered frequentist ap`proach to statistical analysis, which deals in essence with values that would be obtained from many runs of an experiment or increases in observed sample size, there is often a gain from considering the Bayesian approach, which centers on how evidence affects our degree of confidence in a hypothesis (often itself based on previous evidence). Here, the likelihood of a hypothesis given a particular set of data (evidence E) and model (theory T) is evaluated using Bayes' theorem:

As noted in, for example, Statistics in Theory and Practice (R. Lupton, Princeton 1993) this is not at all controversial. This approach is explicit about prior probabilities (or priors for various outcomes. In those cases that we really do understand this well, this can be a powerful The result can be inverted to yield maximum-likelihood estimates of parameters. As more and more data are available, Bayesian estimates often approach "classical" estimates, since the prior is less relevant when large amounts of data exist.

Some introductions to Bayesian reasoning:

An Intuitive Explanation of Bayes' Theorem, by Eliezer Yudkowsky

An Introduction to Bayes' Theorem, by Curtis Brown. (The equation form above is taken from this site)

In hypothesis testing, we distinguish two kinds of errors - Type I (false positive), and Type II (false negative). These may have very different impacts on one's conclusions.

This ss a good place to caution about the cogitive biases that the human brain seems to be universally susceptible to - so strongly that in places where it is a matter of life or death, analysts still need to be retaught how to recognize them regularly. These include confirmation bias, collecting nondiscriminatory evidence, and inappropriate weighting of evidence based on irrelevant factors.

Some good introductions to cognitive biases:

The Psychology of Intelligence Analysis by Richards Heuer

Cognitive Biases Potentially Affecting Judgment of Global Risks by Eliezer Yudkowsky

« Detectors | Telescopes »

Course home page | Bill Keel's Home Page | Image Usage and Copyright Info | UA Astronomy

wkeel@ua.edu

Last changes: 1/2014	  © 2000-2014/A>