05. Sample topics

The examples outlined below are only suggestive and are not intended to be exhaustive of the possible approaches to, and topics for, applied statistical research.

Existing data bases

Data on police killings in the US

The Washington Post maintains a regularly updated data base of police killings in the US that contains variables (factors) such as date, age, race, gender, age, and locations (latitude and longitude) of persons killed by police in the US. Also available is a data base of all local police stations in the US. These data bases can stimulate, among other things, questions related to discrepancies in relation to age, location and race of police killings. Answering, and reporting on, such discrepancies can be a very valuable addition to the applied statistics literature, and can potentially have a significant impact on social policy.

Collecting data – experimental design

Surveys

Surveys are a common form of experimental design especially in social research. Commonly social researchers might want to survey the attitudes of people on certain social issues, or survey people for their habits in relation to a particular topic, such as how much alcohol they drink per week.

To obtain useful information from a survey considerable care must be taken in the survey design.

As well as the survey design the purpose of the survey must be carefully thought through. Are you interested to ascertain the the views and opinions of a small, well-defined group of people, surveying all of them, or are you intending to randomly sample a much larger group, for example?

Methodological and theoretical issues

An example – Randomization and baseline transmission in vaccine field trials

Struchiner, C. J., & Halloran, M. E. (2007). Randomization and baseline transmission in vaccine field trials. Epidemiology & Infection, 135(2), 181-194.

Summary: In randomized trials, the treatment assignment mechanism is independent of the outcome of interest and other covariates thought to be relevant in determining this outcome. It also allows, on average, for a balanced distribution of these covariates in the vaccine and placebo groups. Randomization, however, does not guarantee that the estimated effect is an unbiased estimate of the biological effect of interest. We show how exposure to infection can be a confounder even in randomized vaccine field trials. Based on a simple model of the biological efficacy of interest, we extend the arguments on comparability and collapsibility to examine the limits of randomization to control for unmeasured covariates. Estimates from randomized, placebo-controlled Phase III vaccine field trials that differ in baseline transmission are not comparable unless explicit control for baseline transmission is taken into account.

Modifying, extending, or adding to, an existing result or method

An example – a bound on the variance a finite numerical data set

Bhatia, R., & Davis, C. (2000). A better bound on the variance. The american mathematical monthly, 107(4), 353-357.

prove that if S={a₁, a₂, …, a_n} is a fine set of numerical data with mean

maximum M, and minimum m, the variance

satisfies

where, equality holds if and only if n is even and half the data points equal M and half equal m.

So, in general, the variance is strictly less than the upper bound of the result.

Questions:

For a finite set of real numbers chosen from specific types of types of distributions (e.g. normal, beta, Cauchy, …) what is the nature of the difference between the upper bound of the result and the variance?

How might we quantify this difference?

For example if we simulate choosing 100 random points from a normal distribution with mean 0 and standard deviation 1, 10,000 times we get the following distribution of the differences between upper bound of the data and the variance:

The distribution is clearly not normal – can we characterize this distribution of differences between the upper bound for the variances and the variances?