Pollsters Probably Didn’t Talk To Enough White Voters Without College Degrees | FiveThirtyEight

There’s still a pretty strong relationship between education levels and polling errors…

But most pollsters apply demographic weighting by race, age and gender to try to compensate for this problem. It’s less common (although by no means unheard of) to weight by education, however. As education levels increasingly cleave voters from one another, more pollsters may need to consider weighting their samples accordingly.

Source: Pollsters Probably Didn’t Talk To Enough White Voters Without College Degrees | FiveThirtyEight

When it comes to surveys and polls and non-financial prediction Nate Silver is probably one of the best. From the last paragraph of Nate Silver’s post it almost looks like the state of the art in sociology in subjects involving communication and human interactions and opinion polls is not yet good enough for practical purposes.

***

Let’s consider a caricature poll. There are two products, a and b. In a population of N let the whole population be surveyed whether the respondent prefers one product or the other. Let each person prefer either a or b, but they can express their preference with an unknown degree of bias, error or hesitancy, so that the preferences expressed in the survey does not necessary coincide with that person’s true but hidden preference:

E \ne F

It’s a classical “communication channel with errors” type of problem.
What can I say about the true number of people who prefer the product a?

N_a = N P(a) = N (P(a| E_a) P(E_a) + P(a | E_b) P(E_b))

where

P(a| E_a)

is the probability that the person who expresses his or her preference for the product a actually prefers it, and

P(a| E_b)

is the probability that the person who expresses his or her preference for b actually prefers a.

Or, takin the view from the pollster side:

P(E_a) = P(E_a| a) P(a) + P(E_a | b) P(b) = P(a) (P(E_a| a) - P(E_a | b)) + P(E_a | b)

Now what we do know is a reasonably good estimate of

P(E_a)

that is, the estimate of the incidence of people expressing their preferences. Suppose we can estimate the incidence rate of people incorrectly expressing their preference

P(E_b|a) = \epsilon_a

and

P(E_a|b) = \epsilon_b

Then the true incidence of those who prefer a is

P(a) = \frac{P(E_a)-\epsilon_b}{1 - \epsilon_a - \epsilon_b}

***

It’s easy to imagine other factors contributing to less than perfect correspondence between the result of the survey P(E_a) and the hidden truth P(a).  To unwind the influence of these factors takes more than just doing this or that regression.

blogs_2016_soc_1

In this picture the big circle shows how the population is divided between those who prefer a and those who prefer b. Smaller circle inside the big one are those surveyed. Some people who privately prefer b AND who are surveyed publicly prefer a.

blogs_2016_soc_2

In this picture the big circle again shows how the population is divided between those who prefer a and those who prefer b. Smaller circle inside the big one shows those who are surveyed. Smaller circle is shifted against the big one to illustrate the possibility of a bias:  among those who are surveyed there are more people who prefer a than in the general population. There are many ways in which every survey is biased to a different degree.

Therein lies the problem: conditional probabilities are not “known” with certainty. To estimate them we apply assumptions and models. These assumptions reflect our worldview and these models are based on previous research, surveys and polls and assumptions and, again, the worldview of those researchers whose work we use.

The models in turn depend on what factors we view as important and what factors we choose to ignore. There is no escape from subjectivity here as the great Edward Leamer showed 30 years ago: sociology, econometrics, the science of economics suffer from the problem of “specification searches”.

All this creates potentially unlimited margin of “error” in our estimates of \epsilon_a, \epsilon_b even in the simplest case when there is no “statistical” error involved at all.

These are really elementary arguments in against the view (or hope) that decisions involving millions of people can be informed by surveys, polls or, for that matter, democratic deliberation.  There is a long way from polls and surveys to prediction. Period.

***

Forget that for many months the public was told that Trump voters were racist and bigoted and conservative and white and ignorant, whereas Clinton voters were cosmopolitan, modern, open-minded and well-educated. It turned out that sociologists neglected to include in their quantitative models some factors associated with white and less educated voters. Was it an error, honest mistake by sociometrists, was it mood affiliation, or overconfidence affecting their research is really unimportant now.

For those who are affected by sociological predictions the important takeaway of the abject failure of sociologists is that quantitative sociology is really an infant science and we should be mindful of its limitations.