I promised more on the estimation quiz, and here it is -- better late than never, I hope. (This probably won't make much sense unless you read the first bit published on Saturday. Ignore the bit where I promised that this article would be published last Sunday.)
First a note on the scoring. Here is the distribution of scores achieved by the first five-thousand-odd respondents:
As a vague attempt to cash in on the traditional summer obsession with exam grades, I've made up some mark boundaries, too. Grading `on the curve', with 10% achieving an A, 20% a B, 40% a C, 20% a D, and the remainer failing, the grades are:
| Grade | Score (%) |
|---|---|
| fail | < 15 |
| D | < 21 |
| C | < 32 |
| B | < 42 |
| A | all others |
In the tradition of GCSEs, I could also award an A* grade for those who scored ooh, let's say, 70% or more, but I suspect most who did were cheating, so why bother?
(This explains the fairly low scores achieved by various well-informed people who answered the thing....)
I should say something about how the scores are computed, too. As I've said before, the algorithm is very ad-hoc. I wanted the scoring algorithm to:
- give an individual score for each question;
- give the maximum possible score for each question when the respondent's answer was exactly correct (including any actual error on the answer);
- give the minimum possible score for each question when the respondent's answer was claimed to be exact (uncertainty of zero) but was not correct; and
- otherwise reward accurate answers (close to the true answer) and realistic uncertainties (close to the difference between true answer and answer given), trading the two off in a reasonable fashion.
I spent ages looking for a mathematically satisfying formula for this and didn't find one. In the end I used the following for each question, where x and dx are the candidate's answer and uncertainty, and X and dX are the true answer and uncertainty: (yet again we suffer from the `HTML not being any good for maths' problem)
s² = dx² + dX²
m = x unless x = 0, in which case m = X
p = max(1 - |dx-dX| / m, 0)
score = 10 p exp( -(x-X)² / 2s²)
-- the idea here being that we give a score based on (something like) the probability of the true answer given the normal distribution implied by the given answer and uncertainty, with a penalty function which increases with the departure of the claimed uncertainty from the true error. The 10 is just there to make the score on each question a convenient integer.
There's one further hack, which is that if the question asks for a date, the answers are converted into years-before-the-present. This is so that the penalty function (which compares given uncertainties with absolute answers) is reasonable in that case, the argument being that our knowledge of the past is likely to become more uncertain the further back we go. Obviously this doesn't really work with the question on the date of birth of Christ.
On the whole, this scoring system is rubbish. Note in particular that if your uncertainty is as large as your answer, you will get zero, which hardly rewards accurate estimation of uncertainties!
Various commenters, including Andrew Cooke on Crooked Timber, and John Quiggin commenting on my last piece here, pointed out that a better approach might be to base scores on properties of the distribution of each respondents' relative errors, y = (x-X)/dx. The idea is that, if a person is good at estimating, their uncertainties should be equal to the true errors and there shouldn't be any overall bias in whether their answers are above or below the true answers (since the questions are all on different topics). In this case, y should be distributed with mean zero and variance one. This looks like a fairly reasonable way to assess scores, in fact; for instance, here are the distributions for someone who didn't do very well (score 18%):
and for someone who did rather better (42%):
One way to compare the distributions (by no means the best, but I'm lazy and it's easy to compute) is the Kolmogorov-Smirnov statistic (the largest absolute difference between the two cumulative distributions); this is pretty well correlated with the ad-hoc scores I've computed:
(Ignoring the tail that runs up to the top left, which consists mainly of people who did so well they were probably cheating, a linear model explains about 50% of the variance, which isn't too bad.)
So, next time anyone designs one of these quizzes, I'd suggest more work on the scoring algorithm along these lines. One problem with schemes based on the distribution of answers is that they can't really give a score to each answer, which could make the game a bit unsatisfying, but that's probably not a disastrous problem.
Anyway, this is a bit off-topic. The real ulterior motive of the quiz was to test the theory that people who are incompetent in a given field are also lousy at estimating their own competence. I wanted to see whether the respondents who gave the poorest answers to the estimation questions were also likely to give unreasonably narrow uncertainties.
It turns out that they don't. The uncertainties given tended to be more-or-less reasonable estimates of uncertainties. The distribution of uncertainties relative to actual errors is peaked around 1:
and looking at individual questions typically gives something like this: (in this example, for the question about the distance from the earth to the moon)
-- that is, people who gave inaccurate answers typically gave large uncertainties too. (Actually the results for some of the questions look much more complicated than that, but that's mostly because of `round number' effects, as far as I can tell.)
This probably isn't a very good test of the `incompetent and unaware of it' theory, but still, it restored my faith in humanity a bit. (I mentioned this result to a friend of mine, and he replied, ``That's disappointing.'' And then, after a pause, ``I suppose it isn't, actually.'' Which more-or-less sums it up.)







Comments
Posted by Damian Counsell, Saturday, 4 September 2004 19:18 (link):
It would be nice to be able to write that kind of thing in the discussion of an academic paper: "It was disappointing that we found no correlation between the use of flouride toothpaste and cancers of the mouth and throat [since a positive result would have got me a paper in the British Medical Journal instead of The Annals of the Royal Society of Bounced Submissions]. Then again, I suppose it isn't actually [because we're not all going to die horribly]."
Posted by Bill, Saturday, 11 September 2004 00:59 (link):
Might also look at the questions denominated in billions. Mine was off by a factor of 986. Presumably beacuse I guessed wrong about whether a billion was 10^9 or 10^12. I assume other people had the same problem.
Posted by Jonadab the Unsightly One, Sunday, 12 September 2004 02:02 (link):
Oh, sheesh, I completely forgot about that. I theoretically knew that a billion is different in the UK, but I blithely assumed that a billion meant a US billion (100 million). (Then again, I didn't actually know going in that the survey was made in the UK, though by the _end_ of it I had an inkling it may have been.)
Posted by TokyoB, Monday, 13 September 2004 03:13 (link):
Acutally I think a billion in the US is 1000 million, not 100 million
Posted by john, Friday, 10 September 2004 18:18 (link):
"Time it took" to take the test is also relavent. As at some point you realize the futility and start answering "anything" to get to the end. "Partially taken" tests are also interesting. Which question did the person bail out and where did the IP address resolve. Did more UK takers make it to the end compared to the "please entertain me" people from the US.
Posted by Chris Lightfoot, Friday, 10 September 2004 18:44 (link):
Yeah. To try to even out the effect of respondents getting bored, the quiz randomises the order of the questions. I haven't looked at the rate of quizzes abandoned, though.
Posted by Danila Medvedev, Friday, 10 September 2004 18:30 (link):
For the findings from the "Unskilled..." paper to hold true, one needs the correlation between the ability to give answer and ability to understand its validity. This doesn't seem true for guessing various numbers. Some people (stupid people, who also don't find MeFi particularly intersting) don't understand the idea of "margin of error" and may enter them too small. But once a person passes certain level of intelligence, they can understand what exactly is the meaning of it. :) Those that are capable of guessing the margin of error don't need any extra knowledge to evaluate their mistake. The size of the +/- value will probably inversely correlate with their confidence and one doesn't need to be an astronomer to say how confident he is in his estimate.
It would be intersting to see more detailed data on correlations between the error estimate/real error, especially for all questions separately. The authors of "Unskilled..." mentioned that the results they found will probably depend a lot depending on the field in which the participants are tested. May be its different in your survey as well.
P.S. BTW, there's been well-known research (sorry for violating the policy, but I don't have a reference handy, probably in some papers on financial theory, or may be not) showing that the crowd makes good decisions on average. This isn't directly related to your test, but is still interesting.
Posted by Michael Kellen, Friday, 10 September 2004 18:42 (link):
As the cited article (Kruger and Dunning, 1999) states in its conclusions:
I think perhaps that you have reiterated his work in that paper, finding:
Which is not, in itself, a bad statement to have tested.Posted by Jonas, Friday, 10 September 2004 19:43 (link):
-Some qs should be log'd. Example: distance to moon, plastic bag, britain gdp
Posted by Random Individual, Friday, 10 September 2004 21:11 (link):
I was disappointed that I was asked to estimate +/- error rather than x// (logarithmic) error. This combined particularly badly with the fact that estimating n+-n gives you a score of zero: how do I indicate that I think my estimate is, e.g., within a factor of 2 (which IMO is jolly well near enough for the number of petrol stations in the UK)?
Posted by Michael, Friday, 10 September 2004 22:45 (link):
If you thought that there were 10,000 service stations in the UK, within a factor of two, then why not estimate 12,500 +/- 7,500?
Posted by Dan Dreifort, Saturday, 11 September 2004 08:57 (link):
To scale the margin of error relative to the magnitude of the correct value seems fair.
Your fancy trick might be fine for gas stations in the UK, but when we talk about stars in the galaxy, a factor of two is an unbridgeable (or at least unreasonable vis a vis meaningful data) divide.
I guess I want a statistically relevant scaling estimation efficiency factor grading model, and a side of fries.
Posted by Jonadab the Unsightly One, Sunday, 12 September 2004 02:42 (link):
This explains a *lot* about my score. There must have been ten questions where I calculated my estimate based on going halfway to an estimated outer bound and then specifying a variation equal to my estimate. This landed the official answer within my range in a high percentage of cases, but scored me nothing. I used this for estimating most of the more preposterously hard questions:
Another common problem was terminology in the question. Only a business major would know what FY2000/1 means. Is that a month? A quarter? A year? The two-year period from 2000 to 2001? What does the FY part mean? I only had _one_ college semester of Ecconomics, so a little more verbosity in the question might have helped me out a bit. Even worse was the English Civil War question. For an Ohioan, I know *WAY* more than average about English history, but when it comes to the "English Civil War", which war is that, exactly? The wars between the Anglo-Saxons and the Gaelic peoples? The throne dispute between Harold and William? The War of the Roses? The American Revolution? One of the disputes about whether Scottland/Northumberland/Wales was part of England or not? Something Else? England has had a lot of things that could be considered civil wars. I managed to score four points on the assumpution that it was one of the middle ones, but help me out here: name a specific one. Also, in the House of Commons question: MP last I checked was Military Police, not an elected office in most countries. I guessed that it was talking about the total number of persons elected to the House of Commons (and still came out rather low; I guess they must elect them all every time, rather than the rotation we have in our House of Representatives), but it wouldn't have hurt for the question to spell out what the abbreviation stood for.
A final point, about the plastic shopping bags in Australia: as it turns out I was almost dead on on the _per-capita_ usage, but nobody has any idea what the population of Australia is, and not living there I have even less chance of getting anywhere near close. I suspect the rather poor accuracy of the responses on this question has way more to do with misestimating population than with anything to do directly with shopping bags.
Posted by James, Monday, 13 September 2004 11:58 (link):
For the scoring system, how about:
1) Fitting a normal distribution about the mean and s.d parameters given by the person's answer.
2) Calculating the pdf of this distribution at the true value.
This would give zero points if someone guesses wrongly and gives a zero s.d. Unfortunately if someone got it right and gave a zero s.d. then they would get infinite points, so perhaps a more complicated system of:
1) Fitting a normal distribution about the mean and s.d. parameters given by the persons answer.
2) Creating upper and lower bounds for what is acceptable as the truth. This width will be non-zero in most cases since even the true answer is only based on someone else's guess.
3) Integrate the normal distribution over this interval.
This should remove singularities.
PS I've never done statistics before so take what i say with a huge pinch of salt.
Posted by Chris Lightfoot, Monday, 13 September 2004 12:14 (link):
I had a play with this kind of thing, and didn't get very far. As you identify, there are problems with getting exactly the right answer. I also tried overlap and product integrals between the two distributions, but that has the same problem (for the exact case you wind up with the integral of the product of two delta functions, which is undefined). I had what I thought was a mathematically attractive approach (K-S test statistic between the two distributions) but this punishes estimates very hard when the answer is exact, even when the estimated answer is very close and the range includes the correct answer.
I've had a couple of other better suggestions which I'll try to talk about later.
Posted by James, Monday, 13 September 2004 12:47 (link):
errr, ok disregard the second way, it's nonsense.
Posted by Sven Geier, Tuesday, 14 September 2004 21:13 (link):
In Tom DeMarco's 'The trance state conjecture' there is an interesting observation about the contradiction between accuracy and precision. If you ask someone for an estimate, they can give you a very precise estimate that is unlikely to be accurate, or an accurate estimate that is not going to be precise.
For example I can say that there are 50000 +/- 10 gas stations in GB, which is very precise and totally inaccurate. Or I can say that there are 50000 +/- 50000 which is spot on as accuracy goes, but not exactly precise.
He notes that people tend to err on the side of precision and that this is wrong: the precise estimate is false, while the imprecise one is in fact correct. And thus people don't say "it'll take between six and twelve weeks to do this" (which would probably be right) but they say "I'll be 50 work days or there abouts" and then they waste time in meetings trying to justify why their estimates failed.
What you are trying to measure here is whether people are balancing precision and accuracy accurately. Which they do if and only if 68.3% of their answers are within 1 error-estimate from of their result-estimates, 95.5% with two error-estimates and so forth.
All in all, I find the comments by people how they arrived at a certain given answer more interesting than the questions (or answers) themselves. Let me encourage you to try a version two of this quiz (taking into account all that was learned here) that includes an option "I have no friggin clue" and that has options to provide feedback on (and/or have discussions of) individual questions.
Post a new comment.
Comments copyright (c) contributors and available under a Creative Commons License. See also the comments policy.