Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) search for 'statistically significant' relationships in large quantities of data. This activity was formerly known in the statistical community as data mining, but that term is now in widespread use with an essentially positive meaning, so the pejorative term data dredging is now used instead.

PropertyValue
p:abstract
  • Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) search for 'statistically significant' relationships in large quantities of data. This activity was formerly known in the statistical community as data mining, but that term is now in widespread use with an essentially positive meaning, so the pejorative term data dredging is now used instead. Conventional statistical procedure is to formulate a research hypothesis, (such as 'people in higher social classes live longer') then collect relevant data, then carry out a statistical significance test to see whether the results could be due to the effects of chance. A key point is that every hypothesis must be tested with evidence that was not used in constructing the hypothesis. This is because every data set is liable to contain some chance patterns which are not necessarily present in the population under study, whose presence simply disappear in a sufficiently large sample size. If the hypothesis is not tested on a different data set from the same population, it is likely that the patterns found are chance patterns. See testing hypotheses suggested by the data. As a simplistic example, first throwing five coins, with a result of 2 heads and 3 tails, might lead one to ask why the coin favors tails by fifty percent, whereas first forming the hypothesis might lead one to conclude that only a 5-0 or 0-5 result would be very surprising, since the odds are 93.75% against this happening by chance. In the latter case, it becomes obvious that the data is not anomalous. As a more lyrical example, on a cloudy day, try the experiment of looking for figures in the clouds; if one looks long enough one may see castles, cattle, and all sort of fanciful images; but the images are not really in the clouds, as can be easily confirmed by looking at other clouds. It is important to realize that the alleged statistical significance here is completely spurious - significance tests do not protect against data dredging. When testing a data set on which the hypothesis is known to be true, the data set is by definition not a representative data set, and any resulting significance levels are meaningless. (en)
p:date
  • 2008-07-01 00:00:00.000000 (xsd:date)
p:hasPhotoCollection
p:reference
p:wikiPageUsesTemplate
rdf:type
rdfs:comment
  • Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) search for 'statistically significant' relationships in large quantities of data. This activity was formerly known in the statistical community as data mining, but that term is now in widespread use with an essentially positive meaning, so the pejorative term data dredging is now used instead. (en)
rdfs:label
  • Data dredging (en)
skos:subject
foaf:page
p:redirect
owl:sameAs