Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. These relationships may be valid within the test set but have no statistical significance in the wider population.
| Property | Value |
| dbpprop:abstract
|
- Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. These relationships may be valid within the test set but have no statistical significance in the wider population. The conventional frequentist statistical hypothesis testing procedure is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data, then carry out a statistical significance test to see whether the results could be due to the effects of chance. (The last step is called testing against the null hypothesis). A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set will contain some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same population, it is impossible to determine if the patterns found are chance patterns. See testing hypotheses suggested by the data. As a simplistic example, first throwing five coins, with a result of 2 heads and 3 tails, might lead one to ask why the coin favors tails by fifty percent, whereas first forming the hypothesis might lead one to conclude that only a 5-0 or 0-5 result would be very surprising, since the odds are 93.75% against this happening by chance. In the latter case, it becomes obvious that the data is not anomalous. As a more lyrical example, on a cloudy day, try the experiment of looking for figures in the clouds; if one looks long enough one may see castles, cattle, and all sort of fanciful images; but the images are not really in the clouds, as can be easily confirmed by looking at other clouds. It is important to realize that the alleged statistical significance here is completely spurious – significance tests do not protect against data dredging. When testing a data set on which the hypothesis is known to be true, the data set is by definition not a representative data set, and any resulting significance levels are meaningless.
|
| dbpprop:date
| |
| dbpprop:hasPhotoCollection
| |
| dbpprop:reference
| |
| dbpprop:wikiPageUsesTemplate
| |
| rdf:type
| |
| rdfs:comment
|
- Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. These relationships may be valid within the test set but have no statistical significance in the wider population.
|
| rdfs:label
| |
| owl:sameAs
| |
| skos:subject
| |
| foaf:page
| |
| is dbpprop:redirect
of | |
| is owl:sameAs
of | |