The Duplication Problem
Duplications is a special case of error that can arise in the collection and recording of observational data.
Data ErrorsTop
Data errors divide into several classes.
Observational Errors
1. Under Counting. Under counting occurs (a) when not all the birds that are really present are counted and (b) when the counter counts the visible or heard birds incorrectly reporting a lower number than was really present.
2. Over Counting. Over counting occurs when the counter miscounts the birds seen or heard reporting a larger number of the species than was really present.
3. Misidentifying. This occurs when the observer reports one species as being another and is really an under count of the correct species combined with an over count of the incorrect species.
All of these errors assume that there is a “truth” out there in the world and that it is known to a Guaranteed Omniscient Discriminator who can then identify that a particular error has occurred.
In a practical sense, we cannot know that these errors have been made. Under counting 1 (a) cannot be known to human observers except under extremely favourable conditions. Undercounting 1 (b) can be picked up if birders count in teams and correct each other’s estimates. Over counting 2 can be corrected if birders count in groups. Misidentifying 3 can be caught by a better identifier particularly if birders are in groups but later attempts to disqualify observations based on likelihood are statistical in nature and not certain.
In the field we do the best that we can knowing that we will be making errors all the time to a greater or lesser degree and knowing as well that we will never know how great or little our errors are since we are not Guaranteed Omniscient Disciminators.
Other Errors
4. Transcription errors. These occur (a) when observers write down their data and get it wrong or (b) when later data copying into other formats such as electronic form gets it wrong.
5. Processing errors. These are errors in extraction of data or of calculation based on data.
We will not be discussing these errors further. But the possibility of this kind of error should prevent us from becoming too arrogantly happy with the accuracy of our data since even perfect data is subject to them.
Duplication ErrorTop
A duplication error is an instance of over counting 1 (b) and will arise in the eBird system frequently. For example, even though the recommendation is that TOC trip leaders enter trip observations into eBird, many of the attendees of the trip will want to enter the data too as part of their day/month/year/life list. Therefore each bird will be reported not once but potentially several times in eBird.
We cannot avoid this. The real issue is whether it constitutes a problem. Whether duplication of data points is a problem for the system depends on what how the data is going to be used.
We have recommended four reasons for collecting data (Migration Information, Population Trends over Time, Nesting, Rarities) and it is useful to examine each to assess the impact of over counting on findings.
Migration Information
If the TOC wants eventually to produce a “Birds of Toronto” publication then it will need information about (a) when the birds are here, and (b) how frequent they are when they are here.
If the data extraction from eBird is used to give presence/absence information on each day of the year to answer (a) then the duplication problem has no effect on the determination. Answering (b) is more complex. If numerical count information is used from eBird, then duplications matter and they have to be corrected for somehow. This may not be possible. Alternatively, presence/absence information tracked over time can answer (b) and in this case the duplication problem has no effect.
eBird creates its graphs by calculating the thickness as number of checklists that report the bird divided by the total number of checklists. If a Condor is sighted on a TOC field trip of 10 people to High Park and everyone on the trip reports it to eBird, the bird will appear on 10 lists. However the 10 lists are divided by the total number of lists for the day which could be in the hundreds and consequently the effect of the error is reduced. This is for a determination of a single day. For calculation of a graph covering all years for Toronto, the 10 lists are divided by thousands of lists for all years and the error vanishes.
Population Trends over Time
Any numerical trending requires numeric counts and duplication is a problem. Data collection protocol is essential and in that way duplication will be prevented.
We recommend that numeric trending data only be used when it comes from an official TOC counting project such as the Warbler Survey, The Hawk watch and the Whimbrel Count. Other projects of a similar nature can be set up whenever there is a desire for them.
Nesting
The TOC is not collecting nesting information yet. However, the nature of the beast means that duplication is rare in the first place and can be controlled completely if data extractions use geo-location information to recognize duplicate nests.
Rarities
In reports of rare birds or birds completely new to the official Toronto list, the error of Misidentifying 3 is paramount. It is the job of the vetting committee to do the best it can to determine that a sighting can be accepted or not.
Errors of Under or Over counting are not relevant to the acceptance of the record and therefore the duplication problem has no effect.
Summary and ConclusionTop
For Migration Information, the duplication problem does not apply. Either presence or absence data can be used directly or it can be used over time to generate estimates of abundance.
For Population Trends over Time, we must use official numeric information collected according to a protocol, that exempts the data from the duplication problem.
For Nesting, geo-location information used during data extraction eliminates duplicates completely.
For Rarities, duplications are not relevant to the acceptance or rejection of an observation.
It is the conclusion of this report that the duplication problem cannot be avoided but that its effects are negligible for all data uses that the TOC recognizes.