Monday, October 6, 2008

Name counts in table-style databases

In my article about Ancestry.com database records, we saw that Ancestry.com databases usually contain more than one name in each database record. The same will be true of other vendors as well. For this reason, Ancestry.com and other vendors like to communicate the size of database by talking about name counts rather than record counts. Unfortunately, name counts are more open to interpretation than database record counts.

Table-style databases

In my article on records I referred to table-style databases. I've also heard them called fielded databases. These are databases stored like tables or spreadsheets. Census databases are table-style and are stored internally in tables not unlike the ones the enumerators filled out.

Table-style databases are stored internally in tables not unlike census forms 
Table-style databases are stored internally in tables not unlike census forms.
The example, above, shows President Bush ancestors in 1900.
Image courtesy FamilySearch. © 2008 by Intellectual Reserve, Inc. All rights reserved.

When Ancestry.com first started reporting database sizes by name counts, it had to rely on estimates of the number of names in each record because they had no mechanism to count the number of names actually present. For example, the "U.S. Phone and Address Directories, 1993-2002" database is a table-style database that has 313,282,124 records. Each record can contain two names since a telephone listing can include a spouse name. In the table there is one column for the primary name and another for a second name.

USPhoneDirectories

A simple estimate of the name count would be to multiply the number of records by two, which would give 626,564,248 names, the number reported by the new card catalog. But many, maybe most, telephone listings don't include a second name. If one were to assume that 1/3 of the listings contain a second name, then the actual name count would be smaller than what is reported by some 200 million names!

Name counts are open to abuse because the term is open to interpretation. Look up the U.S. Phone Directories in the old card catalog and you'll find the number of names estimated to be 862,075,337! That's more names than there are places in the database to store names! And it's probably about 400 million more names than are present in the database. That's 100 million names short of an over-count of a half-billion names for a single database!

I recommend that any vendors that publish name counts for table-style databases report the actual number of names present. And when an estimated count is published, the vendor should plainly designate it as such. I wouldn't mind seeing the math symbol (≈) meaning about before estimated numbers.

When I return, I'll talk about database types that require estimated name counts. (You may recall a previous article on that topic, Unbelievable Name Count Claims.)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.