Tuesday, February 2, 2010

Ancestry.com Bloggers Day: Technology (Part 2)

Last year I intended to do stupendously rich articles about Ancestry.com Bloggers Day presentations. Since I never got around to it, this year you’re getting my stupidously poor notes.

Mike Wolfgramm and Jonathan Young gave us the last presentation prior to lunch. Yesterday we talked about Dexter, the flexible content digitization pipeline. Today we will talk about:

  • Named entity extraction
  • Vertical [unique to Ancestry.com] search engine
  • Record linking
  • Hint engine – technology behind the shaky leaf
  • PersonRank – Search engine that powers Mundia (pronounced, “Moon-dia”)

Named entity extraction

Named entity extraction derives facts from unstructured data using advanced algorithms to find names, dates, and places. As I mentioned yesterday, computers are very stupid. Ancestry.com uses machine learning to train the system to identify names, dates, and places.

Having these facts separate makes the records searchable.

Wolfgramm and Young showed us the example below. I’ve circled items in these colors:

  • Name of Deceased: Lime green
  • Age at Death: Yellow
  • Death Date: Orange
  • Obituary Date: Red
  • Locations Mentioned: Purple and pink (we’ll see why I used two colors in a moment)
  • Other Persons Mentioned: Green

JeanHessObit

Below I’ve included the corresponding record from the Ancestry.com U.S. Obituary Collection. I’ve circled items with the same colors as above so you can easily compare the two. As you can see, the algorithm did pretty darn well, for a stupid computer. It got the name of the deceased wrong, but did pick it up in the list of others mentioned. It got the obituary publication date wrong. The algorithm missed three locations (circled in purple): California, San Bernardino County, and Deplaines, although that last one is probably a misspelling of Des Plaines. It got the seven locations circled in pink. Lastly, it picked up all six names of other people.

Jean Hess Obituary Record from Victorville Daily Press

Interestingly, this same, exact obituary also appeared in another newspaper and was picked up by Ancestry.com a year earlier. Back then, the performance of their named entity extraction technology apparently didn’t work as well. Notice in the record, below, that no names were picked up.

Jean Hess Obituary Record from Barstow Desert Dispatch

I asked why the dates were displayed ambiguously, rather than spelling the month out. Wolfgramm explained that they received the data from a third party in that format. He told us that they could fix the problem. Sure enough, within a couple of days, Ancestry.com had the problem fixed. Wow! I wish I could get all bugs fixed that fast!

Vertical Search engine

The problem:

  • Variations in names, dates and places
  • Need to apply name authority (name alternatives)
  • In 1841 UK census ages of those over 15 were usually rounded down to next 0 or 5
  • Rogers, 1985 study found 15% of birth places differ between 1851 and 1861 censuses
  • Significant number of recording and transcription errors
  • Searching 4+ billion records quickly is a challenge

The solution is a vertical search engine that can measure closeness:

  • typographically
  • phonetically
  • date proximity
  • place proximity
  • fuzzy matching

Record Linking

  • Example: How do 3 tree records relate to each other?
  • [I can’t remember how this differed from PersonRank, below.]

Hint Engine

  • Leverages search technology and record linking
  • Computationally expensive – built with a scalable architecture
  • Key collaborative networking technology – don’t have to do brute force compare between all people in all trees when users establish links between trees
  • Acceptance vs. rejection of hints allows algorithmic improvements.
  • Slightly over 80% of hints are accepted.
  • Hint-originated searches are usually more effective because of the additional search information taken from the tree

PersonRank

  • PersonRank is the algorithm used to determine if two individuals in different trees are the same person
  • Q. Is PersonRank used only between tree individuals?
    A. It was Initially, but it is used now for all tree hints.
  • Q. Is it used for regular searches?
    A. No. Perhaps in the future.

Finally! We made it to lunch time! Lunch was with Tim Sullivan and Andrew Wait.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.