Written by Greg Michalec (firstname.lastname@example.org) - February, 2011
This is the documentation on how to use the name-matching utilities to help match/disambiguate person entities (assign person_ids) when importing new data into the database.
The name matching procedure automates the process of matching person entities as much as possible. For portions that require human hand-matching, web-based user interfaces are used, allowing the tasks to be distributed across members of the mimir group. Note that this process assumes that every unique name in the dataset to be matched refers to the same person, i.e. it assumes there are no two people who share the same name who do not already have a person_id. Here is a quick rundown of the steps involved (more details explanations will follow).
First off, you need to make sure you have personal CGI enabled in your stanford home dir: https://itservices.stanford.edu/service/cgi/personal. Then you can cd into you ~/cgi-bin/ dir and checkout the db_management code:
svn co file:///afs/ir/group/minerva/mimir_svn/util/db_management/ ./db_management/
You should then be able to browse to http://stanford.edu/~SUNETID/cgi-bin/db_managment/ and see these instructions. NOTE: There is an .htacess file in the db_management dir which will restrict web access to this directory to only members of the minerva AFS group
More details on update_meta_data.sh can be found on the db management utilities page, but it should be enough to know that this script polls the entire database and sets up the people and _name_source_lookup tables with all the relevant data. This should be run anytime new data is added to the database, or data is changed.
Set up local.pl configuration variables
There is a file called 'local.pl' in the directory that is sourced in all the programs. It contains the database connection information, as well as information about the data source you are matching. Do not commit local.pl with the database password into SVN. We need to keep the db admin password secure.
select $name_field as name, group_concat(distinct __student_emplid) as empid, group_concat(distinct department_name) as dept, group_concat(distinct graduation_academic_year) as years, group_concat(distinct dissertation_title) as title from $table a left join departments using(department_id) where person_id is null and $name_field = ?
Many of the data sources contain mangled names, names with odd characters in them, names in "FirstName LastName" format, and names that are not really names at all, such as "Staff" or "". This script will help put the names into the format expected by the scripts, and resolve the most common problems.
This script begins the initial round of matching and assigning person_ids.
Run match_duplicates_web.pl in a web browser
match_duplicates_web.pl allows us to assign a person_id to rows that exact match on name to more than one person. At this writing, there are about 250 names that are used by more than one person. I.E. there are two distinct people in the dataset who use the name 'Cohen, Jeremy'. This script will present a webpage that will display the person record being matched (as defined by the $id_query) variable, as well as summary data on all the records from _name_source_lookup that match that record either on name or normalized name. The summary data aggregates information across all the tables - i.e. all the departments a person is associated with. These are usually followed by the number of times each occurs in the data. The lastname odds field contains data that lists the number of people in the 2000 United States Census that have that last name, and the probability that the name will occur in 100,000 people. This helps gauge the likelihood that two different people have this same last name.
The interface allows you to select one of the optional matches as being the correct match. Additionally, you can select 'none match', or 'not enough' info.
Once the matching is completed, if any matches were marked 'not sure', you will get a link offering to recheck them (resetting their person_ids to null)
This is the meat of the name matching process. To prep, the script resets any person_ids that have been set to 888888888 or 999999999 from the previous script to null. It also drops the temporary ___normalized_name field.
It then creates a temporary table called _match_$table. It then compares every name from the source data that has not been matched to every unique name in _name_source_lookup and generates a match score based on the 'closeness' of the two names to one another, which is stored in the temporary table. The script uses a modified version of the perl module Lingue::EN::MatchNames (http://search.cpan.org/~brianl/Lingua-EN-MatchNames-1.12/MatchNames.pm) to generate the scores. This module checks for:
Note that depending on the number of remaining matches, this script can take a long time to run. You may want to open a screen session first.
Run match_verify_scores_web.pl in a web browser
This will present you with a similar interface as was seen in match_duplicates_web.pl. The summary data from $id_query is seen at the top, and all the person records that have a match score > 60 will be seen below. The person records are color-coded by their match score - low scores are dark red. scores of 100 are white. Again, options of 'not sure' and 'none' apply. As each match is served to a user for verification, it's match field is set to '5' - this is to make sure that if two or more people are matching simultaneously, they aren't duplicating work. A positive match is set to '1', not sure is '3', none is '2'. Note that the form is automatically focused, so you can quickly run through matches using your arrow keys to select the match and pressing enter to submit - no need to use that pesky mouse!
Once matching is complete, you will presented with an option to recheck the 'not sure' matches. When there are no 'not sure' matches left, you will be presented with an option to merge in the results of the name matching. When you click this option, all the records that have had verified positive matches will have their person_ids set. Then, any records that still have a person_id of null will be inserted into the people table, creating new person_ids. These new person_ids will then be merged back into the source table. Congratulations!
You changed the data, so you need to refresh the meta tables.