Written by Greg Michalec (firstname.lastname@example.org) - February, 2011
This is a brief outline of the steps involved in adding new data to the database, as well as some of the practices we've implemented and some things to watch out for.
Some Naming Conventions
Data Import Procedure
First off, don't do development on the live db. Use the mimir_dev schema, or create a local copy of the database. Whatever schema you use, make sure to update it from the live dataset before you begin, to make sure you have the most recent data, and that you don't overwrite any data when you merge back into the live db.
Since most of the source data is in spreadsheet form, you usually will want to try and reorder/rename the columns in the spreadsheet to match what is in the existing database (assuming you are appending new data). You may also want to do some normalization of the data at this point, such as making sure dollar amounts are in integer format, getting dates into 2010-02-27 format, or getting rid of strange data translation problems (extraneous tabs or other odd characters). It is also sometimes necessary to add columns to spreadsheets and to combine spreadsheets - i.e. you have one spreadsheet for 2008 data and one for 2009, so you need to add a year field and combine the two into one spreadsheet
Once the spreadsheet is cleaned up, export it as a CSV file. You can then use the import_csv.pl script to load the data into a temporary table (see the data utilities page for details on this script).
Generally, I do most of the work of cleaning up /normalizing data in temporary working tables, and then merge the finished data into the real tables as a last step. This protects from messing up the existing data.
Many of the data come in a non-normalized format, where data is duplicated across rows. It's often necessary to split a single spreadsheet source into several related tables. The general rule is that there should always be only one place in the data that refers to a specific entity.
To normalize department names: update new_table a join departments_aliases b on department_name = alias set a.department_id = b.department_id
You should then look at any departments that did not get matched and try to resolve them into existing departments. If they seem to be part of an existing department, add them to the departments_aliases table, otherwise, create a new department in the departments table, then add them to the departments_aliases table.
Normalizing person names is usually the most tedious part of name matching (although maybe not anymore now that i've automated much of it). For details on how to do this, refer to the Name Matching Procedure document. It's pretty crucial to the integrity of the overall data that this procedure is followed. The whole point of this project is to track collaboration, and you can't do that if the database thinks that 'Dan McFarland' is a different person that 'Daniel McFarland'.
Once the names and departments are normalized, and everything is in the proper format, you can then merge your temporary table into the destination tables (if applicable). Finally, you can refresh the data on the live database with the data in the dev database. You can do this on a table-by-table basis, but I usually just dump the entire dev db (using mysqldump) and load in into the live db, like this:
mysqldump -u minerva_admin -p mimir_dev > mimir.sql;
mysql -u minerva_admin -p minerva < mimir.sql ;
If you do this, make sure to first delete any temporary tables you have created. If you are removing any tables, you'll need to manually delete them from the live db.
New Data Tables and Management Scripts
For any new data sources, there are some files that must be updated in order for the database management utilities to work correctly. (These files are in the /utils/db_management/ of the SVN checkout - see the data management utilities page for more details) If you don't follow these steps, it's very likely the person and department ids in your new data will get out of sync with the rest of the database over time.
Note that it is okay to join in other tables if it is relevant to the particular data source. For example, it is useful to know the department of a dissertation that a faculty member read, so the dissertation_members query joins in the dissertations table.
Also, some tables may have more than one entry per row in _name_source_lookup - for example, if a table contains information about both a student and a faculty member (although we want to try to move away from this when designing the tables). There should be an entry for each reference to a person in a row for a table.
[ 'faculty_courses', 'person_id', '__faculty_name', 'faculty_course_id', 'year'],
[ 'courtesy_appointments', 'department_id', '__department_name', 'courtesy_appointment_id']