Email analysis scripts for mbox mailbox files
by Philip Guo (philip@pgbovine.net)
Many email programs (e.g., Mozilla Thunderbird, Eudora, Horde Webmail) store messages in files conforming to the mbox format. I have written some Python scripts to extract and organize email header information (e.g., senders, recipients, subjects, dates, etc.) from mbox files. I used these scripts to collect data for the graphs in my article, MIT: A Life in Emails. I have only used these scripts for this one application, so they aren't too thoroughly tested. Use them at your own risk!
Because I analyzed large mbox files containing thousands of messages, I broke up my analysis into two stages to improve efficiency:
-
Stage 1: Creating an XML summary of an mbox file
create-mbox-summary.py -
Run this on an mbox file to create an XML summary (printed
to stdout).
mbox-summary.dtd -
The generated XML summary file should conform to this DTD.
For large mbox files, the XML summary file is much
smaller than the original because it only contains the
header information. The smaller size (and greater
hierarchical organization) of the XML summary file makes it
easier and more efficient to run analyses on it rather than
on the mbox file.
Stage 2: Analysis of the XML summary file
MboxSummary.py -
Classes whose instances can be instantiated with XML summary
files and contain fields that are conducive to various
analyses involving dates, times, senders, recipients,
subjects, etc. Also contains some auxiliary functions for
performing filtering and building histograms.
To see an example usage of MboxSummary.py, check out MITEmailAnalysis.py,
which contains functions to generate data for my article, MIT: A Life in Emails.
I'm not enthusiastic enough right now to provide more detailed explanations and examples, because I figure that if you want to use this code, you probably have a pretty good idea of what you are doing. Best of luck in mbox hacking, and email me if you come up with any interesting analyses based on the framework provided by these scripts!
Last modified: 2006-09-16
