Discovering Patterns in Relational Data
Lise Getoor
Computer Science
Stanford University
June 2001
The goal of my research is to discover statistical patterns in relational
data. A relational domain consists of objects and the relationships, or
links, between them. Examples of such domains include the World Wide Web
(Web pages and the links between pages), epidemiological studies (patients
with disease strains, contacts between patients, and potential disease
transmissions), and business data (companies, company officers, and event
information such as mergers and acquisitions). The discovery of patterns
in these types of domains has many obvious benefits, from simply gaining
valuable new insights into the domain, to allowing us to make more accurate
future predictions.
Traditional machine learning and statistical methods can discover certain
types of structures, including correlations and independence among domain
attributes, but they are limited to working with flat files, e.g. data
consisting of fixed-length vectors of attribute values. In order to apply
existing approaches to relational domains, the data must first be flattened.
While there are a number of problems associated with doing this, a key
issue is the loss of the relational structure. This structure itself may
be very useful for prediction and should not be disregarded. For example,
a Web page that has many pointers to it is likely to be more useful than
a Web page that is not referenced; and a patient who has infected many
of his/her contacts is more likely to be infected with a highly contagious
disease strain than with one that is less contagious.
In our research, we are addressing this problem by developing methods
for learning statistical models directly from relational data. The algorithms
build on the work in learning Bayesian networks, and provide powerful
and flexible learning algorithms. A learned Probabilistic Relational Model
(PRM) provides a statistical model that can uncover many interesting probabilistic
dependencies held in a domain. Unlike a set of (probabilistic) rules for
classification, PRMs specify a joint distribution over a relational domain
and thus can be used for answering queries about any aspect of the domain,
given any set of observations. Furthermore, rather than trying to predict
one particular attribute, the PRM learning algorithm attempts to weed
out the most significant direct dependencies in the data. The resulting
model provides a high-level, qualitative picture of the structure of the
domain, in addition to the quantitative information provided by the probability
distribution. PRMs are ideally suited to exploratory analysis of a domain
and relational data mining.
We have applied these methods in a number of domains. One for which the
results have been particularly exciting and rewarding has been in the
study of tuberculosis epidemiology. In collaboration with Peter Small,
M.D. and his research group at the Stanford Medical Center, we analyzed
clinical and epidemiological data from the San Francisco tuberculosis
clinic for over 2,000 tuberculosis patients and their contacts. The database
contains patient demographic attributes as well as clinical attributes
such as age at diagnosis, ethnicity, homelessness, HIV status, x-ray results,
disease site (e.g. pulmonary or extra-pulmonary) and whether the patient
was treated privately or at the TB clinic. Additionally, sputum samples
were obtained for each patient and analyzed to determine the strain of
tuberculosis causing the disease in the patient. A contact investigation
was performed for each patient to identify persons with whom the patient
was in contact during his/her infectious period. Data for each contact
include the relationship of the contact to the case (e.g., family member,
co-worker, etc.), the contact's age, and whether the contact is a member
of the patient's household. Tuberculin skin test results and any contact
disease treatment were also available.
The structure of the PRM we learn in this domain has a rich dependency
structure both within the patient, contact and strain classes and between
attributes in different classes. We showed this model to our collaborators
and they found the model quite interesting. They thought many of the dependencies
to be quite reasonable, such as the dependence of age at diagnosis on
HIV status (typically, HIV-positive patients are younger and are infected
with TB as a result of AIDS), and the dependence of the contact's age
on the type of contact (contacts who are co-workers are likely to be younger
than contacts who are parents and older than those who are school friends).
There are also some correlations that are clearly relational and which
would have been difficult to detect using a non-relational learning algorithm.
For example, there is a dependency between the patient's HIV result and
whether he/she transmits the disease to a contact HIV-positive patients
are much more likely to transmit the disease. Another interesting relational
dependency is the correlation between the ethnicity of the patient and
the number of patients infected by the strain patients who are Asian are
more likely to be infected with a strain which is unique in the population,
whereas other ethnicities are more likely to have strains that recur in
several patients. The reason being that Asian patients are more often
immigrants who come to the U.S. with a new strain of TB, whereas other
ethnicities are often infected locally. There are also dependencies that
seem to indicate a bias in the contact investigation procedure or in the
treatment of TB. For example, contacts who were screened at the TB clinic
were much more likely to be diagnosed with TB and receive treatment than
contacts who were screened by their private medical doctor. Our domain
experts were quite interested in identifying these patterns and using
them to help develop better investigation guidelines.
This is just one of the domains that we have explored; others include
gene expression data, U.S. Census data, and Security and Exchange Commission
data on company mergers and acquisitions. There are a number of other
potentially interesting domains to explore, such as market segmentation
and e-commerce transactions. In conclusion, our learning algorithms for
PRMs provide a powerful technique for discovering the statistical structure
in relational domains. These methods are applicable in a wide variety
of settings, and unlike traditional methods, do not require the data to
be reformatted in a manner that destroys the information contained in
the relational structure among objects in the domain.
|