CREU Project
Knowledge Representation and Reasoning for the Biomedical Literature
Faculty: Debra Burhans
Students: Nicholas Lahens and Rebecca Robilotto
Weekly Meeting Time Fall 2006:Wednesday 9:00-10:15
Weekly Meeting Time Spring 2007:Wednesday 2:30-3:30
Project Description
This project explores problems of extracting and understanding information that is contained in the biomedical literature. The information sources are abstracts that are freely available from PubMed
(http://www.ncbi.nih.gov).
The project will start with abstracts, extract information from the abstracts using text mining tools, then represent the extracted information in a knowledge representation system, and finally query the knowledge representation system to see what new information can be inferred from the representations.
Abstract
Text mining of biomedical literature is an important area of research due
to the large number of publications available and the importance of the
information contained therein. In particular, connecting information from
disparate sources may lead to new scientific insights in silico.
This project explores problems of extracting and understanding information
from scientific literature: Specifically, we are interested in how
automated reasoning can be used to connect hypotheses in different
abstracts. The information sources utilized are abstracts that are freely
available from PubMed (http://www.ncbi.nih.gov). An initial set of 50
abstracts will be selected, then hand and electronically analyzed. The
information extracted from the abstracts will be represented using the SNePS
(Semantic Network Processing System) knowledge representation system.
Once represented, the inference mechanisms in SNePS will be used to infer new knowledge from what has been represented. The goal is to develop effective knowledge structures that support inference across abstracts.
If you are interested you can see a copy of our
proposal. Note that personal, academic,
and budget information has removed to protect the privacy of the
participants.
Project Components
- Abstract Selection
- We are currently selecting a set of 50 papers plus
abstracts for the project. We have decided to consider full text as well as
abstracts in our study for a number of reasons. We are only using full
text articles that are freely available from PubMed.
- Selecting the abstracts is tedious and time consuming, particularly
since we have decided based on initial work with PDF documents that not
only must the full text of each paper be freely available but it must be
available as text.
- We have completed abstract and paper selection based on
a methodology we devised to ensure that all our materials are freely
available in text format. Note that this limits the
inclusion of papers from a number of journals. Click here
to read about our selection process, and click here
to see our selected abstract list.
- Abstract Analysis
- We are proceeding with hand analysis of abstracts, Nick has abstracts
0-9, Becky 10-19, and Deb 20-29. Next we will convene to see how
our results compare. Becky is trying out new analysis tools because we
decided not to use MedScan due to the fact that it is not freely available.
We made the decision to only use material and tools that are freely available
for this project.
- We have run all articles through the Genia tagger to gain insignt into
their syntactic structure and to see how well the tagger works on the
paper we selected.
- Hypotheses
- Hypothesis characterization and recognition: Nick is focused on the
identification of hypotheses within the abstract and full text of seven
selected papers from the full set of 50. These hypotheses have been
examined for syntactic structure in order to develop templates that can
be used to automate hypothesis extraction from papers. This remains the
focus of Nick's work, with the end goal of producing a set of usable
templates.
- Notes on hypotheses in articles: as might be expected the abstracts
contain numerous hypotheses. In addition it should be noted that many
hypotheses span sentence boundaries, meaning that sentence-based parsing
techniques will fail to discover these multi-sentence hypotheses. Hypotheses
are clustered in certain sections of articles. Hypotheses found in
backgroun and related work sections are not taken to be novel or newly
proposed.
- In order to clarify hypothese it is important to attach information
about the provenance of a hypothesis. This may be a direct or indirect
reference to a source in the article's bibliography. Further, the
novelty of a hypothesis may be inferred by the section of the article in
which it is found.
- Word-level analysis
- One question of interest to us is whether there is a specific vocabularly
associated with hypothesis statements that will help to identify hypothesis
rich areas of articles where we can then focus our template based extraction
efforts.
- Becky has created concordances for all 50 abstracts and is in the process
of creating them for the 7 papers Nick selected for intensive analysis and
hypothesis abstraction. We are building concordances for each section and
will compare them by looking at word frequency counts as well as word
occurrences.
- The project has produced a number of tools and resources
- Genia-tagged corpus of articles
- Templates for hypothesis recognition which are incorporated into
a Perl program for hypothesis extraction.
- Concordance builder for text analysis written in Perl
- Excel macros for combining and graphing results of concordance builder
- List of 50 connected articles that are all open source and availble in
formats from which plain text can be derived
- Set of 7 of the 50 articles in raw text to which basic text processing
tools can be applied