Building A Concept Space:
An Approach to Thesaurus Development for the Alternatives to Animal Testing Community
"jane, you seem to be on the right track. your problem is quite similar to ours --vocabulary switching. the concept space approach should help greatly. you are welcome to connect to our recent work on "cancer space": http://ai.bpa.arizona.edu/CancerLit/ we have a set of even more precise and refined techniques for building concept space and such. good luck with your work. -- hc" [H. Chen]
INTRODUCTION
In a recent article by Chen et al. (1997), the building of a "concept space" is described as an approach to addressing the vocabulary problem in scientific information retrieval. Chen and co-workers refer to their approach to automatic thesaurus generation as, "a concept space approach because our goal is to create a meaningful and understandable concept space ( a network of terms and weighted associations) which could represent the concepts (terms) and their associations for the underlying information space (i.e. documents in the database)."
The work performed by Chen et al. studied "cross-domain" retrieval of information in the two scientific communities of worm biology and Drosophila genetics. The concept of "cross-domain" retrieval of information is also applicable to the alternatives to animal testing community in that a "concept space" approach would help "bridge the gap" between published research results and the information needs of novice researchers as well as the general public.
Described below are the steps needed to build a concept space as outlined in Chen et al. Also indicated is preliminary progress toward creation of a concept space for the "mini-domain" of alternatives to skin irritation testing in animals. Comments from others involved in keywording or thesaurus development in the area of alternatives to animal testing welcome.
METHODS
Building a concept space as described by Chen et al. consists of specific steps which are delineated below in excerpts (" ") from the article. I have added the underlining for emphasis. Progress at this website toward these goals is indicated in italics.
1. Document and Object List Collection
Document Collection
"In any automatic thesaurus building effort, the first task is to identify complete and recent collections of documents in specific subject domains that can serve as the sources of vocabularies."
Document collection continues..see BIBLIOGRAPHY
Object List Collection
"For most domain-specific databases, there appear always to be some existing lists of subject descriptors. (E.G. subject indexes at back of textbook, researchers' names, genes, experimental methods, organizational names)."
Have developed a collection of researchers' names (at least for alternatives to skin irritation testing), am continuing to collect words describing experimental methods.
See...DESCRIPTORS-METHODS, DESCRIPTORS-RESEARCHERS
2. Object Filtering and Automatic Indexing
Object Filtering
"For each document, we first identified terms that matched with terms in our known vocabularies, a process referred to as object filtering."
Some object filters have been constructed (See DESCRIPTORS above).
Automatic Indexing
"Because after object filtering the remaining texts may still contain many important concepts, an automatic indexing procedure then was followed. Salton (1989) presents a blueprint for automatic indexing, which typically includes dictionary look-up, stop-wording, word stemming, and term-phrase formation. The algorithm first identifies individual words. Then, a stop-word list is used to remove non-semantic bearing words such as the, a, on, in, etc. After removing the stop words, a stemming algorithm is used to identify the word stem for the remaining words. Finally, term-phrase formation that formulates phrases by combining only adjacent words is performed."
"...We have made several changes in the above automatic indexing process and have fine-tuned our algorithms according to subjects' suggestions. We removed the stemming procedure...in order to avoid creating noise and ungrammatical phrases, e.g. cloning will not be stemmed as clone. We created a separate domain-specific stop-word list for worm biology which contained about 600 very general molecular biology terms such as gene, process, mutation, etc. We standardized all researchers' names according to the format of last name, followed by first character of first name."
Have not tried any automatic indexing procedures as of yet. I have standardized all researchers' names according to format suggested above. (See DESCRIPTORS-RESEARCHERS above).
3. Co-occurrence Analysis
Have not utilized methods discussed in this paper as of yet.
4. Associative Retrieval
Have not incorporated any procedures for assessing associative retrieval.
REFERENCES
Chen, H., T. D. Ng, J. Martinez, and B. R. Schatz. A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System. Journal of the American Society for Information Science 48(1): 17-31 1997.
Salton, G. Automatic text processing. Reading, MA: Addison-Wesley 1989.
Nadis, S. Computation Cracks 'Semantic Barriers' Between Databases. Science 272: 1419. June 1996