Natural Language Processing (NLP) Survey of Tools & Resources

General frameworks

  • Apache Unstructured Information Management Architecture (UIMA): Java framework for developing NLP pipelines, released under the Apache 2 license. UIMA provides Eclipse plug-ins for developing and testing UIMA-based applications. UIMA wrappers exist for a variety of other Java-based NLP component libraries.
  • General Architecture for Text Engineering (GATE): Java framework for developing NLP pipelines, developed at the University of Sheffield (UK). GATE includes a number of rule-based NLP components, and GATE wrappers exist for a variety of other Java-based NLP libraries
  • Natural Language Toolkit (NLTK): A Python library for developing NLP applications. This framework is accompanied by a book, which is useful for pedagogical purposes.

NLP components, pipelines, and tools

  • clinical Text and Knowledge Extraction System (cTAKES): cTAKES is built on top of Apache UIMA, and is composed of sets of UIMA processors that are assembled together into pipelines. Some of the processors are wrappers for Apache OpenNLP components, and some are custom built. cTAKES was developed at the Mayo Clinic, and is distributed by the Open Health NLP Consortium.
  • Health Information Text Extraction (HITEX): HITEx was developed as part of the i2b2 project. It is a rule-based NLP pipeline based on the GATE framework.
  • Computational Language and Education Research toolkit (cleartk): cleartk has been developed at the University of Colorado at Boulder, and provides a framework for developing statistical NLP components in Java. It is built on top of Apache UIMA.
  • NegEx (NegEx): NegEx is a tool developed at the University of Pittsburgh to detect negated terms from clinical text. The system utilizes trigger terms as a method to determine likely negation scenarios within a sentence.
  • ConText (ConText): ConText is an extension to NegEx, and is also developed by the University of Pittsburgh. ConText extends NegEx to not only detect negated concepts, but to also find temporality (recent, historical or hypothetical scenarios) and who the experiencer is (patient or other) of the concept.
  • National Library of Medicine's MetaMap (MetaMap): MetaMap is a comprehensive concept tagging system which is built on top of the Unified Medical Language System (UMLS). It requires an active UMLS Metathesaurus License Agreement for use. The program may execute by itself, although there has been done some work to create a UIMA Wrapper to allow MetaMap to act as a UIMA component.
  • MedEx - a tool for extraction medication information from clinical text (MedEx): MedEx processes free-text clinical records to recognize medication names and signature information, such as drug dose, frequency, route, and duration. Use is free with a UMLS license. It is a standalone application for Linux and Windows.
  • SecTag - section tagging hierarchy (SecTag): SecTag recognizes note section headers using NLP, Bayesian, spelling correction, and scoring techniques. The link here includes the SQL and CSV files for the section terminologies. Use is free with either a UMLS or LOINC license.
  • Stanford Named Entity Recognizer (NER): Stanford’s NER is a Conditional Random Field sequence model, together with well-engineered features for Named Entity Recognition in English and German.
  • Stanford CoreNLP (CoreNLP): Stanford CoreNLP is an integrated suite of natural language processing tools for English in Java, including tokenization, part-of-speech tagging, named entity recognition, parsing, and coreference.

Software and Tools used by eMERGE network sites

  • Cincinnati Children's Hospital: a custom pipeline built around cTAKES
  • Mayo Clinic: the prevalent tool is UIMA based cTAKES. Latest open source tools being developed at Mayo Clinic in collaboration with the SHARP consortium can be found at:
  • Mount Sinai School of Medicine:
  • Northwestern University: Most of our NLP applications have been based on Apache UIMA, though we have also used HITEx (GATE). Our data processing workflows use the Kontanz Information Miner (KNIME) to extract data from our Enterprise Data Warehouse, including both structured data and text. Textual data are fed into a custom KNIME node that executes a UIMA-based NLP application to extract information for a particular phenotype. The output of processing the text is structured data, which are merged with other structured data and fed into a phenotype classification algorithm.
  • Vanderbilt University: We use a combination of SecTag, MedEx, and the KnowledgeMap Concept Identifier, which maps free text to UMLS concepts, including some statistical disambiguation. We have developed a SOAP and REST webservice version of KnowledgeMap, SecTag, and NegEx that has been interfaced with the KNIME interface.