Phenotype silver standard corpora

This page aims provides the individual stand-off phenotype annotations created using four concept recognition systems on four corpora. In addition, for each textual corpus, system-based silver standard corpora have been created using both exact matching, as well as sentence-level matching.

The four systems are:

  • NCBO Annotator
  • MetaMap
  • cTAKES
  • BeCAS

The four corpora are:

  • A corpus of 2,163 publication abstracts from Pubmed
  • A corpus of 906 clinical trials from
  • The I2B2 corpus
  • The ShARE/CLEF e-health 2013 Task 1 testing dataset

All annotations are stored in stand-off tab based format in files carrying the names corresponding to the files listed in the original corpus. In the case of the Pubmed and CT_Phentoype corpora, the file names represent Pubmed or Clinical Trials IDs, which can be directly retrived from their original publishers. The stand-off annotation format is: startOffset::endOffset [tab] original text span [tab] list of CUIs separated by comma. Silver standard corpora created from the system annotations ommit the original text span and list only the offsets and the CUIs. Archives corresponding to the four corpora can be downloaded using the links below:

  • pubmed.tar.gz
  • ct_phenotype.tar.gz
  • i2b2.tar.gz
  • share.tar.gz