Text mining 88,000 papers gives Pfizer a drug safety database

Biopharma researchers have accrued a huge library of chemical safety data over the past 60 years, but--as is often the case--much of it is tied up in formats that make computational analysis impossible. Pfizer ($PFE) has tried to open up this data by text mining 88,000 scientific articles for drug interactions.

The project is a collaboration between Pfizer and North Carolina State University, which runs the Comparative Toxicogenomics Database. North Carolina State built the system to track how environmental chemicals affect human health, but the underlying approach is applicable to drug adverse events too. Recognizing this, Pfizer is collaborating with the university to advance its work to build a drug safety database, turning 7 decades of free text information into a mineable trove of drug-induced adverse events. 

"Investigators can now test and validate which genes might be critical to the drug-induced event. This could be useful in gene-testing patients to tailor the correct medicine or it could help design future therapeutics by alerting safety researchers to avoid those pathways and potential toxic outcomes," said Dr. Allan Peter Davis, lead author of the paper. The database is skewed toward cardiovascular, neurological, renal and hepatic toxicity data, therapeutic fields in which Pfizer has a particular interest.

Building the database required a mix of computational capabilities and manual human effort. Pfizer began by going through 3,017 scientific papers to hand select safety findings relevant to 650 drugs. This manual work gave Pfizer a platform from which it could develop queries to automatically extract drug adverse event relationships from free text articles. Running two text mining approaches on biomedical library Medline identified 88,629 articles that Pfizer passed on to North Carolina State for curation.

This process--which was first designed for environmental chemicals--took 5 full-time curators one year to complete. At the end, Pfizer had a database of 250,000 interactions between drugs and disease or phenotype. "Coding the information in a structured format was key," Davis said. Researchers can mine the database for insights into which genes may be responsible for adverse events. It has already suggested genes that may be involved with the nerve damage Millennium Pharmaceuticals' cancer drug Velcade causes in some patients.

- here's the paper
- read the press release