Skip to main content
The NCI Community Hub will be retiring in May 2024. For more information please visit the NCIHub Retirement Page:https://ncihub.cancer.gov/groups/ncihubshutdown/overview
close

Meeting Description:

Hosted by NIH.AI and National Library of Medicine (NLM), this highly interactive workshop will offer opportunities to exchange expertise and collaborate with NIH researchers at all career levels who are utilizing natural language processing technologies in their work. This four-hour workshop will include targeted presentations and will offer time for open discussion among peers and across disciplines. The workshop will be held at NLM Visitors Center, NIH Building 38A, Room 127, on Thursday, May 9, 2019, from 1-5 PM * * Workshop recording will be made available following the event. Can't Attend in Person? WebEx is available via https://cbiit.webex.com/cbiit/j.php?MTID=m9552e7d1d54c51a0710326402c4f355c 

Meeting goals:

  • Educate the NIH community on available natural language processing resources and repositories 
  • Foster collaboration between meeting attendees

 

Meeting Recording:

CLICK to view, or visit https://cbiit.webex.com/cbiit/ldr.php?RCID=4c707137da66fb8658db594791b06b07

 

 

Agenda and Presentations (CLICK to view):

1 - 1:40 pm

Text Mining and Deep Learning for Biology and Healthcare: An Introduction – Lana Yeganova/Qingyu Chen, NCBI/NLM

This session covers fundamentals of NLP and deep learning. We will start from the basic Natural Language Processing components such as tokenization and stemming, discuss the tf - idf term weighting technique, and touch on the BM25 retrieval function. We will transition from traditional word representations to word embeddings and demonstrate how they advance our ability to analyze the relationships across words, sentences, and documents. We will discuss popular word embedding techniques, including Word2Vec, Glove, and FastText, and extend computed word embeddings to create sentence embeddings. Finally, we will discuss deep neural networks such as CNN and LSTM. Running examples will be provided for each topic.

1:40 – 2:15

Automatic information extraction from free-text pathology reports using multi-task convolutional neural networks – Hong-Jun Yoon, Oak Ridge National Laboratory

We introduce two different information extraction techniques from free-text cancer pathology reports; Multi-Task Convolutional Neural Network (MT-CNN) and Hierarchical Convolutional Attention Network (MT-HCAN). The models attempt to tackle document information extraction by learning to identify multiple characteristics simultaneously. We will demonstrate how the models trained, how the latent representation captures the key phrases and concepts, and how the inference is made.

2:15 - 2:30

Break

2:30 - 3:05

Biomedical Named Entity Recognition and Information Extraction – Robert Leaman/Shankai Yan, NCBI/NLM

biomedical text mining applications require locating and identifying concepts of interest - the tasks of named entity recognition (NER) and normalization (NEN). Both tasks have a long history in biomedical text mining, using techniques that have evolved from primarily lexical and rule-based, to include machine learning with rich feature sets and currently deep learning with learned feature representations. Our PubTator Central (PTC) system provides on-demand NER and NEN annotations for six biomedical concept types - genes/proteins, genetic variants, diseases, chemicals, species and cell lines - in both biomedical abstracts and full text articles. PTC processes input text through multiple NER/NEN systems, combining their output with a disambiguation module based on deep learning. The module uses a convolutional neural network (CNN) to determine the most likely concept type for overlapping annotations based on the syntax and semantics of both the span being classified and the surrounding context. The disambiguation model is trained using a weakly supervised approach and provides a significant accuracy improvement. Currently, we are benchmarking deep learning methods for NER and NEN. Deep learning methods for NER have matured significantly, primarily using variations of long short-term memory networks (LSTMs). Normalization methods with deep learning are still an area of active development, and we describe some recent progress.

3:05 - 3:40

Neural Approaches to Medical Question Understanding – Asma Ben Abacha/Yassine Mrabet, LHC/NLM

Online resources are increasingly used by consumers to meet their health information needs. According to surveys from the Pew Research Center, one of three U.S. adults (35%) looks for information on a medical condition online and 15% of internet users posted questions, comments or information about health-related issues on the web. Consumer health questions are often very challenging to automated processing and answering due to their high proximity to open-domain language models, high rates of misspellings and ungrammatical sentences, and the frequent insertion of background information. In this talk, we will describe our approaches to understand and answer automatically consumer health questions. We will present our efforts to summarize long consumer health questions to short questions that are more efficient for answer retrieval, and to infer entailment relations between new user questions and existing, already answered questions. We will then talk about our approaches to extract key information from the user question such as the main topic and question type and how to use them in answer retrieval. Finally, we will present a first prototype built by the combination of several approaches to tackle the question understanding and answer retrieval tasks.

3:40 - 3:50

Break

3:50 - 4:25

Transfer Learning in Biomedical NLP: A Case Study with BERT – Yifan Peng, NCBI/NLM

BERT (Bidirectional Encoder Representations from Transformers) is a recent language representation model proposed by researchers at Google AI Language. It has achieved state-of-the-art results in a wide variety of NLP tasks. Here we introduce how to pre-train the BERT model on large-scale biomedical and clinical corpora (PubMed and MIMIC-III) and how to fine-tune the BERT model on specific tasks such as named entity recognition and relation extraction.

4:25 - 5:00

Guided Discussion

 

Speakers:

  • Dr. Asma Ben Abacha is a Research Scientist at the Lister Hill Center, National Library of Medicine (NLM), National Institutes of Health (NIH). Prior to joining the NLM, she was a researcher at the LIST institute in Luxembourg and part-time lecturer at the University of Lorraine in France. Dr. Ben Abacha received her Ph.D. in computer science from Paris-Sud University. She also received a research master's degree in NLP and a software engineering degree. She is currently working on consumer health question answering, visual question answering and NLP-related projects.
  • Dr. Qingyu Chen obtained his Ph.D. degree from the University of Melbourne. He is currently a postdoctoral fellow in the Text Mining Research Group directed by Dr. Zhiyong Lu at National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM). Dr. Qingyu Chen’s main research interests include biomedical and clinical natural language processing and image processing. His current research focuses on biomedical and clinical text retrieval, machine learning for health care, and biomedical text mining to facilitate biological database curation.
  • Dr. Robert Leaman is a Research Scientist at the National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH). He is the author of several widely used open source systems for biomedical named entity recognition and normalization using machine learning. He received his Ph.D. in computer science in 2012 from Arizona State University.
  • Dr. Yassine Mrabet is a Solution Architect at the Lister Hill Center, National Library of Medicine (NLM). Prior to joining the NLM, he was a Marie-Curie ERCIM fellow working on natural language generation in a joint project between CNRS (France), WAIS (University of Southampton), and LIST (Luxembourg). Dr. Mrabet received his PhD in computer science from the Paris Orsay University, where he worked on hybrid solutions to information retrieval from semi-structured data. His current work includes information retrieval and deep learning approaches to question answering in the scope of the CHiQA project, and big data solutions to image processing for the Open-I system.
  • Dr. Yifan Peng obtained his Ph.D. degree from the University of Delaware. He is currently a research fellow in the Text Mining Research Group directed by Dr. Zhiyong Lu at National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM). Dr. Yifan Peng’s main research interests include biomedical and clinical natural language processing and image processing. His current research focuses on biomedical relation extraction, clinical report generation, and automated classification of age-related macular generation from fundus images.
  • Dr. Shankai Yan obtained his Ph.D. from City University of Hong Kong. He is currently a postdoctoral fellow in the Text Mining Research Group directed by Dr. Zhiyong Lu at National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM). Dr. Shankai Yan's main research interests include biomedical/clinical natural language processing and computational biology. His current research focuses on biomedical relation extraction, biomedical/clinical document classification, and biomedical text mining to facilitate gene signature extraction.
  • Dr. Lana Yeganova is a Scientist at the Computational Biology Branch of the National Center for Biotechnology Information (NCBI) at NIH. Dr. Yeganova holds a Doctorate in Mathematical Optimization from the George Washington University. Her work at NCBI has addressed a range of problems from information extraction and text mining to clustering and knowledge discovery, resulting in design of novel and efficient algorithms. Her most recent research focus has been at improving PubMed user search experience. Along that line she has developed the Field Sensor, a software for understanding the intent of user queries, which has been integrated into PubMed search. She also developed PubTermVariants, a corpus of statistically collected lexical synonyms, and co-developed PubMedPhrases, a corpus of statistically collected phrases, both used for query expansion and indexing in PubMed.
  • Dr. Hong-Jun Yoon obtained his Ph.D. from University of Pittsburgh in 2011. He also served as research staff for radiology department at University of Pittsburgh Medical Center. He joined Oak Ridge National Laboratory in September 2012 and is a staff scientist at Biomedical Sciences, Engineering, and Computing group.

 

Questions?

Please contact either George Zaki (george.zaki@nih.gov), Miles Kimbrough (miles.kimbrough@nih.gov) or Yifan Peng (yifan.peng@nih.gov).

 

Hosted by NIH.AI and National Library of Medicine (NLM)

 

Created by Miles Kimbrough Last Modified Mon July 22, 2019 1:29 pm by Miles Kimbrough