My Work - Towards Understanding of Medical Randomized Controlled Trials by Conclusion Generation

In Proceedings of the 10th International Workshop on Health Text Mining and Information Analysis at EMNLP (LOUHI 2019)

Posted by Jexus on November 17, 2019

Recently by the same author:


美國 EECS 博士班申請經驗分享 (ML/DL/NLP/Speech)

2021 Fall NLP/Speech PhD Application

You may find interesting:


My Work - Lifelong Language Knowledge Distillation

EMNLP 2020 long paper


My Work - Lifelong Language Knowledge Distillation

EMNLP 2020 long paper

Towards Understanding of Medical Randomized Controlled Trails by Conclusion Generation

In Proceedings of the 10th International Workshop on Health Text Mining and Information Analysis at EMNLP (LOUHI 2019)

Authors: Alexander Te-Wei Shieh, Yung-Sung Chuang, Shang-Yu Su, and Yun-Nung Chen

ArXiv: https://arxiv.org/abs/1910.01462
Code: https://github.com/MiuLab/RCT-Gen

Introduction

Randomized controlled trials (RCTs) represent the paramount evidence of clinical medicine. Using machines to interpret the massive amount of RCTs has the potential of aiding clinical decision-making. We propose a RCT conclusion generation task from the PubMed 200k RCT sentence classification dataset to examine the effectiveness of sequence-to-sequence models on understanding RCTs. We first build a pointer-generator baseline model for conclusion generation. Then we fine-tune the state-of-the-art GPT-2 language model, which is pre-trained with general domain data, for this new medical domain task. Both automatic and human evaluation show that our GPT-2 fine-tuned models achieve improved quality and correctness in the generated conclusions compared to the baseline pointer-generator model. Further inspection points out the limitations of this current approach and future directions to explore.

Model

We modified the code from huggingface/pytorch-pretrained-bert and adjusted the attention mask for fine-tuning on seq2seq data format(from source to conclusion).

Requirements

  • python3
  • torch>=0.4.0
  • nltk
  • rouge

install them by:

pip install -r requirements.txt

Usage

Fine-tuning from official gpt-2 pretrained weights

usage: gpt2_train.py [-h] [--save_model_name SAVE_MODEL_NAME]
                     [--train_file TRAIN_FILE] [--dev_file DEV_FILE]
                     [--n_epochs N_EPOCHS] [--batch_size BATCH_SIZE]
                     [--pred_file PRED_FILE] [--example_num EXAMPLE_NUM]
                     [--mode MODE]

optional arguments:
  -h, --help            show this help message and exit
  --save_model_name SAVE_MODEL_NAME
                        pretrained model name or path to local checkpoint
  --train_file TRAIN_FILE
                        training data file name
  --dev_file DEV_FILE   validation data file name
  --n_epochs N_EPOCHS
  --batch_size BATCH_SIZE
  --pred_file PRED_FILE
                        output prediction file name
  --example_num EXAMPLE_NUM
                        output example number, set to `-1` to run all examples

Testing trained model

usage: gpt2_eval.py [-h] [--model_name MODEL_NAME] [--dev_file DEV_FILE]
                    [--pred_file PRED_FILE] [--example_num EXAMPLE_NUM]

optional arguments:
  -h, --help            show this help message and exit
  --model_name MODEL_NAME
                        pretrained model name or path to local checkpoint
  --dev_file DEV_FILE   validation data file name
  --pred_file PRED_FILE
                        output prediction file name
  --example_num EXAMPLE_NUM
                        output example number, set to `-1` to run all examples

Data

We used the PubMed 200k RCT dataset, which was originally constructed for sequential short text classification, with each sentence labeled as background, objective, methods, results and conclusions.

We concatenated the background, objective and results sections of each RCT paper abstract as the model input and the goal of the model is to generate the conclusions. If hint words is needed, just concatenate the hint words right after the results section. The transformed sample csv file can be found in data/.

Citation

Please use the following bibtex entry:

@inproceedings{alex2019understanding,
  title     = {Towards Understanding of Medical Randomized Controlled Trails by Conclusion Generation},
  author    = {Shieh, Alexander Te-Wei and Chuang, Yung-Sung and Su, Shang-Yu and Chen, Yun-Nung},
  booktitle = {In Proceedings of the 10th International Workshop on Health Text Mining and Information Analysis at EMNLP (LOUHI 2019)},
  eprint    = {1910.01462},
  year      = {2019}
}