SKIMLIT

  • Tech Stack: Python, Tensorflow, NLP, Deep Learning
  • Github URL: : Project Link

Welcome to my individual project, where I design an advanced model incorporating token, character, and positional embeddings for the purpose of sequential sentence classification in medical abstracts.

In this project, we're going to be replicating the deep learning model behind the 2017 paper PubMed 200k RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts. And reading through the paper above, we see that the model architecture that they use to acheive their best results is available here: Neural Networks for Joint Sentence Classification in Medical Paper Abstracts. When it was released, the paper presented a new dataset called PubMed 200k RCT which consists of ~200,000 labelled Randomized Controlled Trial (RCT) abstracts. The goal of the dataset was to explore the ability for NLP models to classify sentences which appear in sequential order.

Creating a series of model experiments

  • Model 1: Conv1D with token embeddings
  • Model 2: TensorFlow Hub Pretrained Feature Extractor
  • Model 3: Model 3: Conv1D with character embeddings
  • Model 4: Combining pretrained token embeddings + character embeddings (hybrid embedding layer)
  • > Model 5: Transfer learning with pretrained token embeddings + character embeddings + positional embeddings