Our team has developed an automated model capable of identifying multiple end-systolic and end-diastolic frames in echocardiographic videos of arbitrary length with performance indistinguishable from that of human experts, but with significantly shorter processing time.
Dataset
We used 3 datasets in this study: 1 for training & testing, the others for testing only. We have made our patient dataset and models publicly available, thereby providing a benchmark for future studies and allowing for external validation of our approach
Additionally, we used annotations (ground-truth) from several cardiologist experts, allowing for the examination of inter- and intra-observer variability
A summary of the datasets is as follows:
Name |
PACS-dataset |
MultiBeat-dataset |
EchoNet-dataset |
---|---|---|---|
Source | Made public for this study NHS Trust PACS Archives, Imperial College Healthcare |
Private St Mary’s Hospital |
Public Stanford University Hospital echonet.github.io/dynamic |
Ultrasound machine | Philips Healthcare (iE33 xMATRIX) | GE Healthcare (Vivid.i) and Philips Healthcare (iE33 xMATRIX) | Siemens Healthineers (Acuson SC2000) and Philips Healthcare (iE33, Epiq 5G, Epiq 7C) |
Number of videos/patients | 1,000 | 40 | 10,030 |
Length of videos | 1-3 heartbeats | ≥ 10 heartbeats | 1 heartbeat |
Ground-truth | 2 annotations by 2 experts | 6 annotations by 5 experts (twice by one expert) | 1 annotation |
Original size (pixels) | (300-768)×(400-1024) | 422×636 | 112×112 |
Frame rate (fps) | 23-102 | 52-80 | 50 |
Format | DICOM | DICOM | AVI |
Use | Training/Testing | Testing | Testing |
Network Architecture
Considering the patient image sequences as visual time-series, we adopted Long-term Recurrent Convolutional Networks (CNN+LSTM) for analysing the echocardiographic videos.
Such architectures are a class of models that is both spatially and temporally deep, specifically designed for sequence prediction problems (e.g., order of images) with spatial inputs (e.g. 2D structure or pixels in an image).
The figure to the right provides an overview of the network architecture.
The model comprises:
(i) CNN unit: for the encoding of spatial information for each frame of an echocardiographic video input
(ii) LSTM units: for the decoding of complex temporal information
(iii) a regression unit: for the prediction of the frames of interest.
Spatial feature extraction: First, a CNN unit is used to extract a spatial feature vector from every cardiac frame in the image sequence. A series of state-of-the-art architectures were employed for the CNN unit. These included ResNet50, InceptionV3, DenseNet, and InceptionResNetV2.
Temporal feature extraction: The CNN unit above is only capable of handling a single image, transforming it from input pixels into an internal matrix or vector representation. LSTM units are therefore used to process the image features extracted from the entire image sequence by the CNN, i.e. interpreting the features across time steps. Stacks of LSTM units (1-layer to 4-layers) were explored, where the output of each LSTM unit not in the final layer is treated as input to a unit in the next.
Regression unit: Finally, the output of the LSTM unit is regressed to predict the location of ED and ES frames. The model returns a prediction for each frame in the cardiac sequence (timestep).
Implementation
The models were implemented using the TensorFlow 2.0 deep learning framework and trained using an NVIDIA GeForce ® GTX 1080 Ti GPU.
Random, on the fly augmentation prevented overfitting, such as rotating between -10 and 10 degrees and spatial cropping between 0 and 10 pixels along each axis.
Throughout the study, training was conducted over 70 epochs with a batch size of 2 for all models. The PACS-dataset was used to train the models, with a data split of 60%, 20% and 20% for training, validation and testing, respectively.
During testing, a sliding window of 30 frames in width with a stride of one was applied, allowing up to 30 predictions of differing temporal importance to be calculated for each timestep. Toward the end of each video, should a segment be fewer than 30 frames in length, it was zero-padded with the added frames removed after completion. Experimentation proved a stack of 2 LSTM layers was the optimum configuration across all models.
Evaluation metrics
As the primary endpoint for frame detection, evaluation of trained network predictions measures the difference between each labelled target, either ED or ES, and the timestep prediction.
Average Absolute Frame Difference (aaFD) notation is applied (to the left), where N is the number of events within the test dataset.
The signed mean (μ) and standard deviation (σ) of the error (i.e. frame differences) were also calculated.
Results
PACS-dataset: