Comparison of manual and semi-automated delineation of regions of interest for radioligand PET imaging analysis

Background As imaging centers produce higher resolution research scans, the number of man-hours required to process regional data has become a major concern. Comparison of automated vs. manual methodology has not been reported for functional imaging. We explored validation of using automation to delineate regions of interest on positron emission tomography (PET) scans. The purpose of this study was to ascertain improvements in image processing time and reproducibility of a semi-automated brain region extraction (SABRE) method over manual delineation of regions of interest (ROIs). Methods We compared 2 sets of partial volume corrected serotonin 1a receptor binding potentials (BPs) resulting from manual vs. semi-automated methods. BPs were obtained from subjects meeting consensus criteria for frontotemporal degeneration and from age- and gender-matched healthy controls. Two trained raters provided each set of data to conduct comparisons of inter-rater mean image processing time, rank order of BPs for 9 PET scans, intra- and inter-rater intraclass correlation coefficients (ICC), repeatability coefficients (RC), percentages of the average parameter value (RM%), and effect sizes of either method. Results SABRE saved approximately 3 hours of processing time per PET subject over manual delineation (p < .001). Quality of the SABRE BP results was preserved relative to the rank order of subjects by manual methods. Intra- and inter-rater ICC were high (>0.8) for both methods. RC and RM% were lower for the manual method across all ROIs, indicating less intra-rater variance across PET subjects' BPs. Conclusion SABRE demonstrated significant time savings and no significant difference in reproducibility over manual methods, justifying the use of SABRE in serotonin 1a receptor radioligand PET imaging analysis. This implies that semi-automated ROI delineation is a valid methodology for future PET imaging analysis.


Background
Advances in functional neuroimaging techniques have allowed the correlation of regions of interest (ROIs) with behavioral and cognitive tasks. Manual delineation of ROIs by trained operators is still considered the "gold standard," given its precision for the targets; however some drawbacks of manual analysis have recently been pointed out, such as its labor-intensive requirements (i.e., extensive time needed for ROI drawing) [1], limited reproducibility [2], and difficulties in measuring cortical ROIs [3]. In order to resolve these problems, some researchers have suggested other methods of analysis as represented by an automated program to label brain regions [4], automated evaluation of the whole brain [5], and automated voxel-based morphometry [6]. Unfortunately, these alternatives also are limited by ROIs available [4,5] and the potential inaccuracy introduced by spatial normalization of the brain [7]. The semiautomatic brain region extraction (SABRE) method was designed by Dade et al. to minimize the errors of both manual and automated analysis [1].
SABRE combines manual and automated analyses, which maximizes the advantages of both methods by manual definition of the most essential landmarks to create a customized atlas for the individual brain and automatic brain parcellation. SABRE has proven reliable in assessing regional tissue volume, and it provides time savings over purely manual methods.
The present study compares the benefits of the SABRE method to manual ROI delineation. We searched Pubmed for similar studies using the search terms: "automated brain region extraction," "brain region extraction," "manual ROI AND automated," "region of interest delineation," "SABRE," "semiautomated brain region extraction," and "semiautomatic brain region extraction." This yielded 491 citations. Of these, 5 described research questions similar to ours [8][9][10][11]. Three studies reported the effects of semi-automated methods vs. manual delineation methods for structural or volumetric MRI results for limited regions of brain such as hippocampus [10,11] or ventricular cerebrospinal fluid volume [9]. One of the hippocampal studies required manual delineation on the subject's first MRI, then used automated algorithms to gauge longitudinal volumetric changes from the original, individualized template [11]; the other hippocampal study used a novel expanding seed voxel with constraint points to identify 3D volumes of interest from the inside out [10]. Mosconi et al. validated automated voxel-based FDG-PET analysis including spatial normalization of hippocampal probability ROIs [12]. Only Mega et al. described a parcellation of brain into cortical regions as SABRE does [8]. Their sample also included subjects with cortical atrophy due to neurodenegerative processes but the imaging proc-ess requires warping to a standardized volumetric brainspace. Studies comparing ROI extraction reported positive conclusions in favor of using automation to save time [8,10,11] or achieving similar accuracy to manual methods [8][9][10][11], but none of them have validated the use of semi-automated methods to process functional imaging data or to process multiple cortical regions without warping. During the revision of this manuscript for publication, a paper describing a fully automated ROI extraction for use with PET imaging was published by Rusjan et al. [35]. The authors devised a fully automated method which showed time savings over manual methods and very high intraclass correlation between the two methods for use with three different radioligands. This method does not allow for individualization of intracranial capacity as in SABRE, which will be discussed below.
This study is a first time application of SABRE to a positron emission tomography (PET) study of patients with frontotemporal degeneration (FTD). As PET scanners evolve to yield larger numbers of image slices, the manhours required to delineate ROIs for each subject become impractical. We wished to validate the use of SABRE in analyzing our PET data. Assuming that manual analysis is the gold standard, we compared the PET results generated by manual ROI drawing to those by SABRE, on the bases of analysis time, effect of analysis method on PET results, reproducibility, and ability to discriminate FTD patients from healthy control subjects. We hypothesized that SABRE would save image processing time without altering the basic quality of PET results but that SABRE would be less sensitive to detect the differences between an FTD patient group and an age-matched healthy comparison group. We also hypothesized that SABRE's test-retest reproducibility would be superior to the manual method, which might compensate for any loss of sensitivity. Balancing these characteristics might allow investigators to choose the more feasible and statistically useful procedure for future PET analyses of an FTD population.

Participants
We used data from 9 participants in a study comparing serotonin 1a receptor (5-HT1aR) density as estimated by radioligand binding potentials (BPs) from PET imaging data [13]. We studied 5 patients with FTD diagnosed by consensus criteria [14], duration 3-6 years. They were 1 man and 4 women, ages ranging 59-79 years, with MMSE scores 16-30, and CDR scores of 0.5). We also studied 4 age-and gender-matched healthy comparison subjects (1 man and 3 women, age range 63-80 years). The study procedures were reviewed and approved by Research Ethics Boards at all participating institutions. All 9 subjects or their substitute decision makers gave informed consent to participate in the study.

MRI data acquisition
Imaging procedures: We conducted structural MR imaging on a 1.5 T Signa research-dedicated scanner (GE Medical Systems, software v. 8.4M4, with CV 40 mT/m gradients) at Sunnybrook Health Sciences Centre. We acquired a high-resolution T1-weighted image (an axial 3D SPGR with 5 ms TE, 35 ms TR, 1 NEX, 35° flip angle, 22 × 16.5 cm FOV, 0.859 × 0.859 mm in-plane resolution, and 1.2 to 1.4 mm slice thickness depending on head size). This was followed by an interleaved proton density (PD) and T2-weighted image set (an interleaved axial spin echo with TEs of 30 and 80 ms, 3s TR, 0.5 NEX, 20 × 20 cm FOV, 0.781 × 0.781 mm in-plane resolution, and 3 mm slice thickness). The T1-weighted and PD/T2-weighted imaging parameters have been selected to provide optimal intensity separation and are routinely used for tissue segmentation [15].

Serotonin 1a receptor (5-HT1aR) PET acquisition
PET scans with the radioligand [ 11 C]WAY-100635, a 5-HT1aR antagonist, were performed within 3 months of the MRI scans. Specific activity at time of intravenous injection of the radioligand averaged 793 ± 373 mCi/ μmol. PET images were acquired for 15 transaxial slices (slice thickness of 6.5 mm) over 90 minutes with a GE Medical System PC-2048-15B camera with 5.5 mm intrinsic resolution FWHM.
Co-registered MR and PET images were used for semiautomated and manual ROI delineation, as described below.
Cortical atrophy challenges the accurate interpretation of functional images from patients with dementia. A partial volume correction (PVC) method has been adapted to correct WAY-PET imaging resolution issues. The PVC algorithm corrects for atrophy, spill-in effects, and spill-out effects. This is a variation of the algorithm by Bencherif et al. [23], modified so that calculations are performed in higher-resolution MR space. The algorithm was used to create a map of gray matter (GM) vs. non-GM pixels for each subject. We then applied the hand-drawn ROIs to the map and used Alice™ to calculate average correction factors for each ROI. We submitted the initially derived BPs to their corresponding correction factors and report here the corrected BPs.

SABRE ROI delineation
The SABRE method uses a robust tissue segmentation protocol, which accounts for regional field (RF) inhomogeneities, noise, and partial volume effects [15]. When tested using the Montreal Neurological Imaging phantom, the coefficient of total agreement with increased noise and RF inhomogenity levels was 0.97. When tested on young normal controls and elderly Alzheimer's disease patients, the maximal differences were less than 1% of total intracranial capacity in all tissue classes in a scan-rescan test.
The SABRE process begins with segmentation of the MRI data into GM, white matter, ventricular cerebral spinal fluid and subdural/sulcal CSF (ssCSF) [1,15]. In fact, these segmentation data were used for the PVC algorithm described above. First, the operator subtracted the nonbrain tissue (e.g., skull) from the T1 MR images to extract the T1 intracranial cavity (T1 eroded images). Identification of 15 landmarks on the 3D-rendered T1 images (e.g., anterior commissure (AC), central sulcus) with ANALYZE software (Biomedical Imaging Resource, Mayo Clinic, Rochester, Minnesota) yields a proportional Talairach grid of each individual's eroded T1 images [1]. Using the resulting proportional grid and defined landmarks, the SABRE program parcellates the eroded T1 images automatically into 26 zones (13 in each hemisphere).
To convert the SABRE zones into ROIs, we used AIR (version 5.5) to yield the optimal matrix for co-registration of the T1 masked images to the summed 0-90 minute PET frames (15 transaxial slices) [24]. We restricted the SABRE zones to GM-only portions, outlining them automatically with ANALYZE, because SABRE zones must be converted from opaque square fields to ALICE-compatible outlines. See figure 1 for an illustration of manual vs. SABRE-generated outlines of ROIs.
The same SABRE ROIs were applied to the grey matter vs. non-GM PET data map to derive SABRE-specific average correction factors for each ROI. These were then applied to the TAC data to derive the second set of partial volume corrected BPs for comparison to the manual ROI data. These procedures were performed by a neurosurgeon (ST, Rater S1) and a highly trained technician (JR, Rater S2).

Derivation of binding potential (BP)
Manual and SABRE ROIs became the overlays applied to the dynamic PET images to calculate BP values of each ROI with Alice and PKIN/PMOD software (PMOD group, Zürich, Switzerland) [25,26]. A simplified reference tissue method (SRTM) was performed to obtain BP values, using the cerebellum as the input function [27], given previous findings that the cerebellum is relatively devoid of 5HT1aR [28] and that this method has been proven to be superior to kinetic modeling using arterial data [29].

Statistics
Because the BP itself is a relative estimation, and the SABRE regions are inexact proxies of the hand-drawn ROIs, we did not seek direct correlations between BPs from manual vs. SABRE ROIs. Instead, we compared the methods with regard to: 1) image processing time, 2) basic quality of PET results, 3) reproducibility of BP results, and 4) sensitivity to differentiate FTD patients from comparison subjects based on BP results.
We averaged inter-rater processing times for each method (i.e., (M1A +M2)/2 and (S1A +S2)/2), then compared the mean time spent to process the data using either manual or SABRE methods with the unpaired Student's t-test, as we expected SABRE to save time.
We evaluated the basic quality of PET results by calculating the over all rank order of BPs among 9 subjects in each ROI with the Spearman rank correlation test. We calculated ratios of BPs in the left regions to those of the corresponding right regions (L/R ratios) for comparison of the two methods with the Wilcoxon signed rank test.
We assessed the reproducibility of the tested analysis methods based on intra-and inter-rater reliability for the BPs. We rated intra-rater reliability using intraclass correlation coefficients (ICC), repeatability coefficients (RC), and percentages of the average parameter value (RM%, a measure of coefficient of variation of the difference between the methods). ICC were based on the 1 st and 2 nd results generated by the same rater (M1A vs. M1B and S1A vs. S1B). To determine RM%, we first calculated the RC as twice the standard deviation (SD) of the difference between the average BP values for each of the ROIs from the 1 st and 2 nd analyses (e.g., M1A and M1B), expecting 95% of the differences to be less than the RC [30]. In addition, to facilitate comparisons across regions, the RC was calculated as percentage of the mean BP to obtain the RM% [27]: Inter-rater reliability was assessed by calculating ICC between manual results M1A vs. M2 and SABRE results from S1A vs. S2. We used SPSS: Analysis: Scale: Reliability: Statistics -ICC to make these calculations. We used the Wilcoxon signed rank test to compare the resulting average ICC for manual vs. SABRE results.

Sensitivity to differentiate FTDs from controls
We used two indicators to assess the ability of the methods to differentiate FTD patients from healthy comparison subjects. At autopsy, FTD patients have significant reductions in serotonin receptor densities [31,32]; we expected to find similar losses, reflected as lower BPs, during the course of illness. First we compared the mean BPs for each ROI with paired t-tests. We also defined Cohen's measures for the effect size (d) as the average BP value for the FTD group minus that for comparison subjects divided by the standard deviation for the pooled samples [33]. We compared the calculated d values between manual and SABRE methods with the Wilcoxon signed rank test.
We used SPSS (version 15.0, SPSS, Inc., Chicago, Illinois) for all statistical analyses. Table 1 shows the range of mean 5-HT1a R BP values. Mean BPs after partial volume correction were similar between manual and SABRE methods, without significant differences between FTD and control BPs.

Imagine processing time
The image processing time for SABRE had statistically and practically significant savings over the manual method (p < 0.0001): S1A/S2 1.2 ± .08 hours per subject vs. M1A/M2

RM SD BPscan BPscan Mean BPscan BPscan
Region of interest delineation: a) manual, b) using SABRE Figure 1 Region of interest delineation: a) manual, b) using SABRE.

Basic quality of PET results
We found significant positive correlations between the rank order of BPs among the 9 tested subjects in the majority of rater × method comparisons, but measurements for anterior lateral and medial temporal regions were less similar (see Table 2).
In the comparison of L/R ratios of BPs, the orbitofrontal cortex (OFC) showed the highest average L/R ratio in both manual and SABRE results for all 6 raters' measurements of BPs (see Table 3). One subject had a very small right OFC, which led to higher variance in both manual and SABRE measurements. When this subject's data were excluded, the SABRE average for raters S1A, S1B, and S2 were more similar at 1.16, 1.15, and 1.13, respectively. Most L/R ratios were very close to 1.0. After excluding the outlier, there were no statistically significant Wilcoxon results.

Reproducibility of BP results
SABRE methods achieved average intra-rater (S1A vs. S1B) ICC values similar to the manual methods (see Table 4), but Wilcoxon rank testing showed significant differences in average RC and RM%, supporting manual methods as more reliable when examining intra-rater performance. As shown in the table, the RM% had a wide range across ROIs.
SABRE results yielded high ICC values for inter-rater reliability in general (see Table 5 In comparison, ICC values for the manual method were slightly lower averages, ranging 0.79-0.87. As opposed to the DLPFC, the lowest ICC for the manual ratings were in

Sensitivity to differentiate FTD patients from healthy comparison subjects
Average BP values of FTD patients did not differ from those of healthy comparison subjects according to the unpaired Student's t-test, regardless of the method used to delineate ROIs. Cohen's d values (effect sizes) for the SABRE method were higher than for the manual method across all ROIs (see Table 6). SABRE-derived d's exceeded manually-derived d's with p < 0.05, except in the comparison against rater M2. This particular finding supports the lower inter-rater reliability of the manual method. Effect sizes for left and right OFC were larger than for other ROIs. As in Table 3, the right OFC BP from one subject was thought to be an outlier. Values for right OFC effect sizes when this subject was excluded still varied greatly, reflecting the difficulty of measuring BP when the ROI is very small: M1A 0.24, M1B -0.19, M2 -0.28, S1A -0.37, S1B -0.39, and S2 -0.55.

Discussion
Prior studies have shown that automated methods of ROI delineation can be accurate and time-saving for structural volumetric analyses [8][9][10][11]. Our present results indicate that the SABRE method also saves time for functional radi-oligand PET analysis without altering the basic quality of the results as compared to the gold standard, manual ROI analysis. Intra-rater ICC and reliability were greater for manual methods than SABRE, exceeding reliability criteria pegging acceptable ICC values at a range of 0.75-0.80 [34]. Inter-rater ICC also met acceptable ICC value criteria, with the exception of manual anterior lateral temporal ROIs. The inter-rater reproducibility of PET results using SABRE was at least as high as that using the manual method. Although SABRE failed to significantly discriminate FTD patients from healthy comparison subjects, which may be related to the small sample size, higher d values for SABRE imply that SABRE can detect the expected 5-HT1a R BP differences between FTD patients and comparison subjects more sensitively than manual analysis.
The image processing time savings are amplified for datasets where more than 15 slices are available: compared to more current scanners that would afford 124 slices, our approximately 3 hour difference between methods would translate to at least an 8-hour saving (2 hours for SABRE vs. at least a 10 hour manual task).
Our results were very similar to those from the fully automated ROI extraction method of Rusjan et al. [35], except that they were able to demonstrate higher intra-and inter- rater reliability for the automated method by virtue of full automation obviating normal human variation. We would have expected the SABRE method to grant advantages over the fully automated method due to use of individualized ROI extraction instead of subjecting MRI data to a normalization procedure prior to co-registration with the PET data, but we encountered problems with reliability of SABRE in the temporal lobe regions. Both of our studies recommend automated methods as a time-saving method for ROI extraction without significant cost to accuracy.
Limitations of this study include low sample size and difficulty pinpointing the differences between methods specific to the manual vs. SABRE aspect. Ideally, a validation study would include a larger group of imaging data, as well as more inter-rater comparisons. Using a small number may bias our search for similarity of data quality in favor of SABRE. A larger sample would make the analysis less vulnerable to outliers. Only Mega et al.'s study [8] included 20 subjects (more than twice our sample), consisting of both patients with neurodegenerative disease and controls.
Inclusion of both subjects with moderate to severe cortical atrophy due to FTD and healthy controls with little or no atrophy may have compensated for the small sample size by creating a varied landscape over which both methods had to perform, but the atrophic ROIs may have complicated reproducibility of anterior lateral temporal, DLPFC, and right OFC delineation. A further important limitation is our method of correcting for partial volume effects, in which we applied correction factors to the regional BPs and not to the individual data points along the time activity curve (TAC). Our partial volume effect method suffices for the purpose of our comparison, but most investigators will perform partial volume effects compensation at an earlier data modeling step.
Differences between the methods may be related to aspects of image processing other than the actual delineation of the ROIs. We used Rview for co-registration of the MRI to the PET images for the manually derived data and AIR for the SABRE data. Software also differed for tracing the ROIs: manual raters used Alice; SABRE raters used ANALYZE. These software variations are difficult to include as covariates in the analysis and cannot be ruled out as confounders. It would be difficult to conceive of a significant impact of the software upon the time saved in image processing.
The validation results reported here only apply to this specific experimental setup. It is not known how the accuracy of the procedure is affected by errors in MRI segmentation and/or MRI-PET coregistration, which differ when other radiotracers or segmentation and coregistration strategies are used.
Our findings that SABRE saved time over manual drawing of multiple ROIs are not surprising; the most similar studies in the literature are in agreement [8][9][10][11]. Because structural landmarks are the bases of ROIs processed in the interpretation of PET images, it seemed consistent to find that reproducibility for the SABRE method was equivalent to manual methods. The SABRE method requires identification of fewer anatomical landmarks (15), as opposed to boundaries for each of 10 ROIs in the manual process (approximately 60 localizations, see Appendix) and therefore should leave less room for variation between raters. Ashton et al.'s valid concern about error due to the tracking between slices required from 2D techniques [10] could not be evaluated in our comparison, as derivation of data for the BP measurements uses 2D techniques and would therefore be exposed to the same types of edge detection limitations.

Conclusion
This first account of semi-automated ROI delineation improving on manual methods in processing functional neuroimaging data validates the use of SABRE for future PET studies where the analysis relies upon hypothesisbased inquiry of ROIs. Investigators are cautioned about the potential for reduced reliability using either method when studying ROIs featuring marked atrophy in patient subjects.