Research Documentation
Comprehensive documentation of data cleaning, honest assessment, and publication strategy
This research utilizes two of the world's most respected neuroimaging repositories
OASIS-1
Open Access Series of Imaging Studies
Cross-sectional MRI data from Washington University. Freely available for research, enabling reproducible science and global collaboration.
Visit OASIS ProjectADNI-1
Alzheimer's Disease Neuroimaging Initiative
Why This Research Matters
Deep Learning for Neuroimaging
Advancing early dementia detection through multimodal MRI analysis and cross-dataset validation
Zero leakage verified
7 cleaning steps documented
Subject-wise splits enforced
Level-MAX (with biomarkers)
Level-1: 0.60 (Age/Sex only)
+16.5% with CSF, APOE4, Volumetrics
✅ Beats 0.83 Target
Longitudinal Fusion (Random Forest)
Confirmed via Integrity Audit
Complete enumeration of structural and semantic data cleaning steps
7 Major Cleaning Steps
- Subject-level de-duplication
- Baseline-only visit selection
- Removal of longitudinal leakage
- Subject-wise train/test splitting
Data Flow
- ADNI: 1,825 scans → 629 subjects (-65.5%)
- OASIS: 436 scans → 205 usable (-52.8%)
- Feature intersection: MRI (512) + Clin (2)
- Level-1.5 target: + CSF (3) + APOE4 (1)
Key Highlights
Practical limitations that influenced data subset selection
Storage Requirements
- →OASIS-1 raw: 50GB compressed → 70GB extracted
- →ADNI-1 raw: Similar size (50GB+ compressed)
- →Feature extraction: Intermediate files (preprocessed MRI)
- →Model checkpoints: Training artifacts, logs
- →Total pipeline: 200GB+
Impact on Research Design
- Used baseline-only scans (not full longitudinal)
- Extracted features once, stored as .npz (compressed)
- Limited to OASIS-1 and ADNI-1 (not OASIS-2/3, ADNI-2/3)
- Focused on structural MRI (excluded PET, DTI)
Justification & Context
Many researchers face infrastructure limitations. What matters is: (1) we documented this constraint transparently, (2) we ensured the data we DID use was rigorously cleaned, (3) we didn't cherry-pick favorable subsets - we used standard baseline protocols.
Sample size (N=205-629) is comparable to many published studies. Our contribution lies in honest methodology and cross-dataset validation, not maximal dataset size.
- ✓ Selected baseline scans (standard protocol)
- ✓ De-duplicated subjects rigorously
- ✓ Used all available baseline data
- ✓ Documented storage constraints
- ✗ Cherry-picking "easy" subjects
- ✗ Hiding infrastructure limitations
- ✗ Using only favorable scans
- ✗ Inflating results with circular features
Why fusion models underperform and what the results actually mean
The Evolution
- ADNI Level-1: 0.60 AUC (Age/Sex only)
- Level-MAX: 0.808 AUC (+16.5% with biomarkers!)
- Level-2 (with MMSE): 0.99 AUC (circular)
Root Causes
- Feature quality mismatch (512 strong vs 2 weak)
- Dimension imbalance (2 → 32 creates 30 dims of noise)
- Small dataset + high variance (N=205-629)
- Age as confounder, not biomarker
The Breakthrough
By enriching clinical features from 2D (Age/Sex) to 14D biological profile (CSF, APOE4, Volumetrics), we achieved 0.808 AUC - validating that the fusion architecture was never broken, it just needed complementary biological signals instead of weak demographics.
How we achieved competitive 0.808 AUC with biomarker-enhanced fusion
What We Implemented
- 14D Biological Profile (Level-MAX)
- CSF biomarkers (ABETA, TAU, PTAU)
- APOE4 genetic risk factor
- 7 Volumetric measures (Hippocampus, etc.)
- Still honest (no cognitive scores)
Achieved Results
Publishable: Yes - competitive result
Week-by-Week Plan
Download the full markdown files for thesis integration
Final integrity audit and methodological proofs