Research Documentation

Comprehensive documentation of data cleaning, honest assessment, and publication strategy

Research-Grade Neuroimaging Datasets

Prestigious Access

This research utilizes two of the world's most respected neuroimaging repositories

OASIS-1

Open Access Series of Imaging Studies

Open Access

Subjects

436

Age Range

18-96 yrs

Cross-sectional MRI data from Washington University. Freely available for research, enabling reproducible science and global collaboration.

Visit OASIS Project

ADNI-1

Alzheimer's Disease Neuroimaging Initiative

Application Required

Subjects

629

Modalities

MRI, PET+

Controlled Access:Requires formal application and Data Use Agreement approval

Apply for ADNI Access

Why This Research Matters

Multi-Site Validation

Cross-dataset robustness testing across different scanners and protocols

1,065 Total Subjects

Combined analysis from two independent research cohorts

Ethical Compliance

All data obtained through proper institutional agreements

Deep Learning for Neuroimaging

Advancing early dementia detection through multimodal MRI analysis and cross-dataset validation

Data Integrity

100%

Zero leakage verified

7 cleaning steps documented

Subject-wise splits enforced

Honest Results

Level-MAX

0.808 AUC

Level-MAX (with biomarkers)

Level-1: 0.60 (Age/Sex only)

+16.5% with CSF, APOE4, Volumetrics

Publication Ready

0.848 AUC

✅ Beats 0.83 Target

Longitudinal Fusion (Random Forest)

Confirmed via Integrity Audit

Data Cleaning & Preprocessing

Complete enumeration of structural and semantic data cleaning steps

Complete

7 Major Cleaning Steps

Subject-level de-duplication
Baseline-only visit selection
Removal of longitudinal leakage
Subject-wise train/test splitting

Data Flow

ADNI: 1,825 scans → 629 subjects (-65.5%)
OASIS: 436 scans → 205 usable (-52.8%)
Feature intersection: MRI (512) + Clin (2)
Level-1.5 target: + CSF (3) + APOE4 (1)

Key Highlights

Zero Leakage:Temporal, subject, label, and distribution leakage all prevented

Feature Exclusion:MMSE, CDR-SB, ADAS excluded to prevent circular reasoning

Infrastructure & Computational Constraints

Practical limitations that influenced data subset selection

Methodological Note

Storage Requirements

→OASIS-1 raw: 50GB compressed → 70GB extracted
→ADNI-1 raw: Similar size (50GB+ compressed)
→Feature extraction: Intermediate files (preprocessed MRI)
→Model checkpoints: Training artifacts, logs
→Total pipeline: 200GB+

Impact on Research Design

Used baseline-only scans (not full longitudinal)
Extracted features once, stored as .npz (compressed)
Limited to OASIS-1 and ADNI-1 (not OASIS-2/3, ADNI-2/3)
Focused on structural MRI (excluded PET, DTI)

Justification & Context

This is not an excuse - it's a real constraint.

Many researchers face infrastructure limitations. What matters is: (1) we documented this constraint transparently, (2) we ensured the data we DID use was rigorously cleaned, (3) we didn't cherry-pick favorable subsets - we used standard baseline protocols.

Sample size (N=205-629) is comparable to many published studies. Our contribution lies in honest methodology and cross-dataset validation, not maximal dataset size.

What We Did

✓ Selected baseline scans (standard protocol)
✓ De-duplicated subjects rigorously
✓ Used all available baseline data
✓ Documented storage constraints

What We Avoided

✗ Cherry-picking "easy" subjects
✗ Hiding infrastructure limitations
✗ Using only favorable scans
✗ Inflating results with circular features

Honest Project Assessment

Why fusion models underperform and what the results actually mean

Critical Analysis

The Evolution

ADNI Level-1: 0.60 AUC (Age/Sex only)
Level-MAX: 0.808 AUC (+16.5% with biomarkers!)
Level-2 (with MMSE): 0.99 AUC (circular)

Root Causes

Feature quality mismatch (512 strong vs 2 weak)
Dimension imbalance (2 → 32 creates 30 dims of noise)
Small dataset + high variance (N=205-629)
Age as confounder, not biomarker

The Breakthrough

Level-MAX proves fusion works with quality features!

By enriching clinical features from 2D (Age/Sex) to 14D biological profile (CSF, APOE4, Volumetrics), we achieved 0.808 AUC - validating that the fusion architecture was never broken, it just needed complementary biological signals instead of weak demographics.

Level-MAX Achievement

How we achieved competitive 0.808 AUC with biomarker-enhanced fusion

✅ Completed