A rigorous four-stream ensemble architecture combining Remote Photoplethysmography signal analysis, EfficientNet-B4, Xception, and Swin Transformer spatio-temporal models for state-of-the-art deepfake video detection.
MediaPipe FaceMesh → CHROM rPPG → 117-feature extraction → XGB+LGB+HGB Stacking Ensemble
MTCNN face detection → EfficientNet-B4 → BiLSTM (2L, 256H) → Multi-Head Attention → Binary
MTCNN + Alignment → Xception (2048d) + ECA + Freq Branch → BiLSTM → Fused Binary
MTCNN + Alignment → Swin-Tiny (768d) + ECA + DCT 128-dim → pack_padded BiLSTM → Binary
master_dataset_index.csv compiled once by a unified data compiler.
This guarantees identical video-level alignment across all streams for leakage-free late fusion.
| Dataset | Subset | Label | Max Samples | Identity Pattern |
|---|---|---|---|---|
| FaceForensics++ | Original | Real | 200 | FF_person_{id} |
| FaceForensics++ | Deepfakes, Face2Face, FaceSwap, NeuralTextures, FaceShifter, DeepFakeDetection | Fake | 200 each | FF_person_{id} |
| Celeb-DF v2 | Celeb-real + YouTube-real | Real | 150 + 50 | Celeb_person_{id} |
| Celeb-DF v2 | Celeb-synthesis | Fake | 200 | Celeb_person_{id} |
| Custom Dataset | real_videos / deepfake_videos | Real / Fake | 400 each | Custom_person_{id} |
| DFDC Sample | metadata.json driven | Real / Fake | Balanced | DFDC_{basename} |
StratifiedGroupKFold where groups are person identities extracted from filenames.
This guarantees that no person's real and fake clips appear in both train and validation, preventing face-memorisation
(the primary source of inflated metrics in deepfake literature). This design is required by IEEE T-IFS, CVPR, and top security venues.
CHROM-based physiological feature extraction → 117-dimensional feature vector → Stacking Ensemble classifier
MTCNN face detection → frame caching → EfficientNet-B4 spatial features → BiLSTM+Attention → binary classification
cnn_predictions.csv (col: P_CNN)Eye-aligned MTCNN → Xception (2048d) + ECA → Frequency Branch → BiLSTM → Fused Classifier
cnn_predictions.csv (col: P_CNN)Eye-aligned MTCNN → Swin-Tiny (768d) + ECA + on-the-fly DCT(128d) → pack_padded BiLSTM → 5-fold OOF
cnn_predictions_swin_oof_MASTER.csvvideo_id key via inner join.
Five complementary fusion strategies are evaluated; the best is selected by AUC. Bootstrap 95% confidence
intervals are reported for all final metrics in accordance with IEEE publication standards.
Probability alignment → 5 fusion strategies → optimal selection → bootstrap CI evaluation
| Property | rPPG + ML | EfficientNet-B4 | Xception | Swin-Tiny |
|---|---|---|---|---|
| Paradigm | Physiological | CNN Temporal | CNN + Freq | Transformer |
| Face Detection | MediaPipe FaceMesh (468 lm) | MTCNN | MTCNN + Eye alignment | MTCNN + Eye alignment |
| Input resolution | Full video (60 frames) | 224×224 (16 frames) | 299×299 (16 frames) | 224×224 (16 frames) |
| Feature dim | 117 rPPG features | 1792-dim (EffNet-B4) | 2048-dim (Xception) | 768-dim (Swin-Tiny) |
| Temporal modelling | — | BiLSTM 2L×256H | BiLSTM 2L×256H | pack_padded BiLSTM 2L×256H |
| Attention | — | 4-head MHA | 4-head MHA + ECA | 4-head MHA + ECA |
| Frequency branch | FFT radial (32d) + DCT (32d) | — | 256-dim parallel branch | On-the-fly DCT 128-dim |
| Classifier output dim | Stacking LR logit | 512-dim → 1 | 768-dim → 1 | 704-dim → 1 |
| Loss function | — | Focal (α=0.6, γ=2.0, s=0.1) | Focal (α=dynamic, γ=2.0, s=0.05) | Focal (α=0.5, γ=2.0, s=0.08) |
| SWA | — | epoch 30 | epoch 15 | epoch 15 |
| Hard negative mining | — | — | ✓ epoch 10 | — |
| MixUp / CutMix | — | MixUp (β lam) | MixUp + CutMix (50/50) | MixUp (epoch ≥ 10) |
| TTA passes | — | 5 | 6 | 6 (OOF) |
| Cross-validation | GroupShuffleSplit 80/20 | 5-fold StratGroupKFold | 5-fold StratGroupKFold | 5-fold OOF (all folds) |
| Output file | rppg_predictions.csv |
cnn_predictions.csv |
cnn_predictions.csv |
cnn_predictions_swin_oof_MASTER.csv |
| Score column | P_rPPG |
P_CNN |
P_CNN |
P_CNN |
| Notebook | Output File | Contents |
|---|---|---|
| model_rppg | rppg_predictions.csv | video_id · label · P_rPPG |
| model_rppg | best_rppg_ml_model.joblib | Trained stacking ensemble |
| model_rppg | rppg_scaler.joblib | RobustScaler fitted on train |
| model_rppg | rppg_selector.joblib | ExtraTrees feature selector |
| model_efficientnet | cnn_predictions.csv | video_id · label · P_CNN |
| model_efficientnet | best_cnn_model_fold0.pth | Best EfficientNet checkpoint |
| model_xception | cnn_predictions.csv | video_id · label · P_CNN |
| model_xception | best_cnn_model_fold0.pth | Best Xception checkpoint |
| model_swin | cnn_predictions_swin_oof_MASTER.csv | video_id · label · P_CNN · fold |
| model_swin | swa_model_swin_fold{k}.pth | SWA model weights per fold |
| ensemble | ensemble_final_predictions.csv | All 4 scores + P_final + pred_final |
| ensemble | ensemble_metrics_with_ci.csv | AUC/Acc/F1/P/R with 95% CI |
| ensemble | ensemble_evaluation_plots.png | ROC · AUC bars · CM · score dist · corr |
master_dataset_index.csv