NeuroPulse: Multi-Modal Deepfake Detection System

Navigation

Table of Contents

System Overview
Datasets & Pre-processing
Model 1 — rPPG + ML Stacking
Model 2 — EfficientNet-B4 BiLSTM

Model 3 — Xception + Frequency Branch
Model 4 — Swin Transformer + DCT
Late-Fusion Ensemble
Outputs & Reproducibility

Architecture

System Overview

NeuroPulse employs a dual-stream paradigm: a physiological stream grounded in biological signal analysis, and three independent spatio-temporal CNN streams. All four models are trained on the same master_dataset_index.csv with identity-aware cross-validation to prevent data leakage, then fused via late-stage probability aggregation.

Physiological

rPPG + ML Stacking

MediaPipe FaceMesh → CHROM rPPG → 117-feature extraction → XGB+LGB+HGB Stacking Ensemble

Features

117

ROI Regions

9

Max Frames

60

Output

P_rPPG

Spatial-Temporal

EfficientNet-B4

MTCNN face detection → EfficientNet-B4 → BiLSTM (2L, 256H) → Multi-Head Attention → Binary

Backbone

EffNet-B4

Frames

16

Img Size

224²

TTA Passes

5

Spatial-Temporal

Xception + Freq Branch

MTCNN + Alignment → Xception (2048d) + ECA + Freq Branch → BiLSTM → Fused Binary

Backbone

Xception

Img Size

299²

Fused Dim

768

TTA Passes

6

Transformer

Swin-Tiny + DCT

MTCNN + Alignment → Swin-Tiny (768d) + ECA + DCT 128-dim → pack_padded BiLSTM → Binary

Backbone

Swin-Tiny

Feat Dim

768

K-Folds

5

TTA Passes

6

Data

Datasets & Unified Pre-Processing

All four models consume a single master_dataset_index.csv compiled once by a unified data compiler. This guarantees identical video-level alignment across all streams for leakage-free late fusion.

Dataset	Subset	Label	Max Samples	Identity Pattern
FaceForensics++	Original	Real	200	`FF_person_{id}`
FaceForensics++	Deepfakes, Face2Face, FaceSwap, NeuralTextures, FaceShifter, DeepFakeDetection	Fake	200 each	`FF_person_{id}`
Celeb-DF v2	Celeb-real + YouTube-real	Real	150 + 50	`Celeb_person_{id}`
Celeb-DF v2	Celeb-synthesis	Fake	200	`Celeb_person_{id}`
Custom Dataset	real_videos / deepfake_videos	Real / Fake	400 each	`Custom_person_{id}`
DFDC Sample	metadata.json driven	Real / Fake	Balanced	`DFDC_{basename}`

🔒 Identity-Aware Splitting — Critical Anti-Leakage Design All splits use StratifiedGroupKFold where groups are person identities extracted from filenames. This guarantees that no person's real and fake clips appear in both train and validation, preventing face-memorisation (the primary source of inflated metrics in deepfake literature). This design is required by IEEE T-IFS, CVPR, and top security venues.

Model 1 of 4

rPPG-Based Physiological Detection + ML Stacking

Inspired by FakeCatcher (CVPR 2023), this stream extracts remote photoplethysmography (rPPG) signals from 9 precisely-defined facial regions using MediaPipe FaceMesh's 468 facial landmarks. Deepfakes lack coherent biological blood-flow patterns, making physiological inconsistencies a powerful discriminator.

1

rPPG Signal Extraction & ML Pipeline Flowchart

CHROM-based physiological feature extraction → 117-dimensional feature vector → Stacking Ensemble classifier

🧬 rPPG Signal Processing Details

CHROM algorithm: per-window mean normalisation, overlap-add accumulation (de Haan & Jeanne, 2013)
Bandpass filter: Butterworth 3rd-order, 0.7–4.0 Hz (cardiac frequency range)
Frame sampling: np.linspace across video duration, max 60 frames
Quality gate: Laplacian variance ≥ 10 and face area ≥ 1000 px²
NaN interpolation: linear interpolation for <30% missing ROI frames
Zero-variance coherence features dropped post-extraction (up to 6 features)

⚙️ ML Pipeline Configuration

N_RPPG_ACTUAL = 117 (exact feature count post-coherence removal)
Feature selection threshold: "1.2×mean" ExtraTrees importance
XGBoost: n_est=300, max_depth=3, lr=0.02, λ=10, α=1, scale_pos_weight
LightGBM: n_est=300, num_leaves=8, lr=0.02, λ=10, α=1
HistGradBoost: max_iter=300, max_depth=5, l2=5.0, max_leaves=15
Meta-learner: LogisticRegression(class_weight='balanced', max_iter=1000)

Model 2 of 4

EfficientNet-B4 Spatio-Temporal CNN

An ImageNet-pretrained EfficientNet-B4 backbone combined with a stacked BiLSTM temporal model and multi-head self-attention for deepfake-discriminative inter-frame dependency modelling. SWA and gradual backbone unfreezing ensure stable convergence on the P100 GPU.

2

EfficientNet-B4 Spatio-Temporal Architecture Flowchart

MTCNN face detection → frame caching → EfficientNet-B4 spatial features → BiLSTM+Attention → binary classification

🏋️ Training Configuration

Epochs: 40 · LR: 5×10⁻⁵ · Weight Decay: 5×10⁻⁴
Optimiser: AdamW with param-group LRs (backbone LR/10)
Scheduler: Cosine warmup (10% steps) + cosine decay
SWA: starts epoch 30, SWA-LR=5×10⁻⁵, 5-batch BN update
Loss: Focal Loss (α=0.6, γ=2.0, label_smoothing=0.1)
Grad accumulation: 4 steps → effective batch = 8
Early stopping: patience=25 on validation AUC
TTA: 5-pass (original, H-flip, +15 bright, −15 bright, blur)

🔬 Model Architecture Details

Backbone: EfficientNet-B4 (1792-dim output) via timm
Temporal: 2-layer BiLSTM, hidden=256, output=512
Attention: 4-head MHA, batch_first=True, residual+LayerNorm
Mask handling: padding mask propagated through LSTM+MHA
P100 fix: cuDNN disabled for LSTM, no AMP/autocast
Splits: 5-fold StratifiedGroupKFold (identity-aware)
Backbone unfreeze: epoch 5, then LR ramped over 3 epochs
Output file: cnn_predictions.csv (col: P_CNN)

Model 3 of 4

Xception + Frequency Branch + Hard Negative Mining

The Xception backbone (2048-dim output) is augmented with a parallel frequency branch that extracts DCT and FFT compression artifacts. ECA channel attention re-weights spatial features. CutMix and hard-negative mining curriculum force the model to detect local manipulation boundaries rather than global statistics.

3

Xception Spatio-Temporal + Frequency Architecture Flowchart

Eye-aligned MTCNN → Xception (2048d) + ECA → Frequency Branch → BiLSTM → Fused Classifier

🎛️ Training Configuration

Epochs: 40 · LR: 1×10⁻⁴ · Weight Decay: 1×10⁻²
Loss: Focal Loss (α=dynamic, γ=2.0, smooth=0.05)
FOCAL_ALPHA is computed dynamically per fold from class counts
SWA: epoch 15 → full BN stats update (manual loop, not update_bn)
Hard Mining: epoch 10, class-balanced WeightedRandomSampler, refreshed every 5 epochs
Curriculum: Mixup (α=0.2) + CutMix (α=1.0) activated from epoch 0
Scheduler: CosineAnnealingLR(T_max=SWA_START, eta_min=LR×0.01)
TTA: 6-pass (original, H-flip, bright+15%, bright−15%, blur, 93% crop+resize)

🔬 Architectural Innovations

ECA-Net: efficient channel attention via 1D conv (k=odd(⌈(log₂C + 1)/2⌉))
Frequency branch: parallel 256-dim stream from spatial features
Dual fusion: temporal (512d) || frequency (256d) → 768d classifier
LSTM init: Xavier-uniform for input weights, Orthogonal for hidden, forget-gate bias=1
mask propagation: through LSTM → attention → masked avg-pool
No dropout on LSTM for cuDNN P100 compatibility (single layer uses 0)
Per-fold FOCAL_ALPHA prevents bias with identity-split class imbalance
Output file: cnn_predictions.csv (col: P_CNN)

Model 4 of 4

Swin Transformer Tiny + DCT Frequency Branch

The Swin Transformer's hierarchical shifted-window attention (768-dim output) is paired with a novel on-the-fly DCT frequency branch computed from raw frame pixels. Pack-padded-sequence LSTM eliminates padding corruption. A full 5-fold cross-validation loop runs in a single session, producing out-of-fold (OOF) predictions for bias-free ensemble calibration.

4

Swin Transformer + DCT Architecture Flowchart

Eye-aligned MTCNN → Swin-Tiny (768d) + ECA + on-the-fly DCT(128d) → pack_padded BiLSTM → 5-fold OOF

🔬 Swin Transformer Innovations

Window attention: swin_tiny_patch4_window7_224 (shifted windows, 7×7)
Padded frame skip: only real frames processed by backbone (not padding zeros)
DCT buffer: pre-computed 64×64 DCT matrix as non-trainable buffer
pack_padded_sequence: eliminates padding corruption in LSTM gradient flow
ECA attention: re-weights 768 Swin output channels before projection
LSTM init: forget-gate bias=1 (standard LSTM best-practice)
Full 5-fold OOF: single 11.5h session trains all folds sequentially

⚙️ Training Configuration

Epochs: 40 per fold · LR: 1×10⁻⁴ · Weight Decay: 1×10⁻²
4 param groups: backbone_decay, backbone_nodecay, other_decay, other_nodecay
Scheduler: LambdaLR (linear warmup + cosine decay to 10% min)
SWA: epoch 15, 4 per-group SWA-LRs, anneal_strategy='cos' over 5 epochs
Loss: Focal Loss (α=0.5, γ=2.0, label_smooth=0.08)
MixUp: activated at epoch ≥ HARD_MINING_EPOCH (10)
Early stopping: patience=10 on validation AUC
Output file: cnn_predictions_swin_oof_MASTER.csv

Final Stage

Late-Fusion Ensemble Strategy

All four model probability streams are merged on a shared video_id key via inner join. Five complementary fusion strategies are evaluated; the best is selected by AUC. Bootstrap 95% confidence intervals are reported for all final metrics in accordance with IEEE publication standards.

5

Late-Fusion Ensemble Architecture Flowchart

Probability alignment → 5 fusion strategies → optimal selection → bootstrap CI evaluation

Comparison

Model Architecture Comparison

Property	rPPG + ML	EfficientNet-B4	Xception	Swin-Tiny
Paradigm	Physiological	CNN Temporal	CNN + Freq	Transformer
Face Detection	MediaPipe FaceMesh (468 lm)	MTCNN	MTCNN + Eye alignment	MTCNN + Eye alignment
Input resolution	Full video (60 frames)	224×224 (16 frames)	299×299 (16 frames)	224×224 (16 frames)
Feature dim	117 rPPG features	1792-dim (EffNet-B4)	2048-dim (Xception)	768-dim (Swin-Tiny)
Temporal modelling	—	BiLSTM 2L×256H	BiLSTM 2L×256H	pack_padded BiLSTM 2L×256H
Attention	—	4-head MHA	4-head MHA + ECA	4-head MHA + ECA
Frequency branch	FFT radial (32d) + DCT (32d)	—	256-dim parallel branch	On-the-fly DCT 128-dim
Classifier output dim	Stacking LR logit	512-dim → 1	768-dim → 1	704-dim → 1
Loss function	—	Focal (α=0.6, γ=2.0, s=0.1)	Focal (α=dynamic, γ=2.0, s=0.05)	Focal (α=0.5, γ=2.0, s=0.08)
SWA	—	epoch 30	epoch 15	epoch 15
Hard negative mining	—	—	✓ epoch 10	—
MixUp / CutMix	—	MixUp (β lam)	MixUp + CutMix (50/50)	MixUp (epoch ≥ 10)
TTA passes	—	5	6	6 (OOF)
Cross-validation	GroupShuffleSplit 80/20	5-fold StratGroupKFold	5-fold StratGroupKFold	5-fold OOF (all folds)
Output file	`rppg_predictions.csv`	`cnn_predictions.csv`	`cnn_predictions.csv`	`cnn_predictions_swin_oof_MASTER.csv`
Score column	`P_rPPG`	`P_CNN`	`P_CNN`	`P_CNN`

Outputs

Output Files & Reproducibility

Notebook	Output File	Contents
model_rppg	`rppg_predictions.csv`	video_id · label · P_rPPG
model_rppg	`best_rppg_ml_model.joblib`	Trained stacking ensemble
model_rppg	`rppg_scaler.joblib`	RobustScaler fitted on train
model_rppg	`rppg_selector.joblib`	ExtraTrees feature selector
model_efficientnet	`cnn_predictions.csv`	video_id · label · P_CNN
model_efficientnet	`best_cnn_model_fold0.pth`	Best EfficientNet checkpoint
model_xception	`cnn_predictions.csv`	video_id · label · P_CNN
model_xception	`best_cnn_model_fold0.pth`	Best Xception checkpoint
model_swin	`cnn_predictions_swin_oof_MASTER.csv`	video_id · label · P_CNN · fold
model_swin	`swa_model_swin_fold{k}.pth`	SWA model weights per fold
ensemble	`ensemble_final_predictions.csv`	All 4 scores + P_final + pred_final
ensemble	`ensemble_metrics_with_ci.csv`	AUC/Acc/F1/P/R with 95% CI
ensemble	`ensemble_evaluation_plots.png`	ROC · AUC bars · CM · score dist · corr

🔬 Reproducibility Guarantees

Global seed: SEED = 42 across all notebooks (numpy, torch, random, CUDA)
Identity-aware splits: StratifiedGroupKFold on person identities prevents face memorisation
Identical dataset: all models read from master_dataset_index.csv
Disk-based face cache: reproducible across Kaggle sessions
Checkpoint resume: every notebook supports session-safe auto-resume

cuDNN benchmark=False for deterministic LSTM operation
Label reconciliation: inner join guarantees identical ground truth
Gradient clipping: max_norm=1.0 across all CNN models
P100 compatibility: strict FP32, no AMP, cuDNN disabled for LSTM
5-fold OOF: unbiased probability calibration for ensemble fusion

📚 Key References

Üstunet et al., "FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals," IEEE TPAMI 2023
de Haan & Jeanne, "Robust Pulse Rate From Chrominance-Based rPPG," IEEE TBME 2013
Tan & Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," ICML 2019
Chollet, "Xception: Deep Learning with Depthwise Separable Convolutions," CVPR 2017
Liu et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," ICCV 2021
Wang et al., "ECA-Net: Efficient Channel Attention for Deep CNNs," CVPR 2020
Rossler et al., "FaceForensics++: Learning to Detect Manipulated Facial Images," ICCV 2019
Li et al., "Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics," CVPR 2020