Research Paper · Deep Learning · Computer Vision

NeuroPulse
Orthogonal Four-Stream Deepfake Video Detection
via Haemodynamic Signal Analysis,
Depthwise-Separable Spatio-Temporal Dual-Branch Attention,
and Hierarchical Shifted-Window Transformer
Late-Fusion Ensemble

A rigorous four-stream ensemble architecture combining Remote Photoplethysmography signal analysis, EfficientNet-B4, Xception, and Swin Transformer spatio-temporal models for state-of-the-art deepfake video detection.

PyTorch 2.4 MediaPipe FaceMesh FaceForensics++ Celeb-DF v2 DFDC Stochastic Weight Averaging Identity-Aware Splitting Focal Loss
4Model Streams
117+rPPG Features
5-FoldIdentity-Aware CV
6-PassTest-Time Augmentation
4 DatasetsFF++ · CelebDF · DFDC · Custom
Table of Contents
System Overview
NeuroPulse employs a dual-stream paradigm: a physiological stream grounded in biological signal analysis, and three independent spatio-temporal CNN streams. All four models are trained on the same master_dataset_index.csv with identity-aware cross-validation to prevent data leakage, then fused via late-stage probability aggregation.
Physiological

rPPG + ML Stacking

MediaPipe FaceMesh → CHROM rPPG → 117-feature extraction → XGB+LGB+HGB Stacking Ensemble

Features
117
ROI Regions
9
Max Frames
60
Output
P_rPPG
Spatial-Temporal

EfficientNet-B4

MTCNN face detection → EfficientNet-B4 → BiLSTM (2L, 256H) → Multi-Head Attention → Binary

Backbone
EffNet-B4
Frames
16
Img Size
224²
TTA Passes
5
Spatial-Temporal

Xception + Freq Branch

MTCNN + Alignment → Xception (2048d) + ECA + Freq Branch → BiLSTM → Fused Binary

Backbone
Xception
Img Size
299²
Fused Dim
768
TTA Passes
6
Transformer

Swin-Tiny + DCT

MTCNN + Alignment → Swin-Tiny (768d) + ECA + DCT 128-dim → pack_padded BiLSTM → Binary

Backbone
Swin-Tiny
Feat Dim
768
K-Folds
5
TTA Passes
6
Datasets & Unified Pre-Processing
All four models consume a single master_dataset_index.csv compiled once by a unified data compiler. This guarantees identical video-level alignment across all streams for leakage-free late fusion.
DatasetSubsetLabelMax SamplesIdentity Pattern
FaceForensics++OriginalReal200FF_person_{id}
FaceForensics++Deepfakes, Face2Face, FaceSwap, NeuralTextures, FaceShifter, DeepFakeDetectionFake200 eachFF_person_{id}
Celeb-DF v2Celeb-real + YouTube-realReal150 + 50Celeb_person_{id}
Celeb-DF v2Celeb-synthesisFake200Celeb_person_{id}
Custom Datasetreal_videos / deepfake_videosReal / Fake400 eachCustom_person_{id}
DFDC Samplemetadata.json drivenReal / FakeBalancedDFDC_{basename}
🔒 Identity-Aware Splitting — Critical Anti-Leakage Design All splits use StratifiedGroupKFold where groups are person identities extracted from filenames. This guarantees that no person's real and fake clips appear in both train and validation, preventing face-memorisation (the primary source of inflated metrics in deepfake literature). This design is required by IEEE T-IFS, CVPR, and top security venues.
rPPG-Based Physiological Detection + ML Stacking
Inspired by FakeCatcher (CVPR 2023), this stream extracts remote photoplethysmography (rPPG) signals from 9 precisely-defined facial regions using MediaPipe FaceMesh's 468 facial landmarks. Deepfakes lack coherent biological blood-flow patterns, making physiological inconsistencies a powerful discriminator.
1

rPPG Signal Extraction & ML Pipeline Flowchart

CHROM-based physiological feature extraction → 117-dimensional feature vector → Stacking Ensemble classifier

rPPG Physiological Feature Extraction Pipeline 🎬 Input Video MP4 / AVI / MOV / MKV / WEBM · max 60 frames MediaPipe FaceMesh 468 facial landmarks · static_image_mode=False Confidence threshold: 0.5 · Per-frame detection 9 Facial ROI Regions (ConvexHull Masking) Forehead 36 facial landmarks · upper brow region Primary blood-flow zone Left Cheek 15 facial landmarks · malar region High capillary density Right Cheek 15 facial landmarks · malar region Symmetry cross-check Chin 12 facial landmarks · mentalis Temporal rPPG anchor Nose 12 facial landmarks · nasal bridge Specular reflection sensitive Left Jaw / Right Jaw 9 + 10 facial landmarks Lateral signal cross-validation Left Forehead 10 facial landmarks · temporal zone Unilateral signal isolation Right Forehead 11 facial landmarks · temporal zone Phase coherence check EfficientNet-B0 1280-dim facial geometry features Appended to feature vector CHROM rPPG Signal Processing Per-window mean normalisation · Overlap-Add accumulation Bandpass filter: Butterworth 3rd-order, 0.7–4.0 Hz (cardiac range) Methods: CHROM (primary) · GREEN · POS · Detrend (linear) · Welch PSD (nfft=1024) 117-Dimensional Feature Extraction Spectral (per ROI) SNR · Purity · Entropy Spectral Centroid Dominant Freq · Harmonic Ratio HRV Features RMSSD · SDNN pNN50 · pNN20 LF/HF Power · LF/HF Ratio Cross-ROI Correlation Pearson corr Spectral coherence Phase sync · BPM variance Geometry (26) + Quality Eye/nose/mouth ratios Jaw symmetry · Crest factor Energy · Entropy Stacking Ensemble ML Pipeline RobustScaler → ExtraTrees Selector (1.2× mean threshold) Base: XGBoost (n=300, d=3, lr=0.02) · LightGBM (n=300, nl=8) · HistGradBoost (n=300, d=5) Meta-learner: Logistic Regression · GroupShuffleSplit 80/20 · 5-fold CV OUTPUT: P_rPPG rppg_predictions.csv · P(Fake) ∈ [0, 1]

🧬 rPPG Signal Processing Details

  • CHROM algorithm: per-window mean normalisation, overlap-add accumulation (de Haan & Jeanne, 2013)
  • Bandpass filter: Butterworth 3rd-order, 0.7–4.0 Hz (cardiac frequency range)
  • Frame sampling: np.linspace across video duration, max 60 frames
  • Quality gate: Laplacian variance ≥ 10 and face area ≥ 1000 px²
  • NaN interpolation: linear interpolation for <30% missing ROI frames
  • Zero-variance coherence features dropped post-extraction (up to 6 features)

⚙️ ML Pipeline Configuration

  • N_RPPG_ACTUAL = 117 (exact feature count post-coherence removal)
  • Feature selection threshold: "1.2×mean" ExtraTrees importance
  • XGBoost: n_est=300, max_depth=3, lr=0.02, λ=10, α=1, scale_pos_weight
  • LightGBM: n_est=300, num_leaves=8, lr=0.02, λ=10, α=1
  • HistGradBoost: max_iter=300, max_depth=5, l2=5.0, max_leaves=15
  • Meta-learner: LogisticRegression(class_weight='balanced', max_iter=1000)
EfficientNet-B4 Spatio-Temporal CNN
An ImageNet-pretrained EfficientNet-B4 backbone combined with a stacked BiLSTM temporal model and multi-head self-attention for deepfake-discriminative inter-frame dependency modelling. SWA and gradual backbone unfreezing ensure stable convergence on the P100 GPU.
2

EfficientNet-B4 Spatio-Temporal Architecture Flowchart

MTCNN face detection → frame caching → EfficientNet-B4 spatial features → BiLSTM+Attention → binary classification

EfficientNet-B4 Spatio-Temporal Pipeline 🎬 Input Video Loaded from master_dataset_index.csv MTCNN Face Detector (facenet-pytorch) min_face_size=60px · thresholds=[0.6, 0.7, 0.7] · factor=0.709 post_process=False · center-crop fallback if no detection RAM-Safe Disk Cache (.npy per video) 16 frames per video · (T, H, W, 3) uint8 · skips already-cached videos Training Augmentation Pipeline (albumentations · same transform all T frames) HFlip · ShiftScaleRotate · BrightnessContrast · HueSat · RGBShift · JPEG Compression GaussNoise · ISONoise · CoarseDropout · Posterize · ImageNet norm: μ=[0.485,0.456,0.406] σ=[0.229,0.224,0.225] · 224×224px EfficientNet-B4 Backbone (ImageNet pretrained · timm) global_pool='avg' · drop_path_rate=0.2 · Frozen until epoch 5 → gradual unfreeze (3 epochs) Spatial feature vector: 1792-dim per frame · FP32 · reshape → (B, T, 1792) BiLSTM Temporal Model 2 layers · hidden=256 · bidirectional → 512-dim output per timestep cuDNN disabled for P100 compatibility Multi-Head Self-Attention 4 heads · embed_dim=512 · batch_first=True Mask-aware pooling → 512-dim LayerNorm residual · dropout=0.5 Classifier Head Linear(512→256) → LayerNorm → GELU → Dropout(0.5) → Linear(256→128) → LayerNorm → GELU → Linear(128→1) LayerNorm replaces BatchNorm (batch_size=2 · gradient accumulation ×4 → effective batch=8) OUTPUT: P_CNN cnn_predictions.csv · P(Fake) ∈ [0, 1]

🏋️ Training Configuration

  • Epochs: 40 · LR: 5×10⁻⁵ · Weight Decay: 5×10⁻⁴
  • Optimiser: AdamW with param-group LRs (backbone LR/10)
  • Scheduler: Cosine warmup (10% steps) + cosine decay
  • SWA: starts epoch 30, SWA-LR=5×10⁻⁵, 5-batch BN update
  • Loss: Focal Loss (α=0.6, γ=2.0, label_smoothing=0.1)
  • Grad accumulation: 4 steps → effective batch = 8
  • Early stopping: patience=25 on validation AUC
  • TTA: 5-pass (original, H-flip, +15 bright, −15 bright, blur)

🔬 Model Architecture Details

  • Backbone: EfficientNet-B4 (1792-dim output) via timm
  • Temporal: 2-layer BiLSTM, hidden=256, output=512
  • Attention: 4-head MHA, batch_first=True, residual+LayerNorm
  • Mask handling: padding mask propagated through LSTM+MHA
  • P100 fix: cuDNN disabled for LSTM, no AMP/autocast
  • Splits: 5-fold StratifiedGroupKFold (identity-aware)
  • Backbone unfreeze: epoch 5, then LR ramped over 3 epochs
  • Output file: cnn_predictions.csv (col: P_CNN)
Xception + Frequency Branch + Hard Negative Mining
The Xception backbone (2048-dim output) is augmented with a parallel frequency branch that extracts DCT and FFT compression artifacts. ECA channel attention re-weights spatial features. CutMix and hard-negative mining curriculum force the model to detect local manipulation boundaries rather than global statistics.
3

Xception Spatio-Temporal + Frequency Architecture Flowchart

Eye-aligned MTCNN → Xception (2048d) + ECA → Frequency Branch → BiLSTM → Fused Classifier

Xception Dual-Branch Spatio-Temporal Pipeline 🎬 Input Video 299×299 target resolution (Xception native) MTCNN + Eye-Landmark Alignment Detect eye landmarks → compute rotation angle → warpAffine alignment Laplacian blur filter (threshold 20.0) · Quality gate: variance ≥ 20 · Confidence ≥ 0.9 Center-crop fallback if detection fails Augmentation + MixUp/CutMix (epoch ≥ Hard Mining Epoch) HFlip · ShiftScaleRotate · CLAHE · JPEG Compression · GaussNoise · ElasticTransform · CoarseDropout · Posterize Xception norm: μ=[0.5,0.5,0.5] σ=[0.5,0.5,0.5] · MixUp lam~Beta(0.2) · CutMix α=1.0 (50% split) Parallel Dual-Branch Processing Xception Backbone timm: legacy_xception · pretrained global_pool='avg' · 2048-dim · FP32 ↓ ECA Channel Attention (1D Conv k=odd(log₂C)) Frequency Branch Linear(2048→256) → LayerNorm → GELU → Dropout → Linear(256→256) → LayerNorm Captures JPEG/DCT/FFT compression artifacts input_proj(2048→512) masked avg-pool → freq_pooled BiLSTM Temporal (cuDNN disabled) 2 layers · hidden=256 · bidirectional → 512-dim Temporal dropout(0.3) before attention Xavier input init · Orthogonal hidden init Multi-Head Self-Attention 4 heads · embed_dim=512 · key_padding_mask Residual + LayerNorm · masked pooling → temporal_pooled (512-dim) concat([temporal_pooled (512d), freq_pooled (256d)]) = 768d Fused Classifier Head (768-dim input) Linear(768→256) → LayerNorm → GELU → Dropout(0.3) → Linear(256→128) → LayerNorm → GELU → Dropout(0.15) → Linear(128→1) · Dynamic FOCAL_ALPHA per fold from class counts OUTPUT: P_CNN cnn_predictions.csv · P(Fake) ∈ [0, 1] 6-pass TTA: Original · HFlip · Bright+ · Bright− · Blur · Zoom(93%)

🎛️ Training Configuration

  • Epochs: 40 · LR: 1×10⁻⁴ · Weight Decay: 1×10⁻²
  • Loss: Focal Loss (α=dynamic, γ=2.0, smooth=0.05)
  • FOCAL_ALPHA is computed dynamically per fold from class counts
  • SWA: epoch 15 → full BN stats update (manual loop, not update_bn)
  • Hard Mining: epoch 10, class-balanced WeightedRandomSampler, refreshed every 5 epochs
  • Curriculum: Mixup (α=0.2) + CutMix (α=1.0) activated from epoch 0
  • Scheduler: CosineAnnealingLR(T_max=SWA_START, eta_min=LR×0.01)
  • TTA: 6-pass (original, H-flip, bright+15%, bright−15%, blur, 93% crop+resize)

🔬 Architectural Innovations

  • ECA-Net: efficient channel attention via 1D conv (k=odd(⌈(log₂C + 1)/2⌉))
  • Frequency branch: parallel 256-dim stream from spatial features
  • Dual fusion: temporal (512d) || frequency (256d) → 768d classifier
  • LSTM init: Xavier-uniform for input weights, Orthogonal for hidden, forget-gate bias=1
  • mask propagation: through LSTM → attention → masked avg-pool
  • No dropout on LSTM for cuDNN P100 compatibility (single layer uses 0)
  • Per-fold FOCAL_ALPHA prevents bias with identity-split class imbalance
  • Output file: cnn_predictions.csv (col: P_CNN)
Swin Transformer Tiny + DCT Frequency Branch
The Swin Transformer's hierarchical shifted-window attention (768-dim output) is paired with a novel on-the-fly DCT frequency branch computed from raw frame pixels. Pack-padded-sequence LSTM eliminates padding corruption. A full 5-fold cross-validation loop runs in a single session, producing out-of-fold (OOF) predictions for bias-free ensemble calibration.
4

Swin Transformer + DCT Architecture Flowchart

Eye-aligned MTCNN → Swin-Tiny (768d) + ECA + on-the-fly DCT(128d) → pack_padded BiLSTM → 5-fold OOF

Swin Transformer Dual-Branch Pipeline (5-Fold OOF) 🎬 Input Video 224×224 target · 16 frames per video MTCNN + Eye-Landmark Alignment Eye angle → warpAffine rotation · Laplacian variance ≥ 20 quality gate Center-crop fallback · Pre-extracted cache at /kaggle/input/swin-1data-cache/ (auto-detected) Confidence ≥ 0.9 · blur-filtered Augmentation + Progressive Frame Curriculum Resize(224) · HFlip · ShiftScaleRotate · BrightnessContrast · JPEG(75-100%) · GaussNoise · CoarseDropout · Posterize Progressive frames: epochs 0–4 → 5 frames · epochs 5–14 → 10 frames · epochs 15+ → 16 frames (full) Parallel Dual-Branch Processing Swin-Tiny Backbone swin_tiny_patch4_window7_224 drop_path_rate=0.2 · global_pool='avg' · 768-dim · FP32 ⚡ Skip padded frames (real_mask_flat) — zero-fill padding ↓ ECA Channel Attention: 1D Conv k=odd(log₂768) → 768-dim On-the-fly DCT Features RGB → grayscale → resize(64×64) → 2D DCT 8×8 block means+stds → 128-dim · log(|DCT|+ε) norm DCT matrix pre-computed as non-trainable buffer Linear(128→192) → LayerNorm → GELU → Linear(192→192) → LN input_proj(768→512) masked pool→freq_pooled(192d) pack_padded_sequence BiLSTM 2 layers · hidden=256 · bidirectional → 512-dim cuDNN disabled · Lengths from mask.sum(1) pad_packed_sequence(total_length=T) · Xavier + Orthogonal init Multi-Head Self-Attention 4 heads · embed_dim=512 · key_padding_mask=~mask LayerNorm residual · masked avg-pool → temporal_pooled (512-dim) concat([temporal (512d), freq (192d)]) = 704d Fused Classifier Head (704-dim input) Linear(704→192) → LayerNorm → GELU → Dropout(0.3) → Linear(192→96) → LayerNorm → GELU → Linear(96→1) 4 param groups: backbone_decay · backbone_nodecay · other_decay · other_nodecay OUTPUT: P_CNN (5-fold OOF) cnn_predictions_swin_oof_MASTER.csv 6-pass TTA per fold · video_id · label · P_CNN · fold

🔬 Swin Transformer Innovations

  • Window attention: swin_tiny_patch4_window7_224 (shifted windows, 7×7)
  • Padded frame skip: only real frames processed by backbone (not padding zeros)
  • DCT buffer: pre-computed 64×64 DCT matrix as non-trainable buffer
  • pack_padded_sequence: eliminates padding corruption in LSTM gradient flow
  • ECA attention: re-weights 768 Swin output channels before projection
  • LSTM init: forget-gate bias=1 (standard LSTM best-practice)
  • Full 5-fold OOF: single 11.5h session trains all folds sequentially

⚙️ Training Configuration

  • Epochs: 40 per fold · LR: 1×10⁻⁴ · Weight Decay: 1×10⁻²
  • 4 param groups: backbone_decay, backbone_nodecay, other_decay, other_nodecay
  • Scheduler: LambdaLR (linear warmup + cosine decay to 10% min)
  • SWA: epoch 15, 4 per-group SWA-LRs, anneal_strategy='cos' over 5 epochs
  • Loss: Focal Loss (α=0.5, γ=2.0, label_smooth=0.08)
  • MixUp: activated at epoch ≥ HARD_MINING_EPOCH (10)
  • Early stopping: patience=10 on validation AUC
  • Output file: cnn_predictions_swin_oof_MASTER.csv
Late-Fusion Ensemble Strategy
All four model probability streams are merged on a shared video_id key via inner join. Five complementary fusion strategies are evaluated; the best is selected by AUC. Bootstrap 95% confidence intervals are reported for all final metrics in accordance with IEEE publication standards.
5

Late-Fusion Ensemble Architecture Flowchart

Probability alignment → 5 fusion strategies → optimal selection → bootstrap CI evaluation

Late-Fusion Ensemble Pipeline rPPG Stream rppg_predictions.csv Column: P_rPPG Physiological signal EfficientNet-B4 cnn_predictions.csv Column: P_CNN Spatio-temporal CNN Xception cnn_predictions.csv Column: P_CNN Xception + Freq Branch Swin Transformer cnn_predictions_swin_ oof_MASTER.csv · P_CNN 5-fold OOF predictions Video-ID Alignment (INNER JOIN) Merge on video_id · deduplicate (keep=last) · clamp P ∈ [0,1] · fill NaN with 0.5 Label reconciliation: all models trained on same master_dataset_index.csv → guaranteed agreement 5 Parallel Fusion Strategies ① Simple Average P = mean(P₁, P₂, P₃, P₄) Uniform weight = 0.25 Baseline robustness check ② AUC-Weighted wᵢ = AUCᵢ / ΣAUCⱼ Better models proportionally weighted by AUC contribution ③ Rank-Based Normalize ranks to [0, 1] Robust to probability scale and calibration differences ④ Meta-Learner LR LogisticRegression on [P₁, P₂, P₃, P₄] features 5-fold OOF · no leakage ⑤ Grid-Search Optimal argmax AUC over Σwᵢ = 1 (step=0.1) Coarse simplex search Best Ensemble Selection (by AUC) F1-optimal threshold search on full set · P_final = best_method(P₁, P₂, P₃, P₄) Bootstrap 95% Confidence Intervals (n=1000 iterations) AUC · Accuracy · F1-Score · Precision · Recall → all reported with [CI_low, CI_high] Stratified bootstrap resampling · RandomState(42) · skip single-class bootstrap samples Ensemble Outputs ensemble_final_predictions.csv · ensemble_metrics_with_ci.csv · ensemble_evaluation_plots.png Columns: video_id · label · P_rPPG · P_efficientnet · P_xception · P_swin · P_final · pred_final 🏆 FINAL PREDICTION: P(Fake) ∈ [0, 1] Threshold at optimal F1 · 0 = Real · 1 = Deepfake
Model Architecture Comparison
Property rPPG + ML EfficientNet-B4 Xception Swin-Tiny
Paradigm Physiological CNN Temporal CNN + Freq Transformer
Face Detection MediaPipe FaceMesh (468 lm) MTCNN MTCNN + Eye alignment MTCNN + Eye alignment
Input resolution Full video (60 frames) 224×224 (16 frames) 299×299 (16 frames) 224×224 (16 frames)
Feature dim 117 rPPG features 1792-dim (EffNet-B4) 2048-dim (Xception) 768-dim (Swin-Tiny)
Temporal modelling BiLSTM 2L×256H BiLSTM 2L×256H pack_padded BiLSTM 2L×256H
Attention 4-head MHA 4-head MHA + ECA 4-head MHA + ECA
Frequency branch FFT radial (32d) + DCT (32d) 256-dim parallel branch On-the-fly DCT 128-dim
Classifier output dim Stacking LR logit 512-dim → 1 768-dim → 1 704-dim → 1
Loss function Focal (α=0.6, γ=2.0, s=0.1) Focal (α=dynamic, γ=2.0, s=0.05) Focal (α=0.5, γ=2.0, s=0.08)
SWA epoch 30 epoch 15 epoch 15
Hard negative mining ✓ epoch 10
MixUp / CutMix MixUp (β lam) MixUp + CutMix (50/50) MixUp (epoch ≥ 10)
TTA passes 5 6 6 (OOF)
Cross-validation GroupShuffleSplit 80/20 5-fold StratGroupKFold 5-fold StratGroupKFold 5-fold OOF (all folds)
Output file rppg_predictions.csv cnn_predictions.csv cnn_predictions.csv cnn_predictions_swin_oof_MASTER.csv
Score column P_rPPG P_CNN P_CNN P_CNN
Output Files & Reproducibility
NotebookOutput FileContents
model_rppgrppg_predictions.csvvideo_id · label · P_rPPG
model_rppgbest_rppg_ml_model.joblibTrained stacking ensemble
model_rppgrppg_scaler.joblibRobustScaler fitted on train
model_rppgrppg_selector.joblibExtraTrees feature selector
model_efficientnetcnn_predictions.csvvideo_id · label · P_CNN
model_efficientnetbest_cnn_model_fold0.pthBest EfficientNet checkpoint
model_xceptioncnn_predictions.csvvideo_id · label · P_CNN
model_xceptionbest_cnn_model_fold0.pthBest Xception checkpoint
model_swincnn_predictions_swin_oof_MASTER.csvvideo_id · label · P_CNN · fold
model_swinswa_model_swin_fold{k}.pthSWA model weights per fold
ensembleensemble_final_predictions.csvAll 4 scores + P_final + pred_final
ensembleensemble_metrics_with_ci.csvAUC/Acc/F1/P/R with 95% CI
ensembleensemble_evaluation_plots.pngROC · AUC bars · CM · score dist · corr

🔬 Reproducibility Guarantees

  • Global seed: SEED = 42 across all notebooks (numpy, torch, random, CUDA)
  • Identity-aware splits: StratifiedGroupKFold on person identities prevents face memorisation
  • Identical dataset: all models read from master_dataset_index.csv
  • Disk-based face cache: reproducible across Kaggle sessions
  • Checkpoint resume: every notebook supports session-safe auto-resume
  • cuDNN benchmark=False for deterministic LSTM operation
  • Label reconciliation: inner join guarantees identical ground truth
  • Gradient clipping: max_norm=1.0 across all CNN models
  • P100 compatibility: strict FP32, no AMP, cuDNN disabled for LSTM
  • 5-fold OOF: unbiased probability calibration for ensemble fusion

📚 Key References

  • Üstunet et al., "FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals," IEEE TPAMI 2023
  • de Haan & Jeanne, "Robust Pulse Rate From Chrominance-Based rPPG," IEEE TBME 2013
  • Tan & Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," ICML 2019
  • Chollet, "Xception: Deep Learning with Depthwise Separable Convolutions," CVPR 2017
  • Liu et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," ICCV 2021
  • Wang et al., "ECA-Net: Efficient Channel Attention for Deep CNNs," CVPR 2020
  • Rossler et al., "FaceForensics++: Learning to Detect Manipulated Facial Images," ICCV 2019
  • Li et al., "Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics," CVPR 2020