🛡️ A Distributed Framework for Privacy-Enhanced ViTs on the Edge

🔍 At a Glance

Comparison of Cloud vs Distributed Privacy Approach

The Problem

Offloading visual data (like from AR glasses) to the cloud introduces massive privacy risks. If a server is compromised, your full egocentric view—bedroom, screens, faces—is exposed.

Our Solution

We propose a distributed, hierarchical framework. An edge orchestrator (like a phone) partitions the image into independent "windows". Each window is processed by a different non-colluding cloud server. Result: No single server ever sees your full image, but you still get high-performance AI results.

Abstract

Nowadays, visual intelligence tools have become ubiquitous, offering all kinds of convenience and possibilities. However, these tools have high computational requirements that exceed the capabilities of resource-constrained mobile and wearable devices. While offloading visual data to the cloud is a common solution, it introduces significant privacy vulnerabilities during transmission and server-side computation. To address this, we propose a novel distributed, hierarchical offloading framework for Vision Transformers (ViTs) that addresses these privacy challenges by design. Our approach uses a local trusted edge device, such as a mobile phone or an Nvidia Jetson, as the edge orchestrator. This orchestrator partitions the user's visual data into smaller portions and distributes them across multiple independent cloud servers. By design, no single external server possesses the complete image, preventing comprehensive data reconstruction.

How It Works

1. Partitioning

The trusted edge device splits the image into non-overlapping w × w windows.

2. Distributed Attention

Windows are sent to separate servers. We leverage ViT Window Attention to process local features without needing the full image.

3. Edge Aggregation

Embeddings return to the edge. The device runs Global Attention layers to merge context and finalize the prediction.

Adapting SAM (Segment Anything)

We applied our framework to the Segment Anything Model (SAM). The original image encoder consists of 32 sequential layers.

Standard SAM: Interleaves global and window attention layers throughout the model.
Our PED-SAM: We reorder the layers. The first 28 layers (Window Attention) are offloaded to the cloud.
Edge Execution: The final 4 layers (Global Attention) run locally on the Edge device.

Key Insight 84% of the computation is offloaded, yet no server sees more than a tiny tile of the image.

Comparison of Standard SAM vs Privacy Enhanced SAM Layer Structure

Results

Privacy Protection

We tested against state-of-the-art reconstruction attacks (ViTMAE and Adobe Firefly). When the image is partitioned (e.g., 5x5 grid), the reconstruction fails completely.

Figure: Even with powerful generative AI, attackers cannot reconstruct the scene from isolated 5x5 partitions.

Utility (Accuracy)

Despite the partitioning, our method maintains high segmentation accuracy (mIoU) on the COCO dataset, comparable to the original model.

Model	mIoU (ViT-Huge)
Original SAM (Baseline)	0.584
Gaussian Blur (Obfuscation)	0.209 (Unusable)
Ours (Privacy-Enhanced)	0.563

Latency & Performance

By offloading the heavy encoder to distributed GPUs, we achieve significant speedups for edge devices compared to running locally.

Jetson Orin Nano (Local) 17,700 ms
Ours (Offload to 3x GPUs) 9,624 ms (-46%)

BibTeX

If you use this framework in your research, please cite:

@inproceedings{ding2025distributed,
  title={A Distributed Framework for Privacy-Enhanced Vision Transformers on the Edge},
  author={Ding, Zihao and Zhu, Mufeng and Tang, Zhongze and Wei, Sheng and Liu, Yao},
  booktitle={The Tenth ACM/IEEE Symposium on Edge Computing (SEC '25)},
  year={2025}
}