🔍 At a Glance
The Problem
Offloading visual data (like from AR glasses) to the cloud introduces massive privacy risks. If a server is compromised, your full egocentric view—bedroom, screens, faces—is exposed.
Our Solution
We propose a distributed, hierarchical framework. An edge orchestrator (like a phone) partitions the image into independent "windows". Each window is processed by a different non-colluding cloud server. Result: No single server ever sees your full image, but you still get high-performance AI results.
Abstract
Nowadays, visual intelligence tools have become ubiquitous, offering all kinds of convenience and possibilities. However, these tools have high computational requirements that exceed the capabilities of resource-constrained mobile and wearable devices. While offloading visual data to the cloud is a common solution, it introduces significant privacy vulnerabilities during transmission and server-side computation. To address this, we propose a novel distributed, hierarchical offloading framework for Vision Transformers (ViTs) that addresses these privacy challenges by design. Our approach uses a local trusted edge device, such as a mobile phone or an Nvidia Jetson, as the edge orchestrator. This orchestrator partitions the user's visual data into smaller portions and distributes them across multiple independent cloud servers. By design, no single external server possesses the complete image, preventing comprehensive data reconstruction.
How It Works
1. Partitioning
The trusted edge device splits the image into non-overlapping w × w windows.
2. Distributed Attention
Windows are sent to separate servers. We leverage ViT Window Attention to process local features without needing the full image.
3. Edge Aggregation
Embeddings return to the edge. The device runs Global Attention layers to merge context and finalize the prediction.
Adapting SAM (Segment Anything)
We applied our framework to the Segment Anything Model (SAM). The original image encoder consists of 32 sequential layers.
- Standard SAM: Interleaves global and window attention layers throughout the model.
- Our PED-SAM: We reorder the layers. The first 28 layers (Window Attention) are offloaded to the cloud.
- Edge Execution: The final 4 layers (Global Attention) run locally on the Edge device.
Results
Privacy Protection
We tested against state-of-the-art reconstruction attacks (ViTMAE and Adobe Firefly). When the image is partitioned (e.g., 5x5 grid), the reconstruction fails completely.
Figure: Even with powerful generative AI, attackers cannot reconstruct the scene from isolated 5x5 partitions.
Utility (Accuracy)
Despite the partitioning, our method maintains high segmentation accuracy (mIoU) on the COCO dataset, comparable to the original model.
| Model | mIoU (ViT-Huge) |
|---|---|
| Original SAM (Baseline) | 0.584 |
| Gaussian Blur (Obfuscation) | 0.209 (Unusable) |
| Ours (Privacy-Enhanced) | 0.563 |
Latency & Performance
By offloading the heavy encoder to distributed GPUs, we achieve significant speedups for edge devices compared to running locally.
- Jetson Orin Nano (Local) 17,700 ms
- Ours (Offload to 3x GPUs) 9,624 ms (-46%)
BibTeX
If you use this framework in your research, please cite:
@inproceedings{ding2025distributed,
title={A Distributed Framework for Privacy-Enhanced Vision Transformers on the Edge},
author={Ding, Zihao and Zhu, Mufeng and Tang, Zhongze and Wei, Sheng and Liu, Yao},
booktitle={The Tenth ACM/IEEE Symposium on Edge Computing (SEC '25)},
year={2025}
}