Abstract: This paper analyzes "Communication-Efficient Learning of Deep Networks from Decentralized Data", focusing on the communication optimization of federated learning in decentralized data, and discusses methods for efficient training of deep networks and data privacy protection. This not only lays the foundation for AI and security applications, but also provides potential for the future integration with blockchain and builds a decentralized security model.
Keywords: federated learning, communication efficiency, decentralized data, deep networks, AI, data privacy, security, blockchain
Introduction: Data Security and Decentralization Challenges
Blockchain security challenges
- Dexx safety accident, on November 16, 2024, the loss exceeded US$21 million. The cause of the accident is that centralized private key management lacks encryption protection.
- Bybit security accident, on February 21, 2025, the loss was approximately US$1.46 billion, the largest crypto theft in history. The cause of the accident was multi-sign UI fraud.
- Infini safety accident, on February 24, 2025, lost about US$50 million. The causes of the accident were authority control and unaudited contracts.
AI Security Challenge
At present, the entire AI track is in a stage of rapid development, and AI security as a potential major issue is about to occur. Currently estimated types of safety accidents include:
-
The AI model poisons and injects malicious samples into the training data, causing the model to output incorrect predictions, which can lead to financial AI incorrect transactions.
-
AI inference hijacking, similar to traditional cyber attacks, uses API hijacking to tamper with inference results, and even reverse engineering to steal AI models.
-
Malicious AI Agent, the AI Agent is controlled to perform malicious operations. AI Agents surge in 2025, and security is at stake.
However, in addition to the traditional scope of network security, AI security also includes AI ethics, autonomy, data sovereignty, social risks, etc.
- How to divide the responsibility for intelligent driving accidents? Who will be responsible for the spread of false information generated by AI to cause social unrest?
- AI agents acted beyond expectations and had too high authority, causing behavior to go beyond boundaries, causing economic losses and even endangering the security of the physical world.
- Privacy issues of AI training data, such as leaks or decryption, will infringe on the rights and interests of data sovereignty.
Federated Learning Requirements from a Security Perspective
In one sentence: Federated Learning tries to solve user personal data privacy issues and centralized data processing energy efficiency issues by decentralizing local execution of computing, integrating models and receiving updates under massive personal devices and rich and types of data.
Decentralization
Decentralization can solve both user personal data privacy issues and centralized data processing energy efficiency issues. Each device is a client, and users can choose to participate in this loose and free alliance organization and manage and schedule it through a centralized server.
Each client has a local data set that never uploads. Thousands of clients and the global model will have three actions: synchronous pull, local computing and update push. Only by synchronous pull and update push will it involve communicating with the centralized server. Decoupling model training requirements and direct connection of original data.
The principle of data minimization is that the data collected and stored should be limited to the necessary parts. Originated from the 2012 White House Report on Consumer Data Privacy.
Data requirements
- Data generated from real user equipment has obvious advantages over proxy data from data centers.
- These data are privacy-sensitive or have a large amount of data. For model training purposes, do not record them in the data center.
- For supervised tasks, labels on the data can be naturally inferred from user interaction.
Task example
(1) Image classification, predict which pictures are most likely to be viewed or shared multiple times in the future.
(2) Language modeling, improve speech recognition, text input of touch-screen keyboard, next word prediction, and predict entire paragraph reply.
The training data for these tasks obviously requires a large amount of user privacy data, including all photos held by users, as well as all text they enter through the keyboard of their mobile phone, including passwords, URLs, SMS, etc.
The dispersion of these data is much greater than that of easily available proxy datasets. For example:
- Through standard web corpus, web documents such as Wikipedia, Baidu, Google, etc.
- Flickr photo gallery.
In addition, the data of user interaction comes with their own tags, which are natural and objective definitions and are far more diverse than the tags of second-hand agent data.
The above explains the difference between AI intelligent search and traditional search (Baidu, Google). The former is predicted through a large number of natural data labels, while the latter is indexed by second-hand proxy data labels. The former is a thinkable intelligent brain, while the latter is a library index thinking.
Both tasks are very suitable for neural network learning. For image classification, feedforward deep networks, especially convolutional networks, can provide state-of-the-art results. The most advanced results have been achieved for recurrent neural networks for language modeling tasks, especially LSTM (Long Short-Term Memory).
Privacy protection
Excellent than the data center
-
In traditional data center training, data usually needs to be collected and stored, and even "anonymous" data sets may expose user privacy through combination with other data.
-
The information for federated learning is the smallest update, containing only the minimum information needed to improve a particular model. Since this information is just a change in model parameters, not direct raw data, they themselves do not contain more privacy information than the original training data.
-
This update can and should be short-lived to ensure information is minimized. At the same time, there is no need to identify the source when transmitting these updates, so privacy protection can be further enhanced through hybrid networks (such as Tor) or trusted third parties.
Updated privacy impact
- If the update content is the total gradient of all local data, and the characteristics are sparse bag-of-word models, these non-zero gradients can expose specific words entered by the user on the device.
- For more dense models (such as convolutional neural networks, CNNs), the sum of gradients is harder to be the target of attackers to obtain information on a single training instance, but attacks are still possible.
The bag of words model is a text vectorization method that ignores word order and only counts word frequency.
CNN extracts features through convolution operations, and then classifies them using pooling and fully connected layers, and is especially good at processing images.
Gradients are the direction of change of model parameters, and non-zero gradients refer to the non-zero part of these changes, which may reveal specific words entered by the user.
In the bag of words model, each word corresponds to a feature position. If the user enters a certain word, the gradient of only these words will become non-zero when the model is updated. The attacker can infer the specific input words by observing the non-zero gradient.
Federal Optimization: The Core of Communication Efficiency
1. Characteristics of federal optimization:
- Non-IID (non-independent same distribution)
Since the training data for each client is usually based on the use of a specific user's device, the dataset usually does not represent the distribution of the entire population.
- Data imbalance
Some users will use services or applications more frequently than others, resulting in a difference in the amount of local training data.
- Large-scale distributed
The number of clients participating in the optimization is usually much larger than the average number of data points per client.
- Limited Communication
Mobile devices are often offline, or have slow connections and high communication costs.
2. Practical issues of federal optimization
- Client data changes:As data is added and deleted, the client's data set will change.
- Client availability:Client availability is closely related to local data distribution. For example, American English-speaking phones and British-speaking phones may be connected at different times.
- Missing or corrupted updates:Some clients may not respond or send corrupt updates.
To address these problems, a controlled environment was used in the experiment, focusing on client availability and challenges of imbalanced and non-IID data.
3. Synchronous update mechanism description
We assume a synchronous update mechanism that is carried out in multiple rounds of communication:
- Suppose, there is a fixed client combined with K, each with its own local data set.
- At the beginning of each communication round, a random small number of C clients will be selected, and all of them will receive the current global algorithm status (such as the current model parameter set) from the server.
- For efficiency, we only choose a small number of clients, because our experiments show that when the increased clients exceed a certain number, the benefits will decrease.
- Each selected client will immediately perform local operations based on the global state and local data set, and then send updates to the server.
- The server will receive these updates and merge them into the global state, and then repeat this process.
4. Analysis of formulas for non-convex objective function
① finite and objective function forms
Although we are concerned with non-convex neural network objectives (i.e. loss function or optimization objective is non-convex), the algorithm we consider applies to any finite and objective function with the following form:
"Non-convex" means that a function or set is not convex, meaning it does not have a unique minimum point, and may have multiple local minimum points or saddle points. Simply put, the function image does not present a single "valley" shape, and may have multiple troughs.
This is the form of the total objective function:
This formula means to minimize an objective function f(w), where w is the optimization parameter vector, located in the \mathbb{R}^d space, representing a d-dimensional parameter space. The goal of the optimization problem is to find an optimal w so that f(w) reaches a minimum value.
This is the definition of the objective function:
This formula defines that the objective function f(w) is the average value of multiple sub-objective functions f_i(w). Specifically, f_i(w) is the i-th sub-objective function, which usually represents a data point or task loss function. Here, the total objective function is the average of all f_i(w), so the optimization problem becomes to minimize the average loss of these sub-objections.
② Loss function in machine learning
In machine learning, loss functionf_i(w)
Usually means that the model is at a given inputx_i
and real tagsy_i
The prediction error is recorded as
, where w is the parameter of the model.
That is, for each training sample(x_i, y_i)
, we calculate the loss functionf_i(w)
, it reflects the model's prediction error on this sample. The entire goal is to optimize the average of these loss functionsf(w)
to minimize the overall error of the model.
③ Client data distribution
In this algorithm, assuming that the dataset is allocated to K clients, the dataset index on client k isP_k
,inn_k = |P_k|
is the amount of data on client k. In this way, we can rewrite the objective function f(w) as:
Represents the global target function
f(w)
is a local target function of each client kF_k(w)
Weighted average, wheren_k
is the number of data points on client k, and n is the total number of data points.
Where F_k(w) represents the local objective function on client k, defined as:
In other words, F_k(w) is the average of the loss function of the data on client k.The objective function f(w) is the weighted average of these local objective functions.
Represents the local objective function F_k(w) on client k, which is the average of the loss function f_i(w) of all data points i on client k, where n_k is the number of data points on client k, and P_k is the index set of data points on client k.
④IID hypothesis
IID(Independent and same distribution) means that each sample in the data is both independent of each other and follows the same probability distribution.
Next, assume that the data P_k is formed by evenly and randomly allocating the training samples to each client. In this case, the expected value of the local objective function F_k(w) on each client should be equal to the global objective function f(w), i.e.:
The expectation here is to calculate the data assigned to the fixed client k. In other words, under the IID (Independent Same Distribution) assumption, the data distribution on each client is the same, so the expected value of the loss function calculated by each client is consistent with the global loss function.
It means that when data is uniformly and randomly allocated to each client k, the expected value of the local objective function F_k(w) on client k is equal to the global objective function f(w), that is, when the data distribution of each client is independent and homogeneously distributed (IID), the expectation of the local objective function is equal to the global objective function.
⑤Non-IID settings
However, in actual federated learning settings, the data often does not satisfy the IID hypothesis. That is, the data distribution on the client may be different, which causes a deviation between the local objective function F_k(w) and the global objective function f(w) on each client. For non-IID cases, we say that F_k(w) may be a very inaccurate approximation of the global objective function f(w).
If the data is unevenly distributed or biased between clients (i.e., the data is no longer IID), then the relationship between each client's local objective function may become less accurate with the global objective function. This situation is calledNon-IID settings, it is an important factor that must be considered in federal learning.
5. The core challenge of federal optimization
- Dominant position in communication costs
In federated learning, communication costs become a bottleneck because the data set per client is smaller relative to the entire system, and the bandwidth per client is usually limited, and the upload speed can be as low as 1MB/s or less. In addition, the conditions for clients to participate in optimization are relatively strict, and they usually only participate when the battery is sufficient, Wi-Fi is connected and the traffic costs are not included. Therefore, frequent exchange of communications increases costs and limits the efficiency of optimization.
- Relative Advantages of Calculation Cost
Unlike data center environments, modern smartphones have strong computing power (including built-in GPUs) and are relatively low in computing costs. Therefore, compared to communication, the computational cost is almost negligible. This means thatIncreasing the amount of calculation is more advantageous than increasing the number of communication rounds.
- How to solve the problem of communication cost
In order to reduce the number of communication rounds and thus improve training efficiency, a strategy of increasing the amount of computation must be adopted. Specifically, solutions include:
• Increase parallelism: Raise training speed by working independently between each communication round and sharing computing tasks.
• Increase the amount of calculations per client: Each client performs more complex calculations locally, not just simple gradient calculations. This can reduce the number of communication rounds that need to be performed by allowing each client to handle more computing tasks.
- The ultimate goal
Through appropriate computational volume increase and parallelization, the communication rounds required for federated learning are reduced, thereby improving overall optimization efficiency and reducing communication costs.
Technical details: Federal Average Algorithm (FedAvg)
The FederatedAveraging (FedAvg) algorithm is an optimization method used for federated learning (FL). It improves computational efficiency and reduces communication costs by local training on multiple clients and parameter aggregation on the server side.
1. Background
Traditional deep learning dependencyStochastic Gradient Descent (SGD)【Global nature】 optimized, but inFederal LearningIn the scenario, the data is decentralized and stored on different clients, and all data cannot be directly accessed for global optimization.
FedSGD(Federated SGD)It is a method to directly apply SGD in a federated learning environment, but every round of training requires all selected clients to upload gradients, which has a high communication overhead.
FedAvgBy letting the client executeMulti-step local gradient updateReaggregation improves computing efficiency and reduces the number of communication rounds.
SGD gradually approximates the optimal solution of the loss function by randomly selecting a sample each time to calculate the gradient (Error) and updating the model parameters.
2. Algorithm ideas
FedAvg byThree key parametersControl the training process:
- C: The client ratio (C-fraction) selected for each round.
- E: The number of local training epochs for each client, that is, how many rounds of local training for each client before reporting parameters to the server.
- B: Local small batch size (minibatch) If B=∞, the entire local dataset is used for training.
Core idea:
- Select the C% client.
- Run on each clientE-Wheel Local SGD Training(i.e. perform multiple gradient descent steps).
- The client returns local model parameters to the server, not the gradient.
- Server executionWeighted Average, merge client updated models.
3. FedAvg code implementation
import random
import numpy as np
def compute_loss(model, data):
""" Calculate the loss value of the model on a given data (using a simple mean square error here)"""
# Assume that the last column of the data is a label
X = data[:, :-1]
y = data[:, -1]
predictions = (X, model)
loss = ((predictions - y) ** 2)
return loss
def initialize_model():
"""Initialize global model parameters"""
return (9) # Change to 9 dimensions, because the last column is the label
def client_update(model, data, epochs, lr):
"""Client local training"""
losses = []
for _ in range(epochs):
gradient = compute_gradient(model, data)
model -= lr * gradient
loss = compute_loss(model, data)
(loss)
return model, (losses)
def compute_gradient(model, data):
"""Calculate gradient (the gradient using mean square error)"""
X = data[:, :-1]
y = data[:, -1]
predictions = (X, model)
gradient = -2 * (, (y - predictions)) / len(data)
Return gradient
def aggregate_models(client_models, num_samples):
"""Aggregate client model parameters (weighted average)"""
total_samples = sum(num_samples)
weights = [n / total_samples for n in num_samples]
new_global_model = np.zeros_like(client_models[0])
for model, weight in zip(client_models, weights):
new_global_model += model * weight
return new_global_model
def federated_training(num_rounds, num_clients, fraction, local_epochs, lr):
"""Federal Training Process""""
global_model = initialize_model()
# Generate simulation data: 100 pieces of data per client, 9 features and 1 label for each data
client_data = {
i: ([
(100, 9), # Features
(100, 1) # Tags
], axis=1)
for i in range(num_clients)
}
global_losses = []
for round in range(num_rounds):
selected_clients = (range(num_clients), max(1, int(fraction * num_clients)))
client_models = []
client_losses = []
num_samples = []
for client in selected_clients:
local_model = global_model.copy()
updated_model, local_loss = client_update(local_model, client_data[client], local_epochs, lr)
client_models.append(updated_model)
client_losses.append(local_loss)
num_samples.append(len(client_data[client]))
# Calculate the average loss for this round
avg_loss = (client_losses)
global_losses.append(avg_loss)
global_model = aggregate_models(client_models, num_samples)
print(f"Round {round+1}: Average Loss = {avg_loss:.6f}")
print("\nTraining is completed!")
print(f"Initial loss value: {global_losses[0]:.6f}")
print(f"final loss value: {global_losses[-1]:.6f}")
print(f"Loss drop rate: {((global_losses[0] - global_losses[-1]) / global_losses[0] * 100):.2f}%")
return global_model, global_losses
# Run federal learning
final_model, losses = federated_training(num_rounds=10, num_clients=5, fraction=0.6, local_epochs=5, lr=0.1)
4、FedAvg VS. FedSGD
characteristic | FedSGD (Federated SGD) | FedAvg (Federated Averaging) |
---|---|---|
Optimization method | Server-side aggregated single-time gradient update | Server-side aggregation of multiple rounds of local training models |
Calculate frequency | During each round of communication, each selected client calculates the gradient once. | During each round of communication, each selected client undergoes multiple local trainings. |
Communication overhead | High, communication is required for every iteration | Low, communication is only after multiple rounds of local training |
Client calculation volume | Low, only one mini-batch calculation is performed at a time | High, local training for multiple epochs per round |
Global model update | Directly aggregate all client gradient updates | First train locally many times, then average model parameters |
Applicable scenarios | Suitable for devices with high communication bandwidth and limited computing resources | Suitable for scenarios with strong computing power and limited communication |
Convergence speed | More rounds of communication are required to converge | Converge faster and reduce communication cycles |
5. Key advantages
- Reduce the number of communication rounds:This is a key feature of distributed user equipment, and FedAvg allows clients to perform multiple local updates instead of sending gradients every iteration, reducing communication overhead.
- Model Averaging:FedAvg directly calculates the mean of local model parameters rather than gradient summation, which performs better in the optimization of deep neural networks.
- Avoid bad local optimal solutions:When all clients start training from the same random initialization, FedAvg's parameter averaged approach performs better than a single client-training model in some scenarios.
Training Analysis: Federal Learning Experimental Analysis
Objective: By selecting appropriately-sized datasets in order to study the hyperparameters of FedAvg in depth, we will improve the usability of models on mobile devices. The main research isImage classificationandLanguage ModelingTask.
Experimental design
1. Datasets and Models
-
MNIST (Handwritten Number Recognition)
- 2NN (Binary-layer fully connected neural network):2 hidden layers, each layer has 200 neurons, a total of 199,210 parameters.
- CNN (Convolutional Neural Network):Two 5x5 convolutional layers (with 32 and 64 channels respectively), pooling layer, fully connected layer (512 units), Softmax output layer, total parameters 1,663,370.
-
CIFAR-10 (Image Classification)
- The paper does not introduce experimental details here, but subsequent analysis will involve this task.
-
Shakespeare (Language Modeling)
- Methods for building datasets: Based on character lines in Shakespeare's play, each character has at least two lines and is considered as an independent client.
- Training the model:Character level LSTM, using 8-dimensional character embedding, 2-layer LSTM (256 units), final output layer bit Softmax, total parameters 866,578.
2. Data distribution method
- IID (independent and same distribution):The data is randomly allocated to each client. For example, the MNIST data set is divided into 100 clients, each client has 600 samples.
- Non-IID (non-independent same distribution):The data are sorted by category and divided into 200 copies. Each client only gets samples of 2 categories of data. In this case, the single client has fewer sample categories and is more challenging.
3. Experimental variables
- E (Number of local training rounds):The number of iterations per client trains locally.
- B (local batch size):During the training process, the sample size used each time.
- C (Client participation ratio):During each round of training, the proportion of clients participating in the update.
Experimental results
1. MNIST experiment
-
Influence the proportion of client participation (C)
- C increased from 0.0 (1 client per round) to 1.0 (all 100 clients), the number of communication rounds decreased, and training accelerated.
- Under non-IID data, improving C will improve training efficiency more significantly. When C=1.0, 2NN's training speed is faster than C=0.0.8.6 times, CNN tasks are fast9.9 times。
-
FedAvg VS. FedSGD
- FedSGD performs gradient descent and averages directly on all clients, while FedAvg allows multiple rounds of local training per client (E>1) and then in summary.
-
Experimental results
- MNIST (CNN, target 99% accuracy): FedAvg (E=5, B=10) will achieve the goal in 20 rounds of training, while FedSGD requires 626 rounds (improvement)31.3 times)。
- Shakespeare (LSTM, target 54% accuracy): FedAvg (E=5, B=10) train for 41 rounds, while FedSGD requires 3906 rounds (upgrade95.3 times)。
2. CIFAR-10 experiment
-
Dataset
- The CIFAR-10 dataset contains 50,000 training samples and 10,000 test samples, each image is 32x32 pixels in size and has 3 RGB channels.
- In the experiment, the data set was divided into 100 clients, each client containing 500 training samples and 100 test samples.
-
Model architecture
- A standard convolutional neural network model obtained from the TensorFlow tutorial is used, including two convolutional layers, two fully connected layers and a linear transformation layer, with a total of about 106 parameters.
-
train
- Data preprocessing includes cropping the image to 24x24, randomly all flips, and adjusting contrast, brightness, and whitening.
- FedAvg and standard FedSGD methods were used for comparison.FedAvg's performance far exceeds FedSGD, showing that it has good communication efficiency.
-
Experimental results
- Standard SGD method197,500 small batch updates achieved 86% test accuracy, while the FedAvg method only passes2000 communication cycles have reached 85% test accuracy。
- This result shows that FedAvg has significantly improved communication efficiency compared with traditional SGD methods.
Further analysis: FedAvg made similar progress in each small batch calculation by comparing SGD and FedAvg experiments under different batch sizes (B = 50). Furthermore, increasing the number of clients helps smooth accuracy fluctuations and reduce fluctuations when there is only one client in standard SGD and FedAvg.
3. Large-scale LSTM experiment
- Task background
- The large-scale task in the experiment is a next word prediction task based on LSTM. The data set comes from a large social network with a total of 10 million public posts. The data are grouped by authors and the number of clients exceeds 500,000.
- Each client's dataset contains up to 5000 words, and the test set contains 100,000 posts from different authors.
- Model
- A 256-node LSTM model was used, with a vocabulary size of 10,000 words, and the input and output embedding dimensions of each word were 192, with a total of more than 4.95 million parameters.
- The input sequence is 10 words.
- Training and results
- In order to verify the effectiveness of FedAvg, 200 clients were used for each round of training, and FedAvg used the configuration of B=8 and E=1.
- The experimental results show thatFedAvgIt achieved 10.5% accuracy in 35 rounds of communications, andFedSGDIt takes 18 rounds of communication to achieve the same accuracy, which shows thatFedAvgInefficient on communication rounds (more rounds are required to achieve the same accuracy).
- thisDoesn't mean FedAvg performs worse, but rather indicatesThe advantages of FedAvg are stability and long-term performance when processing larger-scale distributed data. FedAvg can reduce noise caused by differences in data distribution of individual clients through more client participation and better averaged, and ultimately achieve better overall performance. For FedAvg, reducing communication rounds may lead to more computing and communication overhead, but itsThe stability of accuracyIt is a long-term advantage.
Main conclusion
- FedAvg uses non-IID data, especially when E (number of training rounds in this round) is large, it can greatly reduce the number of communication rounds and improve training efficiency.
- FedAvg training converges faster in smaller batches (B=10).
- Appropriately increasing the client participation ratio (C>=0.1) can significantly improve the training speed in non-IID data.
This experiment verifies the efficiency of FedAvg in decentralized data scenarios (such as mobile devices), and is especially suitable for situations where real-world data is unevenly distributed.
Federated Learning: Future Directions to Improve Communication Efficiency and Privacy Protection
Model Practicality
Experiments show thatFederal Learning(Federated Learning) High-quality models can be trained through fewer communication rounds, and the experimental results are reflected in a variety of model architectures, including:
- Multi-layer Perception Machine (MLP)
- Two different convolutional neural networks (CNNs)
- Two-layer character-level LSTM (Long and Short Time Memory Network)
- Large-scale word-level LSTM model
Advantages of Federal Learning: These experimental results proveFedAvgAlgorithms (a common algorithm in federated learning) can train better models with fewer communication rounds, indicating that federated learning is practical, especially in distributed, privacy protection and large-scale data processing.
Privacy protection and security
- Advantages of privacy protection: Federated learning itself has the advantage of privacy protection, because data does not leave the client, avoiding the risk of privacy leakage caused by centralized data storage.
-
Differential Privacy、Safe multi-party computingTechnologies such as Secure Multi-Party Computation (SMPC): In order to further enhance privacy protection, future research can consider introducing these technologies to provide stronger privacy guarantees. These technologies can be combined with federated learning to further enhance the privacy and security of data.
- Differential Privacy: It is a powerful method of privacy protection.By adding noise, ensure that any individual user's data will not be leaked., suitable for data protection in federated learning.
- Safe multi-party computing: It can ensure that the data remains encrypted when performing calculations between multiple parties and avoid leakage of private information.
Applicability of synchronization algorithm
The above privacy protection technology (differential privacy, secure multi-party computing) is most naturally applied toSynchronous Algorithm,likeFedAvg. This is because the training process of the synchronization algorithm depends on the aggregation of update results from each client, andPrivacy protection methods usually require the introduction of noise or encryption technology during global aggregation, suitable for implementation under the synchronization framework.
References
Communication-Efficient Learning of Deep Networks from Decentralized Data
FedAvg Github
For more articles, please go toA blog park with thousands of people