The primary objective of this project is to compare and evaluate the potential impact of incorporating serverless computing into peer-to-peer (P2P) training for machine learning. We aim to demonstrate the benefits of utilizing serverless architectures in comparison to traditional P2P training approaches. Our proposed study comprises two core architectures:
- P2P Training without Serverless Computing: This represents the traditional P2P training process, where peers communicate and synchronize with each other directly, without the integration of serverless computing.
- P2P Training with Serverless Computing: This architecture integrates serverless computing into the P2P training process, leveraging its dynamic resource allocation capabilities to enhance training efficiency and address the challenges posed by varying peer capabilities.
Our work leverages the dynamic resource allocation capability of serverless computing, which allows it to adapt to real-time requirements and effectively address the challenges presented by the increasing number of peers with varying capabilities.
The following figure show the architecture to make running a p2p training without serverless:In the following, we will show all the steps to make a replication of our study
- Prepare the EC2 instances according to the needed number of peers.
- Copy the script of P2P distributed training to the different EC2 instances.
- Configure RabbitMQ using amazon or configure a local one with a public IP address.
- Prepare the dataset inside the S3 buckets.
- Start the different peers inside each EC2.
-
Launch the required number of EC2 instances based on the desired number of peers. Choose an instance type that meets the computational requirements of your workload.For instance, the t2.medium instance type is suitable for MobileNetV3 small, while the t2.large instance type is recommended for VGG11.
-
Configure the security group for the EC2 instances to allow necessary inbound and outbound network traffic. This involves opening specific ports and enabling communication between instances within the same security group.
-
Install the necessary dependencies and libraries on each EC2 instance. Ensure that all required software, frameworks, and libraries are installed to support your distributed computing workload.
-
Set up SSH access to the EC2 instances for remote management.
-
Generate SSH key pairs, associate them with the instances, and securely store the private key for authentication.
-
Test the connectivity between the EC2 instances to verify effective communication. Ensure that the required ports are open and that the instances can discover and connect to each other.
scp -i myAmazonKey.pem script.py ubuntu@ip-xx-xxx-xx-xxx.compute-1.amazonaws.com
Replace myAmazonKey.pem with the path to your Amazon EC2 key file and script.py with the name of the script you want to send. Additionally, replace ip-xx-xxx-xx-xxx.compute-1.amazonaws.com with the actual public IP or DNS address of the target EC2 instance.
You can configure and set up the environment locally on your machine by installing the RabbitMQ on your local machine and provide a public IP adress to have a public credentials that will be used by the Peer to send and receive data.To install RabbitMQ on Ubuntu, follow these steps:
-
Open the terminal on your Ubuntu machine.
-
Update the package list by running the following command:
sudo apt update
- Install RabbitMQ by running the following command:
sudo apt install rabbitmq-server
- Once the installation is complete, start the RabbitMQ server by running the following command:
sudo systemctl start rabbitmq-server
- To enable the RabbitMQ server to start automatically at boot time, run the following command:
sudo systemctl enable rabbitmq-server
- Check the status of the RabbitMQ server by running the following command:
sudo systemctl status rabbitmq-server
Make sure to have the following credentials to replace them inside the source code:
rabbitmq_host: xxx.xxx.xxx.xxx rabbitmq_port: xxxxx rabbitmq_username: xxxxxxxx rabbitmq_password: xxxxxxxx
Another option is to set up the environment on Amazon Web Services (AWS) by creating a RabbitMQ instance. This process entails both creating the RabbitMQ instance itself and configuring it in a manner that aligns with your distinct requirements. By specifying the desired instance type, storage options, and additional settings, you can effectively fine-tune the environment to optimize its functionality and suit your specific needs.
1. Create and configure an S3 (Simple Storage Service) bucket. This involves setting up the necessary permissions, access control, and region settings for the bucket.-
Create the required buckets to accommodate different data partitions based on your dataset and the model being used. Each bucket should be named and organized in a way that aligns with your data partitioning strategy.
-
Partition and load the data into the respective S3 buckets. To achieve this, execute the split_worker_send_to_s3 script using the following command:
python3 split_worker_send_to_s3.py [--size SIZE] [--dataset DATASET] [--model_str MODEL]Deployment requires running `EC2_without_serverless.py` on multiple machines.
Arguments: --size SIZE Total number of workers in the deployment --dataset DATASET Dataset to be used, e.g., mnist, cifar10. --model_str MODEL Model to be trained, e.g., squeeznet1.1, vgg11, mobilenet v3 small.
Usage:The following figure show the architecture to make running a p2p training without serverless:EC2_without_serverless.py [--size SIZE] [--rank RANK] [--batch BATCH] [--dataset DATASET] [--model_str MODEL] [--optimizer OPTIMIZER] [--loss LOSS] [--evaluation EVALUATION] [--compression COMPRESSION]
Arguments:
--size SIZE Total number of workers in the deployment --rank RANK Unique ID of the worker node in the distributed setup. --batch_size BATCH Size of the batch to be employed by each node. --dataset DATASET Dataset to be used, e.g., mnist, cifar10. --model_str MODEL Model to be trained, e.g., squeeznet1.1, vgg11, mobilenet v3 small. --loss LOSS Loss function to optimize. --optimizer OPTIMIZER Optimizer to use. --evaluation EVALUATION If True, metrics are recorded. --compression COMPRESSION If True, gradients are compressed to minimize communication overhead.
In the following, we will show all the steps to make a replication of P2P using serverless
- Prepare the EC2 instances according to the needed number of peers.
- Copy the script of P2P distributed training to the different EC2 instances.
- Configure RabbitMQ using amazon or configure a local one with a public IP address.
- Prepare the dataset inside the S3 buckets.
- Create Lambda serverless to make a batch training
- Create AWS Step function for managing the flow
- Start the different peers inside each EC2.
Execute the split_worker_batches_send_to_s3.py script using the following command:
python3 split_worker_batches_send_to_s3.py [--size SIZE] [--dataset DATASET] [--model_str MODEL]
Arguments: --size SIZE Total number of workers in the deployment --dataset DATASET Dataset to be used, e.g., mnist, cifar10. --model_str MODEL Model to be trained, e.g., squeeznet1.1, vgg11, mobilenet v3 small.
This script will split the data into workers and batches, each batch will be processed by a lambda function.
-
Step1: Navigate to the AWS Lambda Console https://console.aws.amazon.com/lambda/
-
Step2: Configure the function:
- Give a name to the function called "compute_gradient"
- Select the Python3 as language, and make the necessary modification to the accorded ressources and the timelimit that will be set for lambda function (default: 3 seconds, maximum 15 minutes).
- Use the source code we prepared in the folder package, we put all the packages that the lambda function will need to train a model for the assigned data batch.
You can modify the lambda_function.py code if you need specific requirements.
Create a zip file with all the package including the lambda_function.py and the different requirements in package folder. name the zip file as "train_batch.zip"
zip -r ../train_batch.zip .
Use the following command to deploy the function:
aws lambda update-function-code --function-name compute_gradient --zip-file fileb://train_batch.zip
- Since we are going to create a parallel processing, we need to use another lambda function that will trigger the parallel batch processing functions.
The code of this function is located on the file "ProcessInputFunction.py". Deploy this function as lambda function called "ProcessInputFunction"
1. Navigate to the AWS Step Functions Console: https://console.aws.amazon.com/states/.-
Create a new state machine that you call it "batch_processing"
-
Use the code "generate_state_machine.py" to generate a machine step function that you will deploy it into the AWS Step Function. You need to give it the number of batches as input.
python3 generate_state_machine.py [--nbr_batches]
- A similar file to the "state_machine_definition.json" will be generated. You need to deploy this file to the step function configured before.
aws stepfunctions update-state-machine --state-machine-arn arn:aws:states:us-east-1:account_id:stateMachine:batch_processing --definition file://state_machine_definition.jsonIn each EC2 machine, you need to run the source code as following:
Usage: EC2_with_serverless.py [--size SIZE] [--rank RANK] [--batch BATCH] [--dataset DATASET] [--model_str MODEL] [--optimizer OPTIMIZER] [--loss LOSS] [--evaluation EVALUATION] [--compression COMPRESSION] Arguments: --size SIZE Total number of workers in the deployment --rank RANK Unique ID of the worker node in the distributed setup. --batch_size BATCH Size of the batch to be employed by each node. --dataset DATASET Dataset to be used, e.g., mnist, cifar10. --model_str MODEL Model to be trained, e.g., squeeznet1.1, vgg11, mobilenet v3 small. --loss LOSS Loss function to optimize. --optimizer OPTIMIZER Optimizer to use. --evaluation EVALUATION If True, metrics are recorded. --compression COMPRESSION If True, gradients are compressed to minimize communication overhead.
Please note that we removed all the credentials from the source code, where you need to replace it with your proper credentials.
We provide the results of our experimentations in the following link: Experimentations Results
