Exploring the Impact of Serverless Computing on Peer-To-Peer Training for Machine Learning

The primary objective of this project is to compare and evaluate the potential impact of incorporating serverless computing into peer-to-peer (P2P) training for machine learning. We aim to demonstrate the benefits of utilizing serverless architectures in comparison to traditional P2P training approaches. Our proposed study comprises two core architectures:

P2P Training without Serverless Computing: This represents the traditional P2P training process, where peers communicate and synchronize with each other directly, without the integration of serverless computing.
P2P Training with Serverless Computing: This architecture integrates serverless computing into the P2P training process, leveraging its dynamic resource allocation capabilities to enhance training efficiency and address the challenges posed by varying peer capabilities.

Our work leverages the dynamic resource allocation capability of serverless computing, which allows it to adapt to real-time requirements and effectively address the challenges presented by the increasing number of peers with varying capabilities.

P2P Training without Serverless Computing

The following figure show the architecture to make running a p2p training without serverless:

In the following, we will show all the steps to make a replication of our study

Prepare the EC2 instances according to the needed number of peers.
Copy the script of P2P distributed training to the different EC2 instances.
Configure RabbitMQ using amazon or configure a local one with a public IP address.
Prepare the dataset inside the S3 buckets.
Start the different peers inside each EC2.

1. Prepare the EC2 instances

To set up the EC2 instances for distributed computing, follow these steps:

Launch the required number of EC2 instances based on the desired number of peers. Choose an instance type that meets the computational requirements of your workload.For instance, the t2.medium instance type is suitable for MobileNetV3 small, while the t2.large instance type is recommended for VGG11.
Configure the security group for the EC2 instances to allow necessary inbound and outbound network traffic. This involves opening specific ports and enabling communication between instances within the same security group.
Install the necessary dependencies and libraries on each EC2 instance. Ensure that all required software, frameworks, and libraries are installed to support your distributed computing workload.
Set up SSH access to the EC2 instances for remote management.
Generate SSH key pairs, associate them with the instances, and securely store the private key for authentication.
Test the connectivity between the EC2 instances to verify effective communication. Ensure that the required ports are open and that the instances can discover and connect to each other.

2. Copy the script of P2P to EC2

To distribute scripts to each machine, you can utilize the following command:

 scp -i myAmazonKey.pem script.py ubuntu@ip-xx-xxx-xx-xxx.compute-1.amazonaws.com

Replace myAmazonKey.pem with the path to your Amazon EC2 key file and script.py with the name of the script you want to send. Additionally, replace ip-xx-xxx-xx-xxx.compute-1.amazonaws.com with the actual public IP or DNS address of the target EC2 instance.

3. Configure RabbitMQ

You can configure and set up the environment locally on your machine by installing the RabbitMQ on your local machine and provide a public IP adress to have a public credentials that will be used by the Peer to send and receive data.

To install RabbitMQ on Ubuntu, follow these steps:

Open the terminal on your Ubuntu machine.
Update the package list by running the following command:

sudo apt update

Install RabbitMQ by running the following command:

sudo apt install rabbitmq-server

Once the installation is complete, start the RabbitMQ server by running the following command:
```
sudo systemctl start rabbitmq-server
```
To enable the RabbitMQ server to start automatically at boot time, run the following command:
```
sudo systemctl enable rabbitmq-server
```
Check the status of the RabbitMQ server by running the following command:

sudo systemctl status rabbitmq-server

Make sure to have the following credentials to replace them inside the source code:

  rabbitmq_host: xxx.xxx.xxx.xxx
  rabbitmq_port: xxxxx
  rabbitmq_username: xxxxxxxx
  rabbitmq_password: xxxxxxxx

Another option is to set up the environment on Amazon Web Services (AWS) by creating a RabbitMQ instance. This process entails both creating the RabbitMQ instance itself and configuring it in a manner that aligns with your distinct requirements. By specifying the desired instance type, storage options, and additional settings, you can effectively fine-tune the environment to optimize its functionality and suit your specific needs.

4. Prepare the data inside the S3

1. Create and configure an S3 (Simple Storage Service) bucket. This involves setting up the necessary permissions, access control, and region settings for the bucket.

Create the required buckets to accommodate different data partitions based on your dataset and the model being used. Each bucket should be named and organized in a way that aligns with your data partitioning strategy.
Partition and load the data into the respective S3 buckets. To achieve this, execute the split_worker_send_to_s3 script using the following command:

python3 split_worker_send_to_s3.py [--size SIZE] [--dataset DATASET] [--model_str MODEL]
  
Arguments: 
--size SIZE                  Total number of workers in the deployment 
--dataset DATASET            Dataset to be used, e.g., mnist, cifar10.
--model_str MODEL            Model to be trained, e.g., squeeznet1.1, vgg11, mobilenet v3 small.

5. Start the process

Deployment requires running `EC2_without_serverless.py` on multiple machines.

Usage:
EC2_without_serverless.py [--size SIZE] [--rank RANK] [--batch BATCH]
[--dataset DATASET] [--model_str MODEL] [--optimizer OPTIMIZER]
[--loss LOSS] [--evaluation EVALUATION] [--compression COMPRESSION]
Arguments:
--size SIZE                  Total number of workers in the deployment
--rank RANK                  Unique ID of the worker node in the distributed setup.
--batch_size BATCH           Size of the batch to be employed by each node.
--dataset DATASET            Dataset to be used, e.g., mnist, cifar10.
--model_str MODEL            Model to be trained, e.g., squeeznet1.1, vgg11, mobilenet v3 small.
--loss  LOSS                 Loss function to optimize.
--optimizer OPTIMIZER        Optimizer to use.
--evaluation  EVALUATION     If True, metrics are recorded.
--compression  COMPRESSION   If True, gradients are compressed to minimize communication overhead.

P2P Training with Serverless Computing

The following figure show the architecture to make running a p2p training without serverless:

In the following, we will show all the steps to make a replication of P2P using serverless

Prepare the EC2 instances according to the needed number of peers.
Copy the script of P2P distributed training to the different EC2 instances.
Configure RabbitMQ using amazon or configure a local one with a public IP address.
Prepare the dataset inside the S3 buckets.
Create Lambda serverless to make a batch training
Create AWS Step function for managing the flow
Start the different peers inside each EC2.

1. Prepare the EC2 instances

--Similar to the previosu Section--

2. Copy the script of P2P to EC2

--Similar to the previosu Section--

3. Configure RabbitMQ

--Similar to the previosu Section--

4. Prepare the dataset inside the S3

--Similar to the previosu Section--

Execute the split_worker_batches_send_to_s3.py script using the following command:

python3 split_worker_batches_send_to_s3.py [--size SIZE] [--dataset DATASET] [--model_str MODEL]
  
Arguments: 
--size SIZE                  Total number of workers in the deployment 
--dataset DATASET            Dataset to be used, e.g., mnist, cifar10.
--model_str MODEL            Model to be trained, e.g., squeeznet1.1, vgg11, mobilenet v3 small.

This script will split the data into workers and batches, each batch will be processed by a lambda function.

5. Create Lambda serverless

Step1: Navigate to the AWS Lambda Console https://console.aws.amazon.com/lambda/
Step2: Configure the function:

Give a name to the function called "compute_gradient"
Select the Python3 as language, and make the necessary modification to the accorded ressources and the timelimit that will be set for lambda function (default: 3 seconds, maximum 15 minutes).

Use the source code we prepared in the folder package, we put all the packages that the lambda function will need to train a model for the assigned data batch.

You can modify the lambda_function.py code if you need specific requirements.

Create a zip file with all the package including the lambda_function.py and the different requirements in package folder. name the zip file as "train_batch.zip"

 zip -r ../train_batch.zip .

Use the following command to deploy the function:

aws lambda update-function-code --function-name compute_gradient --zip-file fileb://train_batch.zip

Since we are going to create a parallel processing, we need to use another lambda function that will trigger the parallel batch processing functions.

The code of this function is located on the file "ProcessInputFunction.py". Deploy this function as lambda function called "ProcessInputFunction"

6. Create AWS Step function

1. Navigate to the AWS Step Functions Console: https://console.aws.amazon.com/states/.

Create a new state machine that you call it "batch_processing"
Use the code "generate_state_machine.py" to generate a machine step function that you will deploy it into the AWS Step Function. You need to give it the number of batches as input.

python3 generate_state_machine.py [--nbr_batches]

A similar file to the "state_machine_definition.json" will be generated. You need to deploy this file to the step function configured before.

 aws stepfunctions update-state-machine --state-machine-arn arn:aws:states:us-east-1:account_id:stateMachine:batch_processing --definition file://state_machine_definition.json

7. Start the process

In each EC2 machine, you need to run the source code as following:

Usage:

EC2_with_serverless.py [--size SIZE] [--rank RANK] [--batch BATCH]
[--dataset DATASET] [--model_str MODEL] [--optimizer OPTIMIZER]
[--loss LOSS] [--evaluation EVALUATION] [--compression COMPRESSION] 


Arguments: 
  
--size SIZE                  Total number of workers in the deployment 
--rank RANK                  Unique ID of the worker node in the distributed setup.
--batch_size BATCH           Size of the batch to be employed by each node.
--dataset DATASET            Dataset to be used, e.g., mnist, cifar10.
--model_str MODEL            Model to be trained, e.g., squeeznet1.1, vgg11, mobilenet v3 small.
--loss  LOSS                 Loss function to optimize.
--optimizer OPTIMIZER        Optimizer to use.
--evaluation  EVALUATION     If True, metrics are recorded.
--compression  COMPRESSION   If True, gradients are compressed to minimize communication overhead.

Please note that we removed all the credentials from the source code, where you need to replace it with your proper credentials.

Experimentations

We provide the results of our experimentations in the following link: Experimentations Results

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
With_Serverless		With_Serverless
Without_Serverless		Without_Serverless
images		images
LICENSE		LICENSE
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring the Impact of Serverless Computing on Peer-To-Peer Training for Machine Learning

P2P Training without Serverless Computing

1. Prepare the EC2 instances

2. Copy the script of P2P to EC2

3. Configure RabbitMQ

4. Prepare the data inside the S3

5. Start the process

P2P Training with Serverless Computing

1. Prepare the EC2 instances

2. Copy the script of P2P to EC2

3. Configure RabbitMQ

4. Prepare the dataset inside the S3

5. Create Lambda serverless

6. Create AWS Step function

7. Start the process

Experimentations

References

About

Uh oh!

Releases

Packages

Languages

License

AmineBarrak/PeerToPeerServerless

Folders and files

Latest commit

History

Repository files navigation

Exploring the Impact of Serverless Computing on Peer-To-Peer Training for Machine Learning

P2P Training without Serverless Computing

1. Prepare the EC2 instances

2. Copy the script of P2P to EC2

3. Configure RabbitMQ

4. Prepare the data inside the S3

5. Start the process

P2P Training with Serverless Computing

1. Prepare the EC2 instances

2. Copy the script of P2P to EC2

3. Configure RabbitMQ

4. Prepare the dataset inside the S3

5. Create Lambda serverless

6. Create AWS Step function

7. Start the process

Experimentations

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages