-
Notifications
You must be signed in to change notification settings - Fork 77
Migrate 2 r2 #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Migrate 2 r2 #26
Conversation
WalkthroughThe changes introduce a transition from using Hugging Face's model hub to Cloudflare R2 for data storage in the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant FullAutomation
participant CloudStorage
participant FlockAPI
User->>FullAutomation: Start upload process
FullAutomation->>FlockAPI: submit_task(task_id, base_model, gpu_type, bucket, folder_name)
FlockAPI-->>FullAutomation: Response with status
FullAutomation->>CloudStorage: upload_folder(local_folder, cloud_path)
CloudStorage->>CloudStorage: Initialize upload
CloudStorage->>CloudStorage: Upload files with ProgressPercentage
CloudStorage-->>FullAutomation: Upload complete
FullAutomation-->>User: Notify upload status
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Outside diff range and nitpick comments (4)
utils/flock_api.py (1)
24-28: Enhance error message specificity.
Consider making the error message more descriptive by including the status code and specific operation that failed.
- raise Exception(f"Failed to submit task: {response.text}")
+ raise Exception(f"Failed to submit task (status {response.status_code}): {response.text}")Also applies to: 41-42
utils/cloudflare_utils.py (1)
53-53: Rename unused variable dirs to _dirs.
The variable dirs in the loop is not used within the loop body. Renaming it to _dirs signals that it is intentionally unused, which improves code readability.
Apply this change:
for root, dirs, files in stream:
+ for root, _dirs, files in stream:🧰 Tools
🪛 Ruff
53-53: Loop control variable dirs not used within loop body
Rename unused dirs to _dirs
(B007)
full_automation.py (2)
Line range hint 78-88: Avoid using continue inside a finally block
Placing a continue statement inside a finally block results in a SyntaxError in Python. The continue statement should be moved outside the finally block.
Apply this diff to correct the syntax error:
except Exception as e:
logger.error(f"Error: {e}")
logger.info("Proceed to the next model...")
finally:
# Cleanup merged_model and outputs
os.system("rm -rf merged_model")
os.system("rm -rf outputs")
- continue
+continue # Proceed to the next iteration after cleanupLine range hint 78-88: Use shutil instead of os.system for file operations
Using os.system("rm -rf <directory>") can be unsafe and platform-dependent. It's better to use the shutil module for removing directories.
Refactor the cleanup code to use shutil.rmtree:
import shutil
finally:
# Cleanup merged_model and outputs
- os.system("rm -rf merged_model")
- os.system("rm -rf outputs")
+ shutil.rmtree("merged_model", ignore_errors=True)
+ shutil.rmtree("outputs", ignore_errors=True)
continue # Proceed to the next iteration after cleanup📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
- full_automation.py (2 hunks)
- requirements.txt (1 hunks)
- utils/cloudflare_utils.py (1 hunks)
- utils/flock_api.py (2 hunks)
🧰 Additional context used
🪛 Ruff
utils/cloudflare_utils.py
53-53: Loop control variable dirs not used within loop body
Rename unused dirs to _dirs
(B007)
🔇 Additional comments (4)
utils/flock_api.py (2)
18-18: LGTM: Function signature updated correctly for R2 migration.
The parameter changes appropriately reflect the transition from Hugging Face to R2 storage while maintaining type hints.
18-28: Verify complete migration from Hugging Face to R2.
Let's ensure all callers have been updated to the new signature and no Hugging Face references remain.
✅ Verification successful
HF_TOKEN in demo.py is unrelated to the migration
The remaining Hugging Face reference is in demo.py and is only used for loading a pre-trained model using the transformers library, which is unrelated to the R2 migration of task submission. The submit_task usage in full_automation.py has been properly updated with the new signature, using the correct parameters from model2base_model mapping which contains R2-compatible base model names.
- The HF_TOKEN in
demo.pyis used for model loading, not for task submission submit_taskcall infull_automation.pymatches the new signaturemodel2base_modelmapping is aligned with R2 base models (qwen1.5, gemma)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining Hugging Face references and verify submit_task usage
# Check for any remaining Hugging Face references
echo "Checking for remaining Hugging Face references..."
rg -i "hugg(ing)?face|hf_|hg_repo" --type py
# Find all calls to submit_task to verify they've been updated
echo "Checking submit_task usage..."
ast-grep --pattern 'submit_task($$$)'
Length of output: 582
Script:
#!/bin/bash
# Let's check the content of demo.py to understand the HF_TOKEN usage context
echo "Checking demo.py content around HF_TOKEN..."
rg -A 5 -B 5 "HF_TOKEN" demo.py
# Let's also check the model2base_model mapping to ensure it's aligned with R2
echo "Checking model2base_model definition..."
ast-grep --pattern 'model2base_model = $_'
Length of output: 873
full_automation.py (2)
62-68: Ensure secure handling of sensitive credentials
Sensitive information like access_key, secret_key, and session_token should be handled securely. Ensure that these credentials are not logged or exposed in any logs or error messages.
Confirm that no logging statements elsewhere in the codebase output these credentials.
74-77: Verify that submit_task parameters match the updated function signature
Ensure that the parameters passed to submit_task align with its updated definition. The function should accept bucket and folder_name instead of hg_repo_id and revision.
Run the following script to confirm the function signature:
✅ Verification successful
Parameters match the function signature correctly
The verification shows that:
- The
submit_taskfunction inutils/flock_api.pyis defined with parameters:task_id,base_model,gpu_type,bucket, andfolder_name - The function call in
full_automation.pypasses these parameters in the correct order:task_idmodel2base_model[model_id](maps tobase_model)gpu_typeupload_data["data"]["bucket"](maps tobucket)upload_data["data"]["folder_name"](maps tofolder_name)
The parameters align perfectly with the function's signature, and no changes are needed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that submit_task accepts the correct parameters.
# Test: Search for the submit_task function definition.
# Expect: Parameters should include 'task_id', 'model_name', 'gpu_type', 'bucket', and 'folder_name'.
rg -A 2 'def submit_task\(task_id, model_name, gpu_type, bucket, folder_name\):'
Length of output: 82
Script:
#!/bin/bash
# Let's try a broader search for submit_task function definition
rg -A 5 "def submit_task"
# Also search for the actual function call to see the context
rg -B 5 -A 2 "submit_task\("
Length of output: 1276
| trl>=0.9.3,<=0.9.6 | ||
| bitsandbytes | ||
| pyyaml | ||
| boto3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add version constraints for boto3.
Other critical dependencies have version constraints to ensure compatibility. Since boto3 is essential for the R2 storage functionality, it should also have version constraints to prevent potential breaking changes in future updates.
Apply this diff to add version constraints:
-boto3
+boto3>=1.26.0,<2.0.0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| boto3 | |
| boto3>=1.26.0,<2.0.0 |
utils/flock_api.py
Outdated
| def get_address( | ||
| task_id: int | ||
| ): | ||
| payload = json.dumps( | ||
| { | ||
| "task_id": task_id, | ||
| } | ||
| ) | ||
| headers = { | ||
| "flock-api-key": FLOCK_API_KEY, | ||
| "Content-Type": "application/json", | ||
| } | ||
| response = requests.request( | ||
| "POST", | ||
| f"{FED_LEDGER_BASE_URL}/tasks/get_storage_credentials", | ||
| headers=headers, | ||
| data=payload, | ||
| ) | ||
| if response.status_code != 200: | ||
| raise Exception(f"Failed to submit task: {response.text}") | ||
| return response.json() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple improvements needed for the new get_address function.
- The error message is incorrect (copied from
submit_task) - Missing function documentation
- Inconsistent spacing around function definition
Apply these improvements:
-
-
def get_address(
task_id: int
):
+def get_address(task_id: int):
+ """Retrieve storage credentials for R2 upload.
+
+ Args:
+ task_id (int): The task identifier
+
+ Returns:
+ dict: Storage credentials and endpoint information
+
+ Raises:
+ Exception: If the credentials cannot be retrieved
+ """
payload = json.dumps(
{
"task_id": task_id,
}
)
headers = {
"flock-api-key": FLOCK_API_KEY,
"Content-Type": "application/json",
}
response = requests.request(
"POST",
f"{FED_LEDGER_BASE_URL}/tasks/get_storage_credentials",
headers=headers,
data=payload,
)
if response.status_code != 200:
- raise Exception(f"Failed to submit task: {response.text}")
+ raise Exception(f"Failed to retrieve storage credentials (status {response.status_code}): {response.text}")
return response.json()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_address( | |
| task_id: int | |
| ): | |
| payload = json.dumps( | |
| { | |
| "task_id": task_id, | |
| } | |
| ) | |
| headers = { | |
| "flock-api-key": FLOCK_API_KEY, | |
| "Content-Type": "application/json", | |
| } | |
| response = requests.request( | |
| "POST", | |
| f"{FED_LEDGER_BASE_URL}/tasks/get_storage_credentials", | |
| headers=headers, | |
| data=payload, | |
| ) | |
| if response.status_code != 200: | |
| raise Exception(f"Failed to submit task: {response.text}") | |
| return response.json() | |
| def get_address(task_id: int): | |
| """Retrieve storage credentials for R2 upload. | |
| Args: | |
| task_id (int): The task identifier | |
| Returns: | |
| dict: Storage credentials and endpoint information | |
| Raises: | |
| Exception: If the credentials cannot be retrieved | |
| """ | |
| payload = json.dumps( | |
| { | |
| "task_id": task_id, | |
| } | |
| ) | |
| headers = { | |
| "flock-api-key": FLOCK_API_KEY, | |
| "Content-Type": "application/json", | |
| } | |
| response = requests.request( | |
| "POST", | |
| f"{FED_LEDGER_BASE_URL}/tasks/get_storage_credentials", | |
| headers=headers, | |
| data=payload, | |
| ) | |
| if response.status_code != 200: | |
| raise Exception(f"Failed to retrieve storage credentials (status {response.status_code}): {response.text}") | |
| return response.json() |
utils/cloudflare_utils.py
Outdated
| if self.access_key is None or self.secret_key is None or self.endpoint_url is None: | ||
| logger.error("Please provide access_key, secret_key, session_token and endpoint_url") | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify the exception to raise in initialize method.
The raise statement on line 37 is incomplete. In Python, raise without specifying an exception can only be used to re-raise the last exception within an except block. You need to specify the exception you intend to raise.
You can fix this by raising a specific exception:
if self.access_key is None or self.secret_key is None or self.endpoint_url is None:
logger.error("Please provide access_key, secret_key, session_token and endpoint_url")
- raise
+ raise ValueError("Missing required credentials: access_key, secret_key, and endpoint_url.")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.access_key is None or self.secret_key is None or self.endpoint_url is None: | |
| logger.error("Please provide access_key, secret_key, session_token and endpoint_url") | |
| raise | |
| if self.access_key is None or self.secret_key is None or self.endpoint_url is None: | |
| logger.error("Please provide access_key, secret_key, session_token and endpoint_url") | |
| raise ValueError("Missing required credentials: access_key, secret_key, and endpoint_url.") |
utils/cloudflare_utils.py
Outdated
| logger.info("\r%s %s / %s (%.2f%%)" % ( | ||
| self.filename, self._seen_so_far, self.size, | ||
| percentage), end="") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the logging statement in ProgressPercentage.__call__.
The logger.info method does not accept the end parameter. The end parameter is used with the print() function. To achieve the intended effect of updating the same line, consider using print() instead of logger.info, or adjust your logging strategy.
You can modify the code as follows:
- logger.info("\r%s %s / %s (%.2f%%)" % (
+ print("\r%s %s / %s (%.2f%%)" % (
self.filename, self._seen_so_far, self.size,
- percentage), end="")
+ percentage), end="")Alternatively, you might consider using tqdm for progress tracking to provide a more robust and user-friendly progress bar.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.info("\r%s %s / %s (%.2f%%)" % ( | |
| self.filename, self._seen_so_far, self.size, | |
| percentage), end="") | |
| print("\r%s %s / %s (%.2f%%)" % ( | |
| self.filename, self._seen_so_far, self.size, | |
| percentage), end="") |
utils/cloudflare_utils.py
Outdated
| if self.access_key is None or self.secret_key is None or self.endpoint_url is None: | ||
| logger.error("Please provide access_key, secret_key, session_token and endpoint_url") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure consistency between the condition and the error message.
The error message mentions session_token as a required parameter, but the condition does not check if self.session_token is None. To maintain consistency, either update the condition to include session_token or remove it from the error message.
If session_token is required, update the condition:
if self.access_key is None or self.secret_key is None or self.endpoint_url is None:
+ or self.session_token is None:
logger.error("Please provide access_key, secret_key, session_token and endpoint_url")If session_token is optional, modify the error message:
if self.access_key is None or self.secret_key is None or self.endpoint_url is None:
- logger.error("Please provide access_key, secret_key, session_token and endpoint_url")
+ logger.error("Please provide access_key, secret_key, and endpoint_url")Committable suggestion was skipped due to low confidence.
full_automation.py
Outdated
| cf_storage.upload_folder(local_folder="outputs", | ||
| cloud_path=upload_data["data"]["folder_name"]) | ||
| submit_task( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for the upload process
The cf_storage.upload_folder method may raise exceptions due to network issues, authentication errors, or other problems. Without handling these exceptions, the script may terminate prematurely.
Wrap the upload operation in a try-except block to handle potential errors.
cf_storage.initialize()
+try:
cf_storage.upload_folder(
local_folder="outputs",
cloud_path=upload_data["data"]["folder_name"]
)
+except Exception as e:
+ logger.error(f"Upload failed: {e}")
+ logger.info("Proceeding to the next model...")
+ continue # Skip to the next iterationCommittable suggestion was skipped due to low confidence.
full_automation.py
Outdated
| upload_data = get_address(task_id) | ||
| cf_storage = CloudStorage( | ||
| access_key=upload_data["data"]["access_key"], | ||
| secret_key=upload_data["data"]["secret_key"], | ||
| endpoint_url=upload_data["data"]["endpoint_url"], | ||
| session_token=upload_data["data"]["session_token"], | ||
| bucket=upload_data["data"]["bucket"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential exceptions when retrieving upload data
The call to get_address(task_id) may fail or return invalid data, which could result in a KeyError when accessing upload_data["data"]. It's important to handle exceptions to prevent the script from crashing unexpectedly.
Consider adding error handling to check if the response contains the expected data.
try:
logger.info("Start to push the lora weight to the cloudflare R2...")
upload_data = get_address(task_id)
+ if "data" not in upload_data:
+ logger.error("Failed to retrieve upload data from get_address")
+ continue
cf_storage = CloudStorage(
access_key=upload_data["data"]["access_key"],
secret_key=upload_data["data"]["secret_key"],
endpoint_url=upload_data["data"]["endpoint_url"],
session_token=upload_data["data"]["session_token"],
bucket=upload_data["data"]["bucket"]
)Committable suggestion was skipped due to low confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (9)
.pre-commit-config.yaml (1)
8-15: Consider adding Ruff configuration options.The Ruff configuration could benefit from additional settings to enhance its effectiveness:
Add a
pyproject.tomlfile with the following configuration:[tool.ruff] # Enable pycodestyle (`E`), Pyflakes (`F`), and import sorting (`I`) select = ["E", "F", "I"] ignore = [] # Allow autofix for all enabled rules (when `--fix`) is provided. fixable = ["ALL"] unfixable = [] # Exclude a variety of commonly ignored directories. exclude = [ ".bzr", ".direnv", ".eggs", ".git", ".git-rewrite", ".hg", ".mypy_cache", ".nox", ".pants.d", ".pytype", ".ruff_cache", ".svn", ".tox", ".venv", "__pypackages__", "_build", "buck-out", "build", "dist", "node_modules", "venv", ] # Same as Black. line-length = 88 # Allow unused variables when underscore-prefixed. dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$" [tool.ruff.mccabe] # Unlike Flake8, default to a complexity level of 10. max-complexity = 10utils/cloudflare_utils.py (3)
69-69: Rename unused loop variable.The
dirsvariable in the loop is unused. Following Python conventions, rename it to_dirsto indicate it's intentionally unused.- for root, dirs, files in stream: + for root, _dirs, files in stream:🧰 Tools
🪛 Ruff
69-69: Loop control variable
dirsnot used within loop bodyRename unused
dirsto_dirs(B007)
71-74: Consider usingos.path.normpathfor consistent path handling.While the current path handling works, it could be more robust by using
os.path.normpathbefore the Windows path separator replacement.localFilePath = os.path.join(root, file) relativePath = os.path.relpath(localFilePath, local_folder) - cloudPath = os.path.join(cloud_path, relativePath) - cloudPath = cloudPath.replace("\\", "/") + cloudPath = os.path.normpath(os.path.join(cloud_path, relativePath)) + cloudPath = cloudPath.replace(os.path.sep, "/")
82-83: Enhance error handling with specific exception details.The current error logging could be more informative by including the file path and specific error details.
except botocore.exceptions.ClientError as e: - logger.error(e) + logger.error(f"Failed to upload {localFilePath}: {str(e)}")full_automation.py (2)
70-72: Add progress tracking for large uploadsThe upload process could benefit from progress tracking, especially for large model files.
+ logger.info(f"Uploading files from outputs to {upload_data['data']['folder_name']}...") cf_storage.upload_folder( local_folder="outputs", - cloud_path=upload_data["data"]["folder_name"] + cloud_path=upload_data["data"]["folder_name"], + progress_callback=lambda filename, bytes_uploaded, total_bytes: + logger.info(f"Uploading {filename}: {bytes_uploaded}/{total_bytes} bytes") ) + logger.info("Upload completed successfully")
Line range hint
84-86: Replace os.system with safer alternativesUsing
os.systemfor cleanup is unsafe as it doesn't handle paths with spaces or special characters properly.- os.system("rm -rf merged_model") - os.system("rm -rf outputs") + import shutil + for path in ["merged_model", "outputs"]: + if os.path.exists(path): + shutil.rmtree(path)README.md (1)
90-90: Fix the loose punctuation mark.Remove the trailing period after "model" as it creates inconsistency with other bullet points in the list.
-gpu_type`: The type of GPU used for training the model. +gpu_type`: The type of GPU used for training the model🧰 Tools
🪛 LanguageTool
[uncategorized] ~90-~90: Loose punctuation mark.
Context: ...sed for training the model. -revision: The commit hash from the Hugging Face r...(UNLIKELY_OPENING_PUNCTUATION)
demo_data.jsonl (2)
Line range hint
25-48: Consider reducing redundancy in tax-related responses.The tax-related conversations contain some overlapping information. Consider consolidating similar advice and cross-referencing between related topics.
Line range hint
73-96: Consider enhancing time management tool recommendations.While the time management tools section is informative, consider adding:
- Pricing information (free vs paid)
- Platform availability (web, mobile, desktop)
- Integration capabilities
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (9)
.pre-commit-config.yaml(1 hunks)Dockerfile(1 hunks)README.md(1 hunks)demo.py(0 hunks)demo_data.jsonl(1 hunks)full_automation.py(2 hunks)merge.py(0 hunks)utils/cloudflare_utils.py(1 hunks)utils/flock_api.py(2 hunks)
💤 Files with no reviewable changes (2)
- demo.py
- merge.py
✅ Files skipped from review due to trivial changes (1)
- Dockerfile
🚧 Files skipped from review as they are similar to previous changes (1)
- utils/flock_api.py
🧰 Additional context used
🪛 LanguageTool
README.md
[uncategorized] ~90-~90: Loose punctuation mark.
Context: ...sed for training the model. - revision: The commit hash from the Hugging Face r...
(UNLIKELY_OPENING_PUNCTUATION)
🪛 Ruff
utils/cloudflare_utils.py
69-69: Loop control variable dirs not used within loop body
Rename unused dirs to _dirs
(B007)
🔇 Additional comments (10)
.pre-commit-config.yaml (1)
1-15: Verify the impact of Ruff formatting on the codebase.
Since this introduces automatic code formatting, we should verify the potential impact on the existing codebase.
Let's check the files that would be affected:
utils/cloudflare_utils.py (4)
1-7: LGTM! Well-structured imports.
All necessary dependencies are properly imported with clear purpose: boto3 for S3 operations, threading for thread-safety, tqdm for progress bars, and loguru for logging.
27-42: LGTM! Well-structured initialization.
The class initialization is clean with proper parameter handling and separation of concerns. The deferred client creation is a good practice.
20-24: 🛠️ Refactor suggestion
Fix the logging statement to properly handle progress updates.
The logger.info method doesn't support the end parameter. Consider using print() or better yet, leverage the already imported tqdm library for a more robust progress tracking solution.
Here's a better implementation using tqdm:
def __call__(self, bytes_amount):
with self._lock:
self._seen_so_far += bytes_amount
percentage = (self._seen_so_far / self.size) * 100
- logger.info(
- "\r%s %s / %s (%.2f%%)"
- % (self.filename, self._seen_so_far, self.size, percentage),
- end="",
- )
+ tqdm.write(f"{self.filename}: {percentage:.2f}% ({self._seen_so_far}/{self.size} bytes)")43-61:
Fix credential validation and error handling.
There are several issues in the initialization:
- The bare
raisestatement needs to specify an exception - The error message mentions
session_tokenbut it's not checked in the condition - The session_token is optional in boto3, so it shouldn't be in the error message
Here's the corrected implementation:
def initialize(self):
if (
self.access_key is None
or self.secret_key is None
or self.endpoint_url is None
):
- logger.error(
- "Please provide access_key, secret_key, session_token and endpoint_url"
- )
- raise
+ logger.error("Please provide access_key, secret_key, and endpoint_url")
+ raise ValueError("Missing required credentials: access_key, secret_key, and endpoint_url")full_automation.py (2)
10-12: LGTM! Import changes align with R2 migration.
The new imports correctly support the migration from Hugging Face to Cloudflare R2.
59-61:
Security: Remove sensitive data from logs
The task info being logged earlier (line 23) may contain sensitive R2 credentials. Consider redacting sensitive fields before logging.
- logger.info(json.dumps(task, indent=4))
+ # Create a copy and redact sensitive fields
+ log_safe_task = task.copy()
+ if "data" in log_safe_task:
+ log_safe_task["data"] = {
+ k: v for k, v in log_safe_task["data"].items()
+ if k not in ["access_key", "secret_key", "session_token"]
+ }
+ logger.info(json.dumps(log_safe_task, indent=4))Likely invalid or redundant comment.
demo_data.jsonl (3)
Line range hint 1-24: LGTM! Well-structured conversation about website design best practices.
The conversation follows a clear Q&A format with comprehensive responses covering essential aspects of website design.
Line range hint 49-72: LGTM! Well-structured study techniques conversation.
The conversation provides clear, actionable advice about study techniques with good examples and practical implementation steps.
Line range hint 1-96: Verify data format consistency across all entries.
The JSONL format appears consistent, but let's verify the structure across all entries.
✅ Verification successful
Data format is consistent and well-structured across all entries
The verification shows:
- All 8 entries follow the same JSONL format
- Each entry has consistent required fields: "system" and "conversations"
- System messages are consistent: "You are a helpful assistant."
- Conversation roles are properly structured with only valid values: "user" (34 instances) and "assistant" (34 instances)
- Each conversation entry contains required "role" and "content" fields
- The content is properly formatted with appropriate newlines and formatting
The data format meets all the necessary requirements for proper processing and usage.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify JSON structure consistency and required fields
# Test 1: Verify each line is valid JSON
jq -c '.' demo_data.jsonl > /dev/null
# Test 2: Verify required fields exist in each entry
jq -c 'select(.system == null or .conversations == null)' demo_data.jsonl
# Test 3: Verify conversation structure
jq -c '.conversations[] | select(.role == null or .content == null)' demo_data.jsonl
Length of output: 188
Script:
#!/bin/bash
# Let's verify the structure more thoroughly with specific checks
# Check 1: Print the count of entries to verify multiple records exist
echo "Total entries:"
wc -l demo_data.jsonl
# Check 2: Verify system field value consistency
echo -e "\nSystem field values:"
jq -r '.system' demo_data.jsonl | sort | uniq -c
# Check 3: Verify conversation roles are either user or assistant
echo -e "\nUnique conversation roles:"
jq -r '.conversations[].role' demo_data.jsonl | sort | uniq -c
# Check 4: Sample a few entries to inspect structure
echo -e "\nSample entry structure:"
head -n 1 demo_data.jsonl | jq '.'
Length of output: 7349
| upload_data = get_address(task_id) | ||
| cf_storage = CloudStorage( | ||
| access_key=upload_data["data"]["access_key"], | ||
| secret_key=upload_data["data"]["secret_key"], | ||
| endpoint_url=upload_data["data"]["endpoint_url"], | ||
| session_token=upload_data["data"]["session_token"], | ||
| bucket=upload_data["data"]["bucket"], | ||
| ) | ||
| cf_storage.initialize() | ||
| cf_storage.upload_folder( | ||
| local_folder="outputs", cloud_path=upload_data["data"]["folder_name"] | ||
| ) | ||
| # get commit hash | ||
| commit_hash = commit_message.oid | ||
| logger.info(f"Commit hash: {commit_hash}") | ||
| logger.info(f"Repo name: {repo_name}") | ||
| # submit | ||
| submit_task( | ||
| task_id, repo_name, model2base_model[model_id], gpu_type, commit_hash | ||
| task_id, | ||
| model2base_model[model_id], | ||
| gpu_type, | ||
| upload_data["data"]["bucket"], | ||
| upload_data["data"]["folder_name"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance error handling for upload process
While basic error handling exists, it could be more specific to help with debugging and recovery.
upload_data = get_address(task_id)
+ if not upload_data or "data" not in upload_data:
+ raise ValueError("Invalid upload data received from API")
+
cf_storage = CloudStorage(
access_key=upload_data["data"]["access_key"],
secret_key=upload_data["data"]["secret_key"],
endpoint_url=upload_data["data"]["endpoint_url"],
session_token=upload_data["data"]["session_token"],
bucket=upload_data["data"]["bucket"],
)
cf_storage.initialize()
+ except ValueError as e:
+ logger.error(f"API Error: {e}")
+ continue
+ except ConnectionError as e:
+ logger.error(f"Network Error during upload: {e}")
+ continue
except Exception as e:
- logger.error(f"Error: {e}")
+ logger.exception("Unexpected error during upload process")
logger.info("Proceed to the next model...")Committable suggestion skipped: line range outside the PR's diff.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.pre-commit-config.yaml(1 hunks)
🔇 Additional comments (1)
.pre-commit-config.yaml (1)
1-13: LGTM! Formatting issues have been resolved.
The YAML formatting is now consistent with proper 2-space indentation and dash formatting throughout the file.
| repos: | ||
| - repo: https://github.com/pre-commit/pre-commit-hooks | ||
| rev: v4.6.0 | ||
| hooks: | ||
| - id: check-yaml | ||
| - id: end-of-file-fixer | ||
| - id: trailing-whitespace | ||
| - repo: https://github.com/astral-sh/ruff-pre-commit | ||
| rev: v0.6.4 | ||
| hooks: | ||
| - id: ruff | ||
| args: [--fix] | ||
| - id: ruff-format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Update pre-commit hooks to latest stable versions and consider additional security checks
The current configuration is using outdated versions. Here's what needs to be updated:
- Update
pre-commit-hooksfrom v4.6.0 to v5.0.0 - Update
rufffrom v0.6.4 to v0.8.4
Additionally, consider adding these security-focused hooks from pre-commit-hooks:
check-added-large-filescheck-case-conflictcheck-merge-conflictdetect-private-key
Here's the suggested updated configuration:
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- id: check-added-large-files
- id: check-case-conflict
- id: check-merge-conflict
- id: detect-private-key
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.8.4
hooks:
- id: ruff
args: [--fix]
- id: ruff-format🔗 Analysis chain
Consider additional hooks and verify versions.
The current configuration provides basic linting and formatting. Consider adding these useful hooks:
check-added-large-files: Prevents large files from being committedcheck-case-conflict: Checks for files with names that would conflict on case-insensitive filesystemscheck-merge-conflict: Checks for files containing merge conflict stringsdetect-private-key: Detects private keys
Let's verify the latest stable versions of the hooks:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check latest versions of pre-commit hooks and ruff
# Check latest pre-commit-hooks version
gh api repos/pre-commit/pre-commit-hooks/releases/latest --jq .tag_name
# Check latest ruff version
gh api repos/astral-sh/ruff/releases/latest --jq .tag_name
Length of output: 147
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
boto3package for cloud storage integration.