Skip to content

Pod Log Retrieval API & Validation Failure Log Preservation#350

Open
paras-agarwal-meesho wants to merge 4 commits intomainfrom
feat/failed-validation-logs
Open

Pod Log Retrieval API & Validation Failure Log Preservation#350
paras-agarwal-meesho wants to merge 4 commits intomainfrom
feat/failed-validation-logs

Conversation

@paras-agarwal-meesho
Copy link

@paras-agarwal-meesho paras-agarwal-meesho commented Feb 27, 2026

🔁 Pull Request Template – BharatMLStack

Context:

Refer: Pod Log Retrieval API & Validation Failure Log Preservation

Model validation failures are currently difficult to debug because the Kubernetes pod logs of the test deployable are ephemeral and lost once the pod or deployable is recycled.

This solution introduces:

  • A standard Horizon API to fetch Kubernetes pod logs for Argo CD applications.
  • A failure log preservation mechanism that captures and uploads test deployable logs to GCS when validation fails.
  • Integration with TruffleBox to surface preserved logs via GCS links in rejection reasons.

The goal is to eliminate dependency on direct Argo CD or cluster access and provide durable, auditable failure diagnostics.

Describe your changes:

  • API exposed: horizon/pkg/argocd/logs.go
  • Tests for api: horizon/pkg/argocd/logs_test.go
  • Validation Failure logs captured and uploaded to GCS: horizon/internal/predator/handler/predator_validation_logs.go
  • Tests : horizon/internal/predator/handler/predator_validation_logs_test.go

Testing:

  • API testing was done
  • Extensive test files for exhaustive testcases for all functions are created
  • Flow is tested

Checklist before requesting a review

  • I have reviewed my own changes?
  • Relevant or critical functionality is covered by tests?
  • Monitoring needs have been evaluated?
  • Any necessary documentation updates have been considered?

📂 Modules Affected

  • horizon (Real-time systems / networking)
  • online-feature-store (Feature serving infra)
  • trufflebox-ui (Admin panel / UI)
  • infra (Docker, CI/CD, GCP/AWS setup)
  • docs (Documentation updates)
  • Other: ___________

✅ Type of Change

  • Feature addition
  • Bug fix
  • Infra / build system change
  • Performance improvement
  • Refactor
  • Documentation
  • Other: ___________

📊 Benchmark / Metrics (if applicable)

Summary by CodeRabbit

  • New Features

    • Added API endpoint to retrieve application logs with filtering (pod, container, previous, time range, tail, filter).
    • Validation failures now capture, format, and upload diagnostic logs to cloud storage with a public URL when available.
  • Tests

    • Added extensive test coverage for logs retrieval, parsing, URL construction, and validation-failure logging.

@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

Walkthrough

Adds non‑streaming ArgoCD application log retrieval with an API endpoint and handler; integrates failure log capture into Predator validation, formatting degraded pod logs and uploading them to GCS; and introduces comprehensive tests for both features.

Changes

Cohort / File(s) Summary
ArgoCD Logs SDK & tests
horizon/pkg/argocd/logs.go, horizon/pkg/argocd/logs_test.go
New non‑streaming NDJSON logs client: types (ApplicationLogsOptions, ApplicationLogEntry, errors), URL builder, NDJSON parser, GetApplicationLogs implementation, and extensive unit tests (parsing, URL building, API/error paths).
Infrastructure controller, handler & tests
horizon/internal/infrastructure/controller/controller.go, horizon/internal/infrastructure/controller/controller_test.go, horizon/internal/infrastructure/handler/handler.go, horizon/internal/infrastructure/router/router.go
New GET /applications/:appName/logs route, ApplicationLogsQuery type, InfrastructureController.GetApplicationLogs endpoint with validation and option parsing, handler interface and implementation method GetApplicationLogs, and table‑driven tests covering success, validation, and error mappings.
Predator validation logging & tests
horizon/internal/predator/handler/predator_validation.go, horizon/internal/predator/handler/predator_validation_logs.go, horizon/internal/predator/handler/predator_validation_logs_test.go
updateRequestValidationStatus now accepts gcsLogPath; new failure log capture module to find degraded pods, fetch previous container logs, convert timestamps to IST, format plaintext reports, build GCS paths, upload logs, and return public GCS URL; comprehensive unit tests and mocks.
Misc
.gitignore
Added horizon/pkg/argocd/manual_test.go to .gitignore.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Controller as InfrastructureController
    participant Handler as InfrastructureHandler
    participant ArgoCD as ArgoCD API

    Client->>Controller: GET /applications/:appName/logs
    activate Controller
    Controller->>Controller: validate appName, workingEnv, query params
    Controller->>Handler: GetApplicationLogs(appName, workingEnv, opts)
    activate Handler
    Handler->>Handler: derive ArgoCD app name, build request
    Handler->>ArgoCD: HTTP GET logs (NDJSON)
    activate ArgoCD
    ArgoCD-->>Handler: NDJSON log entries
    deactivate ArgoCD
    Handler-->>Controller: []ApplicationLogEntry
    deactivate Handler
    Controller-->>Client: 200 OK with entries
    deactivate Controller
Loading
sequenceDiagram
    participant Predator
    participant InfraHandler as InfrastructureHandler
    participant ArgoCD as ArgoCD API
    participant GCS as GCS Storage

    Predator->>InfraHandler: findDegradedPods(serviceName)
    activate InfraHandler
    InfraHandler->>InfraHandler: fetch resource details
    InfraHandler-->>Predator: list of degraded pods
    deactivate InfraHandler

    loop for each degraded pod
        Predator->>InfraHandler: fetchDegradedPodLogs(serviceName, podName) (Previous=true)
        activate InfraHandler
        InfraHandler->>ArgoCD: GetApplicationLogs(...)
        activate ArgoCD
        ArgoCD-->>InfraHandler: log entries
        deactivate ArgoCD
        InfraHandler->>InfraHandler: convert timestamps to IST, format pod logs
        InfraHandler-->>Predator: formatted pod logs
        deactivate InfraHandler
    end

    Predator->>Predator: formatValidationLogsPlainText
    Predator->>Predator: buildValidationLogsGCSPath
    Predator->>GCS: UploadFile(report, gcsPath)
    activate GCS
    GCS-->>Predator: public GCS URL
    deactivate GCS
    Predator->>Predator: updateRequestValidationStatus(groupID, success=false, gcsLogPath)
Loading
🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Krd Checker ⚠️ Warning PR description lacks either a valid KRD link or KRD exemption with proper Pod Type and justification. Add KRD link (KRD: https://docs.google.com/document/d/) or KRD exemption with Pod Type and 20+ character justification.
✅ Passed checks (1 passed)
Check name Status Explanation
Dynamic Configuration Validation ✅ Passed No changes to application-dyn-*.yml files found in PR; custom check condition satisfied.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@horizon/internal/infrastructure/controller/controller.go`:
- Around line 177-187: The handler currently allows tailLines == 0 but
downstream URL builder coerces non-positive values to 1000; make the contract
consistent by rejecting 0 at the API layer: in the block that parses q.TailLines
(the tailLines variable and strconv.ParseInt usage around q.TailLines), change
the validation to require tailLines > 0 and return a 400 with a message like
"tailLines must be a positive integer" when it's 0 or negative; this preserves
caller intent and aligns with the URL builder behavior.

In `@horizon/internal/predator/handler/predator_validation_logs.go`:
- Around line 183-190: Sanitize and normalize job.ServiceName before composing
the GCS object key in buildValidationLogsGCSPath: create a sanitized variable
(e.g., sanitizedService := strings.ToLower(job.ServiceName)) and then strip or
replace unsafe characters (remove any occurrences of "..", replace path
separators like "/" and "\" with "-" or remove them, and restrict to
alphanumerics, dashes and underscores—e.g., via a regexp whitelist) so the
objectPath uses sanitizedService instead of job.ServiceName when calling
path.Join; update references in buildValidationLogsGCSPath accordingly to
prevent traversal and ensure consistent lowercase keys.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aec2c74 and f8afae4.

📒 Files selected for processing (10)
  • .gitignore
  • horizon/internal/infrastructure/controller/controller.go
  • horizon/internal/infrastructure/controller/controller_test.go
  • horizon/internal/infrastructure/handler/handler.go
  • horizon/internal/infrastructure/router/router.go
  • horizon/internal/predator/handler/predator_validation.go
  • horizon/internal/predator/handler/predator_validation_logs.go
  • horizon/internal/predator/handler/predator_validation_logs_test.go
  • horizon/pkg/argocd/logs.go
  • horizon/pkg/argocd/logs_test.go

Comment on lines +177 to +187
tailLines := int64(1000)
if q.TailLines != "" {
var err error
tailLines, err = strconv.ParseInt(q.TailLines, 10, 64)
if err != nil || tailLines < 0 {
ctx.JSON(http.StatusBadRequest, gin.H{"error": gin.H{
"message": "tailLines must be a non-negative integer",
}})
return
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

tailLines=0 is accepted here but not honored downstream.

Line 181 accepts 0, but the URL builder later coerces non-positive values to 1000, so caller intent is lost. Please make this contract consistent (either reject 0 here or preserve 0 end-to-end).

One consistent option (reject 0 at API layer)
 	if q.TailLines != "" {
 		var err error
 		tailLines, err = strconv.ParseInt(q.TailLines, 10, 64)
-		if err != nil || tailLines < 0 {
+		if err != nil || tailLines <= 0 {
 			ctx.JSON(http.StatusBadRequest, gin.H{"error": gin.H{
-				"message": "tailLines must be a non-negative integer",
+				"message": "tailLines must be a positive integer",
 			}})
 			return
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tailLines := int64(1000)
if q.TailLines != "" {
var err error
tailLines, err = strconv.ParseInt(q.TailLines, 10, 64)
if err != nil || tailLines < 0 {
ctx.JSON(http.StatusBadRequest, gin.H{"error": gin.H{
"message": "tailLines must be a non-negative integer",
}})
return
}
}
tailLines := int64(1000)
if q.TailLines != "" {
var err error
tailLines, err = strconv.ParseInt(q.TailLines, 10, 64)
if err != nil || tailLines <= 0 {
ctx.JSON(http.StatusBadRequest, gin.H{"error": gin.H{
"message": "tailLines must be a positive integer",
}})
return
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@horizon/internal/infrastructure/controller/controller.go` around lines 177 -
187, The handler currently allows tailLines == 0 but downstream URL builder
coerces non-positive values to 1000; make the contract consistent by rejecting 0
at the API layer: in the block that parses q.TailLines (the tailLines variable
and strconv.ParseInt usage around q.TailLines), change the validation to require
tailLines > 0 and return a 400 with a message like "tailLines must be a positive
integer" when it's 0 or negative; this preserves caller intent and aligns with
the URL builder behavior.

Comment on lines +183 to +190
func (p *Predator) buildValidationLogsGCSPath(job *validationjob.Table) (bucket, objectPath string) {
if p.isNonProductionEnvironment() {
bucket = pred.GcsModelBucket
objectPath = path.Join(validationLogsDir, job.ServiceName, job.GroupID+".log")
} else {
bucket = pred.GcsConfigBucket
objectPath = path.Join(validationLogsDir, job.ServiceName, job.GroupID+".log")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitize and normalize job.ServiceName before composing the GCS object key.

Line 186 and Line 189 build object paths from raw job.ServiceName. If it contains /, .., or mixed case, keys can become inconsistent or escape the intended prefix layout.

Proposed fix
 func (p *Predator) buildValidationLogsGCSPath(job *validationjob.Table) (bucket, objectPath string) {
+	serviceName := strings.ToLower(strings.TrimSpace(job.ServiceName))
+	if serviceName == "" || strings.Contains(serviceName, "/") || strings.Contains(serviceName, "..") {
+		return "", ""
+	}
+	objectPath = path.Join(validationLogsDir, serviceName, job.GroupID+".log")
+
 	if p.isNonProductionEnvironment() {
 		bucket = pred.GcsModelBucket
-		objectPath = path.Join(validationLogsDir, job.ServiceName, job.GroupID+".log")
 	} else {
 		bucket = pred.GcsConfigBucket
-		objectPath = path.Join(validationLogsDir, job.ServiceName, job.GroupID+".log")
 	}
 	return bucket, objectPath
 }

Based on learnings: service names (inferflow, predator, numerix) are expected to be consistently lowercase across Horizon code paths.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (p *Predator) buildValidationLogsGCSPath(job *validationjob.Table) (bucket, objectPath string) {
if p.isNonProductionEnvironment() {
bucket = pred.GcsModelBucket
objectPath = path.Join(validationLogsDir, job.ServiceName, job.GroupID+".log")
} else {
bucket = pred.GcsConfigBucket
objectPath = path.Join(validationLogsDir, job.ServiceName, job.GroupID+".log")
}
func (p *Predator) buildValidationLogsGCSPath(job *validationjob.Table) (bucket, objectPath string) {
serviceName := strings.ToLower(strings.TrimSpace(job.ServiceName))
if serviceName == "" || strings.Contains(serviceName, "/") || strings.Contains(serviceName, "..") {
return "", ""
}
objectPath = path.Join(validationLogsDir, serviceName, job.GroupID+".log")
if p.isNonProductionEnvironment() {
bucket = pred.GcsModelBucket
} else {
bucket = pred.GcsConfigBucket
}
return bucket, objectPath
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@horizon/internal/predator/handler/predator_validation_logs.go` around lines
183 - 190, Sanitize and normalize job.ServiceName before composing the GCS
object key in buildValidationLogsGCSPath: create a sanitized variable (e.g.,
sanitizedService := strings.ToLower(job.ServiceName)) and then strip or replace
unsafe characters (remove any occurrences of "..", replace path separators like
"/" and "\" with "-" or remove them, and restrict to alphanumerics, dashes and
underscores—e.g., via a regexp whitelist) so the objectPath uses
sanitizedService instead of job.ServiceName when calling path.Join; update
references in buildValidationLogsGCSPath accordingly to prevent traversal and
ensure consistent lowercase keys.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@horizon/internal/predator/handler/predator_validation_logs_test.go`:
- Around line 62-64: The test package already defines a type named mockGCSClient
causing a redeclaration; rename this local mock type (mockGCSClient) to a unique
identifier such as mockGCSClientForValidationLogs and update all references in
predator_validation_logs_test.go (constructors, variable literals, and method
calls) accordingly, or alternatively move the mock definition into a shared test
helper file (e.g., mocks_test.go) and remove the duplicate type here so both
tests reuse the single mock definition.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8afae4 and d4e7a39.

📒 Files selected for processing (1)
  • horizon/internal/predator/handler/predator_validation_logs_test.go

Comment on lines +62 to +64
type mockGCSClient struct {
uploadFileFn func(bucket, objectPath string, data []byte) error
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix: mockGCSClient is already declared in another file in this package.

The pipeline failure indicates this type is redeclared. Since all _test.go files in the same package share a namespace, you need to either:

  1. Rename this mock to something unique like mockGCSClientForValidationLogs
  2. Consolidate mocks into a shared test helper file (e.g., mocks_test.go) and reuse across tests
Option 1: Rename the mock
-type mockGCSClient struct {
+type mockValidationLogsGCSClient struct {
 	uploadFileFn func(bucket, objectPath string, data []byte) error
 }

-func (m *mockGCSClient) ReadFile(_, _ string) ([]byte, error)                  { return nil, nil }
+func (m *mockValidationLogsGCSClient) ReadFile(_, _ string) ([]byte, error)                  { return nil, nil }
 // ... update all other method receivers similarly
🧰 Tools
🪛 GitHub Actions: Horizon CI

[error] 62-62: mockGCSClient redeclared in this block

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@horizon/internal/predator/handler/predator_validation_logs_test.go` around
lines 62 - 64, The test package already defines a type named mockGCSClient
causing a redeclaration; rename this local mock type (mockGCSClient) to a unique
identifier such as mockGCSClientForValidationLogs and update all references in
predator_validation_logs_test.go (constructors, variable literals, and method
calls) accordingly, or alternatively move the mock definition into a shared test
helper file (e.g., mocks_test.go) and remove the duplicate type here so both
tests reuse the single mock definition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant