Contact Me

Site Reliability Engineer • Observability • Production Support • Technical Support

🎯 Career Objective

Overall, 3+ years of experience in Site Reliability Engineering & Observability platforms & IT Infrastructure & Applications Production Support and Java Support Engineer
Experienced Observability Monitoring Engineer with over 3 years in administrative roles, specializing in providing 24/7 support for global customers in production environments.
Proficient in APM monitoring tools such as DataDog, Grafana, Kibana, Dynatrace, Splunk, OMI, Tidal, and Sitescope. Skilled in managing SLOs, SLIs, and SLAs, and well-versed in ITIL frameworks including incident, change, major, and problem management. Proven ability in Datadog administration, dashboard creation, and monitoring services in production environments.
API Development: Engineered secure and robust API endpoints for CRUD operations, ensuring data integrity and correct performance.
Debugging & Maintenance: Adept at bug fixing and debugging complex applications to maintain system health.
Frameworks: I have good knowledge in developing and troubleshooting applications using Spring Boot and Spring MVC.
Timely Resolution: Committed to diagnosing and resolving system issues to minimize downtime and impact.

Monitoring & Observability

Proficient in the end-to-end administration of a comprehensive APM and monitoring stack, including:

Tools: Datadog | Grafana | Kibana | New Relic |

Datadog Administration: Onboarding services, configuring agents, and tuning metrics collection.
Visualization: Designing and building insightful dashboards tailored to SLOs/SLIs and business KPIs.
Alerting: Implementing and managing alert policies to reduce noise and improve MTTR.

Process & Framework

Service Management: Skilled in managing SLOs, SLIs, and SLAs to align IT services with business goals.
ITIL Practices: Well-versed in ITIL frameworks for Incident, Change, Major Incident, and Problem Management.

🔑 Key Skills

ITIL: Incident, Change, Major Incident, Problem Management; SLOs, SLIs, SLAs (metrics, traces, logs).
Alerting: success/error/composite alerts, threshold tuning, refinement, noise/toil reduction.
App monitoring: triage in production, dev collaboration via JIRA, runbooks, dashboards, reporting.
Tooling: Grafana (error insights), Kibana (log analysis), Datadog admin (monitors, dashboards), PagerDuty (on-call).
Process: onboarding services to monitoring, gap analysis, RCA participation, weekly/monthly reporting.
Programming: Java, Python (custom metrics, light instrumentation).

Professional Experience

DXC Technology, Bangalore — Site Reliability Engineer (Dec 2022 – Present)

Client: Qatar Airways — Payments Monitoring Group

Provided 24/7 support to global customers for payments applications in production environments.
Managed and administered the full observability stack: DataDog, Grafana, Kibana, Dynatrace, Splunk, OMI, Tidal, Sitescope.
Implemented SLOs, SLIs, SLAs to ensure performance and reliability goals were met and measured.
I involve ITIL frameworks for Incident, Change, Major, and Problem Management.
Created and maintained comprehensive DataDog dashboards & monitors for real-time application performance tracking.
Onboarded new application services into production environments and performed gap analysis to ensure monitoring coverage.
Developed and refined alerts for KPIs such as success rate, error rate, and composite metrics to reduce noise and improve MTTR.
Collaborated with development teams via JIRA for ticket creation, escalation, and resolution tracking.
Configured and monitored alerts with PagerDuty to ensure timely incident response and on-call rotations.
Performed advanced observability tasks: custom dashboards, widgets, panels in DataDog; threshold tuning; noise reduction in alerts.
Analyzed and exported observability data from DataDog into Google Sheets, reporting key insights and trends to business stakeholders.
Monitored applications, services, and jobs across DataDog, Grafana, Kibana.
Prepared detailed incident checklists and shared structured, client-facing updates.
Worked extensively on SLA & SLI definitions for critical payments services in production systems.
Configured JIRA dashboards as per project requirements for enhanced visibility and reporting.

Wipro, Bangalore — Site Reliability Engineer (Apr 2022 – Nov 2022)

Client: HSBC — Payments Monitoring

Provided 24/7 L1/L2 support to global customers for critical payments applications in production environments.
Managed and administered the APM/Monitoring stack: Datadog, Grafana, Kibana, OMI, Tidal, SiteScope.
Configured and tuned alert thresholds, significantly reducing noise from ineffective alerts and improving signal clarity.
Monitored and supported applications, services, and batch jobs across multiple platforms to ensure system health.
Created and escalated JIRA tickets to development teams for faster incident resolution and tracking.
Prepared structured incident checklists and runbooks, sharing clear documentation with clients and business teams.
Defined and monitored SLA/SLI metrics for payment services using Datadog to uphold service quality agreements.
Built and customized JIRA dashboards based on project requirements to streamline workflow and visibility.
Configured PagerDuty for effective alerting and implementing escalation workflows to ensure on-call responsiveness.
Performed detailed incident analysis and engaged with Root Cause Analysis (RCA) teams to drive long-term fixes.
Generated and shared daily, weekly, and monthly status reports with business stakeholders to communicate system health and incidents.
Conducted basic front-end troubleshooting of applications and engaged next-level support teams for complex issues.
Provided front-line and second-level IT operations support, ensuring outstanding client service delivery.
Supported weekend server patching activities, including comprehensive pre- and post-patching validation checks.