Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed imgs/slurm_account_grafana.png
Binary file not shown.
Binary file removed imgs/slurm_partition_grafana.png
Binary file not shown.
Binary file removed imgs/slurm_rpc_grafana.png
Binary file not shown.
18 changes: 18 additions & 0 deletions reference/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,12 @@ Graphics Processing Unit (GPU)
A specialized processor that is designed to accelerate image processing and graphics rendering
for output to a display device.

Grafana
An open-source platform for data visualization, monitoring, and observability. Used to create
dashboards and graphs from time series data stored in various data sources.

Resources: [Grafana website {octicon}`link-external`](https://grafana.com/), [Grafana documentation {octicon}`link-external`](https://grafana.com/docs/), [Grafana charm {octicon}`link-external`](https://charmhub.io/grafana-k8s)

High-Performance Computing (HPC)
The practice of aggregating computing power using clusters and parallel processing to complete
tasks faster than standard computing.
Expand All @@ -78,6 +84,12 @@ Juju

Resources: [Juju documentation {octicon}`link-external`](https://documentation.ubuntu.com/juju/latest/)

Loki
An open-source log aggregation system designed to store and query logs from applications and
infrastructure. Designed to be cost-effective and easy to operate.

Resources: [Grafana Loki website {octicon}`link-external`](https://grafana.com/oss/loki/), [Loki documentation {octicon}`link-external`](https://grafana.com/docs/loki/latest/), [Loki charm {octicon}`link-external`](https://charmhub.io/loki-k8s)

MicroCeph
A tool that simplifies deployment and management of Ceph storage both standalone and in a
charmed environment using Juju.
Expand Down Expand Up @@ -106,6 +118,12 @@ Proxy charm
An intermediary charm that enables charms to integrate with non-charmed applications. Also known
as an integrator charm.

Prometheus
An open-source monitoring and alerting system that collects and stores metrics as time series
data. Features a flexible query language (PromQL) and built-in alerting capabilities.

Resources: [Prometheus website {octicon}`link-external`](https://prometheus.io/), [Prometheus documentation {octicon}`link-external`](https://prometheus.io/docs/), [Prometheus charm {octicon}`link-external`](https://charmhub.io/prometheus-k8s)

`sackd`
Slurm Auth and Credential Kiosk daemon. Typically used to provide cluster login nodes.

Expand Down
77 changes: 52 additions & 25 deletions reference/monitoring/grafana.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
---
relatedlinks: "[Grafana dashboards documentation](https://grafana.com/docs/grafana/latest/dashboards/)"
---

(reference-monitoring-grafana)=
# Grafana dashboards

This is an overview of all the charms used in Charmed HPC that provide dashboards for
[Grafana](https://grafana.com/), which acts as a web interface to visualize data from aggregators such
as [Prometheus](https://prometheus.io/) or [Loki](https://grafana.com/oss/loki/). See
[Grafana dashboards](https://grafana.com/docs/grafana/latest/dashboards/) for more general information on dashboards,
and {ref}`reference-monitoring-prometheus` for more information about the metrics displayed on the dashboards.
{term}`Grafana`, which acts as a web interface to visualize data from aggregators such
as {term}`Prometheus` or {term}`Loki`.

See {ref}`howto-manage-integrate-with-cos` for more information.

:::{admonition} Panel query
:class: note
Expand All @@ -16,39 +20,62 @@ to see the exact query used to provide the panel with data.

## Slurmctld

The dashboard from the `slurmctld` charm displays an overall view of the cluster, including the following information:
The dashboards from the {term}`slurmctld` charm provide a display of information from the
entire cluster, each partition, and each charm.

### Cluster Overview

The "Cluster Overview" dashboard provides a display of cluster-level metrics such as:

- Total resource utilization
- Job status distribution
- Node state distribution
- Scheduler metrics

![Grafana Cluster Overview dashboard showing total resource utilization, job state distribution, node state distribution, and scheduler metrics for the Charmed HPC cluster](/reuse/reference/monitoring/cluster-overview.png)

### Partition Overview

The "Partition Overview" dashboard provides a display of partition-level metrics such as:

- Total nodes and jobs in the partition
- Total resource utilization for the partition
- Job status distributing for jobs in the partition
- Node state distribution for all nodes in the partition

![Grafana Partition Overview dashboard showing total nodes and jobs, resource utilization, job status distribution, and node state distribution for a specific partition](/reuse/reference/monitoring/partition-overview.png)

### Node Overview

The "Node Overview" dashboard provides a display of node-level metrics such as:

- CPU and memory usage per partition.
- Node state count.
- CPU and memory usage per account.
- Statistics on Slurmctld RPC messages.
- Available resources that are allocatable for jobs
- Total resource utilization on the node

![Slurm partition dashboard](../../imgs/slurm_partition_grafana.png)
![Slurm account dashboard](../../imgs/slurm_account_grafana.png)
![Slurm rpc dashboard](../../imgs/slurm_rpc_grafana.png)
![Grafana Node Overview dashboard showing node state, resource utilization, running jobs, and hardware configuration for a specific compute node](/reuse/reference/monitoring/node-overview.png)

## MySQL

The dashboard from the `mysql` charm displays metrics for the storage database of Slurmdbd:

- Uptime.
- Queries per second.
- Current cache size.
- Maximum number of concurrent connections.
- Thread resource usage.
- Network traffic statistics.
- Uptime
- Queries per second
- Current cache size
- Maximum number of concurrent connections
- Thread resource usage
- Network traffic statistics

![MySQL dashboard](../../imgs/mysql_grafana.png)
![Grafana MySQL dashboard showing database metrics including uptime, queries per second, cache size, concurrent connections, and thread count](/reuse/reference/monitoring/mysql-exporter.png)

## Traefik K8s

The dashboard from the `traefik-k8s` charm displays metrics about the reverse proxy used when communicating
between the compute plane cluster and the monitoring/identity k8s clusters. This includes:

- Uptime.
- HTTP response code statistics.
- Response times.
- Open connections statistics.
- Raw logs for every proxied endpoint.
- Uptime
- Response times
- HTTP response code statistics
- Open connection statistics
- Raw logs for every proxied endpoint

![Traefik dashboard](../../imgs/traefik_grafana.png)
![Grafana Traefik dashboard showing reverse proxy metrics including uptime, response times, HTTP response code statistics, and open connection statistics](/reuse/reference/monitoring/traefik.png)
7 changes: 3 additions & 4 deletions reference/monitoring/index.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
(reference-monitoring)=
# Monitoring

Integrating COS with Charmed HPC enables monitoring of resources. This integration includes Prometheus, Grafana, and Loki.


Integrating COS with Charmed HPC enables you to monitor your Charmed HPC clusters resources
with {term}`Prometheus`, {term}`Grafana`, and {term}`Loki`. The reference material in
this section list the dashboards and metrics from Charmed HPC that you can interact with.

- {ref}`Prometheus: metrics aggregation and alerts <reference-monitoring-prometheus>`
- {ref}`Grafana: Dashboards and resource visualizations <reference-monitoring-grafana>`
Expand All @@ -17,5 +17,4 @@ Integrating COS with Charmed HPC enables monitoring of resources. This integrati
Prometheus <prometheus>
Grafana <grafana>
Loki <loki>

```
4 changes: 2 additions & 2 deletions reference/monitoring/loki.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
(reference-monitoring-loki)=
# Loki logs

The following table lists all the charms used as part of Charmed HPC that expose logs to Loki, and the
corresponding query to see the exported logs in the Grafana UI.
The following table lists all the charms used as part of Charmed HPC that expose logs to {term}`Loki`, and the
corresponding query to see the exported logs in {term}`Grafana`.
Follow the [Visualize log data](https://grafana.com/docs/loki/latest/visualize/grafana/#grafana-explore)
tutorial from the Grafana documentation for instructions on where and how to query for Loki logs.

Expand Down
18 changes: 9 additions & 9 deletions reference/monitoring/prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,20 @@
# Prometheus metrics and alerts

This is an overview of all the charms used in Charmed HPC that provide monitoring metrics and alerts
for [Prometheus](https://prometheus.io), a metrics aggregator and alerts manager for applications.
for {term}`Prometheus`, a metrics aggregator and alerts manager for applications.

All metrics and alerts can be viewed from Prometheus or from the [Grafana](https://grafana.com) web interface.
All metrics and alerts can be viewed from Prometheus or from the {term}`Grafana` web interface.
See {ref}`howto-manage-integrate-with-cos` for more information.

The following table lists all the charms on Charmed HPC that expose metrics and alerts to Prometheus,
The following table lists all the charms on Charmed HPC that expose metrics and alerts to Prometheus
with their corresponding upstream documentation to know more about the metrics exported. The last
column shows the corresponding query to list the exported metrics in Prometheus or the Grafana UI.
column shows the corresponding query to list the exported metrics in Prometheus or Grafana.

:::{csv-table}
:header: >
: charm, upstream docs, query

slurmctld, [Documentation](https://github.com/rivosinc/prometheus-slurm-exporter), `{juju_charm="slurmctld"}`{l=javascript}
slurmctld, [Documentation](https://slurm.schedmd.com/metrics.html), `{juju_charm="slurmctld"}`{l=javascript}
mysql, [Documentation](https://charmhub.io/mysql), `{juju_charm="mysql"}`{l=javascript}
postgresql-k8s, [Documentation](https://charmhub.io/postgresql-k8s), `{juju_charm="postgresql-k8s"}`{l=javascript}
glauth-k8s, [Documentation](https://charmhub.io/glauth-k8s), `{juju_charm="glauth-k8s"}`{l=javascript}
Expand All @@ -26,7 +26,7 @@ traefik-k8s, [Documentation](https://charmhub.io/traefik-k8s), `{juju_charm="tra

The `slurmctld` charm exposes metrics related to:

- Resource usage per partition, account or user.
- Jobs statuses.
- RPC messages for `slurmctld`.
- Prometheus Slurm Exporter statistics.
- Job and node statuses.
- Resource usage for each partition, node, Slurm account or user.
- Cluster-wide information such as total CPU or memory utilization.
- Scheduler information such scheduling cycle times and queue lengths.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added reuse/reference/monitoring/node-overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
Loading