diff --git a/imgs/slurm_account_grafana.png b/imgs/slurm_account_grafana.png deleted file mode 100644 index 205cf9c..0000000 Binary files a/imgs/slurm_account_grafana.png and /dev/null differ diff --git a/imgs/slurm_partition_grafana.png b/imgs/slurm_partition_grafana.png deleted file mode 100644 index c9b44bd..0000000 Binary files a/imgs/slurm_partition_grafana.png and /dev/null differ diff --git a/imgs/slurm_rpc_grafana.png b/imgs/slurm_rpc_grafana.png deleted file mode 100644 index 78f0e4f..0000000 Binary files a/imgs/slurm_rpc_grafana.png and /dev/null differ diff --git a/reference/glossary.md b/reference/glossary.md index 74fae8f..a8d5773 100644 --- a/reference/glossary.md +++ b/reference/glossary.md @@ -52,6 +52,12 @@ Graphics Processing Unit (GPU) A specialized processor that is designed to accelerate image processing and graphics rendering for output to a display device. +Grafana + An open-source platform for data visualization, monitoring, and observability. Used to create + dashboards and graphs from time series data stored in various data sources. + + Resources: [Grafana website {octicon}`link-external`](https://grafana.com/), [Grafana documentation {octicon}`link-external`](https://grafana.com/docs/), [Grafana charm {octicon}`link-external`](https://charmhub.io/grafana-k8s) + High-Performance Computing (HPC) The practice of aggregating computing power using clusters and parallel processing to complete tasks faster than standard computing. @@ -78,6 +84,12 @@ Juju Resources: [Juju documentation {octicon}`link-external`](https://documentation.ubuntu.com/juju/latest/) +Loki + An open-source log aggregation system designed to store and query logs from applications and + infrastructure. Designed to be cost-effective and easy to operate. + + Resources: [Grafana Loki website {octicon}`link-external`](https://grafana.com/oss/loki/), [Loki documentation {octicon}`link-external`](https://grafana.com/docs/loki/latest/), [Loki charm {octicon}`link-external`](https://charmhub.io/loki-k8s) + MicroCeph A tool that simplifies deployment and management of Ceph storage both standalone and in a charmed environment using Juju. @@ -106,6 +118,12 @@ Proxy charm An intermediary charm that enables charms to integrate with non-charmed applications. Also known as an integrator charm. +Prometheus + An open-source monitoring and alerting system that collects and stores metrics as time series + data. Features a flexible query language (PromQL) and built-in alerting capabilities. + + Resources: [Prometheus website {octicon}`link-external`](https://prometheus.io/), [Prometheus documentation {octicon}`link-external`](https://prometheus.io/docs/), [Prometheus charm {octicon}`link-external`](https://charmhub.io/prometheus-k8s) + `sackd` Slurm Auth and Credential Kiosk daemon. Typically used to provide cluster login nodes. diff --git a/reference/monitoring/grafana.md b/reference/monitoring/grafana.md index 705840d..12d328e 100644 --- a/reference/monitoring/grafana.md +++ b/reference/monitoring/grafana.md @@ -1,11 +1,15 @@ +--- +relatedlinks: "[Grafana dashboards documentation](https://grafana.com/docs/grafana/latest/dashboards/)" +--- + (reference-monitoring-grafana)= # Grafana dashboards This is an overview of all the charms used in Charmed HPC that provide dashboards for -[Grafana](https://grafana.com/), which acts as a web interface to visualize data from aggregators such -as [Prometheus](https://prometheus.io/) or [Loki](https://grafana.com/oss/loki/). See -[Grafana dashboards](https://grafana.com/docs/grafana/latest/dashboards/) for more general information on dashboards, -and {ref}`reference-monitoring-prometheus` for more information about the metrics displayed on the dashboards. +{term}`Grafana`, which acts as a web interface to visualize data from aggregators such +as {term}`Prometheus` or {term}`Loki`. + +See {ref}`howto-manage-integrate-with-cos` for more information. :::{admonition} Panel query :class: note @@ -16,39 +20,62 @@ to see the exact query used to provide the panel with data. ## Slurmctld -The dashboard from the `slurmctld` charm displays an overall view of the cluster, including the following information: +The dashboards from the {term}`slurmctld` charm provide a display of information from the +entire cluster, each partition, and each charm. + +### Cluster Overview + +The "Cluster Overview" dashboard provides a display of cluster-level metrics such as: + +- Total resource utilization +- Job status distribution +- Node state distribution +- Scheduler metrics + +![Grafana Cluster Overview dashboard showing total resource utilization, job state distribution, node state distribution, and scheduler metrics for the Charmed HPC cluster](/reuse/reference/monitoring/cluster-overview.png) + +### Partition Overview + +The "Partition Overview" dashboard provides a display of partition-level metrics such as: + +- Total nodes and jobs in the partition +- Total resource utilization for the partition +- Job status distributing for jobs in the partition +- Node state distribution for all nodes in the partition + +![Grafana Partition Overview dashboard showing total nodes and jobs, resource utilization, job status distribution, and node state distribution for a specific partition](/reuse/reference/monitoring/partition-overview.png) + +### Node Overview + +The "Node Overview" dashboard provides a display of node-level metrics such as: -- CPU and memory usage per partition. -- Node state count. -- CPU and memory usage per account. -- Statistics on Slurmctld RPC messages. +- Available resources that are allocatable for jobs +- Total resource utilization on the node -![Slurm partition dashboard](../../imgs/slurm_partition_grafana.png) -![Slurm account dashboard](../../imgs/slurm_account_grafana.png) -![Slurm rpc dashboard](../../imgs/slurm_rpc_grafana.png) +![Grafana Node Overview dashboard showing node state, resource utilization, running jobs, and hardware configuration for a specific compute node](/reuse/reference/monitoring/node-overview.png) ## MySQL The dashboard from the `mysql` charm displays metrics for the storage database of Slurmdbd: -- Uptime. -- Queries per second. -- Current cache size. -- Maximum number of concurrent connections. -- Thread resource usage. -- Network traffic statistics. +- Uptime +- Queries per second +- Current cache size +- Maximum number of concurrent connections +- Thread resource usage +- Network traffic statistics -![MySQL dashboard](../../imgs/mysql_grafana.png) +![Grafana MySQL dashboard showing database metrics including uptime, queries per second, cache size, concurrent connections, and thread count](/reuse/reference/monitoring/mysql-exporter.png) ## Traefik K8s The dashboard from the `traefik-k8s` charm displays metrics about the reverse proxy used when communicating between the compute plane cluster and the monitoring/identity k8s clusters. This includes: -- Uptime. -- HTTP response code statistics. -- Response times. -- Open connections statistics. -- Raw logs for every proxied endpoint. +- Uptime +- Response times +- HTTP response code statistics +- Open connection statistics +- Raw logs for every proxied endpoint -![Traefik dashboard](../../imgs/traefik_grafana.png) +![Grafana Traefik dashboard showing reverse proxy metrics including uptime, response times, HTTP response code statistics, and open connection statistics](/reuse/reference/monitoring/traefik.png) diff --git a/reference/monitoring/index.md b/reference/monitoring/index.md index 577214b..3e1c7c8 100644 --- a/reference/monitoring/index.md +++ b/reference/monitoring/index.md @@ -1,9 +1,9 @@ (reference-monitoring)= # Monitoring -Integrating COS with Charmed HPC enables monitoring of resources. This integration includes Prometheus, Grafana, and Loki. - - +Integrating COS with Charmed HPC enables you to monitor your Charmed HPC clusters resources +with {term}`Prometheus`, {term}`Grafana`, and {term}`Loki`. The reference material in +this section list the dashboards and metrics from Charmed HPC that you can interact with. - {ref}`Prometheus: metrics aggregation and alerts ` - {ref}`Grafana: Dashboards and resource visualizations ` @@ -17,5 +17,4 @@ Integrating COS with Charmed HPC enables monitoring of resources. This integrati Prometheus Grafana Loki - ``` diff --git a/reference/monitoring/loki.md b/reference/monitoring/loki.md index fda854f..5d408fb 100644 --- a/reference/monitoring/loki.md +++ b/reference/monitoring/loki.md @@ -1,8 +1,8 @@ (reference-monitoring-loki)= # Loki logs -The following table lists all the charms used as part of Charmed HPC that expose logs to Loki, and the -corresponding query to see the exported logs in the Grafana UI. +The following table lists all the charms used as part of Charmed HPC that expose logs to {term}`Loki`, and the +corresponding query to see the exported logs in {term}`Grafana`. Follow the [Visualize log data](https://grafana.com/docs/loki/latest/visualize/grafana/#grafana-explore) tutorial from the Grafana documentation for instructions on where and how to query for Loki logs. diff --git a/reference/monitoring/prometheus.md b/reference/monitoring/prometheus.md index 09e3428..74d0869 100644 --- a/reference/monitoring/prometheus.md +++ b/reference/monitoring/prometheus.md @@ -2,20 +2,20 @@ # Prometheus metrics and alerts This is an overview of all the charms used in Charmed HPC that provide monitoring metrics and alerts -for [Prometheus](https://prometheus.io), a metrics aggregator and alerts manager for applications. +for {term}`Prometheus`, a metrics aggregator and alerts manager for applications. -All metrics and alerts can be viewed from Prometheus or from the [Grafana](https://grafana.com) web interface. +All metrics and alerts can be viewed from Prometheus or from the {term}`Grafana` web interface. See {ref}`howto-manage-integrate-with-cos` for more information. -The following table lists all the charms on Charmed HPC that expose metrics and alerts to Prometheus, +The following table lists all the charms on Charmed HPC that expose metrics and alerts to Prometheus with their corresponding upstream documentation to know more about the metrics exported. The last -column shows the corresponding query to list the exported metrics in Prometheus or the Grafana UI. +column shows the corresponding query to list the exported metrics in Prometheus or Grafana. :::{csv-table} :header: > : charm, upstream docs, query -slurmctld, [Documentation](https://github.com/rivosinc/prometheus-slurm-exporter), `{juju_charm="slurmctld"}`{l=javascript} +slurmctld, [Documentation](https://slurm.schedmd.com/metrics.html), `{juju_charm="slurmctld"}`{l=javascript} mysql, [Documentation](https://charmhub.io/mysql), `{juju_charm="mysql"}`{l=javascript} postgresql-k8s, [Documentation](https://charmhub.io/postgresql-k8s), `{juju_charm="postgresql-k8s"}`{l=javascript} glauth-k8s, [Documentation](https://charmhub.io/glauth-k8s), `{juju_charm="glauth-k8s"}`{l=javascript} @@ -26,7 +26,7 @@ traefik-k8s, [Documentation](https://charmhub.io/traefik-k8s), `{juju_charm="tra The `slurmctld` charm exposes metrics related to: -- Resource usage per partition, account or user. -- Jobs statuses. -- RPC messages for `slurmctld`. -- Prometheus Slurm Exporter statistics. +- Job and node statuses. +- Resource usage for each partition, node, Slurm account or user. +- Cluster-wide information such as total CPU or memory utilization. +- Scheduler information such scheduling cycle times and queue lengths. diff --git a/reuse/reference/monitoring/cluster-overview.png b/reuse/reference/monitoring/cluster-overview.png new file mode 100644 index 0000000..30edab5 Binary files /dev/null and b/reuse/reference/monitoring/cluster-overview.png differ diff --git a/imgs/mysql_grafana.png b/reuse/reference/monitoring/mysql-exporter.png similarity index 100% rename from imgs/mysql_grafana.png rename to reuse/reference/monitoring/mysql-exporter.png diff --git a/reuse/reference/monitoring/node-overview.png b/reuse/reference/monitoring/node-overview.png new file mode 100644 index 0000000..0f3e0fb Binary files /dev/null and b/reuse/reference/monitoring/node-overview.png differ diff --git a/reuse/reference/monitoring/partition-overview.png b/reuse/reference/monitoring/partition-overview.png new file mode 100644 index 0000000..f38302a Binary files /dev/null and b/reuse/reference/monitoring/partition-overview.png differ diff --git a/imgs/traefik_grafana.png b/reuse/reference/monitoring/traefik.png similarity index 100% rename from imgs/traefik_grafana.png rename to reuse/reference/monitoring/traefik.png