Skip to content
Jeremy Main edited this page May 22, 2018 · 7 revisions

Table of Contents

Background

For the past two years, I have been slowly improving the features of GPUProfiler for Windows environments and often heard requests for a tool like that for Linux, or even a monitor to give users greater and immediate insight into what GPU features are doing. I wanted a tool like this and so I just started making it.

Features of ngputop

Display that utilization of GPU resources for all NVIDIA GPUs detected in a bare-metal machine, some hypervisor hosts (XenServer, RHEL RHV, ESXi) or virtual machines* (*see limitations)

UI Example - Graphics view mode

Inferencing on a Ubuntu vGPU VM

Keyboard commands

Key Alternate Command Feature State
F1 h Help Display Enabled
c Compute view Display Enabled
g Graphics view Display Enabled
F2 Setup Configuration Future use
F3 Search Display Future use
F4 Filter Display Future use
F5 Start Profile Future use
F6 Stop Profile Future use
F10 q Exit Operation Enabled

View Modes

ngputop currently has two view modes, "compute" and "graphics". Each view displays more or less data based on the typical utilization metrics each use case is interested in observing.

Compute View (default)

Compute mode displays for each detected NVIDIA GPU the following 'gauges':

Abbreviation Meaning
SM Shader-Module utilization
FB Frame buffer utilization
CL SM Clock
PW Power consumption
TP GPU temperature in Celsius
FN Fan speed (% of maximum)

Graphics View

The Graphics mode displays for each detected NVIDIA GPU the following 'gauges'

Abbreviation Meaning
SM Shader-Module utilization
FB Frame buffer utilization
MC Memory controller utilization
EN Encoder utilization
DE Decoder utilization
PW Power consumption
TP GPU temperature in Celsius

What and how it measures

NVML Function Purpose nvidia-smi query equivalent
nvmlDeviceGetUtilizationRates Get GPU utilization utilization.gpu
utilization.memory
nvmlDeviceGetTemperatureThreshold Get the temperature thresholds
nvmlDeviceGetTemperature Get the temperature temperature.gpu
nvmlDeviceGetProcessUtilization Get GPU utilization per process using the query "--query-compute-apps=pid,name,used_memory"
nvmlDeviceGetPowerUsage Get the current power usage power.draw
nvmlDeviceGetPowerManagementLimit Get the power thresholds power.limit
power.min_limit
power.max_limit
nvmlDeviceGetPciInfo_v2 Get the PCI BUS details pci.bus_id
nvmlDeviceGetName Get the GPU product name name
nvmlDeviceGetMemoryInfo Get the frame buffer thresholds memory.total
memory.used
memory.free
nvmlDeviceGetMaxClockInfo Get the Clock thresholds clocks.max.sm
nvmlDeviceGetHandleByIndex Get a "Handle" for each GPU
nvmlDeviceGetFanSpeed Get the fan speed fan.speed
nvmlDeviceGetEncoderUtilization Get the encoder utilization
nvmlDeviceGetDecoderUtilization Get the decoder utilization
nvmlDeviceGetCount Get number of GPUs count
nvmlDeviceGetClockInfo Get the current clock data clocks.current.graphics
clocks.current_sm