Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 181 additions & 0 deletions diskcache/LOCK_CONTENTION_METRICS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Lock Contention Metrics Implementation

This document explains how lock contention metrics are measured and used in the diskcache module.

## Overview

The enhanced diskcache now tracks lock contention across three critical lock types:
- **write lock** (`wlock`): Excludes concurrent Put operations
- **read lock** (`rlock`): Excludes concurrent Get operations
- **rw lock** (`rwlock`): Excludes structural operations (rotate, switch, drop, close)

## Instrumentation Strategy

### 1. Instrumented Locks

Instead of using `sync.Mutex` and `sync.RWMutex`, we now use:

```go
type InstrumentedMutex struct {
mu sync.Mutex
lockType LockType
path string
lockWaitTime *prometheus.HistogramVec
contention *prometheus.CounterVec
}

type InstrumentedRWMutex struct {
mu sync.RWMutex
path string
lockWaitTime *prometheus.HistogramVec
contention *prometheus.CounterVec
}
```

### 2. Contention Detection

Contention is detected using `TryLock()`:

```go
func (im *InstrumentedMutex) Lock() {
start := time.Now()

// Check if mutex is already locked (contention)
if im.mu.TryLock() {
// No contention - immediate acquisition
im.observeLockTime(start, false)
return
}

// Contention occurred - wait for lock
im.observeContention()
im.mu.Lock()
im.observeLockTime(start, true)
}
```

## Metrics Exposed

### 1. Lock Wait Time Histogram
```
diskcache_lock_wait_seconds{lock_type="write",path="/cache/path"} 0.001
diskcache_lock_wait_seconds{lock_type="read",path="/cache/path"} 0.0001
diskcache_lock_wait_seconds{lock_type="rw",path="/cache/path"} 0.05
```

**Buckets**: `[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]` seconds

### 2. Lock Contention Counter
```
diskcache_lock_contention_total{lock_type="write",path="/cache/path"} 15
diskcache_lock_contention_total{lock_type="read",path="/cache/path"} 3
diskcache_lock_contention_total{lock_type="rw",path="/cache/path"} 1
```

## Usage Examples

### 1. Monitoring Lock Contention Rate

```promql
# Contention rate per second by lock type
rate(diskcache_lock_contention_total[5m]) by (lock_type, path)

# Percentile of lock wait times
histogram_quantile(0.95, rate(diskcache_lock_wait_seconds_bucket[5m])) by (lock_type, path)
```

### 2. Alerting on High Contention

```promql
# Alert when lock wait times exceed threshold
histogram_quantile(0.95, diskcache_lock_wait_seconds) > 0.1

# Alert when contention rate is high
rate(diskcache_lock_contention_total[5m]) > 10
```

### 3. Performance Analysis

```promql
# Lock wait time distribution
histogram_quantile(0.50, diskcache_lock_wait_seconds) by (lock_type)
histogram_quantile(0.95, diskcache_lock_wait_seconds) by (lock_type)
histogram_quantile(0.99, diskcache_lock_wait_seconds) by (lock_type)

# Correlate contention with operations
rate(diskcache_lock_contention_total[5m]) /
rate(diskcache_put_total[5m] + diskcache_get_total[5m])
```

## Implementation Benefits

### 1. **Early Detection of Performance Issues**
- Identifies lock contention before it becomes a bottleneck
- Shows which lock types are under stress
- Helps optimize lock granularity and usage patterns

### 2. **Correlation with System Load**
- Can correlate contention spikes with:
- High put/get rates
- Disk I/O delays
- Memory pressure
- File system issues

### 3. **Capacity Planning**
- Understanding contention patterns helps with:
- Sizing thread pools
- Configuring batch sizes
- Optimizing cache operations
- Planning resource allocation

### 4. **Troubleshooting**
- Quickly identify if performance issues are due to:
- Write contention (many puts)
- Read contention (many gets)
- Structural operations (rotations, switches)

## Integration with Existing Metrics

The lock contention metrics integrate seamlessly with existing diskcache metrics:

```
# Overall view of cache performance
diskcache_put_latency_seconds{path="/cache"} 0.001
diskcache_get_latency_seconds{path="/cache"} 0.0005
diskcache_lock_wait_seconds{lock_type="write",path="/cache"} 0.0002
diskcache_lock_wait_seconds{lock_type="read",path="/cache"} 0.0001

# Identify if locks are the bottleneck
diskcache_lock_wait_seconds / diskcache_put_latency_seconds
```

## Performance Overhead

The instrumentation adds minimal overhead:
- **Contention detection**: One `TryLock()` call before actual lock
- **Timing**: One `time.Since()` call per lock acquisition
- **Metrics update**: One histogram and counter observation per lock

The overhead is negligible compared to typical I/O operations and provides valuable observability.

## Production Deployment

### 1. **Monitoring Dashboard**
Create Grafana panels showing:
- Lock wait time percentiles by type
- Contention rates over time
- Lock wait time vs operation latency
- Contention heat maps by cache path

### 2. **Alerting Rules**
Set up alerts for:
- High p95/p99 lock wait times (>100ms)
- Elevated contention rates (>1/sec sustained)
- Sudden spikes in wait times

### 3. **Capacity Analysis**
Use metrics to:
- Identify optimal thread counts
- Tune batch sizes
- Plan for scale increases
- Optimize lock granularity
26 changes: 18 additions & 8 deletions diskcache/diskcache.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ import (
"sync"
"sync/atomic"
"time"

"github.com/GuanceCloud/cliutils/logger"
)

const (
Expand Down Expand Up @@ -57,6 +59,9 @@ var (

// Invalid file header.
ErrBadHeader = errors.New("bad header")

l = logger.DefaultSLogger("diskcache")
once sync.Once
)

// DiskCache is the representation of a disk cache.
Expand All @@ -81,15 +86,16 @@ type DiskCache struct {
// how long to wakeup a sleeping write-file
wakeup time.Duration

wlock, // write-lock: used to exclude concurrent Put to the header file.
rlock *sync.Mutex // read-lock: used to exclude concurrent Get on the tail file.
rwlock *sync.Mutex // used to exclude switch/rotate/drop/Close on current disk cache instance.
wlock *InstrumentedMutex // write-lock: used to exclude concurrent Put to the header file.
rlock *InstrumentedMutex // read-lock: used to exclude concurrent Get on the tail file.
rwlock *InstrumentedMutex // used to exclude switch/rotate/drop/Close on current disk cache instance.

flock *flock // disabled multi-Open on same path
pos *pos // current read fd position info

// specs of current diskcache
size atomic.Int64 // current byte size
size atomic.Int64 // current byte size

curBatchSize, // current writing file's size
curReadSize, // current reading file's size
batchSize, // current batch size(static)
Expand Down Expand Up @@ -117,8 +123,10 @@ type DiskCache struct {
}

func (c *DiskCache) String() string {
c.rwlock.Lock()
defer c.rwlock.Unlock()
if c.rwlock != nil {
c.rwlock.Lock()
defer c.rwlock.Unlock()
}

// nolint: lll
// if there too many files(>10), only print file count
Expand All @@ -135,8 +143,10 @@ func (c *DiskCache) String() string {
}

func (c *DiskCache) Pretty() string {
c.rwlock.Lock()
defer c.rwlock.Unlock()
if c.rwlock != nil {
c.rwlock.Lock()
defer c.rwlock.Unlock()
}

arr := []string{}

Expand Down
9 changes: 7 additions & 2 deletions diskcache/drop.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,17 @@ func (c *DiskCache) dropBatch() error {

if c.rfd != nil && c.curReadfile == fname {
if err := c.rfd.Close(); err != nil {
return err
return WrapFileOperationError(OpClose, err, c.path, fname).
WithDetails("failed_to_close_read_file_during_drop")
}

c.rfd = nil
}

if fi, err := os.Stat(fname); err == nil {
if err := os.Remove(fname); err != nil {
return err
return WrapFileOperationError(OpRemove, err, c.path, fname).
WithDetails("failed_to_remove_file_during_drop")
}

c.size.Add(-fi.Size())
Expand All @@ -45,6 +47,9 @@ func (c *DiskCache) dropBatch() error {
droppedDataVec.WithLabelValues(c.path, reasonExceedCapacity).Observe(float64(fi.Size()))
datafilesVec.WithLabelValues(c.path).Set(float64(len(c.dataFiles)))
sizeVec.WithLabelValues(c.path).Sub(float64(fi.Size()))
} else {
return WrapFileOperationError(OpStat, err, c.path, fname).
WithDetails("failed_to_stat_file_during_drop")
}

return nil
Expand Down
13 changes: 13 additions & 0 deletions diskcache/envs.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ package diskcache
import (
"os"
"strconv"
"time"
)

func (c *DiskCache) syncEnv() {
Expand Down Expand Up @@ -37,6 +38,18 @@ func (c *DiskCache) syncEnv() {
c.noPos = true
}

if v, ok := os.LookupEnv("ENV_DISKCACHE_POS_DUMP_INTERVAL"); ok && v != "" {
if du, err := time.ParseDuration(v); err == nil && du > 0 {
c.pos.dumpInterval = du
}
}

if v, ok := os.LookupEnv("ENV_DISKCACHE_POS_DUMP_AT"); ok && v != "" {
if n, err := strconv.ParseInt(v, 10, 64); err == nil && n > 0 {
c.pos.dumpCount = int(n)
}
}

if v, ok := os.LookupEnv("ENV_DISKCACHE_NO_LOCK"); ok && v != "" {
c.noLock = true
}
Expand Down
Loading