基于Prometheus+Grafana，打造强大的监控与可视化平台

2026-01-22 / 2026-01-25 / Prometheus Grafana 监控与可视化平台

监控的痛点

在我们的日常工作中，经常会遇到这样的场景：

系统突然变慢，但不知道是哪里出了问题
CPU使用率飙升，却不知道是哪个进程导致的
服务响应时间变长，但没有预警机制
业务指标无法直观展示，领导询问时答不上来

传统的日志查看方式不仅效率低下，还缺乏预警和可视化能力。今天我们就来聊聊如何用Prometheus + Grafana构建一个强大的监控平台。

为什么选择Prometheus + Grafana

相比传统的监控方案，Prometheus + Grafana有以下优势：

开源免费：无需付费，社区活跃
时序数据库：专门针对监控场景优化
强大的查询语言：PromQL功能强大，表达能力强
灵活的可视化：Grafana图表精美，定制性强
生态丰富：大量的exporter和集成

解决方案思路

今天我们要解决的，就是如何用Prometheus + Grafana构建一套完整的监控体系。

核心思路是：

数据采集：使用Prometheus收集各类监控指标
数据存储：时序数据库高效存储监控数据
数据可视化：Grafana提供丰富的图表展示
告警机制：及时发现和处理异常情况

Prometheus环境搭建

1. 安装Prometheus

# Docker方式安装
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v ./prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

2. 基础配置文件

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # 监控Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  # 监控Spring Boot应用
  - job_name: 'spring-boot-app'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['host.docker.internal:8080']
  
  # 监控Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['host.docker.internal:9100']

Spring Boot应用集成

1. 添加依赖

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

2. 配置文件

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  endpoint:
    prometheus:
      enabled: true
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}

3. 自定义业务指标

@Component
public class BusinessMetricsCollector {
    
    private final Counter orderCounter;
    private final Timer responseTimer;
    private final Gauge activeUsers;
    
    public BusinessMetricsCollector(MeterRegistry meterRegistry) {
        this.orderCounter = Counter.builder("orders_total")
            .description("总订单数")
            .register(meterRegistry);
        
        this.responseTimer = Timer.builder("api_response_time_seconds")
            .description("API响应时间")
            .register(meterRegistry);
        
        this.activeUsers = Gauge.builder("active_users")
            .description("活跃用户数")
            .register(meterRegistry, () -> getCurrentActiveUsers());
    }
    
    public void recordOrder(String orderType) {
        orderCounter.increment(Tags.of("type", orderType));
    }
    
    public <T> T recordApiCall(String endpoint, Supplier<T> supplier) {
        return responseTimer.recordCallable(() -> supplier.get());
    }
    
    private double getCurrentActiveUsers() {
        // 获取当前活跃用户数的逻辑
        return userService.getActiveUserCount();
    }
}

Node Exporter部署

1. 安装Node Exporter

# Linux安装
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter &

# 或使用Docker
docker run -d \
  --name node-exporter \
  -p 9100:9100 \
  --net="host" \
  quay.io/prometheus/node-exporter:latest

Grafana配置

1. 安装Grafana

# Docker方式
docker run -d \
  --name grafana \
  -p 3000:3000 \
  -v grafana-storage:/var/lib/grafana \
  grafana/grafana

2. 添加数据源

在Grafana中添加Prometheus数据源：

访问 http://localhost:3000
登录后进入 Data Sources
选择 Prometheus
配置 URL 为 http://host.docker.internal:9090

常用监控面板

1. JVM监控面板

# JVM堆内存使用率
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} * 100

# GC次数
increase(jvm_gc_collection_seconds_count[5m])

# 线程数
jvm_threads_live_threads

2. 业务指标面板

# 每分钟订单数
sum(rate(orders_total[1m])) by (type)

# API响应时间95百分位
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le))

# 错误率
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) / 
sum(rate(http_server_requests_seconds_count[5m]))

3. 系统资源面板

# CPU使用率
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# 磁盘使用率
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

告警规则配置

1. 告警规则文件

# alert_rules.yml
groups:
  - name: application_alerts
    rules:
      - alert: HighCPULoad
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "主机CPU使用率过高"
          description: "主机 {{ $labels.instance }} CPU使用率超过80%，当前值为{{ $value }}"
      
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "主机内存使用率过高"
          description: "主机 {{ $labels.instance }} 内存使用率超过85%，当前值为{{ $value }}"
      
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务宕机"
          description: "服务 {{ $labels.instance }} 已停止响应"
      
      - alert: HighErrorRate
        expr: sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) / 
              sum(rate(http_server_requests_seconds_count[5m])) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "API错误率过高"
          description: "API错误率超过5%，当前值为{{ $value }}"

2. 集成钉钉告警

# alertmanager.yml
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'dingtalk-webhook'

receivers:
  - name: 'dingtalk-webhook'
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
        send_resolved: true

高级监控技巧

1. 自定义仪表板

@RestController
public class MetricsController {
    
    @Autowired
    private MeterRegistry meterRegistry;
    
    @GetMapping("/api/dashboard/metrics")
    public DashboardMetrics getDashboardMetrics() {
        DashboardMetrics metrics = new DashboardMetrics();
        
        // 获取JVM指标
        metrics.setHeapUsage(getMetricValue("jvm_memory_used_bytes", "area", "heap"));
        metrics.setCpuUsage(getMetricValue("system_cpu_usage"));
        metrics.setActiveThreads(getMetricValue("jvm_threads_live_threads"));
        
        // 获取业务指标
        metrics.setTotalOrders(getMetricValue("orders_total"));
        metrics.setAvgResponseTime(getMetricValue("api_response_time_seconds_mean"));
        
        return metrics;
    }
    
    private double getMetricValue(String metricName, String... tags) {
        return meterRegistry.find(metricName)
            .tags(tags)
            .gauge()
            .map(Gauge::value)
            .orElse(0.0);
    }
}

2. 监控数据导出

@Component
public class MetricsExporter {
    
    @Autowired
    private MeterRegistry meterRegistry;
    
    @Scheduled(fixedRate = 300000) // 每5分钟执行一次
    public void exportMetrics() {
        // 导出监控数据到外部系统
        List<MetricData> metrics = collectMetrics();
        sendToExternalSystem(metrics);
    }
    
    private List<MetricData> collectMetrics() {
        return StreamSupport.stream(meterRegistry.getMeters().spliterator(), false)
            .map(this::convertToMetricData)
            .collect(Collectors.toList());
    }
}

性能优化建议

1. 存储优化

# prometheus.yml 存储配置
storage:
  tsdb:
    retention.time: 30d  # 保留30天数据
    retention.size: 50GB  # 限制存储大小
    block.duration: 2h    # 数据块时长

2. 采集优化

# 减少不必要的指标采集
global:
  scrape_interval: 30s  # 适当增加采集间隔
  scrape_timeout: 10s

# 配置指标过滤
scrape_configs:
  - job_name: 'spring-boot-app'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 15s
    scrape_timeout: 5s
    relabel_configs:
      # 过滤掉不需要的指标
      - source_labels: [__name__]
        regex: 'jvm_gc_.*'
        action: drop

实际应用效果

通过Prometheus + Grafana的监控体系，我们可以实现：

实时监控：所有关键指标实时可见
快速定位：问题发生时快速定位根源
趋势分析：历史数据趋势分析
主动预警：异常情况及时告警
业务洞察：业务指标直观展示

最佳实践

1. 监控分层

基础设施层：CPU、内存、磁盘、网络
应用层：JVM、线程、GC、连接池
业务层：订单量、响应时间、错误率

2. 告警分级

Critical：需要立即处理的严重问题
Warning：需要注意但不紧急的问题
Info：仅用于信息提示

3. 监控指标命名规范

使用下划线分隔单词
前缀表示应用或服务
后缀表示单位（如_seconds、_bytes）

注意事项

在使用Prometheus + Grafana时，需要注意以下几点：

资源消耗：监控系统本身也需要消耗资源
数据安全：敏感指标需要脱敏处理
网络配置：确保Prometheus能够访问被监控服务
存储规划：合理规划存储空间和数据保留策略
权限管理：Grafana需要合理的用户权限控制

总结

通过Prometheus + Grafana的组合，我们可以构建一个功能强大、灵活易用的监控平台。这套系统不仅能帮助我们及时发现和解决问题，还能为业务决策提供数据支持。

监控是系统稳定运行的重要保障，建议每个团队都建立起自己的监控体系。

希望这篇文章对你有所帮助！如果你觉得有用，欢迎关注【服务端技术精选】公众号，获取更多后端技术干货。

标题：基于Prometheus+Grafana，打造强大的监控与可视化平台
作者：jiangyi
地址：http://www.jiangyi.space/articles/2026/01/22/1769072117368.html

监控的痛点
为什么选择Prometheus + Grafana
解决方案思路
Prometheus环境搭建
1. 安装Prometheus
2. 基础配置文件
Spring Boot应用集成
1. 添加依赖
2. 配置文件
3. 自定义业务指标
Node Exporter部署
1. 安装Node Exporter
Grafana配置
1. 安装Grafana
2. 添加数据源
常用监控面板
1. JVM监控面板
2. 业务指标面板
3. 系统资源面板
告警规则配置
1. 告警规则文件
2. 集成钉钉告警
高级监控技巧
1. 自定义仪表板
2. 监控数据导出
性能优化建议
1. 存储优化
2. 采集优化
实际应用效果
最佳实践
1. 监控分层
2. 告警分级
3. 监控指标命名规范
注意事项
总结

0 评论