Infrastructure monitoring with Grafana Alloy

I've been playing with Grafana Alloy this weekend.

I started at "I want graphs" and then installed LibreNMS and then couldn't get graphs out of a proxmox server via SNMP and somehow, we ended up here.

Here are some notes to get us started.

I started with the alloy-scenarios github repo, and from that began working from the "snmp" example. My ultimate goal is to get SNMP working, but first, I need to make graphs for things I actually have on my network.

Lets define the goals.

1) Graphs.

2) Logs

(3) Alerting.

4: consistency;

Alloy ticks all of these, in a round-about fashion.

Let's start with the docker-compose.yml I ended up with:



services:

  loki:
    image: grafana/loki:${GRAFANA_LOKI_VERSION:-3.6.10}
    container_name: loki
    hostname: loki
    ports:
      - 3100:3100/tcp
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
      - ./data/loki:/tmp/loki              # Persistence
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - observability


  prometheus:
     image: prom/prometheus:${PROMETHEUS_VERSION:-v3.11.3}
     container_name: prometheus
     hostname: prometheus
     command:
       - --web.enable-remote-write-receiver
       - --config.file=/etc/prometheus/prometheus.yml
       - --storage.tsdb.path=/prometheus   # Explicitly tell Prom where to store data
     ports:
       - 9090:9090/tcp
     volumes:
       - ./prom-config.yaml:/etc/prometheus/prometheus.yml
       - ./data/prometheus:/prometheus    # Persistence
     networks:
       - observability


  grafana:
    image: grafana/grafana:${GRAFANA_VERSION:-13.0.1}
    container_name: grafana
    hostname: grafana
    environment:
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_BASIC_ENABLED=false
    ports:
      - 3000:3000/tcp
    volumes:
      - ./data/grafana:/var/lib/grafana    # Persistence
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml
    networks:
      - observability

  alloy:
    hostname: alloy
    container_name: alloy
    image: grafana/alloy:${GRAFANA_ALLOY_VERSION:-v1.16.1}
    ports:
      - "12345:12345/tcp" # Alloy UI
      - "514:514/udp"     # Standard syslog (RFC3164)
      - "514:514/tcp"     # Standard syslog (RFC3164)
      - "515:515/udp"     # RAW logs
      - "515:515/tcp"     # RAW logs
      - "5424:5424/udp"   # RFC5424 Syslog
      - "5424:5424/tcp"   # RFC5424 Syslog
    volumes:
      - ./config.alloy:/etc/alloy/config.alloy
      - ./snmp.yml:/etc/alloy/snmp.yml
      - ./data/alloy:/var/lib/alloy/data   # Persistence
    networks:
      - observability
    command: run --stability.level=experimental --server.http.listen-addr=0.0.0.0:12345 --storage.path=/var/lib/alloy/data /etc/alloy/config.alloy

networks:
  observability:
    driver: bridge

and the referenced grafana-datasources.yaml:


apiVersion: 1
datasources:
- name: Loki
  type: loki
  access: proxy
  orgId: 1
  url: http://loki:3100
  basicAuth: false
  isDefault: false
  version: 1
  editable: false
- name: Prometheus
  type: prometheus
  orgId: 1
  url: http://prometheus:9090
  basicAuth: false
  isDefault: true
  version: 1
  editable: false

You'll notice we have a lot of ports open for syslog ingress. This is because we can't install the alloy agent on everything, but we should get good data out of this. Should.

We're going to need some directories, and we're going to have to chown them:


mkdir -p ./data/alloy
mkdir -p ./data/grafana
mkdir -p ./data/loki
mkdir -p ./data/prometheus

chown -R 472:472 ./data/grafana
chown -R 10001:10001 ./data/loki
chown -R nobody:nobody ./data/prometheus

Now, to be clear, I haven't actually done anything with SNMP yet. But it is my goal.

I have created two config.alloy files, one for the server, and one for the agents.

Here is the one for the server, overwrite the one that is in the snmp dir from our alloy-scenarios starting point.


livedebugging {
  enabled = true
}

// --- Remote Write to Prometheus ---
prometheus.remote_write "remote" {
  endpoint {
    url = "http://prometheus:9090/api/v1/write"
  }
}

// --- SNMP Exporter Configuration ---
prometheus.exporter.snmp "snmp_exporter" {
    config_file = "/etc/alloy/snmp.yml"

    target "tm" {
        address     = "snmpd"
        module      = "CISCO"
        walk_params = "Cisco"
        labels = {
            "ilo_node" = "switch",
        }
    }

    walk_param "cisco" {
        retries = "2"
        timeout = "30s"
    }
}

// --- SNMP Scrape Configuration ---
discovery.relabel "snmp_targets" {
  targets = prometheus.exporter.snmp.snmp_exporter.targets
  rule {
    target_label = "job"
    replacement  = "smpt"
  }
}

prometheus.scrape "snmp_targets" {
  scrape_interval = "30s"
  targets         = discovery.relabel.snmp_targets.output
  forward_to      = [prometheus.remote_write.remote.receiver]
}



// 1. Define the rules.
// Note that forward_to is empty! We are only using this block to hold our rules.
loki.relabel "syslog" {
  forward_to = []

  rule {
    source_labels = ["__syslog_connection_ip_address"]
    target_label  = "ip_address"
  }
  rule {
    source_labels = ["__syslog_message_hostname"]
    target_label  = "hostname"
  }
  rule {
    source_labels = ["__syslog_message_app_name"]
    target_label  = "app_name"
  }
  rule {
    source_labels = ["__syslog_message_severity"]
    target_label  = "severity"
  }
  rule {
    source_labels = ["__syslog_message_facility"]
    target_label  = "facility"
  }

  // Smart Hostname Fallback
  rule {
    action        = "replace"
    source_labels = ["hostname", "ip_address"]
    separator     = ";"
    regex         = "^(?:-|);(.+)$"
    replacement   = "$1"
    target_label  = "hostname"
  }
}

// 2. Syslog Ingestion
loki.source.syslog "local" {
  // -- RFC 3164 UDP --
  listener {
    address       = "0.0.0.0:514"
    protocol      = "udp"
    syslog_format = "rfc3164"
    labels        = { component = "loki.source.syslog", protocol = "udp", format = "rfc3164" }
  }
  // -- RAW UDP --
  listener {
    address       = "0.0.0.0:515"
    protocol      = "udp"
    syslog_format = "raw"
    labels        = { component = "loki.source.syslog", protocol = "udp", format = "raw" }
  }
  // -- RFC 5424 UDP --
  listener {
    address       = "0.0.0.0:5424"
    protocol      = "udp"
    syslog_format = "rfc5424"
    labels        = { component = "loki.source.syslog", protocol = "udp", format = "rfc5424" }
  }

  // THIS IS THE MAGIC LINE:
  // We inject the rules directly into the syslog component so they run
  // BEFORE the internal labels are stripped.
  relabel_rules = loki.relabel.syslog.rules

  // We bypass the relabel receiver entirely and send the finalized logs straight to Loki
  forward_to = [loki.write.local.receiver]
}

loki.write "local" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

And with this, we should have enough to start out grafana-prometheus-loki-alloy stack!

So, do that.

The Linux Agent Config.

Copy this file to /etc/alloy/config.alloy


logging {
    level = "warn"
}

// This block relabels metrics coming from node_exporter to add standard labels
discovery.relabel "integrations_node_exporter" {
    targets = prometheus.exporter.unix.integrations_node_exporter.targets

    rule {
        // Set the instance label to the hostname of the machine
        target_label = "instance"
        replacement  = constants.hostname
    }

    rule {
        // Set a standard job name for all node_exporter metrics
        target_label = "job"
        replacement = "integrations/node_exporter"
    }
}

// Configure the node_exporter integration to collect system metrics
prometheus.exporter.unix "integrations_node_exporter" {
    // Disable unnecessary collectors to reduce overhead
    disable_collectors = ["ipvs", "btrfs", "infiniband", "xfs", "zfs"]
    enable_collectors = ["meminfo"]

    filesystem {
        // Exclude filesystem types that aren't relevant for monitoring
        fs_types_exclude     = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
        // Exclude mount points that aren't relevant for monitoring
        mount_points_exclude = "^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+)($|/)"
        // Timeout for filesystem operations
        mount_timeout        = "5s"
    }

    netclass {
        // Ignore virtual and container network interfaces
        ignored_devices = "^(veth.*|cali.*|[a-f0-9]{15})$"
    }

    netdev {
        // Exclude virtual and container network interfaces from device metrics
        device_exclude = "^(veth.*|cali.*|[a-f0-9]{15})$"
    }
}

// Define how to scrape metrics from the node_exporter
prometheus.scrape "integrations_node_exporter" {
scrape_interval = "15s"
    // Use the targets with labels from the discovery.relabel component
    targets    = discovery.relabel.integrations_node_exporter.output
    // Send the scraped metrics to the relabeling component
    forward_to = [prometheus.remote_write.local.receiver]
}

prometheus.remote_write "local" {
    endpoint {
        // Send metrics to a locally running Prometheus instance
        url = "http://10.1.1.20:9090/api/v1/write"
    }
}


// --- System Logs ---
// Translate the journal's underscore-prefixed metadata into clean
// Loki label names.
loki.relabel "journal" {
    forward_to = []

    // 1. Extract Hostname
    rule {
        source_labels = ["__journal__hostname"]
        target_label  = "hostname"
    }

    // 2. Extract Systemd Unit (We keep this so your process drop rules work)
    rule {
        source_labels = ["__journal__systemd_unit"]
        target_label  = "unit"
    }

    // 3. Extract the App Name (e.g., "sshd", "dhcpd")
    // Journald calls this SYSLOG_IDENTIFIER.
    rule {
        source_labels = ["__journal_syslog_identifier"]
        target_label  = "app_name"
    }

    // 4. Smart App Name Fallback
    // If a log entry doesn't have a SYSLOG_IDENTIFIER, fall back to using the unit name.
    rule {
        action        = "replace"
        source_labels = ["app_name", "unit"]
        separator     = ";"
        regex         = "^(?:|);(.+)$"
        replacement   = "$1"
        target_label  = "app_name"
    }

    // 5. Extract Priority
    // Journald native priorities are numbers. (0=emerg ... 6=info, 7=debug)
    rule {
        source_labels = ["__journal_priority"]
        target_label  = "priority"
    }

    rule {  source_labels = ["priority"]
        regex = "0"
        replacement = "emerg"
        target_label = "level" }
    rule {  source_labels = ["priority"]
        regex = "1"
        replacement = "alert"
        target_label = "level" }
    rule {  source_labels = ["priority"]
        regex = "2"
        replacement = "crit"
        target_label = "level" }
    rule { source_labels = ["priority"]
        regex = "3"
        replacement = "err"
        target_label = "level" }
    rule { source_labels = ["priority"]
        regex = "4"
        replacement = "warning"
        target_label = "level" }
    rule { source_labels = ["priority"]
        regex = "5"
        replacement = "notice"
        target_label = "level" }
    rule { source_labels = ["priority"]
        regex = "6"
        replacement = "info"
        target_label = "level" }
    rule { source_labels = ["priority"]
        regex = "7"
        replacement = "debug"
        target_label = "level" }
}

loki.source.journal "host" {
    max_age       = "12h"
    relabel_rules = loki.relabel.journal.rules
    labels        = { job = "systemd-journal" }
    forward_to    = [loki.process.journal.receiver]
}

loki.process "journal" {
    // Drop high-volume units that rarely carry actionable signal
    //stage.match {
    //  selector = `{unit=~"systemd-logind.service|systemd-tmpfiles-clean.service|cron.service"}`
    //  action   = "drop"
    //}

    // FIX: Drop low-priority entries.
    // Because journald uses syslog severity numbers, we check for 6 (info) and 7 (debug).
    //stage.match {
    //  selector = `{priority=~"6|7"}`
    //  action   = "drop"
    //}

    forward_to = [loki.write.local.receiver]
}

loki.write "local" {
    endpoint {
        url = "http://10.1.1.20:3100/loki/api/v1/push"
    }
}

There is more to know and do (ie; a windows agent, grafana dashboards)

But this will get you started!

Grafana Alloy Monitoring

Related

Compressing files on linux in 2025

Assistance

Crowdsec Revisited in 2025