I've been playing with Grafana Alloy this weekend.
I started at "I want graphs" and then installed LibreNMS and then couldn't get graphs out of a proxmox server via SNMP and somehow, we ended up here.
Here are some notes to get us started.
I started with the alloy-scenarios github repo, and from that began working from the "snmp" example. My ultimate goal is to get SNMP working, but first, I need to make graphs for things I actually have on my network.
Lets define the goals.
1) Graphs.
2) Logs
(3) Alerting.
4: consistency;
Alloy ticks all of these, in a round-about fashion.
Let's start with the docker-compose.yml I ended up with:
services:
loki:
image: grafana/loki:${GRAFANA_LOKI_VERSION:-3.6.10}
container_name: loki
hostname: loki
ports:
- 3100:3100/tcp
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- ./data/loki:/tmp/loki # Persistence
command: -config.file=/etc/loki/local-config.yaml
networks:
- observability
prometheus:
image: prom/prometheus:${PROMETHEUS_VERSION:-v3.11.3}
container_name: prometheus
hostname: prometheus
command:
- --web.enable-remote-write-receiver
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus # Explicitly tell Prom where to store data
ports:
- 9090:9090/tcp
volumes:
- ./prom-config.yaml:/etc/prometheus/prometheus.yml
- ./data/prometheus:/prometheus # Persistence
networks:
- observability
grafana:
image: grafana/grafana:${GRAFANA_VERSION:-13.0.1}
container_name: grafana
hostname: grafana
environment:
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_BASIC_ENABLED=false
ports:
- 3000:3000/tcp
volumes:
- ./data/grafana:/var/lib/grafana # Persistence
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml
networks:
- observability
alloy:
hostname: alloy
container_name: alloy
image: grafana/alloy:${GRAFANA_ALLOY_VERSION:-v1.16.1}
ports:
- "12345:12345/tcp" # Alloy UI
- "514:514/udp" # Standard syslog (RFC3164)
- "514:514/tcp" # Standard syslog (RFC3164)
- "515:515/udp" # RAW logs
- "515:515/tcp" # RAW logs
- "5424:5424/udp" # RFC5424 Syslog
- "5424:5424/tcp" # RFC5424 Syslog
volumes:
- ./config.alloy:/etc/alloy/config.alloy
- ./snmp.yml:/etc/alloy/snmp.yml
- ./data/alloy:/var/lib/alloy/data # Persistence
networks:
- observability
command: run --stability.level=experimental --server.http.listen-addr=0.0.0.0:12345 --storage.path=/var/lib/alloy/data /etc/alloy/config.alloy
networks:
observability:
driver: bridge
and the referenced grafana-datasources.yaml:
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
orgId: 1
url: http://loki:3100
basicAuth: false
isDefault: false
version: 1
editable: false
- name: Prometheus
type: prometheus
orgId: 1
url: http://prometheus:9090
basicAuth: false
isDefault: true
version: 1
editable: false
You'll notice we have a lot of ports open for syslog ingress. This is because we can't install the alloy agent on everything, but we should get good data out of this. Should.
We're going to need some directories, and we're going to have to chown them:
mkdir -p ./data/alloy
mkdir -p ./data/grafana
mkdir -p ./data/loki
mkdir -p ./data/prometheus
chown -R 472:472 ./data/grafana
chown -R 10001:10001 ./data/loki
chown -R nobody:nobody ./data/prometheus
Now, to be clear, I haven't actually done anything with SNMP yet. But it is my goal.
I have created two config.alloy files, one for the server, and one for the agents.
Here is the one for the server, overwrite the one that is in the snmp dir from our alloy-scenarios starting point.
livedebugging {
enabled = true
}
// --- Remote Write to Prometheus ---
prometheus.remote_write "remote" {
endpoint {
url = "http://prometheus:9090/api/v1/write"
}
}
// --- SNMP Exporter Configuration ---
prometheus.exporter.snmp "snmp_exporter" {
config_file = "/etc/alloy/snmp.yml"
target "tm" {
address = "snmpd"
module = "CISCO"
walk_params = "Cisco"
labels = {
"ilo_node" = "switch",
}
}
walk_param "cisco" {
retries = "2"
timeout = "30s"
}
}
// --- SNMP Scrape Configuration ---
discovery.relabel "snmp_targets" {
targets = prometheus.exporter.snmp.snmp_exporter.targets
rule {
target_label = "job"
replacement = "smpt"
}
}
prometheus.scrape "snmp_targets" {
scrape_interval = "30s"
targets = discovery.relabel.snmp_targets.output
forward_to = [prometheus.remote_write.remote.receiver]
}
// 1. Define the rules.
// Note that forward_to is empty! We are only using this block to hold our rules.
loki.relabel "syslog" {
forward_to = []
rule {
source_labels = ["__syslog_connection_ip_address"]
target_label = "ip_address"
}
rule {
source_labels = ["__syslog_message_hostname"]
target_label = "hostname"
}
rule {
source_labels = ["__syslog_message_app_name"]
target_label = "app_name"
}
rule {
source_labels = ["__syslog_message_severity"]
target_label = "severity"
}
rule {
source_labels = ["__syslog_message_facility"]
target_label = "facility"
}
// Smart Hostname Fallback
rule {
action = "replace"
source_labels = ["hostname", "ip_address"]
separator = ";"
regex = "^(?:-|);(.+)$"
replacement = "$1"
target_label = "hostname"
}
}
// 2. Syslog Ingestion
loki.source.syslog "local" {
// -- RFC 3164 UDP --
listener {
address = "0.0.0.0:514"
protocol = "udp"
syslog_format = "rfc3164"
labels = { component = "loki.source.syslog", protocol = "udp", format = "rfc3164" }
}
// -- RAW UDP --
listener {
address = "0.0.0.0:515"
protocol = "udp"
syslog_format = "raw"
labels = { component = "loki.source.syslog", protocol = "udp", format = "raw" }
}
// -- RFC 5424 UDP --
listener {
address = "0.0.0.0:5424"
protocol = "udp"
syslog_format = "rfc5424"
labels = { component = "loki.source.syslog", protocol = "udp", format = "rfc5424" }
}
// THIS IS THE MAGIC LINE:
// We inject the rules directly into the syslog component so they run
// BEFORE the internal labels are stripped.
relabel_rules = loki.relabel.syslog.rules
// We bypass the relabel receiver entirely and send the finalized logs straight to Loki
forward_to = [loki.write.local.receiver]
}
loki.write "local" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"
}
}
And with this, we should have enough to start out grafana-prometheus-loki-alloy stack!
So, do that.
The Linux Agent Config.
Copy this file to /etc/alloy/config.alloy
logging {
level = "warn"
}
// This block relabels metrics coming from node_exporter to add standard labels
discovery.relabel "integrations_node_exporter" {
targets = prometheus.exporter.unix.integrations_node_exporter.targets
rule {
// Set the instance label to the hostname of the machine
target_label = "instance"
replacement = constants.hostname
}
rule {
// Set a standard job name for all node_exporter metrics
target_label = "job"
replacement = "integrations/node_exporter"
}
}
// Configure the node_exporter integration to collect system metrics
prometheus.exporter.unix "integrations_node_exporter" {
// Disable unnecessary collectors to reduce overhead
disable_collectors = ["ipvs", "btrfs", "infiniband", "xfs", "zfs"]
enable_collectors = ["meminfo"]
filesystem {
// Exclude filesystem types that aren't relevant for monitoring
fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
// Exclude mount points that aren't relevant for monitoring
mount_points_exclude = "^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+)($|/)"
// Timeout for filesystem operations
mount_timeout = "5s"
}
netclass {
// Ignore virtual and container network interfaces
ignored_devices = "^(veth.*|cali.*|[a-f0-9]{15})$"
}
netdev {
// Exclude virtual and container network interfaces from device metrics
device_exclude = "^(veth.*|cali.*|[a-f0-9]{15})$"
}
}
// Define how to scrape metrics from the node_exporter
prometheus.scrape "integrations_node_exporter" {
scrape_interval = "15s"
// Use the targets with labels from the discovery.relabel component
targets = discovery.relabel.integrations_node_exporter.output
// Send the scraped metrics to the relabeling component
forward_to = [prometheus.remote_write.local.receiver]
}
prometheus.remote_write "local" {
endpoint {
// Send metrics to a locally running Prometheus instance
url = "http://10.1.1.20:9090/api/v1/write"
}
}
// --- System Logs ---
// Translate the journal's underscore-prefixed metadata into clean
// Loki label names.
loki.relabel "journal" {
forward_to = []
// 1. Extract Hostname
rule {
source_labels = ["__journal__hostname"]
target_label = "hostname"
}
// 2. Extract Systemd Unit (We keep this so your process drop rules work)
rule {
source_labels = ["__journal__systemd_unit"]
target_label = "unit"
}
// 3. Extract the App Name (e.g., "sshd", "dhcpd")
// Journald calls this SYSLOG_IDENTIFIER.
rule {
source_labels = ["__journal_syslog_identifier"]
target_label = "app_name"
}
// 4. Smart App Name Fallback
// If a log entry doesn't have a SYSLOG_IDENTIFIER, fall back to using the unit name.
rule {
action = "replace"
source_labels = ["app_name", "unit"]
separator = ";"
regex = "^(?:|);(.+)$"
replacement = "$1"
target_label = "app_name"
}
// 5. Extract Priority
// Journald native priorities are numbers. (0=emerg ... 6=info, 7=debug)
rule {
source_labels = ["__journal_priority"]
target_label = "priority"
}
rule { source_labels = ["priority"]
regex = "0"
replacement = "emerg"
target_label = "level" }
rule { source_labels = ["priority"]
regex = "1"
replacement = "alert"
target_label = "level" }
rule { source_labels = ["priority"]
regex = "2"
replacement = "crit"
target_label = "level" }
rule { source_labels = ["priority"]
regex = "3"
replacement = "err"
target_label = "level" }
rule { source_labels = ["priority"]
regex = "4"
replacement = "warning"
target_label = "level" }
rule { source_labels = ["priority"]
regex = "5"
replacement = "notice"
target_label = "level" }
rule { source_labels = ["priority"]
regex = "6"
replacement = "info"
target_label = "level" }
rule { source_labels = ["priority"]
regex = "7"
replacement = "debug"
target_label = "level" }
}
loki.source.journal "host" {
max_age = "12h"
relabel_rules = loki.relabel.journal.rules
labels = { job = "systemd-journal" }
forward_to = [loki.process.journal.receiver]
}
loki.process "journal" {
// Drop high-volume units that rarely carry actionable signal
//stage.match {
// selector = `{unit=~"systemd-logind.service|systemd-tmpfiles-clean.service|cron.service"}`
// action = "drop"
//}
// FIX: Drop low-priority entries.
// Because journald uses syslog severity numbers, we check for 6 (info) and 7 (debug).
//stage.match {
// selector = `{priority=~"6|7"}`
// action = "drop"
//}
forward_to = [loki.write.local.receiver]
}
loki.write "local" {
endpoint {
url = "http://10.1.1.20:3100/loki/api/v1/push"
}
}
There is more to know and do (ie; a windows agent, grafana dashboards)
But this will get you started!