Service Reliability Monitoring
On production servers, service stability is paramount. Frequent restarts indicate underlying issues that need investigation before they cause outages.
Checking Failed Services
systemctl --failed
systemctl list-units --state=failed
systemctl status failed-service.service
Monitoring Restart Counts
systemctl show nginx.service -p NRestarts
systemctl show --all | grep NRestarts | sort -t= -k2 -rn
Restart Policies
# In unit file [Service] section:
# Restart=always|on-failure|on-abnormal|on-abort
# RestartSec=5
# StartLimitBurst=3
# StartLimitIntervalSec=60
systemctl show nginx -p Restart,RestartUSec
Watchdog Configuration
# WatchdogSec=30 in unit file
systemctl show nginx -p WatchdogUSec
systemd-analyze dot --order | grep watchdog
Crash Log Analysis
journalctl -p err --since "24 hours ago"
journalctl -u nginx --since "1 week ago" | grep -i "crash\|segfault\|killed"
coredumpctl list
Automated Monitoring with dargslan-service-restart
pip install dargslan-service-restart
dargslan-service-restart
dargslan-service-restart --failed
dargslan-service-restart --restarts