Device Health Status Dump on Linux Distributions

Exported on 29-Oct-2021 11:41:08

Using Attune to dump vital health status to Attune's running log on popular Linux distributions

In this blueprint, we will use common commands to check the health status of the system, print the commands' output to stdout(they will be directed to Attune's running log -- can be seen in real time when a job is executing, or from Jobs -> history interface afterwards) .
The main purpose of this blueprint is to let other blueprints inspect the results of this one, and create health report etc. accordingly. So, it's used as the beginning(data generation / gathering) of a data processing pipeline.
Users can also learn from this blueprint the commands used to check heath status of Linux.

This has been tested on Ubuntu 20.04.2 LTS / Debian 11.0.0 / CentOS 8

Pre-Blueprint Attune setup
  1. On the Inputs tab, create a Linux node for the host you wish to check health status.
  2. On the Inputs tab, create a Linux credential to connect to the host you wish to check health status.
  3. On the Inputs tab, create a Linux credential with Sudo To root set to connect to the host you wish to check health status. This is required for some health check commands to successfully run, and also needed for installing packages when command not found.

Parameters

Name Type Script Reference Default Value Comment
Linux Distro Checking Result Temp File Text linuxDistroCheckingResultTempFile /tmp/distro_check_result.attune
Linux Node Linux / Unix Server linuxNode
Linux User Linux OS Credential linuxUser
Linux User(sudo) Linux OS Credential linuxUsersudo

1 - Check Distro and Version

Check distro and version, print the result to the running log of Attune, and store the result to a temp file.
The following steps can use this info to determine how to install missing packages.

This step has the following parameters

Name Script Reference Default Value
Linux Distro Checking Result Temp File {linuxDistroCheckingResultTempFile} /tmp/distro_check_result.attune
The connection details have changed from the last step.

Login as user on node

Connect via SSH
ssh user@hostname
This is a Bash Script make sure you run it with bash -l from a terminal session
if [ -f /etc/os-release ]
then
    # freedesktop.org and systemd
    # all the distros supported by this blueprint have /etc/os-release
    . /etc/os-release
    DISTRO=$ID
    VER=$VERSION_ID
elif type lsb_release >/dev/null 2>&1
then
    # linuxbase.org 
    # UNTESTED
    DISTRO=$(lsb_release -si)
    VER=$(lsb_release -sr)
elif [ -f /etc/lsb-release ]
then
    # For some versions of Debian/Ubuntu without lsb_release command
    # UNTESTED
    . /etc/lsb-release
    DISTRO=$DISTRIB_ID
    VER=$DISTRIB_RELEASE
elif [ -f /etc/debian_version ]
then
    # Older Debian/Ubuntu/etc.
    # UNTESTED
    DISTRO=Debian
    VER=$(cat /etc/debian_version)
elif [ -f /etc/SuSe-release ]
then
    # Older SuSE/etc. 
    # TODO currently unimplemented
    :
elif [ -f /etc/redhat-release ]
then
    # Older Red Hat, CentOS, etc. 
    # TODO currently unimplemented
    :
else
    # Fall back to uname, e.g. "Linux <version>", also works for BSD, etc.
    # UNTESTED
    DISTRO=$(uname -s)
    VER=$(uname -r)
fi

echo DISTRO=$DISTRO
echo VER=$VER

# write distro checking result to file
cat << EOF > {linuxDistroCheckingResultTempFile}
DISTRO='$DISTRO'
VER='$VER'
EOF

2 - Gather General Info

Show general info about the system, such as date and time, hostname, info about CPU, etc.

Login as user on node

Connect via SSH
ssh user@hostname
This is a Bash Script make sure you run it with bash -l from a terminal session
echo "========================================================================="
echo "Display the current date and time of the host(date)"
echo "========================================================================="
date
echo

echo "========================================================================="
echo "Print system information(uname -a)"
echo "========================================================================="
uname -a
echo

echo "========================================================================="
echo "Query the system hostname and related settings(hostnamectl)"
echo "========================================================================="
hostnamectl
echo

echo "========================================================================="
echo "Show who is logged on and what they are doing"
echo "This also includes the output of 'uptime'"
echo "(w)"
echo "========================================================================="
w
echo

echo "========================================================================="
echo "Display information about the CPU architecture(lscpu)"
echo "========================================================================="
lscpu
echo

echo "========================================================================="
echo "Display information about the CPU architecture"
echo "in table view with every CPUs in a line"
echo "(lscpu -ae)"
echo "========================================================================="
lscpu -ae
echo

echo "========================================================================="
echo "Show content of kernel's info of CPU(cat /proc/cpuinfo)"
echo "========================================================================="
[ -f /proc/cpuinfo ] && cat /proc/cpuinfo
echo

3 - Gather System Logs

Show content of the system logs, they are usually lengthy, so we display only one in a step.

3.1 - Kernel Ring Buffer - dmesg

Print the kernel ring buffer, this is a long listing, so a single step for it.
Debian requires root privileges to run dmesg, so a credential with Sudo To root is needed.

The connection details have changed from the last step.

Login as user on node

Connect via SSH
ssh user@hostname
This is a Bash Script make sure you run it with bash -l from a terminal session
echo "========================================================================="
echo "Print the kernel ring buffer(dmesg)"
echo "========================================================================="
dmesg

3.2 - System Log File

Print the system log file, this is a long listing, so a single step for it.
Debian requires root privileges to show the content of /var/log/syslog, so a credential with Sudo To root is needed.

This step has the following parameters

Name Script Reference Default Value
Linux Distro Checking Result Temp File {linuxDistroCheckingResultTempFile} /tmp/distro_check_result.attune

Login as user on node

Connect via SSH
ssh user@hostname
This is a Bash Script make sure you run it with bash -l from a terminal session
# The system log file paths are different from distros,
# so we need to do the seperation
. {linuxDistroCheckingResultTempFile} # load distro checking result
case $DISTRO in
    ubuntu | debian)
        echo "========================================================================="
        echo "Print the system log file(cat /var/log/syslog)"
        echo "========================================================================="
        [ -f /var/log/syslog ] && cat /var/log/syslog
        ;;
    centos)
        echo "========================================================================="
        echo "Print the system log file(cat /var/log/messages)"
        echo "========================================================================="
        [ -f /var/log/messages ] && cat /var/log/messages
        ;;
    *)
        echo "unsupported distro"
        false # exit code 1 will let Attune suspend running the job
        ;;
esac

4 - Gather Modules Status

Show kernel modules info, including software and hardware modules.
Package may need to be installed, so Sudo To root is required.

This step has the following parameters

Name Script Reference Default Value
Linux Distro Checking Result Temp File {linuxDistroCheckingResultTempFile} /tmp/distro_check_result.attune

Login as user on node

Connect via SSH
ssh user@hostname
This is a Bash Script make sure you run it with bash -l from a terminal session
# 'lsusb' isn't installed on CentOS by default
. {linuxDistroCheckingResultTempFile} # load distro checking result
case $DISTRO in
    ubuntu | debian)
        # Nothing to be done
        ;;
    centos)
        dnf install -y usbutils
        ;;
    *)
        echo "unsupported distro"
        false # exit code 1 will let Attune suspend running the job
        ;;
esac


echo "========================================================================="
echo "Show the status of modules in the Linux Kernel(lsmod)"
echo "========================================================================="
lsmod
echo

echo "========================================================================="
echo "List all PCI devices(lspci)"
echo "========================================================================="
lspci
echo

echo "========================================================================="
echo "List all PCI devices(lspci -v)"
echo "========================================================================="
lspci -v
echo

echo "========================================================================="
echo "List USB devices(lsusb)"
echo "========================================================================="
lsusb
echo

echo "========================================================================="
echo "List USB devices(lsusb -v)"
echo "========================================================================="
lsusb -v
echo

5 - Gather Memory Stats

Show memory related info.

The connection details have changed from the last step.

Login as user on node

Connect via SSH
ssh user@hostname
This is a Bash Script make sure you run it with bash -l from a terminal session
echo "========================================================================="
echo "Display amount of free and used memory in the system(free -m)"
echo "========================================================================="
free -m
echo

echo "========================================================================="
echo "Display kenerl info of memory(cat /proc/meminfo)"
echo "========================================================================="
[ -f /proc/meminfo ] && cat /proc/meminfo
echo

6 - Gather Network Info

Show network related info.
Package may need to be installed, so Sudo To root is required.

This step has the following parameters

Name Script Reference Default Value
Linux Distro Checking Result Temp File {linuxDistroCheckingResultTempFile} /tmp/distro_check_result.attune
The connection details have changed from the last step.

Login as user on node

Connect via SSH
ssh user@hostname
This is a Bash Script make sure you run it with bash -l from a terminal session
# The commands used in this step need the package net-tools,
# which is not installed by default, so we install it first
. {linuxDistroCheckingResultTempFile} # load distro checking result
case $DISTRO in
    ubuntu | debian)
        apt update
        apt install -y net-tools
        ;;
    centos)
        dnf install -y net-tools
        ;;
    *)
        echo "unsupported distro"
        false # exit code 1 will let Attune suspend running the job
        ;;
esac
        


echo "========================================================================="
echo "Show network interfaces(such as IP, subnet, MAC, etc.)"
echo "/usr/sbin/ifconfig"
echo "========================================================================="
# By default, normal users on Debian don't have /usr/sbin in $PATH
/usr/sbin/ifconfig
echo

echo "========================================================================="
echo "Show the routing tables(netstat -r)"
echo "========================================================================="
netstat -r
echo

echo "========================================================================="
echo "Show all sockets(netstat -apn)"
echo "========================================================================="
netstat -apn
echo

echo "========================================================================="
echo "Show content of the resolver configuration file(cat /etc/resolv.conf)"
echo "========================================================================="
[ -f /etc/resolv.conf ] && cat /etc/resolv.conf
echo

echo "========================================================================="
echo "Show statistics of network interfaces(cat /proc/net/dev)"
echo "========================================================================="
[ -f /proc/net/dev ] && cat /proc/net/dev
echo

7 - Gather Storage Info

Show storage related info.
We add a || true after each grep to prevent grep returning exit code other than the configured one - 0 by default(Attune detects the exit code of every line of script, and will cease to run if exit code other than the expected one is seen), in case there is nothing found by grep.

The connection details have changed from the last step.

Login as user on node

Connect via SSH
ssh user@hostname
This is a Bash Script make sure you run it with bash -l from a terminal session
echo "========================================================================="
echo "Report file system disk space usage(df -aTh | grep -v loop)"
echo "========================================================================="
df -aTh | grep -v loop || true
echo

echo "========================================================================="
echo "List block devices(lsblk -al | grep -v loop)"
echo "========================================================================="
lsblk -al | grep -v loop || true
echo

echo "========================================================================="
echo "Print block device attributes(blkid | grep -v loop)"
echo "========================================================================="
# Debian don't have /usr/sbin in $PATH by default
/usr/sbin/blkid | grep -v loop || true
echo

echo "========================================================================="
echo "List active mount points(mount | grep -v loop)"
echo "========================================================================="
mount | grep -v loop || true
echo

8 - Gather GPU Info

Show GPU related info.
Package may need to be installed, so Sudo To root is required.

This step has the following parameters

Name Script Reference Default Value
Linux Distro Checking Result Temp File {linuxDistroCheckingResultTempFile} /tmp/distro_check_result.attune
The connection details have changed from the last step.

Login as user on node

Connect via SSH
ssh user@hostname
This is a Bash Script make sure you run it with bash -l from a terminal session
echo "========================================================================="
echo "Check status of NVidia GPU(nvidia-smi)"
echo "========================================================================="
# see if there's NVidia GPU installed
if lspci -vnn | grep VGA | grep -qi NVIDIA
then
    # check if 'nvidia-smi' is installed
    if ! type nvidia-smi >/dev/null 2>&1
    then
        # nvidia-smi not found, install it
        . {linuxDistroCheckingResultTempFile} # load distro checking result
        case $DISTRO in
            ubuntu)
                apt update
                apt install -y nvidia-340
                ;;
            debian)
                # add 'non-free' archive area to sources.list
                # if there is already 'non-free', then sources.list is unmodified
                sed -i -e '/deb http/!b' -e '/non-free/b' -e 's/$/ non-free/' /etc/apt/sources.list
                apt update
                apt install -y nvidia-smi
                ;;
            centos)
                # consult https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_Driver_Installation_Quickstart.pdf
                # for installation documentation
                dnf config-manager --set-enabled PowerTools
                dnf install -y epel-release
                dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
                dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
                dnf clean all
                dnf -y module install nvidia-driver:latest-dkms
                ;;
            *)
                echo "unsupported distro"
                false # exit code 1 will let Attune suspend running the job
                ;;
        esac
    fi
    
    if type nvidia-smi >/dev/null 2>&1
    then
        nvidia-smi || true
    else
        echo "nvidia-smi command install failed"
    fi
else
    # since 'nvidia-smi' comes with the GPU driver
    # which is useless(and huge, may also harmful to system stability) if a GPU is not installed
    # so we decide to not install the driver when GPU is not detected
    echo "No NVidia GPU found."
fi
echo

echo "========================================================================="
echo "Show OpenCL platforms and devices(clinfo)"
echo "========================================================================="
# check if 'clinfo' is installed
if ! type clinfo >/dev/null 2>&1
then
    # clinfo not found, install it
    . {linuxDistroCheckingResultTempFile} # load distro checking result
    case $DISTRO in
        ubuntu | debian)
            apt update
            apt install -y clinfo
            ;;
        centos)
            # no official package for CentOS8
            # install with a RHEL7 rpm as a workaround
            dnf install -y ocl-icd # prerequisite
            wget https://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/c/clinfo-2.1.17.02.09-1.el7.x86_64.rpm
            rpm -ihv clinfo-2.1.17.02.09-1.el7.x86_64.rpm
            rm -f clinfo-2.1.17.02.09-1.el7.x86_64.rpm
            ;;
        *)
            echo "unsupported distro"
            false # exit code 1 will let Attune suspend running the job
            ;;
    esac
fi
if type clinfo >/dev/null 2>&1
then
    clinfo || true
else
    echo "clinfo command install failed"
fi
echo

9 - Gather Running Processes and Resource Usage

Show running processes and resource usage of the system.

The connection details have changed from the last step.

Login as user on node

Connect via SSH
ssh user@hostname
This is a Bash Script make sure you run it with bash -l from a terminal session
echo "========================================================================="
echo "Display running processes, plus memory and CPU usage info(top -b -n 1)"
echo "========================================================================="
top -b -n 1
echo

echo "========================================================================="
echo "Report a snapshot of the current processes(ps -e)"
echo "========================================================================="
ps -e
echo