Buildup

Thanks to “https://www.reddit.com/user/MaricxX/” for this photo – it demonstrates how small glitches over time can add up if they aren’t addressed rapidly – or better yet, not allowed to start in the first place.

Cross section of layers of paint showing deformation due to imperfections magnified with each layer.
Layers of paint – credit to MaricxX from Reddit – https://www.reddit.com/user/MaricxX/

At a previous job it was common to take our Windows virtual machine templates and power them on once a month to patch the OS and apply the latest security configurations. We had been doing this with our Red Hat Linux images, but a couple years ago I converted our process so each month we built those VM templates fresh from an ISO and a Hashicorp Packer script using VMware Workstation.

This monthly fresh build ensured that we always knew how to build the VM templates in the event of a disaster, and it ensured that our build process contained exactly what we planned and advertised (through our team Git repository). As new requirements were received from the InfoSec team or other sources with system concerns that could only be readily addressed during the initial build phase, we would add those steps to the Packer config file, then test and build new.

With the prevalence of new worms and other highly effective infection vectors, my fear was that we would get a piece of malware onto the templates and then that malware would be automatically replicated each time a new system was built. And there were many times when we started the patching process each month only to find that a couple of the Windows templates had been left running since the previous months patch effort. There is no telling what might have crawled onto these unmanaged systems in the intervening time, only waiting for us to start using them over time.

While the paint analogy doesn’t perfectly match with the IT world, there are sufficient correlations that it makes the possibility of replicating and amplifying a small defect all the more understandable. Still, I prefer to have my freshly-built template with it’s minimal layers of paint knowing that I am confident that it only contains the bits we wanted.

When is a disk space problem not a disk space problem?

A co-worker setup an Ansible playbook to update some packges but it kept erroring out. The error that Ansible reported from “yum” was “No space left on device“. He had jumped onto the system and saw that this partition had plenty of space left so asked if I could look into it.

I got on and confirmed that when I ran a simple “yum update” it showed this:

[[email protected] ~]# echo n | yum update

Loaded plugins: product-id, rhnplugin, search-disabled-repos, security, subscription-manager

[Errno 28] No space left on device: ‘/var/run/rhsm/cert.pid’

This system is receiving updates from RHN Classic or RHN Satellite.

Could not create lock at /var/run/yum.pid: [Errno 28] No space left on device: ‘/var/run/yum.pid’

Hmm, no disk space still. Looking at the “df /var” output looks good:

[[email protected] ~]# df /var

Filesystem           1K-blocks   Used Available Use% Mounted on

/dev/mapper/rootvg-varlv

                       2514736 914948   1468716  39% /var

Suspecting other resource issues I checked the inode availability using “df -i:

[[email protected] ~]# df -i /var

Filesystem           Inodes  IUsed IFree IUse% Mounted on

/dev/mapper/rootvg-varlv

                     163840 163840     0  100% /var

A ha! No inodes left. I’ll let you use your favorite search engine to look up details, but an easy way to think of “inodes” is as space on the first few pages of a book dedicated to being the “table of contents.” If you have a book with a few chapters, you only need a single page for the table of contents (the inodes). If you have a book with lots of chapters and sub-chapters, you might need a lot of pages (more inodes). By default Unix systems have a forumla on how much of the filesystem to dedicate to being “inodes” and how much is left for actual data storage. Usually this is fine for most systems.

To find them we want to look for directories which have chewed up the 163K files:

for i in /var/*; do echo $i; find $i |wc -l; done

This pointed to the “/var/spool/app01/” directory – it has over 160K small files.  The owner of the system was able to clean up some old files there and the “yum update” worked as expected.

It’s possible to override the inode settings when the filesystem is formatted, so if you know this ahead a time you can do this. If you run into this after the fact, the usual resolution is to backup the data, reformat the filesystem with more inodes allocated, then restore from backup.

SELinux and NFS $HOME directories

Recently we re-installed a common server with RHEL-7 and that went well.  But after a couple days I noticed that I was unable to login with my personal ssh key but I had before. It was a minor annoyance and didn’t pursue it … until today.

It turns out that the /home/ directory on this system is an NFS mount, and in RHEL-7 we have set SELinux to default to enforcing.  There is an SELinux boolean flag, “use_nfs_home_dirs” that needed to be set to “1” (true).  Running the “setsebool -P use_nfs_home_dirs 1” on this system was the fix and now we/I can resume logging in with the SSH key instead of typing in my passwordeach time.

Some were reluctant to fix this as they always typed in their password. While typing in your password over the SSH login connection is encrypted, but it does present the possibility that your password could get copied given a compromised endpoint, plus we are trying to use longer passwords so typing this in multiple times per day was frustrating and slowed workflow.  Using SSH keys eliminates this risk and provides for other features such as scheduled/scripted command execution and file transfers.