RHEL 6 and building large (>2TB) drives/partitions

Originally published 2020

Recently my team needed to rebuild a number of servers to increase capacity.  The original server cluster was physically at capacity (drive space), we were re-using x86_64 servers until the new proper installation with the upgraded software was available.  And we couldn’t wait for the new installation as it was months out but the current server farm was going to exceed capacity in a few weeks.

Since the new servers had to match the old systems, we were limited to installing RedHat Enterprise Linux 6 (RHEL-6) instead of moving to the more modern RHEL-7.  And to further complicate things, not all of these upgrade systems had hard drives of sufficient capacity.  BUT, we were also given an older SAN that had plenty of capacity and we had compatible cards to connect the servers to it.

With that mixture of equipment we set out to install.


Thankfully the systems did have a small (300GB) drive we could use to install the operating system to.  Our base RHEL-6 image installed easily, and we got the SAN connections configured and things were looking good.  Or so we thought.

The SAN team found that they were unable to present anything but a single 30 TiB (terabyte) drive to each system.  That met the storage requirements for raw capacity, but presenting it as other formats was eluding us for some reason.

The next issue we ran into is that the software required (or at least the software vendor only approved) the use of the “ext4” file system, not the newer “xfs” filesystem that Linux and RHEL-6 provide.

Not an issue…   Until you realize that ext4 on RHEL-6 on x86_64 is limited to 16TiB per partition. Additionally, RHEL 6 only supports these large drives in GPT disk format mode (not MBR which has a 2TB disk limit).

Knowing what to do.

The first step in automating a process is ensuring you understand what specifically is needed – after-all, you can’t automate what you don’t understand.  Even though the steps were extremely simple, there was still room for human error, but an even bigger worry was the chance that over time people executing these manual tasks would start to accidentally skip steps or take shortcuts or other time saving measures that started to make the newest systems look different from the original ones.

But more on that, first lets look at what we knew we had to do on each system.

Step 1 – Clean up old partitions and create two new that are <= 16TiB.

sgdisk -g --clear /dev/sdXX
sgdisk --new 1:2048:34359738368 --new 2:34359740416:64403034568 /dev/sXX

This will clean up the old mount points and create two partitions, one 16TB, the other 14TB.  We could have done two 15TiB partitions too. Or three 10TiB.  You get the picture.

Step 2 – Format the filesystems with the necessary filesystem options.

mkfs.ext4 /dev/sdXX -E lazy_itable_init
tune2fs -m0 /dev/sdXXX

Formatting a large partition can take quite a while so we added the “-E lazy_itable_init” flag.  This performs the basic formatting needed to make the partition available, but it defers the actual formatting of all the sectors until they are first used.  This flag speeds up the initial setup of the servers with only a minor delay upon the first write is spread out over time and is only an impact during the first write to that sector.

Step 3 – Mount them at the proper location

There were not a lot of special mount point requirements so to keep their usage easier by the developers, they mirrored the initial systems using the “/mounts/” root mount point.

mkdir -p /mounts/data##
mount -t ext4 /dev/sdXX /mounts/data##

Step 4 – Setup the persistent mount in /etc/fstab

Finally to make sure these partitions return after reboot, we added lines to the /etc/fstab for each partition.

/dev/sdXX  /mounts/data##  ext4  noatime,nodiratime,nobarrier,nofail,rw  0 2

The specific parameters such as “noatime” and “nodiratime” were specifically requested by the vendor to improve system performance by reducing filesystem updates to the metadata about the files which didn’t provide much value.  And since the data within these partitions is replicated across multiple machines in the cluster, the “filesystem dump” flag (the “0”) was set so the backup system knew to ignore this filesystem.  (We actually use a much more modern tool to backup systems instead of “dump” but this was added for consistency.)

Rinse and repeat…

Now that we knew what to do, it was easy to see that there were many possible points of failure if we had humans doing these across dozens of servers, especially when we were expecting this cluster to grow again if the new cluster deployment was further delayed.

To address all this, we build an Ansible playbook to automate all these steps.  We designed it from the start to be flexible so if the process proved useful in the new server deployments it would be minimally challenging to re-use the code.

We started by setting up an inventory file and defining the drive letters each system presented.


data_drives_01={ "drive": "/dev/sdb", "part": "/dev/sdb1", "mount": "/mounts/data01", "fstype": "ext4"}data_drives_02={ "drive": "/dev/sdb", "part": "/dev/sdb2", "mount": "/mounts/data02", "fstype": "ext4"}

There are better ways to define this inventor`y that would have been more flexible over time, but this is what we used and the need to rewrite the inventory file and associated pieces of the Ansible playbook weren’t judged as necessary at this time.

That inventory file defines the new data nodes, “srv01” through “srv25”, and for each of them they define the drives, the partitions, the filesystem format, and the mount points.

From that inventory file we setup this playbook.

- name: Create the data partitions
    label: gpt
    device: "{{ lookup('vars', item).drive }}"
    name: "{{ lookup('vars', item).mount }}"
    number: 1 # Probably need to make this dynamic later.
    state: present
    unit: GiB
  loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"

- name: Check for partition
    path: "{{ lookup('vars', item).part }}"
  loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"

- name: Setup Hadoop data filesystems if necessary
    fstype: "{{ lookup('vars', item).fstype }}"
    dev: "{{ lookup('vars', item).part }}"
    opts: "{{ sys_data_drive_opts }}"
  loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"

- name: Tune Hadoop data filesystem
  command: tune2fs -m0 {{ lookup('vars', item).part }}
  loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"
  changed_when: false

- name: Setup mount for data filesystems
    name: "{{ lookup('vars', item).mount }}"
    src: "{{ lookup('vars', item).part }}"
    fstype: "{{ lookup('vars', item).fstype }}"
    opts: "noatime,nodiratime,nobarrier,nofail,rw"
    state: mounted
    boot: yes
loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"

- name: "Build mount point with permissions"
    path: "{{ lookup('vars', item).mount }}"
    owner: root
    group: root
    state: directory
    mode: "{{ lookup('vars', item).perms | default() }}"
  loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"


Ansible, Check_mode, and Async plays.

Have a task in your Ansible playbook that takes a long time to run, say a very large package installation or download across a slow network link? Depending on how long it takes, Ansible may think the command has failed and fail at that point in the playbook.

Async and Poll in a nutshell

The standard way to do this is to use the Ansible async: and poll: flags. The documentation isn’t really clear on this, so here’s how I think of their actions:

  • The async: B flag says “Run this command in the background for B seconds….”
  • The poll: P flag means “…and check the status every P seconds.”

Thus, a command like this:

- name: Download a big file
        "wget -O /tmp/my_big_file.iso" 
      async: 120
      poll: 5

(Yes, I know there are more Ansible-friendly ways to download a file from a remote URL, but go along with the example…)

So on a good day when the download speeds are high, it might download and Ansible will continue on. On days when the Internet connection is slow, this tool will kick off the wget command, and every 5 seconds it will check if the command is done. When it completes, the playbook goes on. If the wget fails (network error, disk write, etc), or the command takes longer than 120 seconds, Ansible will fail this step as expected.

That’s all well and good. What’s the catch?

Check mode

One feature I love about Ansible is the --check mode option. A well written Ansible module will run in --check mode and do everything it can to validate that it will execute on the managed systems without making any changes to the remote system. This is key when you’re working on a playbook to maintain production systems.

Say you know that a configuration file needs a correction applied. You take the playbook you used to build the system originally, check it out of your source control to a new branch and modify the playbook.

But a cautious developer will check that the playbook runs as expected and doesn’t do anything else unexpected (reboot the server, stop services, fail mid-way through, etc.). To do this, run your playbook with the --check flag. The output looks identical to when it is run normally, but this time the lines that are changed: are actually not changes, rather telling you that this play would make a change.

Some commands are inherently un-safe for Ansible to generically run them, tasks such as shell:, command:, and others more “raw” command have this limitation. Ansible tries to make sure that a command run in check mode will make no changes whatsoever.

The check mode execution is handy when combined with the --diff command line flag, but that’s a story for another day.

Async and Check mode

So, using these together makes sense. I want to download a large file over an occasionally slow link but I do not want the download to run when I’m in check mode. You’d think something like the example code from above would be the correct combination:

- name: Download a big file
        "wget -O /tmp/my_big_file.iso" 
      async: 120
      poll: 5

But when you run it with the --check flag, you get this error:

TASK [Download a big file] ***************************
task path: ./playbook.yml:71
fatal: [localhost]: FAILED! => {
    "changed": false,
    "msg": "check mode and async cannot be used on same task."

What to do?

I have to admit, I didn’t think up this workaround – a Mr. Alex Dzyoba documented it on his blog and I came across it here:

What he documents is using the ansible_check_mode variable, then set the async: value to 0 if we’re in check mode, or 120 if we are not. Using our play above we would do this:

- name: Download a big file
        "wget -O /tmp/my_big_file.iso" 
      async: "{{ ansible_check_mode | ternary(0,120) }}"
      poll: 5

What ends up happening is based on the ansible_check_mode variable:

  • If we are running in check mode (i.e. ansible_check_mode is true), then the value passed to async: is zero (the first value in the ternary() call, and Ansible doesn’t complain about the conflict.
  • When we are running in normal mode (i.e. ansible_check_mode is false), then the value passed to async: is the second value in the ternary() call, and the play will run for 120 seconds.

Why Ansible doesn’t automatically handle this is beyond me, but I’m glad to have come across Mr. Alex Dzyoba website and this method.