ZFS recepies

Software installation

Install the ZFS software as explained on the ZFS on linux site. For tests, I will use the platform created during the Redundant disks without MDRAID POC There is Fedora (25) installed there, so I followed this instruction to install ZFS.

After installation, you will have a number of disabled services:

[root@lvmraid ~]# systemctl list-unit-files | grep zfs
zfs-import-cache.service                    disabled 
zfs-import-scan.service                     disabled 
zfs-mount.service                           disabled 
zfs-share.service                           disabled 
zfs-zed.service                             disabled 
zfs.target                                  disabled

Include a few of them. I will not use the sharing services from ZFS, only mount.

[root@lvmraid ~]# systemctl enable zfs.target
[root@lvmraid ~]# systemctl enable zfs-mount.service
[root@lvmraid ~]# systemctl enable zfs-import-cache.service
[root@lvmraid ~]# systemctl start zfs-mount.service 
Job for zfs-mount.service failed because the control process exited with error code.
See "systemctl status zfs-mount.service" and "journalctl -xe" for details.
[root@lvmraid ~]# zpool status
The ZFS modules are not loaded.
Try running '/sbin/modprobe zfs' as root to load them.
[root@lvmraid ~]# modprobe zfs
[root@lvmraid ~]# zpool status
no pools available

Looks like services do not load ZFS modules at startup. Having studied the sources, you can understand that the module will be loaded automatically, if only zpool is defined. We will not check this, instead we will force the load of zfs module every boot time following to Fedora's recommendations:

[root@lvmraid ~]# cat > /etc/sysconfig/modules/zfs.modules << EOFcat
#!/bin/sh
exec /usr/sbin/modprobe zfs
EOFcat
[root@lvmraid ~]# chmod 755 /etc/sysconfig/modules/zfs.modules
[root@lvmraid ~]# reboot

This script will help automate the scheduled snapshot creation and retention. I highly recommend installing this tool on the production system.

[root@lvmraid ~]# wget https://github.com/zfsonlinux/zfs-auto-snapshot/archive/master.zip
[root@lvmraid ~]# unzip master.zip
[root@lvmraid ~]# cd zfs-auto-snapshot-master/
[root@lvmraid zfs-auto-snapshot-master]# make install
[root@lvmraid zfs-auto-snapshot-master]# cd
[root@lvmraid ~]# rm -f /etc/cron.d/zfs-auto-snapshot /etc/cron.hourly/zfs-auto-snapshot \
	/etc/cron.weekly/zfs-auto-snapshot /etc/cron.monthly/zfs-auto-snapshot

ZFS snapshots use Redirect-On-Write technology (mistakenly called CopyOnWrite). In the schedule of just installed scripts, a lot of snapshots are created, which in fact do not help, and instead spend a lot of resources on them. The oldest snapshot will take up a lot of disk space. Rotating frequent tiny snapshots requires a lot of computing resources. So I use only daily snapshots and the rest of the schedule I've deleted.

Creating pool

The ZFS pool is the place to create file systems (or volumes). The pool spread data between physical disks and takes care of the redundancy. Although you can create a pool without any redundancy technique, this is not common. We will create RAID5 from the third partition of our disks.

NOTE: If you have many disks and want to specify the size of the raid group, simply specify the keyword "raidz" after the desired number of disks.

NOTE:Almost always use an option -o ashift=12 which define 4k IO block size for pool. A default value 9 refers to 512b block size which can cause serious performance degradation.

I used -m none option to not mount whole pool. The export is the name of created pool. I plan to mount its FS under the /export hierarchy, so that's the name.

[root@lvmraid ~]# zpool create -o ashift=12 -m none export raidz /dev/vd?3
[root@lvmraid ~]# zpool list export -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
export  11.9T   412K  11.9T         -     0%     0%  1.00x  ONLINE  -
  raidz1  11.9T   412K  11.9T         -     0%     0%
    vda3      -      -      -         -      -      -
    vdb3      -      -      -         -      -      -
    vdc3      -      -      -         -      -      -
    vdd3      -      -      -         -      -      -
[root@lvmraid ~]# zpool status export -v
  pool: export
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        export      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vdc3    ONLINE       0     0     0
            vdd3    ONLINE       0     0     0

errors: No known data errors
[root@lvmraid ~]# zpool history
History for 'export':
2017-06-25.17:11:49 zpool create -o ashift=12 -m none export raidz /dev/vda3 /dev/vdb3 /dev/vdc3 /dev/vdd3
[root@lvmraid ~]# 

The last command is very useful if you rarely deal with ZFS or share this duty with someone. Another useful command can be iostat for zpool:

[root@lvmraid ~]# zpool iostat export -v 5
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
export       412K  11.9T      0      0      0    470
  raidz1     412K  11.9T      0      0      0    470
    vda3        -      -      0      0     17  1.20K
    vdb3        -      -      0      0     17  1.19K
    vdc3        -      -      0      0     17  1.20K
    vdd3        -      -      0      0     17  1.20K
----------  -----  -----  -----  -----  -----  -----

You can check the version of ZFS and the enabled features with the zpool upgrade -v command.

Here are more examples of pool creating taken from other sources. This is an example of RAID10 pool:

# zpool create -o ashift=12 -m none pool10 mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf spare /dev/sdg
# zpool status pool10
  pool: pool10
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool10      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
        spares
          sdg       AVAIL   

errors: No known data errors

When you have many disks, combining them into one raid group can have a performance impact. It is wise to put them into smaller groups. This is an example of a ZFS pool created from three raid groups with double parity disks (6 data + 2 parity) without spare disks.

# zpool create -o ashift=12 -m none internal \
	raidz2 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj \
	raidz2 /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr \
	raidz2 /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz 
# zpool status internal
  pool: internal
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        internal    ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
            sdi     ONLINE       0     0     0
            sdj     ONLINE       0     0     0
          raidz2-1  ONLINE       0     0     0
            sdk     ONLINE       0     0     0
            sdl     ONLINE       0     0     0
            sdm     ONLINE       0     0     0
            sdn     ONLINE       0     0     0
            sdo     ONLINE       0     0     0
            sdp     ONLINE       0     0     0
            sdq     ONLINE       0     0     0
            sdr     ONLINE       0     0     0
          raidz2-2  ONLINE       0     0     0
            sds     ONLINE       0     0     0
            sdt     ONLINE       0     0     0
            sdu     ONLINE       0     0     0
            sdv     ONLINE       0     0     0
            sdw     ONLINE       0     0     0
            sdx     ONLINE       0     0     0
            sdy     ONLINE       0     0     0
            sdz     ONLINE       0     0     0

errors: No known data errors

Growing zpool online by disk resizing

When ZFS pool builds from virtual disks, files or other storage LUNs, it is possible online resize pool by increasing disk size. You should set pool option autoexpand to "on" (it is off by default) and cause zpool rescan disks online:

# zpool set autoexpand=on zpoolname
# zpool get autoexpand zpoolname
NAME       PROPERTY    VALUE   SOURCE
zpoolname  autoexpand  on      local
# zpool online zpoolname /dev/datavg/zfs

Adding log and cache to ZFS

A ZFS performance depends on its pool structure. Large sized raidz shows usually a bad performance, as almost any IO cause full stripe be readen or rewritten. Adding log and cache on any fast single device could improve the performance.

# lvcreate -n zfslog -L128m rootvg
# lvcreate -n zfscache -L128g rootvg
# zpool add z1 log /dev/rootvg/zfslog
# zpool add z1 cache /dev/rootvg/zfscache

It is possible to add / remove / resize cache without interruption

# zpool remove z1 /dev/rootvg/zfscache
# lvresize -L256g /dev/rootvg/zfscache
# zpool add z1 cache /dev/rootvg/zfscache

Creating filesystems

[root@lvmraid ~]# zfs create -o mountpoint=/export/data export/data
[root@lvmraid ~]# df -hP
Filesystem                Size  Used Avail Use% Mounted on
 ..
export/data               8.7T  128K  8.7T   1% /export/data

This is an example of creating a simple file system. The file system will be automatically mounted if the mountpoint option is specified. If you want to mount ZFS manually using mount command, you could set so called "legacy" mountpoint, like this:

# zfs set mountpoint=legacy export/data
# mount.zfs export/data /mnt

Another usefull filesystem option, related to linux, could be -o acltype=posixacl -o xattr=sa, which add support for extended ACL attributes and embed this info into same IO call.

You can check attributes by using zfs get command

# zfs get all export/data
NAME                PROPERTY              VALUE                  SOURCE
 ..

Create encrypted filesystem

Check if encryption supported and enabled for your zpool:

# zpool get feature@encryption zpoolname
NAME       PROPERTY            VALUE               SOURCE
zpoolname  feature@encryption  active              local

If it supported, but disabled, then enable it:

# zpool set feature@encryption=enabled zpoolname

Finally, create the encrypted fileset:

# zfs create -o encryption=on -o keyformat=passphrase -o keylocation=prompt -o mountpoint=/mnt/home zpoolname/home

Working with snapshot

Let's create the very first snapshot for this FS. It is totally useless, because there is no data in FS, but it will demonstrate the use of disk space by snapshots.

[root@lvmraid ~]# zfs snap export/data@initial

Blue is a desired snapshot name, brown is the name of FS to take snapshot.

Now, lets copy some data into FS, then take another snapshot:

[root@lvmraid ~]# rsync -a /etc /export/data/
[root@lvmraid ~]# df -hP /export/data/
Filesystem      Size  Used Avail Use% Mounted on
export/data     8.7T   18M  8.7T   1% /export/data
[root@lvmraid ~]# zfs snap export/data@etc_copied
[root@lvmraid ~]# zfs list export/data -o space
NAME         AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data  8.65T  17.4M     22.4K   17.4M              0          0
[root@lvmraid ~]# zfs list -r export/data -t snapshot -o space
NAME                    AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data@initial         -  22.4K         -       -              -          -
export/data@etc_copied      -      0         -       -              -          -

As you can see, the new data (18M according to "df") does not affect the disk usage of the snapshot. It will hold previously deleted data. Let's delete something to demonstrate.

[root@lvmraid ~]# rm -rf /export/data/etc
[root@lvmraid ~]# df -hP /export/data
Filesystem      Size  Used Avail Use% Mounted on
export/data     8.7T  128K  8.7T   1% /export/data
[root@lvmraid ~]# zfs list export/data -o space
NAME         AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data  8.65T  17.4M     17.4M   25.4K              0          0
[root@lvmraid ~]# zfs list -r export/data -t snapshot -o space
NAME                    AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data@initial         -  22.4K         -       -              -          -
export/data@etc_copied      -  17.4M         -       -              -          -

The amount of deleted data is subtracted from the data part of the FS data (as shown by "df") and is added to the snapshot usage space ("USEDSNAP" column). A more detailed listing shows that this space belongs to the snapshot "etc_copied". The initial snapshot still uses almost zero space, because deleted data had not exist when this snapshot was created.

Reverting ZFS to taken snapshot

You can revert the whole FS only to latest snapshot. If you want to revert to other than latest snapshot, you have to remove latest snapshot before revert.

[root@lvmraid ~]# zfs rollback export/data@etc_copied
[root@lvmraid ~]# df -hP /export/data
Filesystem      Size  Used Avail Use% Mounted on
export/data     8.7T   18M  8.7T   1% /export/data
[root@lvmraid ~]# zfs list export/data -o space
NAME         AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data  8.65T  17.4M     23.9K   17.4M              0          0
[root@lvmraid ~]# zfs list -r export/data -t snapshot -o space
NAME                    AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data@initial         -  22.4K         -       -              -          -
export/data@etc_copied      -  1.50K         -       -              -          -

FS was reverted, and snapshot's disk usage was turned back be a data usage.

Another great command when working with ZFS snapshots:

[root@lvmraid ~]# rm /export/data/etc/passwd
rm: remove regular file '/export/data/etc/passwd'? y
[root@lvmraid ~]# zfs snap export/data@passwd_removed
[root@lvmraid ~]# zfs diff export/data@etc_copied export/data@passwd_removed
M       /export/data/etc
-       /export/data/etc/passwd
-       /export/data/etc/passwd/<xattrdir>
-       /export/data/etc/passwd/<xattrdir>/security.selinux

The screenshot above explains itself. It is a very nice command - zfs diff !

Let's restore one file from the snapshot, copying it back:

[root@lvmraid ~]# cd /export/data/.zfs/snapshot
[root@lvmraid snapshot]# ll
total 0
dr-xr-xr-x. 1 root root 0 Jun 25 20:32 etc_copied
dr-xr-xr-x. 1 root root 0 Jun 25 20:32 initial
dr-xr-xr-x. 1 root root 0 Jun 25 20:32 passwd_removed
[root@lvmraid snapshot]# rsync -av etc_copied/etc/passwd /export/data/etc/
sending incremental file list
passwd

sent 1,182 bytes  received 35 bytes  2,434.00 bytes/sec
total size is 1,090  speedup is 0.90
[root@lvmraid snapshot]# df -hP
Filesystem                Size  Used Avail Use% Mounted on
 ..
export/data               8.7T   18M  8.7T   1% /export/data
export/data@etc_copied    8.7T   18M  8.7T   1% /export/data/.zfs/snapshot/etc_copied
[root@lvmraid snapshot]# zfs diff export/data@etc_copied 
M       /export/data/etc
-       /export/data/etc/passwd
-       /export/data/etc/passwd/<xattrdir>
-       /export/data/etc/passwd/<xattrdir>/security.selinux
+       /export/data/etc/passwd
+       /export/data/etc/passwd/<xattrdir>
+       /export/data/etc/passwd/<xattrdir>/security.selinux

The hidden directory .zfs automatically mounts the required snapshot for you, and then you can copy single file from there. The "zfs diff" command proves that this is not a real revert. The snapshot still contains deleted data blocks, and the file (with exactly the same name and matadata) is newly created in the FS new data blocks.

Working with clones

First, we will find all the snapshots belonging to FS that we need to clone.

TIP: The zfs list command can be very slow on the loaded system. A much faster way to check the names of snapshots is to list the .zfs/snapshot pseudo directory.

[root@lvmraid ~]# zfs list -r export/data -t snapshot
NAME                         USED  AVAIL  REFER  MOUNTPOINT
export/data@initial         22.4K      -  25.4K  -
export/data@etc_copied      35.2K      -  17.4M  -
export/data@passwd_removed  28.4K      -  17.4M  -
[root@lvmraid ~]# zfs clone -o mountpoint=/clone/data export/data@etc_copied export/data_clone
[root@lvmraid ~]# df -hP
 ..
export/data               8.7T   18M  8.7T   1% /export/data
export/data_clone         8.7T   18M  8.7T   1% /clone/data

Then the clone was created using the snapshot as the basis. And, of course, you can mount it somewhere else.

It is not so easy to determine what a clone is and what is a base snapshot. Here is one of ways to find out the truth:

[root@lvmraid ~]# zfs list -o name,origin,clones -r -t snapshot export/data
NAME                        ORIGIN  CLONES
export/data@initial         -       
export/data@etc_copied      -       export/data_clone
export/data@passwd_removed  -       
[root@lvmraid ~]# zfs list -o name,origin,clones export/data_clone
NAME               ORIGIN                  CLONES
export/data_clone  export/data@etc_copied  -

ZFS has an interesting feature that I will demonstrate here:

[root@lvmraid ~]# zfs destroy -r export/data
cannot destroy 'export/data': filesystem has dependent clones
use '-R' to destroy the following datasets:
export/data_clone
[root@lvmraid ~]# zfs promote export/data_clone
[root@lvmraid ~]# zfs list -o name,origin,clones export/data
NAME         ORIGIN                        CLONES
export/data  export/data_clone@etc_copied  -
[root@lvmraid ~]# zfs list -o name,origin,clones export/data_clone -r -t snapshot
NAME                          ORIGIN  CLONES
export/data_clone@initial     -       
export/data_clone@etc_copied  -       export/data

As a result of promote command, the clone and its base switched their roles, and the clone inherited all the previous snapshots. Now it is possible to remove the origin FS:

[root@lvmraid ~]# zfs destroy -r export/data
[root@lvmraid ~]# zfs list -o name,origin,clones export/data_clone -r -t snapshot
NAME                          ORIGIN  CLONES
export/data_clone@initial     -       
export/data_clone@etc_copied  - 

Remote replication

We will use the same ZFS system as origin and target system. Therefore, the sending process is piped to the receiving process. You can use another ZFS system for data replicattion over the network. SSH can be used as a channel if you want additional protection on the wire or netcat if you want to achieve copy efficiency.

First we need to select the desired snapshot to start with it:

[root@lvmraid ~]# zfs list -r -t snapshot export/data_clone
NAME                           USED  AVAIL  REFER  MOUNTPOINT
export/data_clone@initial     22.4K      -  25.4K  -
export/data_clone@etc_copied  22.4K      -  17.4M  -
[root@lvmraid ~]# zfs send -R export/data_clone@etc_copied | zfs recv -v export/data
receiving full stream of export/data_clone@initial into export/data@initial
received 39.9KB stream in 1 seconds (39.9KB/sec)
receiving incremental stream of export/data_clone@etc_copied into export/data@etc_copied
received 20.0MB stream in 1 seconds (20.0MB/sec)
cannot mount '/clone/data': directory is not empty
[root@lvmraid ~]# zfs set mountpoint=/export/data export/data
[root@lvmraid ~]# zfs mount export/data
[root@lvmraid ~]# df -hP
 ..
export/data_clone         8.7T   18M  8.7T   1% /clone/data
export/data               8.7T   18M  8.7T   1% /export/data
[root@lvmraid ~]# zfs list -r -t snapshot export/data
NAME                     USED  AVAIL  REFER  MOUNTPOINT
export/data@initial     22.4K      -  25.4K  -
export/data@etc_copied      0      -  17.4M  -

The mount point was occupied by the original FS, then I changed it to another, and the mount was successful.

As you can see, FS is copied completely, including the contents of the snapshots. This is a good way to transfer data from one ZFS system to another. The incremental copy is supported.

Advanced replication

The replication to remote server could be piped to SSH, what besides a traffic encryption solves an authentication and permissions problem. For example, you could have a user called prodzfs at your DR storage and root user could ssh to it without password, using authorized key. Then, the replication script can include something similar to:

 ..
# Remote zfs command:
ZFS=" ssh prodzfs@dr-storage sudo zfs"

Give to it sudo permitions, of course.

When initializing big filesystems, an interruption could happen, such as network disconnection, powerfail, etc. Allways use an option -s when initializing big filesystems for replication. This option allows resume a replication from moment of fault instead of starting over. Let's see an example:

# zfs send z1/iso@backupScript_202008282355 | $ZFS receive -s -F -u dr/iso
^C
# $ZFS get receive_resume_token dr/iso
NAME          PROPERTY              VALUE                                 SOURCE
internal/iso  receive_resume_token  1-f69e398fb-d0-2KilometerLongLine     -
# zfs send -t 1-f69e398fb-d0-2KilometerLongLine | $ZFS receive -s dr/iso@backupScript_202008282355

Once interruption occure (simulated here by pressing Ctrl-C), you have to get resume token from remote site. Pay attention that $ZFShere implements tip shown above. Then the operation can be resumed by referencing to this token. However, the receive side syntax should refer to originated snapshot, probably because this information missing in input stream.

Here is an example of incremental update of remote site:

prodstorage:~ # cat /root/bin/zfs-resync.sh
#!/bin/bash
# 3 3 * * * [ -x /root/bin/zfs-resync.sh ] && /root/bin/zfs-resync.sh |& logger -t zfs-resync

# Local pool:
LP=z1
# Remote pool:
RP="dr"
# Remote zfs command:
ZFS=" ssh prodzfs@drstorage sudo zfs"
# Filesystems to resync:
FS=" home iso public www "

for F in $FS ; do
        # get last (in list, not in time) remote snap name:
        FROMSNAP=$($ZFS list -H -t snapshot -o name -r ${RP}/${F} | awk -F"@" 'END {print $2}')
        [ -z $FROMSNAP ] && continue
        # get last (in list, not in time) local snap name:
        TOSNAP=$(/usr/sbin/zfs list -H -t snapshot -o name -r ${LP}/${F} | awk -F"@" 'END {print $2}')
        [ -z $TOSNAP ] && continue
        [ "$FROMSNAP" = "$TOSNAP" ] && continue
        # send incremental stream:
        echo "/usr/sbin/zfs send -R -I ${LP}/${F}@${FROMSNAP} ${LP}/${F}@${TOSNAP} | $ZFS receive -u ${RP}/${F}"
        /usr/sbin/zfs send -R -I ${LP}/${F}@${FROMSNAP} ${LP}/${F}@${TOSNAP} | $ZFS receive -u ${RP}/${F}
done

Used space analisys

Global overview

# zpool list
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
noname   299G   216G  82.7G        -         -    58%    72%  1.00x    ONLINE  -
The most interesting value is "CAP". When it exceeds 85%, overall ZFS performance is reduced. The other value, "FRAG", is for informational purposes only. This is not about data fragmentation, which is impossible due to the ZFS philosophy. This value indicates the fragmentation of free available space and generally reflects the usage and maturity of that zpool. There is nothing you can do about this value, the only way is to replicate the data to another pool and reformat the original. However, after a while this value will change back because it reflects the usage pattern.

You can see a little more detailed view with -v option

# zpool list -v
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
internal    26.2T  19.4T  6.79T        -         -    10%    74%  1.00x    ONLINE  -
  raidz2-0  8.72T  6.45T  2.27T        -         -    10%  74.0%      -    ONLINE
    sdb     1.09T      -      -        -         -      -      -      -    ONLINE
    sdc     1.09T      -      -        -         -      -      -      -    ONLINE
    sdd     1.09T      -      -        -         -      -      -      -    ONLINE
    sde     1.09T      -      -        -         -      -      -      -    ONLINE
    sdf     1.09T      -      -        -         -      -      -      -    ONLINE
    sdg     1.09T      -      -        -         -      -      -      -    ONLINE
    sdh     1.09T      -      -        -         -      -      -      -    ONLINE
    sdi     1.09T      -      -        -         -      -      -      -    ONLINE
  raidz2-1  8.72T  6.47T  2.25T        -         -    10%  74.2%      -    ONLINE
    sdj     1.09T      -      -        -         -      -      -      -    ONLINE
    sdk     1.09T      -      -        -         -      -      -      -    ONLINE
    sdl     1.09T      -      -        -         -      -      -      -    ONLINE
    sdm     1.09T      -      -        -         -      -      -      -    ONLINE
    sdn     1.09T      -      -        -         -      -      -      -    ONLINE
    sdo     1.09T      -      -        -         -      -      -      -    ONLINE
    sdp     1.09T      -      -        -         -      -      -      -    ONLINE
    sdq     1.09T      -      -        -         -      -      -      -    ONLINE
  raidz2-2  8.72T  6.45T  2.27T        -         -    11%  74.0%      -    ONLINE
    sdr     1.09T      -      -        -         -      -      -      -    ONLINE
    sds     1.09T      -      -        -         -      -      -      -    ONLINE
    sdt     1.09T      -      -        -         -      -      -      -    ONLINE
    sdu     1.09T      -      -        -         -      -      -      -    ONLINE
    sdv     1.09T      -      -        -         -      -      -      -    ONLINE
    sdw     1.09T      -      -        -         -      -      -      -    ONLINE
    sdx     1.09T      -      -        -         -      -      -      -    ONLINE
    sdy     1.09T      -      -        -         -      -      -      -    ONLINE
This output adds details about RAIDZ groups. However, there is no disk filling information.

Let's drill down the usage of zpool by its filesets.

# zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
noname               216G  73.3G    24K  none
noname/SuseAcademy  28.3G  73.3G   896M  /mnt/SuseAcademy
noname/home          179G  73.3G  87.4G  /home
This is a quick overview of filesets and space they used. You must be able to read it correctly. The "REFER" value shows how much space is being used by the current data. "USED" shows any addition beyond the data itself.

Lets add more details to see

# zfs list noname/SuseAcademy  -o space
NAME                AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
noname/SuseAcademy  73.3G  28.3G     27.5G    896M             0B         0B
As mentioned above, most of the space is taken up by snapshots (“USEDSNAP”).

Lets find snapshots taking space.

# zfs list noname/SuseAcademy -r -t all
NAME                                                        USED  AVAIL  REFER  MOUNTPOINT
noname/SuseAcademy                                         28.3G  73.3G   896M  /mnt/SuseAcademy
 ..
noname/SuseAcademy@zfs-auto-snap_daily-2023-09-03-1621        0B      -  28.3G  -
noname/SuseAcademy@zfs-auto-snap_daily-2023-09-04-1329        0B      -  28.3G  -
noname/SuseAcademy@zfs-auto-snap_daily-2023-09-05-0650        0B      -   896M  -
noname/SuseAcademy@zfs-auto-snap_daily-2023-09-06-0617        0B      -   896M  -
 ..
I removed several similar lines above and below the data changes. The output shows that the deleted data is not shown as the snapshot value "USED". This is shown in the value "REFER". Around this time, I deleted about 28 GB of outdated VM images, after which REFER decreased. If you want to see exactly what has been changed, you can use the zfs diff command.

Testing redundancy

Let's remove one disk. ZFS does not detect a problem until it accesses the disks, and then shows:

[root@lvmraid ~]# zpool status
  pool: export
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        export      DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vdc3    UNAVAIL      1   432     0  corrupted data
            vdd3    ONLINE       0     0     0

errors: No known data errors

Now reconnect the disconnected disk. ZFS does not see the reconnected disk, they may somehow have to be rescanned. I was too lazy to read the manual and I just rebooted the server. Everything returned to normal:

[root@lvmraid ~]# zpool status
  pool: export
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        export      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vdc3    ONLINE       0     0    95
            vdd3    ONLINE       0     0     0

errors: No known data errors
[root@lvmraid ~]# zpool clear export
[root@lvmraid ~]# zpool status
  pool: export
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        export      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vdc3    ONLINE       0     0     0
            vdd3    ONLINE       0     0     0

errors: No known data errors

Now, the hard part. I'm going to replace the disk with an empty one. First, we need to copy the partition table from other disks, as described in Redundant disks without MDRAID.

Then fix ZFS by replacing device:

[root@lvmraid ~]# zpool status 
  pool: export
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        export      DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vdc3    UNAVAIL      1   333     0  corrupted data
            vdd3    ONLINE       0     0     0

errors: No known data errors
[root@lvmraid ~]# zpool replace export /dev/vdc3 /dev/vde3
[root@lvmraid ~]# zpool status 
  pool: export
 state: ONLINE
  scan: resilvered 21.9M in 0h0m with 0 errors on Mon Jun 26 17:35:56 2017
config:

        NAME        STATE     READ WRITE CKSUM
        export      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vde3    ONLINE       0     0     0
            vdd3    ONLINE       0     0     0

errors: No known data errors

The hard part becomes peace of cake.


Updated on Sat Sep 5 17:57:12 IDT 2020 More documentations here