This post is just a quick reminder about the definition of a ZFS pool's state. The status of a ZFS pool can be determined using the command
zpool status. This command will output something like the output that is shown below.
You can see that the state of the pool is declared as "ONLINE". According to the zpool(8) FreeBSD Man Page "an online pool has all devices operating normally". The lesson that I recently learned is that this is only a partial picture. The state of a ZFS pool is its ability to provide the data on it - it is not the pool's ability to be re-silvered successfully or the pool's tolerance to a device failure. I recently had to re-silver one of my two NAS boxes and I suffered some data loss during the re-silver. I was able to recover the data by a sync from the other NAS, but still, it was an important lesson for me. What follows is an explanation as to how this happened.
This is the scenario that I experienced:
- A ZFS pool is operating as
raidz1with five devices making up the pool: ada1; ada2; ada3; ada4; and ada5.
- ada1 and ada2 are old devices that are slowly failing and they need to be replaced.
- There are two large files that are spread across all five devices: fileA; and fileB.
- There is an error in the part of fileA that is stored on ada1.
- There is an error in the part of fileB that is stored on ada2.
In this scenario
zpool status is still describing the pool as "ONLINE" because there is enough redundancy in the pool to provide both fileA and fileB error-free to the user. However, if ada1 is unavailable (either because it is removed for a re-silver or because it failed completely) then the good part of fileB that is stored on it is lost and there is now not enough information on the remaining four devices for ZFS to provide fileB error-free to the user. Similarly, if ada2 is unavailable (either because it is removed for a re-silver or because it failed completely) then the good part of fileA that is stored on it is lost and there is now not enough information on the remaining four devices for ZFS to provide fileA error-free to the user.
I admit that in this scenario I should've replace both ada1 and ada2 as soon as they each individually started to fail. My excuse is that I was lulled into a false lack of urgency because
zpool status kept on reporting the pool as "ONLINE". I have now learned my lesson.
Finally, as an aside, this episode completely justifies my policy of maintaining two NAS boxes.