体验Bcachefs

一个代码简洁、功能强大的CoW文件系统

Posted by wszqkzqk on January 16, 2024
本文字数:8632

前言

Bcachefs是一个功能强大的CoW文件系统,支持快照、压缩、加密、RAID、缓存、多设备、多磁盘等现代文件系统的特性,同时,作者Kent Overstreet声称也具有较好的稳定性,称其为:

“The COW filesystem for Linux that won’t eat your data”.

笔者早在Bcachefs合并到主线之前就已经开始尝试Bcachefs,在Linux 6.7中,Bcachefs已经合并。目前,Arch Linux已经升级到了Linux 6.7,Bcachefs在此内核中开箱即用。本文将介绍在Arch Linux中体验Bcachefs的情况。

安装

Bcachefs已经在Arch Linux官方内核的linuxlinux-zen中直接支持,对于用户空间程序,可以安装在官方源中的bcachefs-tools

sudo pacman -S bcachefs-tools

创建Bcachefs文件系统

由于Bcahefs脱胎自Bcache,因此,Bcachefs具有像Bcache一样强大的多设备支持。然而,笔者在此并没有使用多设备,而是使用基础的单设备模式。

与Btrfs不同,Bcachefs可以在创建文件系统时指定文件系统采用的压缩算法、Hash算法等内容,例如,如果要创建一个压缩算法为zstd、Hash算法为xxhash的Bcachefs文件系统,启用了usrquotagrpquotaprjquota,指定卷标为Data,并且使用了/dev/sdb3作为底层设备,可以使用如下命令:

sudo mkfs.bcachefs /dev/sdb3 -L Data --acl --compression=zstd --metadata_checksum=xxhash --data_checksum=xxhash --usrquota --grpquota --prjquota

其中,mkfs.bcachefsbcachefs format等效。

特性

Bcachefs的特性非常丰富,但目前用户空间程序还不够完善。这里主要介绍Btrfs所不具备的特性或者行为与Btrfs不同的特性。

透明压缩

与Btrfs等文件系统类似,Bcachefs也支持透明压缩。Bcachefs目前支持以下3种压缩算法:

  • gip
  • zstd
  • lz4

其中,笔者个人较推荐使用zstd算法;如果需要追求更快的读写速率,可以尝试lz4算法;gzip算法压缩速率慢于zstd,解压速率更是远远落后于zstd,压缩率也没有明显优势,笔者个人不推荐在日常环境下使用。

相比Btrfs使用的压缩算法必须在挂载时指定,Bcachefs可以在创建时或者使用bcachefs set-option来设定默认挂载时的压缩算法。

不过,Bcachefs的透明压缩支持相比Btrfs仍有很多不完善的地方:

  • Btrfs的文件系统压缩情况可以使用compsize工具查看,然而,目前Bcachefs没有类似的工具。
  • 与Btrfs不同,目前Bcachefs还尚未支持压缩等级设置,只能使用默认等级
  • Btrfs可以针对不同文件设置不同的压缩算法(btrfs property set [filename] compression zstd),而Bcachefs目前还不支持

透明加密

Bcachefs实现了Btrfs目前所不支持的透明加密功能,可以在创建文件系统时通过--encrypted启用:

sudo bcachefs format --encrypted /dev/sdb3

使用时,需要执行bcachefs unlock以解锁:

sudo bcachefs unlock /dev/sdb3

如果需要更改密码,可以使用bcachefs set-passphrase;如果要永久移除密码,可以使用bcachefs remove-passphrase

子卷与快照

Bcachefs支持子卷与快照,使用方法与Btrfs类似,但是目前用户空间程序还不够完善,例如:

  • bcachefs subvolume还不支持列出子卷
  • 没有与Btrfs类似的sendreceive等命令
  • 没有set-default等命令
  • 不支持subvol=等挂载参数

最后一条是最致命的,因此目前如果要将Bcachefs用作系统盘,不能习惯性采用Btrfs的常用子卷布局,而需要将内容直接放到顶级子卷下。如果确实需要挂载某个子卷,可能得在挂载顶级子卷以后,再使用mount --bind命令挂载子卷。

多设备管理

Bcachefs提供了一系列命令来管理运行中的文件系统中的设备:

  • device add:向现有文件系统添加新设备
  • device remove:从现有文件系统中移除设备
  • device online:将现有成员重新添加到文件系统
  • device offline:将设备离线,但不移除它
  • device evacuate:迁移特定设备上的数据
  • device set-state:将设备标记为失败
  • device resize:调整设备上的文件系统大小
  • device resize-journal:调整设备上的日志大小

filesystem

Bcachefs提供了类似btrfs filesystem的命令,但是命令有所区别,为bcachefs fs,而且目前并不完善,仅支持usage一个命令。

fsck

Btrfs的fsck相当鸡肋,对于无法挂载的分区,如果没有Btrfs RAID1或者Btrfs RAID10,将很难修复,一般的非致命性错误则直接在挂载时透明修复,在开机时运行的fsck实际也并没有发挥作用。而Bcachefs的fsck则会在开机时实际运行。

然而,可能是由于CoW文件系统的既有限制,目前Bcachefs的fsck还远远没有像XFS、ext4等日志文件系统的fsck那么强大。

笔者使用往常使用的暴力手段进行文件系统损坏测试,步骤如下:

  • 先在主机中挂载Bcachefs文件系统
  • 然后将该分区直接以块设备形式直通给虚拟机
  • 在虚拟机中写入数据
  • 关闭虚拟机并在主机中卸载文件系统
  • 重新挂载,此时文件系统已经损坏

经过这种过程的损坏后,在笔者目前的测试中,只有XFS、ext4等日志文件系统可以通过fsck修复而恢复文件系统的挂载(但可能仅变得可挂载功能而存在数据丢失),而CoW文件系统,包括Btrfs、Bcachefs,都无法修复:

wszqkzqk@wszqkzqk-pc ~ > sudo bcachefs fsck /dev/sda6
mounting version 1.4: member_seq opts=ro,usrquota,grpquota,prjquota,degraded,fsck,fix_errors=ask,read_only
recovering from clean shutdown, journal seq 36780
Version upgrade from 1.3: rebalance_work to 1.4: member_seq incomplete
Doing compatible version upgrade from 1.3: rebalance_work to 1.4: member_seq

/dev/sda6: journal checksum error: got 588c62fe should be 8854d3a9 type crc32c_nonzero
found duplicate but non identical journal entries (seq 36387): fix?
 (y,n, or Y,N for all errors of this type) Y
duplicate journal entry 36387 on same device
found duplicate but non identical journal entries (seq 36388), fixing
duplicate journal entry 36388 on same device
found duplicate but non identical journal entries (seq 36389), fixing
duplicate journal entry 36389 on same device
found duplicate but non identical journal entries (seq 36390), fixing
duplicate journal entry 36390 on same device
found duplicate but non identical journal entries (seq 36391), fixing
duplicate journal entry 36391 on same device
found duplicate but non identical journal entries (seq 36392), fixing
duplicate journal entry 36392 on same device
found duplicate but non identical journal entries (seq 36393), fixing
duplicate journal entry 36393 on same device
found duplicate but non identical journal entries (seq 36394), fixing
duplicate journal entry 36394 on same device
found duplicate but non identical journal entries (seq 36395), fixing
duplicate journal entry 36395 on same device
found duplicate but non identical journal entries (seq 36396), fixing
duplicate journal entry 36396 on same device
found duplicate but non identical journal entries (seq 36397), fixing
duplicate journal entry 36397 on same device
found duplicate but non identical journal entries (seq 36398), fixing
duplicate journal entry 36398 on same device
found duplicate but non identical journal entries (seq 36399), fixing
duplicate journal entry 36399 on same device
found duplicate but non identical journal entries (seq 36400), fixing
duplicate journal entry 36400 on same device
found duplicate but non identical journal entries (seq 36401), fixing
duplicate journal entry 36401 on same device
found duplicate but non identical journal entries (seq 36402), fixing
duplicate journal entry 36402 on same device
found duplicate but non identical journal entries (seq 36403), fixing
duplicate journal entry 36403 on same device
found duplicate but non identical journal entries (seq 36404), fixing
duplicate journal entry 36404 on same device
found duplicate but non identical journal entries (seq 36405), fixing
duplicate journal entry 36405 on same device
found duplicate but non identical journal entries (seq 36406), fixing
duplicate journal entry 36406 on same device
found duplicate but non identical journal entries (seq 36407), fixing
duplicate journal entry 36407 on same device
found duplicate but non identical journal entries (seq 36408), fixing
duplicate journal entry 36408 on same device
found duplicate but non identical journal entries (seq 36409), fixing
duplicate journal entry 36409 on same device
found duplicate but non identical journal entries (seq 36410), fixing
duplicate journal entry 36410 on same device
found duplicate but non identical journal entries (seq 36411), fixing
duplicate journal entry 36411 on same device
found duplicate but non identical journal entries (seq 36412), fixing
duplicate journal entry 36412 on same device
found duplicate but non identical journal entries (seq 36413), fixing
duplicate journal entry 36413 on same device
found duplicate but non identical journal entries (seq 36414), fixing
duplicate journal entry 36414 on same device
found duplicate but non identical journal entries (seq 36415), fixing
duplicate journal entry 36415 on same device
found duplicate but non identical journal entries (seq 36416), fixing
duplicate journal entry 36416 on same device
found duplicate but non identical journal entries (seq 36417), fixing
duplicate journal entry 36417 on same device
journal read done, replaying entries 36780-36780
alloc_read... done
stripes_read... done
snapshots_read... done
check_allocations...error validating btree node on /dev/sda6 at btree inodes level 0/1
  u64s 11 type btree_ptr_v2 0:268488755:U32_MAX len 0 ver 0: seq bea2314c7a5528b4 written 482 min_key 0:268487600:0 durability: 1 ptr: 0:3381:0 gen 6
  node offset 470/482: btree node data missing: expected 482 sectors, found 470: fix?
 (y,n, or Y,N for all errors of this type) Y
btree_node_read_work: rewriting btree node at btree=inodes level=0 0:268488755:U32_MAX due to error
error validating btree node on /dev/sda6 at btree inodes level 0/1
  u64s 11 type btree_ptr_v2 0:536904743:U32_MAX len 0 ver 0: seq aaab92240e28546f written 236 min_key 0:536903573:0 durability: 1 ptr: 0:4203:0 gen 5
  node offset 0/236: got wrong btree node (want aaab92240e28546f got e8ce98ab0690b421)
got btree inodes level 0 pos 0:536903573:0-0:536904743:U32_MAX
running explicit recovery pass check_topology (4), currently at check_allocations (5)
Unreadable btree node at btree inodes level 0:
  u64s 11 type btree_ptr_v2 0:536904743:U32_MAX len 0 ver 0: seq aaab92240e28546f written 236 min_key 0:536903573:0 durability: 1 ptr: 0:4203:0 gen 5: fix?
 (y,n, or Y,N for all errors of this type) Y
Halting mark and sweep to start topology repair pass
fs has wrong nr_inodes: got 256454, should be 120149: fix?
 (y,n, or Y,N for all errors of this type) Y
 done
check_allocations...btree_node_read_work: rewriting btree node at btree=inodes level=0 0:268488755:U32_MAX due to error
error validating btree node on /dev/sda6 at btree inodes level 0/1
  u64s 11 type btree_ptr_v2 0:536904743:U32_MAX len 0 ver 0: seq aaab92240e28546f written 236 min_key 0:536903573:0 durability: 1 ptr: 0:4203:0 gen 5
  node offset 0/236: got wrong btree node (want aaab92240e28546f got e8ce98ab0690b421)
got btree inodes level 0 pos 0:536903573:0-0:536904743:U32_MAX
error validating btree node on /dev/sda6 at btree inodes level 0/1
  u64s 11 type btree_ptr_v2 0:1342209756:U32_MAX len 0 ver 0: seq 5a7973305c53ac1 written 501 min_key 0:1342208608:0 durability: 1 ptr: 0:4061:0 gen 4
  node offset 497/501: btree node data missing: expected 501 sectors, found 497, fixing
btree_node_read_work: rewriting btree node at btree=inodes level=0 0:1342209756:U32_MAX due to error
error validating btree node on /dev/sda6 at btree inodes level 0/1
  u64s 11 type btree_ptr_v2 0:1342223982:4294967294 len 0 ver 0: seq 538fe4f76d91b3c6 written 361 min_key 0:1342222841:0 durability: 1 ptr: 0:12554:0 gen 3
  node offset 346/361: btree node data missing: expected 361 sectors, found 346, fixing
btree_node_read_work: rewriting btree node at btree=inodes level=0 0:1342223982:4294967294 due to error
fs has wrong nr_inodes: got 120149, should be 255789, fixing
 done
cannot go rw, unfixed btree errors
bch2_fs_recovery(): error erofs_unfixed_errors
bch2_fs_start(): error starting filesystem erofs_unfixed_errors

奇怪的是,fsck报错中有erofs_unfixed_errors这样的信息,不知道为什么会冒出erofs,还是说这是error filesystem的简称。