linux-ext4 - Re: created ext4 disk image differs depending on the underlying filesystem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Sat, 4 May 2024 20:10:20 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Johannes Schauer Marin Rodrigues <josch@...ter-muffin.de>
Cc: linux-ext4@...r.kernel.org
Subject: Re: created ext4 disk image differs depending on the underlying
 filesystem

On Sat, May 04, 2024 at 07:53:29PM +0200, Johannes Schauer Marin Rodrigues wrote:
> > 
> > Any idea what is going on?

The fundamental issue has to do with how ext2fs_zero_blocks() in
lib/ext2fs/mkjournal.c is implemented.

> The "Lifetime writes" being much higher on fat32 suggests that despite
> "nodiscard", less zeroes were written out when ext4 or tmpfs are the underlying
> FS?

Yes, that's exactly right.

The ext2fs_zero_blocks() function will attempt to call the io
channel's zeroout function --- for Unix systems, that's
lib/ext2fs/unix_io.c's __unix_zeroout() function.  This will attempt
to use fallocate's FALLOC_FL_ZERO_RANGE or FALLOCATE_FL_PUNCH_HOLE to
zero a range of blocks.  Now, exactly how ZERO_RANGE and PUNCH_HOLE is
implemented depends on whether the "storage device" being accessed via
unix_io is a block device or a file, and if it is a file, whether the
underlying file system supports ZERO_RANGE or PUNCH_HOLE.

Depending on how the underlying file system supports ZERO_RANGE and/or
PUNCH_HOLE, it may simply manipulate metadata blocks (e.g., ext4's
extent tree) so that the relevant file offsets will return zero --- or
if the file system doesn't support unitialized extent range, and/or
doesn't support sparse files, the file system MAY write all zeros, or
the file system MAY simply return an EOPNOTSUPP error, or the file
system MAY issue a SCSI WRITE SAME or moral equivalent for UFS, NVMe,
etc., if the block device supports it (and this might turn into a
SSD-level discard, so long as it is a reliable discard).  And of
course, if unix_io is accessing a block device, depending on the
capabilities of the storage device and its connection bus, this might
also turn into a SCSI WRITE SAME, or some other zeroout command.

Now, the zeroout command doesn't actually increment the lifetime
writes counter.  Whether or not it should is an interesting
philosophical question, since it might actually result in writes to
the device, or it might just simply involve metadata updates, either
on the underlying file (if the file system supports it), or
potentially in the metadata for the SSD's Flash Translation Layer.  At
the userspace level, we simply don't know how FALLOC_FL_ZERO_RANGE and
FALLOC_FL_PUNCH_HOLE will be implemented.

In the case of FAT32, the file system doesn't support sparse files,
and it also doesn't support unitialized extents.  So
FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE will fail on a fat32
file system.  As a result, ext2fs_zero_blocks() will fall back to
explicitly writing zeros using io_channel_write_blk64(), and this
*does* increment the lifetime writes counter.

If you enhance the script by adding "ls -ls "$imgpath" and "filefrag
-v "$imgpath" || /bin/true", you can see that the disk space consumed
by the image file varies, and it varies even more if you use the
original version of the script that doesn't disable lazy_itable_init,
discard, et.al.

Unfortunately tmpfs and fat don't support filefrag -v, but you could
see the difference if you write a debugging program which used lseek's
SEEK_HOLE and SEEK_DATA to see which parts of the file are sparse
(although it won't show which parts of the file are marked
unitialized, assuming the file system supported it).

If your goal is to create completely reproducible image files, one
question is whether keeping the checksums identical is enough, or do
you care about whether the underlying file is being more efficiently
stored by using sparse files or extents marked unitialized?

Depending on how much you care about reproducibility versus file
storage efficiency, I could imagine adding some kind of option which
disables the zeroout function, and forces e2fsprogs to always write
zeros, even if that increases the write wearout rate of the underlying
flash file system, and increasing the size of the image file.  Or I
could imageine some kind of extended option which hacks mke2fs to zero
out the lifetime writes counter.;

Cheers,

						- Ted