linux-ext4 - Re: [PATCH v3 1/2] ext4: introduce EXT4_BG_WAS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 12 Aug 2020 17:14:32 -0600
From:   Andreas Dilger <adilger@...ger.ca>
To:     tytso@....edu
Cc:     Wang Shilong <wangshilong1991@...il.com>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Wang Shilong <wshilong@....com>, Shuichi Ihara <sihara@....com>
Subject: Re: [PATCH v3 1/2] ext4: introduce EXT4_BG_WAS_TRIMMED to optimize
 trim

On Aug 10, 2020, at 7:24 AM, tytso@....edu wrote:
> 
> On Sat, Aug 08, 2020 at 10:33:08PM -0600, Andreas Dilger wrote:
>> What about storing "s_min_freed_blocks_to_trim" persistently in the
>> superblock, and then the admin can adjust this as desired?  If it is
>> set =1, then the "lazy trim" optimization would be disabled (every
>> FITRIM request would honor the trim requests whenever there is a
>> freed block in a group).  I suppose we could allow =0 to mean "do not
>> store the WAS_TRIMMED flag persistently", so there would be no change
>> for current behavior, and it would require a tune2fs option to set the
>> new value into the superblock (though we might consider setting this
>> to a non-zero value in mke2fs by default).
> 
> Currently the the minimum blocks to trim is passed in to FITRIM from
> userspace; so we would need to define how the passed-in value from the
> fstrim program interacts with the value stored in the sueprblock.
> Would we always ignore the value passed-in from userspace?  That
> doesn't seem right...

The s_min_freed_blocks_to_trim value would decide whether the persistent
WAS_TRIMMED flag is cleared from the group or not when blocks are freed.
If =1 then every block freed in a group will clear the WAS_TRIMMED flag
(this is essentially the current behavior).  If > 1 (the default with
the second patch in the series) then multiple blocks need to be freed
from a group before it is reconsidered for trim, which is reasonable,
since few underlying devices have 1-block TRIM granularity and issuing
a new TRIM each time is just overhead.  If =0 then the WAS_TRIMMED flag
is never saved, and every FITRIM call will result in every group being
trimmed again, which is what currently happens when the filesystem is
newly mounted.

>> The other thing we were thinkgin about was changing the "-o discard" code
>> to leverage the WAS_TRIMMED flag, and just do bulk trim periodically
>> in the filesystem as blocks are freed from groups, rather than tracking
>> freed extents in memory and submitting trims actively during IO.  Instead,
>> it would track groups that exceed "s_min_freed_blocks_to_trim", and trim
>> the whole group in the background when the filesystem is not active.
> 
> Hmm, maybe.  That's an awful lot of complexity, which is my concern
> with that approach.
> 
> Part of the problem here is that discard is being used for different
> things for different use cases and devices with different discard
> speeds.  Right now, one of the primary uses of -o discard is for
> people who have fast discard implementations and/or people who really
> want to make sure every freed block is immediately discard --- perhaps
> to meet security / privacy requirements (such as HIPPA compliance,
> etc.).   I don't want to break that.
> 
> We now have a requirement of people who have very slow discards --- I
> think at one point people mentioned something about for devices using
> HDD, probably in some kind of dm-thin use case?  One solution that we
> can use for those is simply use fstrim -m 8M or some such.  But it
> appears that part of the problem is people do want more precision than
> that?

At least in the DDN case it isn't for HDDs but just for huge NVMe RAID
devices where the many millions of TRIM requests take a noticeable time.
Also, even if the TRIM is done by mke2fs, fstrim will TRIM the whole
filesystem again after each mount because this state is not persistently
stored in the group descriptors.

> Another solution might be to skip trimming block groups if there have
> been blocks that have been freshly freed that are pending a commit,
> and skip that block group until the commit has completed.  That might
> also help reduce contention on a busy file system.

I think even with fast TRIM devices, sending lots of them inline with
other requests causes IO serialization, so "-o discard" is bad for
performance and has complex in-memory tracking of extents to TRIM.  The
other drawback of "-o discard" is even with "perfect" TRIM, the size
or alignment may not be suitable for the underlying NAND, so TRIMs that
eventually cover the entire group could in fact release none of the
underlying space because individually they weren't the right size/offset.

The proposal is to allow efficiently (i.e. 1 bit per group on disk, one
int per group in memory) aggregating and tracking which groups can
benefit from a TRIM request (e.g. after multi-MB of blocks are freed
there), rather than "-o discard" sending hundreds of TRIMs when individual
blocks/extents are freed.  The alternative of running "fstrim" from cron
is also not ideal, because it is hard to know what interval to use, and
there is no way to know if the filesystem is idle and it is a good time
to run, or if it *was* idle but suddenly becomes busy after fstrim starts.


Adding filesystem load-detection for fstrim, so that it only trims groups
while idle and stops when there is IO, would solve the contention issue
with actual filesystem usage.  At that point, the "enhancement" would
just be to essentially have fstrim restart periodically internally when
"-o discard" is used (or maybe "-o fstrim" instead?), with optimizations
to speed up finding groups that need to be trimmed as opposed to a full
scan each time.  Maybe the optimizations are premature, and a full scan
is easy enough even when there are millions of groups?  I guess if mballoc
can do full-group scans then fstrim can do the same easily enough.


> Yet another solution might be bias block allocations towards LBA
> ranges that have been deleted recently --- since another way to avoid
> trims is to simply overwrite those LBA's.  But then the question is
> how much memory are we willing to dedicate towards tracking recently
> released LBA's, and to what level of granularity?  Perhaps we just
> track the freed extents, and if they don't get used within a certain
> period, or if we start getting put under memory pressure, we then send
> the discards at that point.

I'd think that overwriting just-freed blocks is the opposite of what we
want?  Overwriting NAND blocks results in an erase cycle, and TRIM is
intended to pre-erase the blocks so that this doesn't need to be done
when there is a new write.

> Ultimately, though, this is a space full of trade offs, and I'm
> reminded of one of my father's favorite Chinese sayings: "You're
> demanding a horse which can run fast, but which doesn't eat much
> grass." (又要马儿跑，又要马儿不吃草).  Or translated more
> idiomatically, you can't have your cake and eat it too.  It seems
> this desire transcends all cultures.  :-)


Cheers, Andreas






Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)