linux-ext4 - Re: Selective Data Journaling in ext4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Wed, 13 Feb 2019 14:08:13 -0700
From:   Andreas Dilger <adilger@...ger.ca>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     Vijay Chidambaram <vijayc@...xas.edu>, linux-ext4@...r.kernel.org,
        jesus.palos@...xas.edu
Subject: Re: Selective Data Journaling in ext4

On Feb 13, 2019, at 11:53 AM, Theodore Y. Ts'o <tytso@....edu> wrote:
> 
> On Wed, Feb 13, 2019 at 10:30:47AM -0600, Vijay Chidambaram wrote:
>> Agreed, but another way to view this feature is that it is dynamic
>> switching between ordered mode and data journaling mode. We switch to
>> data journaling mode exactly when it is required, so you are right
>> that most applications would never see a difference. But when it is
>> required, this scheme would ensure stronger semantics are provided.
>> Overall, it provides data-journaling guarantees all the time, and I
>> was thinking some applications would like that peace of mind.
> 
> Switching back and forth orderred and data journalling mode is a bit
> tricky.  (Insert "one does not simply walk into Morder" meme here).
> 
> See the comment in ext4_change_journal_flag() in fs/ext4/inode.c:
> 
> 	/*
> 	 * We have to be very careful here: changing a data block's
> 	 * journaling status dynamically is dangerous.  If we write a
> 	 * data block to the journal, change the status and then delete
> 	 * that block, we risk forgetting to revoke the old log record
> 	 * from the journal and so a subsequent replay can corrupt data.
> 	 * So, first we make sure that the journal is empty and that
> 	 * nobody is changing anything.
> 	 */
> 
> What this means is that you have to track a list of blocks that has
> ever been data journalled, because before we delete the file, we have
> to write revoke all blocks belonging to that file on the list.
> Similarly, if you switch from ordered to data journalling mode, all of
> those blocks must be revoked.

To avoid the issue of enabling data journaling on a file, and the more
difficult process of disabling data journaling, I think we can be lazy
when disabling data journaling on a file until after the last journal
tid that contains data blocks from the file has been checkpointed out
of the journal.  It isn't like the case where the user requests data
journal be enabled or disabled *now*, so we just need to e.g. put those
files into the orphan list with a journal commit (checkpoint?) callback
to track when the data journal can be removed.

Alternately, just leave the data-journal mode enabled on such files
since they are likely to be used in the same way in the future (or more
likely never modified again) and we never disable data journal.

> This should also be done in a way that avoids serializing parallel
> writes to the the inode.  That's not something we support today (yet),
> but thare are some plans to allow parallel direct I/O writes to the
> file.  Speaking of Direct I/O writes, as above, if a block that was
> previously written via data journalling, the revoke block must be
> submitted --- and committed --- before Direct I/O writes to that block
> can be allowed.
> 
>>> Since we already have delalloc to pre-stage the dirty pages before the
>>> write, we can make a good decision about whether the file data should
>>> be written to the journal or directly to the filesystem.
> 
> Note that delalloc and data journalling is not compatible.  That being
> said, if we are writing to not-yet-allocated block, recent discussions
> of changing ext4 so that we only insert the block into the extent tree
> in a workqueue triggered by the I/O callback for data block write, is
> probably the better way of removing the data=ordered overhead.
> 
> Finally, this optimization only makes sense for HDD's, right?  For
> SSD's, random writes are mostly free, and the cost of the double
> write, not to mention the write amplification effect, probably makes
> this not worthwhile.

Sure, HDDs or hybrid HDDs with SSDs for the journal.  Using the SMR ext4
patches to enable log-structured write mode for ext4 would allow using a
good-sized journal device (32-64GB Optane M.2 devices are cheap and very
fast, and the smallest possible devices that are available today, larger
SSDs are definitely practical to use).  That allows sinking all of the
IOPS into the journal automatically without overwhelming the SSD bandwidth
with large writes that can efficiently be made directly to HDDs, and
then the checkpoint can do a better job to order the writes to HDD later.
With a RAID system the aggregate HDD bandwidth for large read/write exceeds
the SSD bandwidth.

This is definitely a workload that is of real-life interest (mixed large
and small file writes), so being able to optimize this at the ext4 level
would be great.

>> We like this idea as well, and would be happy to work on it! To make
>> sure we are on the same page, the proposal is to:
>> - identify whether writes are sequential or random (1)
>> - Send random writes to journal if Selective Data Journaling is enabled (2)
>> 
>> How should we do (1)? Also, would it make sense to do this per-file
>> instead of as a mode for the entire file system? I am thinking of
>> opening a file with O_SDJ which will convert random writes to
>> sequential and increase performance.

There are really two things to (1) - small random/sync/unaligned writes
into a large file, and small writes to individual files.  The VM already
does similar random/sequential read request detection for large files,
so the same could be used easily for write requests, and the latter can
be done by checking the file size.

Cheers, Andreas






Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)