linux-ext4 - Re: [PATCH v2 12/12] docs: Add fast commit documentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 21 Aug 2019 08:21:21 -0700
From:   "Darrick J. Wong" <darrick.wong@...cle.com>
To:     harshad shirwadkar <harshadshirwadkar@...il.com>
Cc:     Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: Re: [PATCH v2 12/12] docs: Add fast commit documentation

On Mon, Aug 19, 2019 at 11:38:42PM -0700, harshad shirwadkar wrote:
> On Thu, Aug 15, 2019 at 6:00 PM Darrick J. Wong <darrick.wong@...cle.com> wrote:
> >
> > On Thu, Aug 08, 2019 at 08:45:52PM -0700, Harshad Shirwadkar wrote:
> > > This patch adds necessary documentation to
> > > Documentation/filesystems/journalling.rst and
> > > Documentation/filesystems/ext4/journal.rst.
> > >
> > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@...il.com>
> > > ---
> > >  Documentation/filesystems/ext4/journal.rst | 96 ++++++++++++++++++++--
> > >  Documentation/filesystems/journalling.rst  | 15 ++++
> > >  2 files changed, 105 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> > > index ea613ee701f5..d6e4a698e208 100644
> > > --- a/Documentation/filesystems/ext4/journal.rst
> > > +++ b/Documentation/filesystems/ext4/journal.rst
> > > @@ -29,10 +29,14 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> > >  disk before the metadata are written to disk through the journal.
> > >
> > >  The journal inode is typically inode 8. The first 68 bytes of the
> > > -journal inode are replicated in the ext4 superblock. The journal itself
> > > -is normal (but hidden) file within the filesystem. The file usually
> > > -consumes an entire block group, though mke2fs tries to put it in the
> > > -middle of the disk.
> > > +journal inode are replicated in the ext4 superblock. The journal
> > > +itself is normal (but hidden) file within the filesystem. The file
> > > +usually consumes an entire block group, though mke2fs tries to put it
> > > +in the middle of the disk. Last 128 blocks in the journal are reserved
> > > +for fast commits. Fast commits store metadata changes to inodes in an
> > > +incremental fashion. A fast commit is valid only if there is no full
> > > +commit after that particular fast commit. That makes fast commit space
> > > +reusable after every full commit.
> > >
> > >  All fields in jbd2 are written to disk in big-endian order. This is the
> > >  opposite of ext4.
> > > @@ -48,16 +52,18 @@ Layout
> > >  Generally speaking, the journal has this format:
> > >
> > >  .. list-table::
> > > -   :widths: 16 48 16
> > > +   :widths: 16 48 16 18
> > >     :header-rows: 1
> > >
> > >     * - Superblock
> > >       - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > >         revocations] commmit\_block
> > >       - [more transactions...]
> > > +     - [Fast commits...]
> > >     * -
> > >       - One transaction
> > >       -
> > > +     -
> > >
> > >  Notice that a transaction begins with either a descriptor and some data,
> > >  or a block revocation list. A finished transaction always ends with a
> > > @@ -76,7 +82,7 @@ The journal superblock will be in the next full block after the
> > >  superblock.
> > >
> > >  .. list-table::
> > > -   :widths: 12 12 12 32 12
> > > +   :widths: 12 12 12 32 12 12
> > >     :header-rows: 1
> > >
> > >     * - 1024 bytes of padding
> > > @@ -85,11 +91,13 @@ superblock.
> > >       - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > >         revocations] commmit\_block
> > >       - [more transactions...]
> > > +     - [Fast commits...]
> > >     * -
> > >       -
> > >       -
> > >       - One transaction
> > >       -
> > > +     -
> > >
> > >  Block Header
> > >  ~~~~~~~~~~~~
> > > @@ -609,3 +617,79 @@ bytes long (but uses a full block):
> > >       - h\_commit\_nsec
> > >       - Nanoseconds component of the above timestamp.
> > >
> > > +Fast Commit Block
> > > +~~~~~~~~~~~~~~~~~
> > > +
> > > +The fast commit block indicates an append to the last commit block
> > > +that was written to the journal. One fast commit block records updates
> > > +to one inode. So, typically you would find as many fast commit blocks
> > > +as the number of inodes that got changed since the last commit. A fast
> > > +commit block is valid only if there is no commit block present with
> > > +transaction ID greater than that of the fast commit block. If such a
> > > +block a present, then there is no need to replay the fast commit
> > > +block.
> > > +
> > > +Multiple fast commit blocks are a part of one sub-transaction. To
> > > +indicate the last block in a fast commit transaction, fc_flags field
> > > +in the last block in every subtransaction is marked with "LAST" (0x1)
> > > +flag. A subtransaction is valid only if all the following conditions
> > > +are met:
> > > +
> > > +1) SUBTID of all blocks is either equal to or greater than SUBTID of
> > > +   the previous fast commit block.
> > > +2) For every sub-transaction, last block is marked with LAST flag.
> > > +3) There are no invalid blocks in between.
> > > +
> > > +.. list-table::
> > > +   :widths: 8 8 24 40
> > > +   :header-rows: 1
> > > +
> > > +   * - Offset
> > > +     - Type
> > > +     - Name
> > > +     - Descriptor
> > > +   * - 0x0
> > > +     - journal\_header\_s
> > > +     - (open coded)
> > > +     - Common block header.
> > > +   * - 0xC
> > > +     - \_\_le32
> > > +     - fc\_magic
> > > +     - Magic value which should be set to 0xE2540090. This identifies
> > > +       that this block is a fast commit block.
> > > +   * - 0x10
> > > +     - \_\_le32
> > > +     - fc\_subtid
> > > +     - Sub-transaction ID for this commit block
> > > +   * - 0x14
> > > +     - \_\_u8
> > > +     - fc\_features
> > > +     - Features used by this fast commit block.
> > > +   * - 0x15
> > > +     - \_\_u8
> > > +     - fc_flags
> > > +     - Flags. (0x1(Last) - Indicates that this is the last block in sub-transaction)
> > > +   * - 0x16
> > > +     - \_\_le16
> > > +     - fc_num_tlvs
> > > +     - Number of TLVs contained in this fast commit block
> > > +   * - 0x18
> > > +     - \_\_le32
> > > +     - \_\_fc\_len
> > > +     - Length of the fast commit block in terms of number of blocks
> > > +   * - 0x2c
> > > +     - \_\_le32
> > > +     - fc\_ino
> > > +     - Inode number of the inode that will be recovered using this fast commit
> > > +   * - 0x30
> > > +     - struct ext4\_inode
> > > +     - inode
> > > +     - On-disk copy of the inode at the commit time
> > > +   * - 0x34
> > > +     - struct ext4\_fc\_tl
> > > +     - Array of struct ext4\_fc\_tl
> > > +     - The actual delta with the last commit. Starting at this offset,
> > > +       there is an array of TLVs that indicates which all extents
> > > +       should be present in the corresponding inode. Currently, the
> > > +       only tag that is supported is EXT4\_FC\_TAG\_EXT. That tag
> > > +       indicates that the corresponding value is an extent.
> >
> > This is a good start, but what's the structure of struct ext4_fc_tl ?
> > It's written to disk, it should be here too.  Looks like it's mostly
> > just an array of ondisk extent structures?
> Thanks, I'll update this in next version. struct ext4_fc_tl is a
> generic tag-length-value container that currently holds only extents
> that were added to a file after last commit.

<nod> If they're the same format as the extent map records then I think
you can just reference that part of the documentation.

> > So if I read this right, this first fastcommit tag type seems to be an
> > inode core and an array of extents which ... I guess are the extents
> > that were allocated and mapped into the file?  So therefore journal
> > replay of this metadata update becomes a simple matter of logging the
> > new inode core, adding the associated fc extent records to the extent
> > map, and marking the corresponding parts of the block bitmap in use?
> >
> Yes, that's precisely what is done here.
> > I'm wondering why these fast commits aren't written inline with the
> > regular jbd2 transaction block stream?  i.e.
> >
> > [descriptors][blocks][commit][fastcommit][fastcommit][descriptor...]
> >
> After a full commit all previous fast commits are invalid. So, if we

All of them, fs-wide?  Or just the ones for that particular inode?

> inline fast commits with corresponding transactions, we'll end up
> wasting a whole lot of journal space. So, fast commit area is kept
> separate from the normal journaling area and after every transaction
> commit, fast commit space is reused. But, if we could overwrite fast
> commit blocks with the final commit then it's possible to inline fast
> commit blocks with the transaction stream without losing journal
> space. So, fast commit could just write a fast commit block after
> previous transaction and when next transaction commits, it could
> simply overwrite previous fast commit blocks.

<nod>

> > That way jbd2 replay just adds a case for a journal block with h_magic
> > == JBD2_FC_MAGIC where it checkpoints whatever it had staged at that
> > point, throws the fast commit block up to ext4 to do whatever, and then
> > continues on replaying regular transactions?  I get this feeling like
> > fastcommit is a journal that runs inside of/alongside jbd2 and wonder
> > why not just integrate it better with jbd2?
> Hmmm, I agree, we want fast commits to be as close to jbd2 as possible.

:)

--D

> >
> > > diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> > > index 58ce6b395206..2e0d550b546c 100644
> > > --- a/Documentation/filesystems/journalling.rst
> > > +++ b/Documentation/filesystems/journalling.rst
> > > @@ -115,6 +115,21 @@ called after each transaction commit. You can also use
> > >  ``transaction->t_private_list`` for attaching entries to a transaction
> > >  that need processing when the transaction commits.
> > >
> > > +JBD2 also allows client file systems to implement file system specific
> > > +commits which are called as ``fast commits``. File systems that wish
> > > +to use this feature should first set
> > > +``journal->j_fc_commit_callback``. That function is called before
> > > +performing a commit. File system can call :c:func:`jbd2_map_fc_buf()`
> > > +to get buffers reserved for fast commits. If file system returns 0,
> > > +JBD2 assumes that file system performed a fast commit and it backs off
> > > +from performing a commit. Otherwise, JBD2 falls back to normal full
> >
> > Huh.  Ok, so the caller I guess grabs fastcommit blocks, writes the
> > intent to the fc block, and pushes it to disk, after which we can return
> > to userspace.  Some time later jbd2 gets around to committing things so
> > it calls back with ->j_fc_commit_callback at which point we say "Oh! I
> > already wrote that to disk as a fastcommit, so return 0" and jbd2
> > shrugs and moves on to the next transaction?
> I am sorry for the confusing wording here, let me fix it in the next
> version. So, either when fsync() is called or when jbd2 wakes up, in
> both case, journal->j_fc_commit_callback() is invoked by jbd2. In
> other words, journal->j_fc_commit_callback() is the main fastcommit
> "commit" routine. If j_fc_commit_callback() returns 0, jbd2 knows that
> file system was able to perform a fast commit and in that case a full
> commit is not needed. But, there are scenarios when file system thinks
> that it would rather do a full commit. File system can think that for
> a couple of reasons - accumulated work is too much to fit in fast
> commit region, accumulated work is too much to have any performance
> benefits, a complex operation (such as punch hole) was performed for
> which there's no fast commit support yet. In such cases,
> j_fc_commit_callback() can simply return a non-zero value to tell jbd2
> to perform a full traditional commit.
> 
> Thanks,
> Harshad
> >
> > --D
> >
> > > +commit. After performing either a fast or a full commit, JBD2 calls
> > > +``journal->j_fc_cleanup_cb`` to allow file systems to perform cleanups
> > > +for their internal fast commit related data structures. At the replay
> > > +time, JBD2 passes each and every fast commit block to the file system
> > > +via ``journal->j_fc_replay_cb``. Ext4 effectively uses this fast
> > > +commit mechanism to improve journal commit performance.
> > > +
> > >  JBD2 also provides a way to block all transaction updates via
> > >  :c:func:`jbd2_journal_lock_updates()` /
> > >  :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> > > --
> > > 2.23.0.rc1.153.gdeed80330f-goog
> > >