linux-ext4 - Re: [PATCH v2 5/6] ext4: introduce direct IO write path using iomap infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 11 Sep 2019 13:38:52 +0530
From:   Ritesh Harjani <riteshh@...ux.ibm.com>
To:     Matthew Bobrowski <mbobrowski@...browski.org>, tytso@....edu,
        jack@...e.cz, adilger.kernel@...ger.ca
Cc:     linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        david@...morbit.com, hch@...radead.org, darrick.wong@...cle.com
Subject: Re: [PATCH v2 5/6] ext4: introduce direct IO write path using iomap
 infrastructure

Hello,

Few more small things noted.
Please check once.

On 9/9/19 2:56 PM, Ritesh Harjani wrote:
> 
> 
> On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
>> This patch introduces a new direct IO write code path implementation
>> that makes use of the iomap infrastructure.
>>
>> All direct IO write operations are now passed from the ->write_iter()
>> callback to the new function ext4_dio_write_iter(). This function is
>> responsible for calling into iomap infrastructure via
>> iomap_dio_rw(). Snippets of the direct IO code from within
>> ext4_file_write_iter(), such as checking whether the IO request is
>> unaligned asynchronous IO, or whether it will ber overwriting
>> allocated and initialized blocks has been moved out and into
>> ext4_dio_write_iter().
>>
>> The block mapping flags that are passed to ext4_map_blocks() from
>> within ext4_dio_get_block() and friends have effectively been taken
>> out and introduced within the ext4_iomap_begin(). If ext4_map_blocks()
>> happens to have instantiated blocks beyond the i_size, then we attempt
>> to place the inode onto the orphan list. Despite being able to perform
>> i_size extension checking earlier on in the direct IO code path, it
>> makes most sense to perform this bit post successful block allocation.
>>
>> The ->end_io() callback ext4_dio_write_end_io() is responsible for
>> removing the inode from the orphan list and determining if we should
>> truncate a failed write in the case of an error. We also convert a
>> range of unwritten extents to written if IOMAP_DIO_UNWRITTEN is set
>> and perform the necessary i_size/i_disksize extension if the
>> iocb->ki_pos + dio->size > i_size_read(inode).
>>
>> In the instance of a short write, we fallback to buffered IO and
>> complete whatever is left the 'iter'. Any blocks that may have been
>> allocated in preparation for direct IO will be reused by buffered IO,
>> so there's no issue with leaving allocated blocks beyond EOF.
>>
>> Signed-off-by: Matthew Bobrowski <mbobrowski@...browski.org>
> 
> Sorry some minor simplification comments. Forgot to respond in previous 
> email.
> 
> Otherwise looks good.
> 
> Reviewed-by: Ritesh Harjani <riteshh@...ux.ibm.com>
> 
> 
> 
>> ---
>>   fs/ext4/file.c  | 219 +++++++++++++++++++++++++++++++++---------------
>>   fs/ext4/inode.c |  57 ++++++++++---
>>   2 files changed, 198 insertions(+), 78 deletions(-)
>>
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index 8e586198f6e6..bf22425a6a6f 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -29,6 +29,7 @@
>>   #include <linux/pagevec.h>
>>   #include <linux/uio.h>
>>   #include <linux/mman.h>
>> +#include <linux/backing-dev.h>
>>   #include "ext4.h"
>>   #include "ext4_jbd2.h"
>>   #include "xattr.h"
>> @@ -217,6 +218,14 @@ static ssize_t ext4_write_checks(struct kiocb 
>> *iocb, struct iov_iter *from)
>>       if (ret <= 0)
>>           return ret;
>>
>> +    ret = file_remove_privs(iocb->ki_filp);
>> +    if (ret)
>> +        return 0;
>> +
>> +    ret = file_update_time(iocb->ki_filp);
>> +    if (ret)
>> +        return 0;
>> +
>>       if (unlikely(IS_IMMUTABLE(inode)))
>>           return -EPERM;

Maybe we can move this up. If file is IMMUTABLE no point in
calling for above actions (file_remove_privs/file_updatetime).

also why not use file_modified() API which does the same.

>>
>> @@ -234,6 +243,34 @@ static ssize_t ext4_write_checks(struct kiocb 
>> *iocb, struct iov_iter *from)
>>       return iov_iter_count(from);
>>   }
>>
>> +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>> +                    struct iov_iter *from)
>> +{
>> +    ssize_t ret;
>> +    struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> +    if (iocb->ki_flags & IOCB_NOWAIT)
>> +        return -EOPNOTSUPP;
>> +
>> +    if (!inode_trylock(inode))
>> +        inode_lock(inode);

Is it really needed to check for trylock first?
we can directly call for inode_lock() here.


>> +
>> +    ret = ext4_write_checks(iocb, from);
>> +    if (ret <= 0)
>> +        goto out;
>> +
>> +    current->backing_dev_info = inode_to_bdi(inode);
>> +    ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
>> +    current->backing_dev_info = NULL;
>> +out:
>> +    inode_unlock(inode);
>> +    if (likely(ret > 0)) {
>> +        iocb->ki_pos += ret;
>> +        ret = generic_write_sync(iocb, ret);
>> +    }
>> +    return ret;
>> +}
>> +
>>   static int ext4_handle_inode_extension(struct inode *inode, loff_t 
>> offset,
>>                          ssize_t len, size_t count)
>>   {
>> @@ -311,6 +348,118 @@ static int 
>> ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
>>       return ret;
>>   }
>>
>> +/*
>> + * For a write that extends the inode size, ext4_dio_write_iter() will
>> + * wait for the write to complete. Consequently, operations performed
>> + * within this function are still covered by the inode_lock().
>> + */
> Maybe add a comment that on success this returns 0.
> 
>> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, 
>> int error,
>> +                 unsigned int flags)
>> +{
>> +    int ret = 0;
> No need to initialize ret.
> 
> 
>> +    loff_t offset = iocb->ki_pos;
>> +    struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> +    if (error) {
>> +        ret = ext4_handle_failed_inode_extension(inode, offset + size);
>> +        return ret ? ret : error;
>> +    }
>> +
>> +    if (flags & IOMAP_DIO_UNWRITTEN) {
>> +        ret = ext4_convert_unwritten_extents(NULL, inode,
>> +                             offset, size);
>> +        if (ret)
>> +            return ret;
>> +    }
>> +
>> +    if (offset + size > i_size_read(inode)) {
>> +        ret = ext4_handle_inode_extension(inode, offset, size, 0);
>> +        if (ret)
>> +            return ret;
>> +    }
>> +    return ret;
> Directly return 0, since if it falls here it mans it is a success case.
> You are anyway returning error from above error paths.
> 
>> +}
>> +
>> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct 
>> iov_iter *from)
>> +{
>> +    ssize_t ret;
>> +    loff_t offset = iocb->ki_pos;
>> +    size_t count = iov_iter_count(from);
>> +    struct inode *inode = file_inode(iocb->ki_filp);
>> +    bool extend = false, overwrite = false, unaligned_aio = false;
>> +
>> +    if (!inode_trylock(inode)) {
>> +        if (iocb->ki_flags & IOCB_NOWAIT)
>> +            return -EAGAIN;
>> +        inode_lock(inode);
>> +    }
>> +
>> +    if (!ext4_dio_checks(inode)) {
>> +        inode_unlock(inode);
>> +        /*
>> +         * Fallback to buffered IO if the operation on the
>> +         * inode is not supported by direct IO.
>> +         */
>> +        return ext4_buffered_write_iter(iocb, from);
>> +    }
>> +
>> +    ret = ext4_write_checks(iocb, from);
This can modify the count in iov_iter *from.

>> +    if (ret <= 0) {
>> +        inode_unlock(inode);
>> +        return ret;
>> +    }
let's recalculate count = iov_iter_count(from);

>> +
>> +    /*
>> +     * Unaligned direct AIO must be serialized among each other as
>> +     * the zeroing of partial blocks of two competing unaligned
>> +     * AIOs can result in data corruption.
>> +     */
>> +    if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
>> +        !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, 
>> offset)) {
>> +        unaligned_aio = true;
>> +        inode_dio_wait(inode);
>> +    }
>> +
>> +    /*
>> +     * Determine whether the IO operation will overwrite allocated
>> +     * and initialized blocks. If so, check to see whether it is
>> +     * possible to take the dioread_nolock path.
>> +     */
>> +    if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&count here could be the old one.


>> +        ext4_should_dioread_nolock(inode)) {
>> +        overwrite = true;
>> +        downgrade_write(&inode->i_rwsem);
>> +    }
>> +
>> +    if (offset + count > i_size_read(inode) ||
>> +        offset + count > EXT4_I(inode)->i_disksize) {
ditto.

>> +        ext4_update_i_disksize(inode, inode->i_size);
>> +        extend = true;
>> +    }
>> +
>> +    ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, 
>> ext4_dio_write_end_io);
>> +
>> +    /*
>> +     * Unaligned direct AIO must be the only IO in flight or else
>> +     * any overlapping aligned IO after unaligned IO might result
>> +     * in data corruption. We also need to wait here in the case
>> +     * where the inode is being extended so that inode extension
>> +     * routines in ext4_dio_write_end_io() are covered by the
>> +     * inode_lock().
>> +     */
>> +    if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
>> +        inode_dio_wait(inode);
>> +
>> +    if (overwrite)
>> +        inode_unlock_shared(inode);
>> +    else
>> +        inode_unlock(inode);
>> +
>> +    if (ret >= 0 && iov_iter_count(from))
>> +        return ext4_buffered_write_iter(iocb, from);
>> +    return ret;
>> +}
>> +
>>   #ifdef CONFIG_FS_DAX
>>   static ssize_t
>>   ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> @@ -325,15 +474,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct 
>> iov_iter *from)
>>               return -EAGAIN;
>>           inode_lock(inode);
>>       }
>> +
>>       ret = ext4_write_checks(iocb, from);
>>       if (ret <= 0)
>>           goto out;
>> -    ret = file_remove_privs(iocb->ki_filp);
>> -    if (ret)
>> -        goto out;
>> -    ret = file_update_time(iocb->ki_filp);
>> -    if (ret)
>> -        goto out;
>>
>>       offset = iocb->ki_pos;
>>       ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
>> @@ -359,73 +503,16 @@ static ssize_t
>>   ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>   {
>>       struct inode *inode = file_inode(iocb->ki_filp);
>> -    int o_direct = iocb->ki_flags & IOCB_DIRECT;
>> -    int unaligned_aio = 0;
>> -    int overwrite = 0;
>> -    ssize_t ret;
>>
>>       if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
>>           return -EIO;
>>
>> -#ifdef CONFIG_FS_DAX
>>       if (IS_DAX(inode))
>>           return ext4_dax_write_iter(iocb, from);
>> -#endif
>> -    if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
>> -        return -EOPNOTSUPP;
>>
>> -    if (!inode_trylock(inode)) {
>> -        if (iocb->ki_flags & IOCB_NOWAIT)
>> -            return -EAGAIN;
>> -        inode_lock(inode);
>> -    }
>> -
>> -    ret = ext4_write_checks(iocb, from);
>> -    if (ret <= 0)
>> -        goto out;
>> -
>> -    /*
>> -     * Unaligned direct AIO must be serialized among each other as 
>> zeroing
>> -     * of partial blocks of two competing unaligned AIOs can result 
>> in data
>> -     * corruption.
>> -     */
>> -    if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
>> -        !is_sync_kiocb(iocb) &&
>> -        ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
>> -        unaligned_aio = 1;
>> -        ext4_unwritten_wait(inode);
>> -    }
>> -
>> -    iocb->private = &overwrite;
>> -    /* Check whether we do a DIO overwrite or not */
>> -    if (o_direct && !unaligned_aio) {
>> -        if (ext4_overwrite_io(inode, iocb->ki_pos, 
>> iov_iter_count(from))) {
>> -            if (ext4_should_dioread_nolock(inode))
>> -                overwrite = 1;
>> -        } else if (iocb->ki_flags & IOCB_NOWAIT) {
>> -            ret = -EAGAIN;
>> -            goto out;
>> -        }
>> -    }
>> -
>> -    ret = __generic_file_write_iter(iocb, from);
>> -    /*
>> -     * Unaligned direct AIO must be the only IO in flight. Otherwise
>> -     * overlapping aligned IO after unaligned might result in data
>> -     * corruption.
>> -     */
>> -    if (ret == -EIOCBQUEUED && unaligned_aio)
>> -        ext4_unwritten_wait(inode);
>> -    inode_unlock(inode);
>> -
>> -    if (ret > 0)
>> -        ret = generic_write_sync(iocb, ret);
>> -
>> -    return ret;
>> -
>> -out:
>> -    inode_unlock(inode);
>> -    return ret;
>> +    if (iocb->ki_flags & IOCB_DIRECT)
>> +        return ext4_dio_write_iter(iocb, from);
>> +    return ext4_buffered_write_iter(iocb, from);
>>   }
>>
>>   #ifdef CONFIG_FS_DAX
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index efb184928e51..f52ad3065236 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode 
>> *inode, loff_t offset, loff_t length,
>>               }
>>           }
>>       } else if (flags & IOMAP_WRITE) {
>> -        int dio_credits;
>>           handle_t *handle;
>> -        int retries = 0;
>> +        int dio_credits, retries = 0, m_flags = 0;
>>
>> -        /* Trim mapping request to maximum we can map at once for DIO */
>> +        /*
>> +         * Trim mapping request to maximum we can map at once
>> +         * for DIO.
>> +         */
>>           if (map.m_len > DIO_MAX_BLOCKS)
>>               map.m_len = DIO_MAX_BLOCKS;
>>           dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
>> @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode 
>> *inode, loff_t offset, loff_t length,
>>           if (IS_ERR(handle))
>>               return PTR_ERR(handle);
>>
>> -        ret = ext4_map_blocks(handle, inode, &map,
>> -                      EXT4_GET_BLOCKS_CREATE_ZERO);
>> +        /*
>> +         * DAX and direct IO are the only two operations that
>> +         * are currently supported with IOMAP_WRITE.
>> +         */
>> +        WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
>> +        if (IS_DAX(inode))
>> +            m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
>> +        else if (round_down(offset, i_blocksize(inode)) >=
>> +             i_size_read(inode))
>> +            m_flags = EXT4_GET_BLOCKS_CREATE;
>> +        else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>> +            m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
>> +
>> +        ret = ext4_map_blocks(handle, inode, &map, m_flags);
>> +
>> +        /*
>> +         * We cannot fill holes in indirect tree based inodes
>> +         * as that could expose stale data in the case of a
>> +         * crash. Use the magic error code to fallback to
>> +         * buffered IO.
>> +         */
> 
> I like this comment ;)
> Help others to understand what is really going on here.
> 
>> +        if (!m_flags && !ret)
>> +            ret = -ENOTBLK;
>> +
>>           if (ret < 0) {
>>               ext4_journal_stop(handle);
>>               if (ret == -ENOSPC &&
>> @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode 
>> *inode, loff_t offset, loff_t length,
>>           }
>>
>>           /*
>> -         * If we added blocks beyond i_size, we need to make sure they
>> -         * will get truncated if we crash before updating i_size in
>> -         * ext4_iomap_end(). For faults we don't need to do that (and
>> -         * even cannot because for orphan list operations inode_lock is
>> -         * required) - if we happen to instantiate block beyond i_size,
>> -         * it is because we race with truncate which has already added
>> -         * the inode to the orphan list.
>> +         * If we added blocks beyond i_size, we need to make
>> +         * sure they will get truncated if we crash before
>> +         * updating the i_size. For faults we don't need to do
>> +         * that (and even cannot because for orphan list
>> +         * operations inode_lock is required) - if we happen
>> +         * to instantiate block beyond i_size, it is because
>> +         * we race with truncate which has already added the
>> +         * inode to the orphan list.
>>            */
>>           if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
>>               (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
>> @@ -3612,6 +3637,14 @@ static int ext4_iomap_begin(struct inode 
>> *inode, loff_t offset, loff_t length,
>>   static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t 
>> length,
>>                 ssize_t written, unsigned flags, struct iomap *iomap)
>>   {
>> +    /*
>> +     * Check to see whether an error occurred while writing data
>> +     * out to allocated blocks. If so, return the magic error code
>> +     * so that we fallback to buffered IO and reuse the blocks
>> +     * that were allocated in preparation for the direct IO write.
>> +     */
>> +    if (flags & IOMAP_DIRECT && written == 0)
>> +        return -ENOTBLK;
>>       return 0;
>>   }
>>
>