linux-ext4 - Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 20 Sep 2023 17:08:19 +0530
From:   Ritesh Harjani (IBM) <ritesh.list@...il.com>
To:     Jan Kara <jack@...e.cz>, Gao Xiang <hsiangkao@...ux.alibaba.com>
Cc:     linux-ext4@...r.kernel.org, Theodore Ts'o <tytso@....edu>,
        Jan Kara <jack@...e.cz>,
        Matthew Bobrowski <mbobrowski@...browski.org>,
        Christoph Hellwig <hch@....de>,
        Joseph Qi <joseph.qi@...ux.alibaba.com>
Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion

Jan Kara <jack@...e.cz> writes:

> Hello!
>
> On Tue 19-09-23 14:00:04, Gao Xiang wrote:
>> Our consumer reports a behavior change between pre-iomap and iomap
>> direct io conversion:
>> 
>> If the system crashes after an appending write to a file open with
>> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if
>> O_SYNC was marked before.
>> 
>> It can be reproduced by a test program in the attachment with
>> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger
>> 
>> After some analysis, we found that before iomap direct I/O conversion,
>> the timing was roughly (taking Linux 3.10 codebase as an example):
>> 
>> 	..
>> 	- ext4_file_dio_write
>> 	  - __generic_file_aio_write
>> 	      ..
>> 	    - ext4_direct_IO  # generic_file_direct_write
>> 	      - ext4_ext_direct_IO
>> 	        - ext4_ind_direct_IO  # final_size > inode->i_size
>> 	          - ..
>> 	          - ret = blockdev_direct_IO()
>> 	          - i_size_write(inode, end) # orphan && ret > 0 &&
>> 	                                   # end > inode->i_size
>> 	          - ext4_mark_inode_dirty()
>> 	          - ...
>> 	  - generic_write_sync  # handling O_SYNC
>> 
>> So the dirty inode meta will be committed into journal immediately
>> if O_SYNC is set.  However, After commit 569342dc2485 ("ext4: move
>> inode extension/truncate code out from ->iomap_end() callback"),
>> the new behavior seems as below:
>> 
>> 	..
>> 	- ext4_dio_write_iter
>> 	  - ext4_dio_write_checks  # extend = 1
>> 	  - iomap_dio_rw
>> 	      - __iomap_dio_rw
>> 	      - iomap_dio_complete
>> 	        - generic_write_sync
>> 	  - ext4_handle_inode_extension  # extend = 1

Yes, since ext4_handle_inode_extension() will handle inode i_disksize
update and mark the inode dirty, generic_write_sync() call should happen
after that.

That also means then we don't have any generic FS testcase which can validate
this behaviour. 

>> 
>> So that i_size will be recorded only after generic_write_sync() is
>> called.  So O_SYNC won't flush the update i_size to the disk.
>
> Indeed, that looks like a bug. Thanks for report!
>
>> On the other side, after a quick look of XFS side, it will record
>> i_size changes in xfs_dio_write_end_io() so it seems that it doesn't
>> have this problem.
>
> Yes, I'm a bit hazy on the details but I think we've decided to call
> ext4_handle_inode_extension() directly from ext4_dio_write_iter() because
> from ext4_dio_write_end_io() it was difficult to test in a race-free way
> whether extending i_size (and i_disksize) is needed or not (we don't
> necessarily hold i_rwsem there).

We do hold i_rwsem in exclusive write mode for file extend case.
(ext4_dio_write_checks()).

IIUC, ext4_handle_inode_extension() takes "written" and "count" as it's
argument. This means that "count" bytes were mapped, but only "written"
bytes were written. This information is used in
ext4_handle_inode_extension() case for truncating blocks beyond EOF. 

I also found this discussion here [1].

[1]:
https://lore.kernel.org/linux-ext4/20191008151238.GK5078@quack2.suse.cz/

>From this thread it looks like we decided to move
ext4_handle_inode_extension() case out of ->end_io callback after v4
series (in v5) to handle above case. Right?

Do let me know if you think it was due to something else.

> I'll think how we could fix the problem you've reported.
>

1. I was thinking why do we need to truncate those blocks which are beyond
EOF for DIO case? Wasn't there an argument, that for short DIO writes,
we can use the remaining blocks allocated to be written by buffered-io,
right? Do we risk exposing anything in doing so?
We do fallback to buffered-io for short writes in ext4_dio_write_iter().

2. Either ways let's say we still would like to call truncate. Then can we
move ext4_truncate() logic out of ext4_handle_inode_extension() and call
ext4_handle_inode_extension() from within ->end_io callback. 
ext4_truncate() can be then done in the main routine directly i.e. in
ext4_dio_write_iter() where we do have both "count" and "written" information.

Thoughts?

-ritesh

> 								Honza
> -- 
> Jan Kara <jack@...e.com>
> SUSE Labs, CR