linux-ext4 - Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Wed, 6 Dec 2023 21:34:49 +1100
From: Dave Chinner <david@...morbit.com>
To: Christoph Hellwig <hch@...radead.org>
Cc: Baokun Li <libaokun1@...wei.com>, Jan Kara <jack@...e.cz>,
	linux-mm@...ck.org, linux-ext4@...r.kernel.org, tytso@....edu,
	adilger.kernel@...ger.ca, willy@...radead.org,
	akpm@...ux-foundation.org, ritesh.list@...il.com,
	linux-kernel@...r.kernel.org, yi.zhang@...wei.com,
	yangerkun@...wei.com, yukuai3@...wei.com
Subject: Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending
 DIO write race with buffered read

On Wed, Dec 06, 2023 at 01:02:43AM -0800, Christoph Hellwig wrote:
> On Wed, Dec 06, 2023 at 07:35:35PM +1100, Dave Chinner wrote:
> > Mixing overlapping buffered read with direct writes - especially partial block
> > extending DIO writes - is a recipe for data corruption. It's not a
> > matter of if, it's a matter of when.
> > 
> > Fundamentally, when you have overlapping write IO involving DIO, the
> > result of the overlapping IOs is undefined. One cannot control
> > submission order, the order that the overlapping IO hit the
> > media, or completion ordering that might clear flags like unwritten
> > extents. The only guarantee that we give in this case is that we
> > won't expose stale data from the disk to the user read.
> 
> Btw, one thing we could do to kill these races forever is to track if
> there are any buffered openers for an inode and just fall back to
> buffered I/O for that case.  With that and and inode_dio_wait for
> when opening for buffered I/O we'd avoid the races an various crazy
> workarounds entirely.

That's basically what Solaris did 20-25 years ago. The inode held a
flag that indicated what IO was being done, and if the "buffered"
flag was set (either through mmap() based access or buffered
read/write syscalls) then direct IO would do also do buffered IO
until the flag was cleared and the cache cleaned and invalidated.

That had .... problems.

Largely they were performance problems - unpredictable IO latency
and CPU overhead for IO meant applications would randomly miss SLAs.
The application would see IO suddenly lose all concurrency, go real
slow and/or burn lots more CPU when the inode switched to buffered
mode.

I'm not sure that's a particularly viable model given the raw IO
throughput even cheap modern SSDs largely exceeds the capability of
buffered IO through the page cache. The differences in concurrency,
latency and throughput between buffered and DIO modes will be even
more stark itoday than they were 20 years ago....

-Dave.
-- 
Dave Chinner
david@...morbit.com