linux-ext4 - Re: [PATCHv4 2/3] ext4: Start with shared i_rwsem in case of DIO instead of exclusive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 5 Dec 2019 19:10:06 +0530
From:   Ritesh Harjani <riteshh@...ux.ibm.com>
To:     Jan Kara <jack@...e.cz>
Cc:     tytso@....edu, linux-ext4@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, mbobrowski@...browski.org,
        joseph.qi@...ux.alibaba.com
Subject: Re: [PATCHv4 2/3] ext4: Start with shared i_rwsem in case of DIO
 instead of exclusive

Hello Jan,

Thanks a lot for your reviews.

On 12/5/19 5:33 PM, Jan Kara wrote:
> On Thu 05-12-19 12:16:23, Ritesh Harjani wrote:
>> Earlier there was no shared lock in DIO read path. But this patch
>> (16c54688592ce: ext4: Allow parallel DIO reads)
>> simplified some of the locking mechanism while still allowing for parallel DIO
>> reads by adding shared lock in inode DIO read path.
>>
>> But this created problem with mixed read/write workload. It is due to the fact
>> that in DIO path, we first start with exclusive lock and only when we determine
>> that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
>> we still have shared locking in DIO reads.
>>
>> So, this patch tries to fix this issue by starting with shared lock and then
>> switching to exclusive lock only when required based on ext4_dio_write_checks().
>>
>> Other than that, it also simplifies below cases:-
>>
>> 1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
>> abused in the sense that it was not really checking for AIO anywhere also it
>> used to check for extending writes. So this API was renamed and simplified to
>> ext4_unaligned_io() which actully only checks if the IO is really unaligned.
>>
>> Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
>> block and that will require serialization against other direct IOs in the same
>> block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
>> we also need to wait for any outstanding IOs to complete so that conversion from
>> unwritten to written is completed before anyone try to map the overlapping block.
>> Hence we take exclusive inode lock and also wait for inode_dio_wait() for
>> unaligned DIO case. Please note since we are anyway taking an exclusive lock in
>> unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
>>
>> 2. Added ext4_extending_io(). This checks if the IO is extending the file.
>>
>> 3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
>> only switch to exclusive lock if required. So in most cases with aligned,
>> non-extending, dioread_nolock & overwrites, it tries to write with a shared
>> lock. If not, then we restart the operation in ext4_dio_write_checks(), after
>> acquiring exclusive lock.
>>
>> Signed-off-by: Ritesh Harjani <riteshh@...ux.ibm.com>
> 
> Cool, the patch looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <jack@...e.cz>

great!

> 
> Two small nits below:
> 
>> -static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>> +static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
>> +					 struct iov_iter *from)
>>   {
>>   	struct inode *inode = file_inode(iocb->ki_filp);
>>   	ssize_t ret;
>> @@ -228,11 +235,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>>   		iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
>>   	}
>>   
>> +	return iov_iter_count(from);
>> +}
> 
> You return iov_iter_count() from ext4_generic_write_checks()...
> 
>> +static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>> +				     bool *ilock_shared, bool *extend)
>> +{
>> +	struct file *file = iocb->ki_filp;
>> +	struct inode *inode = file_inode(file);
>> +	loff_t offset;
>> +	size_t count;
>> +	ssize_t ret;
>> +
>> +restart:
>> +	ret = ext4_generic_write_checks(iocb, from);
>> +	if (ret <= 0)
>> +		goto out;
>> +
>> +	offset = iocb->ki_pos;
>> +	count = iov_iter_count(from);
> 
> But you don't use the returned count here and just call iov_iter_count()
> again (which is cheap anyway but still it's strange).

Yes. iov_iter_count() (as you also said) is anyway a inline function
which only does from->count, which comes at no cost.
But re-assigning a ssize_t value to size_t is something I was getting
uncomfortable with. Although I agree that it should be completely fine
here, I just was not convinced to use that instead of directly accessing
it from iov_iter_count() for better readability reasons.

But unless you feel otherwise, I could make those changes at 2 places
which you mentioned.

> 
>> +	if (ext4_extending_io(inode, offset, count))
>> +		*extend = true;
>> +	/*
>> +	 * Determine whether the IO operation will overwrite allocated
>> +	 * and initialized blocks. If so, check to see whether it is
>> +	 * possible to take the dioread_nolock path.
>> +	 *
>> +	 * We need exclusive i_rwsem for changing security info
>> +	 * in file_modified().
>> +	 */
>> +	if (*ilock_shared && (!IS_NOSEC(inode) || *extend ||
>> +	     !ext4_should_dioread_nolock(inode) ||
>> +	     !ext4_overwrite_io(inode, offset, count))) {
>> +		inode_unlock_shared(inode);
>> +		*ilock_shared = false;
>> +		inode_lock(inode);
>> +		goto restart;
>> +	}
>> +
>> +	ret = file_modified(file);
>> +	if (ret < 0)
>> +		goto out;
>> +
>> +	return count;
> 
> And then you return count from ext4_dio_write_checks() here...

ditto
> 
>> -	ret = ext4_write_checks(iocb, from);
>> -	if (ret <= 0) {
>> -		inode_unlock(inode);
>> +	ret = ext4_dio_write_checks(iocb, from, &ilock_shared, &extend);
>> +	if (ret <= 0)
>>   		return ret;
>> -	}
>>   
>> -	/*
>> -	 * Unaligned asynchronous direct I/O must be serialized among each
>> -	 * other as the zeroing of partial blocks of two competing unaligned
>> -	 * asynchronous direct I/O writes can result in data corruption.
>> -	 */
>>   	offset = iocb->ki_pos;
>>   	count = iov_iter_count(from);
> 
> And then again just don't use the value here...

ditto
> 
> 								Honza
>