linux-ext4 - Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 27 Nov 2014 14:14:21 +0100
From:	Jan Kara <jack@...e.cz>
To:	Theodore Ts'o <tytso@....edu>
Cc:	Linux Filesystem Development List <linux-fsdevel@...r.kernel.org>,
	Ext4 Developers List <linux-ext4@...r.kernel.org>,
	Linux btrfs Developers List <linux-btrfs@...r.kernel.org>,
	XFS Developers <xfs@....sgi.com>
Subject: Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option

On Wed 26-11-14 05:23:52, Ted Tso wrote:
> Add a new mount option which enables a new "lazytime" mode.  This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode.  The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
> 
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
> 
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table.  The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives.  Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact 99.9 percentile latencies ---
> which is a very big deal for web serving tiers (for example).
  So this looks better to me than previous versions but I'm still not 100%
happy :)

Looking into the code & your patch I'd prefer to do something like:
* add support for I_DIRTY_TIME in __mark_inode_dirty() - update_time will
  call __mark_inode_dirty() with this flag if any of the times was updated.
  That way we can just remove your ->write_time() callback - filesystems
  can just handle this in their ->dirty_inode() methods if they wish.
  __mark_inode_dirty() will take care of moving inode into proper writeback
  list (i_dirty / i_dirty_time), dirtied_when will be set to current time.
* change queue_io() to also call
	moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, time + 24hours)
  For this you need to tweak move_expired_inodes() to take pointer to
  timestamp instead of pointer to work but that's trivial. Also you want
  probably leave time ->older_than_this value (i.e. without +24 hours) if
  we are doing WB_SYNC_ALL writeback. With this you can remove
  flush_sb_dirty_time() completely.
* Changes for iput() & fsync stay as they are.

And this should be all that's necessary. I'm not 100% sure about your dirty
bits naming changes - let's see how that will look like when the above more
substantial changes are done.

One technical detail below:

> diff --git a/fs/inode.c b/fs/inode.c
> index 8f5c4b5..9e464cc 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1430,11 +1430,22 @@ static void iput_final(struct inode *inode)
>   */
>  void iput(struct inode *inode)
>  {
> -	if (inode) {
> -		BUG_ON(inode->i_state & I_CLEAR);
> -
> -		if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock))
> -			iput_final(inode);
> +	if (!inode)
> +		return;
> +	BUG_ON(inode->i_state & I_CLEAR);
  I think we can better handle this without retry at this place like:
	if (atomic_read(&inode->i_count) == 1 && inode->i_nlink &&
	    (inode->i_state & I_DIRTY_TIME)) {
		if (inode->i_op->write_time)
			inode->i_op->write_time(inode);
		else if (inode->i_sb->s_op->write_inode)
			mark_inode_dirty_sync(inode);
	}
  Sure it will be one more read of i_count in the fast path but that's IMO
negligible.

BTW: Is the test for ->write_inode really needed? We don't do it e.g. in
update_time().

> +retry:
> +	if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> +		if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> +			atomic_inc(&inode->i_count);
> +			inode->i_state &= ~I_DIRTY_TIME;
> +			spin_unlock(&inode->i_lock);
> +			if (inode->i_op->write_time)
> +				inode->i_op->write_time(inode);
> +			else if (inode->i_sb->s_op->write_inode)
> +				mark_inode_dirty_sync(inode);
> +			goto retry;
> +		}
> +		iput_final(inode);
>  	}
>  }
>  EXPORT_SYMBOL(iput);

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html