linux-ext4 - Re: Locality of extent status tree traversal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:   Tue, 7 May 2019 17:40:09 -0400
From:   "Theodore Ts'o" <tytso@....edu>
To:     Probir Roy <proy.cse@...il.com>
Cc:     linux-ext4@...r.kernel.org
Subject: Re: Locality of extent status tree traversal

On Tue, May 07, 2019 at 01:56:27PM -0500, Probir Roy wrote:
> > block number (e.g., the location on disk).  It's a cache; lookups are
> > fast, and is an in-memory lookup.  Well, it's a little more than a
> > cache, it also stores some information for delayed allocation buffered
> > writes.
> 
> For sequential access, it will traverse almost the same path of the
> tree. How deep the extent status tree be in general? If the tree is
> much deeper, the sequential accesses would have many repeated nodes
> traversal on the tree for the lookup. Have you observed significant
> bottleneck on "ext4_es_lookup_extent"? Can it be removed by caching
> the parent node?

You do realize that the extent status tree is separate from the
on-disk ext4 extent tree, right?

The extent status tree is an in-memory cache, and it caches logical
extents.  Which is to say, the on-disk physical extents are limited
(for historical reasons) to 32,767 blocks in an on-disk extent entry.
If you have a contiguous range of 128,000 blocks, it will require 4
on-disk extents in the ext4 extent tree.

The extent status tree, being an in-memory data structure, will
collapse those 4 on-disk extents (assuming they are physically and
logically contiguous) into a single in-memory entry in the extent
status tree.  So the depth of the extent status tree very much depends
on how fragmented the data blocks are for the file in question.  If
the file is 100% contiguous, and pre-allocated in advance, it could be
12TB long, and it would only take a single entry in the extent status
tree.  If the file is very highly fragmented, then of course, the size
of the extent status tree in memory and the on-disk extent tree can be
quite large.

It's true that if the tree is very deep, then you might have to do
many traversals of the red-black tree.  But that's because file is
super fragmented.  If we didn't have the extent status cache, then
we'd have to read in portions of the on-disk extent tree.  That tree
has a larger fanout, so it's wider, as well as being less deep.  But
if you are talking about the number of memory accesses needed to
traverse the extent tree, it's going to be roughly the same as the
read-block tree.  In either case, it's going to be O(log N) of the number
of extents in the highly fragmented file.

So let's back up.  Why are you so concerned about potential
bottle-necks on the ext4_es_lookup_extent()?  What is your workload?
How badly fragmented is your file?  And have you considered taking
measures (such as preallocating the file using fallocate, and possibly
pre-initializing the file) to improve things?  If you are doing
something nasty such as a random write workload using buffered writes,
without preallocating the file, then yeah, things can get pretty
nasty.  But the problem isn't going to be in the extent status tree;
it's going to be in many positions as well.

						- Ted