linux-ext4 - Re: allowing ext4 file systems that wrapped inode count to continue working

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 26 Jul 2018 19:47:10 +0200
From:   Jaco Kroon <jaco@....co.za>
To:     Andreas Dilger <adilger@...ger.ca>
Cc:     Jan Kara <jack@...e.cz>, linux-ext4 <linux-ext4@...r.kernel.org>,
        Theodore Ts'o <tytso@....edu>
Subject: Re: allowing ext4 file systems that wrapped inode count to continue
 working

Hi Andreas, Ted,

Ted, you mostly just expanded on Andreas's information regarding
reducing the filesystem to "sane" state.  Specifically useful
information on dropping the last group.  This may well come in useful. 
Whilst I'm working in a test environment at the moment, my real problem
is this:

# df -m /home
Filesystem                      1M-blocks          Used   Available Use%
Mounted on
/dev/mapper/lvm-home  66055848 65023779    1032053  99% /home

I really need to further expand that filesystem.  I can take it offline
for a few hours or so if there is no other options, but that's not ideal
(even getting to run umount when nothing is accessing that is a scarce
opportunity).  The VG on which it's contained does have 4.5TB available
for expansion, I just don't want to allocate that anywhere until I have
a known working strategy.

I'll respond to Andreas's information below.  Please do keep in mind
that whilst I'm a long time Linux users (nearly 20 years), and have a
sensible amount of development experience, I'm by no means a filesystem
(not mention ext4) expert, and I may well misinterpret some available
information that's available, and bark up wrong trees here.

On 24/07/2018 18:33, Andreas Dilger wrote:
> On Jul 24, 2018, at 9:00 AM, Jaco Kroon <jaco@....co.za> wrote: >> >> Hi, >> >> Related to
https://www.spinics.net/lists/linux-ext4/msg61075.html (and >> possibly
the cause of the the work from Jan in that patch series). >> >> I have a
64TB (exactly) filesystem. >> >> Filesystem OS type: Linux >> Inode
count: 4294967295 >> Block count: 17179869184 >> Reserved block count:
689862348 >> Free blocks: 16910075355 >> Free inodes: 4294966285 >>
First block: 0 >> Block size: 4096 >> Fragment size: 4096 >> Group
descriptor size: 64 >> Blocks per group: 32768 >> Fragments per group:
32768 >> Inodes per group: 8192 >> Inode blocks per group: 512 >> RAID
stride: 128 >> RAID stripe width: 128 >> First meta block group: 1152 >>
Flex block group size: 16 >> >> Note that in the above Inode count ==
2^32-1 instead of the expected 2^32. >> >> This results in the correct
inode count being exactly 2^32 (which >> overflows to 0). A kernel bug
(fixed by Jan) allowed this overflow in >> the first place. >> >> I'm
busy trying to write a patch for e2fsck that would allow it to (on >>
top of the referenced series by Jan) enable fsck to at least clear the
>> filesystem from other errors where currently if I hack the inode
count >> to ~0U fsck, tune2fs and friends fail. > > Probably the easiest
way to move forward here would be to use debugfs > to edit the
superblock to reduce the blocks count by s_blocks_per_group > and the
inode count by (s_inodes_per_group - 1) so e2fsck doesn't think > you
have that last group at all. This assumes that you do not have any >
inodes allocated in the last group, which is unlikely. If you do, you >
could use "ncheck" to find the names of those files and copy them to >
some other part of the filesystem before editing the superblock. This
relates to Ted's information as well.  Working strategy "short term". 
Ie, next few days.
> > >> With the attached patch (sorry, Thunderbird breaks my inlining of
>> patches) tune2fs operates (-l at least) as expected, and fsck gets to
>> pass5 where it segfaults with the following stack trace (compiled
with -O0): >> >> /dev/exp/exp contains a file system with errors, check
forced. >> Pass 1: Checking inodes, blocks, and sizes >> Pass 2:
Checking directory structure >> Pass 3: Checking directory connectivity
>> Pass 4: Checking reference counts >> Pass 5: Checking group summary
information >> >> Program received signal SIGSEGV, Segmentation fault.
>> 0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90, >>
group=552320, bg_flag=1) >> at blknum.c:445 >> 445 return gdp->bg_flags
& bg_flag; >> (gdb) bt >> #0 0x00005555555ac8d1 in ext2fs_bg_flags_test
(fs=0x555555811e90, >> group=552320, bg_flag=1) >> at blknum.c:445 >> #1
0x000055555558c343 in check_inode_bitmaps (ctx=0x5555558112b0) at >>
pass5.c:759 >> #2 0x000055555558a251 in e2fsck_pass5
(ctx=0x5555558112b0) at pass5.c:57 >> #3 0x000055555556fb48 in
e2fsck_run (ctx=0x5555558112b0) at e2fsck.c:249 >> #4 0x000055555556e849
in main (argc=5, argv=0x7fffffffdfe8) at unix.c:1859 >> (gdb) print *gdp
>> $1 = {bg_block_bitmap = 528400, bg_inode_bitmap = 0, bg_inode_table =
>> 528456, >> bg_free_blocks_count = 0, bg_free_inodes_count = 0,
bg_used_dirs_count >> = 4000, bg_flags = 8, >> bg_exclude_bitmap_lo = 0,
bg_block_bitmap_csum_lo = 0, >> bg_inode_bitmap_csum_lo = 8, >>
bg_itable_unused = 0, bg_checksum = 0, bg_block_bitmap_hi = 528344, >>
bg_inode_bitmap_hi = 0, >> bg_inode_table_hi = 528512,
bg_free_blocks_count_hi = 0, >> bg_free_inodes_count_hi = 0, >>
bg_used_dirs_count_hi = 4280, bg_itable_unused_hi = 8, >>
bg_exclude_bitmap_hi = 0, >> bg_block_bitmap_csum_hi = 0,
bg_inode_bitmap_csum_hi = 0, bg_reserved = 0} >> >> ... so I'm not sure
why it even segfaults. gdb can retrieve a value of >> 8 for bg_flags ...
and yet, if the code does that it segfaults. So not >> sure what the
discrepancy is there - probably a misunderstanding of >> what's going
wrong, but the only thing I can see that can segfault is >> the gdp
dereference, and since that seems to be a valid pointer ... >> >> I am
not sure if this is a separate issue, or due to me tampering with >> the
inode counter in the way that I am (I have to assume the latter). >> For
testing I created a thin volume (1TB) in a separate environment, >>
where I created a 16TB filesystem initially, and then expanded that to
>> 64TB, resulting in exactly the same symptoms we saw in production >>
environment. I created a thousand empty files in the root folder. The >>
filesystem is consuming 100GB on-disk currently in the thin volume. >>
Note that group=552320 > 524288 (17179869184 / 32768). > > I was looking
at the code in check_inode_bitmaps() and there are > definitely some
risky areas in the loop handling. The bitmap end > and bitmap start
values are __u64, so they should be OK. The loop > counter "i" is only a
32-bit value, so it may overflow with 2^32-1 > inodes (or really 2^32
inodes in the table, even if the last one > is not used). > > I think
you need to figure out why the group counter has exceeded > the actual
number of groups. It is likely that the segfault is > justified by going
beyond the end of the array, as there is no > valid group data or inode
table for the groups. > > One possibility is if the last group us marked
EXT2_BG_INODE_UNINIT > then it will increment "i" by s_inodes_per_group
and the loop > condition will never be false. Converting i to a __u64
variable may solve the problem in this code. > > While I'm not against
fixing this, I also suspect that there are > other parts of the code
that may have similar problems, which may > be dangerous if you are
resizing the filesystem I confirmed the overflow occurs in the initial
skip_group code.  Essentially it increments i by s_inodes_per_group-1,
which takes it to exactly 0.  The outer loop then increments that by 1,
and back to 1, and we keep looping.  The reason for this is that the
last block has one fewer inode than the other blocks (effectively).

I've modified this slightly:

a/e2fsck/pass5.c
+++ b/e2fsck/pass5.c
@@ -644,6 +644,8 @@ redo_counts:
                                group_free = inodes;
                                free_inodes += inodes;
                                i += inodes;
+                               if (i == 0 || i > fs->super->s_inodes_count)
+                                       i = fs->super->s_inodes_count;
                                skip_group = 0;
                                goto do_counts;
                        }

And now I'm getting:

Internal error: fudging end of bitmap (2)

If I understand correctly this comes from check_inode_end(),
specifically, the second loop.

end = EXT2_INODES_PER_GROUP(fs->super) * fs->group_desc_count;

This value should match s_inodes_count right?  With:

@@ -841,7 +843,7 @@ static void check_inode_end(e2fsck_t ctx)
 
        clear_problem_context(&pctx);
 
-       end = EXT2_INODES_PER_GROUP(fs->super) * fs->group_desc_count;
+       end = fs->super->s_inodes_count;
        pctx.errcode = ext2fs_fudge_inode_bitmap_end(fs->inode_map, end,
                                                     &save_inodes_count);
        if (pctx.errcode) {

Then is just failes with fudging end of bitmap (1).

This leads me to believe that if s_inodes_count is reduced to 2^32 -
s_inodes_per_group instead of 2^32-1 (assuming those inodes are not in
use) then things should work - opinions?

How can I verify if any of those inodes are currently used?

> >> Regarding further expansion, would appreciate some advise, there are
two >> (three) possible options that I could come up with: >> >> 1. Find
a way to reduce the number of inodes per group (say to 4096, >> which
would require re-allocating all inodes >= 2^31 to inodes <2^31). > >
Once you can properly access the filesystem (e.g. after editing the >
superblock to shrink it by one group), then I believe Darrick added >
support to resize2fs (or tune2fs) to reduce the number of inodes in >
the filesystem. The time needed to run this depends on how many > inodes
are in use in the filesystem. If he has I cannot locate it.  I checked
the e2fsprogs git logs for all patches by him and could not locate
anything, a quick check of everything 2017 and 2018 didn't reveal
anything either (the latter check was less comprehensive and I could
have missed).
> > > I'd strongly recommend to make a backup of the filesystem before
such > an operation, since if there is a bug or interruption it could
leave > you with a broken filesystem. Unfortunately not option.  If I
had that kind of space available I'd back it up and create a new
filesystem and copy back.  We only have around 4.5TB spare currently on
the VG that's unallocated.

>> >> >> (3. Find some free space, create a new filesystem, and
iteratively move >> data from the one to the other, shrinking and
growing the filesystems as >> per progress - will never be able to move
more data that what is >> curently available on the system, around 4TB
in my case, so will take a >> VERY long time).
I am contemplating creating a new 4TB filesystem in that space, mounting
that in the "correct" location (would need to find a gap to umount the
old fs first) and symlinking the top-level folders over.  From there I'd
need to rsync (cp -a) a top-level folder at a time over, remove the
symlink (breaking access), revalidating (rsync) and then rename into the
correct location.  Once the "new" filesystem is depleted of space I'd
need to offline the old filesystem, shrink it, lvreduce the blockdevice
and online it again, allocate the released storage to the new
filesystem, and extend that.  I'll then need to iteratively do that
until all data has been moved over.  To shrink is a long and slow
operation in general.  And the filesystem is under heavy pressure (reads
are peaking around 600MB/s, average around 150MB/s, with very few "idle"
times).

A blockage here would be if one of the top-level folders (which is the
only level at which I can guarantee that there is no hard-linking
between folders) is larger than my total free space currently.  I've
started a du process for this, and currently the largest is 3.5TB, but
it's still calculating (has been going since I sent my first email).

Even with this strategy (which I'm starting to think is the way to go) I
first need to be able to get rid of that last group, and ideas presented
by Ted only works if debugfs works (which given my previous patch it
would, but fsck still won't clear the filesystem - which may be fine). 
After reducing the block, fsck should be fine again and I can set out on
the above strategy, which would need to be executed over days, probably
weeks.
> >> 2. Allow to add additional blocks to the filesystem, without adding
>> additional inodes. >> >> I'm currently aiming for option 2 since that
looks to be the simplest. >> Simply allow overflow to happen, but don't
allocate additional inodes >> if number of inodes is already ~0U. > >
This would be a useful feature to have, especially since we allow >
FLEX_BG to put all of the metadata at the start of the filesystem. >
This would essentially allow growing the inode bitmap independently > of
the block bitmap, which can definitely be convenient in some cases. > >
It may touch a lot of places in the code, but at least it would be >
pretty easily tested, since it could be used on small filesystems.
I assume by creating the filesystem with -N with a LARGE number of
inodes, and then resising from there?

Since we normally perform on-line resizes I figured I'll give that a try
first.  So as I looked at the code there are a few things that I notice:

Online resizing tries three approaches with the kernel:

1.  ioctl EXT4_IOC_RESIZE_FS (unless requested not to, or it fails, then);
2.  ioctl EXT2_IOC_GROUP_EXTEND - if this succeeds (as per test via
ext2fs_read_bitmaps() we're done).
3.  flex groups gets cleared.  EXT2_IOC_GROUP_EXTEND is used to extend
the last group.  Then a sequence of EXT2_IOC_GROUP_ADD ioct's is used to
add more groups.

In my mind we only need to really care about IOC_RESIZE_FS ioct here?

# ext4_resize_fs - checks for inode overflow correctly now (fixed by
Jan).  This check will need to go away in light that we want to be able
to add more blocks without adding more inodes.

# it looks like we will still need to allocate flex groups, but that not
every flex group will have inodes.  So really I'm not sure how to
approach this.  Pass5 above depends on the inode count to match the
group counts if I understand it correctly in order to rebuild bitmaps
and to determine if the filesystem

# On the other hand, it looks to me like when growing the filesystem the
new blocks are simply added to the last group (Call to
ext4_group_extend_no_check).  Can we simply stop here?  Or am I
misunderstanding that this only adds blocks here to complete the group? 
In other words, let's say there is normally 128MB in a group, and the
last group was 96MB, this will just use the first 32MB to complete that
group first?

# We then loop adding additional flex groups (which could be normal
groups if s_log_groups_per_flex == 0), outputting progress every 10
seconds or so.

# The ext4_setup_next_flex is most likely where changes needs to be
made, or perhaps the functions called by it.
  Specifically - it should not add additional inodes if this will cause
an overflow (or only add up to ~0U - which as per above causes other
difficulty at least to my skill level?).  Might be simpler to still
allocate them in the group (2MB worth of blocks with 8192 inodes/group)
but just not make them available for use.  Still having them allocated
at least means that all the normal clusters_per_group, inodes_per_group
and other related calculations remain in tact, but to make that 2MB of
data available pretty much means that any groups beyond the point where
inode count would overflow would need to have those values specially
treated.

I think I've realized this is somewhat above my skill level, and more
complicated than what I had hoped.  I think I better get cracking on
recreating a new filesystem and starting to move data over.

Kind Regards,
Jaco


Content of type "text/html" skipped