2020-12-04 14:32:11

by Paul Menzel

[permalink] [raw]
Subject: ext4: Funny characters appended to file names

Dear Linux folks,


Using Debian Sid/unstable with 5.9.11 (5.9.0-4-686-pae), it looks like
the last `sudo grub-update` installed modules with corrupted file names.
`/boot` is mounted.

> $ findmnt /boot
> TARGET SOURCE FSTYPE OPTIONS
> /boot /dev/md0 ext4 rw,relatime
> $ ls -l /boot/grub/i386-pc/
> insgesamt 2085
> -rw-r--r-- 1 root root 8004 13. Aug 23:00 '915resolution.mod-'$'\205\300''u'$'\023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
> -rw-r--r-- 1 root root 10596 13. Aug 23:00 'acpi.mod-'$'\205\300''u'$'\023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
> […]
> $ file /boot/grub/i386-pc/zstd.mod-��u^S�鍓\]�����f���ur��^DVt\$^Xj^B胒
> /boot/grub/i386-pc/zstd.mod-��u�鍓]������f�����ur��V�t$j胒: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped

Checking the file system returned no errors.

$ sudo umount /boot
$ sudo fsck.ext4 /dev/md0
e2fsck 1.45.6 (20-Mar-2020)
boot: sauber, 331/124928 Dateien, 145680/497856 Blöcke

This causes GRUB fail to load the module, and it falls back into rescue
mode.

Any idea, what might have happened. It’s a degraded RAID, and I only use
one drive since several years, but never deactivated it, and `/dev/md0`
still shows up.

```
$ more /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid1 sdb1[0]
497856 blocks [2/1] [U_]

md1 : active raid1 sdb2[0]
1953013952 blocks [2/1] [U_]

unused devices: <none>
```


Kind regards,

Paul


Attachments:
grub-ext4-funny-chars.txt (25.88 kB)

2020-12-04 15:30:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

On Fri, Dec 04, 2020 at 03:30:38PM +0100, Paul Menzel wrote:
> Dear Linux folks,
>
>
> Using Debian Sid/unstable with 5.9.11 (5.9.0-4-686-pae), it looks like the
> last `sudo grub-update` installed modules with corrupted file names. `/boot`
> is mounted.
>
> > $ findmnt /boot
> > TARGET SOURCE FSTYPE OPTIONS
> > /boot /dev/md0 ext4 rw,relatime
> > $ ls -l /boot/grub/i386-pc/
> > insgesamt 2085
> > -rw-r--r-- 1 root root 8004 13. Aug 23:00 '915resolution.mod-'$'\205\300''u'$'\023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
> > -rw-r--r-- 1 root root 10596 13. Aug 23:00 'acpi.mod-'$'\205\300''u'$'\023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
> > […]
> > $ file /boot/grub/i386-pc/zstd.mod-��u^S�鍓\]�����f���ur��^DVt\$^Xj^B胒
> > /boot/grub/i386-pc/zstd.mod-��u�鍓]������f�����ur��V�t$j胒: ELF 32-bit
> > LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
>
> Checking the file system returned no errors.
>
> $ sudo umount /boot
> $ sudo fsck.ext4 /dev/md0
> e2fsck 1.45.6 (20-Mar-2020)
> boot: sauber, 331/124928 Dateien, 145680/497856 Blöcke

Try forcing a full fsck:

sudo fsck.ext4 -f /dev/md0

You'll see that it takes rather longer to run....

- Ted

2020-12-04 15:40:50

by Paul Menzel

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

Dear Theodore,


Am 04.12.20 um 16:28 schrieb Theodore Y. Ts'o:
> On Fri, Dec 04, 2020 at 03:30:38PM +0100, Paul Menzel wrote:

>> Using Debian Sid/unstable with 5.9.11 (5.9.0-4-686-pae), it looks like the
>> last `sudo grub-update` installed modules with corrupted file names. `/boot`
>> is mounted.
>>
>>> $ findmnt /boot
>>> TARGET SOURCE FSTYPE OPTIONS
>>> /boot /dev/md0 ext4 rw,relatime
>>> $ ls -l /boot/grub/i386-pc/
>>> insgesamt 2085
>>> -rw-r--r-- 1 root root 8004 13. Aug 23:00 '915resolution.mod-'$'\205\300''u'$'\023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
>>> -rw-r--r-- 1 root root 10596 13. Aug 23:00 'acpi.mod-'$'\205\300''u'$'\023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
>>> […]
>>> $ file /boot/grub/i386-pc/zstd.mod-��u^S�鍓\]�����f���ur��^DVt\$^Xj^B胒
>>> /boot/grub/i386-pc/zstd.mod-��u�鍓]������f�����ur��V�t$j胒: ELF 32-bit
>>> LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
>>
>> Checking the file system returned no errors.
>>
>> $ sudo umount /boot
>> $ sudo fsck.ext4 /dev/md0
>> e2fsck 1.45.6 (20-Mar-2020)
>> boot: sauber, 331/124928 Dateien, 145680/497856 Blöcke
>
> Try forcing a full fsck:
>
> sudo fsck.ext4 -f /dev/md0
>
> You'll see that it takes rather longer to run....

Only two or three seconds on this system. (It’s only 486,2M.)

> $ sudo LANG=C fsck.ext4 -f /dev/md0
> e2fsck 1.45.6 (20-Mar-2020)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> boot: 327/124928 files (17.7% non-contiguous), 126021/497856 blocks

I can’t remember if that was an Ext2 or Ext3 when created several years ago.


Kind regards,

Paul

2020-12-04 18:07:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

On Fri, Dec 04, 2020 at 04:39:12PM +0100, Paul Menzel wrote:
>
> > $ sudo LANG=C fsck.ext4 -f /dev/md0
> > e2fsck 1.45.6 (20-Mar-2020)
> > Pass 1: Checking inodes, blocks, and sizes
> > Pass 2: Checking directory structure
> > Pass 3: Checking directory connectivity
> > Pass 4: Checking reference counts
> > Pass 5: Checking group summary information
> > boot: 327/124928 files (17.7% non-contiguous), 126021/497856 blocks
>
> I can’t remember if that was an Ext2 or Ext3 when created several years ago.

Well, the output dumpe2fs will tell us an awful lot about the file
system.

Whe was the last time the directory was OK? Do you know when it may
have gotten corrupted?

- Ted

2020-12-04 20:05:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

On Dec 4, 2020, at 7:30 AM, Paul Menzel <[email protected]> wrote:
>
> Using Debian Sid/unstable with 5.9.11 (5.9.0-4-686-pae), it looks like the last `sudo grub-update` installed modules with corrupted file names. `/boot` is mounted.
>
>> $ findmnt /boot
>> TARGET SOURCE FSTYPE OPTIONS
>> /boot /dev/md0 ext4 rw,relatime
>> $ ls -l /boot/grub/i386-pc/
>> insgesamt 2085
>> -rw-r--r-- 1 root root 8004 13. Aug 23:00 '915resolution.mod-'$'\205\300''u'$'\023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
>> -rw-r--r-- 1 root root 10596 13. Aug 23:00 'acpi.mod-'$'\205\300''u'$'\023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
>> […]
>> $ file /boot/grub/i386-pc/zstd.mod-��u^S�鍓\]�����f���ur��^DVt\$^Xj^B胒 /boot/grub/i386-pc/zstd.mod-��u�鍓]������f�����ur��V�t$j胒: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
>
> Checking the file system returned no errors.
>
> $ sudo umount /boot
> $ sudo fsck.ext4 /dev/md0
> e2fsck 1.45.6 (20-Mar-2020)
> boot: sauber, 331/124928 Dateien, 145680/497856 Blöcke
>
> This causes GRUB fail to load the module, and it falls back into rescue mode.
>
> Any idea, what might have happened. It’s a degraded RAID, and I only use one drive since several years, but never deactivated it, and `/dev/md0` still shows up.
>
> ```
> $ more /proc/mdstat
> Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
> md0 : active raid1 sdb1[0]
> 497856 blocks [2/1] [U_]
>
> md1 : active raid1 sdb2[0]
> 1953013952 blocks [2/1] [U_]
>
> unused devices: <none>
> ```

Did you try downgrading to the previous kernel to see if that fixes the problem?
Then, it would be useful to bisect between the old working kernel and the new
broken kernel to see what introduced this bug.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2020-12-05 19:36:46

by Paul Menzel

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

[Cc: +Colin]

Dear Ted,


Am 04.12.20 um 19:05 schrieb Theodore Y. Ts'o:
> On Fri, Dec 04, 2020 at 04:39:12PM +0100, Paul Menzel wrote:

Colin, the modules in `/boot/grub/i386-pc` look funny, and can’t be
loaded by GRUB anymore.

```
$ ls -lt /boot/grub/i386-pc/
insgesamt 2085
-rw-r--r-- 1 root root 512 13. Aug 23:00 'boot.img-'$'\205\300''u'$'
\023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205
\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
-rw-r--r-- 1 root root 30893 13. Aug 23:00 'core.img-'$'\205\300''u'$'
\023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205
\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
[…]
```

>>> $ sudo LANG=C fsck.ext4 -f /dev/md0
>>> e2fsck 1.45.6 (20-Mar-2020)
>>> Pass 1: Checking inodes, blocks, and sizes
>>> Pass 2: Checking directory structure
>>> Pass 3: Checking directory connectivity
>>> Pass 4: Checking reference counts
>>> Pass 5: Checking group summary information
>>> boot: 327/124928 files (17.7% non-contiguous), 126021/497856 blocks
>>
>> I can’t remember if that was an Ext2 or Ext3 when created several years ago.
>
> Well, the output dumpe2fs will tell us an awful lot about the file
> system.

Interesting. The file system was created in 2008. ;-) Please find the
dumpe2fs output attached.

> When was the last time the directory was OK? Do you know when it may
> have gotten corrupted?

The last reboot before. But I am really confused now though.

$ ls -ld /boot/grub/i386-pc
drwxr-xr-x 2 root root 28672 29. Nov 12:13 /boot/grub/i386-pc

But the module files in there are all from August 2020.

-rw-r--r-- 1 root root 2400 Aug 13 23:00
'part_gpt.mod-'$'\205\300''u'$'\023\211\351\215\223'']'$'\206\371\377\211\360\350''f'$'\376\377\377\205\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002\350\203\222'

The characters in the file name look like some character encoding. Do
you know hat that is? UTF-8? The dumped output viewed in an editor shows
a “Asian” looking characters 胒.

$ last reboot
reboot system boot 5.9.0-4-686-pae Fri Dec 4 21:10 still running
reboot system boot 5.9.0-4-686-pae Sun Nov 29 17:56 - 17:02
(4+23:05)
reboot system boot 5.9.0-1-686-pae Sun Nov 29 09:37 - 12:50
(03:12)
reboot system boot 5.9.0-1-686-pae Sat Oct 31 20:49 - 23:40
(02:50)

So, I ran

Nov 29 10:58:55 mattotaupa sudo[2214]: paul : TTY=pts/0 ;
PWD=/home/paul ; USER=root ; COMMAND=/usr/bin/apt full-upgrade

and that also touched `/boot/grub/i386-pc`.

The GRUB packages and Linux package were updated according to
`/var/log/dpkg.log.1`.

2020-11-29 11:38:06 upgrade grub2-common:i386 2.04-9 2.04-10
[…]
2020-11-29 12:04:00 status installed
linux-image-5.9.0-4-686-pae:i386 5.9.11-1
[…]
2020-11-29 12:13:24 configure grub-pc:i386 2.04-10 <none>
2020-11-29 12:13:24 status unpacked grub-pc:i386 2.04-10
2020-11-29 12:13:24 status half-configured grub-pc:i386 2.04-10
[Dialog waited for my confirmation: Some GRUB warning regarding
block devices, which I always “ignore”, that means tell GRUB to be upgraded]
2020-11-29 12:43:21 status installed grub-pc:i386 2.04-10
[…]

So, afterward I was able to reboot without any issues.

But now, it gets really confusing. `last` claims, the system was running
for four days. But that was not the (power was unplugged in between). I
also can’t remember the shutdown failing in some strange way. In the
systemd journal, it’s also recorded as one boot, but messages are
missing, supporting it was powered off in between.

Nov 29 17:57:00 mattotaupa kernel: Linux version 5.9.0-4-686-pae
([email protected]) (gcc-10 (Debian 10.2.0-19) 10.2.0, GNU
ld (GNU Binutils for Debian) 2.35.1) #1 SMP Debian 5.9.11-1 (2020-11-27)
Nov 29 17:57:00 mattotaupa kernel: random: get_random_u32 called
from bsp_init_amd+0x12f/0x220 with crng_init=0
[…]
Nov 29 17:57:51 mattotaupa tracker-miner-f[1130]: Unable to get XDG
user directory path for special directory &MUSIC. Ignoring this location.
Nov 29 17:57:51 mattotaupa tracker-miner-f[1130]: Unable to get XDG
user directory path for special directory &PICTURES. Ignoring this location.
Nov 29 17:57:51 mattotaupa tracker-miner-f[1130]: Unable to get XDG
user directory path for special directory &VIDEOS. Ignoring this location.
Dez 04 14:48:12 mattotaupa systemd-timesyncd[706]: Initial
synchronization to time server [2a01:4f8:120:9224::2]:123
(2.debian.pool.ntp.org).
Dez 04 14:48:12 mattotaupa systemd[1]: Starting exim4-base
housekeeping...
Dez 04 14:48:12 mattotaupa systemd[1]: Starting Daily man-db
regeneration...

So, something really strange happened. But looking more into this, and
that I wasn’t definitely not home on November 29th, 2020 at 17:57, I
think the system time was just incorrect.

The (old) drive seems alright.

```
$ sudo smartctl -i /dev/sdb
smartctl 7.1 2019-12-30 r5022 [i686-linux-5.9.0-4-686-pae] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, http://www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF)
Device Model: WDC WD20EARS-60MVWB0
Serial Number: WD-WCAZA4234015
LU WWN Device Id: 5 0014ee 20585a39e
Firmware Version: 51.0AB51
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Sat Dec 5 20:08:13 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
$ sudo smartctl -H /dev/sdb
smartctl 7.1 2019-12-30 r5022 [i686-linux-5.9.0-4-686-pae] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, http://www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
```

Please find `sudo smartctl --all /dev/sdb` attached.

Do you want me to re-install grub to see if it’s a problem introduced in
Debian’s GRUB 2.04-10?

> grub2 (2.04-10) unstable; urgency=medium
>
> [ Ian Campbell ]
> * Remove myself from uploaders.
>
> [ Colin Watson ]
> * When upgrading grub-pc noninteractively, bail out if grub-install fails.
> It's better to fail the upgrade than to produce a possibly-unbootable
> system.
> * Explicitly check whether the target device exists before running
> grub-install, since grub-install copies modules to /boot/grub/ before
> installing the core image, and the new modules might be incompatible
> with the old core image (closes: #966575).
> * Cherry-pick from upstream:
> - tftp: Roll-over block counter to prevent data packets timeouts
> (LP: #1892290).
>
> [ Dimitri John Ledkov ]
> * grub-install: Add backup and restore.
> * Don't call grub-install on fresh install of grub-pc. It's the job of
> installers to do that after a fresh install.
>
> -- Colin Watson <[email protected]> Sun, 08 Nov 2020 16:26:08 +0000


Kind regards,

Paul


Attachments:
dumpe2fs.txt (20.01 kB)
smartctl-all-dev-sdb.txt (4.76 kB)
Download all attachments

2020-12-06 15:20:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

On Sun, Dec 06, 2020 at 02:44:16PM +0000, Colin Watson wrote:
> > Colin, the modules in `/boot/grub/i386-pc` look funny, and can’t be loaded
> > by GRUB anymore.
> >
> > ```
> > $ ls -lt /boot/grub/i386-pc/
> > insgesamt 2085
> > -rw-r--r-- 1 root root 512 13. Aug 23:00 'boot.img-'$'\205\300''u'$'
> > \023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205

I think Colin theory makes sense. Note the hypthen after "boot.img".
That corresponds with the 'i' in the code below:

> Now that I look at it more closely, some of the changes to
> clean_grub_dir_real look suspicious:
>
> + char *srcf = grub_util_path_concat (2, di, de->d_name);
> +
> + if (mode == CREATE_BACKUP)
> + {
> + char *dstf = grub_util_path_concat_ext (2, di, de->d_name, "-");
> + if (grub_util_rename (srcf, dstf) < 0)
> + grub_util_error (_("cannot backup `%s': %s"), srcf,
> + grub_util_fd_strerror ());
> + free (dstf);
> + }

... however, if I'm understanding the code correctly, this is the
codepath used to create the backup file (e.g., the previous version of
boot.img). So shouldn't there be a "boot.img" file in
/boot/grub/i386-pc which would be the newly installed version of that
file, and so the system would actually be booting correctly?

Or am I misunderstanding what is going on? Paul, I thought you said
your system wasn't able to boot because the needed files in
/boot/grub/i386-pc had apparently been corrupted?

Essentially, there are three possibilities:

1) A hardware corruption which corrupted the directory.

2) A kernel bug which corrupted the directory.

3) The file system isn't actually corrupted, but the filename with the
random garbage in the filename was created because a userspace
application so requested it.

The fact that all of the filenames have the a similar pattern of
corruption to them would tend to rule out #1. And the fact that
e2fsck didn't notice any other corruptions would tend to argue against
#1 and #2. So #3 does seem to be the most likely.

- Ted

2020-12-06 15:25:52

by Colin Watson

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

On Sat, Dec 05, 2020 at 08:34:34PM +0100, Paul Menzel wrote:
> [Cc: +Colin]

Also CCing Dimitri, whose GRUB patch this may be related to. Dimitri,
see https://marc.info/?l=linux-ext4&m=160719695914303&w=2 for the full
message I'm replying to.

> Am 04.12.20 um 19:05 schrieb Theodore Y. Ts'o:
> > On Fri, Dec 04, 2020 at 04:39:12PM +0100, Paul Menzel wrote:
>
> Colin, the modules in `/boot/grub/i386-pc` look funny, and can’t be loaded
> by GRUB anymore.
>
> ```
> $ ls -lt /boot/grub/i386-pc/
> insgesamt 2085
> -rw-r--r-- 1 root root 512 13. Aug 23:00 'boot.img-'$'\205\300''u'$'
> \023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205
> \300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
> -rw-r--r-- 1 root root 30893 13. Aug 23:00 'core.img-'$'\205\300''u'$'
> \023\211''鍓]'$'\206\371\377\211\360\350''f'$'\376\377\377\205
> \300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002''胒'
> […]
> ```
[...]
> > When was the last time the directory was OK? Do you know when it may
> > have gotten corrupted?
>
> The last reboot before. But I am really confused now though.
>
> $ ls -ld /boot/grub/i386-pc
> drwxr-xr-x 2 root root 28672 29. Nov 12:13 /boot/grub/i386-pc
>
> But the module files in there are all from August 2020.
>
> -rw-r--r-- 1 root root 2400 Aug 13 23:00 'part_gpt.mod-'$'\205\300''u'$'\023\211\351\215\223'']'$'\206\371\377\211\360\350''f'$'\376\377\377\205\300''ur'$'\203\354\004''V'$'\377''t$'$'\030''j'$'\002\350\203\222'
>
> The characters in the file name look like some character encoding. Do you
> know hat that is? UTF-8? The dumped output viewed in an editor shows a
> “Asian” looking characters 胒.

It seems rather more likely to be junk from uninitialised memory.

> 2020-11-29 11:38:06 upgrade grub2-common:i386 2.04-9 2.04-10
> […]
> 2020-11-29 12:04:00 status installed linux-image-5.9.0-4-686-pae:i386
> 5.9.11-1
> […]
> 2020-11-29 12:13:24 configure grub-pc:i386 2.04-10 <none>
> 2020-11-29 12:13:24 status unpacked grub-pc:i386 2.04-10
> 2020-11-29 12:13:24 status half-configured grub-pc:i386 2.04-10
> [Dialog waited for my confirmation: Some GRUB warning regarding block
> devices, which I always “ignore”, that means tell GRUB to be upgraded]

You need to actually look into this and fix it properly rather than
ignoring it. It's probably related to this problem, since a successful
installation doesn't go down the RESTORE_BACKUP path which I think is
the suspicious one here.

> 2020-11-29 12:43:21 status installed grub-pc:i386 2.04-10
> […]
>
> So, afterward I was able to reboot without any issues.
[...]
> Do you want me to re-install grub to see if it’s a problem introduced in
> Debian’s GRUB 2.04-10?

Now that I look at it more closely, some of the changes to
clean_grub_dir_real look suspicious:

+ char *srcf = grub_util_path_concat (2, di, de->d_name);
+
+ if (mode == CREATE_BACKUP)
+ {
+ char *dstf = grub_util_path_concat_ext (2, di, de->d_name, "-");
+ if (grub_util_rename (srcf, dstf) < 0)
+ grub_util_error (_("cannot backup `%s': %s"), srcf,
+ grub_util_fd_strerror ());
+ free (dstf);
+ }
+ else if (mode == RESTORE_BACKUP)
+ {
+ char *dstf = grub_util_path_concat_ext (2, di, de->d_name);
+ dstf[strlen (dstf) - 1] = 0;
+ if (grub_util_rename (srcf, dstf) < 0)
+ grub_util_error (_("cannot restore `%s': %s"), dstf,
+ grub_util_fd_strerror ());
+ free (dstf);
+ }
+ else
+ {
+ if (grub_util_unlink (srcf) < 0)
+ grub_util_error (_("cannot delete `%s': %s"), srcf,
+ grub_util_fd_strerror ());
+ }
+ free (srcf);

grub_util_path_concat is a helper that joins its arguments with "/";
grub_util_path_concat_ext does likewise except the last argument is
appended as an extension without first appending "/". The first
argument to both of these functions is "n": grub_util_path_concat
expects n further argument, while grub_util_path_concat_ext expects n +
1 further arguments.

So, in the RESTORE_BACKUP case, shouldn't that be:

char *dstf = grub_util_path_concat (2, di, de->d_name);

... rather than grub_util_path_concat_ext? Otherwise it seems to me
that it's going to try to append an additional argument which doesn't
exist, and may well add random uninitialised stuff from memory. Running
grub-install under valgrind would probably show this up (I can't get it
to do it for me so far, but most likely I just haven't set up quite the
right initial conditions).

This looks more likely to be a userspace problem rather than filesystem
corruption. I think this should likely be refiled as a bug against
Debian's grub2 package.

--
Colin Watson (he/him) [[email protected]]

2020-12-06 17:40:38

by Colin Watson

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

On Sun, Dec 06, 2020 at 10:15:27AM -0500, Theodore Y. Ts'o wrote:
> On Sun, Dec 06, 2020 at 02:44:16PM +0000, Colin Watson wrote:
> > Now that I look at it more closely, some of the changes to
> > clean_grub_dir_real look suspicious:
> >
> > + char *srcf = grub_util_path_concat (2, di, de->d_name);
> > +
> > + if (mode == CREATE_BACKUP)
> > + {
> > + char *dstf = grub_util_path_concat_ext (2, di, de->d_name, "-");
> > + if (grub_util_rename (srcf, dstf) < 0)
> > + grub_util_error (_("cannot backup `%s': %s"), srcf,
> > + grub_util_fd_strerror ());
> > + free (dstf);
> > + }
>
> ... however, if I'm understanding the code correctly, this is the
> codepath used to create the backup file (e.g., the previous version of
> boot.img). So shouldn't there be a "boot.img" file in
> /boot/grub/i386-pc which would be the newly installed version of that
> file, and so the system would actually be booting correctly?

Not quite. What's described here as "backup/restore" thing is used as
follows:

* rename old modules aside as a backup
* do the rest of the installation (writing to the MBR or similar, as
well as copying in new modules)
* if installation succeeds, remove the backup files
* if installation fails, then:
* remove the newly-created modules
* move the backup files back into place

But if the restored file names are computed wrongly, then this leaves
the system in a bad state as Paul described.

I don't know why Dimitri chose to explicitly remove the new files first
rather than just renaming over the top and then removing any leftovers
at the end; that seems unnecessarily risky. Though this is code that's
apparently supposed to work on Windows as well, and the MoveFile
function that's used to implement grub_util_rename there requires the
destination file not to exist (sigh), so maybe it had something to do
with that.

> Essentially, there are three possibilities:
>
> 1) A hardware corruption which corrupted the directory.
>
> 2) A kernel bug which corrupted the directory.
>
> 3) The file system isn't actually corrupted, but the filename with the
> random garbage in the filename was created because a userspace
> application so requested it.
>
> The fact that all of the filenames have the a similar pattern of
> corruption to them would tend to rule out #1. And the fact that
> e2fsck didn't notice any other corruptions would tend to argue against
> #1 and #2. So #3 does seem to be the most likely.

Yep.

--
Colin Watson (he/him) [[email protected]]

2020-12-06 17:45:51

by Colin Watson

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

On Sun, Dec 06, 2020 at 05:37:46PM +0000, Colin Watson wrote:
> I don't know why Dimitri chose to explicitly remove the new files first
> rather than just renaming over the top and then removing any leftovers
> at the end; that seems unnecessarily risky. Though this is code that's
> apparently supposed to work on Windows as well, and the MoveFile
> function that's used to implement grub_util_rename there requires the
> destination file not to exist (sigh), so maybe it had something to do
> with that.

Incidentally, if this is actually the reason, then I think this would be
a viable replacement:

ret = !MoveFileEx (windows_from, windows_to, MOVEFILE_REPLACE_EXISTING);

Not that I really speak the Windows API ...

--
Colin Watson (he/him) [[email protected]]

2020-12-06 18:29:54

by Colin Watson

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

On Sun, Dec 06, 2020 at 02:44:16PM +0000, Colin Watson wrote:
> So, in the RESTORE_BACKUP case, shouldn't that be:
>
> char *dstf = grub_util_path_concat (2, di, de->d_name);
>
> ... rather than grub_util_path_concat_ext? Otherwise it seems to me
> that it's going to try to append an additional argument which doesn't
> exist, and may well add random uninitialised stuff from memory. Running
> grub-install under valgrind would probably show this up (I can't get it
> to do it for me so far, but most likely I just haven't set up quite the
> right initial conditions).

While I couldn't reproduce this on amd64 (and valgrind didn't show any
errors), I can reproduce it just fine on i386, which is what Paul is
using. I guess the va_list layout in memory is different enough between
the two ABIs to tickle this.

I'll apply this patch for Debian grub2 2.04-11, which I've confirmed
fixes it for me:

diff --git a/util/grub-install-common.c b/util/grub-install-common.c
index a883b6dae..61f9075bc 100644
--- a/util/grub-install-common.c
+++ b/util/grub-install-common.c
@@ -247,7 +247,7 @@ clean_grub_dir_real (const char *di, enum clean_grub_dir_mode mode)
}
else if (mode == RESTORE_BACKUP)
{
- char *dstf = grub_util_path_concat_ext (2, di, de->d_name);
+ char *dstf = grub_util_path_concat (2, di, de->d_name);
dstf[strlen (dstf) - 1] = 0;
if (grub_util_rename (srcf, dstf) < 0)
grub_util_error (_("cannot restore `%s': %s"), dstf,

Dimitri, I know Ubuntu isn't very interested in supporting i386, but IMO
you should apply this patch to Ubuntu in any case; it might affect other
architectures, and anyway leaving known undefined behaviour around isn't
a good idea.

Paul, note that you'll also need to run "sudo rm -f
/boot/grub/i386-pc/*.{img,lst,mo,mod,o,sh}-*" to clean up the stray
files created by this bug.

--
Colin Watson (he/him) [[email protected]]

2020-12-07 02:03:29

by Dimitri John Ledkov

[permalink] [raw]
Subject: Re: ext4: Funny characters appended to file names

Hi,

On Sun, 6 Dec 2020, 18:28 Colin Watson, <[email protected]> wrote:
>
> On Sun, Dec 06, 2020 at 02:44:16PM +0000, Colin Watson wrote:
> > So, in the RESTORE_BACKUP case, shouldn't that be:
> >
> > char *dstf = grub_util_path_concat (2, di, de->d_name);
> >
> > ... rather than grub_util_path_concat_ext? Otherwise it seems to me
> > that it's going to try to append an additional argument which doesn't
> > exist, and may well add random uninitialised stuff from memory. Running
> > grub-install under valgrind would probably show this up (I can't get it
> > to do it for me so far, but most likely I just haven't set up quite the
> > right initial conditions).
>
> While I couldn't reproduce this on amd64 (and valgrind didn't show any
> errors), I can reproduce it just fine on i386, which is what Paul is
> using. I guess the va_list layout in memory is different enough between
> the two ABIs to tickle this.
>

Indeed, my patch was not tested on i386 and has had limited reviews
before landing in Ubuntu. Still not reviewed on the upstream mailing
list.

I am sorry for causing this issue.

> I'll apply this patch for Debian grub2 2.04-11, which I've confirmed
> fixes it for me:
>
> diff --git a/util/grub-install-common.c b/util/grub-install-common.c
> index a883b6dae..61f9075bc 100644
> --- a/util/grub-install-common.c
> +++ b/util/grub-install-common.c
> @@ -247,7 +247,7 @@ clean_grub_dir_real (const char *di, enum clean_grub_dir_mode mode)
> }
> else if (mode == RESTORE_BACKUP)
> {
> - char *dstf = grub_util_path_concat_ext (2, di, de->d_name);
> + char *dstf = grub_util_path_concat (2, di, de->d_name);
> dstf[strlen (dstf) - 1] = 0;
> if (grub_util_rename (srcf, dstf) < 0)
> grub_util_error (_("cannot restore `%s': %s"), dstf,
>
> Dimitri, I know Ubuntu isn't very interested in supporting i386, but IMO
> you should apply this patch to Ubuntu in any case; it might affect other
> architectures, and anyway leaving known undefined behaviour around isn't
> a good idea.
>

Will do. And will resend it to the grub mailing list.

> Paul, note that you'll also need to run "sudo rm -f
> /boot/grub/i386-pc/*.{img,lst,mo,mod,o,sh}-*" to clean up the stray
> files created by this bug.

Thank you everyone for the code review.