2016-07-18 10:57:07

by Johannes Stezenbach

[permalink] [raw]
Subject: 4.7.0-rc7 ext4 error in dx_probe

Hi,

I'm running 4.7.0-rc7 with ext4 on lvm on dm-crypt on SSD
and out of the blue on idle machine the following error
message appeared:

[373851.683131] EXT4-fs (dm-3): error count since last fsck: 1
[373851.683151] EXT4-fs (dm-3): initial error at time 1468438194: dx_probe:740: inode 22288562
[373851.683158] EXT4-fs (dm-3): last error at time 1468438194: dx_probe:740: inode 22288562

inode 22288562 is a directory with ~800 small files in it,
but AFAICT nothing was accessing it, no cron job running etc.
No further error message was logged. Accessing the directory
and the files in it also gives no further errors.

Searching back in the log at date -d @1468438194 I found:

Jul 13 21:29:54 foo kernel: EXT4-fs error (device dm-3): dx_probe:740: inode #22288562: comm git: Directory index failed checksum


Time to run fsck? Is it the consequence of a previous crash
(I had many recently)?


Johannes


2016-07-18 13:38:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Mon, Jul 18, 2016 at 12:57:07PM +0200, Johannes Stezenbach wrote:
>
> I'm running 4.7.0-rc7 with ext4 on lvm on dm-crypt on SSD
> and out of the blue on idle machine the following error
> message appeared:
>
> [373851.683131] EXT4-fs (dm-3): error count since last fsck: 1
> [373851.683151] EXT4-fs (dm-3): initial error at time 1468438194: dx_probe:740: inode 22288562
> [373851.683158] EXT4-fs (dm-3): last error at time 1468438194: dx_probe:740: inode 22288562
>
> inode 22288562 is a directory with ~800 small files in it,
> but AFAICT nothing was accessing it, no cron job running etc.
> No further error message was logged. Accessing the directory
> and the files in it also gives no further errors.

Yes, thes messages gets printed once a day in case there was a file
system corruption detected earlier. The problem is people
unfortunately run with their file systems set to errors=continue,
which I sometimes refer to as the "don't worry, be happy" option. The
problem is sometimes this can cause data loss, and people aren't
checking their logs so it's possible for errors to lurk hidden for a
long time without noticing. Then they complain, and the people who
are trying to resolve the bug report might not realize that in fact
the file system has been corrupted for ages.

I strongly suggest that people either (a) run automatic log analaysis
software that informs you when a file system corruption was detected,
or (b) run with errors=panic which will at least force a file system
check when the system reboots, or (c) run with errors=remount-ro which
avoids the file system from getting further corrupted.

> Searching back in the log at date -d @1468438194 I found:
>
> Jul 13 21:29:54 foo kernel: EXT4-fs error (device dm-3): dx_probe:740: inode #22288562: comm git: Directory index failed checksum
>
>
> Time to run fsck? Is it the consequence of a previous crash
> (I had many recently)?

In general, whenever the an EXT4-fs error is registered, you probably
want to run fsck, right away. The default of errors=continue was more
appropriate back in the day when servers were pets, and many people
would complain if we just unceremoniously forced a reboot when we
noticed a file system corruption, or remounted the file system
read-only, which could lead to some surprising failures at the
application stack level that could be hard to debug. But in a
Cloud/Kuberneties world of "servers are cattle", forcing a reboot
right away probably is the best thing to do, so the fsck can be forced
right away.

As far as what caused it, I'm not sure. If you can since look at the
directory without any problems, it's possible that it was caused by a
transient hardware glitch. That's because when we issue the
"Directory index failed checksum", the directory block is discarded
(to avoid potential further file system corruption). If you can list
the directory now, that would appear that a subsequent attempt to
reread the directory block was successful.

But, in matters like this, I generally advise a "better safe than
sorry", and run fsck. If you are using LVM, then you might be able to
create a snapshot and then run fsck -n on the snapshot, to make sure
the file system is OK, without needing to shutdown your server. An
example of such a script can be found here:

http://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/e2croncheck

Cheers,

- Ted

2016-07-18 14:17:23

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Mon, Jul 18, 2016 at 09:38:43AM -0400, Theodore Ts'o wrote:
> On Mon, Jul 18, 2016 at 12:57:07PM +0200, Johannes Stezenbach wrote:
> >
> > I'm running 4.7.0-rc7 with ext4 on lvm on dm-crypt on SSD
> > and out of the blue on idle machine the following error
> > message appeared:
> >
> > [373851.683131] EXT4-fs (dm-3): error count since last fsck: 1
> > [373851.683151] EXT4-fs (dm-3): initial error at time 1468438194: dx_probe:740: inode 22288562
> > [373851.683158] EXT4-fs (dm-3): last error at time 1468438194: dx_probe:740: inode 22288562
> >
> > inode 22288562 is a directory with ~800 small files in it,
> > but AFAICT nothing was accessing it, no cron job running etc.
> > No further error message was logged. Accessing the directory
> > and the files in it also gives no further errors.
>
> Yes, thes messages gets printed once a day in case there was a file
> system corruption detected earlier. The problem is people
> unfortunately run with their file systems set to errors=continue,
> which I sometimes refer to as the "don't worry, be happy" option. The
[snip]

I've not willingly done this, but I recently upgraded to a bigger
SSD and so created new file system, and the mount option for errors=
isn't specified so it uses the default from superblock, and
mkfs.ext4 has defaulted to "Errors behavior: Continue"
according to dumpe2fs -h. I'm using Debian sid FWIW, just checked
the source of e2fsprogs-1.43.1 and found:

#define EXT2_ERRORS_DEFAULT EXT2_ERRORS_CONTINUE


During reboot after crash I saw the usual "Clearing orphaned inode"
messages scroll by, however they did not make it into systemd journal.
So I suspect if there were any other fsck errors during boot
they were lost, too, thanks to systemd-fsck.

Thanks for your detailed reply.


Johannes

2016-07-26 05:05:07

by Christian Kujau

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Mon, 18 Jul 2016, Theodore Ts'o wrote:
> system corruption detected earlier. The problem is people
> unfortunately run with their file systems set to errors=continue,

The first thing I do after creating an ext4 fs is to set this to
errors=remount-ro :-)

Isn't "remount-ro" was XFS is doing when it detects an error? Are
there plans to switch the default for ext4 to that? I never understood why
errors=continue was the default, thanks for explaining that.

Christian.
--
BOFH excuse #339:

manager in the cable duct

2016-07-27 11:58:43

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Mon, Jul 18, 2016 at 04:17:23PM +0200, Johannes Stezenbach wrote:
> On Mon, Jul 18, 2016 at 09:38:43AM -0400, Theodore Ts'o wrote:
> > On Mon, Jul 18, 2016 at 12:57:07PM +0200, Johannes Stezenbach wrote:
> > >
> > > I'm running 4.7.0-rc7 with ext4 on lvm on dm-crypt on SSD
> > > and out of the blue on idle machine the following error
> > > message appeared:
> > >
> > > [373851.683131] EXT4-fs (dm-3): error count since last fsck: 1
> > > [373851.683151] EXT4-fs (dm-3): initial error at time 1468438194: dx_probe:740: inode 22288562
> > > [373851.683158] EXT4-fs (dm-3): last error at time 1468438194: dx_probe:740: inode 22288562
> > >
> > > inode 22288562 is a directory with ~800 small files in it,
> > > but AFAICT nothing was accessing it, no cron job running etc.
> > > No further error message was logged. Accessing the directory
> > > and the files in it also gives no further errors.

FWIW, now with 4.7.0 and errors=remount-ro it just happened again
during git update (actually "repo sync -ld" of AOSP/cm
repository). Again a directory with 321 small files.
ls on ro fs after the error listed the directory without problems.
Fsck fixed wrong inode and wrong free block count.
ls after fsck still listed the directory and "git status"
reported it as clean.

[72173.126740] EXT4-fs error (device dm-3): dx_probe:740: inode #12327817: comm git: Directory index failed checksum
[72173.131346] Aborting journal on device dm-3-8.
[72173.135884] EXT4-fs (dm-3): Remounting filesystem read-only

Since I upgraded the RAM from 4G to 8G not long ago I
suspect it could be the root of the issue, although
this RAM was taken from another machine (which I had
upgraded from 4G to 12G and now downgraded to 8G) where it
worked for ~2 years, also with AOSP stuff. Sigh...


Johannes

2016-07-28 00:01:19

by Dave Chinner

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Mon, Jul 25, 2016 at 09:55:06PM -0700, Christian Kujau wrote:
> On Mon, 18 Jul 2016, Theodore Ts'o wrote:
> > system corruption detected earlier. The problem is people
> > unfortunately run with their file systems set to errors=continue,
>
> The first thing I do after creating an ext4 fs is to set this to
> errors=remount-ro :-)
>
> Isn't "remount-ro" was XFS is doing when it detects an error?

No, XFS shuts down and prevents all further access to the filesystem
when it hits a corruption that it can't continue onwards from. This
is not optional in XFS - to continue after such a corruption is
detected will only make the corruption worse....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-08-05 10:35:44

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Wed, Aug 03, 2016 at 05:50:26PM +0300, T?r?k Edwin wrote:
> I have just encountered a similar problem after I've recently upgraded to 4.7.0:
> [Wed Aug 3 11:08:57 2016] EXT4-fs error (device dm-1): dx_probe:740: inode #13295: comm python: Directory index failed checksum
> [Wed Aug 3 11:08:57 2016] Aborting journal on device dm-1-8.
> [Wed Aug 3 11:08:57 2016] EXT4-fs (dm-1): Remounting filesystem read-only
> [Wed Aug 3 11:08:57 2016] EXT4-fs error (device dm-1): ext4_journal_check_start:56: Detected aborted journal
>
> I've rebooted in single-user mode, fsck fixed the filesystem, and rebooted, filesystem is rw again now.
>
> inode #13295 seems to be this and I can list it now:
> stat /usr/lib64/python3.4/site-packages
> File: '/usr/lib64/python3.4/site-packages'
> Size: 12288 Blocks: 24 IO Block: 4096 directory
> Device: fd01h/64769d Inode: 13295 Links: 180
> Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2016-05-09 11:29:44.056661988 +0300
> Modify: 2016-08-01 00:34:24.029779875 +0300
> Change: 2016-08-01 00:34:24.029779875 +0300
> Birth: -
>
> The filesystem was /, I only noticed it was readonly after several hours when I tried to install something:
> /dev/mapper/vg--ssd-root on / type ext4 (rw,noatime,errors=remount-ro,data=ordered)
>
> $ uname -a
> Linux bolt 4.7.0-gentoo-rr #1 SMP Thu Jul 28 11:28:56 EEST 2016 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux
>
> FWIW I've been using ext4 for years and this is the first time I see this message.
> Prior to 4.7 I was on 4.6.1 -> 4.6.2 -> 4.6.3 -> 4.6.4.
>
> The kernel is from gentoo-sources + a patch for enabling AMD LWP (I had that patch since 4.6.3 and its not related to I/O).
>
> If I see this message again what should I do to obtain more information to trace down the root cause?

It just happened again to me, this time hitting /usr/sbin/
on root fs. Meanwhile I ran memtest86 7.0 for two nights,
it didn't find anything. I'm using hibernate regularly
and I think so this only happened after a few hibernate/resume
cycles, but no idea if that means anything.
Now I'm back at 4.4.16 to see if it reproduces.

Johannes

2016-08-05 17:02:28

by Darrick J. Wong

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Fri, Aug 05, 2016 at 12:35:44PM +0200, Johannes Stezenbach wrote:
> On Wed, Aug 03, 2016 at 05:50:26PM +0300, T?r?k Edwin wrote:
> > I have just encountered a similar problem after I've recently upgraded to 4.7.0:
> > [Wed Aug 3 11:08:57 2016] EXT4-fs error (device dm-1): dx_probe:740: inode #13295: comm python: Directory index failed checksum
> > [Wed Aug 3 11:08:57 2016] Aborting journal on device dm-1-8.
> > [Wed Aug 3 11:08:57 2016] EXT4-fs (dm-1): Remounting filesystem read-only
> > [Wed Aug 3 11:08:57 2016] EXT4-fs error (device dm-1): ext4_journal_check_start:56: Detected aborted journal
> >
> > I've rebooted in single-user mode, fsck fixed the filesystem, and rebooted, filesystem is rw again now.
> >
> > inode #13295 seems to be this and I can list it now:
> > stat /usr/lib64/python3.4/site-packages
> > File: '/usr/lib64/python3.4/site-packages'
> > Size: 12288 Blocks: 24 IO Block: 4096 directory
> > Device: fd01h/64769d Inode: 13295 Links: 180
> > Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> > Access: 2016-05-09 11:29:44.056661988 +0300
> > Modify: 2016-08-01 00:34:24.029779875 +0300
> > Change: 2016-08-01 00:34:24.029779875 +0300
> > Birth: -
> >
> > The filesystem was /, I only noticed it was readonly after several hours when I tried to install something:
> > /dev/mapper/vg--ssd-root on / type ext4 (rw,noatime,errors=remount-ro,data=ordered)
> >
> > $ uname -a
> > Linux bolt 4.7.0-gentoo-rr #1 SMP Thu Jul 28 11:28:56 EEST 2016 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux
> >
> > FWIW I've been using ext4 for years and this is the first time I see this message.
> > Prior to 4.7 I was on 4.6.1 -> 4.6.2 -> 4.6.3 -> 4.6.4.
> >
> > The kernel is from gentoo-sources + a patch for enabling AMD LWP (I had that patch since 4.6.3 and its not related to I/O).
> >
> > If I see this message again what should I do to obtain more information to trace down the root cause?
>
> It just happened again to me, this time hitting /usr/sbin/
> on root fs. Meanwhile I ran memtest86 7.0 for two nights,
> it didn't find anything. I'm using hibernate regularly
> and I think so this only happened after a few hibernate/resume
> cycles, but no idea if that means anything.
> Now I'm back at 4.4.16 to see if it reproduces.

When you're back on 4.7, can you apply this patch[1] to see if it fixes
the problem? I speculate that the new parallel dir lookup code enables
multiple threads to be verifying the same directory block buffer at the
same time.

--D

[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23

>
> Johannes
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-08-05 18:11:54

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Fri, Aug 05, 2016 at 10:02:28AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 05, 2016 at 12:35:44PM +0200, Johannes Stezenbach wrote:
> > On Wed, Aug 03, 2016 at 05:50:26PM +0300, T?r?k Edwin wrote:
> > > I have just encountered a similar problem after I've recently upgraded to 4.7.0:
> > > [Wed Aug 3 11:08:57 2016] EXT4-fs error (device dm-1): dx_probe:740: inode #13295: comm python: Directory index failed checksum
> > > [Wed Aug 3 11:08:57 2016] Aborting journal on device dm-1-8.
> > > [Wed Aug 3 11:08:57 2016] EXT4-fs (dm-1): Remounting filesystem read-only
> > > [Wed Aug 3 11:08:57 2016] EXT4-fs error (device dm-1): ext4_journal_check_start:56: Detected aborted journal
> >
> > It just happened again to me, this time hitting /usr/sbin/
> > on root fs. Meanwhile I ran memtest86 7.0 for two nights,
> > it didn't find anything. I'm using hibernate regularly
> > and I think so this only happened after a few hibernate/resume
> > cycles, but no idea if that means anything.
> > Now I'm back at 4.4.16 to see if it reproduces.
>
> When you're back on 4.7, can you apply this patch[1] to see if it fixes
> the problem? I speculate that the new parallel dir lookup code enables
> multiple threads to be verifying the same directory block buffer at the
> same time.
>
> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23

I added the patch, rebuilt and rebooted. It will take some time
before I'll report back since the issue is so hard to reproduce.

Thanks,
Johannes

2016-08-05 19:15:48

by Darrick J. Wong

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Fri, Aug 05, 2016 at 08:11:36PM +0200, Johannes Stezenbach wrote:
> On Fri, Aug 05, 2016 at 10:02:28AM -0700, Darrick J. Wong wrote:
> > On Fri, Aug 05, 2016 at 12:35:44PM +0200, Johannes Stezenbach wrote:
> > > On Wed, Aug 03, 2016 at 05:50:26PM +0300, T?r?k Edwin wrote:
> > > > I have just encountered a similar problem after I've recently upgraded to 4.7.0:
> > > > [Wed Aug 3 11:08:57 2016] EXT4-fs error (device dm-1): dx_probe:740: inode #13295: comm python: Directory index failed checksum
> > > > [Wed Aug 3 11:08:57 2016] Aborting journal on device dm-1-8.
> > > > [Wed Aug 3 11:08:57 2016] EXT4-fs (dm-1): Remounting filesystem read-only
> > > > [Wed Aug 3 11:08:57 2016] EXT4-fs error (device dm-1): ext4_journal_check_start:56: Detected aborted journal
> > >
> > > It just happened again to me, this time hitting /usr/sbin/
> > > on root fs. Meanwhile I ran memtest86 7.0 for two nights,
> > > it didn't find anything. I'm using hibernate regularly
> > > and I think so this only happened after a few hibernate/resume
> > > cycles, but no idea if that means anything.
> > > Now I'm back at 4.4.16 to see if it reproduces.
> >
> > When you're back on 4.7, can you apply this patch[1] to see if it fixes
> > the problem? I speculate that the new parallel dir lookup code enables
> > multiple threads to be verifying the same directory block buffer at the
> > same time.
> >
> > [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23
>
> I added the patch, rebuilt and rebooted. It will take some time
> before I'll report back since the issue is so hard to reproduce.

FWIW I could trigger it reliably by running a bunch of directory traversal
programs simultaneously on the same directory. I have a script that fires
up multiple mutts pointing to the Maildirs for the high traffic Linux lists.

--D

>
> Thanks,
> Johannes
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-08-08 03:56:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Fri, Aug 05, 2016 at 12:15:48PM -0700, Darrick J. Wong wrote:
> > > [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23
> >
> > I added the patch, rebuilt and rebooted. It will take some time
> > before I'll report back since the issue is so hard to reproduce.
>
> FWIW I could trigger it reliably by running a bunch of directory traversal
> programs simultaneously on the same directory. I have a script that fires
> up multiple mutts pointing to the Maildirs for the high traffic Linux lists.

Hmm, I wonder if we should request that this patch be backported to
-stable. Darrick, what do you think?

- Ted

2016-08-08 06:28:10

by Darrick J. Wong

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Sun, Aug 07, 2016 at 11:56:34PM -0400, Theodore Ts'o wrote:
> On Fri, Aug 05, 2016 at 12:15:48PM -0700, Darrick J. Wong wrote:
> > > > [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23
> > >
> > > I added the patch, rebuilt and rebooted. It will take some time
> > > before I'll report back since the issue is so hard to reproduce.
> >
> > FWIW I could trigger it reliably by running a bunch of directory traversal
> > programs simultaneously on the same directory. I have a script that fires
> > up multiple mutts pointing to the Maildirs for the high traffic Linux lists.
>
> Hmm, I wonder if we should request that this patch be backported to
> -stable. Darrick, what do you think?

Seems like an excellent idea.

I have one lingering concern -- is it a bug that two processes could be
computing the checksum of a buffer simultaneously? I would have thought ext4
would serialize that kind of buffer_head access...

--D
>
> - Ted

2016-08-08 16:08:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Sun, Aug 07, 2016 at 11:28:10PM -0700, Darrick J. Wong wrote:
>
> I have one lingering concern -- is it a bug that two processes could be
> computing the checksum of a buffer simultaneously? I would have thought ext4
> would serialize that kind of buffer_head access...

Do we know how this is happening? We've always depended on the VFS to
provide this exclusion. The only way we should be modifying the
buffer_head at the same time if two CPU's are trying to modify the
directory at the same time, and that should _never_ be happening, even
with the new directory parallism code, unless the file system has
given permission and intends to do its own fine-grained locking.

- Ted

2016-08-08 16:55:46

by Darrick J. Wong

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Mon, Aug 08, 2016 at 12:08:18PM -0400, Theodore Ts'o wrote:
> On Sun, Aug 07, 2016 at 11:28:10PM -0700, Darrick J. Wong wrote:
> >
> > I have one lingering concern -- is it a bug that two processes could be
> > computing the checksum of a buffer simultaneously? I would have thought ext4
> > would serialize that kind of buffer_head access...
>
> Do we know how this is happening? We've always depended on the VFS to
> provide this exclusion. The only way we should be modifying the
> buffer_head at the same time if two CPU's are trying to modify the
> directory at the same time, and that should _never_ be happening, even
> with the new directory parallism code, unless the file system has
> given permission and intends to do its own fine-grained locking.

It's a combination of two things, I think. The first is that the
checksum calculation routine (temporarily) set the checksum field to
zero during the computation, which of course is a no-no. The patch
fixes that problem and should go in.

The second problem is that we now can have multiple lookups at the same
time, which means that there can be more than one CPU calling into
dx_probe on the same directory blocks at the same time. There isn't any
locking on the buffer heads between readers, so we can end up with
ext4_read_dirblock racing with itself to verify the block. It's perhaps
a little inefficient for multiple threads to be checksumming the same
block, but only turns deadly if you combine it with the first problem.

--D

>
> - Ted

2016-08-08 21:13:01

by Edwin Török

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On 2016-08-08 19:55, Darrick J. Wong wrote:
> On Mon, Aug 08, 2016 at 12:08:18PM -0400, Theodore Ts'o wrote:
>> On Sun, Aug 07, 2016 at 11:28:10PM -0700, Darrick J. Wong wrote:
>>>
>>> I have one lingering concern -- is it a bug that two processes could be
>>> computing the checksum of a buffer simultaneously? I would have thought ext4
>>> would serialize that kind of buffer_head access...
>>
>> Do we know how this is happening? We've always depended on the VFS to
>> provide this exclusion. The only way we should be modifying the
>> buffer_head at the same time if two CPU's are trying to modify the
>> directory at the same time, and that should _never_ be happening, even
>> with the new directory parallism code, unless the file system has
>> given permission and intends to do its own fine-grained locking.
>
> It's a combination of two things, I think. The first is that the
> checksum calculation routine (temporarily) set the checksum field to
> zero during the computation, which of course is a no-no. The patch
> fixes that problem and should go in.

Thanks a lot for the patch.
I wrote a small testcase (see below) that triggers the problem quite soon on my box with kernel 4.7.0, and seems to have survived so far with kernel 4.7.0+patch.
When it failed it printed something like "readdir: Bad message".

The drop caches part is quite important for triggering the bug, and might explain why this bug was hard to reproduce: IIUC this race condition can happen only
if 2+ threads/processes try to access the same directory, and the directory's inode is not in the cache (i.e. was never cached, or got kicked out of the cache).


/*
$ gcc trigger.c -o trigger -pthread
$ ./trigger
*/

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <pthread.h>
#include <unistd.h>
#include <fcntl.h>

#define FILES 100000
#define THREADS 16
#define LOOPS 1000

static void die(const char *msg)
{
perror(msg);
exit(EXIT_FAILURE);
}

static void* list(void* arg)
{
for(int i=0;i<LOOPS;i++) {
DIR *d = opendir(".");
if (!d) {
die("opendir");
}
errno = 0;
while(readdir(d)) {}
if (errno) {
die("readdir");
}
closedir(d);
FILE *f = fopen("/proc/sys/vm/drop_caches", "w");
if (f) {
fputs("3", f);
fclose(f);
}
}
return NULL;
}

int main()
{
pthread_t t[THREADS];

if(mkdir("ext4test", 0755) < 0 && errno != EEXIST)
die("mkdir");
if(chdir("ext4test") < 0)
die("chdir");
for (unsigned i=0;i < FILES;i++) {
char name[16];
snprintf(name, sizeof(name), "%d", i);
int fd = open(name, O_WRONLY|O_CREAT, 0600);
if (fd < 0)
die("open");
close(fd);
}
for (unsigned i=0;i < sizeof(t)/sizeof(t[0]); i++) {
pthread_create(&t[i], NULL,list, NULL);
}
for (unsigned i=0;i < sizeof(t)/sizeof(t[0]); i++) {
pthread_join(t[i], NULL);
}
return 0;
}



--
Edwin T?r?k | Co-founder and Lead Developer

Skylable open-source object storage: reliable, fast, secure
http://www.skylable.com

2016-08-09 02:37:54

by Darrick J. Wong

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Tue, Aug 09, 2016 at 12:13:01AM +0300, T?r?k Edwin wrote:
> On 2016-08-08 19:55, Darrick J. Wong wrote:
> > On Mon, Aug 08, 2016 at 12:08:18PM -0400, Theodore Ts'o wrote:
> >> On Sun, Aug 07, 2016 at 11:28:10PM -0700, Darrick J. Wong wrote:
> >>>
> >>> I have one lingering concern -- is it a bug that two processes could be
> >>> computing the checksum of a buffer simultaneously? I would have thought ext4
> >>> would serialize that kind of buffer_head access...
> >>
> >> Do we know how this is happening? We've always depended on the VFS to
> >> provide this exclusion. The only way we should be modifying the
> >> buffer_head at the same time if two CPU's are trying to modify the
> >> directory at the same time, and that should _never_ be happening, even
> >> with the new directory parallism code, unless the file system has
> >> given permission and intends to do its own fine-grained locking.
> >
> > It's a combination of two things, I think. The first is that the
> > checksum calculation routine (temporarily) set the checksum field to
> > zero during the computation, which of course is a no-no. The patch
> > fixes that problem and should go in.
>
> Thanks a lot for the patch.
> I wrote a small testcase (see below) that triggers the problem quite soon on
> my box with kernel 4.7.0, and seems to have survived so far with kernel
> 4.7.0+patch.
> When it failed it printed something like "readdir: Bad message".
>
> The drop caches part is quite important for triggering the bug, and might
> explain why this bug was hard to reproduce: IIUC this race condition can
> happen only if 2+ threads/processes try to access the same directory, and the
> directory's inode is not in the cache (i.e. was never cached, or got kicked
> out of the cache).

Could you formulate this into an xfstest, please? It would be very useful to
have this as a regression test.

(Or attach a Signed-off-by and I'll take care of it eventually.)

--D
>
>
> /*
> $ gcc trigger.c -o trigger -pthread
> $ ./trigger
> */
>
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <dirent.h>
> #include <string.h>
> #include <stdlib.h>
> #include <errno.h>
> #include <pthread.h>
> #include <unistd.h>
> #include <fcntl.h>
>
> #define FILES 100000
> #define THREADS 16
> #define LOOPS 1000
>
> static void die(const char *msg)
> {
> perror(msg);
> exit(EXIT_FAILURE);
> }
>
> static void* list(void* arg)
> {
> for(int i=0;i<LOOPS;i++) {
> DIR *d = opendir(".");
> if (!d) {
> die("opendir");
> }
> errno = 0;
> while(readdir(d)) {}
> if (errno) {
> die("readdir");
> }
> closedir(d);
> FILE *f = fopen("/proc/sys/vm/drop_caches", "w");
> if (f) {
> fputs("3", f);
> fclose(f);
> }
> }
> return NULL;
> }
>
> int main()
> {
> pthread_t t[THREADS];
>
> if(mkdir("ext4test", 0755) < 0 && errno != EEXIST)
> die("mkdir");
> if(chdir("ext4test") < 0)
> die("chdir");
> for (unsigned i=0;i < FILES;i++) {
> char name[16];
> snprintf(name, sizeof(name), "%d", i);
> int fd = open(name, O_WRONLY|O_CREAT, 0600);
> if (fd < 0)
> die("open");
> close(fd);
> }
> for (unsigned i=0;i < sizeof(t)/sizeof(t[0]); i++) {
> pthread_create(&t[i], NULL,list, NULL);
> }
> for (unsigned i=0;i < sizeof(t)/sizeof(t[0]); i++) {
> pthread_join(t[i], NULL);
> }
> return 0;
> }
>
>
>
> --
> Edwin T?r?k | Co-founder and Lead Developer
>
> Skylable open-source object storage: reliable, fast, secure
> http://www.skylable.com

2016-08-09 07:12:58

by Edwin Török

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On 2016-08-09 05:37, Darrick J. Wong wrote:
> On Tue, Aug 09, 2016 at 12:13:01AM +0300, T?r?k Edwin wrote:
>> On 2016-08-08 19:55, Darrick J. Wong wrote:
>>> On Mon, Aug 08, 2016 at 12:08:18PM -0400, Theodore Ts'o wrote:
>>>> On Sun, Aug 07, 2016 at 11:28:10PM -0700, Darrick J. Wong wrote:
>>>>>
>>>>> I have one lingering concern -- is it a bug that two processes could be
>>>>> computing the checksum of a buffer simultaneously? I would have thought ext4
>>>>> would serialize that kind of buffer_head access...
>>>>
>>>> Do we know how this is happening? We've always depended on the VFS to
>>>> provide this exclusion. The only way we should be modifying the
>>>> buffer_head at the same time if two CPU's are trying to modify the
>>>> directory at the same time, and that should _never_ be happening, even
>>>> with the new directory parallism code, unless the file system has
>>>> given permission and intends to do its own fine-grained locking.
>>>
>>> It's a combination of two things, I think. The first is that the
>>> checksum calculation routine (temporarily) set the checksum field to
>>> zero during the computation, which of course is a no-no. The patch
>>> fixes that problem and should go in.
>>
>> Thanks a lot for the patch.
>> I wrote a small testcase (see below) that triggers the problem quite soon on
>> my box with kernel 4.7.0, and seems to have survived so far with kernel
>> 4.7.0+patch.
>> When it failed it printed something like "readdir: Bad message".
>>
>> The drop caches part is quite important for triggering the bug, and might
>> explain why this bug was hard to reproduce: IIUC this race condition can
>> happen only if 2+ threads/processes try to access the same directory, and the
>> directory's inode is not in the cache (i.e. was never cached, or got kicked
>> out of the cache).
>
> Could you formulate this into an xfstest, please? It would be very useful to
> have this as a regression test.
>
> (Or attach a Signed-off-by and I'll take care of it eventually.)

I've attached a signed-off-by line and a copyright header (feel free to add yourself in the copyright header too):

Signed-off-by: T?r?k Edwin <[email protected]>

>> /*
>> $ gcc trigger.c -o trigger -pthread
>> $ ./trigger
>> */

/*
* Test concurrent readdir on uncached inodes
*
* Copyright (C) 2016 Skylable Ltd.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
*/

>>
>> #include <stdio.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <dirent.h>
>> #include <string.h>
>> #include <stdlib.h>
>> #include <errno.h>
>> #include <pthread.h>
>> #include <unistd.h>
>> #include <fcntl.h>
>>
>> #define FILES 100000
>> #define THREADS 16
>> #define LOOPS 1000
>>
>> static void die(const char *msg)
>> {
>> perror(msg);
>> exit(EXIT_FAILURE);
>> }
>>
>> static void* list(void* arg)
>> {
>> for(int i=0;i<LOOPS;i++) {
>> DIR *d = opendir(".");
>> if (!d) {
>> die("opendir");
>> }
>> errno = 0;
>> while(readdir(d)) {}
>> if (errno) {
>> die("readdir");
>> }
>> closedir(d);
>> FILE *f = fopen("/proc/sys/vm/drop_caches", "w");
>> if (f) {
>> fputs("3", f);
>> fclose(f);
>> }
>> }
>> return NULL;
>> }
>>
>> int main()
>> {
>> pthread_t t[THREADS];
>>
>> if(mkdir("ext4test", 0755) < 0 && errno != EEXIST)
>> die("mkdir");
>> if(chdir("ext4test") < 0)
>> die("chdir");
>> for (unsigned i=0;i < FILES;i++) {
>> char name[16];
>> snprintf(name, sizeof(name), "%d", i);
>> int fd = open(name, O_WRONLY|O_CREAT, 0600);
>> if (fd < 0)
>> die("open");
>> close(fd);
>> }
>> for (unsigned i=0;i < sizeof(t)/sizeof(t[0]); i++) {
>> pthread_create(&t[i], NULL,list, NULL);
>> }
>> for (unsigned i=0;i < sizeof(t)/sizeof(t[0]); i++) {
>> pthread_join(t[i], NULL);
>> }
>> return 0;
>> }
>>
>>
>>
>> --
>> Edwin T?r?k | Co-founder and Lead Developer
>>
>> Skylable open-source object storage: reliable, fast, secure
>> http://www.skylable.com
>


--
Edwin T?r?k | Co-founder and Lead Developer

Skylable open-source object storage: reliable, fast, secure
http://www.skylable.com

2016-08-17 14:27:25

by Edwin Török

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On 2016-08-08 09:28, Darrick J. Wong wrote:
> On Sun, Aug 07, 2016 at 11:56:34PM -0400, Theodore Ts'o wrote:
>> On Fri, Aug 05, 2016 at 12:15:48PM -0700, Darrick J. Wong wrote:
>>>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23
>>>>
>>>> I added the patch, rebuilt and rebooted. It will take some time
>>>> before I'll report back since the issue is so hard to reproduce.
>>>
>>> FWIW I could trigger it reliably by running a bunch of directory traversal
>>> programs simultaneously on the same directory. I have a script that fires
>>> up multiple mutts pointing to the Maildirs for the high traffic Linux lists.
>>
>> Hmm, I wonder if we should request that this patch be backported to
>> -stable. Darrick, what do you think?
>
> Seems like an excellent idea.

Hi,

I see 4.7.1. was released, however before testing it in the changelog I can't see the inode.c commit with the checksum fix.
Is there another commit that would provide an equivalent fix, or is the fix not part of 4.7.1?

Best regards,
--Edwin

2016-08-17 20:02:32

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Fri, Aug 05, 2016 at 08:11:36PM +0200, Johannes Stezenbach wrote:
> On Fri, Aug 05, 2016 at 10:02:28AM -0700, Darrick J. Wong wrote:
> >
> > When you're back on 4.7, can you apply this patch[1] to see if it fixes
> > the problem? I speculate that the new parallel dir lookup code enables
> > multiple threads to be verifying the same directory block buffer at the
> > same time.
> >
> > [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23
>
> I added the patch, rebuilt and rebooted. It will take some time
> before I'll report back since the issue is so hard to reproduce.

FWIW, so far the issue didn't appear again after I applied
the patch to 4.7.0, and I stressed it a bit with repo syncs,
AOSP builds, rsync backups etc.

Johannes

2016-08-17 23:28:33

by Darrick J. Wong

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Wed, Aug 17, 2016 at 05:27:25PM +0300, T?r?k Edwin wrote:
> On 2016-08-08 09:28, Darrick J. Wong wrote:
> > On Sun, Aug 07, 2016 at 11:56:34PM -0400, Theodore Ts'o wrote:
> >> On Fri, Aug 05, 2016 at 12:15:48PM -0700, Darrick J. Wong wrote:
> >>>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23
> >>>>
> >>>> I added the patch, rebuilt and rebooted. It will take some time
> >>>> before I'll report back since the issue is so hard to reproduce.
> >>>
> >>> FWIW I could trigger it reliably by running a bunch of directory traversal
> >>> programs simultaneously on the same directory. I have a script that fires
> >>> up multiple mutts pointing to the Maildirs for the high traffic Linux lists.
> >>
> >> Hmm, I wonder if we should request that this patch be backported to
> >> -stable. Darrick, what do you think?
> >
> > Seems like an excellent idea.
>
> Hi,
>
> I see 4.7.1. was released, however before testing it in the changelog I can't
> see the inode.c commit with the checksum fix. Is there another commit that
> would provide an equivalent fix, or is the fix not part of 4.7.1?

There's no 'equivalent' fix; the one we have simply didn't make it
to the stable maintainers before 4.7.1 came out.

(Hey Ted, can you send the patch to ... whomever's in charge of 4.7?)

--D

>
> Best regards,
> --Edwin

2016-08-18 07:44:43

by Edwin Török

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On 2016-08-17 23:02, Johannes Stezenbach wrote:
> On Fri, Aug 05, 2016 at 08:11:36PM +0200, Johannes Stezenbach wrote:
>> On Fri, Aug 05, 2016 at 10:02:28AM -0700, Darrick J. Wong wrote:
>>>
>>> When you're back on 4.7, can you apply this patch[1] to see if it fixes
>>> the problem? I speculate that the new parallel dir lookup code enables
>>> multiple threads to be verifying the same directory block buffer at the
>>> same time.
>>>
>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23
>>
>> I added the patch, rebuilt and rebooted. It will take some time
>> before I'll report back since the issue is so hard to reproduce.
>
> FWIW, so far the issue didn't appear again after I applied
> the patch to 4.7.0, and I stressed it a bit with repo syncs,
> AOSP builds, rsync backups etc.

Same here: I've been using 4.7.0+patch daily, and everything is working fine.

Best regards,
--Edwin

2016-08-22 07:33:08

by Edwin Török

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On 2016-08-18 02:28, Darrick J. Wong wrote:
> On Wed, Aug 17, 2016 at 05:27:25PM +0300, T?r?k Edwin wrote:
>> On 2016-08-08 09:28, Darrick J. Wong wrote:
>>> On Sun, Aug 07, 2016 at 11:56:34PM -0400, Theodore Ts'o wrote:
>>>> On Fri, Aug 05, 2016 at 12:15:48PM -0700, Darrick J. Wong wrote:
>>>>>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23
>>>>>>
>>>>>> I added the patch, rebuilt and rebooted. It will take some time
>>>>>> before I'll report back since the issue is so hard to reproduce.
>>>>>
>>>>> FWIW I could trigger it reliably by running a bunch of directory traversal
>>>>> programs simultaneously on the same directory. I have a script that fires
>>>>> up multiple mutts pointing to the Maildirs for the high traffic Linux lists.
>>>>
>>>> Hmm, I wonder if we should request that this patch be backported to
>>>> -stable. Darrick, what do you think?
>>>
>>> Seems like an excellent idea.
>>
>> Hi,
>>
>> I see 4.7.1. was released, however before testing it in the changelog I can't
>> see the inode.c commit with the checksum fix. Is there another commit that
>> would provide an equivalent fix, or is the fix not part of 4.7.1?
>
> There's no 'equivalent' fix; the one we have simply didn't make it
> to the stable maintainers before 4.7.1 came out.
>
> (Hey Ted, can you send the patch to ... whomever's in charge of 4.7?)

Ping? 4.7.2 is out....

The process for submitting to stable seems to be described here [1],
it doesn't say who can submit, but I guess it'd be better if one of the people listed in the original commit,
or an ext4 maintainer submitted it.

FWIW 4.7.0+patch has been working reliably here, would be nice if I could tell people here that its safe to upgrade to 4.7.x (without manually patching it).

Best regards,
--Edwin

2016-09-01 16:40:17

by Darrick J. Wong

[permalink] [raw]
Subject: Re: 4.7.0-rc7 ext4 error in dx_probe

On Mon, Aug 22, 2016 at 10:33:08AM +0300, T?r?k Edwin wrote:
> On 2016-08-18 02:28, Darrick J. Wong wrote:
> > On Wed, Aug 17, 2016 at 05:27:25PM +0300, T?r?k Edwin wrote:
> >> On 2016-08-08 09:28, Darrick J. Wong wrote:
> >>> On Sun, Aug 07, 2016 at 11:56:34PM -0400, Theodore Ts'o wrote:
> >>>> On Fri, Aug 05, 2016 at 12:15:48PM -0700, Darrick J. Wong wrote:
> >>>>>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23
> >>>>>>
> >>>>>> I added the patch, rebuilt and rebooted. It will take some time
> >>>>>> before I'll report back since the issue is so hard to reproduce.
> >>>>>
> >>>>> FWIW I could trigger it reliably by running a bunch of directory traversal
> >>>>> programs simultaneously on the same directory. I have a script that fires
> >>>>> up multiple mutts pointing to the Maildirs for the high traffic Linux lists.
> >>>>
> >>>> Hmm, I wonder if we should request that this patch be backported to
> >>>> -stable. Darrick, what do you think?
> >>>
> >>> Seems like an excellent idea.
> >>
> >> Hi,
> >>
> >> I see 4.7.1. was released, however before testing it in the changelog I can't
> >> see the inode.c commit with the checksum fix. Is there another commit that
> >> would provide an equivalent fix, or is the fix not part of 4.7.1?
> >
> > There's no 'equivalent' fix; the one we have simply didn't make it
> > to the stable maintainers before 4.7.1 came out.
> >
> > (Hey Ted, can you send the patch to ... whomever's in charge of 4.7?)
>
> Ping? 4.7.2 is out....
>
> The process for submitting to stable seems to be described here [1], it
> doesn't say who can submit, but I guess it'd be better if one of the people
> listed in the original commit, or an ext4 maintainer submitted it.
>
> FWIW 4.7.0+patch has been working reliably here, would be nice if I could
> tell people here that its safe to upgrade to 4.7.x (without manually patching
> it).

Ted, ping for stable inclusion please? :)

--D

>
> Best regards,
> --Edwin