LinuxLists.cc - [PATCH] fix softlockups in ext2/3 when trying to allocate blocks

2009-07-06 19:43:21

Subject: [PATCH] fix softlockups in ext2/3 when trying to allocate blocks

This isn't a huge deal, but using a big beefy box with more CPUs than what is
sane, you can get a nice flood of softlockup messages when running heavy
multi-threaded io tests on ext2/3. The processors compete for blocks from the
allocator, so they will loop quite a bit trying to get their allocation. This
patch simply makes sure that we reschedule if need be. This made the softlockup
messages disappear whereas before they happened almost immediately. Thanks,

Tested-by: Evan McNabb <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
---
fs/ext2/balloc.c | 1 +
fs/ext3/balloc.c | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index 7f8d2e5..17dd55f 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -1176,6 +1176,7 @@ ext2_try_to_allocate_with_rsv(struct super_block *sb, unsigned int group,
break; /* succeed */
}
num = *count;
+ cond_resched();
}
return ret;
}
diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index 27967f9..cffc8cd 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -735,6 +735,7 @@ bitmap_search_next_usable_block(ext3_grpblk_t start, struct buffer_head *bh,
struct journal_head *jh = bh2jh(bh);

while (start < maxblocks) {
+ cond_resched();
next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
if (next >= maxblocks)
return -1;
@@ -1391,6 +1392,7 @@ ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
break; /* succeed */
}
num = *count;
+ cond_resched();
}
out:
if (ret >= 0) {
--
1.6.2.2

2009-07-08 20:26:14

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH] fix softlockups in ext2/3 when trying to allocate blocks

On Mon, Jul 06, 2009 at 03:47:39PM -0400, Josef Bacik wrote:
> This isn't a huge deal, but using a big beefy box with more CPUs than what is
> sane, you can get a nice flood of softlockup messages when running heavy
> multi-threaded io tests on ext2/3. The processors compete for blocks from the
> allocator, so they will loop quite a bit trying to get their allocation. This
> patch simply makes sure that we reschedule if need be. This made the softlockup
> messages disappear whereas before they happened almost immediately. Thanks,
>
> Tested-by: Evan McNabb <[email protected]>
> Signed-off-by: Josef Bacik <[email protected]>
> ---
> fs/ext2/balloc.c | 1 +
> fs/ext3/balloc.c | 2 ++
> 2 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index 7f8d2e5..17dd55f 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -1176,6 +1176,7 @@ ext2_try_to_allocate_with_rsv(struct super_block *sb, unsigned int group,
> break; /* succeed */
> }
> num = *count;
> + cond_resched();
> }
> return ret;
> }
> diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> index 27967f9..cffc8cd 100644
> --- a/fs/ext3/balloc.c
> +++ b/fs/ext3/balloc.c
> @@ -735,6 +735,7 @@ bitmap_search_next_usable_block(ext3_grpblk_t start, struct buffer_head *bh,
> struct journal_head *jh = bh2jh(bh);
>
> while (start < maxblocks) {
> + cond_resched();
> next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
> if (next >= maxblocks)
> return -1;

I'm curious: Why schedule at the beginning of the while() loop rather
than at the end?

> @@ -1391,6 +1392,7 @@ ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
> break; /* succeed */
> }
> num = *count;
> + cond_resched();
> }
> out:
> if (ret >= 0) {
> --
> 1.6.2.2

I like this patch in general, but I worry about introducing new
performance problems in other cases. Have you guys tested on single
cpu systems? Maybe with a file system close to ENOSPC or badly
fragmented?

-VAL

2009-07-21 06:37:57

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] fix softlockups in ext2/3 when trying to allocate blocks

On Mon, 6 Jul 2009 15:47:39 -0400 Josef Bacik <[email protected]> wrote:

> This isn't a huge deal, but using a big beefy box with more CPUs than what is
> sane, you can get a nice flood of softlockup messages when running heavy
> multi-threaded io tests on ext2/3. The processors compete for blocks from the
> allocator, so they will loop quite a bit trying to get their allocation. This
> patch simply makes sure that we reschedule if need be. This made the softlockup
> messages disappear whereas before they happened almost immediately. Thanks,

The softlockup threshold is 60 seconds. For the kernel to spend 60
seconds continuous CPU time in the filesystem is very bad behaviour, and
adding a rescheduling point doesn't fix that!

> Tested-by: Evan McNabb <[email protected]>
> Signed-off-by: Josef Bacik <[email protected]>
> ---
> fs/ext2/balloc.c | 1 +
> fs/ext3/balloc.c | 2 ++
> 2 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index 7f8d2e5..17dd55f 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -1176,6 +1176,7 @@ ext2_try_to_allocate_with_rsv(struct super_block *sb, unsigned int group,
> break; /* succeed */
> }
> num = *count;
> + cond_resched();
> }
> return ret;
> }
> diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> index 27967f9..cffc8cd 100644
> --- a/fs/ext3/balloc.c
> +++ b/fs/ext3/balloc.c
> @@ -735,6 +735,7 @@ bitmap_search_next_usable_block(ext3_grpblk_t start, struct buffer_head *bh,
> struct journal_head *jh = bh2jh(bh);
>
> while (start < maxblocks) {
> + cond_resched();
> next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
> if (next >= maxblocks)
> return -1;
> @@ -1391,6 +1392,7 @@ ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
> break; /* succeed */
> }
> num = *count;
> + cond_resched();
> }
> out:
> if (ret >= 0) {

I worry that something has gone wrong with the reservations code. The
filesystem _should_ be able to find a free block without any contention
from other CPUs, because there's a range of blocks reserved for this
inode's allocation attempts.

Unless the workload has a lot of threads writing to the _same_ file.
If it does that then yes, we'll have lots of CPUs contenting for blocks
within that inode's reservation window. Tell us about the workload please.

But that shouldn't be happening either because all those write()ing
threads will be serialised by i_mutex.

So I don't know what's happening here. Possibly a better fix would be
to add a lock rather than leaving the contention in place and hiding
it. Even better would be to understand why the contention is happening
and prevent that.

Thanks.

2009-07-21 15:16:08

by Josef Bacik

[permalink] [raw]

Subject: Re: [PATCH] fix softlockups in ext2/3 when trying to allocate blocks

On Mon, Jul 20, 2009 at 11:37:35PM -0700, Andrew Morton wrote:
> On Mon, 6 Jul 2009 15:47:39 -0400 Josef Bacik <[email protected]> wrote:
>
> > This isn't a huge deal, but using a big beefy box with more CPUs than what is
> > sane, you can get a nice flood of softlockup messages when running heavy
> > multi-threaded io tests on ext2/3. The processors compete for blocks from the
> > allocator, so they will loop quite a bit trying to get their allocation. This
> > patch simply makes sure that we reschedule if need be. This made the softlockup
> > messages disappear whereas before they happened almost immediately. Thanks,
>
> The softlockup threshold is 60 seconds. For the kernel to spend 60
> seconds continuous CPU time in the filesystem is very bad behaviour, and
> adding a rescheduling point doesn't fix that!
>

In RHEL its set to 10 seconds, so its not totally unreasonable.

> > Tested-by: Evan McNabb <[email protected]>
> > Signed-off-by: Josef Bacik <[email protected]>
> > ---
> > fs/ext2/balloc.c | 1 +
> > fs/ext3/balloc.c | 2 ++
> > 2 files changed, 3 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> > index 7f8d2e5..17dd55f 100644
> > --- a/fs/ext2/balloc.c
> > +++ b/fs/ext2/balloc.c
> > @@ -1176,6 +1176,7 @@ ext2_try_to_allocate_with_rsv(struct super_block *sb, unsigned int group,
> > break; /* succeed */
> > }
> > num = *count;
> > + cond_resched();
> > }
> > return ret;
> > }
> > diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> > index 27967f9..cffc8cd 100644
> > --- a/fs/ext3/balloc.c
> > +++ b/fs/ext3/balloc.c
> > @@ -735,6 +735,7 @@ bitmap_search_next_usable_block(ext3_grpblk_t start, struct buffer_head *bh,
> > struct journal_head *jh = bh2jh(bh);
> >
> > while (start < maxblocks) {
> > + cond_resched();
> > next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
> > if (next >= maxblocks)
> > return -1;
> > @@ -1391,6 +1392,7 @@ ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
> > break; /* succeed */
> > }
> > num = *count;
> > + cond_resched();
> > }
> > out:
> > if (ret >= 0) {
>
> I worry that something has gone wrong with the reservations code. The
> filesystem _should_ be able to find a free block without any contention
> from other CPUs, because there's a range of blocks reserved for this
> inode's allocation attempts.
>

Sure, the problem is if we run out of blocks in that reservation window, or
somebody else runs out of blocks in their reservation window, we start trying to
steal blocks from other inodes reservation windows.

> Unless the workload has a lot of threads writing to the _same_ file.
> If it does that then yes, we'll have lots of CPUs contenting for blocks
> within that inode's reservation window. Tell us about the workload please.
>

The workload is on a box with 32 CPUs and 32GB of ram. Its running some sort of
kernel compiling stress test, which from what I understand is running a kernel
compile per CPU. Then on top of that there is a dd running at the same time.

> But that shouldn't be happening either because all those write()ing
> threads will be serialised by i_mutex.
>
> So I don't know what's happening here. Possibly a better fix would be
> to add a lock rather than leaving the contention in place and hiding
> it. Even better would be to understand why the contention is happening
> and prevent that.
>

I could probably add some locking in here to help the problem, but I'm worried
about the performance impact that would have. This is just a crap situation,
since we are quickly exhausting our reservation windows and devovling to just
schlepping through the block bitmaps for free space, and thats where we start to
suck hard. I can look into it some more and possibly come up with something
else, this just seemed to be the quickest way to fix the problem with affecting
as little people as possible, especially since it's only reproducing on a box
with 32 CPUs and 32GB of RAM. Thanks,

Josef

2009-07-21 15:50:20

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] fix softlockups in ext2/3 when trying to allocate blocks

On Tue 21-07-09 11:15:52, Josef Bacik wrote:
> On Mon, Jul 20, 2009 at 11:37:35PM -0700, Andrew Morton wrote:
> > On Mon, 6 Jul 2009 15:47:39 -0400 Josef Bacik <[email protected]> wrote:
> >
> > > This isn't a huge deal, but using a big beefy box with more CPUs than what is
> > > sane, you can get a nice flood of softlockup messages when running heavy
> > > multi-threaded io tests on ext2/3. The processors compete for blocks from the
> > > allocator, so they will loop quite a bit trying to get their allocation. This
> > > patch simply makes sure that we reschedule if need be. This made the softlockup
> > > messages disappear whereas before they happened almost immediately. Thanks,
> >
> > The softlockup threshold is 60 seconds. For the kernel to spend 60
> > seconds continuous CPU time in the filesystem is very bad behaviour, and
> > adding a rescheduling point doesn't fix that!
> >
>
> In RHEL its set to 10 seconds, so its not totally unreasonable.
>
> > > Tested-by: Evan McNabb <[email protected]>
> > > Signed-off-by: Josef Bacik <[email protected]>
> > > ---
> > > fs/ext2/balloc.c | 1 +
> > > fs/ext3/balloc.c | 2 ++
> > > 2 files changed, 3 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> > > index 7f8d2e5..17dd55f 100644
> > > --- a/fs/ext2/balloc.c
> > > +++ b/fs/ext2/balloc.c
> > > @@ -1176,6 +1176,7 @@ ext2_try_to_allocate_with_rsv(struct super_block *sb, unsigned int group,
> > > break; /* succeed */
> > > }
> > > num = *count;
> > > + cond_resched();
> > > }
> > > return ret;
> > > }
> > > diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> > > index 27967f9..cffc8cd 100644
> > > --- a/fs/ext3/balloc.c
> > > +++ b/fs/ext3/balloc.c
> > > @@ -735,6 +735,7 @@ bitmap_search_next_usable_block(ext3_grpblk_t start, struct buffer_head *bh,
> > > struct journal_head *jh = bh2jh(bh);
> > >
> > > while (start < maxblocks) {
> > > + cond_resched();
> > > next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
> > > if (next >= maxblocks)
> > > return -1;
> > > @@ -1391,6 +1392,7 @@ ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
> > > break; /* succeed */
> > > }
> > > num = *count;
> > > + cond_resched();
> > > }
> > > out:
> > > if (ret >= 0) {
> >
> > I worry that something has gone wrong with the reservations code. The
> > filesystem _should_ be able to find a free block without any contention
> > from other CPUs, because there's a range of blocks reserved for this
> > inode's allocation attempts.
> >
>
> Sure, the problem is if we run out of blocks in that reservation window, or
> somebody else runs out of blocks in their reservation window, we start trying to
> steal blocks from other inodes reservation windows.
Yes, but that should happen only if start running of blocks (all the free
blocks are reserved). We scan all the groups and try to establish a
reservation window in each of them... Hmm, looking into the code, we also
skip groups with less than window_size/2 blocks free. But that should be at
most 2MB so it shouldn't be a big deal. How big is the filesystem and how full
does it get?
BTW: You write above you can see the problem on ext2/3. Can you really
observe it on ext2? I ask because on ext3, the pressure for free blocks is
much higher in stress tests which create & remove files since the space of
removed files can be used only after a transaction with delete is
committed.
Also have you verified that we indeed take the 'repeat' loop in
ext2_try_to_allocate() often (that's when we race with other threads
allocating blocks)?

> > Unless the workload has a lot of threads writing to the _same_ file.
> > If it does that then yes, we'll have lots of CPUs contenting for blocks
> > within that inode's reservation window. Tell us about the workload please.
> >
>
> The workload is on a box with 32 CPUs and 32GB of ram. Its running some sort of
> kernel compiling stress test, which from what I understand is running a kernel
> compile per CPU. Then on top of that there is a dd running at the same time.
And the kernel compile is single-threaded? My question should probably be
- roughly how many parallel writers are there?

> > But that shouldn't be happening either because all those write()ing
> > threads will be serialised by i_mutex.
> >
> > So I don't know what's happening here. Possibly a better fix would be
> > to add a lock rather than leaving the contention in place and hiding
> > it. Even better would be to understand why the contention is happening
> > and prevent that.
> >
>
> I could probably add some locking in here to help the problem, but I'm worried
> about the performance impact that would have. This is just a crap situation,
Yeah, I don't like the locking too much either. I'd first like to
understand what exactly happens on your box. One low-cost thing we could
try is that we won't scan groups for free blocks starting with group 0 but
starting with some random group and wrapping around, like we do it when
searching for free inodes. That should spread writers a bit.

> since we are quickly exhausting our reservation windows and devovling to just
> schlepping through the block bitmaps for free space, and thats where we start to
> suck hard. I can look into it some more and possibly come up with something
> else, this just seemed to be the quickest way to fix the problem with affecting
> as little people as possible, especially since it's only reproducing on a box
> with 32 CPUs and 32GB of RAM. Thanks,
Well, that's not a small machine but not particularly huge either so I
think we should cope reasonably with it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-07-21 16:06:56

by Josef Bacik

[permalink] [raw]

Subject: Re: [PATCH] fix softlockups in ext2/3 when trying to allocate blocks

On Tue, Jul 21, 2009 at 05:50:20PM +0200, Jan Kara wrote:
> On Tue 21-07-09 11:15:52, Josef Bacik wrote:
> > On Mon, Jul 20, 2009 at 11:37:35PM -0700, Andrew Morton wrote:
> > > On Mon, 6 Jul 2009 15:47:39 -0400 Josef Bacik <[email protected]> wrote:
> > >
> > > > This isn't a huge deal, but using a big beefy box with more CPUs than what is
> > > > sane, you can get a nice flood of softlockup messages when running heavy
> > > > multi-threaded io tests on ext2/3. The processors compete for blocks from the
> > > > allocator, so they will loop quite a bit trying to get their allocation. This
> > > > patch simply makes sure that we reschedule if need be. This made the softlockup
> > > > messages disappear whereas before they happened almost immediately. Thanks,
> > >
> > > The softlockup threshold is 60 seconds. For the kernel to spend 60
> > > seconds continuous CPU time in the filesystem is very bad behaviour, and
> > > adding a rescheduling point doesn't fix that!
> > >
> >
> > In RHEL its set to 10 seconds, so its not totally unreasonable.
> >
> > > > Tested-by: Evan McNabb <[email protected]>
> > > > Signed-off-by: Josef Bacik <[email protected]>
> > > > ---
> > > > fs/ext2/balloc.c | 1 +
> > > > fs/ext3/balloc.c | 2 ++
> > > > 2 files changed, 3 insertions(+), 0 deletions(-)
> > > >
> > > > diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> > > > index 7f8d2e5..17dd55f 100644
> > > > --- a/fs/ext2/balloc.c
> > > > +++ b/fs/ext2/balloc.c
> > > > @@ -1176,6 +1176,7 @@ ext2_try_to_allocate_with_rsv(struct super_block *sb, unsigned int group,
> > > > break; /* succeed */
> > > > }
> > > > num = *count;
> > > > + cond_resched();
> > > > }
> > > > return ret;
> > > > }
> > > > diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> > > > index 27967f9..cffc8cd 100644
> > > > --- a/fs/ext3/balloc.c
> > > > +++ b/fs/ext3/balloc.c
> > > > @@ -735,6 +735,7 @@ bitmap_search_next_usable_block(ext3_grpblk_t start, struct buffer_head *bh,
> > > > struct journal_head *jh = bh2jh(bh);
> > > >
> > > > while (start < maxblocks) {
> > > > + cond_resched();
> > > > next = ext3_find_next_zero_bit(bh->b_data, maxblocks, start);
> > > > if (next >= maxblocks)
> > > > return -1;
> > > > @@ -1391,6 +1392,7 @@ ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
> > > > break; /* succeed */
> > > > }
> > > > num = *count;
> > > > + cond_resched();
> > > > }
> > > > out:
> > > > if (ret >= 0) {
> > >
> > > I worry that something has gone wrong with the reservations code. The
> > > filesystem _should_ be able to find a free block without any contention
> > > from other CPUs, because there's a range of blocks reserved for this
> > > inode's allocation attempts.
> > >
> >
> > Sure, the problem is if we run out of blocks in that reservation window, or
> > somebody else runs out of blocks in their reservation window, we start trying to
> > steal blocks from other inodes reservation windows.
> Yes, but that should happen only if start running of blocks (all the free
> blocks are reserved). We scan all the groups and try to establish a
> reservation window in each of them... Hmm, looking into the code, we also
> skip groups with less than window_size/2 blocks free. But that should be at
> most 2MB so it shouldn't be a big deal. How big is the filesystem and how full
> does it get?

Sorry, not entirely sure on the details here, it should just be a clean fs, no
idea how big. I can't get ahold of the original reporter.

> BTW: You write above you can see the problem on ext2/3. Can you really
> observe it on ext2? I ask because on ext3, the pressure for free blocks is
> much higher in stress tests which create & remove files since the space of
> removed files can be used only after a transaction with delete is
> committed.
> Also have you verified that we indeed take the 'repeat' loop in
> ext2_try_to_allocate() often (that's when we race with other threads
> allocating blocks)?
>

Hrm I thought it was reproduced on ext2, but looking back at the bz that wasn't
actually said, so I'm not sure if this happens on ext2.

> > > Unless the workload has a lot of threads writing to the _same_ file.
> > > If it does that then yes, we'll have lots of CPUs contenting for blocks
> > > within that inode's reservation window. Tell us about the workload please.
> > >
> >
> > The workload is on a box with 32 CPUs and 32GB of ram. Its running some sort of
> > kernel compiling stress test, which from what I understand is running a kernel
> > compile per CPU. Then on top of that there is a dd running at the same time.
> And the kernel compile is single-threaded? My question should probably be
> - roughly how many parallel writers are there?
>

Sorry I'm not sure, I'm waiting for the original reporter to pop back up so I
can get those details.

> > > But that shouldn't be happening either because all those write()ing
> > > threads will be serialised by i_mutex.
> > >
> > > So I don't know what's happening here. Possibly a better fix would be
> > > to add a lock rather than leaving the contention in place and hiding
> > > it. Even better would be to understand why the contention is happening
> > > and prevent that.
> > >
> >
> > I could probably add some locking in here to help the problem, but I'm worried
> > about the performance impact that would have. This is just a crap situation,
> Yeah, I don't like the locking too much either. I'd first like to
> understand what exactly happens on your box. One low-cost thing we could
> try is that we won't scan groups for free blocks starting with group 0 but
> starting with some random group and wrapping around, like we do it when
> searching for free inodes. That should spread writers a bit.
>
> > since we are quickly exhausting our reservation windows and devovling to just
> > schlepping through the block bitmaps for free space, and thats where we start to
> > suck hard. I can look into it some more and possibly come up with something
> > else, this just seemed to be the quickest way to fix the problem with affecting
> > as little people as possible, especially since it's only reproducing on a box
> > with 32 CPUs and 32GB of RAM. Thanks,
> Well, that's not a small machine but not particularly huge either so I
> think we should cope reasonably with it.
>

Agreed. As soon as the original reporter pops back up again I will get some
more details from him and see about getting a more complete picture of what
exactly is going on. Thanks,

Josef