On Wed, Mar 01, 2017 at 12:46:34PM +0800, Xiong Zhou wrote:
> Hi,
>
> It's reproduciable, not everytime though. Ext4 works fine.
On ext4 fsstress won't run bulkstat because it doesn't exist. Either
way this smells like a MM issue to me as there were not XFS changes
in that area recently.
On Wed, Mar 01, 2017 at 04:37:31PM -0800, Christoph Hellwig wrote:
> On Wed, Mar 01, 2017 at 12:46:34PM +0800, Xiong Zhou wrote:
> > Hi,
> >
> > It's reproduciable, not everytime though. Ext4 works fine.
>
> On ext4 fsstress won't run bulkstat because it doesn't exist. Either
> way this smells like a MM issue to me as there were not XFS changes
> in that area recently.
Yap.
First bad commit:
commit 5d17a73a2ebeb8d1c6924b91e53ab2650fe86ffb
Author: Michal Hocko <[email protected]>
Date: Fri Feb 24 14:58:53 2017 -0800
vmalloc: back off when the current task is killed
Reverting this commit on top of
e5d56ef Merge tag 'watchdog-for-linus-v4.11'
survives the tests.
On 03/02/2017 10:49 AM, Xiong Zhou wrote:
> On Wed, Mar 01, 2017 at 04:37:31PM -0800, Christoph Hellwig wrote:
>> On Wed, Mar 01, 2017 at 12:46:34PM +0800, Xiong Zhou wrote:
>>> Hi,
>>>
>>> It's reproduciable, not everytime though. Ext4 works fine.
>> On ext4 fsstress won't run bulkstat because it doesn't exist. Either
>> way this smells like a MM issue to me as there were not XFS changes
>> in that area recently.
> Yap.
>
> First bad commit:
>
> commit 5d17a73a2ebeb8d1c6924b91e53ab2650fe86ffb
> Author: Michal Hocko <[email protected]>
> Date: Fri Feb 24 14:58:53 2017 -0800
>
> vmalloc: back off when the current task is killed
>
> Reverting this commit on top of
> e5d56ef Merge tag 'watchdog-for-linus-v4.11'
> survives the tests.
Does fsstress test or the system hang ? I am not familiar with this
code but If it's the test which is getting hung and its hitting this
new check introduced by the above commit that means the requester is
currently being killed by OOM killer for some other memory allocation
request. Then is not this kind if memory alloc failure expected ?
On 2017/3/2 13:19, Xiong Zhou wrote:
> On Wed, Mar 01, 2017 at 04:37:31PM -0800, Christoph Hellwig wrote:
>> On Wed, Mar 01, 2017 at 12:46:34PM +0800, Xiong Zhou wrote:
>>> Hi,
>>>
>>> It's reproduciable, not everytime though. Ext4 works fine.
>>
>> On ext4 fsstress won't run bulkstat because it doesn't exist. Either
>> way this smells like a MM issue to me as there were not XFS changes
>> in that area recently.
>
> Yap.
>
> First bad commit:
>
It looks like not a bug.
>From below commit, the allocation failure print was due to current process received SIGKILL signal.
You may need to confirm whether that's the case.
Regards,
Bob
> commit 5d17a73a2ebeb8d1c6924b91e53ab2650fe86ffb
> Author: Michal Hocko <[email protected]>
> Date: Fri Feb 24 14:58:53 2017 -0800
>
> vmalloc: back off when the current task is killed
>
> Reverting this commit on top of
> e5d56ef Merge tag 'watchdog-for-linus-v4.11'
> survives the tests.
>
On Thu 02-03-17 12:17:47, Anshuman Khandual wrote:
> On 03/02/2017 10:49 AM, Xiong Zhou wrote:
> > On Wed, Mar 01, 2017 at 04:37:31PM -0800, Christoph Hellwig wrote:
> >> On Wed, Mar 01, 2017 at 12:46:34PM +0800, Xiong Zhou wrote:
> >>> Hi,
> >>>
> >>> It's reproduciable, not everytime though. Ext4 works fine.
> >> On ext4 fsstress won't run bulkstat because it doesn't exist. Either
> >> way this smells like a MM issue to me as there were not XFS changes
> >> in that area recently.
> > Yap.
> >
> > First bad commit:
> >
> > commit 5d17a73a2ebeb8d1c6924b91e53ab2650fe86ffb
> > Author: Michal Hocko <[email protected]>
> > Date: Fri Feb 24 14:58:53 2017 -0800
> >
> > vmalloc: back off when the current task is killed
> >
> > Reverting this commit on top of
> > e5d56ef Merge tag 'watchdog-for-linus-v4.11'
> > survives the tests.
>
> Does fsstress test or the system hang ? I am not familiar with this
> code but If it's the test which is getting hung and its hitting this
> new check introduced by the above commit that means the requester is
> currently being killed by OOM killer for some other memory allocation
> request.
Well, not exactly. It is sufficient for it to be _killed_ by SIGKILL.
And for that it just needs to do a group_exit when one thread was still
in the kernel (see zap_process). While I can change this check to
actually do the oom specific check I believe a more generic
fatal_signal_pending is the right thing to do here. I am still not sure
what is the actual problem here, though. Could you be more specific
please?
--
Michal Hocko
SUSE Labs
On Thu, Mar 02, 2017 at 09:42:23AM +0100, Michal Hocko wrote:
> On Thu 02-03-17 12:17:47, Anshuman Khandual wrote:
> > On 03/02/2017 10:49 AM, Xiong Zhou wrote:
> > > On Wed, Mar 01, 2017 at 04:37:31PM -0800, Christoph Hellwig wrote:
> > >> On Wed, Mar 01, 2017 at 12:46:34PM +0800, Xiong Zhou wrote:
> > >>> Hi,
> > >>>
> > >>> It's reproduciable, not everytime though. Ext4 works fine.
> > >> On ext4 fsstress won't run bulkstat because it doesn't exist. Either
> > >> way this smells like a MM issue to me as there were not XFS changes
> > >> in that area recently.
> > > Yap.
> > >
> > > First bad commit:
> > >
> > > commit 5d17a73a2ebeb8d1c6924b91e53ab2650fe86ffb
> > > Author: Michal Hocko <[email protected]>
> > > Date: Fri Feb 24 14:58:53 2017 -0800
> > >
> > > vmalloc: back off when the current task is killed
> > >
> > > Reverting this commit on top of
> > > e5d56ef Merge tag 'watchdog-for-linus-v4.11'
> > > survives the tests.
> >
> > Does fsstress test or the system hang ? I am not familiar with this
> > code but If it's the test which is getting hung and its hitting this
> > new check introduced by the above commit that means the requester is
> > currently being killed by OOM killer for some other memory allocation
> > request.
>
> Well, not exactly. It is sufficient for it to be _killed_ by SIGKILL.
> And for that it just needs to do a group_exit when one thread was still
> in the kernel (see zap_process). While I can change this check to
> actually do the oom specific check I believe a more generic
> fatal_signal_pending is the right thing to do here. I am still not sure
> what is the actual problem here, though. Could you be more specific
> please?
It's blocking the test and system-shutdown. fsstress wont exit.
For anyone interested, a simple ugly reproducer:
cat > fst.sh <<EOFF
#! /bin/bash -x
[ \$# -ne 3 ] && { echo "./single FSTYP TEST XFSTESTS_DIR"; exit 1; }
FST=\$1
BLKSZ=4096
fallocate -l 10G /home/test.img
fallocate -l 15G /home/scratch.img
MNT1=/loopmnt
MNT2=/loopsch
mkdir -p \$MNT1
mkdir -p \$MNT2
DEV1=\$(losetup --find --show /home/test.img)
DEV2=\$(losetup --find --show /home/scratch.img)
cleanup()
{
umount -d \$MNT1
umount -d \$MNT2
umount \$DEV1 \$DEV2
losetup -D || losetup -a | awk -F: '{print \$1}' | xargs losetup -d
rm -f /home/{test,scratch}.img
}
trap cleanup 0 1 2
if [[ \$FST =~ ext ]] ; then
mkfs.\${FST} -Fq -b \${BLKSZ} \$DEV1
elif [[ \$FST =~ xfs ]] ; then
mkfs.\${FST} -fq -b size=\${BLKSZ} \$DEV1
fi
if test \$? -ne 0 ; then
echo "mkfs \$DEV1 failed"
exit 1
fi
if [[ \$FST =~ ext ]] ; then
mkfs.\${FST} -Fq -b \${BLKSZ} \$DEV2
elif [[ \$FST =~ xfs ]] ; then
mkfs.\${FST} -fq -b size=\${BLKSZ} \$DEV2
fi
if test \$? -ne 0 ; then
echo "mkfs \$DEV2 failed"
exit 1
fi
mount \$DEV1 \$MNT1
if test \$? -ne 0 ; then
echo "mount \$DEV1 failed"
exit 1
fi
mount \$DEV2 \$MNT2
if test \$? -ne 0 ; then
echo "mount \$DEV2 failed"
exit 1
fi
pushd \$3 || exit 1
cat > local.config <<EOF
TEST_DEV=\$DEV1
TEST_DIR=\$MNT1
SCRATCH_MNT=\$MNT2
SCRATCH_DEV=\$DEV2
EOF
i=0
while [ \$i -lt 50 ] ; do
./check \$2
echo \$i
((i=\$i+1))
done
popd
EOFF
git clone git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
cd xfstests-dev
make -j2 && make install || exit 1
cd -
sh -x ./fst.sh xfs generic/269 xfstests-dev
> --
> Michal Hocko
> SUSE Labs
Michal Hocko wrote:
> On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> [...]
> > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > flag. As a result, those who are not ready to fail upon SIGKILL are
> > confused. ;-)
>
> You are right! The function is documented it might fail but the code
> doesn't really allow that. This seems like a bug to me. What do you
> think about the following?
Yes, I think this patch is correct. Maybe
-May fail and may return vmalloced memory.
+May fail.
together, for it is always vmalloc()ed memory.
> ---
> From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Thu, 2 Mar 2017 11:31:11 +0100
> Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
>
> Even though kmem_zalloc_greedy is documented it might fail the current
> code doesn't really implement this properly and loops on the smallest
> allowed size for ever. This is a problem because vzalloc might fail
> permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> task is killed") such a failure is much more probable than it used to
> be. Fix this by bailing out if the minimum size request failed.
>
> This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
>
> Reported-by: Xiong Zhou <[email protected]>
> Analyzed-by: Tetsuo Handa <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/xfs/kmem.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..ee95f5c6db45 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> size_t kmsize = maxsize;
>
> while (!(ptr = vzalloc(kmsize))) {
> + if (kmsize == minsize)
> + break;
> if ((kmsize >>= 1) <= minsize)
> kmsize = minsize;
> }
> --
> 2.11.0
>
> --
> Michal Hocko
> SUSE Labs
>
On 2017/03/02 14:19, Xiong Zhou wrote:
> On Wed, Mar 01, 2017 at 04:37:31PM -0800, Christoph Hellwig wrote:
>> On Wed, Mar 01, 2017 at 12:46:34PM +0800, Xiong Zhou wrote:
>>> Hi,
>>>
>>> It's reproduciable, not everytime though. Ext4 works fine.
>>
>> On ext4 fsstress won't run bulkstat because it doesn't exist. Either
>> way this smells like a MM issue to me as there were not XFS changes
>> in that area recently.
>
> Yap.
>
> First bad commit:
>
> commit 5d17a73a2ebeb8d1c6924b91e53ab2650fe86ffb
> Author: Michal Hocko <[email protected]>
> Date: Fri Feb 24 14:58:53 2017 -0800
>
> vmalloc: back off when the current task is killed
>
> Reverting this commit on top of
> e5d56ef Merge tag 'watchdog-for-linus-v4.11'
> survives the tests.
>
Looks like kmem_zalloc_greedy() is broken.
It loops forever until vzalloc() succeeds.
If somebody (not limited to the OOM killer) sends SIGKILL and
vmalloc() backs off, kmem_zalloc_greedy() will loop forever.
----------------------------------------
void *
kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
{
void *ptr;
size_t kmsize = maxsize;
while (!(ptr = vzalloc(kmsize))) {
if ((kmsize >>= 1) <= minsize)
kmsize = minsize;
}
if (ptr)
*size = kmsize;
return ptr;
}
----------------------------------------
So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
killed") implemented __GFP_KILLABLE flag and automatically applied that
flag. As a result, those who are not ready to fail upon SIGKILL are
confused. ;-)
On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> [...]
> > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > flag. As a result, those who are not ready to fail upon SIGKILL are
> > confused. ;-)
>
> You are right! The function is documented it might fail but the code
> doesn't really allow that. This seems like a bug to me. What do you
> think about the following?
> ---
> From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Thu, 2 Mar 2017 11:31:11 +0100
> Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
>
> Even though kmem_zalloc_greedy is documented it might fail the current
> code doesn't really implement this properly and loops on the smallest
> allowed size for ever. This is a problem because vzalloc might fail
> permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> task is killed") such a failure is much more probable than it used to
> be. Fix this by bailing out if the minimum size request failed.
>
> This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
>
> Reported-by: Xiong Zhou <[email protected]>
> Analyzed-by: Tetsuo Handa <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/xfs/kmem.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..ee95f5c6db45 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> size_t kmsize = maxsize;
>
> while (!(ptr = vzalloc(kmsize))) {
> + if (kmsize == minsize)
> + break;
> if ((kmsize >>= 1) <= minsize)
> kmsize = minsize;
> }
More consistent with the rest of the kmem code might be to accept a
flags argument and do something like this based on KM_MAYFAIL. The one
current caller looks like it would pass it, but I suppose we'd still
need a mechanism to break out should a new caller not pass that flag.
Would a fatal_signal_pending() check in the loop as well allow us to
break out in the scenario that is reproduced here?
Brian
> --
> 2.11.0
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
[...]
> So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> killed") implemented __GFP_KILLABLE flag and automatically applied that
> flag. As a result, those who are not ready to fail upon SIGKILL are
> confused. ;-)
You are right! The function is documented it might fail but the code
doesn't really allow that. This seems like a bug to me. What do you
think about the following?
---
>From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Thu, 2 Mar 2017 11:31:11 +0100
Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
Even though kmem_zalloc_greedy is documented it might fail the current
code doesn't really implement this properly and loops on the smallest
allowed size for ever. This is a problem because vzalloc might fail
permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
task is killed") such a failure is much more probable than it used to
be. Fix this by bailing out if the minimum size request failed.
This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
Reported-by: Xiong Zhou <[email protected]>
Analyzed-by: Tetsuo Handa <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
fs/xfs/kmem.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..ee95f5c6db45 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
size_t kmsize = maxsize;
while (!(ptr = vzalloc(kmsize))) {
+ if (kmsize == minsize)
+ break;
if ((kmsize >>= 1) <= minsize)
kmsize = minsize;
}
--
2.11.0
--
Michal Hocko
SUSE Labs
On Thu 02-03-17 07:24:27, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> > On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> > [...]
> > > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > > flag. As a result, those who are not ready to fail upon SIGKILL are
> > > confused. ;-)
> >
> > You are right! The function is documented it might fail but the code
> > doesn't really allow that. This seems like a bug to me. What do you
> > think about the following?
> > ---
> > From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Thu, 2 Mar 2017 11:31:11 +0100
> > Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
> >
> > Even though kmem_zalloc_greedy is documented it might fail the current
> > code doesn't really implement this properly and loops on the smallest
> > allowed size for ever. This is a problem because vzalloc might fail
> > permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> > task is killed") such a failure is much more probable than it used to
> > be. Fix this by bailing out if the minimum size request failed.
> >
> > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> >
> > Reported-by: Xiong Zhou <[email protected]>
> > Analyzed-by: Tetsuo Handa <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > fs/xfs/kmem.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 339c696bbc01..ee95f5c6db45 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > size_t kmsize = maxsize;
> >
> > while (!(ptr = vzalloc(kmsize))) {
> > + if (kmsize == minsize)
> > + break;
> > if ((kmsize >>= 1) <= minsize)
> > kmsize = minsize;
> > }
>
> More consistent with the rest of the kmem code might be to accept a
> flags argument and do something like this based on KM_MAYFAIL.
Well, vmalloc doesn't really support GFP_NOFAIL semantic right now for
the same reason it doesn't support GFP_NOFS. So I am not sure this is a
good idea.
> The one
> current caller looks like it would pass it, but I suppose we'd still
> need a mechanism to break out should a new caller not pass that flag.
> Would a fatal_signal_pending() check in the loop as well allow us to
> break out in the scenario that is reproduced here?
Yes that check would work as well I just thought the break out when the
minsize request fails to be more logical. There might be other reasons
to fail the request and looping here seems just wrong. But whatever you
or other xfs people prefer.
--
Michal Hocko
SUSE Labs
On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 08:00:09, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 01:49:09PM +0100, Michal Hocko wrote:
> > > On Thu 02-03-17 07:24:27, Brian Foster wrote:
> > > > On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> > > > > On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > > > > > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > > > > > flag. As a result, those who are not ready to fail upon SIGKILL are
> > > > > > confused. ;-)
> > > > >
> > > > > You are right! The function is documented it might fail but the code
> > > > > doesn't really allow that. This seems like a bug to me. What do you
> > > > > think about the following?
> > > > > ---
> > > > > From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> > > > > From: Michal Hocko <[email protected]>
> > > > > Date: Thu, 2 Mar 2017 11:31:11 +0100
> > > > > Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
> > > > >
> > > > > Even though kmem_zalloc_greedy is documented it might fail the current
> > > > > code doesn't really implement this properly and loops on the smallest
> > > > > allowed size for ever. This is a problem because vzalloc might fail
> > > > > permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> > > > > task is killed") such a failure is much more probable than it used to
> > > > > be. Fix this by bailing out if the minimum size request failed.
> > > > >
> > > > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > > > >
> > > > > Reported-by: Xiong Zhou <[email protected]>
> > > > > Analyzed-by: Tetsuo Handa <[email protected]>
> > > > > Signed-off-by: Michal Hocko <[email protected]>
> > > > > ---
> > > > > fs/xfs/kmem.c | 2 ++
> > > > > 1 file changed, 2 insertions(+)
> > > > >
> > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > > > index 339c696bbc01..ee95f5c6db45 100644
> > > > > --- a/fs/xfs/kmem.c
> > > > > +++ b/fs/xfs/kmem.c
> > > > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > > > > size_t kmsize = maxsize;
> > > > >
> > > > > while (!(ptr = vzalloc(kmsize))) {
> > > > > + if (kmsize == minsize)
> > > > > + break;
> > > > > if ((kmsize >>= 1) <= minsize)
> > > > > kmsize = minsize;
> > > > > }
> > > >
> > > > More consistent with the rest of the kmem code might be to accept a
> > > > flags argument and do something like this based on KM_MAYFAIL.
> > >
> > > Well, vmalloc doesn't really support GFP_NOFAIL semantic right now for
> > > the same reason it doesn't support GFP_NOFS. So I am not sure this is a
> > > good idea.
> > >
> >
> > Not sure I follow..? I'm just suggesting to control the loop behavior
> > based on the KM_ flag, not to do or change anything wrt to GFP_ flags.
>
> As Tetsuo already pointed out, vmalloc cannot really support never-fail
> semantic with the current implementation so the semantic would have
> to be implemented in kmem_zalloc_greedy and the only way to do that
> would be to loop there and this is rather nasty as you can see from the
> reported issue because the vmalloc failure might be permanent so there
> won't be any way to make a forward progress. Breaking out of the loop
> on fatal_signal_pending pending would break the non-failing sementic.
>
Sure..
> Besides that, there doesn't really seem to be any demand for this
> semantic in the first place so why to make this more complicated than
> necessary?
>
That may very well be the case. I'm not necessarily against this...
> I see your argument about being in sync with other kmem helpers but
> those are bit different because regular page/slab allocators allow never
> fail semantic (even though this is mostly ignored by those helpers which
> implement their own retries but that is a different topic).
>
... but what I'm trying to understand here is whether this failure
scenario is specific to vmalloc() or whether the other kmem_*()
functions are susceptible to the same problem. For example, suppose we
replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
KM_SLEEP) call. Could we hit the same problem if the process is killed?
Brian
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu 02-03-17 08:41:58, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
[...]
> > I see your argument about being in sync with other kmem helpers but
> > those are bit different because regular page/slab allocators allow never
> > fail semantic (even though this is mostly ignored by those helpers which
> > implement their own retries but that is a different topic).
> >
>
> ... but what I'm trying to understand here is whether this failure
> scenario is specific to vmalloc() or whether the other kmem_*()
> functions are susceptible to the same problem. For example, suppose we
> replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> KM_SLEEP) call. Could we hit the same problem if the process is killed?
Well, kmem_zalloc uses kmalloc which can also fail when we are out of
memory but in that case we can expect the OOM killer releasing some
memory which would allow us to make a forward progress on the next
retry. So essentially retrying around kmalloc is much more safe in this
regard. Failing vmalloc might be permanent because there is no vmalloc
space to allocate from or much more likely due to already mentioned
patch. So vmalloc is different, really.
--
Michal Hocko
SUSE Labs
On Thu, Mar 02, 2017 at 01:49:09PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 07:24:27, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> > > On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> > > [...]
> > > > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > > > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > > > flag. As a result, those who are not ready to fail upon SIGKILL are
> > > > confused. ;-)
> > >
> > > You are right! The function is documented it might fail but the code
> > > doesn't really allow that. This seems like a bug to me. What do you
> > > think about the following?
> > > ---
> > > From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <[email protected]>
> > > Date: Thu, 2 Mar 2017 11:31:11 +0100
> > > Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
> > >
> > > Even though kmem_zalloc_greedy is documented it might fail the current
> > > code doesn't really implement this properly and loops on the smallest
> > > allowed size for ever. This is a problem because vzalloc might fail
> > > permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> > > task is killed") such a failure is much more probable than it used to
> > > be. Fix this by bailing out if the minimum size request failed.
> > >
> > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > >
> > > Reported-by: Xiong Zhou <[email protected]>
> > > Analyzed-by: Tetsuo Handa <[email protected]>
> > > Signed-off-by: Michal Hocko <[email protected]>
> > > ---
> > > fs/xfs/kmem.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index 339c696bbc01..ee95f5c6db45 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > > size_t kmsize = maxsize;
> > >
> > > while (!(ptr = vzalloc(kmsize))) {
> > > + if (kmsize == minsize)
> > > + break;
> > > if ((kmsize >>= 1) <= minsize)
> > > kmsize = minsize;
> > > }
> >
> > More consistent with the rest of the kmem code might be to accept a
> > flags argument and do something like this based on KM_MAYFAIL.
>
> Well, vmalloc doesn't really support GFP_NOFAIL semantic right now for
> the same reason it doesn't support GFP_NOFS. So I am not sure this is a
> good idea.
>
Not sure I follow..? I'm just suggesting to control the loop behavior
based on the KM_ flag, not to do or change anything wrt to GFP_ flags.
> > The one
> > current caller looks like it would pass it, but I suppose we'd still
> > need a mechanism to break out should a new caller not pass that flag.
> > Would a fatal_signal_pending() check in the loop as well allow us to
> > break out in the scenario that is reproduced here?
>
> Yes that check would work as well I just thought the break out when the
> minsize request fails to be more logical. There might be other reasons
> to fail the request and looping here seems just wrong. But whatever you
> or other xfs people prefer.
There may be higher level reasons for why this code should or should not
loop, that just seems like a separate issue to me. My thinking is more
that this appears to be how every kmem_*() function operates today and
it seems a bit out of place to change behavior of one to fix a bug.
Maybe I'm missing something though.. are we subject to the same general
problem in any of the other kmem_*() functions that can currently loop
indefinitely?
Brian
> --
> Michal Hocko
> SUSE Labs
On Thu 02-03-17 08:00:09, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 01:49:09PM +0100, Michal Hocko wrote:
> > On Thu 02-03-17 07:24:27, Brian Foster wrote:
> > > On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> > > > On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> > > > [...]
> > > > > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > > > > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > > > > flag. As a result, those who are not ready to fail upon SIGKILL are
> > > > > confused. ;-)
> > > >
> > > > You are right! The function is documented it might fail but the code
> > > > doesn't really allow that. This seems like a bug to me. What do you
> > > > think about the following?
> > > > ---
> > > > From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> > > > From: Michal Hocko <[email protected]>
> > > > Date: Thu, 2 Mar 2017 11:31:11 +0100
> > > > Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
> > > >
> > > > Even though kmem_zalloc_greedy is documented it might fail the current
> > > > code doesn't really implement this properly and loops on the smallest
> > > > allowed size for ever. This is a problem because vzalloc might fail
> > > > permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> > > > task is killed") such a failure is much more probable than it used to
> > > > be. Fix this by bailing out if the minimum size request failed.
> > > >
> > > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > > >
> > > > Reported-by: Xiong Zhou <[email protected]>
> > > > Analyzed-by: Tetsuo Handa <[email protected]>
> > > > Signed-off-by: Michal Hocko <[email protected]>
> > > > ---
> > > > fs/xfs/kmem.c | 2 ++
> > > > 1 file changed, 2 insertions(+)
> > > >
> > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > > index 339c696bbc01..ee95f5c6db45 100644
> > > > --- a/fs/xfs/kmem.c
> > > > +++ b/fs/xfs/kmem.c
> > > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > > > size_t kmsize = maxsize;
> > > >
> > > > while (!(ptr = vzalloc(kmsize))) {
> > > > + if (kmsize == minsize)
> > > > + break;
> > > > if ((kmsize >>= 1) <= minsize)
> > > > kmsize = minsize;
> > > > }
> > >
> > > More consistent with the rest of the kmem code might be to accept a
> > > flags argument and do something like this based on KM_MAYFAIL.
> >
> > Well, vmalloc doesn't really support GFP_NOFAIL semantic right now for
> > the same reason it doesn't support GFP_NOFS. So I am not sure this is a
> > good idea.
> >
>
> Not sure I follow..? I'm just suggesting to control the loop behavior
> based on the KM_ flag, not to do or change anything wrt to GFP_ flags.
As Tetsuo already pointed out, vmalloc cannot really support never-fail
semantic with the current implementation so the semantic would have
to be implemented in kmem_zalloc_greedy and the only way to do that
would be to loop there and this is rather nasty as you can see from the
reported issue because the vmalloc failure might be permanent so there
won't be any way to make a forward progress. Breaking out of the loop
on fatal_signal_pending pending would break the non-failing sementic.
Besides that, there doesn't really seem to be any demand for this
semantic in the first place so why to make this more complicated than
necessary?
I see your argument about being in sync with other kmem helpers but
those are bit different because regular page/slab allocators allow never
fail semantic (even though this is mostly ignored by those helpers which
implement their own retries but that is a different topic).
--
Michal Hocko
SUSE Labs
On Thu, Mar 02, 2017 at 03:34:41PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 09:23:15, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> > > On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > > > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > I see your argument about being in sync with other kmem helpers but
> > > > > those are bit different because regular page/slab allocators allow never
> > > > > fail semantic (even though this is mostly ignored by those helpers which
> > > > > implement their own retries but that is a different topic).
> > > > >
> > > >
> > > > ... but what I'm trying to understand here is whether this failure
> > > > scenario is specific to vmalloc() or whether the other kmem_*()
> > > > functions are susceptible to the same problem. For example, suppose we
> > > > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > > > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> > >
> > > Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> > > memory but in that case we can expect the OOM killer releasing some
> > > memory which would allow us to make a forward progress on the next
> > > retry. So essentially retrying around kmalloc is much more safe in this
> > > regard. Failing vmalloc might be permanent because there is no vmalloc
> > > space to allocate from or much more likely due to already mentioned
> > > patch. So vmalloc is different, really.
> >
> > Right.. that's why I'm asking. So it's technically possible but highly
> > unlikely due to the different failure characteristics. That seems
> > reasonable to me, then.
> >
> > To be clear, do we understand what causes the vzalloc() failure to be
> > effectively permanent in this specific reproducer? I know you mention
> > above that we could be out of vmalloc space, but that doesn't clarify
> > whether there are other potential failure paths or then what this has to
> > do with the fact that the process was killed. Does the pending signal
> > cause the subsequent failures or are you saying that there is some other
> > root cause of the failure, this process would effectively be spinning
> > here anyways, and we're just noticing it because it's trying to exit?
>
> In this particular case it is fatal_signal_pending that causes the
> permanent failure. This check has been added to prevent from complete
> memory reserves depletion on OOM when a killed task has a free ticket to
> reserves and vmalloc requests can be really large. In this case there
> was no OOM killer going on but fsstress has SIGKILL pending for other
> reason. Most probably as a result of the group_exit when all threads
> are killed (see zap_process). I could have turn fatal_signal_pending
> into tsk_is_oom_victim which would be less likely to hit but in
> principle fatal_signal_pending should be better because we do want to
> bail out when the process is existing as soon as possible.
>
> What I really wanted to say is that there are other possible permanent
> failure paths in vmalloc AFAICS. They are much less probable but they
> still exist.
>
> Does that make more sense now?
Yes, thanks. That explains why this crops up now where it hasn't in the
past. Please include that background in the commit log description.
Also, that kind of makes me think that a fatal_signal_pending() check is
still appropriate in the loop, even if we want to drop the infinite
retry loop in kmem_zalloc_greedy() as well. There's no sense in doing
however many retries are left before we return and that's also more
explicit for the next person who goes to change this code in the future.
Otherwise, I'm fine with breaking the infinite retry loop at the same
time. It looks like Christoph added this function originally so this
should probably require his ack as well..
Brian
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> [...]
> > > I see your argument about being in sync with other kmem helpers but
> > > those are bit different because regular page/slab allocators allow never
> > > fail semantic (even though this is mostly ignored by those helpers which
> > > implement their own retries but that is a different topic).
> > >
> >
> > ... but what I'm trying to understand here is whether this failure
> > scenario is specific to vmalloc() or whether the other kmem_*()
> > functions are susceptible to the same problem. For example, suppose we
> > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > KM_SLEEP) call. Could we hit the same problem if the process is killed?
>
> Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> memory but in that case we can expect the OOM killer releasing some
> memory which would allow us to make a forward progress on the next
> retry. So essentially retrying around kmalloc is much more safe in this
> regard. Failing vmalloc might be permanent because there is no vmalloc
> space to allocate from or much more likely due to already mentioned
> patch. So vmalloc is different, really.
Right.. that's why I'm asking. So it's technically possible but highly
unlikely due to the different failure characteristics. That seems
reasonable to me, then.
To be clear, do we understand what causes the vzalloc() failure to be
effectively permanent in this specific reproducer? I know you mention
above that we could be out of vmalloc space, but that doesn't clarify
whether there are other potential failure paths or then what this has to
do with the fact that the process was killed. Does the pending signal
cause the subsequent failures or are you saying that there is some other
root cause of the failure, this process would effectively be spinning
here anyways, and we're just noticing it because it's trying to exit?
Brian
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu, Mar 02, 2017 at 04:14:11PM +0100, Michal Hocko wrote:
> On Thu 02-03-17 09:51:31, Brian Foster wrote:
> > On Thu, Mar 02, 2017 at 03:34:41PM +0100, Michal Hocko wrote:
> > > On Thu 02-03-17 09:23:15, Brian Foster wrote:
> > > > On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> > > > > On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > > > > > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> > > > > [...]
> > > > > > > I see your argument about being in sync with other kmem helpers but
> > > > > > > those are bit different because regular page/slab allocators allow never
> > > > > > > fail semantic (even though this is mostly ignored by those helpers which
> > > > > > > implement their own retries but that is a different topic).
> > > > > > >
> > > > > >
> > > > > > ... but what I'm trying to understand here is whether this failure
> > > > > > scenario is specific to vmalloc() or whether the other kmem_*()
> > > > > > functions are susceptible to the same problem. For example, suppose we
> > > > > > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > > > > > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> > > > >
> > > > > Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> > > > > memory but in that case we can expect the OOM killer releasing some
> > > > > memory which would allow us to make a forward progress on the next
> > > > > retry. So essentially retrying around kmalloc is much more safe in this
> > > > > regard. Failing vmalloc might be permanent because there is no vmalloc
> > > > > space to allocate from or much more likely due to already mentioned
> > > > > patch. So vmalloc is different, really.
> > > >
> > > > Right.. that's why I'm asking. So it's technically possible but highly
> > > > unlikely due to the different failure characteristics. That seems
> > > > reasonable to me, then.
> > > >
> > > > To be clear, do we understand what causes the vzalloc() failure to be
> > > > effectively permanent in this specific reproducer? I know you mention
> > > > above that we could be out of vmalloc space, but that doesn't clarify
> > > > whether there are other potential failure paths or then what this has to
> > > > do with the fact that the process was killed. Does the pending signal
> > > > cause the subsequent failures or are you saying that there is some other
> > > > root cause of the failure, this process would effectively be spinning
> > > > here anyways, and we're just noticing it because it's trying to exit?
> > >
> > > In this particular case it is fatal_signal_pending that causes the
> > > permanent failure. This check has been added to prevent from complete
> > > memory reserves depletion on OOM when a killed task has a free ticket to
> > > reserves and vmalloc requests can be really large. In this case there
> > > was no OOM killer going on but fsstress has SIGKILL pending for other
> > > reason. Most probably as a result of the group_exit when all threads
> > > are killed (see zap_process). I could have turn fatal_signal_pending
> > > into tsk_is_oom_victim which would be less likely to hit but in
> > > principle fatal_signal_pending should be better because we do want to
> > > bail out when the process is existing as soon as possible.
> > >
> > > What I really wanted to say is that there are other possible permanent
> > > failure paths in vmalloc AFAICS. They are much less probable but they
> > > still exist.
> > >
> > > Does that make more sense now?
> >
> > Yes, thanks. That explains why this crops up now where it hasn't in the
> > past. Please include that background in the commit log description.
>
> OK, does this sound better. I am open to any suggestions to improve this
> of course
>
Yeah..
> : xfs: allow kmem_zalloc_greedy to fail
> :
> : Even though kmem_zalloc_greedy is documented it might fail the current
> : code doesn't really implement this properly and loops on the smallest
> : allowed size for ever. This is a problem because vzalloc might fail
> : permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> : ("vmalloc: back off when the current task is killed") when the current
> : task is killed. The later one makes the failure scenario much more
> : probable than it used to be. Fix this by bailing out if the minimum size
^
" because it makes vmalloc() failures permanent for tasks with fatal
signals pending."
> : request failed.
> :
> : This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> :
> : fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> : fsstress cpuset=/ mems_allowed=0-1
> : CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> : Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> : Call Trace:
> : dump_stack+0x63/0x87
> : warn_alloc+0x114/0x1c0
> : ? alloc_pages_current+0x88/0x120
> : __vmalloc_node_range+0x250/0x2a0
> : ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> : ? free_hot_cold_page+0x21f/0x280
> : vzalloc+0x54/0x60
> : ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> : kmem_zalloc_greedy+0x2b/0x40 [xfs]
> : xfs_bulkstat+0x11b/0x730 [xfs]
> : ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> : ? selinux_capable+0x20/0x30
> : ? security_capable+0x48/0x60
> : xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> : xfs_file_ioctl+0x9dd/0xad0 [xfs]
> : ? do_filp_open+0xa5/0x100
> : do_vfs_ioctl+0xa7/0x5e0
> : SyS_ioctl+0x79/0x90
> : do_syscall_64+0x67/0x180
> : entry_SYSCALL64_slow_path+0x25/0x25
> :
> : fsstress keeps looping inside kmem_zalloc_greedy without any way out
> : because vmalloc keeps failing due to fatal_signal_pending.
> :
> : Reported-by: Xiong Zhou <[email protected]>
> : Analyzed-by: Tetsuo Handa <[email protected]>
> : Signed-off-by: Michal Hocko <[email protected]>
>
> > Also, that kind of makes me think that a fatal_signal_pending() check is
> > still appropriate in the loop, even if we want to drop the infinite
> > retry loop in kmem_zalloc_greedy() as well. There's no sense in doing
> > however many retries are left before we return and that's also more
> > explicit for the next person who goes to change this code in the future.
>
> I am not objecting to adding fatal_signal_pending as well I just thought
> that from the logic POV breaking after reaching the minimum size is just
> the right thing to do. We can optimize further by checking
> fatal_signal_pending and reducing retries when we know it doesn't make
> much sense but that should be done on top as an optimization IMHO.
>
I don't think of it as an optimization to not invoke calls (a
non-deterministic number of times) that we know are going to fail, but
ultimately I don't care too much how it's framed or if it's done in
separate patches or whatnot. As long as they are posted at the same
time. ;)
Brian
> > Otherwise, I'm fine with breaking the infinite retry loop at the same
> > time. It looks like Christoph added this function originally so this
> > should probably require his ack as well..
>
> What do you think Christoph?
> --
> Michal Hocko
> SUSE Labs
From: Michal Hocko <[email protected]>
It doesn't really make much sense to retry vmalloc request if the
current task is killed. We should rather bail out as soon as possible
and let it RIP as soon as possible. The current implementation of
vmalloc will fail anyway.
Suggested-by: Brian Foster <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
fs/xfs/kmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index ee95f5c6db45..01c52567a4ff 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -34,7 +34,7 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
size_t kmsize = maxsize;
while (!(ptr = vzalloc(kmsize))) {
- if (kmsize == minsize)
+ if (kmsize == minsize || fatal_signal_pending(current))
break;
if ((kmsize >>= 1) <= minsize)
kmsize = minsize;
--
2.11.0
From: Michal Hocko <[email protected]>
Even though kmem_zalloc_greedy is documented it might fail the current
code doesn't really implement this properly and loops on the smallest
allowed size for ever. This is a problem because vzalloc might fail
permanently - we might run out of vmalloc space or since 5d17a73a2ebe
("vmalloc: back off when the current task is killed") when the current
task is killed. The later one makes the failure scenario much more
probable than it used to be because it makes vmalloc() failures
permanent for tasks with fatal signals pending.. Fix this by bailing out
if the minimum size request failed.
This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
fsstress cpuset=/ mems_allowed=0-1
CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
Call Trace:
dump_stack+0x63/0x87
warn_alloc+0x114/0x1c0
? alloc_pages_current+0x88/0x120
__vmalloc_node_range+0x250/0x2a0
? kmem_zalloc_greedy+0x2b/0x40 [xfs]
? free_hot_cold_page+0x21f/0x280
vzalloc+0x54/0x60
? kmem_zalloc_greedy+0x2b/0x40 [xfs]
kmem_zalloc_greedy+0x2b/0x40 [xfs]
xfs_bulkstat+0x11b/0x730 [xfs]
? xfs_bulkstat_one_int+0x340/0x340 [xfs]
? selinux_capable+0x20/0x30
? security_capable+0x48/0x60
xfs_ioc_bulkstat+0xe4/0x190 [xfs]
xfs_file_ioctl+0x9dd/0xad0 [xfs]
? do_filp_open+0xa5/0x100
do_vfs_ioctl+0xa7/0x5e0
SyS_ioctl+0x79/0x90
do_syscall_64+0x67/0x180
entry_SYSCALL64_slow_path+0x25/0x25
fsstress keeps looping inside kmem_zalloc_greedy without any way out
because vmalloc keeps failing due to fatal_signal_pending.
Reported-by: Xiong Zhou <[email protected]>
Analyzed-by: Tetsuo Handa <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
fs/xfs/kmem.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..ee95f5c6db45 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
size_t kmsize = maxsize;
while (!(ptr = vzalloc(kmsize))) {
+ if (kmsize == minsize)
+ break;
if ((kmsize >>= 1) <= minsize)
kmsize = minsize;
}
--
2.11.0
Brian Foster wrote:
> On Thu, Mar 02, 2017 at 01:49:09PM +0100, Michal Hocko wrote:
> > On Thu 02-03-17 07:24:27, Brian Foster wrote:
> > > On Thu, Mar 02, 2017 at 11:35:20AM +0100, Michal Hocko wrote:
> > > > On Thu 02-03-17 19:04:48, Tetsuo Handa wrote:
> > > > [...]
> > > > > So, commit 5d17a73a2ebeb8d1("vmalloc: back off when the current task is
> > > > > killed") implemented __GFP_KILLABLE flag and automatically applied that
> > > > > flag. As a result, those who are not ready to fail upon SIGKILL are
> > > > > confused. ;-)
> > > >
> > > > You are right! The function is documented it might fail but the code
> > > > doesn't really allow that. This seems like a bug to me. What do you
> > > > think about the following?
> > > > ---
> > > > From d02cb0285d8ce3344fd64dc7e2912e9a04bef80d Mon Sep 17 00:00:00 2001
> > > > From: Michal Hocko <[email protected]>
> > > > Date: Thu, 2 Mar 2017 11:31:11 +0100
> > > > Subject: [PATCH] xfs: allow kmem_zalloc_greedy to fail
> > > >
> > > > Even though kmem_zalloc_greedy is documented it might fail the current
> > > > code doesn't really implement this properly and loops on the smallest
> > > > allowed size for ever. This is a problem because vzalloc might fail
> > > > permanently. Since 5d17a73a2ebe ("vmalloc: back off when the current
> > > > task is killed") such a failure is much more probable than it used to
> > > > be. Fix this by bailing out if the minimum size request failed.
> > > >
> > > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > > >
> > > > Reported-by: Xiong Zhou <[email protected]>
> > > > Analyzed-by: Tetsuo Handa <[email protected]>
> > > > Signed-off-by: Michal Hocko <[email protected]>
> > > > ---
> > > > fs/xfs/kmem.c | 2 ++
> > > > 1 file changed, 2 insertions(+)
> > > >
> > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > > index 339c696bbc01..ee95f5c6db45 100644
> > > > --- a/fs/xfs/kmem.c
> > > > +++ b/fs/xfs/kmem.c
> > > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > > > size_t kmsize = maxsize;
> > > >
> > > > while (!(ptr = vzalloc(kmsize))) {
> > > > + if (kmsize == minsize)
> > > > + break;
> > > > if ((kmsize >>= 1) <= minsize)
> > > > kmsize = minsize;
> > > > }
> > >
> > > More consistent with the rest of the kmem code might be to accept a
> > > flags argument and do something like this based on KM_MAYFAIL.
> >
> > Well, vmalloc doesn't really support GFP_NOFAIL semantic right now for
> > the same reason it doesn't support GFP_NOFS. So I am not sure this is a
> > good idea.
> >
>
> Not sure I follow..? I'm just suggesting to control the loop behavior
> based on the KM_ flag, not to do or change anything wrt to GFP_ flags.
vmalloc() cannot support __GFP_NOFAIL. Therefore, kmem_zalloc_greedy()
cannot support !KM_MAYFAIL. We will change kmem_zalloc_greedy() not to
use vmalloc() when we need to support !KM_MAYFAIL. That won't be now.
>
> > > The one
> > > current caller looks like it would pass it, but I suppose we'd still
> > > need a mechanism to break out should a new caller not pass that flag.
> > > Would a fatal_signal_pending() check in the loop as well allow us to
> > > break out in the scenario that is reproduced here?
> >
> > Yes that check would work as well I just thought the break out when the
> > minsize request fails to be more logical. There might be other reasons
> > to fail the request and looping here seems just wrong. But whatever you
> > or other xfs people prefer.
>
> There may be higher level reasons for why this code should or should not
> loop, that just seems like a separate issue to me. My thinking is more
> that this appears to be how every kmem_*() function operates today and
> it seems a bit out of place to change behavior of one to fix a bug.
>
> Maybe I'm missing something though.. are we subject to the same general
> problem in any of the other kmem_*() functions that can currently loop
> indefinitely?
>
The kmem code looks to me a source of problems. For example,
kmem_alloc() by RESCUER workqueue threads got stuck doing GFP_NOFS allocation.
Due to __GFP_NOWARN, warn_alloc() cannot warn allocation stalls.
Due to order <= PAGE_ALLOC_COSTLY_ORDER, the
"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)"
message cannot be printed because __alloc_pages_nodemask() does not
return. And due to GFP_NOFS without __GFP_HIGH nor __GFP_NOFAIL,
memory cannot be allocated to RESCUER thread which are trying to
allocate memory for reclaiming memory; the result is silent lockup.
(I must stop here, for this is a different thread.)
----------------------------------------
[ 1095.633625] MemAlloc: xfs-data/sda1(451) flags=0x4228060 switches=45509 seq=1 gfp=0x1604240(GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK) order=0 delay=652073
[ 1095.633626] xfs-data/sda1 R running task 12696 451 2 0x00000000
[ 1095.633663] Workqueue: xfs-data/sda1 xfs_end_io [xfs]
[ 1095.633665] Call Trace:
[ 1095.633668] __schedule+0x336/0xe00
[ 1095.633671] schedule+0x3d/0x90
[ 1095.633672] schedule_timeout+0x20d/0x510
[ 1095.633675] ? lock_timer_base+0xa0/0xa0
[ 1095.633678] schedule_timeout_uninterruptible+0x2a/0x30
[ 1095.633680] __alloc_pages_slowpath+0x2b5/0xd95
[ 1095.633687] __alloc_pages_nodemask+0x3e4/0x460
[ 1095.633699] alloc_pages_current+0x97/0x1b0
[ 1095.633702] new_slab+0x4cb/0x6b0
[ 1095.633706] ___slab_alloc+0x3a3/0x620
[ 1095.633728] ? kmem_alloc+0x96/0x120 [xfs]
[ 1095.633730] ? ___slab_alloc+0x5c6/0x620
[ 1095.633732] ? cpuacct_charge+0x38/0x1e0
[ 1095.633767] ? kmem_alloc+0x96/0x120 [xfs]
[ 1095.633770] __slab_alloc+0x46/0x7d
[ 1095.633773] __kmalloc+0x301/0x3b0
[ 1095.633802] kmem_alloc+0x96/0x120 [xfs]
[ 1095.633804] ? kfree+0x1fa/0x330
[ 1095.633842] xfs_log_commit_cil+0x489/0x710 [xfs]
[ 1095.633864] __xfs_trans_commit+0x83/0x260 [xfs]
[ 1095.633883] xfs_trans_commit+0x10/0x20 [xfs]
[ 1095.633901] __xfs_setfilesize+0xdb/0x240 [xfs]
[ 1095.633936] xfs_setfilesize_ioend+0x89/0xb0 [xfs]
[ 1095.633954] ? xfs_setfilesize_ioend+0x5/0xb0 [xfs]
[ 1095.633971] xfs_end_io+0x81/0x110 [xfs]
[ 1095.633973] process_one_work+0x22b/0x760
[ 1095.633975] ? process_one_work+0x194/0x760
[ 1095.633997] rescuer_thread+0x1f2/0x3d0
[ 1095.634002] kthread+0x10f/0x150
[ 1095.634003] ? worker_thread+0x4b0/0x4b0
[ 1095.634004] ? kthread_create_on_node+0x70/0x70
[ 1095.634007] ret_from_fork+0x31/0x40
[ 1095.634013] MemAlloc: xfs-eofblocks/s(456) flags=0x4228860 switches=15435 seq=1 gfp=0x1400240(GFP_NOFS|__GFP_NOWARN) order=0 delay=293074
[ 1095.634014] xfs-eofblocks/s R running task 12032 456 2 0x00000000
[ 1095.634037] Workqueue: xfs-eofblocks/sda1 xfs_eofblocks_worker [xfs]
[ 1095.634038] Call Trace:
[ 1095.634040] ? _raw_spin_lock+0x3d/0x80
[ 1095.634042] ? vmpressure+0xd0/0x120
[ 1095.634044] ? vmpressure+0xd0/0x120
[ 1095.634047] ? vmpressure_prio+0x21/0x30
[ 1095.634049] ? do_try_to_free_pages+0x70/0x300
[ 1095.634052] ? try_to_free_pages+0x131/0x3f0
[ 1095.634058] ? __alloc_pages_slowpath+0x3ec/0xd95
[ 1095.634065] ? __alloc_pages_nodemask+0x3e4/0x460
[ 1095.634069] ? alloc_pages_current+0x97/0x1b0
[ 1095.634111] ? xfs_buf_allocate_memory+0x160/0x2a3 [xfs]
[ 1095.634133] ? xfs_buf_get_map+0x2be/0x480 [xfs]
[ 1095.634169] ? xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1095.634204] ? xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1095.634222] ? xfs_btree_read_buf_block.constprop.34+0x78/0xc0 [xfs]
[ 1095.634239] ? xfs_btree_lookup_get_block+0x8a/0x180 [xfs]
[ 1095.634257] ? xfs_btree_lookup+0xd0/0x3f0 [xfs]
[ 1095.634296] ? kmem_zone_alloc+0x96/0x120 [xfs]
[ 1095.634299] ? _raw_spin_unlock+0x27/0x40
[ 1095.634315] ? xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
[ 1095.634348] ? xfs_bmap_del_extent+0x1b2/0x1610 [xfs]
[ 1095.634380] ? kmem_zone_alloc+0x96/0x120 [xfs]
[ 1095.634400] ? __xfs_bunmapi+0x4db/0xda0 [xfs]
[ 1095.634421] ? xfs_bunmapi+0x2b/0x40 [xfs]
[ 1095.634459] ? xfs_itruncate_extents+0x1df/0x780 [xfs]
[ 1095.634502] ? xfs_rename+0xc70/0x1080 [xfs]
[ 1095.634525] ? xfs_free_eofblocks+0x1c4/0x230 [xfs]
[ 1095.634546] ? xfs_inode_free_eofblocks+0x18d/0x280 [xfs]
[ 1095.634565] ? xfs_inode_ag_walk.isra.13+0x2b5/0x620 [xfs]
[ 1095.634582] ? xfs_inode_ag_walk.isra.13+0x91/0x620 [xfs]
[ 1095.634618] ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
[ 1095.634630] ? radix_tree_next_chunk+0x10b/0x2d0
[ 1095.634635] ? radix_tree_gang_lookup_tag+0xd7/0x150
[ 1095.634672] ? xfs_perag_get_tag+0x11d/0x370 [xfs]
[ 1095.634690] ? xfs_perag_get_tag+0x5/0x370 [xfs]
[ 1095.634709] ? xfs_inode_ag_iterator_tag+0x71/0xa0 [xfs]
[ 1095.634726] ? xfs_inode_clear_eofblocks_tag+0x1a0/0x1a0 [xfs]
[ 1095.634744] ? __xfs_icache_free_eofblocks+0x3b/0x40 [xfs]
[ 1095.634759] ? xfs_eofblocks_worker+0x27/0x40 [xfs]
[ 1095.634762] ? process_one_work+0x22b/0x760
[ 1095.634763] ? process_one_work+0x194/0x760
[ 1095.634784] ? rescuer_thread+0x1f2/0x3d0
[ 1095.634788] ? kthread+0x10f/0x150
[ 1095.634789] ? worker_thread+0x4b0/0x4b0
[ 1095.634790] ? kthread_create_on_node+0x70/0x70
[ 1095.634793] ? ret_from_fork+0x31/0x40
----------------------------------------
> Brian
>
> > --
> > Michal Hocko
> > SUSE Labs
Looks fine,
Reviewed-by: Christoph Hellwig <[email protected]>
On Thu, Mar 02, 2017 at 04:45:41PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> It doesn't really make much sense to retry vmalloc request if the
> current task is killed. We should rather bail out as soon as possible
> and let it RIP as soon as possible. The current implementation of
> vmalloc will fail anyway.
>
> Suggested-by: Brian Foster <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Reviewed-by: Brian Foster <[email protected]>
> fs/xfs/kmem.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index ee95f5c6db45..01c52567a4ff 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -34,7 +34,7 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> size_t kmsize = maxsize;
>
> while (!(ptr = vzalloc(kmsize))) {
> - if (kmsize == minsize)
> + if (kmsize == minsize || fatal_signal_pending(current))
> break;
> if ((kmsize >>= 1) <= minsize)
> kmsize = minsize;
> --
> 2.11.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Even though kmem_zalloc_greedy is documented it might fail the current
> code doesn't really implement this properly and loops on the smallest
> allowed size for ever. This is a problem because vzalloc might fail
> permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> ("vmalloc: back off when the current task is killed") when the current
> task is killed. The later one makes the failure scenario much more
> probable than it used to be because it makes vmalloc() failures
> permanent for tasks with fatal signals pending.. Fix this by bailing out
> if the minimum size request failed.
>
> This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
>
> fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> fsstress cpuset=/ mems_allowed=0-1
> CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> Call Trace:
> dump_stack+0x63/0x87
> warn_alloc+0x114/0x1c0
> ? alloc_pages_current+0x88/0x120
> __vmalloc_node_range+0x250/0x2a0
> ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> ? free_hot_cold_page+0x21f/0x280
> vzalloc+0x54/0x60
> ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> kmem_zalloc_greedy+0x2b/0x40 [xfs]
> xfs_bulkstat+0x11b/0x730 [xfs]
> ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> ? selinux_capable+0x20/0x30
> ? security_capable+0x48/0x60
> xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> xfs_file_ioctl+0x9dd/0xad0 [xfs]
> ? do_filp_open+0xa5/0x100
> do_vfs_ioctl+0xa7/0x5e0
> SyS_ioctl+0x79/0x90
> do_syscall_64+0x67/0x180
> entry_SYSCALL64_slow_path+0x25/0x25
>
> fsstress keeps looping inside kmem_zalloc_greedy without any way out
> because vmalloc keeps failing due to fatal_signal_pending.
>
> Reported-by: Xiong Zhou <[email protected]>
> Analyzed-by: Tetsuo Handa <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Reviewed-by: Brian Foster <[email protected]>
> fs/xfs/kmem.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..ee95f5c6db45 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> size_t kmsize = maxsize;
>
> while (!(ptr = vzalloc(kmsize))) {
> + if (kmsize == minsize)
> + break;
> if ((kmsize >>= 1) <= minsize)
> kmsize = minsize;
> }
> --
> 2.11.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Looks fine,
Reviewed-by: Christoph Hellwig <[email protected]>
I've just realized that Darrick was not on the CC list. Let's add him.
I believe this patch should go in in the current cycle because
5d17a73a2ebe was merged in this merge window and it can be abused...
The other patch [1] is not that urgent.
[1] http://lkml.kernel.org/r/[email protected]
On Thu 02-03-17 16:45:40, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Even though kmem_zalloc_greedy is documented it might fail the current
> code doesn't really implement this properly and loops on the smallest
> allowed size for ever. This is a problem because vzalloc might fail
> permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> ("vmalloc: back off when the current task is killed") when the current
> task is killed. The later one makes the failure scenario much more
> probable than it used to be because it makes vmalloc() failures
> permanent for tasks with fatal signals pending.. Fix this by bailing out
> if the minimum size request failed.
>
> This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
>
> fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> fsstress cpuset=/ mems_allowed=0-1
> CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> Call Trace:
> dump_stack+0x63/0x87
> warn_alloc+0x114/0x1c0
> ? alloc_pages_current+0x88/0x120
> __vmalloc_node_range+0x250/0x2a0
> ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> ? free_hot_cold_page+0x21f/0x280
> vzalloc+0x54/0x60
> ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> kmem_zalloc_greedy+0x2b/0x40 [xfs]
> xfs_bulkstat+0x11b/0x730 [xfs]
> ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> ? selinux_capable+0x20/0x30
> ? security_capable+0x48/0x60
> xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> xfs_file_ioctl+0x9dd/0xad0 [xfs]
> ? do_filp_open+0xa5/0x100
> do_vfs_ioctl+0xa7/0x5e0
> SyS_ioctl+0x79/0x90
> do_syscall_64+0x67/0x180
> entry_SYSCALL64_slow_path+0x25/0x25
>
> fsstress keeps looping inside kmem_zalloc_greedy without any way out
> because vmalloc keeps failing due to fatal_signal_pending.
>
> Reported-by: Xiong Zhou <[email protected]>
> Analyzed-by: Tetsuo Handa <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/xfs/kmem.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..ee95f5c6db45 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> size_t kmsize = maxsize;
>
> while (!(ptr = vzalloc(kmsize))) {
> + if (kmsize == minsize)
> + break;
> if ((kmsize >>= 1) <= minsize)
> kmsize = minsize;
> }
> --
> 2.11.0
>
--
Michal Hocko
SUSE Labs
On Thu, Mar 02, 2017 at 05:16:06PM +0100, Michal Hocko wrote:
> I've just realized that Darrick was not on the CC list. Let's add him.
> I believe this patch should go in in the current cycle because
> 5d17a73a2ebe was merged in this merge window and it can be abused...
>
> The other patch [1] is not that urgent.
>
> [1] http://lkml.kernel.org/r/[email protected]
Both patches look ok to me. I'll take both patches for rc2.
Reviewed-by: Darrick J. Wong <[email protected]>
(Annoyingly I missed the whole thread yesterday due to vger slowness, in
case anyone was wondering why I didn't reply.)
--D
>
> On Thu 02-03-17 16:45:40, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Even though kmem_zalloc_greedy is documented it might fail the current
> > code doesn't really implement this properly and loops on the smallest
> > allowed size for ever. This is a problem because vzalloc might fail
> > permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> > ("vmalloc: back off when the current task is killed") when the current
> > task is killed. The later one makes the failure scenario much more
> > probable than it used to be because it makes vmalloc() failures
> > permanent for tasks with fatal signals pending.. Fix this by bailing out
> > if the minimum size request failed.
> >
> > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> >
> > fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> > fsstress cpuset=/ mems_allowed=0-1
> > CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> > Call Trace:
> > dump_stack+0x63/0x87
> > warn_alloc+0x114/0x1c0
> > ? alloc_pages_current+0x88/0x120
> > __vmalloc_node_range+0x250/0x2a0
> > ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > ? free_hot_cold_page+0x21f/0x280
> > vzalloc+0x54/0x60
> > ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > xfs_bulkstat+0x11b/0x730 [xfs]
> > ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> > ? selinux_capable+0x20/0x30
> > ? security_capable+0x48/0x60
> > xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> > xfs_file_ioctl+0x9dd/0xad0 [xfs]
> > ? do_filp_open+0xa5/0x100
> > do_vfs_ioctl+0xa7/0x5e0
> > SyS_ioctl+0x79/0x90
> > do_syscall_64+0x67/0x180
> > entry_SYSCALL64_slow_path+0x25/0x25
> >
> > fsstress keeps looping inside kmem_zalloc_greedy without any way out
> > because vmalloc keeps failing due to fatal_signal_pending.
> >
> > Reported-by: Xiong Zhou <[email protected]>
> > Analyzed-by: Tetsuo Handa <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > fs/xfs/kmem.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 339c696bbc01..ee95f5c6db45 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > size_t kmsize = maxsize;
> >
> > while (!(ptr = vzalloc(kmsize))) {
> > + if (kmsize == minsize)
> > + break;
> > if ((kmsize >>= 1) <= minsize)
> > kmsize = minsize;
> > }
> > --
> > 2.11.0
> >
>
> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 02-03-17 09:23:15, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> > On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> > [...]
> > > > I see your argument about being in sync with other kmem helpers but
> > > > those are bit different because regular page/slab allocators allow never
> > > > fail semantic (even though this is mostly ignored by those helpers which
> > > > implement their own retries but that is a different topic).
> > > >
> > >
> > > ... but what I'm trying to understand here is whether this failure
> > > scenario is specific to vmalloc() or whether the other kmem_*()
> > > functions are susceptible to the same problem. For example, suppose we
> > > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> >
> > Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> > memory but in that case we can expect the OOM killer releasing some
> > memory which would allow us to make a forward progress on the next
> > retry. So essentially retrying around kmalloc is much more safe in this
> > regard. Failing vmalloc might be permanent because there is no vmalloc
> > space to allocate from or much more likely due to already mentioned
> > patch. So vmalloc is different, really.
>
> Right.. that's why I'm asking. So it's technically possible but highly
> unlikely due to the different failure characteristics. That seems
> reasonable to me, then.
>
> To be clear, do we understand what causes the vzalloc() failure to be
> effectively permanent in this specific reproducer? I know you mention
> above that we could be out of vmalloc space, but that doesn't clarify
> whether there are other potential failure paths or then what this has to
> do with the fact that the process was killed. Does the pending signal
> cause the subsequent failures or are you saying that there is some other
> root cause of the failure, this process would effectively be spinning
> here anyways, and we're just noticing it because it's trying to exit?
In this particular case it is fatal_signal_pending that causes the
permanent failure. This check has been added to prevent from complete
memory reserves depletion on OOM when a killed task has a free ticket to
reserves and vmalloc requests can be really large. In this case there
was no OOM killer going on but fsstress has SIGKILL pending for other
reason. Most probably as a result of the group_exit when all threads
are killed (see zap_process). I could have turn fatal_signal_pending
into tsk_is_oom_victim which would be less likely to hit but in
principle fatal_signal_pending should be better because we do want to
bail out when the process is existing as soon as possible.
What I really wanted to say is that there are other possible permanent
failure paths in vmalloc AFAICS. They are much less probable but they
still exist.
Does that make more sense now?
--
Michal Hocko
SUSE Labs
On Thu, Mar 02, 2017 at 09:51:31AM -0500, Brian Foster wrote:
> Otherwise, I'm fine with breaking the infinite retry loop at the same
> time. It looks like Christoph added this function originally so this
> should probably require his ack as well..
I just moved the code around, but I'll take a look as well.
On Thu 02-03-17 10:30:02, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 04:14:11PM +0100, Michal Hocko wrote:
[...]
> > I am not objecting to adding fatal_signal_pending as well I just thought
> > that from the logic POV breaking after reaching the minimum size is just
> > the right thing to do. We can optimize further by checking
> > fatal_signal_pending and reducing retries when we know it doesn't make
> > much sense but that should be done on top as an optimization IMHO.
> >
>
> I don't think of it as an optimization to not invoke calls (a
> non-deterministic number of times) that we know are going to fail, but
the point is that vmalloc failure modes are an implementation detail
which might change in the future. The fix should be really independent
on the current implementation that is why I think the
fatal_signal_pending is just an optimization.
> ultimately I don't care too much how it's framed or if it's done in
> separate patches or whatnot. As long as they are posted at the same
> time. ;)
Done
--
Michal Hocko
SUSE Labs
On Thu 02-03-17 09:51:31, Brian Foster wrote:
> On Thu, Mar 02, 2017 at 03:34:41PM +0100, Michal Hocko wrote:
> > On Thu 02-03-17 09:23:15, Brian Foster wrote:
> > > On Thu, Mar 02, 2017 at 02:50:01PM +0100, Michal Hocko wrote:
> > > > On Thu 02-03-17 08:41:58, Brian Foster wrote:
> > > > > On Thu, Mar 02, 2017 at 02:27:55PM +0100, Michal Hocko wrote:
> > > > [...]
> > > > > > I see your argument about being in sync with other kmem helpers but
> > > > > > those are bit different because regular page/slab allocators allow never
> > > > > > fail semantic (even though this is mostly ignored by those helpers which
> > > > > > implement their own retries but that is a different topic).
> > > > > >
> > > > >
> > > > > ... but what I'm trying to understand here is whether this failure
> > > > > scenario is specific to vmalloc() or whether the other kmem_*()
> > > > > functions are susceptible to the same problem. For example, suppose we
> > > > > replaced this kmem_zalloc_greedy() call with a kmem_zalloc(PAGE_SIZE,
> > > > > KM_SLEEP) call. Could we hit the same problem if the process is killed?
> > > >
> > > > Well, kmem_zalloc uses kmalloc which can also fail when we are out of
> > > > memory but in that case we can expect the OOM killer releasing some
> > > > memory which would allow us to make a forward progress on the next
> > > > retry. So essentially retrying around kmalloc is much more safe in this
> > > > regard. Failing vmalloc might be permanent because there is no vmalloc
> > > > space to allocate from or much more likely due to already mentioned
> > > > patch. So vmalloc is different, really.
> > >
> > > Right.. that's why I'm asking. So it's technically possible but highly
> > > unlikely due to the different failure characteristics. That seems
> > > reasonable to me, then.
> > >
> > > To be clear, do we understand what causes the vzalloc() failure to be
> > > effectively permanent in this specific reproducer? I know you mention
> > > above that we could be out of vmalloc space, but that doesn't clarify
> > > whether there are other potential failure paths or then what this has to
> > > do with the fact that the process was killed. Does the pending signal
> > > cause the subsequent failures or are you saying that there is some other
> > > root cause of the failure, this process would effectively be spinning
> > > here anyways, and we're just noticing it because it's trying to exit?
> >
> > In this particular case it is fatal_signal_pending that causes the
> > permanent failure. This check has been added to prevent from complete
> > memory reserves depletion on OOM when a killed task has a free ticket to
> > reserves and vmalloc requests can be really large. In this case there
> > was no OOM killer going on but fsstress has SIGKILL pending for other
> > reason. Most probably as a result of the group_exit when all threads
> > are killed (see zap_process). I could have turn fatal_signal_pending
> > into tsk_is_oom_victim which would be less likely to hit but in
> > principle fatal_signal_pending should be better because we do want to
> > bail out when the process is existing as soon as possible.
> >
> > What I really wanted to say is that there are other possible permanent
> > failure paths in vmalloc AFAICS. They are much less probable but they
> > still exist.
> >
> > Does that make more sense now?
>
> Yes, thanks. That explains why this crops up now where it hasn't in the
> past. Please include that background in the commit log description.
OK, does this sound better. I am open to any suggestions to improve this
of course
: xfs: allow kmem_zalloc_greedy to fail
:
: Even though kmem_zalloc_greedy is documented it might fail the current
: code doesn't really implement this properly and loops on the smallest
: allowed size for ever. This is a problem because vzalloc might fail
: permanently - we might run out of vmalloc space or since 5d17a73a2ebe
: ("vmalloc: back off when the current task is killed") when the current
: task is killed. The later one makes the failure scenario much more
: probable than it used to be. Fix this by bailing out if the minimum size
: request failed.
:
: This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
:
: fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
: fsstress cpuset=/ mems_allowed=0-1
: CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
: Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
: Call Trace:
: dump_stack+0x63/0x87
: warn_alloc+0x114/0x1c0
: ? alloc_pages_current+0x88/0x120
: __vmalloc_node_range+0x250/0x2a0
: ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
: ? free_hot_cold_page+0x21f/0x280
: vzalloc+0x54/0x60
: ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
: kmem_zalloc_greedy+0x2b/0x40 [xfs]
: xfs_bulkstat+0x11b/0x730 [xfs]
: ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
: ? selinux_capable+0x20/0x30
: ? security_capable+0x48/0x60
: xfs_ioc_bulkstat+0xe4/0x190 [xfs]
: xfs_file_ioctl+0x9dd/0xad0 [xfs]
: ? do_filp_open+0xa5/0x100
: do_vfs_ioctl+0xa7/0x5e0
: SyS_ioctl+0x79/0x90
: do_syscall_64+0x67/0x180
: entry_SYSCALL64_slow_path+0x25/0x25
:
: fsstress keeps looping inside kmem_zalloc_greedy without any way out
: because vmalloc keeps failing due to fatal_signal_pending.
:
: Reported-by: Xiong Zhou <[email protected]>
: Analyzed-by: Tetsuo Handa <[email protected]>
: Signed-off-by: Michal Hocko <[email protected]>
> Also, that kind of makes me think that a fatal_signal_pending() check is
> still appropriate in the loop, even if we want to drop the infinite
> retry loop in kmem_zalloc_greedy() as well. There's no sense in doing
> however many retries are left before we return and that's also more
> explicit for the next person who goes to change this code in the future.
I am not objecting to adding fatal_signal_pending as well I just thought
that from the logic POV breaking after reaching the minimum size is just
the right thing to do. We can optimize further by checking
fatal_signal_pending and reducing retries when we know it doesn't make
much sense but that should be done on top as an optimization IMHO.
> Otherwise, I'm fine with breaking the infinite retry loop at the same
> time. It looks like Christoph added this function originally so this
> should probably require his ack as well..
What do you think Christoph?
--
Michal Hocko
SUSE Labs
On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Even though kmem_zalloc_greedy is documented it might fail the current
> code doesn't really implement this properly and loops on the smallest
> allowed size for ever. This is a problem because vzalloc might fail
> permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> ("vmalloc: back off when the current task is killed") when the current
> task is killed. The later one makes the failure scenario much more
> probable than it used to be because it makes vmalloc() failures
> permanent for tasks with fatal signals pending.. Fix this by bailing out
> if the minimum size request failed.
>
> This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
>
> fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> fsstress cpuset=/ mems_allowed=0-1
> CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> Call Trace:
> dump_stack+0x63/0x87
> warn_alloc+0x114/0x1c0
> ? alloc_pages_current+0x88/0x120
> __vmalloc_node_range+0x250/0x2a0
> ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> ? free_hot_cold_page+0x21f/0x280
> vzalloc+0x54/0x60
> ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> kmem_zalloc_greedy+0x2b/0x40 [xfs]
> xfs_bulkstat+0x11b/0x730 [xfs]
> ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> ? selinux_capable+0x20/0x30
> ? security_capable+0x48/0x60
> xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> xfs_file_ioctl+0x9dd/0xad0 [xfs]
> ? do_filp_open+0xa5/0x100
> do_vfs_ioctl+0xa7/0x5e0
> SyS_ioctl+0x79/0x90
> do_syscall_64+0x67/0x180
> entry_SYSCALL64_slow_path+0x25/0x25
>
> fsstress keeps looping inside kmem_zalloc_greedy without any way out
> because vmalloc keeps failing due to fatal_signal_pending.
>
> Reported-by: Xiong Zhou <[email protected]>
> Analyzed-by: Tetsuo Handa <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/xfs/kmem.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..ee95f5c6db45 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> size_t kmsize = maxsize;
>
> while (!(ptr = vzalloc(kmsize))) {
> + if (kmsize == minsize)
> + break;
> if ((kmsize >>= 1) <= minsize)
> kmsize = minsize;
> }
Seems wrong to me - this function used to have lots of callers and
over time we've slowly removed them or replaced them with something
else. I'd suggest removing it completely, replacing the call sites
with kmem_zalloc_large().
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sat, Mar 04, 2017 at 09:54:44AM +1100, Dave Chinner wrote:
> On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Even though kmem_zalloc_greedy is documented it might fail the current
> > code doesn't really implement this properly and loops on the smallest
> > allowed size for ever. This is a problem because vzalloc might fail
> > permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> > ("vmalloc: back off when the current task is killed") when the current
> > task is killed. The later one makes the failure scenario much more
> > probable than it used to be because it makes vmalloc() failures
> > permanent for tasks with fatal signals pending.. Fix this by bailing out
> > if the minimum size request failed.
> >
> > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> >
> > fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> > fsstress cpuset=/ mems_allowed=0-1
> > CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> > Call Trace:
> > dump_stack+0x63/0x87
> > warn_alloc+0x114/0x1c0
> > ? alloc_pages_current+0x88/0x120
> > __vmalloc_node_range+0x250/0x2a0
> > ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > ? free_hot_cold_page+0x21f/0x280
> > vzalloc+0x54/0x60
> > ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > xfs_bulkstat+0x11b/0x730 [xfs]
> > ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> > ? selinux_capable+0x20/0x30
> > ? security_capable+0x48/0x60
> > xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> > xfs_file_ioctl+0x9dd/0xad0 [xfs]
> > ? do_filp_open+0xa5/0x100
> > do_vfs_ioctl+0xa7/0x5e0
> > SyS_ioctl+0x79/0x90
> > do_syscall_64+0x67/0x180
> > entry_SYSCALL64_slow_path+0x25/0x25
> >
> > fsstress keeps looping inside kmem_zalloc_greedy without any way out
> > because vmalloc keeps failing due to fatal_signal_pending.
> >
> > Reported-by: Xiong Zhou <[email protected]>
> > Analyzed-by: Tetsuo Handa <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > fs/xfs/kmem.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 339c696bbc01..ee95f5c6db45 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > size_t kmsize = maxsize;
> >
> > while (!(ptr = vzalloc(kmsize))) {
> > + if (kmsize == minsize)
> > + break;
> > if ((kmsize >>= 1) <= minsize)
> > kmsize = minsize;
> > }
>
> Seems wrong to me - this function used to have lots of callers and
> over time we've slowly removed them or replaced them with something
> else. I'd suggest removing it completely, replacing the call sites
> with kmem_zalloc_large().
Heh. I thought the reason why _greedy still exists (for its sole user
bulkstat) is that bulkstat had the flexibility to deal with receiving
0, 1, or 4 pages. So yeah, we could just kill it.
But thinking even more stingily about memory, are there applications
that care about being able to bulkstat 16384 inodes at once? How badly
does bulkstat need to be able to bulk-process more than a page's worth
of inobt records, anyway?
--D
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 03, 2017 at 03:19:12PM -0800, Darrick J. Wong wrote:
> On Sat, Mar 04, 2017 at 09:54:44AM +1100, Dave Chinner wrote:
> > On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > Even though kmem_zalloc_greedy is documented it might fail the current
> > > code doesn't really implement this properly and loops on the smallest
> > > allowed size for ever. This is a problem because vzalloc might fail
> > > permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> > > ("vmalloc: back off when the current task is killed") when the current
> > > task is killed. The later one makes the failure scenario much more
> > > probable than it used to be because it makes vmalloc() failures
> > > permanent for tasks with fatal signals pending.. Fix this by bailing out
> > > if the minimum size request failed.
> > >
> > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > >
> > > fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> > > fsstress cpuset=/ mems_allowed=0-1
> > > CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> > > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> > > Call Trace:
> > > dump_stack+0x63/0x87
> > > warn_alloc+0x114/0x1c0
> > > ? alloc_pages_current+0x88/0x120
> > > __vmalloc_node_range+0x250/0x2a0
> > > ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > > ? free_hot_cold_page+0x21f/0x280
> > > vzalloc+0x54/0x60
> > > ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > > kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > > xfs_bulkstat+0x11b/0x730 [xfs]
> > > ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> > > ? selinux_capable+0x20/0x30
> > > ? security_capable+0x48/0x60
> > > xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> > > xfs_file_ioctl+0x9dd/0xad0 [xfs]
> > > ? do_filp_open+0xa5/0x100
> > > do_vfs_ioctl+0xa7/0x5e0
> > > SyS_ioctl+0x79/0x90
> > > do_syscall_64+0x67/0x180
> > > entry_SYSCALL64_slow_path+0x25/0x25
> > >
> > > fsstress keeps looping inside kmem_zalloc_greedy without any way out
> > > because vmalloc keeps failing due to fatal_signal_pending.
> > >
> > > Reported-by: Xiong Zhou <[email protected]>
> > > Analyzed-by: Tetsuo Handa <[email protected]>
> > > Signed-off-by: Michal Hocko <[email protected]>
> > > ---
> > > fs/xfs/kmem.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index 339c696bbc01..ee95f5c6db45 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > > size_t kmsize = maxsize;
> > >
> > > while (!(ptr = vzalloc(kmsize))) {
> > > + if (kmsize == minsize)
> > > + break;
> > > if ((kmsize >>= 1) <= minsize)
> > > kmsize = minsize;
> > > }
> >
> > Seems wrong to me - this function used to have lots of callers and
> > over time we've slowly removed them or replaced them with something
> > else. I'd suggest removing it completely, replacing the call sites
> > with kmem_zalloc_large().
>
> Heh. I thought the reason why _greedy still exists (for its sole user
> bulkstat) is that bulkstat had the flexibility to deal with receiving
> 0, 1, or 4 pages. So yeah, we could just kill it.
irbuf is sized to minimise AGI locking, but if memory is low
it just uses what it can get. Keep in mind the number of inodes we
need to process is determined by the userspace buffer size, which
can easily be sized to hold tens of thousands of struct
xfs_bulkstat.
> But thinking even more stingily about memory, are there applications
> that care about being able to bulkstat 16384 inodes at once?
IIRC, xfsdump can bulkstat up to 64k inodes per call....
> How badly
> does bulkstat need to be able to bulk-process more than a page's worth
> of inobt records, anyway?
Benchmark it on a busy system doing lots of other AGI work (e.g. a
busy NFS server workload with a working set of tens of millions of
inodes so it doesn't fit in cache) and find out. That's generally
how I answer those sorts of questions...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sat 04-03-17 09:54:44, Dave Chinner wrote:
> On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Even though kmem_zalloc_greedy is documented it might fail the current
> > code doesn't really implement this properly and loops on the smallest
> > allowed size for ever. This is a problem because vzalloc might fail
> > permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> > ("vmalloc: back off when the current task is killed") when the current
> > task is killed. The later one makes the failure scenario much more
> > probable than it used to be because it makes vmalloc() failures
> > permanent for tasks with fatal signals pending.. Fix this by bailing out
> > if the minimum size request failed.
> >
> > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> >
> > fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> > fsstress cpuset=/ mems_allowed=0-1
> > CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
> > Call Trace:
> > dump_stack+0x63/0x87
> > warn_alloc+0x114/0x1c0
> > ? alloc_pages_current+0x88/0x120
> > __vmalloc_node_range+0x250/0x2a0
> > ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > ? free_hot_cold_page+0x21f/0x280
> > vzalloc+0x54/0x60
> > ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > kmem_zalloc_greedy+0x2b/0x40 [xfs]
> > xfs_bulkstat+0x11b/0x730 [xfs]
> > ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> > ? selinux_capable+0x20/0x30
> > ? security_capable+0x48/0x60
> > xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> > xfs_file_ioctl+0x9dd/0xad0 [xfs]
> > ? do_filp_open+0xa5/0x100
> > do_vfs_ioctl+0xa7/0x5e0
> > SyS_ioctl+0x79/0x90
> > do_syscall_64+0x67/0x180
> > entry_SYSCALL64_slow_path+0x25/0x25
> >
> > fsstress keeps looping inside kmem_zalloc_greedy without any way out
> > because vmalloc keeps failing due to fatal_signal_pending.
> >
> > Reported-by: Xiong Zhou <[email protected]>
> > Analyzed-by: Tetsuo Handa <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > fs/xfs/kmem.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 339c696bbc01..ee95f5c6db45 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t maxsize)
> > size_t kmsize = maxsize;
> >
> > while (!(ptr = vzalloc(kmsize))) {
> > + if (kmsize == minsize)
> > + break;
> > if ((kmsize >>= 1) <= minsize)
> > kmsize = minsize;
> > }
>
> Seems wrong to me - this function used to have lots of callers and
> over time we've slowly removed them or replaced them with something
> else. I'd suggest removing it completely, replacing the call sites
> with kmem_zalloc_large().
I do not really care how this gets fixed. Dropping kmem_zalloc_greedy
sounds like a way to go. I am not familiar with xfs_bulkstat to do an
edicated guess which allocation size to use. So I guess I have to
postpone this to you guys if you prefer that route though.
Thanks!
--
Michal Hocko
SUSE Labs