syzbot wrote:
> syzbot hit the following crash on net-next commit
> 6bb46bc57c8e9ce947cc605e555b7204b44d2b10 (Fri Jan 26 16:00:23 2018 +0000)
> Merge branch 'cxgb4-fix-dump-collection-when-firmware-crashed'
>
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: [email protected]
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> [ 3685] 0 3685 17821 1 184320 0 0 sshd
> [ 3692] 0 3692 4376 0 32768 0 0
> syzkaller025682
> [ 3695] 0 3695 4376 0 36864 0 0
> syzkaller025682
> Kernel panic - not syncing: Out of memory and no killable processes...
>
This sounds like too huge vmalloc() request where size is controlled by userspace.
----------
[ 27.738855] syzkaller025682 invoked oom-killer: gfp_mask=0x14002c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN), nodemask=(null), order=0, oom_score_adj=0
[ 27.754960] syzkaller025682 cpuset=/ mems_allowed=0
[ 27.760386] CPU: 0 PID: 3689 Comm: syzkaller025682 Not tainted 4.15.0-rc9+ #212
[ 27.767820] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 27.777166] Call Trace:
[ 27.779739] dump_stack+0x194/0x257
[ 27.793194] dump_header+0x28c/0xe1e
[ 27.888997] oom_kill_process+0x8b5/0x14a0
[ 27.981648] out_of_memory+0x86d/0x1220
[ 28.003684] __alloc_pages_slowpath+0x1d1b/0x2d00
[ 28.054140] __alloc_pages_nodemask+0x9fb/0xd80
[ 28.090590] alloc_pages_current+0xb6/0x1e0
[ 28.094927] __vmalloc_node_range+0x409/0x650
[ 28.103837] __vmalloc_node_flags_caller+0x50/0x60
[ 28.113166] kvmalloc_node+0x82/0xd0
[ 28.116869] xt_alloc_table_info+0x64/0xe0
[ 28.121097] do_ip6t_set_ctl+0x29b/0x5f0
[ 28.139158] nf_setsockopt+0x67/0xc0
[ 28.142862] ipv6_setsockopt+0x115/0x150
[ 28.146912] udpv6_setsockopt+0x45/0x80
[ 28.150867] sock_common_setsockopt+0x95/0xd0
[ 28.155359] SyS_setsockopt+0x189/0x360
[ 28.177379] entry_SYSCALL_64_fastpath+0x29/0xa0
----------
struct xt_table_info *xt_alloc_table_info(unsigned int size)
{
(...snipped...)
info = kvmalloc(sz, GFP_KERNEL);
(...snipped...)
}
static int
do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
{
int ret;
if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
return -EPERM;
switch (cmd) {
case IP6T_SO_SET_REPLACE:
ret = do_replace(sock_net(sk), user, len);
break;
case IP6T_SO_SET_ADD_COUNTERS:
ret = do_add_counters(sock_net(sk), user, len, 0);
break;
default:
ret = -EINVAL;
}
return ret;
}
vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
off when the current task is killed") but then became unkillable by commit
b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
killed""). Therefore, we can't handle this problem from MM side.
Please consider adding some limit from networking side.
Tetsuo Handa <[email protected]> wrote:
> syzbot wrote:
> > syzbot hit the following crash on net-next commit
> > 6bb46bc57c8e9ce947cc605e555b7204b44d2b10 (Fri Jan 26 16:00:23 2018 +0000)
> > Merge branch 'cxgb4-fix-dump-collection-when-firmware-crashed'
> >
> > C reproducer is attached.
> > syzkaller reproducer is attached.
> > Raw console output is attached.
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached.
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: [email protected]
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> >
> > [ 3685] 0 3685 17821 1 184320 0 0 sshd
> > [ 3692] 0 3692 4376 0 32768 0 0
> > syzkaller025682
> > [ 3695] 0 3695 4376 0 36864 0 0
> > syzkaller025682
> > Kernel panic - not syncing: Out of memory and no killable processes...
> >
>
> This sounds like too huge vmalloc() request where size is controlled by userspace.
Right.
Before eacd86ca3b036e55e172b7279f101cef4a6ff3a4
this used
info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
would it help to re-add that?
> vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
> off when the current task is killed") but then became unkillable by commit
> b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> killed""). Therefore, we can't handle this problem from MM side.
> Please consider adding some limit from networking side.
I don't know what "some limit" would be. I would prefer if there was
a way to supress OOM Killer in first place so we can just -ENOMEM user.
AFAIU NOWARN|NORETRY does that, so I would propose to just read-add it?
On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
> > off when the current task is killed") but then became unkillable by commit
> > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > killed""). Therefore, we can't handle this problem from MM side.
> > Please consider adding some limit from networking side.
>
> I don't know what "some limit" would be. I would prefer if there was
> a way to supress OOM Killer in first place so we can just -ENOMEM user.
Just supressing OOM kill is a bad idea. We still leave a way to allocate
arbitrary large buffer in kernel.
--
Kirill A. Shutemov
Kirill A. Shutemov <[email protected]> wrote:
> On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
> > > off when the current task is killed") but then became unkillable by commit
> > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > > killed""). Therefore, we can't handle this problem from MM side.
> > > Please consider adding some limit from networking side.
> >
> > I don't know what "some limit" would be. I would prefer if there was
> > a way to supress OOM Killer in first place so we can just -ENOMEM user.
>
> Just supressing OOM kill is a bad idea. We still leave a way to allocate
> arbitrary large buffer in kernel.
Isn't that what we do everywhere in network stack?
I think we should try to allocate whatever amount of memory is needed
for the given xtables ruleset, given that is what admin requested us to do.
I also would not know what limit is sane -- I've seen setups with as much
as 100k iptables rules, and that was 5 years ago.
And even if we add a "Xk rules" limit, it might be too much for
low-memory systems, or not enough for whatever other use case there
might be.
On Mon 29-01-18 17:57:22, Florian Westphal wrote:
> Kirill A. Shutemov <[email protected]> wrote:
> > On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
> > > > off when the current task is killed") but then became unkillable by commit
> > > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > > > killed""). Therefore, we can't handle this problem from MM side.
> > > > Please consider adding some limit from networking side.
> > >
> > > I don't know what "some limit" would be. I would prefer if there was
> > > a way to supress OOM Killer in first place so we can just -ENOMEM user.
> >
> > Just supressing OOM kill is a bad idea. We still leave a way to allocate
> > arbitrary large buffer in kernel.
>
> Isn't that what we do everywhere in network stack?
>
> I think we should try to allocate whatever amount of memory is needed
> for the given xtables ruleset, given that is what admin requested us to do.
If this is a root only thing then __GFP_NORETRY sounds like the most
straightforward way to go.
--
Michal Hocko
SUSE Labs
On Mon, Jan 29, 2018 at 05:57:22PM +0100, Florian Westphal wrote:
> Kirill A. Shutemov <[email protected]> wrote:
> > On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
> > > > off when the current task is killed") but then became unkillable by commit
> > > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > > > killed""). Therefore, we can't handle this problem from MM side.
> > > > Please consider adding some limit from networking side.
> > >
> > > I don't know what "some limit" would be. I would prefer if there was
> > > a way to supress OOM Killer in first place so we can just -ENOMEM user.
> >
> > Just supressing OOM kill is a bad idea. We still leave a way to allocate
> > arbitrary large buffer in kernel.
>
> Isn't that what we do everywhere in network stack?
>
> I think we should try to allocate whatever amount of memory is needed
> for the given xtables ruleset, given that is what admin requested us to do.
Is it correct that "admin" in this case is root in random container?
I mean, can we get access to it with CLONE_NEWUSER|CLONE_NEWNET?
This can be fun.
> I also would not know what limit is sane -- I've seen setups with as much
> as 100k iptables rules, and that was 5 years ago.
>
> And even if we add a "Xk rules" limit, it might be too much for
> low-memory systems, or not enough for whatever other use case there
> might be.
I hate what I'm saying, but I guess we need some tunable here.
Not sure what exactly.
--
Kirill A. Shutemov
On Monday 2018-01-29 17:57, Florian Westphal wrote:
>> > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
>> > > off when the current task is killed") but then became unkillable by commit
>> > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
>> > > killed""). Therefore, we can't handle this problem from MM side.
>> > > Please consider adding some limit from networking side.
>> >
>> > I don't know what "some limit" would be. I would prefer if there was
>> > a way to supress OOM Killer in first place so we can just -ENOMEM user.
>>
>> Just supressing OOM kill is a bad idea. We still leave a way to allocate
>> arbitrary large buffer in kernel.
At the very least, mm could limit that kind of "arbitrary". If the machine has
x GB (swap included) and the admin tries to make the kernel allocate space for
an x GB ruleset, no way is it going to be satisfiable _even with OOM_.
>I think we should try to allocate whatever amount of memory is needed
>for the given xtables ruleset, given that is what admin requested us to do.
Kirill A. Shutemov <[email protected]> wrote:
> On Mon, Jan 29, 2018 at 05:57:22PM +0100, Florian Westphal wrote:
> > Kirill A. Shutemov <[email protected]> wrote:
> > > On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > > > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
> > > > > off when the current task is killed") but then became unkillable by commit
> > > > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > > > > killed""). Therefore, we can't handle this problem from MM side.
> > > > > Please consider adding some limit from networking side.
> > > >
> > > > I don't know what "some limit" would be. I would prefer if there was
> > > > a way to supress OOM Killer in first place so we can just -ENOMEM user.
> > >
> > > Just supressing OOM kill is a bad idea. We still leave a way to allocate
> > > arbitrary large buffer in kernel.
> >
> > Isn't that what we do everywhere in network stack?
> >
> > I think we should try to allocate whatever amount of memory is needed
> > for the given xtables ruleset, given that is what admin requested us to do.
>
> Is it correct that "admin" in this case is root in random container?
Yes.
> I mean, can we get access to it with CLONE_NEWUSER|CLONE_NEWNET?
Yes.
> This can be fun.
Do we prevent "admin in random container" to insert 2m ipv6 routes
(alternatively: ipsec tunnels, interfaces etc etc)?
> > I also would not know what limit is sane -- I've seen setups with as much
> > as 100k iptables rules, and that was 5 years ago.
> >
> > And even if we add a "Xk rules" limit, it might be too much for
> > low-memory systems, or not enough for whatever other use case there
> > might be.
>
> I hate what I'm saying, but I guess we need some tunable here.
> Not sure what exactly.
Would memcg help?
(I don't buy the "run untrusted binaries on linux is safe" thing, so
I would not know).
On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> Kirill A. Shutemov <[email protected]> wrote:
[...]
> > I hate what I'm saying, but I guess we need some tunable here.
> > Not sure what exactly.
>
> Would memcg help?
That really depends. I would have to check whether vmalloc path obeys
__GFP_ACCOUNT (I suspect it does except for page tables allocations but
that shouldn't be a big deal). But then the other potential problem is
the life time of the xt_table_info (or other potentially large) data
structures. Are they bound to any process life time. Because if they are
not then the OOM killer will not help. The OOM panic earlier in this
thread suggests it doesn't because the test case managed to eat all the
available memory and killed all the eligible tasks which didn't help.
So in some sense the memcg would help to stop the excessive allocation,
but it wouldn't resolve it other than kill all tasks in the affected
memcg/container. Whether this is sufficient or not, I dunno. It sounds
quite suboptimal to me. But it is true this would be less tricky then
adding a global knob...
--
Michal Hocko
SUSE Labs
Michal Hocko <[email protected]> wrote:
> On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> > Kirill A. Shutemov <[email protected]> wrote:
> [...]
> > > I hate what I'm saying, but I guess we need some tunable here.
> > > Not sure what exactly.
> >
> > Would memcg help?
>
> That really depends. I would have to check whether vmalloc path obeys
> __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> that shouldn't be a big deal). But then the other potential problem is
> the life time of the xt_table_info (or other potentially large) data
> structures. Are they bound to any process life time.
No.
> Because if they are
> not then the OOM killer will not help. The OOM panic earlier in this
> thread suggests it doesn't because the test case managed to eat all the
> available memory and killed all the eligible tasks which didn't help.
Yes, which is why we do not want any OOM killer invocation in first
place...
> So in some sense the memcg would help to stop the excessive allocation,
> but it wouldn't resolve it other than kill all tasks in the affected
> memcg/container. Whether this is sufficient or not, I dunno. It sounds
> quite suboptimal to me. But it is true this would be less tricky then
> adding a global knob...
Global knob doesn't really help at all, I can add multiple large
iptables rulesets (so we would have to account), and we have same issue
in virtually all of networking, so we need limits for interface count,
tunnel count, ipsec policies/SAs, nftables, tc, etc etc.
On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote:
> Michal Hocko <[email protected]> wrote:
> > On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> > > Kirill A. Shutemov <[email protected]> wrote:
> > [...]
> > > > I hate what I'm saying, but I guess we need some tunable here.
> > > > Not sure what exactly.
> > >
> > > Would memcg help?
> >
> > That really depends. I would have to check whether vmalloc path obeys
> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> > that shouldn't be a big deal). But then the other potential problem is
> > the life time of the xt_table_info (or other potentially large) data
> > structures. Are they bound to any process life time.
>
> No.
Well, IIUC they bound to net namespace life time, so killing all
proccesses in the namespace would help to get memory back. :)
--
Kirill A. Shutemov
On Tue, Jan 30, 2018 at 9:28 AM, Kirill A. Shutemov
<[email protected]> wrote:
> On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote:
>> Michal Hocko <[email protected]> wrote:
>> > On Mon 29-01-18 23:35:22, Florian Westphal wrote:
>> > > Kirill A. Shutemov <[email protected]> wrote:
>> > [...]
>> > > > I hate what I'm saying, but I guess we need some tunable here.
>> > > > Not sure what exactly.
>> > >
>> > > Would memcg help?
>> >
>> > That really depends. I would have to check whether vmalloc path obeys
>> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but
>> > that shouldn't be a big deal). But then the other potential problem is
>> > the life time of the xt_table_info (or other potentially large) data
>> > structures. Are they bound to any process life time.
>>
>> No.
>
> Well, IIUC they bound to net namespace life time, so killing all
> proccesses in the namespace would help to get memory back. :)
... unless the namespace is mounted into file system.
Let's start with NOWARN as that's what kernel generally uses for
allocations with user-controllable size. ENOMEM is roughly as
informative as the WARNING message in this case.
I think we also need to consider setting up memory cgroup for
syzkaller test processes (we do RLIMIT_AS, but that's weak).
On Tue 30-01-18 09:11:27, Florian Westphal wrote:
> Michal Hocko <[email protected]> wrote:
> > On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> > > Kirill A. Shutemov <[email protected]> wrote:
> > [...]
> > > > I hate what I'm saying, but I guess we need some tunable here.
> > > > Not sure what exactly.
> > >
> > > Would memcg help?
> >
> > That really depends. I would have to check whether vmalloc path obeys
> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> > that shouldn't be a big deal). But then the other potential problem is
> > the life time of the xt_table_info (or other potentially large) data
> > structures. Are they bound to any process life time.
>
> No.
>
> > Because if they are
> > not then the OOM killer will not help. The OOM panic earlier in this
> > thread suggests it doesn't because the test case managed to eat all the
> > available memory and killed all the eligible tasks which didn't help.
>
> Yes, which is why we do not want any OOM killer invocation in first
> place...
The problem is that as soon as you eat that memory and ask for more
until you fail with ENOMEM then the OOM is simply unavoidable.
--
Michal Hocko
SUSE Labs
On Tue 30-01-18 10:02:34, Dmitry Vyukov wrote:
> On Tue, Jan 30, 2018 at 9:28 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote:
> >> Michal Hocko <[email protected]> wrote:
> >> > On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> >> > > Kirill A. Shutemov <[email protected]> wrote:
> >> > [...]
> >> > > > I hate what I'm saying, but I guess we need some tunable here.
> >> > > > Not sure what exactly.
> >> > >
> >> > > Would memcg help?
> >> >
> >> > That really depends. I would have to check whether vmalloc path obeys
> >> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> >> > that shouldn't be a big deal). But then the other potential problem is
> >> > the life time of the xt_table_info (or other potentially large) data
> >> > structures. Are they bound to any process life time.
> >>
> >> No.
> >
> > Well, IIUC they bound to net namespace life time, so killing all
> > proccesses in the namespace would help to get memory back. :)
>
> ... unless the namespace is mounted into file system.
>
> Let's start with NOWARN as that's what kernel generally uses for
> allocations with user-controllable size. ENOMEM is roughly as
> informative as the WARNING message in this case.
You want __GFP_NORETRY but that is not _fully_ supported by kvmalloc
right now. More specifically kvmalloc doesn't guanratee that the request
will not trigger the OOM killer (like regular __GFP_NORETRY). This is
because of internal vmalloc restrictions. If you are however OK to
simply bail out in most cases then __GFP_NORETRY should work reasonably
fine.
> I think we also need to consider setting up memory cgroup for
> syzkaller test processes (we do RLIMIT_AS, but that's weak).
Well, this is not about syzkaller, it merely pointed out a potential
DoS... And that has to be addressed somehow.
--
Michal Hocko
SUSE Labs
> From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Tue, 30 Jan 2018 14:51:07 +0100
> Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive
Acked-by: Florian Westphal <[email protected]>
On Tue 30-01-18 10:57:39, Michal Hocko wrote:
> On Tue 30-01-18 10:02:34, Dmitry Vyukov wrote:
> > On Tue, Jan 30, 2018 at 9:28 AM, Kirill A. Shutemov
> > <[email protected]> wrote:
> > > On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote:
> > >> Michal Hocko <[email protected]> wrote:
> > >> > On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> > >> > > Kirill A. Shutemov <[email protected]> wrote:
> > >> > [...]
> > >> > > > I hate what I'm saying, but I guess we need some tunable here.
> > >> > > > Not sure what exactly.
> > >> > >
> > >> > > Would memcg help?
> > >> >
> > >> > That really depends. I would have to check whether vmalloc path obeys
> > >> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> > >> > that shouldn't be a big deal). But then the other potential problem is
> > >> > the life time of the xt_table_info (or other potentially large) data
> > >> > structures. Are they bound to any process life time.
> > >>
> > >> No.
> > >
> > > Well, IIUC they bound to net namespace life time, so killing all
> > > proccesses in the namespace would help to get memory back. :)
> >
> > ... unless the namespace is mounted into file system.
> >
> > Let's start with NOWARN as that's what kernel generally uses for
> > allocations with user-controllable size. ENOMEM is roughly as
> > informative as the WARNING message in this case.
>
> You want __GFP_NORETRY but that is not _fully_ supported by kvmalloc
> right now. More specifically kvmalloc doesn't guanratee that the request
> will not trigger the OOM killer (like regular __GFP_NORETRY). This is
> because of internal vmalloc restrictions. If you are however OK to
> simply bail out in most cases then __GFP_NORETRY should work reasonably
> fine.
>
> > I think we also need to consider setting up memory cgroup for
> > syzkaller test processes (we do RLIMIT_AS, but that's weak).
>
> Well, this is not about syzkaller, it merely pointed out a potential
> DoS... And that has to be addressed somehow.
So how about this?
---
From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 30 Jan 2018 14:51:07 +0100
Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive
syzbot has noticed that xt_alloc_table_info can allocate a lot of
memory. This is an admin only interface but an admin in a namespace
is sufficient as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use
kvmalloc() in xt_alloc_table_info()") has changed the opencoded
kmalloc->vmalloc fallback into kvmalloc. It has dropped __GFP_NORETRY on
the way because vmalloc has simply never fully supported __GFP_NORETRY
semantic. This is still the case because e.g. page tables backing the
vmalloc area are hardcoded GFP_KERNEL.
Revert back to __GFP_NORETRY as a poors man defence against excessively
large allocation request here. We will not rule out the OOM killer
completely but __GFP_NORETRY should at least stop the large request
in most cases.
Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()")
Signed-off-by: Michal Hocko <[email protected]>
---
net/netfilter/x_tables.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index d8571f414208..a5f5c29bcbdc 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1003,7 +1003,13 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
return NULL;
- info = kvmalloc(sz, GFP_KERNEL);
+ /*
+ * __GFP_NORETRY is not fully supported by kvmalloc but it should
+ * work reasonably well if sz is too large and bail out rather
+ * than shoot all processes down before realizing there is nothing
+ * more to reclaim.
+ */
+ info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
if (!info)
return NULL;
--
2.15.1
--
Michal Hocko
SUSE Labs
On Tue 30-01-18 15:01:11, Florian Westphal wrote:
> > From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Tue, 30 Jan 2018 14:51:07 +0100
> > Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive
>
> Acked-by: Florian Westphal <[email protected]>
Thanks! How should we route this change? Andrew, David?
--
Michal Hocko
SUSE Labs
On Tue, 30 Jan 2018 15:01:04 +0100 Michal Hocko <[email protected]> wrote:
> > Well, this is not about syzkaller, it merely pointed out a potential
> > DoS... And that has to be addressed somehow.
>
> So how about this?
> ---
argh ;)
> >From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Tue, 30 Jan 2018 14:51:07 +0100
> Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive
>
> syzbot has noticed that xt_alloc_table_info can allocate a lot of
> memory. This is an admin only interface but an admin in a namespace
> is sufficient as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use
> kvmalloc() in xt_alloc_table_info()") has changed the opencoded
> kmalloc->vmalloc fallback into kvmalloc. It has dropped __GFP_NORETRY on
> the way because vmalloc has simply never fully supported __GFP_NORETRY
> semantic. This is still the case because e.g. page tables backing the
> vmalloc area are hardcoded GFP_KERNEL.
>
> Revert back to __GFP_NORETRY as a poors man defence against excessively
> large allocation request here. We will not rule out the OOM killer
> completely but __GFP_NORETRY should at least stop the large request
> in most cases.
>
> Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()")
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> net/netfilter/x_tables.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index d8571f414208..a5f5c29bcbdc 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1003,7 +1003,13 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
> if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
> return NULL;
offtopic: preceding comment here is "prevent them from hitting BUG() in
vmalloc.c". I suspect this is ancient code and vmalloc sure as heck
shouldn't go BUG with this input. And it should be using `sz' ;)
So I suspect and hope that this code can be removed. If not, let's fix
vmalloc!
> - info = kvmalloc(sz, GFP_KERNEL);
> + /*
> + * __GFP_NORETRY is not fully supported by kvmalloc but it should
> + * work reasonably well if sz is too large and bail out rather
> + * than shoot all processes down before realizing there is nothing
> + * more to reclaim.
> + */
> + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> if (!info)
> return NULL;
checkpatch sayeth
networking block comments don't use an empty /* line, use /* Comment...
So I'll do that and shall scoot the patch Davewards.
On Tue 30-01-18 11:27:45, Andrew Morton wrote:
> On Tue, 30 Jan 2018 15:01:04 +0100 Michal Hocko <[email protected]> wrote:
>
> > > Well, this is not about syzkaller, it merely pointed out a potential
> > > DoS... And that has to be addressed somehow.
> >
> > So how about this?
> > ---
>
> argh ;)
doh, those hardwired moves...
> > >From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Tue, 30 Jan 2018 14:51:07 +0100
> > Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive
> >
> > syzbot has noticed that xt_alloc_table_info can allocate a lot of
> > memory. This is an admin only interface but an admin in a namespace
> > is sufficient as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use
> > kvmalloc() in xt_alloc_table_info()") has changed the opencoded
> > kmalloc->vmalloc fallback into kvmalloc. It has dropped __GFP_NORETRY on
> > the way because vmalloc has simply never fully supported __GFP_NORETRY
> > semantic. This is still the case because e.g. page tables backing the
> > vmalloc area are hardcoded GFP_KERNEL.
> >
> > Revert back to __GFP_NORETRY as a poors man defence against excessively
> > large allocation request here. We will not rule out the OOM killer
> > completely but __GFP_NORETRY should at least stop the large request
> > in most cases.
> >
> > Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()")
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > net/netfilter/x_tables.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index d8571f414208..a5f5c29bcbdc 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -1003,7 +1003,13 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
> > if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
> > return NULL;
>
> offtopic: preceding comment here is "prevent them from hitting BUG() in
> vmalloc.c". I suspect this is ancient code and vmalloc sure as heck
> shouldn't go BUG with this input. And it should be using `sz' ;)
Yeah, we do not BUG but rather fail instead. See __vmalloc_node_range.
My excavation tools pointed me to "VM: Rework vmalloc code to support mapping of arbitray pages"
by Christoph back in 2002. So yes, we can safely remove it finally. Se
below.
From 8d52e1d939d101b0dafed6ae5c3c1376183e65bb Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 31 Jan 2018 09:16:56 +0100
Subject: [PATCH] net/netfilter/x_tables.c: remove size check
Back in 2002 vmalloc used to BUG on too large sizes. We are much better
behaved these days and vmalloc simply returns NULL for those. Remove
the check as it simply not needed and the comment even misleading.
Suggested-by: Andrew Morton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
net/netfilter/x_tables.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index b55ec5aa51a6..48a6ff620493 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -999,10 +999,6 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
if (sz < sizeof(*info))
return NULL;
- /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
- if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
- return NULL;
-
/* __GFP_NORETRY is not fully supported by kvmalloc but it should
* work reasonably well if sz is too large and bail out rather
* than shoot all processes down before realizing there is nothing
--
2.15.1
--
Michal Hocko
SUSE Labs
On Tue, Jan 30, 2018 at 03:39:58PM +0100, Michal Hocko wrote:
> On Tue 30-01-18 15:01:11, Florian Westphal wrote:
> > > From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <[email protected]>
> > > Date: Tue, 30 Jan 2018 14:51:07 +0100
> > > Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive
> >
> > Acked-by: Florian Westphal <[email protected]>
>
> Thanks! How should we route this change? Andrew, David?
I'll place this in the nf.git tree if everyone is happy with it.
Hi,
On Wed, Jan 31, 2018 at 09:19:16AM +0100, Michal Hocko wrote:
[...]
> Yeah, we do not BUG but rather fail instead. See __vmalloc_node_range.
> My excavation tools pointed me to "VM: Rework vmalloc code to support mapping of arbitray pages"
> by Christoph back in 2002. So yes, we can safely remove it finally. Se
> below.
>
>
> From 8d52e1d939d101b0dafed6ae5c3c1376183e65bb Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 31 Jan 2018 09:16:56 +0100
> Subject: [PATCH] net/netfilter/x_tables.c: remove size check
>
> Back in 2002 vmalloc used to BUG on too large sizes. We are much better
> behaved these days and vmalloc simply returns NULL for those. Remove
> the check as it simply not needed and the comment even misleading.
>
> Suggested-by: Andrew Morton <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> net/netfilter/x_tables.c | 4 ----
> 1 file changed, 4 deletions(-)
>
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index b55ec5aa51a6..48a6ff620493 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -999,10 +999,6 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
> if (sz < sizeof(*info))
> return NULL;
>
> - /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
> - if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
> - return NULL;
> -
> /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> * work reasonably well if sz is too large and bail out rather
> * than shoot all processes down before realizing there is nothing
Patchwork didn't catch this patch for some reason, would you mind to
resend?
Thanks!
On Wed, 7 Feb 2018 18:44:39 +0100 Pablo Neira Ayuso <[email protected]> wrote:
> Hi,
>
> On Wed, Jan 31, 2018 at 09:19:16AM +0100, Michal Hocko wrote:
> [...]
> > Yeah, we do not BUG but rather fail instead. See __vmalloc_node_range.
> > My excavation tools pointed me to "VM: Rework vmalloc code to support mapping of arbitray pages"
> > by Christoph back in 2002. So yes, we can safely remove it finally. Se
> > below.
> >
> >
> > From 8d52e1d939d101b0dafed6ae5c3c1376183e65bb Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Wed, 31 Jan 2018 09:16:56 +0100
> > Subject: [PATCH] net/netfilter/x_tables.c: remove size check
> >
> > Back in 2002 vmalloc used to BUG on too large sizes. We are much better
> > behaved these days and vmalloc simply returns NULL for those. Remove
> > the check as it simply not needed and the comment even misleading.
> >
> > Suggested-by: Andrew Morton <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > net/netfilter/x_tables.c | 4 ----
> > 1 file changed, 4 deletions(-)
> >
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index b55ec5aa51a6..48a6ff620493 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -999,10 +999,6 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
> > if (sz < sizeof(*info))
> > return NULL;
> >
> > - /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
> > - if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
> > - return NULL;
> > -
> > /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> > * work reasonably well if sz is too large and bail out rather
> > * than shoot all processes down before realizing there is nothing
>
> Patchwork didn't catch this patch for some reason, would you mind to
> resend?
From: Michal Hocko <[email protected]>
Subject: net/netfilter/x_tables.c: remove size check
Back in 2002 vmalloc used to BUG on too large sizes. We are much better
behaved these days and vmalloc simply returns NULL for those. Remove the
check as it simply not needed and the comment is even misleading.
Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Andrew Morton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Florian Westphal <[email protected]>
Cc: David S. Miller <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
net/netfilter/x_tables.c | 4 ----
1 file changed, 4 deletions(-)
diff -puN net/netfilter/x_tables.c~net-netfilter-x_tablesc-remove-size-check net/netfilter/x_tables.c
--- a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-remove-size-check
+++ a/net/netfilter/x_tables.c
@@ -1004,10 +1004,6 @@ struct xt_table_info *xt_alloc_table_inf
if (sz < sizeof(*info))
return NULL;
- /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
- if ((size >> PAGE_SHIFT) + 2 > totalram_pages)
- return NULL;
-
/* __GFP_NORETRY is not fully supported by kvmalloc but it should
* work reasonably well if sz is too large and bail out rather
* than shoot all processes down before realizing there is nothing
_
On Wed, Feb 07, 2018 at 11:06:42AM -0800, Andrew Morton wrote:
> From: Michal Hocko <[email protected]>
> Subject: net/netfilter/x_tables.c: remove size check
>
> Back in 2002 vmalloc used to BUG on too large sizes. We are much better
> behaved these days and vmalloc simply returns NULL for those. Remove the
> check as it simply not needed and the comment is even misleading.
Applied, thanks!