Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752961AbdL1BTI (ORCPT ); Wed, 27 Dec 2017 20:19:08 -0500 Received: from mail-qt0-f196.google.com ([209.85.216.196]:43731 "EHLO mail-qt0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752686AbdL1BTG (ORCPT ); Wed, 27 Dec 2017 20:19:06 -0500 X-Google-Smtp-Source: ACJfBosJz77Hs3JlG3eVXdmqnxZ7Q1btrLOfXYNLyu4eZbgVYM8Kf31urSnjCU1hO5YiaamGuv2Ik0g1XhsVUJBBUVo= MIME-Version: 1.0 In-Reply-To: <926421514406019@web43j.yandex.ru> References: <001a11421c7436e541055c4c7cf0@google.com> <1164631514405294@web52j.yandex.ru> <926421514406019@web43j.yandex.ru> From: Tom Herbert Date: Wed, 27 Dec 2017 17:19:05 -0800 Message-ID: Subject: Re: WARNING in strp_data_ready To: Ozgur Cc: Dmitry Vyukov , John Fastabend , syzbot , "David S. Miller" , Eric Biggers , LKML , Linux Kernel Network Developers , "syzkaller-bugs@googlegroups.com" , Tom Herbert , Cong Wang Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6225 Lines: 141 On Wed, Dec 27, 2017 at 12:20 PM, Ozgur wrote: > > > 27.12.2017, 23:14, "Dmitry Vyukov" : >> On Wed, Dec 27, 2017 at 9:08 PM, Ozgur wrote: >>> 27.12.2017, 22:21, "Dmitry Vyukov" : >>>> On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert wrote: >>>>> Did you try the patch I posted? >>>> >>>> Hi Tom, >>> >>> Hello Dmitry, >>> >>>> No. And I didn't know I need to. Why? >>>> If you think the patch needs additional testing, you can ask syzbot to >>>> test it. See https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot >>>> Otherwise proceed with committing it. Or what are we waiting for? >>>> >>>> Thanks >>> >>> I think we need to fixed patch for crash, in fact check to patch code and test solve the bug. >>> How do test it because there is no patch in the following bug? >> >> Hi Ozgur, >> >> I am not sure I completely understand what you mean. But the >> reproducer for this bug (which one can use for testing) is here: >> https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY >> Tom also mentions there is some patch for this, but I don't know where >> it is, it doesn't seem to be referenced from this thread. > > Hello Dmitry, > > Ah, I'm sorry I don't seen Tom mail and I don't have a patch not tested :) > I think Tom send patch to only you and are you tested? > > kcmsock.c will change and strp_data_ready I think locked. > > Tom, please send a patch for me? I can test and inform you. > Hi Ozgur, I reposted the patches as RFC "kcm: Fix lockdep issue". Please test if you can! Thanks, Tom > Regards > > Ozgur > >>> The fix patch should be for this net/kcm/kcmsock.c file and lock functions must be added calling sk_data_ready (). >>> Regards >>> >>> Ozgur >>> >>>>> On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov wrote: >>>>>> On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov wrote: >>>>>>>> wrote: >>>>>>>>> On 10/24/2017 08:20 AM, syzbot wrote: >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb >>>>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master >>>>>>>>>> compiler: gcc (GCC) 7.1.1 20170620 >>>>>>>>>> .config is attached >>>>>>>>>> Raw console output is attached. >>>>>>>>>> C reproducer is attached >>>>>>>>>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ >>>>>>>>>> for information about syzkaller reproducers >>>>>>>>>> >>>>>>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me include/net/sock.h:1505 [inline] >>>>>>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user include/net/sock.h:1511 [inline] >>>>>>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 >>>>>>>>>> Kernel panic - not syncing: panic_on_warn set ... >>>>>>>>>> >>>>>>>>>> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138 >>>>>>>>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 >>>>>>>>>> Call Trace: >>>>>>>>>> >>>>>>>>>> __dump_stack lib/dump_stack.c:16 [inline] >>>>>>>>>> dump_stack+0x194/0x257 lib/dump_stack.c:52 >>>>>>>>>> panic+0x1e4/0x417 kernel/panic.c:181 >>>>>>>>>> __warn+0x1c4/0x1d9 kernel/panic.c:542 >>>>>>>>>> report_bug+0x211/0x2d0 lib/bug.c:183 >>>>>>>>>> fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178 >>>>>>>>>> do_trap_no_signal arch/x86/kernel/traps.c:212 [inline] >>>>>>>>>> do_trap+0x260/0x390 arch/x86/kernel/traps.c:261 >>>>>>>>>> do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298 >>>>>>>>>> do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311 >>>>>>>>>> invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 >>>>>>>>>> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline] >>>>>>>>>> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline] >>>>>>>>>> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404 >>>>>>>>>> RSP: 0018:ffff8801db206b18 EFLAGS: 00010206 >>>>>>>>>> RAX: ffff8801d1e02080 RBX: ffff8801dad74c48 RCX: 0000000000000000 >>>>>>>>>> RDX: 0000000000000100 RSI: ffff8801d29fa0a0 RDI: ffffffff85cbede0 >>>>>>>>>> RBP: ffff8801db206b38 R08: 0000000000000005 R09: 1ffffffff0ce0bcd >>>>>>>>>> R10: ffff8801db206a00 R11: dffffc0000000000 R12: ffff8801d29fa000 >>>>>>>>>> R13: ffff8801dad74c50 R14: ffff8801d4350a92 R15: 0000000000000001 >>>>>>>>>> psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353 >>>>>>>>> >>>>>>>>> Looks like KCM is calling sk_data_ready() without first taking the >>>>>>>>> sock lock. >>>>>>>>> >>>>>>>>> /* Called with lower sock held */ >>>>>>>>> static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb) >>>>>>>>> { >>>>>>>>> [...] >>>>>>>>> if (kcm_queue_rcv_skb(&kcm->sk, skb)) { >>>>>>>>> >>>>>>>>> In this case kcm->sk is not the same lock the comment is referring to. >>>>>>>>> And kcm_queue_rcv_skb() will eventually call sk_data_ready(). >>>>>>>>> >>>>>>>>> @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock? >>>>>>>>> I don't have anything better in mind immediately. >>>>>>>> The sock locks are taken in reverse order in the send path so so >>>>>>>> grabbing kcm sock lock with lower lock held to call sk_data_ready may >>>>>>>> lead to deadlock like I think. >>>>>>>> >>>>>>>> It might be possible to change the order in the send path to do this. >>>>>>>> Something like: >>>>>>>> >>>>>>>> trylock on lower socket lock >>>>>>>> -if trylock fails >>>>>>>> - release kcm sock lock >>>>>>>> - lock lower sock >>>>>>>> - lock kcm sock >>>>>>>> - call sendpage locked function >>>>>>>> >>>>>>>> I admit that dealing with two levels of socket locks in the data path >>>>>>>> is quite a pain :-) >>>>>>> >>>>>>> up >>>>>>> >>>>>>> still happening and we've lost 50K+ test VMs on this >>>>>> >>>>>> up >>>>>> >>>>>> Still happens and number of crashes crossed 60K, can we do something >>>>>> with this please?