MIME-Version: 1.0
In-Reply-To: <926421514406019@web43j.yandex.ru>
References: <001a11421c7436e541055c4c7cf0@google.com> <de320d00-dced-fd39-056b-6c67290254e0@gmail.com>
 <CALx6S35x+8HM-Lf7wuDGqnjoVUvPCNgFMvLNeLCNRoT_LZcxLA@mail.gmail.com>
 <CACT4Y+YXhkZsDdst+H1VQZ9Tc0fMJhF3niXEJoK9TqcTUaC4iA@mail.gmail.com>
 <CACT4Y+bFsT4ZVvZSZvQdcKXb8n-79gy2tLB02jBm8wwp32YoAA@mail.gmail.com>
 <CALx6S37RTEzd5pABpULPfrsoe-Huj0GZtQpOUfG=tT9dn5wL_A@mail.gmail.com>
 <CACT4Y+YbKPmYD8HyrX-ct5L2Rqj2sqqk3sS=7pZm+s9ar=qKXg@mail.gmail.com>
 <1164631514405294@web52j.yandex.ru> <CACT4Y+bOd17CLKm742_t06p9yweh1x2rZswc_ghtmnuE7+Zb5w@mail.gmail.com>
 <926421514406019@web43j.yandex.ru>
From: Tom Herbert <tom@herbertland.com>
Date: Wed, 27 Dec 2017 17:19:05 -0800
Message-ID: <CALx6S35ZPyD9aS3c52Z-w4wo4=iK41iJ-+nMpQSGeQ9nMKnPbQ@mail.gmail.com>
Subject: Re: WARNING in strp_data_ready
To: Ozgur <ozgur@goosey.org>
Cc: Dmitry Vyukov <dvyukov@google.com>,
        John Fastabend <john.fastabend@gmail.com>,
        syzbot 
        <bot+c91c53af67f9ebe599a337d2e70950366153b295@syzkaller.appspotmail.com>,
        "David S. Miller" <davem@davemloft.net>,
        Eric Biggers <ebiggers@google.com>,
        LKML <linux-kernel@vger.kernel.org>,
        Linux Kernel Network Developers <netdev@vger.kernel.org>,
        "syzkaller-bugs@googlegroups.com" <syzkaller-bugs@googlegroups.com>,
        Tom Herbert <tom@quantonium.net>, Cong Wang <xiyou.wangcong@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6225
Lines: 141

On Wed, Dec 27, 2017 at 12:20 PM, Ozgur <ozgur@goosey.org> wrote:
>
>
> 27.12.2017, 23:14, "Dmitry Vyukov" <dvyukov@google.com>:
>> On Wed, Dec 27, 2017 at 9:08 PM, Ozgur <ozgur@goosey.org> wrote:
>>>  27.12.2017, 22:21, "Dmitry Vyukov" <dvyukov@google.com>:
>>>>  On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>>>   Did you try the patch I posted?
>>>>
>>>>  Hi Tom,
>>>
>>>  Hello Dmitry,
>>>
>>>>  No. And I didn't know I need to. Why?
>>>>  If you think the patch needs additional testing, you can ask syzbot to
>>>>  test it. See https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
>>>>  Otherwise proceed with committing it. Or what are we waiting for?
>>>>
>>>>  Thanks
>>>
>>>  I think we need to fixed patch for crash, in fact check to patch code and test solve the bug.
>>>  How do test it because there is no patch in the following bug?
>>
>> Hi Ozgur,
>>
>> I am not sure I completely understand what you mean. But the
>> reproducer for this bug (which one can use for testing) is here:
>> https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY
>> Tom also mentions there is some patch for this, but I don't know where
>> it is, it doesn't seem to be referenced from this thread.
>
> Hello Dmitry,
>
> Ah, I'm sorry I don't seen Tom mail and I don't have a patch not tested :)
> I think Tom send patch to only you and are you tested?
>
> kcmsock.c will change and strp_data_ready I think locked.
>
> Tom, please send a patch for me? I can test and inform you.
>
Hi Ozgur,

I reposted the patches as RFC "kcm: Fix lockdep issue". Please test if you can!

Thanks,
Tom

> Regards
>
> Ozgur
>
>>>  The fix patch should be for this net/kcm/kcmsock.c file and lock functions must be added calling sk_data_ready ().
>>>  Regards
>>>
>>>  Ozgur
>>>
>>>>>   On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>>>>>   On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>>>>>>>   <john.fastabend@gmail.com> wrote:
>>>>>>>>>   On 10/24/2017 08:20 AM, syzbot wrote:
>>>>>>>>>>   Hello,
>>>>>>>>>>
>>>>>>>>>>   syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>>>>>>>>>>   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>>>>>>>>>   compiler: gcc (GCC) 7.1.1 20170620
>>>>>>>>>>   .config is attached
>>>>>>>>>>   Raw console output is attached.
>>>>>>>>>>   C reproducer is attached
>>>>>>>>>>   syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>>>>>>>>>   for information about syzkaller reproducers
>>>>>>>>>>
>>>>>>>>>>   WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>>>>>>>   WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>>>>>>>   WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>>>>>>>   Kernel panic - not syncing: panic_on_warn set ...
>>>>>>>>>>
>>>>>>>>>>   CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>>>>>>>>>>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>>>>>>>>>   Call Trace:
>>>>>>>>>>    <IRQ>
>>>>>>>>>>    __dump_stack lib/dump_stack.c:16 [inline]
>>>>>>>>>>    dump_stack+0x194/0x257 lib/dump_stack.c:52
>>>>>>>>>>    panic+0x1e4/0x417 kernel/panic.c:181
>>>>>>>>>>    __warn+0x1c4/0x1d9 kernel/panic.c:542
>>>>>>>>>>    report_bug+0x211/0x2d0 lib/bug.c:183
>>>>>>>>>>    fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>>>>>>>>>    do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>>>>>>>>>    do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>>>>>>>>>    do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>>>>>>>>>    do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>>>>>>>>>    invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>>>>>>>>>>   RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>>>>>>>   RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>>>>>>>   RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>>>>>>>   RSP: 0018:ffff8801db206b18 EFLAGS: 00010206
>>>>>>>>>>   RAX: ffff8801d1e02080 RBX: ffff8801dad74c48 RCX: 0000000000000000
>>>>>>>>>>   RDX: 0000000000000100 RSI: ffff8801d29fa0a0 RDI: ffffffff85cbede0
>>>>>>>>>>   RBP: ffff8801db206b38 R08: 0000000000000005 R09: 1ffffffff0ce0bcd
>>>>>>>>>>   R10: ffff8801db206a00 R11: dffffc0000000000 R12: ffff8801d29fa000
>>>>>>>>>>   R13: ffff8801dad74c50 R14: ffff8801d4350a92 R15: 0000000000000001
>>>>>>>>>>    psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>>>>>>>>>
>>>>>>>>>   Looks like KCM is calling sk_data_ready() without first taking the
>>>>>>>>>   sock lock.
>>>>>>>>>
>>>>>>>>>   /* Called with lower sock held */
>>>>>>>>>   static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
>>>>>>>>>   {
>>>>>>>>>    [...]
>>>>>>>>>           if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
>>>>>>>>>
>>>>>>>>>   In this case kcm->sk is not the same lock the comment is referring to.
>>>>>>>>>   And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>>>>>>>>>
>>>>>>>>>   @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
>>>>>>>>>   I don't have anything better in mind immediately.
>>>>>>>>   The sock locks are taken in reverse order in the send path so so
>>>>>>>>   grabbing kcm sock lock with lower lock held to call sk_data_ready may
>>>>>>>>   lead to deadlock like I think.
>>>>>>>>
>>>>>>>>   It might be possible to change the order in the send path to do this.
>>>>>>>>   Something like:
>>>>>>>>
>>>>>>>>   trylock on lower socket lock
>>>>>>>>   -if trylock fails
>>>>>>>>     - release kcm sock lock
>>>>>>>>     - lock lower sock
>>>>>>>>     - lock kcm sock
>>>>>>>>   - call sendpage locked function
>>>>>>>>
>>>>>>>>   I admit that dealing with two levels of socket locks in the data path
>>>>>>>>   is quite a pain :-)
>>>>>>>
>>>>>>>   up
>>>>>>>
>>>>>>>   still happening and we've lost 50K+ test VMs on this
>>>>>>
>>>>>>   up
>>>>>>
>>>>>>   Still happens and number of crashes crossed 60K, can we do something
>>>>>>   with this please?