DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 617955F742
Subject: Re: [PATCH] ipmi: use rcu lock around call to
 intf->handlers->sender()
To: minyard@acm.org, openipmi-developer@lists.sourceforge.net
Cc: linux-kernel@vger.kernel.org
References: <1497892653-5236-1-git-send-email-tcamuso@redhat.com>
 <c91bc413-b516-0cad-9232-419b8cf14e65@acm.org>
From: Tony Camuso <tcamuso@redhat.com>
Message-ID: <c15876e8-1a44-5f93-4368-8ac89367452d@redhat.com>
Date: Mon, 19 Jun 2017 14:26:06 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.1.0
MIME-Version: 1.0
In-Reply-To: <c91bc413-b516-0cad-9232-419b8cf14e65@acm.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5999
Lines: 133

On 06/19/2017 01:53 PM, Corey Minyard wrote:
> Ok, I have this queued for the next kernel, and in linux-next for testing. I added
> some information about when this was introduced.
> 
> I also plan on sending this to stable, if that's ok with you.

Yes, I've tested it for regressions, and it's fine.

> 
> A note below....
> 
> On 06/19/2017 12:17 PM, Tony Camuso wrote:
>> A vendor with a system having more than 128 CPUs occasionally encounters
>> the following crash during shutdown. This is not an easily reproduceable
>> event, but the vendor was able to provide the following analysis of the
>> crash, which exhibits the same footprint each time.
>>
>> crash> bt
>> PID: 0      TASK: ffff88017c70ce70  CPU: 5   COMMAND: "swapper/5"
>>   #0 [ffff88085c143ac8] machine_kexec at ffffffff81059c8b
>>   #1 [ffff88085c143b28] __crash_kexec at ffffffff811052e2
>>   #2 [ffff88085c143bf8] crash_kexec at ffffffff811053d0
>>   #3 [ffff88085c143c10] oops_end at ffffffff8168ef88
>>   #4 [ffff88085c143c38] no_context at ffffffff8167ebb3
>>   #5 [ffff88085c143c88] __bad_area_nosemaphore at ffffffff8167ec49
>>   #6 [ffff88085c143cd0] bad_area_nosemaphore at ffffffff8167edb3
>>   #7 [ffff88085c143ce0] __do_page_fault at ffffffff81691d1e
>>   #8 [ffff88085c143d40] do_page_fault at ffffffff81691ec5
>>   #9 [ffff88085c143d70] page_fault at ffffffff8168e188
>>      [exception RIP: unknown or invalid address]
>>      RIP: ffffffffa053c800  RSP: ffff88085c143e28  RFLAGS: 00010206
>>      RAX: ffff88017c72bfd8  RBX: ffff88017a8dc000  RCX: ffff8810588b5ac8
>>      RDX: ffff8810588b5a00  RSI: ffffffffa053c800  RDI: ffff8810588b5a00
>>      RBP: ffff88085c143e58   R8: ffff88017c70d408   R9: ffff88017a8dc000
>>      R10: 0000000000000002  R11: ffff88085c143da0  R12: ffff8810588b5ac8
>>      R13: 0000000000000100  R14: ffffffffa053c800  R15: ffff8810588b5a00
>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> --- <IRQ stack> ---
> 
> When I added this patch all the text below here disappeared, since --- is the marker
> for "end of text" :).  I removed the --- here, but just FYI for the future.

Oh, wow, thanks! I have scripts that look for "---", so I
should know better.

> Thanks again,
> 
> -corey
> 


>>      [exception RIP: cpuidle_enter_state+82]
>>      RIP: ffffffff81514192  RSP: ffff88017c72be50  RFLAGS: 00000202
>>      RAX: 0000001e4c3c6f16  RBX: 000000000000f8a0  RCX: 0000000000000018
>>      RDX: 0000000225c17d03  RSI: ffff88017c72bfd8  RDI: 0000001e4c3c6f16
>>      RBP: ffff88017c72be78   R8: 000000000000237e   R9: 0000000000000018
>>      R10: 0000000000002494  R11: 0000000000000001  R12: ffff88017c72be20
>>      R13: ffff88085c14f8e0  R14: 0000000000000082  R15: 0000001e4c3bb400
>>      ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
>>
>> This is the corresponding stack trace
>>
>> It has crashed because the area pointed with RIP extracted from timer
>> element is already removed during a shutdown process.
>>
>> The function is smi_timeout().
>>
>> And we think ffff8810588b5a00 in RDX is a parameter struct smi_info
>>
>> crash> rd ffff8810588b5a00 20
>> ffff8810588b5a00:  ffff8810588b6000 0000000000000000   .`.X............
>> ffff8810588b5a10:  ffff880853264400 ffffffffa05417e0   .D&S......T.....
>> ffff8810588b5a20:  24a024a000000000 0000000000000000   .....$.$........
>> ffff8810588b5a30:  0000000000000000 0000000000000000   ................
>> ffff8810588b5a40:  ffffffffa053a040 ffffffffa053a060   @.S.....`.S.....
>> ffff8810588b5a50:  0000000000000000 0000000100000001   ................
>> ffff8810588b5a60:  0000000000000000 0000000000000e00   ................
>> ffff8810588b5a70:  ffffffffa053a580 ffffffffa053a6e0   ..S.......S.....
>> ffff8810588b5a80:  ffffffffa053a4a0 ffffffffa053a250   ..S.....P.S.....
>> ffff8810588b5a90:  0000000500000002 0000000000000000   ................
>>
>> Unfortunately the top of this area is already detroyed by someone.
>> But because of two reasonns we think this is struct smi_info
>>   1) The address included in between  ffff8810588b5a70 and ffff8810588b5a80:
>>    are inside of ipmi_si_intf.c  see crash> module ffff88085779d2c0
>>
>>   2) We've found the area which point this.
>>    It is offset 0x68 of  ffff880859df4000
>>
>> crash> rd  ffff880859df4000 100
>> ffff880859df4000:  0000000000000000 0000000000000001   ................
>> ffff880859df4010:  ffffffffa0535290 dead000000000200   .RS.............
>> ffff880859df4020:  ffff880859df4020 ffff880859df4020    @.Y.... @.Y....
>> ffff880859df4030:  0000000000000002 0000000000100010   ................
>> ffff880859df4040:  ffff880859df4040 ffff880859df4040   @@.Y....@@.Y....
>> ffff880859df4050:  0000000000000000 0000000000000000   ................
>> ffff880859df4060:  0000000000000000 ffff8810588b5a00   .........Z.X....
>> ffff880859df4070:  0000000000000001 ffff880859df4078   ........x@.Y....
>>
>>   If we regards it as struct ipmi_smi in shutdown process
>>   it looks consistent.
>>
>> The remedy for this apparent race is affixed below.
>>
>> Signed-off-by: Tony Camuso <tcamuso@redhat.com>
>> ---
>>   drivers/char/ipmi/ipmi_msghandler.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
>> index 9f69995..49a7e96 100644
>> --- a/drivers/char/ipmi/ipmi_msghandler.c
>> +++ b/drivers/char/ipmi/ipmi_msghandler.c
>> @@ -3878,6 +3878,9 @@ static void smi_recv_tasklet(unsigned long val)
>>        * because the lower layer is allowed to hold locks while calling
>>        * message delivery.
>>        */
>> +
>> +    rcu_read_lock();
>> +
>>       if (!run_to_completion)
>>           spin_lock_irqsave(&intf->xmit_msgs_lock, flags);
>>       if (intf->curr_msg == NULL && !intf->in_shutdown) {
>> @@ -3900,6 +3903,8 @@ static void smi_recv_tasklet(unsigned long val)
>>       if (newmsg)
>>           intf->handlers->sender(intf->send_info, newmsg);
>> +    rcu_read_unlock();
>> +
>>       handle_new_recv_msgs(intf);
>>   }
> 
>