Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933607AbcJUVEJ (ORCPT ); Fri, 21 Oct 2016 17:04:09 -0400 Received: from mx0a-000f0801.pphosted.com ([67.231.144.122]:48863 "EHLO mx0a-000f0801.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755321AbcJUVEH (ORCPT ); Fri, 21 Oct 2016 17:04:07 -0400 Subject: Re: [PREEMPT-RT] Oops in rapl_cpu_prepare() To: Sebastian Andrzej Siewior References: <20161021105630.y2iym7smtdpyo54z@linutronix.de> CC: , From: "Charles (Chas) Williams" Message-ID: <4e56a576-1a9f-e195-5ed9-2bc7169c4d94@brocade.com> Date: Fri, 21 Oct 2016 17:03:56 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.3.0 MIME-Version: 1.0 In-Reply-To: <20161021105630.y2iym7smtdpyo54z@linutronix.de> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-ClientProxiedBy: hq1wp-excas12.corp.brocade.com (10.70.38.22) To BRMWP-EXMB12.corp.brocade.com (172.16.59.130) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-10-21_12:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1610210363 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2388 Lines: 54 On 10/21/2016 06:56 AM, Sebastian Andrzej Siewior wrote: > On 2016-10-20 16:27:55 [-0400], Charles (Chas) Williams wrote: >> Recent 4.8 kernels have been oopsing when running under VMWare: > > can you reproduce this on bare metal? I can't get dedicated access to the specific bare metal since it is running as a dedicated hypervisor. I haven't seen this issue anywhere else though with the 4.8 kernel. >> [ 2.270203] BUG: unable to handle kernel NULL pointer dereference at 0000000000000408 >> [ 2.270325] IP: [] rapl_cpu_online+0x59/0x70 > > can you check if pmu is NULL? It's not. The dereference at 0x408 and pmu->cpu being fairly early in the struct seems to indicate that pmu wasn't pointing to 0 at the time (but fairly close). I should have noticed that earlier. >> Is there a particular order guaranteed by the callbacks? Will >> rapl_cpu_prepare() always happen before online/offline? Additionally, > > yes, see include/linux/cpuhotplug.h. On CPU-up the array ids are invoked > from CPUHP_OFFLINE till CPUHP_ONLINE. Yes, I see that now. Thanks for the pointer! > If a callback (such as CPUHP_PERF_X86_RAPL_PREP) fail then we rollback > to the starting point (in case of CPU up it would be CPUHP_OFFLINE. You'll like this, I just did a little printk debugging because it was easier than trying to get a debugger running: [ 3.107126] init_rapl_pmus: maxpkg 4 [ 3.107263] rapl_cpu_prepare: pmu ffff880234faa540 cpu 0 pkgid 0 [ 3.107400] rapl_cpu_prepare: pmu ffff880234faa600 cpu 1 pkgid 2 [ 3.107537] rapl_cpu_prepare: pmu ffff880234faa6c0 cpu 2 pkgid 65535 [ 3.107662] rapl_cpu_online: pmu ffff880234faa540 cpu 0 pkgid 0 [ 3.107907] rapl_cpu_online: pmu ffff880234faa600 cpu 1 pkgid 2 [ 3.108133] rapl_cpu_online: pmu ffff880234faa6c0 cpu 2 pkgid 65535 [ 3.108333] rapl_cpu_online: pmu ffff880234faa6c0 cpu 3 pkgid 65535 where pkgid is topology_logical_package_id(cpu). I can't understand why I don't see a cpu 3 during cpu prepare, when I see one later. The 65535 is a -1 from topology_phys_to_logical_pkg() getting assigned to the logical_proc_id apparently. So this is pretty puzzling. Since this is a guest running under VMWare, I don't know that there is any particular CPU pinning or emulation of RAPL. It looks there was a proposal to not run in guests: https://lkml.org/lkml/2015/12/3/559