Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753288AbcKBJcT (ORCPT ); Wed, 2 Nov 2016 05:32:19 -0400 Received: from mx0b-000f0801.pphosted.com ([67.231.152.113]:44944 "EHLO mx0a-000f0801.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751325AbcKBJcS (ORCPT ); Wed, 2 Nov 2016 05:32:18 -0400 X-Greylist: delayed 964 seconds by postgrey-1.27 at vger.kernel.org; Wed, 02 Nov 2016 05:32:17 EDT Subject: Re: [PREEMPT-RT] Oops in rapl_cpu_prepare() To: Sebastian Andrzej Siewior References: <20161021105630.y2iym7smtdpyo54z@linutronix.de> <4e56a576-1a9f-e195-5ed9-2bc7169c4d94@brocade.com> <20161025122205.cw5xyejcg7xegnmq@linutronix.de> <20161028080324.b6nnwaljmzxiyykx@linutronix.de> CC: , From: "Charles (Chas) Williams" Message-ID: Date: Wed, 2 Nov 2016 05:16:03 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.3.0 MIME-Version: 1.0 In-Reply-To: <20161028080324.b6nnwaljmzxiyykx@linutronix.de> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-ClientProxiedBy: hq1wp-excas11.corp.brocade.com (10.70.36.102) To BRMWP-EXMB12.corp.brocade.com (172.16.59.130) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-11-02_02:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1611020169 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3275 Lines: 71 On 10/28/2016 04:03 AM, Sebastian Andrzej Siewior wrote: > On 2016-10-27 15:00:32 [-0400], Charles (Chas) Williams wrote: >>> I assume "init_rapl_pmus: maxpkg 4" is from init_rapl_pmus() returning >>> topology_max_packages(). So it says 4 but then returns 65535 for CPU 2 >>> and 3. That -1 comes probably from topology_update_package_map(). Could >>> you please send a complete boot log and try the following patch? This >>> one should fix your boot problem and disable RAPL if the info is >>> invalid. >> >> But sometimes the topology info is correct and if I get lucky, the >> package id could be valid for all the CPU's. Given the behavior, >> I have seen so far it makes me thing the RAPL isn't being emulated. >> So even if I did boot onto a "valid" set of cores, would I always be >> certain that I will be on those cores? > > I don't what vmware does here. Nor do they ship source to check. So if > you have a big HW box with say two packages, it might make sense to give > this information to the guest _if_ the CPUs are pinned and the guest > never migrates. Yes, I agree _if_. That's why it simply isn't clear to me that we should attempt do any RAPL at all for VMWare. The current behavior doesn't seem to make sense and I don't expect it to suddenly start acting reasonable. Since I don't understand why some package id's are valid and others are not, I would prefer not to trust any of the information as far as enabling/disabling the RAPL monitoring. > >> Per your request in your next email: >> >>> One thing I forgot to ask: Could you please check if you get the same >>> pkgid reported for cpu 0-3 on a pre-v4.8 kernel? (before the hotplug >>> rework). >> >> Our previous kernel was 4.4, and didn't use the logical package id: > I see. > > Did the patch I sent fixed it for you and were you not able to test? Yes, it does prevent RAPL from starting and loading. From the boot log: [ 2.711481] RAPL PMU: rapl pmu error: max package: 4 but CPU2 belongs to 65535 [ 2.711639] rapl pmu error: max package: 4 but CPU2 belongs to 65535 This was consistent across several reboots. I poked around in the VM settings. Apparently this guest is configured for four virtual sockets with one core per socket. Testing with two virtual sockets, one core per socket: [ 2.163177] RAPL PMU: rapl pmu error: max package: 2 but CPU1 belongs to 65535 [ 2.163304] rapl pmu error: max package: 2 but CPU1 belongs to 65535 Booting with 1 virtual socket, 1 core per socket: [ 1.750311] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 10737418240 ms ovfl timer [ 1.750312] RAPL PMU: hw unit of domain pp0-core 2^-0 Joules [ 1.750313] RAPL PMU: hw unit of domain package 2^-0 Joules [ 1.750314] RAPL PMU: hw unit of domain dram 2^-0 Joules Booting with 1 virtual socket, 4 cores per socket: [ 3.527298] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 10737418240 ms ovfl timer [ 3.527302] RAPL PMU: hw unit of domain pp0-core 2^-0 Joules [ 3.527304] RAPL PMU: hw unit of domain package 2^-0 Joules [ 3.527307] RAPL PMU: hw unit of domain dram 2^-0 Joules So, it looks like VMWare tends to always get something wrong if you have more than one virtual socket. The above behavior was consistent across several reboots.