Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp909617ybl; Fri, 6 Dec 2019 08:10:34 -0800 (PST) X-Google-Smtp-Source: APXvYqwMiuativIvRJrmlDdsw9LyfSqmuTMbfq8lTQ/C59qH/qM+WHRx2dhvRQKo4xNsFio/7dT6 X-Received: by 2002:a9d:6196:: with SMTP id g22mr11784353otk.204.1575648633969; Fri, 06 Dec 2019 08:10:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575648633; cv=none; d=google.com; s=arc-20160816; b=XmXuPYAom7sT3ggbI311ql1/4RizhgO9AfJvXZWtR3KVxHlPsOTv22YKRUEx2DnVY7 miP5NpApPWgkGcS6ObP5NVWdHQ159Jp+bkzfdjNCe73a55PIErf2yL+1Z7SV9Q49Z6N0 K6E6YXvQJLuoGLhQgAPotglax7b/JH3hyX7gayobGJf5t3NsrsDJQ25+Ja9fJ84HM7cL VJoTe6xdJ6rR5YmkiEm5hApyV/cLY74Zb6G+K+At61lpuenBsRnODYdRv9g2aI5E78Or 7BrJf8W+iFdQBwwtubVzcm0KW3P7+k84TNUZUfx+mEsYTanr9spMa5umBXxoRhDE0/yx 1weQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=+d13r+7Is6MJfeRLzn9+AWV6eMFyoAitriAUc8rpdIw=; b=D1/AsEC2esKNehN7CVJOI/MHZdMtyc9yDCslIqDfU/tzgdYOudaHg53KydyOAMENPr 0mVK8/+iDvL8zemucW0ygGfXOiwfrm2CqPG/4Sz5vmpOLaPcQOMCccykVqBACmda6RjN lsVny0tlUZjIgX6rcwytYFKoFpSaT74FY7/q8NrZEQA3m9mGr3azLcvdS5nDqk9sFQA+ ZAhoVYbVcg/QQ24heBkmbZBi/Vfm1vUsQaLobIYLSz8nKFpTypbxBXMc334tHJg6J0Yr dGyHjuBArGvPqgylNvG1x/z6hTQUVXEj3+NcwRG3KvCM5FT9yYPJ2KZES+71RPJicMoh zbOQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f7si6868396otp.6.2019.12.06.08.10.18; Fri, 06 Dec 2019 08:10:33 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726418AbfLFQI5 (ORCPT + 99 others); Fri, 6 Dec 2019 11:08:57 -0500 Received: from mga09.intel.com ([134.134.136.24]:27052 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726371AbfLFQI4 (ORCPT ); Fri, 6 Dec 2019 11:08:56 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 06 Dec 2019 08:08:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.69,285,1571727600"; d="scan'208";a="214467087" Received: from linux.intel.com ([10.54.29.200]) by orsmga006.jf.intel.com with ESMTP; 06 Dec 2019 08:08:53 -0800 Received: from [10.125.249.30] (rsudarik-mobl.ccr.corp.intel.com [10.125.249.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by linux.intel.com (Postfix) with ESMTPS id 79594580AA9; Fri, 6 Dec 2019 08:08:50 -0800 (PST) Subject: Re: [PATCH 1/6] perf x86: Infrastructure for exposing an Uncore unit to PMON mapping To: Stephane Eranian Cc: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , LKML , Brendan Gregg , Andi Kleen , "Liang, Kan" , alexander.antonov@intel.com References: <20191126163630.17300-1-roman.sudarikov@linux.intel.com> <20191126163630.17300-2-roman.sudarikov@linux.intel.com> From: "Sudarikov, Roman" Message-ID: Date: Fri, 6 Dec 2019 19:08:46 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05.12.2019 21:02, Stephane Eranian wrote: > On Wed, Dec 4, 2019 at 10:48 AM Sudarikov, Roman > wrote: >> On 02.12.2019 22:47, Stephane Eranian wrote: >>> On Tue, Nov 26, 2019 at 8:36 AM wrote: >>>> From: Roman Sudarikov >>>> >>>> Intel® Xeon® Scalable processor family (code name Skylake-SP) makes significant >>>> changes in the integrated I/O (IIO) architecture. The new solution introduces >>>> IIO stacks which are responsible for managing traffic between the PCIe domain >>>> and the Mesh domain. Each IIO stack has its own PMON block and can handle either >>>> DMI port, x16 PCIe root port, MCP-Link or various built-in accelerators. >>>> IIO PMON blocks allow concurrent monitoring of I/O flows up to 4 x4 bifurcation >>>> within each IIO stack. >>>> >>>> Software is supposed to program required perf counters within each IIO stack >>>> and gather performance data. The tricky thing here is that IIO PMON reports data >>>> per IIO stack but users have no idea what IIO stacks are - they only know devices >>>> which are connected to the platform. >>>> >>>> Understanding IIO stack concept to find which IIO stack that particular IO device >>>> is connected to, or to identify an IIO PMON block to program for monitoring >>>> specific IIO stack assumes a lot of implicit knowledge about given Intel server >>>> platform architecture. >>>> >>>> This patch set introduces: >>>> An infrastructure for exposing an Uncore unit to Uncore PMON mapping through sysfs-backend >>>> A new --iiostat mode in perf stat to provide I/O performance metrics per I/O device >>>> >>>> Current version supports a server line starting Intel® Xeon® Processor Scalable >>>> Family and introduces mapping for IIO Uncore units only. >>>> Other units can be added on demand. >>>> >>>> Usage example: >>>> /sys/devices/uncore__/platform_mapping >>>> >>>> Each Uncore unit type, by its nature, can be mapped to its own context, for example: >>>> CHA - each uncore_cha_ is assigned to manage a distinct slice of LLC capacity >>>> UPI - each uncore_upi_ is assigned to manage one link of Intel UPI Subsystem >>>> IIO - each uncore_iio_ is assigned to manage one stack of the IIO module >>>> IMC - each uncore_imc_ is assigned to manage one channel of Memory Controller >>>> >>>> Implementation details: >>>> Two callbacks added to struct intel_uncore_type to discover and map Uncore units to PMONs: >>>> int (*get_topology)(struct intel_uncore_type *type) >>>> int (*set_mapping)(struct intel_uncore_type *type) >>>> >>>> IIO stack to PMON mapping is exposed through >>>> /sys/devices/uncore_iio_/platform_mapping >>>> in the following format: domain:bus >>>> >>>> Details of IIO Uncore unit mapping to IIO PMON: >>>> Each IIO stack is either a DMI port, x16 PCIe root port, MCP-Link or various >>>> built-in accelerators. For Uncore IIO Unit type, the platform_mapping file >>>> holds bus numbers of devices, which can be monitored by that IIO PMON block >>>> on each die. >>>> >>>> For example, on a 4-die Intel Xeon® server platform: >>>> $ cat /sys/devices/uncore_iio_0/platform_mapping >>>> 0000:00,0000:40,0000:80,0000:c0 >>>> >>>> Which means: >>>> IIO PMON block 0 on die 0 belongs to IIO stack located on bus 0x00, domain 0x0000 >>>> IIO PMON block 0 on die 1 belongs to IIO stack located on bus 0x40, domain 0x0000 >>>> IIO PMON block 0 on die 2 belongs to IIO stack located on bus 0x80, domain 0x0000 >>>> IIO PMON block 0 on die 3 belongs to IIO stack located on bus 0xc0, domain 0x0000 >>>> >>> You are just looking at one die (package). How does your enumeration >>> help figure out >>> is the iio_0 is on socket0 of socket1 and then figure out which >>> bus/domain in on which >>> socket. >>> >>> And how does that help map actual devices (using the output of lspci) >>> to the IIO? >>> You need to show how you would do that, which is really what people >>> want, with what you >>> have in your patch right now. >> No. I'm enumerating all IIO PMUs for all sockets on the platform. >> >> Let's take an 4 socket SKX as an example - sysfs exposes 6 instances of >> IIO PMU and each socket has its own instance of each IIO PMUs, >> meaning that socket 0 has its own IIO PMU0, socket 1 also has its own >> IIO PMU0 and so on. Same apply for IIO PMUs 1 through 5. > I know that. > >> Below is sample output: >> >> $:/sys/devices# cat uncore_iio_0/platform_mapping >> 0000:00,0000:40,0000:80,0000:c0 >> $:/sys/devices# cat uncore_iio_1/platform_mapping >> 0000:16,0000:44,0000:84,0000:c4 >> $:/sys/devices# cat uncore_iio_2/platform_mapping >> 0000:24,0000:58,0000:98,0000:d8 >> $:/sys/devices# cat uncore_iio_3/platform_mapping >> 0000:32,0000:6c,0000:ac,0000:ec >> >> Technically, the idea is as follows - kernel part of the feature is for >> locating IIO stacks and creating IIO PMON to IIO stack mapping. >> Userspace part of the feature is for locating IO devices connected to >> each IIO stack on each socket and configure only required IIO counters to >> provide 4 IO performance metrics - Inbound Read, Inbound Write, Outbound >> Read, Outbound Write - attributed to each device. >> >> >> Follow up patches show how users can benefit from the feature; see >> https://lkml.org/lkml/2019/11/26/451 >> > I know this is useful. I have done this for internal users a long time ago. > >> Below is sample output: >> >> 1. show mode >> >> ./perf stat --iiostat=show >> >> S0-RootPort0-uncore_iio_0<00:00.0 Sky Lake-E DMI3 Registers> >> S1-RootPort0-uncore_iio_0<81:00.0 Ethernet Controller X710 for 10GbE SFP+> >> S0-RootPort1-uncore_iio_1<18:00.0 Omni-Path HFI Silicon 100 Series [discrete]> >> S1-RootPort1-uncore_iio_1<86:00.0 Ethernet Controller XL710 for 40GbE QSFP+> >> S1-RootPort1-uncore_iio_1<88:00.0 Ethernet Controller XL710 for 40GbE QSFP+> >> S0-RootPort2-uncore_iio_2<3d:00.0 Ethernet Connection X722 for 10GBASE-T> >> S1-RootPort2-uncore_iio_2 >> S1-RootPort3-uncore_iio_3 >> >> For example, NIC at 81:00.0 is local to S1, connected to its RootPort0 and is covered by IIO PMU0 (on socket 1) >> >> 1. collector mode >> >> ./perf stat --iiostat -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct >> 357708+0 records in >> 357707+0 records out >> 375083606016 bytes (375 GB, 349 GiB) copied, 215.381 s, 1.7 GB/s >> >> Performance counter stats for 'system wide': >> >> device Inbound Read(MB) Inbound Write(MB) Outbound Read(MB) Outbound Write(MB) >> 00:00.0 0 0 0 0 >> 81:00.0 0 0 0 0 >> 18:00.0 0 0 0 0 >> 86:00.0 0 0 0 0 >> 88:00.0 0 0 0 0 >> 3b:00.0 3 0 0 0 >> 3c:03.0 3 0 0 0 >> 3d:00.0 3 0 0 0 >> af:00.0 0 0 0 0 >> da:00.0 358559 44 0 22 >> > I think this output would be more useful with the socket information. > People care about NUMA locality. This output > does not cover that (in a single cmdline). It would also benefit from > having the actual Linux device names, e.g., sda, ssda, eth0, ...., Hi Stephane, I still think we should keep b:d.f notion as a part of the output and, sure, we can add socket and device name information, so it will look like this: Before:    Performance counter stats for 'system wide':   deviceInbound Read(MB) Inbound Write(MB)  da:00.0 After:    Performance counter stats for 'system wide':            device             Inbound Read(MB)    Inbound Write(MB)  S1da:00.0 Are you OK with that approach? BTW, to address it we need code changes at userspace part only. Can we proceed with kernel patch review and once finalized, I'll send userspace part and we will figure out right output format? Thanks, Roman > >> 215.383783574 seconds time elapsed >> >>>> Signed-off-by: Roman Sudarikov >>>> Co-developed-by: Alexander Antonov >>>> Signed-off-by: Alexander Antonov >>>> --- >>>> arch/x86/events/intel/uncore.c | 61 +++++++++++- >>>> arch/x86/events/intel/uncore.h | 13 ++- >>>> arch/x86/events/intel/uncore_snbep.c | 144 +++++++++++++++++++++++++++ >>>> 3 files changed, 214 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c >>>> index 86467f85c383..0f779c8fcc05 100644 >>>> --- a/arch/x86/events/intel/uncore.c >>>> +++ b/arch/x86/events/intel/uncore.c >>>> @@ -18,6 +18,11 @@ struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); >>>> struct pci_extra_dev *uncore_extra_pci_dev; >>>> static int max_dies; >>>> >>>> +int get_max_dies(void) >>>> +{ >>>> + return max_dies; >>>> +} >>>> + >>>> /* mask of cpus that collect uncore events */ >>>> static cpumask_t uncore_cpu_mask; >>>> >>>> @@ -816,6 +821,16 @@ static ssize_t uncore_get_attr_cpumask(struct device *dev, >>>> >>>> static DEVICE_ATTR(cpumask, S_IRUGO, uncore_get_attr_cpumask, NULL); >>>> >>>> +static ssize_t platform_mapping_show(struct device *dev, >>>> + struct device_attribute *attr, char *buf) >>>> +{ >>>> + struct intel_uncore_pmu *pmu = dev_get_drvdata(dev); >>>> + >>>> + return snprintf(buf, PAGE_SIZE - 1, "%s\n", pmu->platform_mapping ? >>>> + (char *)pmu->platform_mapping : "0"); >>>> +} >>>> +static DEVICE_ATTR_RO(platform_mapping); >>>> + >>>> static struct attribute *uncore_pmu_attrs[] = { >>>> &dev_attr_cpumask.attr, >>>> NULL, >>>> @@ -825,6 +840,15 @@ static const struct attribute_group uncore_pmu_attr_group = { >>>> .attrs = uncore_pmu_attrs, >>>> }; >>>> >>>> +static struct attribute *platform_attrs[] = { >>>> + &dev_attr_platform_mapping.attr, >>>> + NULL, >>>> +}; >>>> + >>>> +static const struct attribute_group uncore_platform_discovery_group = { >>>> + .attrs = platform_attrs, >>>> +}; >>>> + >>>> static int uncore_pmu_register(struct intel_uncore_pmu *pmu) >>>> { >>>> int ret; >>>> @@ -905,11 +929,27 @@ static void uncore_types_exit(struct intel_uncore_type **types) >>>> uncore_type_exit(*types); >>>> } >>>> >>>> +static void uncore_type_attrs_compaction(struct intel_uncore_type *type) >>>> +{ >>>> + int i, j; >>>> + >>>> + for (i = 0, j = 0; i < UNCORE_MAX_NUM_ATTR_GROUP; i++) { >>>> + if (!type->attr_groups[i]) >>>> + continue; >>>> + if (i > j) { >>>> + type->attr_groups[j] = type->attr_groups[i]; >>>> + type->attr_groups[i] = NULL; >>>> + } >>>> + j++; >>>> + } >>>> +} >>>> + >>>> static int __init uncore_type_init(struct intel_uncore_type *type, bool setid) >>>> { >>>> struct intel_uncore_pmu *pmus; >>>> size_t size; >>>> int i, j; >>>> + int ret; >>>> >>>> pmus = kcalloc(type->num_boxes, sizeof(*pmus), GFP_KERNEL); >>>> if (!pmus) >>>> @@ -922,8 +962,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid) >>>> pmus[i].pmu_idx = i; >>>> pmus[i].type = type; >>>> pmus[i].boxes = kzalloc(size, GFP_KERNEL); >>>> - if (!pmus[i].boxes) >>>> + if (!pmus[i].boxes) { >>>> + ret = -ENOMEM; >>>> goto err; >>>> + } >>>> } >>>> >>>> type->pmus = pmus; >>>> @@ -940,8 +982,10 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid) >>>> >>>> attr_group = kzalloc(struct_size(attr_group, attrs, i + 1), >>>> GFP_KERNEL); >>>> - if (!attr_group) >>>> + if (!attr_group) { >>>> + ret = -ENOMEM; >>>> goto err; >>>> + } >>>> >>>> attr_group->group.name = "events"; >>>> attr_group->group.attrs = attr_group->attrs; >>>> @@ -954,6 +998,17 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid) >>>> >>>> type->pmu_group = &uncore_pmu_attr_group; >>>> >>>> + /* >>>> + * Exposing mapping of Uncore units to corresponding Uncore PMUs >>>> + * through /sys/devices/uncore__/platform_mapping >>>> + */ >>>> + if (type->get_topology && type->set_mapping) >>>> + if (!type->get_topology(type) && !type->set_mapping(type)) >>>> + type->platform_discovery = &uncore_platform_discovery_group; >>>> + >>>> + /* For optional attributes, we can safely remove embedded NULL attr_groups elements */ >>>> + uncore_type_attrs_compaction(type); >>>> + >>>> return 0; >>>> >>>> err: >>>> @@ -961,7 +1016,7 @@ static int __init uncore_type_init(struct intel_uncore_type *type, bool setid) >>>> kfree(pmus[i].boxes); >>>> kfree(pmus); >>>> >>>> - return -ENOMEM; >>>> + return ret; >>>> } >>>> >>>> static int __init >>>> diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h >>>> index bbfdaa720b45..ce3727b9f7f8 100644 >>>> --- a/arch/x86/events/intel/uncore.h >>>> +++ b/arch/x86/events/intel/uncore.h >>>> @@ -43,6 +43,8 @@ struct intel_uncore_box; >>>> struct uncore_event_desc; >>>> struct freerunning_counters; >>>> >>>> +#define UNCORE_MAX_NUM_ATTR_GROUP 5 >>>> + >>>> struct intel_uncore_type { >>>> const char *name; >>>> int num_counters; >>>> @@ -71,13 +73,19 @@ struct intel_uncore_type { >>>> struct intel_uncore_ops *ops; >>>> struct uncore_event_desc *event_descs; >>>> struct freerunning_counters *freerunning; >>>> - const struct attribute_group *attr_groups[4]; >>>> + const struct attribute_group *attr_groups[UNCORE_MAX_NUM_ATTR_GROUP]; >>>> struct pmu *pmu; /* for custom pmu ops */ >>>> + void *platform_topology; >>>> + /* finding Uncore units */ >>>> + int (*get_topology)(struct intel_uncore_type *type); >>>> + /* mapping Uncore units to PMON ranges */ >>>> + int (*set_mapping)(struct intel_uncore_type *type); >>>> }; >>>> >>>> #define pmu_group attr_groups[0] >>>> #define format_group attr_groups[1] >>>> #define events_group attr_groups[2] >>>> +#define platform_discovery attr_groups[3] >>>> >>>> struct intel_uncore_ops { >>>> void (*init_box)(struct intel_uncore_box *); >>>> @@ -99,6 +107,7 @@ struct intel_uncore_pmu { >>>> int pmu_idx; >>>> int func_id; >>>> bool registered; >>>> + void *platform_mapping; >>>> atomic_t activeboxes; >>>> struct intel_uncore_type *type; >>>> struct intel_uncore_box **boxes; >>>> @@ -490,6 +499,8 @@ static inline struct intel_uncore_box *uncore_event_to_box(struct perf_event *ev >>>> return event->pmu_private; >>>> } >>>> >>>> +int get_max_dies(void); >>>> + >>>> struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu); >>>> u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *event); >>>> void uncore_mmio_exit_box(struct intel_uncore_box *box); >>>> diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c >>>> index b10a5ec79e48..92ce9fbafde1 100644 >>>> --- a/arch/x86/events/intel/uncore_snbep.c >>>> +++ b/arch/x86/events/intel/uncore_snbep.c >>>> @@ -273,6 +273,28 @@ >>>> #define SKX_CPUNODEID 0xc0 >>>> #define SKX_GIDNIDMAP 0xd4 >>>> >>>> +/* >>>> + * The CPU_BUS_NUMBER MSR returns the values of the respective CPUBUSNO CSR >>>> + * that BIOS programmed. MSR has package scope. >>>> + * | Bit | Default | Description >>>> + * | [63] | 00h | VALID - When set, indicates the CPU bus >>>> + * numbers have been initialized. (RO) >>>> + * |[62:48]| --- | Reserved >>>> + * |[47:40]| 00h | BUS_NUM_5 — Return the bus number BIOS assigned >>>> + * CPUBUSNO(5). (RO) >>>> + * |[39:32]| 00h | BUS_NUM_4 — Return the bus number BIOS assigned >>>> + * CPUBUSNO(4). (RO) >>>> + * |[31:24]| 00h | BUS_NUM_3 — Return the bus number BIOS assigned >>>> + * CPUBUSNO(3). (RO) >>>> + * |[23:16]| 00h | BUS_NUM_2 — Return the bus number BIOS assigned >>>> + * CPUBUSNO(2). (RO) >>>> + * |[15:8] | 00h | BUS_NUM_1 — Return the bus number BIOS assigned >>>> + * CPUBUSNO(1). (RO) >>>> + * | [7:0] | 00h | BUS_NUM_0 — Return the bus number BIOS assigned >>>> + * CPUBUSNO(0). (RO) >>>> + */ >>>> +#define SKX_MSR_CPU_BUS_NUMBER 0x300 >>>> + >>>> /* SKX CHA */ >>>> #define SKX_CHA_MSR_PMON_BOX_FILTER_TID (0x1ffULL << 0) >>>> #define SKX_CHA_MSR_PMON_BOX_FILTER_LINK (0xfULL << 9) >>>> @@ -3580,6 +3602,9 @@ static struct intel_uncore_ops skx_uncore_iio_ops = { >>>> .read_counter = uncore_msr_read_counter, >>>> }; >>>> >>>> +static int skx_iio_get_topology(struct intel_uncore_type *type); >>>> +static int skx_iio_set_mapping(struct intel_uncore_type *type); >>>> + >>>> static struct intel_uncore_type skx_uncore_iio = { >>>> .name = "iio", >>>> .num_counters = 4, >>>> @@ -3594,6 +3619,8 @@ static struct intel_uncore_type skx_uncore_iio = { >>>> .constraints = skx_uncore_iio_constraints, >>>> .ops = &skx_uncore_iio_ops, >>>> .format_group = &skx_uncore_iio_format_group, >>>> + .get_topology = skx_iio_get_topology, >>>> + .set_mapping = skx_iio_set_mapping, >>>> }; >>>> >>>> enum perf_uncore_iio_freerunning_type_id { >>>> @@ -3780,6 +3807,123 @@ static int skx_count_chabox(void) >>>> return hweight32(val); >>>> } >>>> >>>> +static inline u8 skx_iio_topology_byte(void *platform_topology, >>>> + int die, int idx) >>>> +{ >>>> + return *((u8 *)(platform_topology) + die * sizeof(u64) + idx); >>>> +} >>>> + >>>> +static inline bool skx_iio_topology_valid(u64 msr_value) >>>> +{ >>>> + return msr_value & ((u64)1 << 63); >>>> +} >>>> + >>>> +static int skx_msr_cpu_bus_read(int cpu, int die) >>>> +{ >>>> + int ret = rdmsrl_on_cpu(cpu, SKX_MSR_CPU_BUS_NUMBER, >>>> + ((u64 *)skx_uncore_iio.platform_topology) + die); >>>> + >>>> + if (!ret) { >>>> + if (!skx_iio_topology_valid(*(((u64 *)skx_uncore_iio.platform_topology) + die))) >>>> + ret = -1; >>>> + } >>>> + return ret; >>>> +} >>>> + >>>> +static int skx_iio_get_topology(struct intel_uncore_type *type) >>>> +{ >>>> + int ret, cpu, die, current_die; >>>> + struct pci_bus *bus = NULL; >>>> + >>>> + while ((bus = pci_find_next_bus(bus)) != NULL) >>>> + if (pci_domain_nr(bus)) { >>>> + pr_info("Mapping of I/O stack to PMON ranges is not supported for multi-segment topology\n"); >>>> + return -1; >>>> + } >>>> + >>>> + /* Size of SKX_MSR_CPU_BUS_NUMBER is 8 bytes, the MSR has package scope.*/ >>>> + type->platform_topology = >>>> + kzalloc(get_max_dies() * sizeof(u64), GFP_KERNEL); >>>> + if (!type->platform_topology) >>>> + return -ENOMEM; >>>> + >>>> + /* >>>> + * Using cpus_read_lock() to ensure cpu is not going down between >>>> + * looking at cpu_online_mask. >>>> + */ >>>> + cpus_read_lock(); >>>> + /* Invalid value to start loop.*/ >>>> + current_die = -1; >>>> + for_each_online_cpu(cpu) { >>>> + die = topology_logical_die_id(cpu); >>>> + if (current_die == die) >>>> + continue; >>>> + ret = skx_msr_cpu_bus_read(cpu, die); >>>> + if (ret) >>>> + break; >>>> + current_die = die; >>>> + } >>>> + cpus_read_unlock(); >>>> + >>>> + if (ret) >>>> + kfree(type->platform_topology); >>>> + return ret; >>>> +} >>>> + >>>> +static int skx_iio_set_mapping(struct intel_uncore_type *type) >>>> +{ >>>> + /* >>>> + * Each IIO stack (PCIe root port) has its own IIO PMON block, so each >>>> + * platform_mapping holds bus number(s) of PCIe root port(s), which can >>>> + * be monitored by that IIO PMON block. >>>> + * >>>> + * For example, on a 4-die Xeon platform with up to 6 IIO stacks per die >>>> + * and, therefore, 6 IIO PMON blocks per die, the platform_mapping of IIO >>>> + * PMON block 0 holds "0000:00,0000:40,0000:80,0000:c0": >>>> + * >>>> + * $ cat /sys/devices/uncore_iio_0/platform_mapping >>>> + * 0000:00,0000:40,0000:80,0000:c0 >>>> + * >>>> + * Which means: >>>> + * IIO PMON block 0 on the die 0 belongs to PCIe root port located on bus 0x00, domain 0x0000 >>>> + * IIO PMON block 0 on the die 1 belongs to PCIe root port located on bus 0x40, domain 0x0000 >>>> + * IIO PMON block 0 on the die 2 belongs to PCIe root port located on bus 0x80, domain 0x0000 >>>> + * IIO PMON block 0 on the die 3 belongs to PCIe root port located on bus 0xc0, domain 0x0000 >>>> + */ >>>> + >>>> + int ret = 0; >>>> + int die, i; >>>> + char *buf; >>>> + struct intel_uncore_pmu *pmu; >>>> + const int template_len = 8; >>>> + >>>> + for (i = 0; i < type->num_boxes; i++) { >>>> + pmu = type->pmus + i; >>>> + /* Root bus 0x00 is valid only for die 0 AND pmu_idx = 0. */ >>>> + if (skx_iio_topology_byte(type->platform_topology, 0, pmu->pmu_idx) || (!pmu->pmu_idx)) { >>>> + pmu->platform_mapping = >>>> + kzalloc(get_max_dies() * template_len + 1, GFP_KERNEL); >>>> + if (pmu->platform_mapping) { >>>> + buf = (char *)pmu->platform_mapping; >>>> + for (die = 0; die < get_max_dies(); die++) >>>> + buf += snprintf(buf, template_len + 1, "%04x:%02x,", 0, >>>> + skx_iio_topology_byte(type->platform_topology, >>>> + die, pmu->pmu_idx)); >>>> + >>>> + *(--buf) = '\0'; >>>> + } else { >>>> + for (; i >= 0; i--) >>>> + kfree((type->pmus + i)->platform_mapping); >>>> + ret = -ENOMEM; >>>> + break; >>>> + } >>>> + } >>>> + } >>>> + >>>> + kfree(type->platform_topology); >>>> + return ret; >>>> +} >>>> + >>>> void skx_uncore_cpu_init(void) >>>> { >>>> skx_uncore_chabox.num_boxes = skx_count_chabox(); >>>> -- >>>> 2.19.1 >>>>