2020-10-03 01:17:17

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

Hybrid CPU topologies combine CPUs of different microarchitectures in the
same die. Thus, even though the instruction set is compatible among all
CPUs, there may still be differences in features (e.g., some CPUs may
have counters that others CPU do not). There may be applications
interested in knowing the type of micro-architecture topology of the
system to make decisions about process affinity.

While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
cpu_capacity) may be used to infer the types of micro-architecture of the
CPUs in the platform, it may not be entirely accurate. For instance, two
subsets of CPUs with different types of micro-architecture may have the
same capacity due to power or thermal constraints.

Create the new directory /sys/devices/system/cpu/types. Under such
directory, create individual subdirectories for each type of CPU micro-
architecture. Each subdirectory will have cpulist and cpumap files. This
makes it convenient for user space to read all the CPUs of the same type
at once without having to inspect each CPU individually.

Implement a generic interface using weak functions that architectures can
override to indicate a) support for CPU types, b) the CPU type number, and
c) a string to identify the CPU vendor and type.

For example, an x86 system with one Intel Core and four Intel Atom CPUs
would look like this (other architectures have the hooks to use whatever
directory naming convention below "types" that meets their needs):

user@host:~$: ls /sys/devices/system/cpu/types
intel_atom_0 intel_core_0

user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
cpulist cpumap

user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
cpulist cpumap

user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
0f

user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
0-3

user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
10

user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
4

On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
created. Add a hook for this purpose.

Cc: Andi Kleen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "Gautham R. Shenoy" <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Len Brown <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: "Ravi V. Shankar" <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: [email protected]
Reviewed-by: Tony Luck <[email protected]>
Suggested-by: Len Brown <[email protected]> # Necessity of the interface
Suggested-by: Dave Hansen <[email protected]> # Details of the interface
Signed-off-by: Ricardo Neri <[email protected]>
---
.../ABI/testing/sysfs-devices-system-cpu | 13 ++
drivers/base/cpu.c | 214 ++++++++++++++++++
include/linux/cpu.h | 12 +
include/linux/cpuhotplug.h | 1 +
4 files changed, 240 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index b555df825447..bcb3d7195102 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -614,3 +614,16 @@ Description: SPURR ticks for cpuX when it was idle.

This sysfs interface exposes the number of SPURR ticks
for cpuX when it was idle.
+
+What: /sys/devices/system/cpu/types
+ /sys/devices/system/cpu/types/<vendor>_<cpu_type>/cpumap
+ /sys/devices/system/cpu/types/<vendor>_<cpu_type>/cpulist
+Date: Oct 2020
+Contact: Linux kernel mailing list <[email protected]>
+Description: Types of CPUs in hybrid systems
+
+ If a platform has CPUs with more than one type of micro-
+ architecture, there is one directory for each type of CPU.
+ Inside each directory, the files cpumap and cpulist contain
+ a mask and a list representing CPUs of the same type of micro-
+ architecture. These files only contain CPUs currently online.
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index d2136ab9b14a..7e98b11b15a1 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -607,11 +607,225 @@ static void __init cpu_register_vulnerabilities(void)
static inline void cpu_register_vulnerabilities(void) { }
#endif

+static ssize_t cpulist_show(struct device *device,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct cpumask *mask = dev_get_drvdata(device);
+
+ if (!mask)
+ return -EINVAL;
+
+ return cpumap_print_to_pagebuf(true, buf, mask);
+}
+
+static ssize_t cpumap_show(struct device *device,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct cpumask *mask = dev_get_drvdata(device);
+
+ if (!mask)
+ return -EINVAL;
+
+ return cpumap_print_to_pagebuf(false, buf, mask);
+}
+
+static DEVICE_ATTR_RO(cpumap);
+static DEVICE_ATTR_RO(cpulist);
+
+static struct attribute *cpu_type_attrs[] = {
+ &dev_attr_cpumap.attr,
+ &dev_attr_cpulist.attr,
+ NULL
+};
+
+static const struct attribute_group cpu_type_attr_group = {
+ .attrs = cpu_type_attrs,
+};
+
+static const struct attribute_group *cpu_type_attr_groups[] = {
+ &cpu_type_attr_group,
+ NULL,
+};
+
+/* Root device for the cpu_type sysfs entries */
+static struct device *cpu_type_device;
+/*
+ * An array of cpu_type devices. One per each type of supported CPU micro-
+ * architecture.
+ */
+static struct device *cpu_type_devices[CPUTYPES_MAX_NR];
+
+/**
+ * arch_get_cpu_type() - Get the CPU type number
+ * @cpu: Index of the CPU of which the index is needed
+ *
+ * Get the CPU type number of @cpu, a non-zero unsigned 32-bit number that
+ * uniquely identifies a type of CPU micro-architecture. All CPUs of the same
+ * type have the same type number. Type numbers are defined by each CPU
+ * architecture.
+ */
+u32 __weak arch_get_cpu_type(int cpu)
+{
+ return 0;
+}
+
+/**
+ * arch_has_cpu_type() - Check if CPU types are supported
+ *
+ * Returns true if the running platform supports CPU types. This is true if the
+ * platform has CPUs of more than one type of micro-architecture. Otherwise,
+ * returns false.
+ */
+bool __weak arch_has_cpu_type(void)
+{
+ return false;
+}
+
+/**
+ * arch_get_cpu_type_name() - Get the CPU type name
+ * @cpu_type: Type of CPU micro-architecture.
+ *
+ * Returns a string name associated with the CPU micro-architecture type as
+ * indicated in @cpu_type. The format shall be <vendor>_<cpu_type>. Returns
+ * NULL if the CPU type is not known.
+ */
+const char __weak *arch_get_cpu_type_name(u32 cpu_type)
+{
+ return NULL;
+}
+
+static int cpu_type_create_device(int cpu, int idx)
+{
+ u32 cpu_type = arch_get_cpu_type(cpu);
+ struct cpumask *mask;
+ const char *name;
+ char buf[64];
+
+ mask = kzalloc(cpumask_size(), GFP_KERNEL);
+ if (!mask)
+ return -ENOMEM;
+
+ name = arch_get_cpu_type_name(cpu_type);
+ if (!name) {
+ snprintf(buf, sizeof(buf), "unknown_%u", cpu_type);
+ name = buf;
+ }
+
+ cpu_type_devices[idx] = cpu_device_create(cpu_type_device, mask,
+ cpu_type_attr_groups, name);
+ if (IS_ERR(cpu_type_devices[idx]))
+ goto free_cpumask;
+
+ cpumask_set_cpu(cpu, mask);
+ return 0;
+
+free_cpumask:
+ kfree(mask);
+ return -ENOMEM;
+}
+
+static int cpu_type_sysfs_online(unsigned int cpu)
+{
+ u32 cpu_type = arch_get_cpu_type(cpu);
+ struct cpumask *mask;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(cpu_type_devices); i++) {
+ u32 this_cpu_type;
+
+ /*
+ * The first devices in the array are used first. Thus, create
+ * a new device as well as sysfs directory and for the type of
+ * @cpu.
+ */
+ if (!cpu_type_devices[i])
+ return cpu_type_create_device(cpu, i);
+
+ mask = dev_get_drvdata(cpu_type_devices[i]);
+ if (!mask && !cpumask_weight(mask)) {
+ /*
+ * We should not be here. Be paranoid about
+ * NULL pointers.
+ */
+ dev_err(cpu_type_devices[i], "CPU mask is invalid");
+ return -EINVAL;
+ }
+
+ this_cpu_type = arch_get_cpu_type(cpumask_first(mask));
+ if (cpu_type == this_cpu_type) {
+ cpumask_set_cpu(cpu, mask);
+ return 0;
+ }
+ }
+
+ return -ENODEV;
+}
+
+static int cpu_type_sysfs_offline(unsigned int cpu)
+{
+ u32 cpu_type = arch_get_cpu_type(cpu);
+ struct cpumask *mask;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(cpu_type_devices); i++) {
+ u32 this_cpu_type;
+
+ if (!cpu_type_devices[i])
+ continue;
+
+ mask = dev_get_drvdata(cpu_type_devices[i]);
+ if (!mask && !cpumask_weight(mask)) {
+ /*
+ * We should not be here. Be paranoid about
+ * NULL pointers.
+ */
+ dev_err(cpu_type_devices[i], "CPU mask is invalid");
+ continue;
+ }
+
+ this_cpu_type = arch_get_cpu_type(cpumask_first(mask));
+ if (cpu_type == this_cpu_type) {
+ cpumask_clear_cpu(cpu, mask);
+ return 0;
+ }
+ }
+
+ /*
+ * If we are here, no matching cpu_type was found. This CPU was not
+ * accounted for at hotplug online.
+ */
+ pr_warn("Unexpected CPU offline!\n");
+
+ return -ENODEV;
+}
+
+static void __init cpu_type_sysfs_register(void)
+{
+ struct device *dev = cpu_subsys.dev_root;
+ int ret;
+
+ if (!arch_has_cpu_type())
+ return;
+
+ cpu_type_device = cpu_device_create(dev, NULL, NULL, "types");
+ if (IS_ERR(cpu_type_device))
+ return;
+
+ ret = cpuhp_setup_state(CPUHP_AP_BASE_CPUTYPE_ONLINE,
+ "base/cpu_type_sysfs:online",
+ cpu_type_sysfs_online, cpu_type_sysfs_offline);
+ if (ret)
+ dev_warn(dev, "failed to CPU type sysfs\n");
+}
+
void __init cpu_dev_init(void)
{
if (subsys_system_register(&cpu_subsys, cpu_root_attr_groups))
panic("Failed to register CPU subsystem");

cpu_dev_register_generic();
+ cpu_type_sysfs_register();
cpu_register_vulnerabilities();
}
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 8aa84c052fdf..76fba70c7801 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -228,4 +228,16 @@ static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0;
extern bool cpu_mitigations_off(void);
extern bool cpu_mitigations_auto_nosmt(void);

+#ifndef CPUTYPES_MAX_NR
+/*
+ * The maximum number of types of cpus. Architecture-specific implementations
+ * shall override this value.
+ */
+#define CPUTYPES_MAX_NR 0
+#endif
+
+extern u32 arch_get_cpu_type(int cpu);
+extern bool arch_has_cpu_type(void);
+extern const char *arch_get_cpu_type_name(u32 cpu_type);
+
#endif /* _LINUX_CPU_H_ */
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 6f524bbf71a2..fd90ccb6652c 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -186,6 +186,7 @@ enum cpuhp_state {
CPUHP_AP_WATCHDOG_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
+ CPUHP_AP_BASE_CPUTYPE_ONLINE,
CPUHP_AP_BASE_CACHEINFO_ONLINE,
CPUHP_AP_ONLINE_DYN,
CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 30,
--
2.17.1


2020-10-03 03:29:41

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On 10/2/20 6:17 PM, Ricardo Neri wrote:
> +/**
> + * arch_get_cpu_type() - Get the CPU type number
> + * @cpu: Index of the CPU of which the index is needed
> + *
> + * Get the CPU type number of @cpu, a non-zero unsigned 32-bit number that

Are you sure that @cpu is non-zero?


> + * uniquely identifies a type of CPU micro-architecture. All CPUs of the same
> + * type have the same type number. Type numbers are defined by each CPU
> + * architecture.
> + */
> +u32 __weak arch_get_cpu_type(int cpu)
> +{
> + return 0;
> +}

arch_get_cpu_type() in patch 4/4 allows @cpu to be 0.


--
~Randy

2020-10-03 08:55:23

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> Hybrid CPU topologies combine CPUs of different microarchitectures in the
> same die. Thus, even though the instruction set is compatible among all
> CPUs, there may still be differences in features (e.g., some CPUs may
> have counters that others CPU do not). There may be applications
> interested in knowing the type of micro-architecture topology of the
> system to make decisions about process affinity.
>
> While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> cpu_capacity) may be used to infer the types of micro-architecture of the
> CPUs in the platform, it may not be entirely accurate. For instance, two
> subsets of CPUs with different types of micro-architecture may have the
> same capacity due to power or thermal constraints.
>
> Create the new directory /sys/devices/system/cpu/types. Under such
> directory, create individual subdirectories for each type of CPU micro-
> architecture. Each subdirectory will have cpulist and cpumap files. This
> makes it convenient for user space to read all the CPUs of the same type
> at once without having to inspect each CPU individually.
>
> Implement a generic interface using weak functions that architectures can
> override to indicate a) support for CPU types, b) the CPU type number, and
> c) a string to identify the CPU vendor and type.
>
> For example, an x86 system with one Intel Core and four Intel Atom CPUs
> would look like this (other architectures have the hooks to use whatever
> directory naming convention below "types" that meets their needs):
>
> user@host:~$: ls /sys/devices/system/cpu/types
> intel_atom_0 intel_core_0
>
> user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> cpulist cpumap
>
> user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> cpulist cpumap
>
> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> 0f
>
> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> 0-3
>
> user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> 10
>
> user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> 4

The output of 'tree' sometimes makes it easier to see here, or:
grep -R . *
also works well.

> On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> created. Add a hook for this purpose.

Why should these not show up if the system is not "hybrid"?

>
> Cc: Andi Kleen <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: "Gautham R. Shenoy" <[email protected]>
> Cc: Kan Liang <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: "Ravi V. Shankar" <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: [email protected]
> Reviewed-by: Tony Luck <[email protected]>
> Suggested-by: Len Brown <[email protected]> # Necessity of the interface
> Suggested-by: Dave Hansen <[email protected]> # Details of the interface
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> .../ABI/testing/sysfs-devices-system-cpu | 13 ++
> drivers/base/cpu.c | 214 ++++++++++++++++++
> include/linux/cpu.h | 12 +
> include/linux/cpuhotplug.h | 1 +
> 4 files changed, 240 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> index b555df825447..bcb3d7195102 100644
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -614,3 +614,16 @@ Description: SPURR ticks for cpuX when it was idle.
>
> This sysfs interface exposes the number of SPURR ticks
> for cpuX when it was idle.
> +
> +What: /sys/devices/system/cpu/types
> + /sys/devices/system/cpu/types/<vendor>_<cpu_type>/cpumap
> + /sys/devices/system/cpu/types/<vendor>_<cpu_type>/cpulist

Please split these up into different entries, because:

> +Date: Oct 2020
> +Contact: Linux kernel mailing list <[email protected]>
> +Description: Types of CPUs in hybrid systems
> +
> + If a platform has CPUs with more than one type of micro-
> + architecture, there is one directory for each type of CPU.
> + Inside each directory, the files cpumap and cpulist contain
> + a mask and a list representing CPUs of the same type of micro-
> + architecture. These files only contain CPUs currently online.

You don't describe what mask and list actualy represent. I have no clue
what they mean, so odds are, others will not either.

> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> index d2136ab9b14a..7e98b11b15a1 100644
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -607,11 +607,225 @@ static void __init cpu_register_vulnerabilities(void)
> static inline void cpu_register_vulnerabilities(void) { }
> #endif
>
> +static ssize_t cpulist_show(struct device *device,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct cpumask *mask = dev_get_drvdata(device);
> +
> + if (!mask)
> + return -EINVAL;

How can mask be NULL?

> +
> + return cpumap_print_to_pagebuf(true, buf, mask);
> +}
> +
> +static ssize_t cpumap_show(struct device *device,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct cpumask *mask = dev_get_drvdata(device);
> +
> + if (!mask)
> + return -EINVAL;

Again, how can that be true?

> +
> + return cpumap_print_to_pagebuf(false, buf, mask);
> +}
> +
> +static DEVICE_ATTR_RO(cpumap);
> +static DEVICE_ATTR_RO(cpulist);
> +
> +static struct attribute *cpu_type_attrs[] = {
> + &dev_attr_cpumap.attr,
> + &dev_attr_cpulist.attr,
> + NULL
> +};
> +
> +static const struct attribute_group cpu_type_attr_group = {
> + .attrs = cpu_type_attrs,
> +};
> +
> +static const struct attribute_group *cpu_type_attr_groups[] = {
> + &cpu_type_attr_group,
> + NULL,
> +};

ATTRIBUTE_GROUPS()?

> +
> +/* Root device for the cpu_type sysfs entries */
> +static struct device *cpu_type_device;
> +/*

No extra blank line?

> + * An array of cpu_type devices. One per each type of supported CPU micro-
> + * architecture.
> + */
> +static struct device *cpu_type_devices[CPUTYPES_MAX_NR];
> +
> +/**
> + * arch_get_cpu_type() - Get the CPU type number
> + * @cpu: Index of the CPU of which the index is needed
> + *
> + * Get the CPU type number of @cpu, a non-zero unsigned 32-bit number that
> + * uniquely identifies a type of CPU micro-architecture. All CPUs of the same
> + * type have the same type number. Type numbers are defined by each CPU
> + * architecture.
> + */
> +u32 __weak arch_get_cpu_type(int cpu)

Can you just have this "cpu type" be an enumerated type so we have a
clue as to what they really are? Each arch can then define it, right?

> +{
> + return 0;
> +}
> +
> +/**
> + * arch_has_cpu_type() - Check if CPU types are supported
> + *
> + * Returns true if the running platform supports CPU types. This is true if the
> + * platform has CPUs of more than one type of micro-architecture. Otherwise,
> + * returns false.
> + */
> +bool __weak arch_has_cpu_type(void)
> +{
> + return false;
> +}
> +
> +/**
> + * arch_get_cpu_type_name() - Get the CPU type name
> + * @cpu_type: Type of CPU micro-architecture.
> + *
> + * Returns a string name associated with the CPU micro-architecture type as
> + * indicated in @cpu_type. The format shall be <vendor>_<cpu_type>. Returns
> + * NULL if the CPU type is not known.
> + */
> +const char __weak *arch_get_cpu_type_name(u32 cpu_type)
> +{
> + return NULL;
> +}

Why is vendor part of this? Shouldn't it just be arch?

I say this as "vendor" is kind of "interesting" when it comes to other
arches...

Speaking of other arches, we all know that other arches have this
feature as well, have you worked with any other groups to verify that
this interface will also work with them?

> +
> +static int cpu_type_create_device(int cpu, int idx)
> +{
> + u32 cpu_type = arch_get_cpu_type(cpu);
> + struct cpumask *mask;
> + const char *name;
> + char buf[64];
> +
> + mask = kzalloc(cpumask_size(), GFP_KERNEL);
> + if (!mask)
> + return -ENOMEM;
> +
> + name = arch_get_cpu_type_name(cpu_type);
> + if (!name) {
> + snprintf(buf, sizeof(buf), "unknown_%u", cpu_type);

If there is no name, why are you still making one up?

Shouldn't you just not create it at all?

> + name = buf;
> + }
> +
> + cpu_type_devices[idx] = cpu_device_create(cpu_type_device, mask,
> + cpu_type_attr_groups, name);
> + if (IS_ERR(cpu_type_devices[idx]))
> + goto free_cpumask;
> +
> + cpumask_set_cpu(cpu, mask);
> + return 0;
> +
> +free_cpumask:
> + kfree(mask);
> + return -ENOMEM;
> +}
> +
> +static int cpu_type_sysfs_online(unsigned int cpu)
> +{
> + u32 cpu_type = arch_get_cpu_type(cpu);
> + struct cpumask *mask;
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(cpu_type_devices); i++) {
> + u32 this_cpu_type;
> +
> + /*
> + * The first devices in the array are used first. Thus, create
> + * a new device as well as sysfs directory and for the type of
> + * @cpu.
> + */
> + if (!cpu_type_devices[i])
> + return cpu_type_create_device(cpu, i);
> +
> + mask = dev_get_drvdata(cpu_type_devices[i]);
> + if (!mask && !cpumask_weight(mask)) {
> + /*
> + * We should not be here. Be paranoid about
> + * NULL pointers.
> + */
> + dev_err(cpu_type_devices[i], "CPU mask is invalid");
> + return -EINVAL;
> + }
> +
> + this_cpu_type = arch_get_cpu_type(cpumask_first(mask));
> + if (cpu_type == this_cpu_type) {
> + cpumask_set_cpu(cpu, mask);
> + return 0;
> + }
> + }
> +
> + return -ENODEV;
> +}
> +
> +static int cpu_type_sysfs_offline(unsigned int cpu)
> +{
> + u32 cpu_type = arch_get_cpu_type(cpu);
> + struct cpumask *mask;
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(cpu_type_devices); i++) {
> + u32 this_cpu_type;
> +
> + if (!cpu_type_devices[i])
> + continue;
> +
> + mask = dev_get_drvdata(cpu_type_devices[i]);
> + if (!mask && !cpumask_weight(mask)) {
> + /*
> + * We should not be here. Be paranoid about
> + * NULL pointers.
> + */
> + dev_err(cpu_type_devices[i], "CPU mask is invalid");
> + continue;
> + }
> +
> + this_cpu_type = arch_get_cpu_type(cpumask_first(mask));
> + if (cpu_type == this_cpu_type) {
> + cpumask_clear_cpu(cpu, mask);
> + return 0;
> + }
> + }
> +
> + /*
> + * If we are here, no matching cpu_type was found. This CPU was not
> + * accounted for at hotplug online.
> + */
> + pr_warn("Unexpected CPU offline!\n");
> +
> + return -ENODEV;
> +}
> +
> +static void __init cpu_type_sysfs_register(void)
> +{
> + struct device *dev = cpu_subsys.dev_root;
> + int ret;
> +
> + if (!arch_has_cpu_type())
> + return;
> +
> + cpu_type_device = cpu_device_create(dev, NULL, NULL, "types");

This feels like an abuse of cpu_device_create(), doesn't it?

> + if (IS_ERR(cpu_type_device))
> + return;
> +
> + ret = cpuhp_setup_state(CPUHP_AP_BASE_CPUTYPE_ONLINE,
> + "base/cpu_type_sysfs:online",
> + cpu_type_sysfs_online, cpu_type_sysfs_offline);
> + if (ret)
> + dev_warn(dev, "failed to CPU type sysfs\n");
> +}
> +
> void __init cpu_dev_init(void)
> {
> if (subsys_system_register(&cpu_subsys, cpu_root_attr_groups))
> panic("Failed to register CPU subsystem");
>
> cpu_dev_register_generic();
> + cpu_type_sysfs_register();
> cpu_register_vulnerabilities();
> }
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index 8aa84c052fdf..76fba70c7801 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -228,4 +228,16 @@ static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0;
> extern bool cpu_mitigations_off(void);
> extern bool cpu_mitigations_auto_nosmt(void);
>
> +#ifndef CPUTYPES_MAX_NR
> +/*
> + * The maximum number of types of cpus. Architecture-specific implementations
> + * shall override this value.
> + */
> +#define CPUTYPES_MAX_NR 0
> +#endif
> +
> +extern u32 arch_get_cpu_type(int cpu);
> +extern bool arch_has_cpu_type(void);
> +extern const char *arch_get_cpu_type_name(u32 cpu_type);
> +
> #endif /* _LINUX_CPU_H_ */
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 6f524bbf71a2..fd90ccb6652c 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -186,6 +186,7 @@ enum cpuhp_state {
> CPUHP_AP_WATCHDOG_ONLINE,
> CPUHP_AP_WORKQUEUE_ONLINE,
> CPUHP_AP_RCUTREE_ONLINE,
> + CPUHP_AP_BASE_CPUTYPE_ONLINE,
> CPUHP_AP_BASE_CACHEINFO_ONLINE,

I'm all for messing with enumerated type values, but why put it in this
part of the list and not right before the next one?

thanks,

greg k-h

2020-10-03 11:07:29

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > +/**
> > + * arch_get_cpu_type_name() - Get the CPU type name
> > + * @cpu_type: Type of CPU micro-architecture.
> > + *
> > + * Returns a string name associated with the CPU micro-architecture type as
> > + * indicated in @cpu_type. The format shall be <vendor>_<cpu_type>. Returns
> > + * NULL if the CPU type is not known.
> > + */
> > +const char __weak *arch_get_cpu_type_name(u32 cpu_type)
> > +{
> > + return NULL;
> > +}
>
> Why is vendor part of this? Shouldn't it just be arch?
>
> I say this as "vendor" is kind of "interesting" when it comes to other
> arches...
>
> Speaking of other arches, we all know that other arches have this
> feature as well, have you worked with any other groups to verify that
> this interface will also work with them?

Here's one set of patches for ARM64 for much the same type of cpu
design:
https://android-review.googlesource.com/c/kernel/common/+/1437098/3
Yes, it's not been posted to any kernel lists, but this is public so you
need to work with the ARM developers to come up with an interface that
works for everyone please.

thanks,

greg k-h

2020-10-06 00:59:44

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > Hybrid CPU topologies combine CPUs of different microarchitectures in the
> > same die. Thus, even though the instruction set is compatible among all
> > CPUs, there may still be differences in features (e.g., some CPUs may
> > have counters that others CPU do not). There may be applications
> > interested in knowing the type of micro-architecture topology of the
> > system to make decisions about process affinity.
> >
> > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> > cpu_capacity) may be used to infer the types of micro-architecture of the
> > CPUs in the platform, it may not be entirely accurate. For instance, two
> > subsets of CPUs with different types of micro-architecture may have the
> > same capacity due to power or thermal constraints.
> >
> > Create the new directory /sys/devices/system/cpu/types. Under such
> > directory, create individual subdirectories for each type of CPU micro-
> > architecture. Each subdirectory will have cpulist and cpumap files. This
> > makes it convenient for user space to read all the CPUs of the same type
> > at once without having to inspect each CPU individually.
> >
> > Implement a generic interface using weak functions that architectures can
> > override to indicate a) support for CPU types, b) the CPU type number, and
> > c) a string to identify the CPU vendor and type.
> >
> > For example, an x86 system with one Intel Core and four Intel Atom CPUs
> > would look like this (other architectures have the hooks to use whatever
> > directory naming convention below "types" that meets their needs):
> >
> > user@host:~$: ls /sys/devices/system/cpu/types
> > intel_atom_0 intel_core_0
> >
> > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> > cpulist cpumap
> >
> > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> > cpulist cpumap
> >
> > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> > 0f
> >
> > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> > 0-3
> >
> > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> > 10
> >
> > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> > 4

Thank you for the quick and detailed Greg!

>
> The output of 'tree' sometimes makes it easier to see here, or:
> grep -R . *
> also works well.

Indeed, this would definitely make it more readable.

>
> > On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> > created. Add a hook for this purpose.
>
> Why should these not show up if the system is not "hybrid"?

My thinking was that on a non-hybrid system, it does not make sense to
create this interface, as all the CPUs will be of the same type.

>
> >
> > Cc: Andi Kleen <[email protected]>
> > Cc: Dave Hansen <[email protected]>
> > Cc: "Gautham R. Shenoy" <[email protected]>
> > Cc: Kan Liang <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: "Rafael J. Wysocki" <[email protected]>
> > Cc: "Ravi V. Shankar" <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: [email protected]
> > Reviewed-by: Tony Luck <[email protected]>
> > Suggested-by: Len Brown <[email protected]> # Necessity of the interface
> > Suggested-by: Dave Hansen <[email protected]> # Details of the interface
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > .../ABI/testing/sysfs-devices-system-cpu | 13 ++
> > drivers/base/cpu.c | 214 ++++++++++++++++++
> > include/linux/cpu.h | 12 +
> > include/linux/cpuhotplug.h | 1 +
> > 4 files changed, 240 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > index b555df825447..bcb3d7195102 100644
> > --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> > +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > @@ -614,3 +614,16 @@ Description: SPURR ticks for cpuX when it was idle.
> >
> > This sysfs interface exposes the number of SPURR ticks
> > for cpuX when it was idle.
> > +
> > +What: /sys/devices/system/cpu/types
> > + /sys/devices/system/cpu/types/<vendor>_<cpu_type>/cpumap
> > + /sys/devices/system/cpu/types/<vendor>_<cpu_type>/cpulist
>
> Please split these up into different entries, because:

Sure, I can split these into two entries.

>
> > +Date: Oct 2020
> > +Contact: Linux kernel mailing list <[email protected]>
> > +Description: Types of CPUs in hybrid systems
> > +
> > + If a platform has CPUs with more than one type of micro-
> > + architecture, there is one directory for each type of CPU.
> > + Inside each directory, the files cpumap and cpulist contain
> > + a mask and a list representing CPUs of the same type of micro-
> > + architecture. These files only contain CPUs currently online.
>
> You don't describe what mask and list actualy represent. I have no clue
> what they mean, so odds are, others will not either.

Sure, I will explain better what mask and list represent.

>
> > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> > index d2136ab9b14a..7e98b11b15a1 100644
> > --- a/drivers/base/cpu.c
> > +++ b/drivers/base/cpu.c
> > @@ -607,11 +607,225 @@ static void __init cpu_register_vulnerabilities(void)
> > static inline void cpu_register_vulnerabilities(void) { }
> > #endif
> >
> > +static ssize_t cpulist_show(struct device *device,
> > + struct device_attribute *attr,
> > + char *buf)
> > +{
> > + struct cpumask *mask = dev_get_drvdata(device);
> > +
> > + if (!mask)
> > + return -EINVAL;
>
> How can mask be NULL?

It cannot. I will remove this check.

>
> > +
> > + return cpumap_print_to_pagebuf(true, buf, mask);
> > +}
> > +
> > +static ssize_t cpumap_show(struct device *device,
> > + struct device_attribute *attr,
> > + char *buf)
> > +{
> > + struct cpumask *mask = dev_get_drvdata(device);
> > +
> > + if (!mask)
> > + return -EINVAL;
>
> Again, how can that be true?

It cannot, I will remove this check.

>
> > +
> > + return cpumap_print_to_pagebuf(false, buf, mask);
> > +}
> > +
> > +static DEVICE_ATTR_RO(cpumap);
> > +static DEVICE_ATTR_RO(cpulist);
> > +
> > +static struct attribute *cpu_type_attrs[] = {
> > + &dev_attr_cpumap.attr,
> > + &dev_attr_cpulist.attr,
> > + NULL
> > +};
> > +
> > +static const struct attribute_group cpu_type_attr_group = {
> > + .attrs = cpu_type_attrs,
> > +};
> > +
> > +static const struct attribute_group *cpu_type_attr_groups[] = {
> > + &cpu_type_attr_group,
> > + NULL,
> > +};
>
> ATTRIBUTE_GROUPS()?

I will use this macro.

>
> > +
> > +/* Root device for the cpu_type sysfs entries */
> > +static struct device *cpu_type_device;
> > +/*
>
> No extra blank line?

I will add a blank line here.
>
> > + * An array of cpu_type devices. One per each type of supported CPU micro-
> > + * architecture.
> > + */
> > +static struct device *cpu_type_devices[CPUTYPES_MAX_NR];
> > +
> > +/**
> > + * arch_get_cpu_type() - Get the CPU type number
> > + * @cpu: Index of the CPU of which the index is needed
> > + *
> > + * Get the CPU type number of @cpu, a non-zero unsigned 32-bit number that
> > + * uniquely identifies a type of CPU micro-architecture. All CPUs of the same
> > + * type have the same type number. Type numbers are defined by each CPU
> > + * architecture.
> > + */
> > +u32 __weak arch_get_cpu_type(int cpu)
>
> Can you just have this "cpu type" be an enumerated type so we have a
> clue as to what they really are? Each arch can then define it, right?

Do you mean that each arch can have its own definition of the same
enumeration or that all archs share a common enumeration?
>
> > +{
> > + return 0;
> > +}
> > +
> > +/**
> > + * arch_has_cpu_type() - Check if CPU types are supported
> > + *
> > + * Returns true if the running platform supports CPU types. This is true if the
> > + * platform has CPUs of more than one type of micro-architecture. Otherwise,
> > + * returns false.
> > + */
> > +bool __weak arch_has_cpu_type(void)
> > +{
> > + return false;
> > +}
> > +
> > +/**
> > + * arch_get_cpu_type_name() - Get the CPU type name
> > + * @cpu_type: Type of CPU micro-architecture.
> > + *
> > + * Returns a string name associated with the CPU micro-architecture type as
> > + * indicated in @cpu_type. The format shall be <vendor>_<cpu_type>. Returns
> > + * NULL if the CPU type is not known.
> > + */
> > +const char __weak *arch_get_cpu_type_name(u32 cpu_type)
> > +{
> > + return NULL;
> > +}
>
> Why is vendor part of this? Shouldn't it just be arch?
>
> I say this as "vendor" is kind of "interesting" when it comes to other
> arches...

I thought of this as maybe intel_atcm vs a potential amd_minizen. I
guess arch is sufficient, as there will never be an amd_atom.

>
> Speaking of other arches, we all know that other arches have this
> feature as well, have you worked with any other groups to verify that
> this interface will also work with them?

I was not aware of the effort that you shared in a subsequent email. I
will contact the ARM engineers and discuss with them a potential
interface.

>
> > +
> > +static int cpu_type_create_device(int cpu, int idx)
> > +{
> > + u32 cpu_type = arch_get_cpu_type(cpu);
> > + struct cpumask *mask;
> > + const char *name;
> > + char buf[64];
> > +
> > + mask = kzalloc(cpumask_size(), GFP_KERNEL);
> > + if (!mask)
> > + return -ENOMEM;
> > +
> > + name = arch_get_cpu_type_name(cpu_type);
> > + if (!name) {
> > + snprintf(buf, sizeof(buf), "unknown_%u", cpu_type);
>
> If there is no name, why are you still making one up?
>
> Shouldn't you just not create it at all?

I think all CPUs in the system must be enumerated. It may happen that an
old kernel runs on a new processor and does not know about a new type in
such processor. In this case, it will simply say that there is a CPU type
it does not know about.
>
> > + name = buf;
> > + }
> > +
> > + cpu_type_devices[idx] = cpu_device_create(cpu_type_device, mask,
> > + cpu_type_attr_groups, name);
> > + if (IS_ERR(cpu_type_devices[idx]))
> > + goto free_cpumask;
> > +
> > + cpumask_set_cpu(cpu, mask);
> > + return 0;
> > +
> > +free_cpumask:
> > + kfree(mask);
> > + return -ENOMEM;
> > +}
> > +
> > +static int cpu_type_sysfs_online(unsigned int cpu)
> > +{
> > + u32 cpu_type = arch_get_cpu_type(cpu);
> > + struct cpumask *mask;
> > + int i;
> > +
> > + for (i = 0; i < ARRAY_SIZE(cpu_type_devices); i++) {
> > + u32 this_cpu_type;
> > +
> > + /*
> > + * The first devices in the array are used first. Thus, create
> > + * a new device as well as sysfs directory and for the type of
> > + * @cpu.
> > + */
> > + if (!cpu_type_devices[i])
> > + return cpu_type_create_device(cpu, i);
> > +
> > + mask = dev_get_drvdata(cpu_type_devices[i]);
> > + if (!mask && !cpumask_weight(mask)) {
> > + /*
> > + * We should not be here. Be paranoid about
> > + * NULL pointers.
> > + */
> > + dev_err(cpu_type_devices[i], "CPU mask is invalid");
> > + return -EINVAL;
> > + }
> > +
> > + this_cpu_type = arch_get_cpu_type(cpumask_first(mask));
> > + if (cpu_type == this_cpu_type) {
> > + cpumask_set_cpu(cpu, mask);
> > + return 0;
> > + }
> > + }
> > +
> > + return -ENODEV;
> > +}
> > +
> > +static int cpu_type_sysfs_offline(unsigned int cpu)
> > +{
> > + u32 cpu_type = arch_get_cpu_type(cpu);
> > + struct cpumask *mask;
> > + int i;
> > +
> > + for (i = 0; i < ARRAY_SIZE(cpu_type_devices); i++) {
> > + u32 this_cpu_type;
> > +
> > + if (!cpu_type_devices[i])
> > + continue;
> > +
> > + mask = dev_get_drvdata(cpu_type_devices[i]);
> > + if (!mask && !cpumask_weight(mask)) {
> > + /*
> > + * We should not be here. Be paranoid about
> > + * NULL pointers.
> > + */
> > + dev_err(cpu_type_devices[i], "CPU mask is invalid");
> > + continue;
> > + }
> > +
> > + this_cpu_type = arch_get_cpu_type(cpumask_first(mask));
> > + if (cpu_type == this_cpu_type) {
> > + cpumask_clear_cpu(cpu, mask);
> > + return 0;
> > + }
> > + }
> > +
> > + /*
> > + * If we are here, no matching cpu_type was found. This CPU was not
> > + * accounted for at hotplug online.
> > + */
> > + pr_warn("Unexpected CPU offline!\n");
> > +
> > + return -ENODEV;
> > +}
> > +
> > +static void __init cpu_type_sysfs_register(void)
> > +{
> > + struct device *dev = cpu_subsys.dev_root;
> > + int ret;
> > +
> > + if (!arch_has_cpu_type())
> > + return;
> > +
> > + cpu_type_device = cpu_device_create(dev, NULL, NULL, "types");
>
> This feels like an abuse of cpu_device_create(), doesn't it?

Would it be better if I use a struct device?

>
> > + if (IS_ERR(cpu_type_device))
> > + return;
> > +
> > + ret = cpuhp_setup_state(CPUHP_AP_BASE_CPUTYPE_ONLINE,
> > + "base/cpu_type_sysfs:online",
> > + cpu_type_sysfs_online, cpu_type_sysfs_offline);
> > + if (ret)
> > + dev_warn(dev, "failed to CPU type sysfs\n");
> > +}
> > +
> > void __init cpu_dev_init(void)
> > {
> > if (subsys_system_register(&cpu_subsys, cpu_root_attr_groups))
> > panic("Failed to register CPU subsystem");
> >
> > cpu_dev_register_generic();
> > + cpu_type_sysfs_register();
> > cpu_register_vulnerabilities();
> > }
> > diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> > index 8aa84c052fdf..76fba70c7801 100644
> > --- a/include/linux/cpu.h
> > +++ b/include/linux/cpu.h
> > @@ -228,4 +228,16 @@ static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0;
> > extern bool cpu_mitigations_off(void);
> > extern bool cpu_mitigations_auto_nosmt(void);
> >
> > +#ifndef CPUTYPES_MAX_NR
> > +/*
> > + * The maximum number of types of cpus. Architecture-specific implementations
> > + * shall override this value.
> > + */
> > +#define CPUTYPES_MAX_NR 0
> > +#endif
> > +
> > +extern u32 arch_get_cpu_type(int cpu);
> > +extern bool arch_has_cpu_type(void);
> > +extern const char *arch_get_cpu_type_name(u32 cpu_type);
> > +
> > #endif /* _LINUX_CPU_H_ */
> > diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> > index 6f524bbf71a2..fd90ccb6652c 100644
> > --- a/include/linux/cpuhotplug.h
> > +++ b/include/linux/cpuhotplug.h
> > @@ -186,6 +186,7 @@ enum cpuhp_state {
> > CPUHP_AP_WATCHDOG_ONLINE,
> > CPUHP_AP_WORKQUEUE_ONLINE,
> > CPUHP_AP_RCUTREE_ONLINE,
> > + CPUHP_AP_BASE_CPUTYPE_ONLINE,
> > CPUHP_AP_BASE_CACHEINFO_ONLINE,
>
> I'm all for messing with enumerated type values, but why put it in this
> part of the list and not right before the next one?

The reason I put CPUHP_AP_BASE_CPUTYPE_ONLINE before
CPUHP_AP_BASE_CACHEINFO_ONLINE is that both deal with enumerating and
exposing the topology of the system. I could also invert the order.

Thanks and BR,
Ricardo

2020-10-06 01:08:27

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Sat, Oct 03, 2020 at 01:05:48PM +0200, Greg Kroah-Hartman wrote:
> On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > > +/**
> > > + * arch_get_cpu_type_name() - Get the CPU type name
> > > + * @cpu_type: Type of CPU micro-architecture.
> > > + *
> > > + * Returns a string name associated with the CPU micro-architecture type as
> > > + * indicated in @cpu_type. The format shall be <vendor>_<cpu_type>. Returns
> > > + * NULL if the CPU type is not known.
> > > + */
> > > +const char __weak *arch_get_cpu_type_name(u32 cpu_type)
> > > +{
> > > + return NULL;
> > > +}
> >
> > Why is vendor part of this? Shouldn't it just be arch?
> >
> > I say this as "vendor" is kind of "interesting" when it comes to other
> > arches...
> >
> > Speaking of other arches, we all know that other arches have this
> > feature as well, have you worked with any other groups to verify that
> > this interface will also work with them?
>
> Here's one set of patches for ARM64 for much the same type of cpu
> design:
> https://android-review.googlesource.com/c/kernel/common/+/1437098/3
> Yes, it's not been posted to any kernel lists, but this is public so you
> need to work with the ARM developers to come up with an interface that
> works for everyone please.

Thanks for the pointer, Greg! I will study this proposal and work with
the ARM engineers.

BR,
Ricardo

2020-10-06 01:48:57

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Fri, Oct 02, 2020 at 08:27:41PM -0700, Randy Dunlap wrote:
> On 10/2/20 6:17 PM, Ricardo Neri wrote:
> > +/**
> > + * arch_get_cpu_type() - Get the CPU type number
> > + * @cpu: Index of the CPU of which the index is needed
> > + *
> > + * Get the CPU type number of @cpu, a non-zero unsigned 32-bit number that

Thank you for your feedback Randy!

> Are you sure that @cpu is non-zero?

This is the intent. Maybe it can be reworked to return -EINVAL instead?
I gues it is plausible but less likely that an arch defines a type as
0xffffffea;

>
> > + * uniquely identifies a type of CPU micro-architecture. All CPUs of the same
> > + * type have the same type number. Type numbers are defined by each CPU
> > + * architecture.
> > + */
> > +u32 __weak arch_get_cpu_type(int cpu)
> > +{
> > + return 0;
> > +}
>
> arch_get_cpu_type() in patch 4/4 allows @cpu to be 0.

It should not return 0 if the system does not have
X86_FEATURE_HYBRID_CPU. The currently existing CPU types are all
non-zero as per the Intel SDM. Am I missing anything?

Thanks and BR,
Ricardo

2020-10-06 07:39:16

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
> On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > > Hybrid CPU topologies combine CPUs of different microarchitectures in the
> > > same die. Thus, even though the instruction set is compatible among all
> > > CPUs, there may still be differences in features (e.g., some CPUs may
> > > have counters that others CPU do not). There may be applications
> > > interested in knowing the type of micro-architecture topology of the
> > > system to make decisions about process affinity.
> > >
> > > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> > > cpu_capacity) may be used to infer the types of micro-architecture of the
> > > CPUs in the platform, it may not be entirely accurate. For instance, two
> > > subsets of CPUs with different types of micro-architecture may have the
> > > same capacity due to power or thermal constraints.
> > >
> > > Create the new directory /sys/devices/system/cpu/types. Under such
> > > directory, create individual subdirectories for each type of CPU micro-
> > > architecture. Each subdirectory will have cpulist and cpumap files. This
> > > makes it convenient for user space to read all the CPUs of the same type
> > > at once without having to inspect each CPU individually.
> > >
> > > Implement a generic interface using weak functions that architectures can
> > > override to indicate a) support for CPU types, b) the CPU type number, and
> > > c) a string to identify the CPU vendor and type.
> > >
> > > For example, an x86 system with one Intel Core and four Intel Atom CPUs
> > > would look like this (other architectures have the hooks to use whatever
> > > directory naming convention below "types" that meets their needs):
> > >
> > > user@host:~$: ls /sys/devices/system/cpu/types
> > > intel_atom_0 intel_core_0
> > >
> > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> > > cpulist cpumap
> > >
> > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> > > cpulist cpumap
> > >
> > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> > > 0f
> > >
> > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> > > 0-3
> > >
> > > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> > > 10
> > >
> > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> > > 4
>
> Thank you for the quick and detailed Greg!
>
> >
> > The output of 'tree' sometimes makes it easier to see here, or:
> > grep -R . *
> > also works well.
>
> Indeed, this would definitely make it more readable.
>
> >
> > > On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> > > created. Add a hook for this purpose.
> >
> > Why should these not show up if the system is not "hybrid"?
>
> My thinking was that on a non-hybrid system, it does not make sense to
> create this interface, as all the CPUs will be of the same type.

Why not just have this an attribute type in the existing cpuX directory?
Why do this have to be a totally separate directory and userspace has to
figure out to look in two different spots for the same cpu to determine
what it is?

That feels wasteful, it should be much simpler to use the existing
object, right?

That way, you also show the "type" of all cpus, no matter if they are
"hybrid" or not, again, making userspace deal with things in a much
simpler manner.

> > > +/**
> > > + * arch_get_cpu_type() - Get the CPU type number
> > > + * @cpu: Index of the CPU of which the index is needed
> > > + *
> > > + * Get the CPU type number of @cpu, a non-zero unsigned 32-bit number that
> > > + * uniquely identifies a type of CPU micro-architecture. All CPUs of the same
> > > + * type have the same type number. Type numbers are defined by each CPU
> > > + * architecture.
> > > + */
> > > +u32 __weak arch_get_cpu_type(int cpu)
> >
> > Can you just have this "cpu type" be an enumerated type so we have a
> > clue as to what they really are? Each arch can then define it, right?
>
> Do you mean that each arch can have its own definition of the same
> enumeration or that all archs share a common enumeration?

Each arch should have its own definition of this enumerated type.

> > > +/**
> > > + * arch_get_cpu_type_name() - Get the CPU type name
> > > + * @cpu_type: Type of CPU micro-architecture.
> > > + *
> > > + * Returns a string name associated with the CPU micro-architecture type as
> > > + * indicated in @cpu_type. The format shall be <vendor>_<cpu_type>. Returns
> > > + * NULL if the CPU type is not known.
> > > + */
> > > +const char __weak *arch_get_cpu_type_name(u32 cpu_type)
> > > +{
> > > + return NULL;
> > > +}
> >
> > Why is vendor part of this? Shouldn't it just be arch?
> >
> > I say this as "vendor" is kind of "interesting" when it comes to other
> > arches...
>
> I thought of this as maybe intel_atcm vs a potential amd_minizen. I
> guess arch is sufficient, as there will never be an amd_atom.

Even if there is, that can be part of the cpu "type" that you have for
your enumerated type, right?

thanks,

greg k-h

2020-10-07 03:19:31

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
> On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
> > On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> > > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > > > Hybrid CPU topologies combine CPUs of different microarchitectures in the
> > > > same die. Thus, even though the instruction set is compatible among all
> > > > CPUs, there may still be differences in features (e.g., some CPUs may
> > > > have counters that others CPU do not). There may be applications
> > > > interested in knowing the type of micro-architecture topology of the
> > > > system to make decisions about process affinity.
> > > >
> > > > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> > > > cpu_capacity) may be used to infer the types of micro-architecture of the
> > > > CPUs in the platform, it may not be entirely accurate. For instance, two
> > > > subsets of CPUs with different types of micro-architecture may have the
> > > > same capacity due to power or thermal constraints.
> > > >
> > > > Create the new directory /sys/devices/system/cpu/types. Under such
> > > > directory, create individual subdirectories for each type of CPU micro-
> > > > architecture. Each subdirectory will have cpulist and cpumap files. This
> > > > makes it convenient for user space to read all the CPUs of the same type
> > > > at once without having to inspect each CPU individually.
> > > >
> > > > Implement a generic interface using weak functions that architectures can
> > > > override to indicate a) support for CPU types, b) the CPU type number, and
> > > > c) a string to identify the CPU vendor and type.
> > > >
> > > > For example, an x86 system with one Intel Core and four Intel Atom CPUs
> > > > would look like this (other architectures have the hooks to use whatever
> > > > directory naming convention below "types" that meets their needs):
> > > >
> > > > user@host:~$: ls /sys/devices/system/cpu/types
> > > > intel_atom_0 intel_core_0
> > > >
> > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> > > > cpulist cpumap
> > > >
> > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> > > > cpulist cpumap
> > > >
> > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> > > > 0f
> > > >
> > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> > > > 0-3
> > > >
> > > > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> > > > 10
> > > >
> > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> > > > 4
> >
> > Thank you for the quick and detailed Greg!
> >
> > >
> > > The output of 'tree' sometimes makes it easier to see here, or:
> > > grep -R . *
> > > also works well.
> >
> > Indeed, this would definitely make it more readable.
> >
> > >
> > > > On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> > > > created. Add a hook for this purpose.
> > >
> > > Why should these not show up if the system is not "hybrid"?
> >
> > My thinking was that on a non-hybrid system, it does not make sense to
> > create this interface, as all the CPUs will be of the same type.
>
> Why not just have this an attribute type in the existing cpuX directory?
> Why do this have to be a totally separate directory and userspace has to
> figure out to look in two different spots for the same cpu to determine
> what it is?

But if the type is located under cpuX, usespace would need to traverse
all the CPUs and create its own cpu masks. Under the types directory it
would only need to look once for each type of CPU, IMHO.
>
> That feels wasteful, it should be much simpler to use the existing
> object, right?
>
> That way, you also show the "type" of all cpus, no matter if they are
> "hybrid" or not, again, making userspace deal with things in a much
> simpler manner.

Indeed, that would be simpler to implement, and perhaps a natural extension
of the existing interface.

Lastly, legacy and non-hybrid parts will not have a type defined. Thus,
it may not make sense for them to expose a type in sysfs.

> > > > +/**
> > > > + * arch_get_cpu_type() - Get the CPU type number
> > > > + * @cpu: Index of the CPU of which the index is needed
> > > > + *
> > > > + * Get the CPU type number of @cpu, a non-zero unsigned 32-bit number that
> > > > + * uniquely identifies a type of CPU micro-architecture. All CPUs of the same
> > > > + * type have the same type number. Type numbers are defined by each CPU
> > > > + * architecture.
> > > > + */
> > > > +u32 __weak arch_get_cpu_type(int cpu)
> > >
> > > Can you just have this "cpu type" be an enumerated type so we have a
> > > clue as to what they really are? Each arch can then define it, right?
> >
> > Do you mean that each arch can have its own definition of the same
> > enumeration or that all archs share a common enumeration?
>
> Each arch should have its own definition of this enumerated type.

Sure, understood.

>
> > > > +/**
> > > > + * arch_get_cpu_type_name() - Get the CPU type name
> > > > + * @cpu_type: Type of CPU micro-architecture.
> > > > + *
> > > > + * Returns a string name associated with the CPU micro-architecture type as
> > > > + * indicated in @cpu_type. The format shall be <vendor>_<cpu_type>. Returns
> > > > + * NULL if the CPU type is not known.
> > > > + */
> > > > +const char __weak *arch_get_cpu_type_name(u32 cpu_type)
> > > > +{
> > > > + return NULL;
> > > > +}
> > >
> > > Why is vendor part of this? Shouldn't it just be arch?
> > >
> > > I say this as "vendor" is kind of "interesting" when it comes to other
> > > arches...
> >
> > I thought of this as maybe intel_atcm vs a potential amd_minizen. I
> > guess arch is sufficient, as there will never be an amd_atom.
>
> Even if there is, that can be part of the cpu "type" that you have for
> your enumerated type, right?

Indeed. The interface I propose lets archs define their own string
names.

Thanks and BR,
Ricardo

2020-10-07 05:31:04

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote:
> On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
> > On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
> > > On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> > > > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > > > > Hybrid CPU topologies combine CPUs of different microarchitectures in the
> > > > > same die. Thus, even though the instruction set is compatible among all
> > > > > CPUs, there may still be differences in features (e.g., some CPUs may
> > > > > have counters that others CPU do not). There may be applications
> > > > > interested in knowing the type of micro-architecture topology of the
> > > > > system to make decisions about process affinity.
> > > > >
> > > > > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> > > > > cpu_capacity) may be used to infer the types of micro-architecture of the
> > > > > CPUs in the platform, it may not be entirely accurate. For instance, two
> > > > > subsets of CPUs with different types of micro-architecture may have the
> > > > > same capacity due to power or thermal constraints.
> > > > >
> > > > > Create the new directory /sys/devices/system/cpu/types. Under such
> > > > > directory, create individual subdirectories for each type of CPU micro-
> > > > > architecture. Each subdirectory will have cpulist and cpumap files. This
> > > > > makes it convenient for user space to read all the CPUs of the same type
> > > > > at once without having to inspect each CPU individually.
> > > > >
> > > > > Implement a generic interface using weak functions that architectures can
> > > > > override to indicate a) support for CPU types, b) the CPU type number, and
> > > > > c) a string to identify the CPU vendor and type.
> > > > >
> > > > > For example, an x86 system with one Intel Core and four Intel Atom CPUs
> > > > > would look like this (other architectures have the hooks to use whatever
> > > > > directory naming convention below "types" that meets their needs):
> > > > >
> > > > > user@host:~$: ls /sys/devices/system/cpu/types
> > > > > intel_atom_0 intel_core_0
> > > > >
> > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> > > > > cpulist cpumap
> > > > >
> > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> > > > > cpulist cpumap
> > > > >
> > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> > > > > 0f
> > > > >
> > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> > > > > 0-3
> > > > >
> > > > > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> > > > > 10
> > > > >
> > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> > > > > 4
> > >
> > > Thank you for the quick and detailed Greg!
> > >
> > > >
> > > > The output of 'tree' sometimes makes it easier to see here, or:
> > > > grep -R . *
> > > > also works well.
> > >
> > > Indeed, this would definitely make it more readable.
> > >
> > > >
> > > > > On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> > > > > created. Add a hook for this purpose.
> > > >
> > > > Why should these not show up if the system is not "hybrid"?
> > >
> > > My thinking was that on a non-hybrid system, it does not make sense to
> > > create this interface, as all the CPUs will be of the same type.
> >
> > Why not just have this an attribute type in the existing cpuX directory?
> > Why do this have to be a totally separate directory and userspace has to
> > figure out to look in two different spots for the same cpu to determine
> > what it is?
>
> But if the type is located under cpuX, usespace would need to traverse
> all the CPUs and create its own cpu masks. Under the types directory it
> would only need to look once for each type of CPU, IMHO.

What does a "mask" do? What does userspace care about this? You would
have to create it by traversing the directories you are creating anyway,
so it's not much different, right?

> > That feels wasteful, it should be much simpler to use the existing
> > object, right?
> >
> > That way, you also show the "type" of all cpus, no matter if they are
> > "hybrid" or not, again, making userspace deal with things in a much
> > simpler manner.
>
> Indeed, that would be simpler to implement, and perhaps a natural extension
> of the existing interface.
>
> Lastly, legacy and non-hybrid parts will not have a type defined. Thus,
> it may not make sense for them to expose a type in sysfs.

That's fine, not having a sysfs file if you don't know the type is fine.
Or you can fix that up and show the type of those as well, why wouldn't
you want to?

> > > I thought of this as maybe intel_atcm vs a potential amd_minizen. I
> > > guess arch is sufficient, as there will never be an amd_atom.
> >
> > Even if there is, that can be part of the cpu "type" that you have for
> > your enumerated type, right?
>
> Indeed. The interface I propose lets archs define their own string
> names.

Arches _should_ define their own names, otherwise who would?

thanks,

greg k-h

2020-10-08 03:40:47

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Wed, Oct 07, 2020 at 07:15:46AM +0200, Greg Kroah-Hartman wrote:
> On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote:
> > On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
> > > On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
> > > > On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> > > > > On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> > > > > > Hybrid CPU topologies combine CPUs of different microarchitectures in the
> > > > > > same die. Thus, even though the instruction set is compatible among all
> > > > > > CPUs, there may still be differences in features (e.g., some CPUs may
> > > > > > have counters that others CPU do not). There may be applications
> > > > > > interested in knowing the type of micro-architecture topology of the
> > > > > > system to make decisions about process affinity.
> > > > > >
> > > > > > While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> > > > > > cpu_capacity) may be used to infer the types of micro-architecture of the
> > > > > > CPUs in the platform, it may not be entirely accurate. For instance, two
> > > > > > subsets of CPUs with different types of micro-architecture may have the
> > > > > > same capacity due to power or thermal constraints.
> > > > > >
> > > > > > Create the new directory /sys/devices/system/cpu/types. Under such
> > > > > > directory, create individual subdirectories for each type of CPU micro-
> > > > > > architecture. Each subdirectory will have cpulist and cpumap files. This
> > > > > > makes it convenient for user space to read all the CPUs of the same type
> > > > > > at once without having to inspect each CPU individually.
> > > > > >
> > > > > > Implement a generic interface using weak functions that architectures can
> > > > > > override to indicate a) support for CPU types, b) the CPU type number, and
> > > > > > c) a string to identify the CPU vendor and type.
> > > > > >
> > > > > > For example, an x86 system with one Intel Core and four Intel Atom CPUs
> > > > > > would look like this (other architectures have the hooks to use whatever
> > > > > > directory naming convention below "types" that meets their needs):
> > > > > >
> > > > > > user@host:~$: ls /sys/devices/system/cpu/types
> > > > > > intel_atom_0 intel_core_0
> > > > > >
> > > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> > > > > > cpulist cpumap
> > > > > >
> > > > > > user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> > > > > > cpulist cpumap
> > > > > >
> > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> > > > > > 0f
> > > > > >
> > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> > > > > > 0-3
> > > > > >
> > > > > > user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> > > > > > 10
> > > > > >
> > > > > > user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> > > > > > 4
> > > >
> > > > Thank you for the quick and detailed Greg!
> > > >
> > > > >
> > > > > The output of 'tree' sometimes makes it easier to see here, or:
> > > > > grep -R . *
> > > > > also works well.
> > > >
> > > > Indeed, this would definitely make it more readable.
> > > >
> > > > >
> > > > > > On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> > > > > > created. Add a hook for this purpose.
> > > > >
> > > > > Why should these not show up if the system is not "hybrid"?
> > > >
> > > > My thinking was that on a non-hybrid system, it does not make sense to
> > > > create this interface, as all the CPUs will be of the same type.
> > >
> > > Why not just have this an attribute type in the existing cpuX directory?
> > > Why do this have to be a totally separate directory and userspace has to
> > > figure out to look in two different spots for the same cpu to determine
> > > what it is?
> >
> > But if the type is located under cpuX, usespace would need to traverse
> > all the CPUs and create its own cpu masks. Under the types directory it
> > would only need to look once for each type of CPU, IMHO.
>
> What does a "mask" do? What does userspace care about this?

It would only need to consume the cpu masks that sysfs provides; but it
would have to implement the extra step of looking at another directory.

> You would have to create it by traversing the directories you are creating anyway,
> so it's not much different, right?

True, that is a good point, userspace would still end up traversing
diretories anyways.

>
> > > That feels wasteful, it should be much simpler to use the existing
> > > object, right?
> > >
> > > That way, you also show the "type" of all cpus, no matter if they are
> > > "hybrid" or not, again, making userspace deal with things in a much
> > > simpler manner.
> >
> > Indeed, that would be simpler to implement, and perhaps a natural extension
> > of the existing interface.
> >
> > Lastly, legacy and non-hybrid parts will not have a type defined. Thus,
> > it may not make sense for them to expose a type in sysfs.
>
> That's fine, not having a sysfs file if you don't know the type is fine.
> Or you can fix that up and show the type of those as well, why wouldn't
> you want to?

What information should be used for the fixup? On legacy x86 parts,
CPUID would not have the leaf that provides this information. Maybe an
"unknown" type?

>
> > > > I thought of this as maybe intel_atcm vs a potential amd_minizen. I
> > > > guess arch is sufficient, as there will never be an amd_atom.
> > >
> > > Even if there is, that can be part of the cpu "type" that you have for
> > > your enumerated type, right?
> >
> > Indeed. The interface I propose lets archs define their own string
> > names.
>
> Arches _should_ define their own names, otherwise who would?

Yes, ineed.

Thanks and BR,
Ricardo

2020-10-10 08:27:36

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On 10/5/20 6:15 PM, Ricardo Neri wrote:
> On Fri, Oct 02, 2020 at 08:27:41PM -0700, Randy Dunlap wrote:
>> On 10/2/20 6:17 PM, Ricardo Neri wrote:
>>> +/**
>>> + * arch_get_cpu_type() - Get the CPU type number
>>> + * @cpu: Index of the CPU of which the index is needed
>>> + *
>>> + * Get the CPU type number of @cpu, a non-zero unsigned 32-bit number that
>
> Thank you for your feedback Randy!
>
>> Are you sure that @cpu is non-zero?
>
> This is the intent. Maybe it can be reworked to return -EINVAL instead?
> I gues it is plausible but less likely that an arch defines a type as
> 0xffffffea;

You lost me here.

>>
>>> + * uniquely identifies a type of CPU micro-architecture. All CPUs of the same
>>> + * type have the same type number. Type numbers are defined by each CPU
>>> + * architecture.
>>> + */
>>> +u32 __weak arch_get_cpu_type(int cpu)
>>> +{
>>> + return 0;
>>> +}
>>
>> arch_get_cpu_type() in patch 4/4 allows @cpu to be 0.
>
> It should not return 0 if the system does not have
> X86_FEATURE_HYBRID_CPU. The currently existing CPU types are all
> non-zero as per the Intel SDM. Am I missing anything?

@cpu is a cpu index, not a cpu type.

--
~Randy

2020-11-12 06:28:46

by Brice Goglin

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

Le 07/10/2020 à 07:15, Greg Kroah-Hartman a écrit :
> On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote:
>> On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
>>> On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
>>>> On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
>>>>> On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
>>>>>> Hybrid CPU topologies combine CPUs of different microarchitectures in the
>>>>>> same die. Thus, even though the instruction set is compatible among all
>>>>>> CPUs, there may still be differences in features (e.g., some CPUs may
>>>>>> have counters that others CPU do not). There may be applications
>>>>>> interested in knowing the type of micro-architecture topology of the
>>>>>> system to make decisions about process affinity.
>>>>>>
>>>>>> While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
>>>>>> cpu_capacity) may be used to infer the types of micro-architecture of the
>>>>>> CPUs in the platform, it may not be entirely accurate. For instance, two
>>>>>> subsets of CPUs with different types of micro-architecture may have the
>>>>>> same capacity due to power or thermal constraints.
>>>>>>
>>>>>> Create the new directory /sys/devices/system/cpu/types. Under such
>>>>>> directory, create individual subdirectories for each type of CPU micro-
>>>>>> architecture. Each subdirectory will have cpulist and cpumap files. This
>>>>>> makes it convenient for user space to read all the CPUs of the same type
>>>>>> at once without having to inspect each CPU individually.
>>>>>>
>>>>>> Implement a generic interface using weak functions that architectures can
>>>>>> override to indicate a) support for CPU types, b) the CPU type number, and
>>>>>> c) a string to identify the CPU vendor and type.
>>>>>>
>>>>>> For example, an x86 system with one Intel Core and four Intel Atom CPUs
>>>>>> would look like this (other architectures have the hooks to use whatever
>>>>>> directory naming convention below "types" that meets their needs):
>>>>>>
>>>>>> user@host:~$: ls /sys/devices/system/cpu/types
>>>>>> intel_atom_0 intel_core_0
>>>>>>
>>>>>> user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
>>>>>> cpulist cpumap
>>>>>>
>>>>>> user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
>>>>>> cpulist cpumap
>>>>>>
>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
>>>>>> 0f
>>>>>>
>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
>>>>>> 0-3
>>>>>>
>>>>>> user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
>>>>>> 10
>>>>>>
>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
>>>>>> 4
>>>> Thank you for the quick and detailed Greg!
>>>>
>>>>> The output of 'tree' sometimes makes it easier to see here, or:
>>>>> grep -R . *
>>>>> also works well.
>>>> Indeed, this would definitely make it more readable.
>>>>
>>>>>> On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
>>>>>> created. Add a hook for this purpose.
>>>>> Why should these not show up if the system is not "hybrid"?
>>>> My thinking was that on a non-hybrid system, it does not make sense to
>>>> create this interface, as all the CPUs will be of the same type.
>>> Why not just have this an attribute type in the existing cpuX directory?
>>> Why do this have to be a totally separate directory and userspace has to
>>> figure out to look in two different spots for the same cpu to determine
>>> what it is?
>> But if the type is located under cpuX, usespace would need to traverse
>> all the CPUs and create its own cpu masks. Under the types directory it
>> would only need to look once for each type of CPU, IMHO.
> What does a "mask" do? What does userspace care about this? You would
> have to create it by traversing the directories you are creating anyway,
> so it's not much different, right?


Hello

Sorry for the late reply. As the first userspace consumer of this
interface [1], I can confirm that reading a single file to get the mask
would be better, at least for performance reason. On large platforms, we
already have to read thousands of sysfs files to get CPU topology and
cache information, I'd be happy not to read one more file per cpu.

Reading these sysfs files is slow, and it does not scale well when
multiple processes read them in parallel. There are ways to avoid this
multiple discoveries by sharing hwloc info through XML or shmem, but it
will take years before all developers of different runtimes all
implement this :)

Thanks

Brice

[1] hwloc and lstopo are going to expose this hybrid info to HPC
runtimes (https://www.open-mpi.org/projects/hwloc/)


2020-11-12 06:44:45

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Thu, Nov 12, 2020 at 07:19:48AM +0100, Brice Goglin wrote:
> Le 07/10/2020 ? 07:15, Greg Kroah-Hartman a ?crit?:
> > On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote:
> >> On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
> >>> On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
> >>>> On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> >>>>> On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> >>>>>> Hybrid CPU topologies combine CPUs of different microarchitectures in the
> >>>>>> same die. Thus, even though the instruction set is compatible among all
> >>>>>> CPUs, there may still be differences in features (e.g., some CPUs may
> >>>>>> have counters that others CPU do not). There may be applications
> >>>>>> interested in knowing the type of micro-architecture topology of the
> >>>>>> system to make decisions about process affinity.
> >>>>>>
> >>>>>> While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> >>>>>> cpu_capacity) may be used to infer the types of micro-architecture of the
> >>>>>> CPUs in the platform, it may not be entirely accurate. For instance, two
> >>>>>> subsets of CPUs with different types of micro-architecture may have the
> >>>>>> same capacity due to power or thermal constraints.
> >>>>>>
> >>>>>> Create the new directory /sys/devices/system/cpu/types. Under such
> >>>>>> directory, create individual subdirectories for each type of CPU micro-
> >>>>>> architecture. Each subdirectory will have cpulist and cpumap files. This
> >>>>>> makes it convenient for user space to read all the CPUs of the same type
> >>>>>> at once without having to inspect each CPU individually.
> >>>>>>
> >>>>>> Implement a generic interface using weak functions that architectures can
> >>>>>> override to indicate a) support for CPU types, b) the CPU type number, and
> >>>>>> c) a string to identify the CPU vendor and type.
> >>>>>>
> >>>>>> For example, an x86 system with one Intel Core and four Intel Atom CPUs
> >>>>>> would look like this (other architectures have the hooks to use whatever
> >>>>>> directory naming convention below "types" that meets their needs):
> >>>>>>
> >>>>>> user@host:~$: ls /sys/devices/system/cpu/types
> >>>>>> intel_atom_0 intel_core_0
> >>>>>>
> >>>>>> user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> >>>>>> cpulist cpumap
> >>>>>>
> >>>>>> user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> >>>>>> cpulist cpumap
> >>>>>>
> >>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> >>>>>> 0f
> >>>>>>
> >>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> >>>>>> 0-3
> >>>>>>
> >>>>>> user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> >>>>>> 10
> >>>>>>
> >>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> >>>>>> 4
> >>>> Thank you for the quick and detailed Greg!
> >>>>
> >>>>> The output of 'tree' sometimes makes it easier to see here, or:
> >>>>> grep -R . *
> >>>>> also works well.
> >>>> Indeed, this would definitely make it more readable.
> >>>>
> >>>>>> On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> >>>>>> created. Add a hook for this purpose.
> >>>>> Why should these not show up if the system is not "hybrid"?
> >>>> My thinking was that on a non-hybrid system, it does not make sense to
> >>>> create this interface, as all the CPUs will be of the same type.
> >>> Why not just have this an attribute type in the existing cpuX directory?
> >>> Why do this have to be a totally separate directory and userspace has to
> >>> figure out to look in two different spots for the same cpu to determine
> >>> what it is?
> >> But if the type is located under cpuX, usespace would need to traverse
> >> all the CPUs and create its own cpu masks. Under the types directory it
> >> would only need to look once for each type of CPU, IMHO.
> > What does a "mask" do? What does userspace care about this? You would
> > have to create it by traversing the directories you are creating anyway,
> > so it's not much different, right?
>
>
> Hello
>
> Sorry for the late reply. As the first userspace consumer of this
> interface [1], I can confirm that reading a single file to get the mask
> would be better, at least for performance reason. On large platforms, we
> already have to read thousands of sysfs files to get CPU topology and
> cache information, I'd be happy not to read one more file per cpu.
>
> Reading these sysfs files is slow, and it does not scale well when
> multiple processes read them in parallel.

Really? Where is the slowdown? Would something like readfile() work
better for you for that?
https://lore.kernel.org/linux-api/[email protected]/

How does multiple processes slow anything down, there shouldn't be any
shared locks here.

> There are ways to avoid this
> multiple discoveries by sharing hwloc info through XML or shmem, but it
> will take years before all developers of different runtimes all
> implement this :)

I don't understand, what exactly are you suggesting we do here instead?

thanks,

greg k-h

2020-11-12 09:12:54

by Brice Goglin

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface


Le 12/11/2020 à 07:42, Greg Kroah-Hartman a écrit :
> On Thu, Nov 12, 2020 at 07:19:48AM +0100, Brice Goglin wrote:
>> Le 07/10/2020 à 07:15, Greg Kroah-Hartman a écrit :
>>> On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote:
>>>> On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
>>>>> On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
>>>>>> On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
>>>>>>> On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
>>>>>>>> Hybrid CPU topologies combine CPUs of different microarchitectures in the
>>>>>>>> same die. Thus, even though the instruction set is compatible among all
>>>>>>>> CPUs, there may still be differences in features (e.g., some CPUs may
>>>>>>>> have counters that others CPU do not). There may be applications
>>>>>>>> interested in knowing the type of micro-architecture topology of the
>>>>>>>> system to make decisions about process affinity.
>>>>>>>>
>>>>>>>> While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
>>>>>>>> cpu_capacity) may be used to infer the types of micro-architecture of the
>>>>>>>> CPUs in the platform, it may not be entirely accurate. For instance, two
>>>>>>>> subsets of CPUs with different types of micro-architecture may have the
>>>>>>>> same capacity due to power or thermal constraints.
>>>>>>>>
>>>>>>>> Create the new directory /sys/devices/system/cpu/types. Under such
>>>>>>>> directory, create individual subdirectories for each type of CPU micro-
>>>>>>>> architecture. Each subdirectory will have cpulist and cpumap files. This
>>>>>>>> makes it convenient for user space to read all the CPUs of the same type
>>>>>>>> at once without having to inspect each CPU individually.
>>>>>>>>
>>>>>>>> Implement a generic interface using weak functions that architectures can
>>>>>>>> override to indicate a) support for CPU types, b) the CPU type number, and
>>>>>>>> c) a string to identify the CPU vendor and type.
>>>>>>>>
>>>>>>>> For example, an x86 system with one Intel Core and four Intel Atom CPUs
>>>>>>>> would look like this (other architectures have the hooks to use whatever
>>>>>>>> directory naming convention below "types" that meets their needs):
>>>>>>>>
>>>>>>>> user@host:~$: ls /sys/devices/system/cpu/types
>>>>>>>> intel_atom_0 intel_core_0
>>>>>>>>
>>>>>>>> user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
>>>>>>>> cpulist cpumap
>>>>>>>>
>>>>>>>> user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
>>>>>>>> cpulist cpumap
>>>>>>>>
>>>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
>>>>>>>> 0f
>>>>>>>>
>>>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
>>>>>>>> 0-3
>>>>>>>>
>>>>>>>> user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
>>>>>>>> 10
>>>>>>>>
>>>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
>>>>>>>> 4
>>>>>> Thank you for the quick and detailed Greg!
>>>>>>
>>>>>>> The output of 'tree' sometimes makes it easier to see here, or:
>>>>>>> grep -R . *
>>>>>>> also works well.
>>>>>> Indeed, this would definitely make it more readable.
>>>>>>
>>>>>>>> On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
>>>>>>>> created. Add a hook for this purpose.
>>>>>>> Why should these not show up if the system is not "hybrid"?
>>>>>> My thinking was that on a non-hybrid system, it does not make sense to
>>>>>> create this interface, as all the CPUs will be of the same type.
>>>>> Why not just have this an attribute type in the existing cpuX directory?
>>>>> Why do this have to be a totally separate directory and userspace has to
>>>>> figure out to look in two different spots for the same cpu to determine
>>>>> what it is?
>>>> But if the type is located under cpuX, usespace would need to traverse
>>>> all the CPUs and create its own cpu masks. Under the types directory it
>>>> would only need to look once for each type of CPU, IMHO.
>>> What does a "mask" do? What does userspace care about this? You would
>>> have to create it by traversing the directories you are creating anyway,
>>> so it's not much different, right?
>>
>> Hello
>>
>> Sorry for the late reply. As the first userspace consumer of this
>> interface [1], I can confirm that reading a single file to get the mask
>> would be better, at least for performance reason. On large platforms, we
>> already have to read thousands of sysfs files to get CPU topology and
>> cache information, I'd be happy not to read one more file per cpu.
>>
>> Reading these sysfs files is slow, and it does not scale well when
>> multiple processes read them in parallel.
> Really? Where is the slowdown? Would something like readfile() work
> better for you for that?
> https://lore.kernel.org/linux-api/[email protected]/


I guess readfile would improve the sequential case by avoiding syscalls
but it would not improve the parallel case since syscalls shouldn't have
any parallel issue?

We've been watching the status of readfile() since it was posted on LKML
6 months ago, but we were actually wondering if it would end up being
included at some point.


> How does multiple processes slow anything down, there shouldn't be any
> shared locks here.


When I benchmarked this in 2016, reading a single (small) sysfs file was
41x slower when running 64 processes simultaneously on a 64-core Knights
Landing than reading from a single process. On a SGI Altix UV with 12x
8-core CPUs, reading from one process per CPU (12 total) was 60x slower
(which could mean NUMA affinity matters), and reading from one process
per core (96 total) was 491x slower.

I will try to find some time to dig further on recent kernels with perf
and readfile (both machines were running RHEL7).


>> There are ways to avoid this
>> multiple discoveries by sharing hwloc info through XML or shmem, but it
>> will take years before all developers of different runtimes all
>> implement this :)
> I don't understand, what exactly are you suggesting we do here instead?


I was just saying userspace has ways to mitigate the issue but it will
take time because many different projects are involved.

Brice


2020-11-12 10:52:07

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Thu, Nov 12, 2020 at 10:10:57AM +0100, Brice Goglin wrote:
>
> Le 12/11/2020 ? 07:42, Greg Kroah-Hartman a ?crit?:
> > On Thu, Nov 12, 2020 at 07:19:48AM +0100, Brice Goglin wrote:
> >> Le 07/10/2020 ? 07:15, Greg Kroah-Hartman a ?crit?:
> >>> On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote:
> >>>> On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
> >>>>> On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
> >>>>>> On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> >>>>>>> On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> >>>>>>>> Hybrid CPU topologies combine CPUs of different microarchitectures in the
> >>>>>>>> same die. Thus, even though the instruction set is compatible among all
> >>>>>>>> CPUs, there may still be differences in features (e.g., some CPUs may
> >>>>>>>> have counters that others CPU do not). There may be applications
> >>>>>>>> interested in knowing the type of micro-architecture topology of the
> >>>>>>>> system to make decisions about process affinity.
> >>>>>>>>
> >>>>>>>> While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> >>>>>>>> cpu_capacity) may be used to infer the types of micro-architecture of the
> >>>>>>>> CPUs in the platform, it may not be entirely accurate. For instance, two
> >>>>>>>> subsets of CPUs with different types of micro-architecture may have the
> >>>>>>>> same capacity due to power or thermal constraints.
> >>>>>>>>
> >>>>>>>> Create the new directory /sys/devices/system/cpu/types. Under such
> >>>>>>>> directory, create individual subdirectories for each type of CPU micro-
> >>>>>>>> architecture. Each subdirectory will have cpulist and cpumap files. This
> >>>>>>>> makes it convenient for user space to read all the CPUs of the same type
> >>>>>>>> at once without having to inspect each CPU individually.
> >>>>>>>>
> >>>>>>>> Implement a generic interface using weak functions that architectures can
> >>>>>>>> override to indicate a) support for CPU types, b) the CPU type number, and
> >>>>>>>> c) a string to identify the CPU vendor and type.
> >>>>>>>>
> >>>>>>>> For example, an x86 system with one Intel Core and four Intel Atom CPUs
> >>>>>>>> would look like this (other architectures have the hooks to use whatever
> >>>>>>>> directory naming convention below "types" that meets their needs):
> >>>>>>>>
> >>>>>>>> user@host:~$: ls /sys/devices/system/cpu/types
> >>>>>>>> intel_atom_0 intel_core_0
> >>>>>>>>
> >>>>>>>> user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> >>>>>>>> cpulist cpumap
> >>>>>>>>
> >>>>>>>> user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> >>>>>>>> cpulist cpumap
> >>>>>>>>
> >>>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> >>>>>>>> 0f
> >>>>>>>>
> >>>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> >>>>>>>> 0-3
> >>>>>>>>
> >>>>>>>> user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> >>>>>>>> 10
> >>>>>>>>
> >>>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> >>>>>>>> 4
> >>>>>> Thank you for the quick and detailed Greg!
> >>>>>>
> >>>>>>> The output of 'tree' sometimes makes it easier to see here, or:
> >>>>>>> grep -R . *
> >>>>>>> also works well.
> >>>>>> Indeed, this would definitely make it more readable.
> >>>>>>
> >>>>>>>> On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> >>>>>>>> created. Add a hook for this purpose.
> >>>>>>> Why should these not show up if the system is not "hybrid"?
> >>>>>> My thinking was that on a non-hybrid system, it does not make sense to
> >>>>>> create this interface, as all the CPUs will be of the same type.
> >>>>> Why not just have this an attribute type in the existing cpuX directory?
> >>>>> Why do this have to be a totally separate directory and userspace has to
> >>>>> figure out to look in two different spots for the same cpu to determine
> >>>>> what it is?
> >>>> But if the type is located under cpuX, usespace would need to traverse
> >>>> all the CPUs and create its own cpu masks. Under the types directory it
> >>>> would only need to look once for each type of CPU, IMHO.
> >>> What does a "mask" do? What does userspace care about this? You would
> >>> have to create it by traversing the directories you are creating anyway,
> >>> so it's not much different, right?
> >>
> >> Hello
> >>
> >> Sorry for the late reply. As the first userspace consumer of this
> >> interface [1], I can confirm that reading a single file to get the mask
> >> would be better, at least for performance reason. On large platforms, we
> >> already have to read thousands of sysfs files to get CPU topology and
> >> cache information, I'd be happy not to read one more file per cpu.
> >>
> >> Reading these sysfs files is slow, and it does not scale well when
> >> multiple processes read them in parallel.
> > Really? Where is the slowdown? Would something like readfile() work
> > better for you for that?
> > https://lore.kernel.org/linux-api/[email protected]/
>
>
> I guess readfile would improve the sequential case by avoiding syscalls
> but it would not improve the parallel case since syscalls shouldn't have
> any parallel issue?

syscalls should not have parallel issues at all.

> We've been watching the status of readfile() since it was posted on LKML
> 6 months ago, but we were actually wondering if it would end up being
> included at some point.

It needs a solid reason to be merged. My "test" benchmarks are fun to
run, but I have yet to find a real need for it anywhere as the
open/read/close syscall overhead seems to be lost in the noise on any
real application workload that I can find.

If you have a real need, and it reduces overhead and cpu usage, I'm more
than willing to update the patchset and resubmit it.

> > How does multiple processes slow anything down, there shouldn't be any
> > shared locks here.
>
>
> When I benchmarked this in 2016, reading a single (small) sysfs file was
> 41x slower when running 64 processes simultaneously on a 64-core Knights
> Landing than reading from a single process. On a SGI Altix UV with 12x
> 8-core CPUs, reading from one process per CPU (12 total) was 60x slower
> (which could mean NUMA affinity matters), and reading from one process
> per core (96 total) was 491x slower.
>
> I will try to find some time to dig further on recent kernels with perf
> and readfile (both machines were running RHEL7).

2016 was a long time ago in kernel-land, please retest on a kernel.org
release, not a RHEL monstrosity.

> >> There are ways to avoid this
> >> multiple discoveries by sharing hwloc info through XML or shmem, but it
> >> will take years before all developers of different runtimes all
> >> implement this :)
> > I don't understand, what exactly are you suggesting we do here instead?
>
>
> I was just saying userspace has ways to mitigate the issue but it will
> take time because many different projects are involved.

I still don't understand, what issue are you referring to?

thanks,

greg k-h

2020-11-12 11:38:26

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Thu, Nov 12, 2020 at 12:21:43PM +0100, Brice Goglin wrote:
>
> Le 12/11/2020 ? 11:49, Greg Kroah-Hartman a ?crit?:
> > On Thu, Nov 12, 2020 at 10:10:57AM +0100, Brice Goglin wrote:
> >> Le 12/11/2020 ? 07:42, Greg Kroah-Hartman a ?crit?:
> >>> On Thu, Nov 12, 2020 at 07:19:48AM +0100, Brice Goglin wrote:
> >>>> Le 07/10/2020 ? 07:15, Greg Kroah-Hartman a ?crit?:
> >>>>> On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote:
> >>>>>> On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
> >>>>>>> On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
> >>>>>>>> On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
> >>>>>>>>> On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
> >>>>>>>>>> Hybrid CPU topologies combine CPUs of different microarchitectures in the
> >>>>>>>>>> same die. Thus, even though the instruction set is compatible among all
> >>>>>>>>>> CPUs, there may still be differences in features (e.g., some CPUs may
> >>>>>>>>>> have counters that others CPU do not). There may be applications
> >>>>>>>>>> interested in knowing the type of micro-architecture topology of the
> >>>>>>>>>> system to make decisions about process affinity.
> >>>>>>>>>>
> >>>>>>>>>> While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> >>>>>>>>>> cpu_capacity) may be used to infer the types of micro-architecture of the
> >>>>>>>>>> CPUs in the platform, it may not be entirely accurate. For instance, two
> >>>>>>>>>> subsets of CPUs with different types of micro-architecture may have the
> >>>>>>>>>> same capacity due to power or thermal constraints.
> >>>>>>>>>>
> >>>>>>>>>> Create the new directory /sys/devices/system/cpu/types. Under such
> >>>>>>>>>> directory, create individual subdirectories for each type of CPU micro-
> >>>>>>>>>> architecture. Each subdirectory will have cpulist and cpumap files. This
> >>>>>>>>>> makes it convenient for user space to read all the CPUs of the same type
> >>>>>>>>>> at once without having to inspect each CPU individually.
> >>>>>>>>>>
> >>>>>>>>>> Implement a generic interface using weak functions that architectures can
> >>>>>>>>>> override to indicate a) support for CPU types, b) the CPU type number, and
> >>>>>>>>>> c) a string to identify the CPU vendor and type.
> >>>>>>>>>>
> >>>>>>>>>> For example, an x86 system with one Intel Core and four Intel Atom CPUs
> >>>>>>>>>> would look like this (other architectures have the hooks to use whatever
> >>>>>>>>>> directory naming convention below "types" that meets their needs):
> >>>>>>>>>>
> >>>>>>>>>> user@host:~$: ls /sys/devices/system/cpu/types
> >>>>>>>>>> intel_atom_0 intel_core_0
> >>>>>>>>>>
> >>>>>>>>>> user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> >>>>>>>>>> cpulist cpumap
> >>>>>>>>>>
> >>>>>>>>>> user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> >>>>>>>>>> cpulist cpumap
> >>>>>>>>>>
> >>>>>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> >>>>>>>>>> 0f
> >>>>>>>>>>
> >>>>>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> >>>>>>>>>> 0-3
> >>>>>>>>>>
> >>>>>>>>>> user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> >>>>>>>>>> 10
> >>>>>>>>>>
> >>>>>>>>>> user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> >>>>>>>>>> 4
> >>>>>>>> Thank you for the quick and detailed Greg!
> >>>>>>>>
> >>>>>>>>> The output of 'tree' sometimes makes it easier to see here, or:
> >>>>>>>>> grep -R . *
> >>>>>>>>> also works well.
> >>>>>>>> Indeed, this would definitely make it more readable.
> >>>>>>>>
> >>>>>>>>>> On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> >>>>>>>>>> created. Add a hook for this purpose.
> >>>>>>>>> Why should these not show up if the system is not "hybrid"?
> >>>>>>>> My thinking was that on a non-hybrid system, it does not make sense to
> >>>>>>>> create this interface, as all the CPUs will be of the same type.
> >>>>>>> Why not just have this an attribute type in the existing cpuX directory?
> >>>>>>> Why do this have to be a totally separate directory and userspace has to
> >>>>>>> figure out to look in two different spots for the same cpu to determine
> >>>>>>> what it is?
> >>>>>> But if the type is located under cpuX, usespace would need to traverse
> >>>>>> all the CPUs and create its own cpu masks. Under the types directory it
> >>>>>> would only need to look once for each type of CPU, IMHO.
> >>>>> What does a "mask" do? What does userspace care about this? You would
> >>>>> have to create it by traversing the directories you are creating anyway,
> >>>>> so it's not much different, right?
> >>>> Hello
> >>>>
> >>>> Sorry for the late reply. As the first userspace consumer of this
> >>>> interface [1], I can confirm that reading a single file to get the mask
> >>>> would be better, at least for performance reason. On large platforms, we
> >>>> already have to read thousands of sysfs files to get CPU topology and
> >>>> cache information, I'd be happy not to read one more file per cpu.
> >>>>
> >>>> Reading these sysfs files is slow, and it does not scale well when
> >>>> multiple processes read them in parallel.
> >>> Really? Where is the slowdown? Would something like readfile() work
> >>> better for you for that?
> >>> https://lore.kernel.org/linux-api/[email protected]/
> >>
> >> I guess readfile would improve the sequential case by avoiding syscalls
> >> but it would not improve the parallel case since syscalls shouldn't have
> >> any parallel issue?
> > syscalls should not have parallel issues at all.
> >
> >> We've been watching the status of readfile() since it was posted on LKML
> >> 6 months ago, but we were actually wondering if it would end up being
> >> included at some point.
> > It needs a solid reason to be merged. My "test" benchmarks are fun to
> > run, but I have yet to find a real need for it anywhere as the
> > open/read/close syscall overhead seems to be lost in the noise on any
> > real application workload that I can find.
> >
> > If you have a real need, and it reduces overhead and cpu usage, I'm more
> > than willing to update the patchset and resubmit it.
>
>
> Good, I'll give it at try.
>
>
> >>> How does multiple processes slow anything down, there shouldn't be any
> >>> shared locks here.
> >>
> >> When I benchmarked this in 2016, reading a single (small) sysfs file was
> >> 41x slower when running 64 processes simultaneously on a 64-core Knights
> >> Landing than reading from a single process. On a SGI Altix UV with 12x
> >> 8-core CPUs, reading from one process per CPU (12 total) was 60x slower
> >> (which could mean NUMA affinity matters), and reading from one process
> >> per core (96 total) was 491x slower.
> >>
> >> I will try to find some time to dig further on recent kernels with perf
> >> and readfile (both machines were running RHEL7).
> > 2016 was a long time ago in kernel-land, please retest on a kernel.org
> > release, not a RHEL monstrosity.
>
>
> Quick test on 5.8.14 from Debian (fairly close to mainline) on a server
> with 2x20 cores.
>
> I am measuring the time to do open+read+close of
> /sys/devices/system/cpu/cpu15/topology/die_id 1000 times
>
> With a single process, it takes 2ms (2us per open+read+close, looks OK).
>
> With one process per core (with careful binding, etc), it jumps from 2ms
> to 190ms (without much variation).
>
> It looks like locks in kernfs_iop_permission and kernfs_dop_revalidate
> are causing the issue.
>
> I am attaching the perf report callgraph output below.

Ouch, yes, we are hitting the single kernfs mutex for all of this, not
nice. I'll add this to my list of things to look at in the near future,
thanks for the report!

greg k-h

2020-11-17 15:59:35

by Brice Goglin

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface


Le 12/11/2020 à 11:49, Greg Kroah-Hartman a écrit :
> On Thu, Nov 12, 2020 at 10:10:57AM +0100, Brice Goglin wrote:
>> Le 12/11/2020 à 07:42, Greg Kroah-Hartman a écrit :
>>> On Thu, Nov 12, 2020 at 07:19:48AM +0100, Brice Goglin wrote:
>>>>
>>>> Hello
>>>>
>>>> Sorry for the late reply. As the first userspace consumer of this
>>>> interface [1], I can confirm that reading a single file to get the mask
>>>> would be better, at least for performance reason. On large platforms, we
>>>> already have to read thousands of sysfs files to get CPU topology and
>>>> cache information, I'd be happy not to read one more file per cpu.
>>>>
>>>> Reading these sysfs files is slow, and it does not scale well when
>>>> multiple processes read them in parallel.
>>> Really? Where is the slowdown? Would something like readfile() work
>>> better for you for that?
>>> https://lore.kernel.org/linux-api/[email protected]/
>>
>> I guess readfile would improve the sequential case by avoiding syscalls
>> but it would not improve the parallel case since syscalls shouldn't have
>> any parallel issue?
> syscalls should not have parallel issues at all.
>
>> We've been watching the status of readfile() since it was posted on LKML
>> 6 months ago, but we were actually wondering if it would end up being
>> included at some point.
> It needs a solid reason to be merged. My "test" benchmarks are fun to
> run, but I have yet to find a real need for it anywhere as the
> open/read/close syscall overhead seems to be lost in the noise on any
> real application workload that I can find.
>
> If you have a real need, and it reduces overhead and cpu usage, I'm more
> than willing to update the patchset and resubmit it.
>
>

Hello

I updated hwloc to use readfile instead of open+read+close on all those
small sysfs/procfs files. Unfortunately the improvement is very small,
only a couple percents. On a 40 core server, our library starts in 38ms
instead of 39ms. I can't deploy your patches on larger machines, but I
tested our code on a copy of their sysfs files saved on a local disk :
For a 256-thread KNL, we go from 15ms to 14ms. For a 896-core SGI
machine, from 73ms to 71ms.

I see 200ns improvement for readfile (2300) vs open+read+close (2500) on
my server when reading a single cpu topology file. With several
thousands of sysfs files to read in the above large hwloc tests, it
confirms an overall improvement in the order of 1ms.

So, just like you said, the overhead seems to be pretty much lost in the
noise of hwloc doing its own stuff after reading hundreds of sysfs files :/

Brice


2020-11-18 10:48:17

by Brice Goglin

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface


Le 17/11/2020 à 16:55, Brice Goglin a écrit :
> Le 12/11/2020 à 11:49, Greg Kroah-Hartman a écrit :
>> On Thu, Nov 12, 2020 at 10:10:57AM +0100, Brice Goglin wrote:
>>> Le 12/11/2020 à 07:42, Greg Kroah-Hartman a écrit :
>>>> On Thu, Nov 12, 2020 at 07:19:48AM +0100, Brice Goglin wrote:
>>>>> Hello
>>>>>
>>>>> Sorry for the late reply. As the first userspace consumer of this
>>>>> interface [1], I can confirm that reading a single file to get the mask
>>>>> would be better, at least for performance reason. On large platforms, we
>>>>> already have to read thousands of sysfs files to get CPU topology and
>>>>> cache information, I'd be happy not to read one more file per cpu.
>>>>>
>>>>> Reading these sysfs files is slow, and it does not scale well when
>>>>> multiple processes read them in parallel.
>>>> Really? Where is the slowdown? Would something like readfile() work
>>>> better for you for that?
>>>> https://lore.kernel.org/linux-api/[email protected]/
>>> I guess readfile would improve the sequential case by avoiding syscalls
>>> but it would not improve the parallel case since syscalls shouldn't have
>>> any parallel issue?
>> syscalls should not have parallel issues at all.
>>
>>> We've been watching the status of readfile() since it was posted on LKML
>>> 6 months ago, but we were actually wondering if it would end up being
>>> included at some point.
>> It needs a solid reason to be merged. My "test" benchmarks are fun to
>> run, but I have yet to find a real need for it anywhere as the
>> open/read/close syscall overhead seems to be lost in the noise on any
>> real application workload that I can find.
>>
>> If you have a real need, and it reduces overhead and cpu usage, I'm more
>> than willing to update the patchset and resubmit it.
>>
>>
> Hello
>
> I updated hwloc to use readfile instead of open+read+close on all those
> small sysfs/procfs files. Unfortunately the improvement is very small,
> only a couple percents. On a 40 core server, our library starts in 38ms
> instead of 39ms. I can't deploy your patches on larger machines, but I
> tested our code on a copy of their sysfs files saved on a local disk :
> For a 256-thread KNL, we go from 15ms to 14ms. For a 896-core SGI
> machine, from 73ms to 71ms.


Sorry, I forgot to update some codepaths to properly use readfile yesterday :/
Here are updated and more precise numbers that show a non-negligible improvement.
Again, we're measuring the entire hwloc topology discovery, which includes reading
many sysfs file (improved thanks to readfile) and then building a hierarchy of
objects describing the machine (not modified).

Server sysfs files (dual-socket x 20 cores x SMT-2)
default 43.48ms +/-4.48
readfile 42.15ms +/-4.58 => 3.1% better
1971 readfile calls => 674ns improvement per call

Knight Landing sysfs stored on local hard drive (64 cores x SMT-4)
default 14.60ms +/-0.91
readfile 13.63ms +/-1.05 => 6.6% better
2940 readfile calls => 329ns improvement per call

SGI Altix UV sysfs stored on local hard drive (56 sockets x 8 coeurs x SMT-2)
default 69.12ms +/-1.40
readfile 66.03ms +/-1.35 => 4.5% better
14525 readfile calls => 212ns improvement per call

I don't know why the first case (real sysfs files) gets a much
higher standard deviation and higher improvement per readfile call.
The other two cases match what microbenmarks say
(about 200ns improvement per readfile call).

Brice




>
> I see 200ns improvement for readfile (2300) vs open+read+close (2500) on
> my server when reading a single cpu topology file. With several
> thousands of sysfs files to read in the above large hwloc tests, it
> confirms an overall improvement in the order of 1ms.
>
> So, just like you said, the overhead seems to be pretty much lost in the
> noise of hwloc doing its own stuff after reading hundreds of sysfs files :/
>
> Brice
>
>

2020-11-18 10:58:34

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Wed, Nov 18, 2020 at 11:45:46AM +0100, Brice Goglin wrote:
>
> Le 17/11/2020 ? 16:55, Brice Goglin a ?crit?:
> > Le 12/11/2020 ? 11:49, Greg Kroah-Hartman a ?crit?:
> >> On Thu, Nov 12, 2020 at 10:10:57AM +0100, Brice Goglin wrote:
> >>> Le 12/11/2020 ? 07:42, Greg Kroah-Hartman a ?crit?:
> >>>> On Thu, Nov 12, 2020 at 07:19:48AM +0100, Brice Goglin wrote:
> >>>>> Hello
> >>>>>
> >>>>> Sorry for the late reply. As the first userspace consumer of this
> >>>>> interface [1], I can confirm that reading a single file to get the mask
> >>>>> would be better, at least for performance reason. On large platforms, we
> >>>>> already have to read thousands of sysfs files to get CPU topology and
> >>>>> cache information, I'd be happy not to read one more file per cpu.
> >>>>>
> >>>>> Reading these sysfs files is slow, and it does not scale well when
> >>>>> multiple processes read them in parallel.
> >>>> Really? Where is the slowdown? Would something like readfile() work
> >>>> better for you for that?
> >>>> https://lore.kernel.org/linux-api/[email protected]/
> >>> I guess readfile would improve the sequential case by avoiding syscalls
> >>> but it would not improve the parallel case since syscalls shouldn't have
> >>> any parallel issue?
> >> syscalls should not have parallel issues at all.
> >>
> >>> We've been watching the status of readfile() since it was posted on LKML
> >>> 6 months ago, but we were actually wondering if it would end up being
> >>> included at some point.
> >> It needs a solid reason to be merged. My "test" benchmarks are fun to
> >> run, but I have yet to find a real need for it anywhere as the
> >> open/read/close syscall overhead seems to be lost in the noise on any
> >> real application workload that I can find.
> >>
> >> If you have a real need, and it reduces overhead and cpu usage, I'm more
> >> than willing to update the patchset and resubmit it.
> >>
> >>
> > Hello
> >
> > I updated hwloc to use readfile instead of open+read+close on all those
> > small sysfs/procfs files. Unfortunately the improvement is very small,
> > only a couple percents. On a 40 core server, our library starts in 38ms
> > instead of 39ms. I can't deploy your patches on larger machines, but I
> > tested our code on a copy of their sysfs files saved on a local disk :
> > For a 256-thread KNL, we go from 15ms to 14ms. For a 896-core SGI
> > machine, from 73ms to 71ms.
>
>
> Sorry, I forgot to update some codepaths to properly use readfile yesterday :/
> Here are updated and more precise numbers that show a non-negligible improvement.
> Again, we're measuring the entire hwloc topology discovery, which includes reading
> many sysfs file (improved thanks to readfile) and then building a hierarchy of
> objects describing the machine (not modified).
>
> Server sysfs files (dual-socket x 20 cores x SMT-2)
> default 43.48ms +/-4.48
> readfile 42.15ms +/-4.58 => 3.1% better
> 1971 readfile calls => 674ns improvement per call
>
> Knight Landing sysfs stored on local hard drive (64 cores x SMT-4)
> default 14.60ms +/-0.91
> readfile 13.63ms +/-1.05 => 6.6% better
> 2940 readfile calls => 329ns improvement per call
>
> SGI Altix UV sysfs stored on local hard drive (56 sockets x 8 coeurs x SMT-2)
> default 69.12ms +/-1.40
> readfile 66.03ms +/-1.35 => 4.5% better
> 14525 readfile calls => 212ns improvement per call
>
> I don't know why the first case (real sysfs files) gets a much
> higher standard deviation and higher improvement per readfile call.
> The other two cases match what microbenmarks say
> (about 200ns improvement per readfile call).

Thanks a lot for the detailed numbers. I think 6.6% is the best that I
saw in my microbenchmarks as well.

While that would be great for a userspace task that does real work,
doing this type of thing is arguably, not "real work" :)

I'll keep the patch updated over time just for fun, but odds are it's
not going to get merged unless we have some userspace workload that
really needs it.

But, you pointing out the global kernfs mutex is an issue, that we
should work on resolving. I've pointed someone at it already and maybe
we will have some results in a few weeks.

thanks,

greg k-h

2020-11-19 08:30:08

by Fox Chen

[permalink] [raw]
Subject: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface

On Fri, Nov 13, 2020 at 2:15 PM Brice Goglin <[email protected]> wrote:
>
>
> Le 12/11/2020 à 11:49, Greg Kroah-Hartman a écrit :
>
> On Thu, Nov 12, 2020 at 10:10:57AM +0100, Brice Goglin wrote:
>
> Le 12/11/2020 à 07:42, Greg Kroah-Hartman a écrit :
>
> On Thu, Nov 12, 2020 at 07:19:48AM +0100, Brice Goglin wrote:
>
> Le 07/10/2020 à 07:15, Greg Kroah-Hartman a écrit :
>
> On Tue, Oct 06, 2020 at 08:14:47PM -0700, Ricardo Neri wrote:
>
> On Tue, Oct 06, 2020 at 09:37:44AM +0200, Greg Kroah-Hartman wrote:
>
> On Mon, Oct 05, 2020 at 05:57:36PM -0700, Ricardo Neri wrote:
>
> On Sat, Oct 03, 2020 at 10:53:45AM +0200, Greg Kroah-Hartman wrote:
>
> On Fri, Oct 02, 2020 at 06:17:42PM -0700, Ricardo Neri wrote:
>
> Hybrid CPU topologies combine CPUs of different microarchitectures in the
> same die. Thus, even though the instruction set is compatible among all
> CPUs, there may still be differences in features (e.g., some CPUs may
> have counters that others CPU do not). There may be applications
> interested in knowing the type of micro-architecture topology of the
> system to make decisions about process affinity.
>
> While the existing sysfs for capacity (/sys/devices/system/cpu/cpuX/
> cpu_capacity) may be used to infer the types of micro-architecture of the
> CPUs in the platform, it may not be entirely accurate. For instance, two
> subsets of CPUs with different types of micro-architecture may have the
> same capacity due to power or thermal constraints.
>
> Create the new directory /sys/devices/system/cpu/types. Under such
> directory, create individual subdirectories for each type of CPU micro-
> architecture. Each subdirectory will have cpulist and cpumap files. This
> makes it convenient for user space to read all the CPUs of the same type
> at once without having to inspect each CPU individually.
>
> Implement a generic interface using weak functions that architectures can
> override to indicate a) support for CPU types, b) the CPU type number, and
> c) a string to identify the CPU vendor and type.
>
> For example, an x86 system with one Intel Core and four Intel Atom CPUs
> would look like this (other architectures have the hooks to use whatever
> directory naming convention below "types" that meets their needs):
>
> user@host:~$: ls /sys/devices/system/cpu/types
> intel_atom_0 intel_core_0
>
> user@host:~$ ls /sys/devices/system/cpu/types/intel_atom_0
> cpulist cpumap
>
> user@host:~$ ls /sys/devices/system/cpu/types/intel_core_0
> cpulist cpumap
>
> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpumap
> 0f
>
> user@host:~$ cat /sys/devices/system/cpu/types/intel_atom_0/cpulist
> 0-3
>
> user@ihost:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpumap
> 10
>
> user@host:~$ cat /sys/devices/system/cpu/types/intel_core_0/cpulist
> 4
>
> Thank you for the quick and detailed Greg!
>
> The output of 'tree' sometimes makes it easier to see here, or:
> grep -R . *
> also works well.
>
> Indeed, this would definitely make it more readable.
>
> On non-hybrid systems, the /sys/devices/system/cpu/types directory is not
> created. Add a hook for this purpose.
>
> Why should these not show up if the system is not "hybrid"?
>
> My thinking was that on a non-hybrid system, it does not make sense to
> create this interface, as all the CPUs will be of the same type.
>
> Why not just have this an attribute type in the existing cpuX directory?
> Why do this have to be a totally separate directory and userspace has to
> figure out to look in two different spots for the same cpu to determine
> what it is?
>
> But if the type is located under cpuX, usespace would need to traverse
> all the CPUs and create its own cpu masks. Under the types directory it
> would only need to look once for each type of CPU, IMHO.
>
> What does a "mask" do? What does userspace care about this? You would
> have to create it by traversing the directories you are creating anyway,
> so it's not much different, right?
>
> Hello
>
> Sorry for the late reply. As the first userspace consumer of this
> interface [1], I can confirm that reading a single file to get the mask
> would be better, at least for performance reason. On large platforms, we
> already have to read thousands of sysfs files to get CPU topology and
> cache information, I'd be happy not to read one more file per cpu.
>
> Reading these sysfs files is slow, and it does not scale well when
> multiple processes read them in parallel.
>
> Really? Where is the slowdown? Would something like readfile() work
> better for you for that?
> https://lore.kernel.org/linux-api/[email protected]/
>
> I guess readfile would improve the sequential case by avoiding syscalls
> but it would not improve the parallel case since syscalls shouldn't have
> any parallel issue?
>
> syscalls should not have parallel issues at all.
>
> We've been watching the status of readfile() since it was posted on LKML
> 6 months ago, but we were actually wondering if it would end up being
> included at some point.
>
> It needs a solid reason to be merged. My "test" benchmarks are fun to
> run, but I have yet to find a real need for it anywhere as the
> open/read/close syscall overhead seems to be lost in the noise on any
> real application workload that I can find.
>
> If you have a real need, and it reduces overhead and cpu usage, I'm more
> than willing to update the patchset and resubmit it.
>
>
> Good, I'll give it at try.
>
>
> How does multiple processes slow anything down, there shouldn't be any
> shared locks here.
>
> When I benchmarked this in 2016, reading a single (small) sysfs file was
> 41x slower when running 64 processes simultaneously on a 64-core Knights
> Landing than reading from a single process. On a SGI Altix UV with 12x
> 8-core CPUs, reading from one process per CPU (12 total) was 60x slower
> (which could mean NUMA affinity matters), and reading from one process
> per core (96 total) was 491x slower.
>
> I will try to find some time to dig further on recent kernels with perf
> and readfile (both machines were running RHEL7).
>
> 2016 was a long time ago in kernel-land, please retest on a kernel.org
> release, not a RHEL monstrosity.
>
>
> Quick test on 5.8.14 from Debian (fairly close to mainline) on a server with 2x20 cores.
>
> I am measuring the time to do open+read+close of /sys/devices/system/cpu/cpu15/topology/die_id 1000 times
>
> With a single process, it takes 2ms (2us per open+read+close, looks OK).
>
> With one process per core (with careful binding, etc), it jumps from 2ms to 190ms (without much variation).
>
> It looks like locks in kernfs_iop_permission and kernfs_dop_revalidate are causing the issue.
>
> I am attaching the perf report callgraph output below.
>
>
>
> There are ways to avoid this
> multiple discoveries by sharing hwloc info through XML or shmem, but it
> will take years before all developers of different runtimes all
> implement this :)
>
> I don't understand, what exactly are you suggesting we do here instead?
>
> I was just saying userspace has ways to mitigate the issue but it will
> take time because many different projects are involved.
>
> I still don't understand, what issue are you referring to?
>
>
> Reading many sysfs files causing the application startup to takes many seconds when launching multiple process at the same time.
>
> Brice
>
>
> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 7K of event 'cycles'
> # Event count (approx.): 5291578622
> #
> # Children Self Command Shared Object Symbol
> # ........ ........ ............. ................. .......................................
> #
> 99.91% 0.00% fops_overhead [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
> |
> ---entry_SYSCALL_64_after_hwframe
> do_syscall_64
> |
> |--98.69%--__x64_sys_openat
> | |
> | --98.67%--do_sys_openat2
> | |
> | --98.57%--do_filp_open
> | path_openat
> | |
> | |--81.83%--link_path_walk.part.0
> | | |
> | | |--52.19%--inode_permission.part.0
> | | | |
> | | | --51.86%--kernfs_iop_permission
> | | | |
> | | | |--50.92%--__mutex_lock.constprop.0
> | | | | |
> | | | | --49.58%--osq_lock
> | | | |
> | | | --0.59%--mutex_unlock
> | | |
> | | --29.47%--walk_component
> | | |
> | | --29.10%--lookup_fast
> | | |
> | | --28.76%--kernfs_dop_revalidate
> | | |
> | | --28.29%--__mutex_lock.constprop.0
> | | |
> | | --27.65%--osq_lock
> | |
> | |--9.60%--lookup_fast
> | | |
> | | --9.50%--kernfs_dop_revalidate
> | | |
> | | --9.35%--__mutex_lock.constprop.0
> | | |
> | | --9.18%--osq_lock
> | |
> | |--6.17%--may_open
> | | |
> | | --6.13%--inode_permission.part.0
> | | |
> | | --6.10%--kernfs_iop_permission
> | | |
> | | --5.90%--__mutex_lock.constprop.0
> | | |
> | | --5.80%--osq_lock
> | |
> | --0.52%--do_dentry_open
> |
> --0.63%--__prepare_exit_to_usermode
> |
> --0.58%--task_work_run
>
> 99.91% 0.01% fops_overhead [kernel.kallsyms] [k] do_syscall_64
> |
> --99.89%--do_syscall_64
> |
> |--98.69%--__x64_sys_openat
> | |
> | --98.67%--do_sys_openat2
> | |
> | --98.57%--do_filp_open
> | path_openat
> | |
> | |--81.83%--link_path_walk.part.0
> | | |
> | | |--52.19%--inode_permission.part.0
> | | | |
> | | | --51.86%--kernfs_iop_permission
> | | | |
> | | | |--50.92%--__mutex_lock.constprop.0
> | | | | |
> | | | | --49.58%--osq_lock
> | | | |
> | | | --0.59%--mutex_unlock
> | | |
> | | --29.47%--walk_component
> | | |
> | | --29.10%--lookup_fast
> | | |
> | | --28.76%--kernfs_dop_revalidate
> | | |
> | | --28.29%--__mutex_lock.constprop.0
> | | |
> | | --27.65%--osq_lock
> | |
> | |--9.60%--lookup_fast
> | | |
> | | --9.50%--kernfs_dop_revalidate
> | | |
> | | --9.35%--__mutex_lock.constprop.0
> | | |
> | | --9.18%--osq_lock
> | |
> | |--6.17%--may_open
> | | |
> | | --6.13%--inode_permission.part.0
> | | |
> | | --6.10%--kernfs_iop_permission
> | | |
> | | --5.90%--__mutex_lock.constprop.0
> | | |
> | | --5.80%--osq_lock
> | |
> | --0.52%--do_dentry_open
> |
> --0.63%--__prepare_exit_to_usermode
> |
> --0.58%--task_work_run
>
> 98.72% 0.00% fops_overhead [unknown] [k] 0x7379732f73656369
> |
> ---0x7379732f73656369
> __GI___libc_open
> |
> --98.70%--entry_SYSCALL_64_after_hwframe
> do_syscall_64
> |
> --98.66%--__x64_sys_openat
> |
> --98.65%--do_sys_openat2
> |
> --98.55%--do_filp_open
> path_openat
> |
> |--81.80%--link_path_walk.part.0
> | |
> | |--52.16%--inode_permission.part.0
> | | |
> | | --51.86%--kernfs_iop_permission
> | | |
> | | |--50.92%--__mutex_lock.constprop.0
> | | | |
> | | | --49.58%--osq_lock
> | | |
> | | --0.59%--mutex_unlock
> | |
> | --29.47%--walk_component
> | |
> | --29.10%--lookup_fast
> | |
> | --28.76%--kernfs_dop_revalidate
> | |
> | --28.29%--__mutex_lock.constprop.0
> | |
> | --27.65%--osq_lock
> |
> |--9.60%--lookup_fast
> | |
> | --9.50%--kernfs_dop_revalidate
> | |
> | --9.35%--__mutex_lock.constprop.0
> | |
> | --9.18%--osq_lock
> |
> |--6.17%--may_open
> | |
> | --6.13%--inode_permission.part.0
> | |
> | --6.10%--kernfs_iop_permission
> | |
> | --5.90%--__mutex_lock.constprop.0
> | |
> | --5.80%--osq_lock
> |
> --0.52%--do_dentry_open
>
> 98.72% 0.00% fops_overhead libc-2.31.so [.] __GI___libc_open
> |
> ---__GI___libc_open
> |
> --98.70%--entry_SYSCALL_64_after_hwframe
> do_syscall_64
> |
> --98.66%--__x64_sys_openat
> |
> --98.65%--do_sys_openat2
> |
> --98.55%--do_filp_open
> path_openat
> |
> |--81.80%--link_path_walk.part.0
> | |
> | |--52.16%--inode_permission.part.0
> | | |
> | | --51.86%--kernfs_iop_permission
> | | |
> | | |--50.92%--__mutex_lock.constprop.0
> | | | |
> | | | --49.58%--osq_lock
> | | |
> | | --0.59%--mutex_unlock
> | |
> | --29.47%--walk_component
> | |
> | --29.10%--lookup_fast
> | |
> | --28.76%--kernfs_dop_revalidate
> | |
> | --28.29%--__mutex_lock.constprop.0
> | |
> | --27.65%--osq_lock
> |
> |--9.60%--lookup_fast
> | |
> | --9.50%--kernfs_dop_revalidate
> | |
> | --9.35%--__mutex_lock.constprop.0
> | |
> | --9.18%--osq_lock
> |
> |--6.17%--may_open
> | |
> | --6.13%--inode_permission.part.0
> | |
> | --6.10%--kernfs_iop_permission
> | |
> | --5.90%--__mutex_lock.constprop.0
> | |
> | --5.80%--osq_lock
> |
> --0.52%--do_dentry_open
>
> 98.69% 0.01% fops_overhead [kernel.kallsyms] [k] __x64_sys_openat
> |
> --98.67%--__x64_sys_openat
> do_sys_openat2
> |
> --98.57%--do_filp_open
> path_openat
> |
> |--81.83%--link_path_walk.part.0
> | |
> | |--52.19%--inode_permission.part.0
> | | |
> | | --51.86%--kernfs_iop_permission
> | | |
> | | |--50.92%--__mutex_lock.constprop.0
> | | | |
> | | | --49.58%--osq_lock
> | | |
> | | --0.59%--mutex_unlock
> | |
> | --29.47%--walk_component
> | |
> | --29.10%--lookup_fast
> | |
> | --28.76%--kernfs_dop_revalidate
> | |
> | --28.29%--__mutex_lock.constprop.0
> | |
> | --27.65%--osq_lock
> |
> |--9.60%--lookup_fast
> | |
> | --9.50%--kernfs_dop_revalidate
> | |
> | --9.35%--__mutex_lock.constprop.0
> | |
> | --9.18%--osq_lock
> |
> |--6.17%--may_open
> | |
> | --6.13%--inode_permission.part.0
> | |
> | --6.10%--kernfs_iop_permission
> | |
> | --5.90%--__mutex_lock.constprop.0
> | |
> | --5.80%--osq_lock
> |
> --0.52%--do_dentry_open
>
> 98.67% 0.03% fops_overhead [kernel.kallsyms] [k] do_sys_openat2
> |
> --98.65%--do_sys_openat2
> |
> --98.57%--do_filp_open
> path_openat
> |
> |--81.83%--link_path_walk.part.0
> | |
> | |--52.19%--inode_permission.part.0
> | | |
> | | --51.86%--kernfs_iop_permission
> | | |
> | | |--50.92%--__mutex_lock.constprop.0
> | | | |
> | | | --49.58%--osq_lock
> | | |
> | | --0.59%--mutex_unlock
> | |
> | --29.47%--walk_component
> | |
> | --29.10%--lookup_fast
> | |
> | --28.76%--kernfs_dop_revalidate
> | |
> | --28.29%--__mutex_lock.constprop.0
> | |
> | --27.65%--osq_lock
> |
> |--9.60%--lookup_fast
> | |
> | --9.50%--kernfs_dop_revalidate
> | |
> | --9.35%--__mutex_lock.constprop.0
> | |
> | --9.18%--osq_lock
> |
> |--6.17%--may_open
> | |
> | --6.13%--inode_permission.part.0
> | |
> | --6.10%--kernfs_iop_permission
> | |
> | --5.90%--__mutex_lock.constprop.0
> | |
> | --5.80%--osq_lock
> |
> --0.52%--do_dentry_open
>
> 98.57% 0.00% fops_overhead [kernel.kallsyms] [k] do_filp_open
> |
> ---do_filp_open
> path_openat
> |
> |--81.83%--link_path_walk.part.0
> | |
> | |--52.19%--inode_permission.part.0
> | | |
> | | --51.86%--kernfs_iop_permission
> | | |
> | | |--50.92%--__mutex_lock.constprop.0
> | | | |
> | | | --49.58%--osq_lock
> | | |
> | | --0.59%--mutex_unlock
> | |
> | --29.47%--walk_component
> | |
> | --29.10%--lookup_fast
> | |
> | --28.76%--kernfs_dop_revalidate
> | |
> | --28.29%--__mutex_lock.constprop.0
> | |
> | --27.65%--osq_lock
> |
> |--9.60%--lookup_fast
> | |
> | --9.50%--kernfs_dop_revalidate
> | |
> | --9.35%--__mutex_lock.constprop.0
> | |
> | --9.18%--osq_lock
> |
> |--6.17%--may_open
> | |
> | --6.13%--inode_permission.part.0
> | |
> | --6.10%--kernfs_iop_permission
> | |
> | --5.90%--__mutex_lock.constprop.0
> | |
> | --5.80%--osq_lock
> |
> --0.52%--do_dentry_open
>
> 98.57% 0.01% fops_overhead [kernel.kallsyms] [k] path_openat
> |
> --98.56%--path_openat
> |
> |--81.83%--link_path_walk.part.0
> | |
> | |--52.19%--inode_permission.part.0
> | | |
> | | --51.86%--kernfs_iop_permission
> | | |
> | | |--50.92%--__mutex_lock.constprop.0
> | | | |
> | | | --49.58%--osq_lock
> | | |
> | | --0.59%--mutex_unlock
> | |
> | --29.47%--walk_component
> | |
> | --29.10%--lookup_fast
> | |
> | --28.76%--kernfs_dop_revalidate
> | |
> | --28.29%--__mutex_lock.constprop.0
> | |
> | --27.65%--osq_lock
> |
> |--9.60%--lookup_fast
> | |
> | --9.50%--kernfs_dop_revalidate
> | |
> | --9.35%--__mutex_lock.constprop.0
> | |
> | --9.18%--osq_lock
> |
> |--6.17%--may_open
> | |
> | --6.13%--inode_permission.part.0
> | |
> | --6.10%--kernfs_iop_permission
> | |
> | --5.90%--__mutex_lock.constprop.0
> | |
> | --5.80%--osq_lock
> |
> --0.52%--do_dentry_open
>
> 94.52% 1.30% fops_overhead [kernel.kallsyms] [k] __mutex_lock.constprop.0
> |
> |--93.23%--__mutex_lock.constprop.0
> | |
> | |--92.23%--osq_lock
> | |
> | --0.55%--mutex_spin_on_owner
> |
> --1.30%--0x7379732f73656369
> __GI___libc_open
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_openat
> do_sys_openat2
> do_filp_open
> path_openat
> |
> --1.09%--link_path_walk.part.0
> |
> --0.75%--inode_permission.part.0
> kernfs_iop_permission
> __mutex_lock.constprop.0
>
> 92.24% 92.22% fops_overhead [kernel.kallsyms] [k] osq_lock
> |
> --92.22%--0x7379732f73656369
> __GI___libc_open
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_openat
> do_sys_openat2
> do_filp_open
> path_openat
> |
> |--77.21%--link_path_walk.part.0
> | |
> | |--49.57%--inode_permission.part.0
> | | kernfs_iop_permission
> | | __mutex_lock.constprop.0
> | | osq_lock
> | |
> | --27.64%--walk_component
> | lookup_fast
> | kernfs_dop_revalidate
> | __mutex_lock.constprop.0
> | osq_lock
> |
> |--9.18%--lookup_fast
> | kernfs_dop_revalidate
> | __mutex_lock.constprop.0
> | osq_lock
> |
> --5.80%--may_open
> inode_permission.part.0
> kernfs_iop_permission
> __mutex_lock.constprop.0
> osq_lock
>
> 81.83% 0.03% fops_overhead [kernel.kallsyms] [k] link_path_walk.part.0
> |
> --81.80%--link_path_walk.part.0
> |
> |--52.19%--inode_permission.part.0
> | |
> | --51.86%--kernfs_iop_permission
> | |
> | |--50.92%--__mutex_lock.constprop.0
> | | |
> | | --49.58%--osq_lock
> | |
> | --0.59%--mutex_unlock
> |
> --29.47%--walk_component
> |
> --29.10%--lookup_fast
> |
> --28.76%--kernfs_dop_revalidate
> |
> --28.29%--__mutex_lock.constprop.0
> |
> --27.65%--osq_lock
>
> 58.32% 0.24% fops_overhead [kernel.kallsyms] [k] inode_permission.part.0
> |
> --58.08%--inode_permission.part.0
> |
> --57.97%--kernfs_iop_permission
> |
> |--56.81%--__mutex_lock.constprop.0
> | |
> | --55.39%--osq_lock
> |
> --0.73%--mutex_unlock
>
> 57.97% 0.00% fops_overhead [kernel.kallsyms] [k] kernfs_iop_permission
> |
> ---kernfs_iop_permission
> |
> |--56.81%--__mutex_lock.constprop.0
> | |
> | --55.39%--osq_lock
> |
> --0.73%--mutex_unlock
>
> 38.71% 0.03% fops_overhead [kernel.kallsyms] [k] lookup_fast
> |
> --38.68%--lookup_fast
> |
> --38.26%--kernfs_dop_revalidate
> |
> --37.64%--__mutex_lock.constprop.0
> |
> --36.83%--osq_lock
>
> 38.26% 0.04% fops_overhead [kernel.kallsyms] [k] kernfs_dop_revalidate
> |
> --38.22%--kernfs_dop_revalidate
> |
> --37.64%--__mutex_lock.constprop.0
> |
> --36.83%--osq_lock
>
> 29.47% 0.03% fops_overhead [kernel.kallsyms] [k] walk_component
> |
> --29.44%--walk_component
> |
> --29.10%--lookup_fast
> |
> --28.76%--kernfs_dop_revalidate
> |
> --28.29%--__mutex_lock.constprop.0
> |
> --27.65%--osq_lock
>
> 6.17% 0.03% fops_overhead [kernel.kallsyms] [k] may_open
> |
> --6.14%--may_open
> |
> --6.13%--inode_permission.part.0
> |
> --6.10%--kernfs_iop_permission
> |
> --5.90%--__mutex_lock.constprop.0
> |
> --5.80%--osq_lock
>
> 1.22% 0.00% fops_overhead [unknown] [k] 0x5541d68949564100
> |
> ---0x5541d68949564100
> __libc_start_main
> |
> |--0.68%--__close
> | |
> | --0.66%--entry_SYSCALL_64_after_hwframe
> | do_syscall_64
> | |
> | --0.61%--__prepare_exit_to_usermode
> | |
> | --0.58%--task_work_run
> |
> --0.54%--read
>
> 1.22% 0.00% fops_overhead libc-2.31.so [.] __libc_start_main
> |
> ---__libc_start_main
> |
> |--0.68%--__close
> | |
> | --0.66%--entry_SYSCALL_64_after_hwframe
> | do_syscall_64
> | |
> | --0.61%--__prepare_exit_to_usermode
> | |
> | --0.58%--task_work_run
> |
> --0.54%--read
>
> 1.06% 1.05% fops_overhead [kernel.kallsyms] [k] mutex_unlock
> |
> --1.02%--0x7379732f73656369
> __GI___libc_open
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_openat
> do_sys_openat2
> do_filp_open
> path_openat
> |
> --0.80%--link_path_walk.part.0
> |
> --0.60%--inode_permission.part.0
> kernfs_iop_permission
> |
> --0.59%--mutex_unlock
>
> 0.88% 0.79% fops_overhead [kernel.kallsyms] [k] mutex_lock
> |
> --0.68%--0x7379732f73656369
> __GI___libc_open
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_openat
> do_sys_openat2
> do_filp_open
> path_openat
>
> 0.68% 0.01% fops_overhead libc-2.31.so [.] __close
> |
> --0.67%--__close
> |
> --0.66%--entry_SYSCALL_64_after_hwframe
> do_syscall_64
> |
> --0.61%--__prepare_exit_to_usermode
> |
> --0.58%--task_work_run
>
> 0.63% 0.05% fops_overhead [kernel.kallsyms] [k] __prepare_exit_to_usermode
> |
> --0.58%--__prepare_exit_to_usermode
> task_work_run
>
> 0.58% 0.00% fops_overhead [kernel.kallsyms] [k] task_work_run
> |
> ---task_work_run
>
> 0.58% 0.10% fops_overhead [kernel.kallsyms] [k] dput
> 0.56% 0.55% fops_overhead [kernel.kallsyms] [k] mutex_spin_on_owner
> |
> --0.55%--0x7379732f73656369
> __GI___libc_open
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> __x64_sys_openat
> do_sys_openat2
> do_filp_open
> path_openat
>
> 0.54% 0.00% fops_overhead libc-2.31.so [.] read
> |
> ---read
>
> 0.52% 0.12% fops_overhead [kernel.kallsyms] [k] do_dentry_open
> 0.50% 0.00% fops_overhead [kernel.kallsyms] [k] ksys_read
> 0.50% 0.03% fops_overhead [kernel.kallsyms] [k] vfs_read
> 0.46% 0.05% fops_overhead [kernel.kallsyms] [k] __fput
> 0.45% 0.45% fops_overhead [kernel.kallsyms] [k] lockref_put_return
> 0.43% 0.43% fops_overhead [kernel.kallsyms] [k] osq_unlock
> 0.41% 0.08% fops_overhead [kernel.kallsyms] [k] step_into
> 0.41% 0.08% fops_overhead [kernel.kallsyms] [k] __d_lookup
> 0.37% 0.35% fops_overhead [kernel.kallsyms] [k] _raw_spin_lock
> 0.35% 0.03% fops_overhead [kernel.kallsyms] [k] seq_read
> 0.28% 0.01% fops_overhead [kernel.kallsyms] [k] kernfs_fop_open
> 0.27% 0.03% fops_overhead [kernel.kallsyms] [k] kernfs_fop_release
> 0.16% 0.01% fops_overhead [kernel.kallsyms] [k] kernfs_put_open_node
> 0.16% 0.00% fops_overhead [kernel.kallsyms] [k] terminate_walk
> 0.12% 0.01% fops_overhead [kernel.kallsyms] [k] __alloc_file
> 0.12% 0.00% fops_overhead [kernel.kallsyms] [k] alloc_empty_file
> 0.12% 0.01% fops_overhead [kernel.kallsyms] [k] unlazy_walk
> 0.12% 0.05% fops_overhead [kernel.kallsyms] [k] _cond_resched
> 0.12% 0.07% fops_overhead [kernel.kallsyms] [k] call_rcu
> 0.10% 0.00% fops_overhead [kernel.kallsyms] [k] __legitimize_path
> 0.09% 0.05% fops_overhead [kernel.kallsyms] [k] sysfs_kf_seq_show
> 0.09% 0.09% fops_overhead [kernel.kallsyms] [k] generic_permission
> 0.09% 0.07% fops_overhead [kernel.kallsyms] [k] rcu_all_qs
> 0.09% 0.01% fops_overhead [kernel.kallsyms] [k] security_file_open
> 0.08% 0.00% fops_overhead [kernel.kallsyms] [k] security_file_alloc
> 0.08% 0.08% fops_overhead [kernel.kallsyms] [k] lockref_get_not_dead
> 0.08% 0.03% fops_overhead [kernel.kallsyms] [k] kmem_cache_alloc
> 0.08% 0.08% fops_overhead [kernel.kallsyms] [k] apparmor_file_open
> 0.07% 0.05% fops_overhead [kernel.kallsyms] [k] kfree
> 0.05% 0.05% fops_overhead [kernel.kallsyms] [k] kernfs_fop_read
> 0.05% 0.05% fops_overhead [kernel.kallsyms] [k] set_nlink
> 0.05% 0.01% fops_overhead [kernel.kallsyms] [k] kernfs_seq_start
> 0.05% 0.03% fops_overhead [kernel.kallsyms] [k] path_init
> 0.05% 0.00% fops_overhead [kernel.kallsyms] [k] __x64_sys_close
> 0.05% 0.00% fops_overhead [kernel.kallsyms] [k] filp_close
> 0.05% 0.05% fops_overhead [kernel.kallsyms] [k] syscall_return_via_sysret
> 0.05% 0.03% fops_overhead [kernel.kallsyms] [k] __kmalloc_node
> 0.05% 0.05% fops_overhead [kernel.kallsyms] [k] rcu_segcblist_enqueue
> 0.04% 0.04% fops_overhead [kernel.kallsyms] [k] vfs_open
> 0.04% 0.04% fops_overhead [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 0.04% 0.03% fops_overhead [kernel.kallsyms] [k] sprintf
> 0.04% 0.00% fops_overhead [kernel.kallsyms] [k] dev_attr_show
> 0.04% 0.00% fops_overhead [kernel.kallsyms] [k] die_id_show
> 0.04% 0.04% fops_overhead [kernel.kallsyms] [k] kmem_cache_free
> 0.04% 0.04% fops_overhead [kernel.kallsyms] [k] fsnotify_parent
> 0.04% 0.04% fops_overhead [kernel.kallsyms] [k] security_inode_permission
> 0.04% 0.01% fops_overhead [kernel.kallsyms] [k] __check_object_size
> 0.04% 0.04% fops_overhead [kernel.kallsyms] [k] apparmor_file_alloc_security
> 0.04% 0.00% fops_overhead [kernel.kallsyms] [k] seq_release
> 0.04% 0.04% fops_overhead [kernel.kallsyms] [k] memset_erms
> 0.04% 0.04% fops_overhead [kernel.kallsyms] [k] kernfs_get_active
> 0.03% 0.00% fops_overhead [kernel.kallsyms] [k] try_to_wake_up
> 0.03% 0.00% fops_overhead [kernel.kallsyms] [k] vsnprintf
> 0.03% 0.01% fops_overhead [kernel.kallsyms] [k] mntput_no_expire
> 0.03% 0.03% fops_overhead [kernel.kallsyms] [k] lockref_get
> 0.03% 0.03% fops_overhead [kernel.kallsyms] [k] kernfs_put_active
> 0.03% 0.03% fops_overhead [kernel.kallsyms] [k] fsnotify
> 0.03% 0.03% fops_overhead [kernel.kallsyms] [k] locks_remove_posix
> 0.03% 0.00% fops_overhead [kernel.kallsyms] [k] security_file_permission
> 0.03% 0.03% fops_overhead [kernel.kallsyms] [k] rw_verify_area
> 0.03% 0.03% fops_overhead [kernel.kallsyms] [k] set_root
> 0.03% 0.00% fops_overhead [kernel.kallsyms] [k] nd_jump_root
> 0.03% 0.01% fops_overhead [kernel.kallsyms] [k] wake_up_q
> 0.03% 0.00% fops_overhead [kernel.kallsyms] [k] __mutex_unlock_slowpath.constprop.0
> 0.03% 0.00% fops_overhead [kernel.kallsyms] [k] getname_flags.part.0
> 0.03% 0.03% fops_overhead [kernel.kallsyms] [k] task_work_add
> 0.03% 0.00% fops_overhead [kernel.kallsyms] [k] fput_many
> 0.03% 0.03% fops_overhead [kernel.kallsyms] [k] __legitimize_mnt
> 0.03% 0.01% fops_overhead [kernel.kallsyms] [k] kernfs_seq_stop
> 0.02% 0.02% fops_overhead [nfs] [k] nfs_do_access
> 0.02% 0.00% fops_overhead ld-2.31.so [.] _dl_map_object
> 0.02% 0.00% fops_overhead ld-2.31.so [.] open_path
> 0.02% 0.00% fops_overhead ld-2.31.so [.] __GI___open64_nocancel
> 0.02% 0.00% fops_overhead [nfs] [k] nfs_permission
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] kernfs_seq_next
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] available_idle_cpu
> 0.01% 0.00% fops_overhead [unknown] [k] 0x3931206e69207364
> 0.01% 0.00% fops_overhead libc-2.31.so [.] __GI___libc_write
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] ksys_write
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] vfs_write
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] tty_write
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] n_tty_write
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] pty_write
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] queue_work_on
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] __queue_work
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] select_task_rq_fair
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] slab_free_freelist_hook
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] __list_del_entry_valid
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] memcg_kmem_put_cache
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] __syscall_return_slowpath
> 0.01% 0.01% fops_overhead libc-2.31.so [.] _dl_addr
> 0.01% 0.00% fops_overhead [unknown] [.] 0x756e696c2d34365f
> 0.01% 0.00% fops_overhead [unknown] [.] 0x00007f4b1ca1e000
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] __x86_indirect_thunk_rax
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] __virt_addr_valid
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] locks_remove_file
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] memcpy_erms
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] update_rq_clock
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] entry_SYSCALL_64
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] __check_heap_object
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] apparmor_file_free_security
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] security_file_free
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] __d_lookup_rcu
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] mntput
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] get_unused_fd_flags
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] alloc_slab_page
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] __slab_alloc
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] ___slab_alloc
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] allocate_slab
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] __alloc_fd
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] legitimize_root
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] strncpy_from_user
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] kernfs_refresh_inode
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] build_open_flags
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] strcmp
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] memcg_kmem_get_cache
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] sysvec_apic_timer_interrupt
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] asm_call_sysvec_on_stack
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] __sysvec_apic_timer_interrupt
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] hrtimer_interrupt
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] __hrtimer_run_queues
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] tick_sched_timer
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] tick_sched_handle
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] update_process_times
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] scheduler_tick
> 0.01% 0.01% fops_overhead [kernel.kallsyms] [k] perf_iterate_ctx
> 0.01% 0.00% fops_overhead [unknown] [k] 0x00007fd34e3a0627
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] __x64_sys_execve
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] do_execve
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] __do_execve_file
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] load_elf_binary
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] elf_map
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] vm_mmap_pgoff
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] do_mmap
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] mmap_region
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] perf_event_mmap
> 0.01% 0.00% fops_overhead [kernel.kallsyms] [k] perf_iterate_sb
> 0.00% 0.00% perf_5.8 [unknown] [k] 0x00007fd34e3a0627
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] perf_event_exec
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] do_syscall_64
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] __x64_sys_execve
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] do_execve
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] __do_execve_file
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] load_elf_binary
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] begin_new_exec
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] native_write_msr
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] __intel_pmu_enable_all.constprop.0
> 0.00% 0.00% perf_5.8 [kernel.kallsyms] [k] acpi_os_read_memory
>
>
> #
> # (Tip: To count events in every 1000 msec: perf stat -I 1000)
> #
>

Hi Brice,

I wrote a benchmark to do open+read+close on
/sys/devices/system/cpu/cpu0/topology/die_id
https://github.com/foxhlchen/sysfs_benchmark/blob/main/main.c


+ 3.39% 3.37% a.out [kernel.kallsyms] [k] mutex_unlock

+ 2.76% 2.74% a.out [kernel.kallsyms] [k] mutex_lock

+ 0.92% 0.42% a.out [kernel.kallsyms] [k]
__mutex_lock.constprop.0 ▒
0.38% 0.37% a.out [kernel.kallsyms] [k]
mutex_spin_on_owner ▒
0.05% 0.05% a.out [kernel.kallsyms] [k] __mutex_init

0.01% 0.01% a.out [kernel.kallsyms] [k]
__mutex_lock_slowpath ▒
0.01% 0.00% a.out [kernel.kallsyms] [k]
__mutex_unlock_slowpath.constprop.0

But I failed to reproduce your result.

If it is possible, would you mind providing your benchmark code? :)


thanks,
fox