Subject: [PATCH v9 0/5] use bin_attribute to break the size limitation of cpumap ABI

V9:
- Split bitmask and list APIs and removed bool parameter with respect to
Greg's comment
- Removed duplication in code doc

v8:
- add Reviewed-by tags of Jonathan Cameron which I missed in v7;
- add Reviewed-by tags of Andy Shevchenko
- add Yury's patch extending comment in this series

v7:
- update doc in code for new APIs according to the comments of
Andy Shevchenko;
- other minor cleanup and commit log fix according to the comments
of Andy Shevchenko

v6:
-minor cleanup according to Andy Shevchenko's comment;
-take bitmap_print_to_buf back according to Yury Norov's comment and
fix the documents; and also take the bitmap testcase back.
-Sorry, Yury, I don't think it is doable to move memory allocation
to drivers.
Considering a driver like topology.c, we have M CPUs and each CPU
has N various nodes like core_siblings, package_cpus, die_cpus etc,
we can't know the size of each node of each CPU in advance. The best
time to get the size of each node is really when users read the sysfs
node. otherwise, we have to scan M*N nodes in drivers in advance to
figure out the exact size of buffers we need.
On the other hand, it is crazily tricky to ask a bundle of drivers to
find a proper place to save the pointer of allocated buffers so that
they can be re-used in second read of the same bin_attribute node.
And I doubt it is really useful to save the address of buffers
somewhere. Immediately freeing it seems to be a correct option to
avoid runtime waste of memory. We can't predict when users will read
topology ABI and which node users will read.
Finally, reading topology things wouldn't be the really cpu-bound
things in user applications, hardly this kind of ABI things can be
a performance bottleneck. Users use numactl and lstopo commands to
read ABIs but nobody will do it again and again. And a normal
application won't do topology repeatly. So the overhead caused by
malloc/free in the new bitmap API doesn't really matter.
if we really want a place to re-used the buffer and avoid malloc/free,
it seems this should be done in some common place rather than each
driver. still it is hard to find the best place.

Thanks for the comments of Yury and Andy in v5.

v5:
-remove the bitmap API bitmap_print_to_buf, alternatively, only provide
cpumap_print_to_buf API as what we really care about is cpumask for
this moment. we can freely take bitmap_print_to_buf back once we find
the second user.
hopefully this can alleviate Yury's worries on possible abuse of a new
bitmap API.
-correct the document of cpumap_print_to_buf;
-In patch1, clearly explain why we need this new API in commit log;
-Also refine the commit log of patch 2 and 3;
-As the modification is narrowed to the scope of cpumask, the kunit
test of bitmap_print_to_buf doesn't apply in the new patchset. so
test case patch4/4 is removed.

Thanks for the comments of Greg, Yury, Andy. Thanks for Jonathan's
review.

v4:
-add test cases for bitmap_print_to_buf API;
-add Reviewed-by of Jonathan Cameron for patches 1-3, thanks!

v3:
-fixed the strlen issue and patch #1,#2,#3 minor formatting issues, thanks
to Andy Shevchenko and Jonathan Cameron.

v2:
-split the original patch #1 into two patches and use kasprintf() in
-patch #1 to simplify the code. do some minor formatting adjustments.

Background:

the whole story began from this thread when Jonatah and me tried to add a
new topology level-cluster which exists on kunpeng920 and X86 Jacobsville:
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/

in the discussion, Greg had some concern about the potential one page size
limitation of sysfs ABI for topology. Greg's comment is reasonable
and I think we should address the problem.

For this moment, numa node, cpu topology and some other drivers are using
cpu bitmap and list to expose hardware topology. When cpu number is large,
the page buffer of sysfs won't be able to hold the whole bitmask or list.
This doesn't really happen nowadays for bitmask as the maximum NR_CPUS
is 8196 for X86_64 and 4096 for ARM64 since
8196 * 9 / 32 = 2305
is still smaller than 4KB page size.

So the existing BUILD_BUG_ON() in drivers/base/node.c is pretty much
preventing future problems when hardware gets more and more CPUs:
static ssize_t node_read_cpumap(struct device *dev, bool list, char *buf)
{
cpumask_var_t mask;
struct node *node_dev = to_node(dev);

/* 2008/04/07: buf currently PAGE_SIZE, need 9 chars per 32 bits. */
BUILD_BUG_ON((NR_CPUS/32 * 9) > (PAGE_SIZE-1));
}

But those ABIs exposing cpu lists are much more tricky as a list could be
like: 0, 3, 5, 7, 9, 11... etc. so nobody knows the size till the last
moment. Comparing to bitmask, list is easier to exceed one page.

In the previous discussion, Greg and Dave Hansen preferred to remove this
kind of limitation totally and remove the BUILD_BUG_ON() in
drivers/base/node.c together:
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/

Todo:

right now, only topology and node are addressed. there are many other
drivers are calling cpumap_print_to_pagebuf() and have the similar
problems. we are going to address them one by one after this patchset
settles down.

Barry Song (1):
lib: test_bitmap: add bitmap_print_bitmask/list_to_buf test cases

Tian Tao (3):
cpumask: introduce cpumap_print_list/bitmask_to_buf to support large
bitmask and list
topology: use bin_attribute to break the size limitation of cpumap ABI
drivers/base/node.c: use bin_attribute to break the size limitation of
cpumap ABI

Yury Norov (1):
bitmap: extend comment to bitmap_print_bitmask/list_to_buf

drivers/base/node.c | 63 +++++++++++------
drivers/base/topology.c | 115 ++++++++++++++++--------------
include/linux/bitmap.h | 6 ++
include/linux/cpumask.h | 38 ++++++++++
lib/bitmap.c | 121 ++++++++++++++++++++++++++++++++
lib/test_bitmap.c | 150 ++++++++++++++++++++++++++++++++++++++++
6 files changed, 418 insertions(+), 75 deletions(-)

--
2.25.1


Subject: [PATCH v9 5/5] bitmap: extend comment to bitmap_print_bitmask/list_to_buf

From: Yury Norov <[email protected]>

Extend comment to new function to warn potential users about caveats.

Signed-off-by: Yury Norov <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
lib/bitmap.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/lib/bitmap.c b/lib/bitmap.c
index 73746d96af81..663dd81967d4 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -568,6 +568,24 @@ static int bitmap_print_to_buf(bool list, char *buf, const unsigned long *maskp,
* and it can break the size limit of converted decimal list and hexadecimal
* bitmask.
*
+ * WARNING!
+ *
+ * This function is not a replacement for sprintf() or bitmap_print_to_pagebuf().
+ * It is intended to workaround sysfs limitations discussed above and should be
+ * used carefully in general case for the following reasons:
+ * - Time complexity is O(nbits^2/count), comparing to O(nbits) for snprintf().
+ * - Memory complexity is O(nbits), comparing to O(1) for snprintf().
+ * - @off and @count are NOT offset and number of bits to print.
+ * - If printing part of bitmap as list, the resulting string is not a correct
+ * list representation of bitmap. Particularly, some bits within or out of
+ * related interval may be erroneously set or unset. The format of the string
+ * may be broken, so bitmap_parselist-like parser may fail parsing it.
+ * - If printing the whole bitmap as list by parts, user must ensure the order
+ * of calls of the function such that the offset is incremented linearly.
+ * - If printing the whole bitmap as list by parts, user must keep bitmap
+ * unchanged between the very first and very last call. Otherwise concatenated
+ * result may be incorrect, and format may be broken.
+ *
* Returns the number of characters actually printed to @buf
*/
int bitmap_print_bitmask_to_buf(char *buf, const unsigned long *maskp,
--
2.25.1

Subject: [PATCH v9 4/5] drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI

From: Tian Tao <[email protected]>

Reading /sys/devices/system/cpu/cpuX/nodeX/ returns cpumap and cpulist.
However, the size of this file is limited to PAGE_SIZE because of the
limitation for sysfs attribute.

This patch moves to use bin_attribute to extend the ABI to be more
than one page so that cpumap bitmask and list won't be potentially
trimmed.

Signed-off-by: Tian Tao <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
drivers/base/node.c | 63 ++++++++++++++++++++++++++++-----------------
1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 4a4ae868ad9f..70a84e85ad3f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -26,43 +26,47 @@ static struct bus_type node_subsys = {
.dev_name = "node",
};

-
-static ssize_t node_read_cpumap(struct device *dev, bool list, char *buf)
+static inline ssize_t cpumap_read(struct file *file, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t off, size_t count)
{
- ssize_t n;
- cpumask_var_t mask;
+ struct device *dev = kobj_to_dev(kobj);
struct node *node_dev = to_node(dev);
-
- /* 2008/04/07: buf currently PAGE_SIZE, need 9 chars per 32 bits. */
- BUILD_BUG_ON((NR_CPUS/32 * 9) > (PAGE_SIZE-1));
+ cpumask_var_t mask;
+ ssize_t n;

if (!alloc_cpumask_var(&mask, GFP_KERNEL))
return 0;

cpumask_and(mask, cpumask_of_node(node_dev->dev.id), cpu_online_mask);
- n = cpumap_print_to_pagebuf(list, buf, mask);
+ n = cpumap_print_bitmask_to_buf(buf, mask, off, count);
free_cpumask_var(mask);

return n;
}

-static inline ssize_t cpumap_show(struct device *dev,
- struct device_attribute *attr,
- char *buf)
+static BIN_ATTR_RO(cpumap, 0);
+
+static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t off, size_t count)
{
- return node_read_cpumap(dev, false, buf);
-}
+ struct device *dev = kobj_to_dev(kobj);
+ struct node *node_dev = to_node(dev);
+ cpumask_var_t mask;
+ ssize_t n;
+
+ if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+ return 0;

-static DEVICE_ATTR_RO(cpumap);
+ cpumask_and(mask, cpumask_of_node(node_dev->dev.id), cpu_online_mask);
+ n = cpumap_print_list_to_buf(buf, mask, off, count);
+ free_cpumask_var(mask);

-static inline ssize_t cpulist_show(struct device *dev,
- struct device_attribute *attr,
- char *buf)
-{
- return node_read_cpumap(dev, true, buf);
+ return n;
}

-static DEVICE_ATTR_RO(cpulist);
+static BIN_ATTR_RO(cpulist, 0);

/**
* struct node_access_nodes - Access class device to hold user visible
@@ -557,15 +561,28 @@ static ssize_t node_read_distance(struct device *dev,
static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);

static struct attribute *node_dev_attrs[] = {
- &dev_attr_cpumap.attr,
- &dev_attr_cpulist.attr,
&dev_attr_meminfo.attr,
&dev_attr_numastat.attr,
&dev_attr_distance.attr,
&dev_attr_vmstat.attr,
NULL
};
-ATTRIBUTE_GROUPS(node_dev);
+
+static struct bin_attribute *node_dev_bin_attrs[] = {
+ &bin_attr_cpumap,
+ &bin_attr_cpulist,
+ NULL
+};
+
+static const struct attribute_group node_dev_group = {
+ .attrs = node_dev_attrs,
+ .bin_attrs = node_dev_bin_attrs
+};
+
+static const struct attribute_group *node_dev_groups[] = {
+ &node_dev_group,
+ NULL
+};

#ifdef CONFIG_HUGETLBFS
/*
--
2.25.1

Subject: [PATCH v9 3/5] topology: use bin_attribute to break the size limitation of cpumap ABI

From: Tian Tao <[email protected]>

Reading /sys/devices/system/cpu/cpuX/topology/ returns cpu topology.
However, the size of this file is limited to PAGE_SIZE because of
the limitation for sysfs attribute.
This patch moves to use bin_attribute to extend the ABI to be more
than one page so that cpumap bitmask and list won't be potentially
trimmed.

Signed-off-by: Tian Tao <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
drivers/base/topology.c | 115 ++++++++++++++++++++++------------------
1 file changed, 63 insertions(+), 52 deletions(-)

diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 4d254fcc93d1..43c0940643f5 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -21,25 +21,27 @@ static ssize_t name##_show(struct device *dev, \
return sysfs_emit(buf, "%d\n", topology_##name(dev->id)); \
}

-#define define_siblings_show_map(name, mask) \
-static ssize_t name##_show(struct device *dev, \
- struct device_attribute *attr, char *buf) \
-{ \
- return cpumap_print_to_pagebuf(false, buf, topology_##mask(dev->id));\
+#define define_siblings_read_func(name, mask) \
+static ssize_t name##_read(struct file *file, struct kobject *kobj, \
+ struct bin_attribute *attr, char *buf, \
+ loff_t off, size_t count) \
+{ \
+ struct device *dev = kobj_to_dev(kobj); \
+ \
+ return cpumap_print_bitmask_to_buf(buf, topology_##mask(dev->id), \
+ off, count); \
+} \
+ \
+static ssize_t name##_list_read(struct file *file, struct kobject *kobj, \
+ struct bin_attribute *attr, char *buf, \
+ loff_t off, size_t count) \
+{ \
+ struct device *dev = kobj_to_dev(kobj); \
+ \
+ return cpumap_print_list_to_buf(buf, topology_##mask(dev->id), \
+ off, count); \
}

-#define define_siblings_show_list(name, mask) \
-static ssize_t name##_list_show(struct device *dev, \
- struct device_attribute *attr, \
- char *buf) \
-{ \
- return cpumap_print_to_pagebuf(true, buf, topology_##mask(dev->id));\
-}
-
-#define define_siblings_show_func(name, mask) \
- define_siblings_show_map(name, mask); \
- define_siblings_show_list(name, mask)
-
define_id_show_func(physical_package_id);
static DEVICE_ATTR_RO(physical_package_id);

@@ -49,71 +51,80 @@ static DEVICE_ATTR_RO(die_id);
define_id_show_func(core_id);
static DEVICE_ATTR_RO(core_id);

-define_siblings_show_func(thread_siblings, sibling_cpumask);
-static DEVICE_ATTR_RO(thread_siblings);
-static DEVICE_ATTR_RO(thread_siblings_list);
+define_siblings_read_func(thread_siblings, sibling_cpumask);
+static BIN_ATTR_RO(thread_siblings, 0);
+static BIN_ATTR_RO(thread_siblings_list, 0);

-define_siblings_show_func(core_cpus, sibling_cpumask);
-static DEVICE_ATTR_RO(core_cpus);
-static DEVICE_ATTR_RO(core_cpus_list);
+define_siblings_read_func(core_cpus, sibling_cpumask);
+static BIN_ATTR_RO(core_cpus, 0);
+static BIN_ATTR_RO(core_cpus_list, 0);

-define_siblings_show_func(core_siblings, core_cpumask);
-static DEVICE_ATTR_RO(core_siblings);
-static DEVICE_ATTR_RO(core_siblings_list);
+define_siblings_read_func(core_siblings, core_cpumask);
+static BIN_ATTR_RO(core_siblings, 0);
+static BIN_ATTR_RO(core_siblings_list, 0);

-define_siblings_show_func(die_cpus, die_cpumask);
-static DEVICE_ATTR_RO(die_cpus);
-static DEVICE_ATTR_RO(die_cpus_list);
+define_siblings_read_func(die_cpus, die_cpumask);
+static BIN_ATTR_RO(die_cpus, 0);
+static BIN_ATTR_RO(die_cpus_list, 0);

-define_siblings_show_func(package_cpus, core_cpumask);
-static DEVICE_ATTR_RO(package_cpus);
-static DEVICE_ATTR_RO(package_cpus_list);
+define_siblings_read_func(package_cpus, core_cpumask);
+static BIN_ATTR_RO(package_cpus, 0);
+static BIN_ATTR_RO(package_cpus_list, 0);

#ifdef CONFIG_SCHED_BOOK
define_id_show_func(book_id);
static DEVICE_ATTR_RO(book_id);
-define_siblings_show_func(book_siblings, book_cpumask);
-static DEVICE_ATTR_RO(book_siblings);
-static DEVICE_ATTR_RO(book_siblings_list);
+define_siblings_read_func(book_siblings, book_cpumask);
+static BIN_ATTR_RO(book_siblings, 0);
+static BIN_ATTR_RO(book_siblings_list, 0);
#endif

#ifdef CONFIG_SCHED_DRAWER
define_id_show_func(drawer_id);
static DEVICE_ATTR_RO(drawer_id);
-define_siblings_show_func(drawer_siblings, drawer_cpumask);
-static DEVICE_ATTR_RO(drawer_siblings);
-static DEVICE_ATTR_RO(drawer_siblings_list);
+define_siblings_read_func(drawer_siblings, drawer_cpumask);
+static BIN_ATTR_RO(drawer_siblings, 0);
+static BIN_ATTR_RO(drawer_siblings_list, 0);
#endif

+static struct bin_attribute *bin_attrs[] = {
+ &bin_attr_core_cpus,
+ &bin_attr_core_cpus_list,
+ &bin_attr_thread_siblings,
+ &bin_attr_thread_siblings_list,
+ &bin_attr_core_siblings,
+ &bin_attr_core_siblings_list,
+ &bin_attr_die_cpus,
+ &bin_attr_die_cpus_list,
+ &bin_attr_package_cpus,
+ &bin_attr_package_cpus_list,
+#ifdef CONFIG_SCHED_BOOK
+ &bin_attr_book_siblings,
+ &bin_attr_book_siblings_list,
+#endif
+#ifdef CONFIG_SCHED_DRAWER
+ &bin_attr_drawer_siblings,
+ &bin_attr_drawer_siblings_list,
+#endif
+ NULL
+};
+
static struct attribute *default_attrs[] = {
&dev_attr_physical_package_id.attr,
&dev_attr_die_id.attr,
&dev_attr_core_id.attr,
- &dev_attr_thread_siblings.attr,
- &dev_attr_thread_siblings_list.attr,
- &dev_attr_core_cpus.attr,
- &dev_attr_core_cpus_list.attr,
- &dev_attr_core_siblings.attr,
- &dev_attr_core_siblings_list.attr,
- &dev_attr_die_cpus.attr,
- &dev_attr_die_cpus_list.attr,
- &dev_attr_package_cpus.attr,
- &dev_attr_package_cpus_list.attr,
#ifdef CONFIG_SCHED_BOOK
&dev_attr_book_id.attr,
- &dev_attr_book_siblings.attr,
- &dev_attr_book_siblings_list.attr,
#endif
#ifdef CONFIG_SCHED_DRAWER
&dev_attr_drawer_id.attr,
- &dev_attr_drawer_siblings.attr,
- &dev_attr_drawer_siblings_list.attr,
#endif
NULL
};

static const struct attribute_group topology_attr_group = {
.attrs = default_attrs,
+ .bin_attrs = bin_attrs,
.name = "topology"
};

--
2.25.1

2021-08-12 04:47:47

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v9 0/5] use bin_attribute to break the size limitation of cpumap ABI

> V9:
> - Split bitmask and list APIs and removed bool parameter with respect to
> Greg's comment
> - Removed duplication in code doc
>
...
>
> Background:
>
> the whole story began from this thread when Jonatah and me tried to add a
> new topology level-cluster which exists on kunpeng920 and X86 Jacobsville:
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
>

Hi Greg,
Will you take this series so that I can rebase the cluster-scheduler series[1] on top of
this? that cluster series is where this ABI series really get started. I am looking forward
to sending a normal patchset for cluster series after this ABI series settles down.

[1] scheduler: expose the topology of clusters and add cluster scheduler
https://lore.kernel.org/lkml/[email protected]/

>
> Barry Song (1):
> lib: test_bitmap: add bitmap_print_bitmask/list_to_buf test cases
>
> Tian Tao (3):
> cpumask: introduce cpumap_print_list/bitmask_to_buf to support large
> bitmask and list
> topology: use bin_attribute to break the size limitation of cpumap ABI
> drivers/base/node.c: use bin_attribute to break the size limitation of
> cpumap ABI
>
> Yury Norov (1):
> bitmap: extend comment to bitmap_print_bitmask/list_to_buf
>
> drivers/base/node.c | 63 +++++++++++------
> drivers/base/topology.c | 115 ++++++++++++++++--------------
> include/linux/bitmap.h | 6 ++
> include/linux/cpumask.h | 38 ++++++++++
> lib/bitmap.c | 121 ++++++++++++++++++++++++++++++++
> lib/test_bitmap.c | 150 ++++++++++++++++++++++++++++++++++++++++
> 6 files changed, 418 insertions(+), 75 deletions(-)
>
> --
> 2.25.1

Thanks
Barry

2021-08-13 09:04:48

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH v9 0/5] use bin_attribute to break the size limitation of cpumap ABI

On Thu, Aug 12, 2021 at 04:44:26PM +1200, Barry Song wrote:
> > V9:
> > - Split bitmask and list APIs and removed bool parameter with respect to
> > Greg's comment
> > - Removed duplication in code doc
> >
> ...
> >
> > Background:
> >
> > the whole story began from this thread when Jonatah and me tried to add a
> > new topology level-cluster which exists on kunpeng920 and X86 Jacobsville:
> > https://lore.kernel.org/lkml/[email protected]/
> > https://lore.kernel.org/lkml/[email protected]/
> >
>
> Hi Greg,
> Will you take this series so that I can rebase the cluster-scheduler series[1] on top of
> this? that cluster series is where this ABI series really get started. I am looking forward
> to sending a normal patchset for cluster series after this ABI series settles down.
>
> [1] scheduler: expose the topology of clusters and add cluster scheduler
> https://lore.kernel.org/lkml/[email protected]/

Now applied to my testing tree.

thanks,

greg k-h