2021-04-29 07:05:56

by Tian Tao

[permalink] [raw]
Subject: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf

Both numa node and cpu use cpu bitmap like 3,ffffffff to expose hardware
topology. When cpu number is large, the page buffer of sysfs will over-
flow. This doesn't really happen nowadays as the maximum NR_CPUS is 8196
for X86_64 and 4096 for ARM64 since 8196 * 9 / 32 = 2305 is still smaller
than 4KB page size.

So the existing BUILD_BUG_ON() in drivers/base/node.c is pretty much
preventing future problems similar with Y2K when hardware gets more
and more CPUs.

On the other hand, it should be more sensible to move the guard to common
code which can protect both cpu and numa:
/sys/devices/system/cpu/cpu0/topology/die_cpus etc.
/sys/devices/system/node/node0/cpumap etc.

Topology bitmap mask strings shouldn't be larger than PAGE_SIZE as
lstopo and numactl depend on them. But other ABIs exposing cpu lists
are not really used by common applications, so this patch also marks
those lists could be trimmed as there is no any guarantee those lists
are always less than PAGE_SIZE especially a list could be like this:
0, 3, 5, 7, 9, 11... etc.

Signed-off-by: Tian Tao <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
---
Documentation/ABI/stable/sysfs-devices-node | 5 ++++-
Documentation/admin-guide/cputopology.rst | 15 +++++++++++++++
drivers/base/node.c | 3 ---
include/linux/cpumask.h | 6 ++++++
4 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 484fc04..82dfe64 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -47,7 +47,10 @@ What: /sys/devices/system/node/nodeX/cpulist
Date: October 2002
Contact: Linux Memory Management list <[email protected]>
Description:
- The CPUs associated to the node.
+ The CPUs associated to the node. The format is like 0-3,
+ 8-11, 14,17. maximum size is PAGE_SIZE, so the tail
+ of the string will be trimmed while its size is larger
+ than PAGE_SIZE.

What: /sys/devices/system/node/nodeX/meminfo
Date: October 2002
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
index b90dafc..4538d78 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -44,6 +44,9 @@ core_cpus:
core_cpus_list:

human-readable list of CPUs within the same core.
+ The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+ so the tail of the string will be trimmed while its size is larger
+ than PAGE_SIZE.
(deprecated name: "thread_siblings_list");

package_cpus:
@@ -54,6 +57,9 @@ package_cpus:
package_cpus_list:

human-readable list of CPUs sharing the same physical_package_id.
+ The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+ so the tail of the string will be trimmed while its size is larger
+ than PAGE_SIZE.
(deprecated name: "core_siblings_list")

die_cpus:
@@ -63,6 +69,9 @@ die_cpus:
die_cpus_list:

human-readable list of CPUs within the same die.
+ The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+ so the tail of the string will be trimmed while its size is larger
+ than PAGE_SIZE.

book_siblings:

@@ -73,6 +82,9 @@ book_siblings_list:

human-readable list of cpuX's hardware threads within the same
book_id.
+ The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+ so the tail of the string will be trimmed while its size is larger
+ than PAGE_SIZE.

drawer_siblings:

@@ -83,6 +95,9 @@ drawer_siblings_list:

human-readable list of cpuX's hardware threads within the same
drawer_id.
+ The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+ so the tail of the string will be trimmed while its size is larger
+ than PAGE_SIZE.

Architecture-neutral, drivers/base/topology.c, exports these attributes.
However, the book and drawer related sysfs files will only be created if
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 2c36f61d..e24530c3 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -33,9 +33,6 @@ static ssize_t node_read_cpumap(struct device *dev, bool list, char *buf)
cpumask_var_t mask;
struct node *node_dev = to_node(dev);

- /* 2008/04/07: buf currently PAGE_SIZE, need 9 chars per 32 bits. */
- BUILD_BUG_ON((NR_CPUS/32 * 9) > (PAGE_SIZE-1));
-
if (!alloc_cpumask_var(&mask, GFP_KERNEL))
return 0;

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bfc4690..1882477 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -12,6 +12,7 @@
#include <linux/bitmap.h>
#include <linux/atomic.h>
#include <linux/bug.h>
+#include <asm/page.h>

/* Don't assign or return these: may not be this big! */
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
@@ -979,6 +980,11 @@ static inline bool cpu_dying(unsigned int cpu)
static inline ssize_t
cpumap_print_to_pagebuf(bool list, char *buf, const struct cpumask *mask)
{
+ /*
+ * 32bits requires 9bytes: "ff,ffffffff", thus, too many CPUs will
+ * cause the overflow of sysfs pagebuf
+ */
+ BUILD_BUG_ON((NR_CPUS/32 * 9) > (PAGE_SIZE-1));
return bitmap_print_to_pagebuf(list, buf, cpumask_bits(mask),
nr_cpu_ids);
}
--
2.7.4


2021-04-29 14:24:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf

On 4/29/21 12:03 AM, Tian Tao wrote:
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 484fc04..82dfe64 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -47,7 +47,10 @@ What: /sys/devices/system/node/nodeX/cpulist
> Date: October 2002
> Contact: Linux Memory Management list <[email protected]>
> Description:
> - The CPUs associated to the node.
> + The CPUs associated to the node. The format is like 0-3,
> + 8-11, 14,17. maximum size is PAGE_SIZE, so the tail
> + of the string will be trimmed while its size is larger
> + than PAGE_SIZE.

I think it's pretty arguable that truncating output on a real system is
an ABI break. Doing this would make the interface rather useless.

Don't we need a real solution rather than throwing up our hands?

Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
something?

2021-04-29 15:47:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf

On Thu, Apr 29, 2021 at 07:21:13AM -0700, Dave Hansen wrote:
> On 4/29/21 12:03 AM, Tian Tao wrote:
> > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > index 484fc04..82dfe64 100644
> > --- a/Documentation/ABI/stable/sysfs-devices-node
> > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > @@ -47,7 +47,10 @@ What: /sys/devices/system/node/nodeX/cpulist
> > Date: October 2002
> > Contact: Linux Memory Management list <[email protected]>
> > Description:
> > - The CPUs associated to the node.
> > + The CPUs associated to the node. The format is like 0-3,
> > + 8-11, 14,17. maximum size is PAGE_SIZE, so the tail
> > + of the string will be trimmed while its size is larger
> > + than PAGE_SIZE.
>
> I think it's pretty arguable that truncating output on a real system is
> an ABI break. Doing this would make the interface rather useless.
>
> Don't we need a real solution rather than throwing up our hands?
>
> Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
> something?

There is a real way to get > PAGE_SIZE out of a sysfs file. The LED
developers had to do this when they ran into this same exact problem.
Make it a binary sysfs file and promise to NEVER create such a file
again in the future :)

thanks,

gre gk-h

Subject: RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf



> -----Original Message-----
> From: Dave Hansen [mailto:[email protected]]
> Sent: Friday, April 30, 2021 2:21 AM
> To: tiantao (H) <[email protected]>; [email protected];
> [email protected]; Song Bao Hua (Barry Song)
> <[email protected]>
> Cc: [email protected]; [email protected]; Rafael J.
> Wysocki <[email protected]>; Peter Zijlstra <[email protected]>; Valentin
> Schneider <[email protected]>; Dave Hansen
> <[email protected]>; Daniel Bristot de Oliveira <[email protected]>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
>
> On 4/29/21 12:03 AM, Tian Tao wrote:
> > diff --git a/Documentation/ABI/stable/sysfs-devices-node
> b/Documentation/ABI/stable/sysfs-devices-node
> > index 484fc04..82dfe64 100644
> > --- a/Documentation/ABI/stable/sysfs-devices-node
> > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > @@ -47,7 +47,10 @@ What: /sys/devices/system/node/nodeX/cpulist
> > Date: October 2002
> > Contact: Linux Memory Management list <[email protected]>
> > Description:
> > - The CPUs associated to the node.
> > + The CPUs associated to the node. The format is like 0-3,
> > + 8-11, 14,17. maximum size is PAGE_SIZE, so the tail
> > + of the string will be trimmed while its size is larger
> > + than PAGE_SIZE.
>
> I think it's pretty arguable that truncating output on a real system is
> an ABI break. Doing this would make the interface rather useless.
>
> Don't we need a real solution rather than throwing up our hands?
>
> Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
> something?

This kind of cpu list ABIs have been there for many years but have
never been documented well.

We have two ABIs:
xxx_cpus - in format like 3333333333
xxx_cpus_list - in format like 0,3,5,7,9,11,13....

xxx_cpus_list is another human-readable version of xxx_cpus. It doesn't
include any more useful information than xxx_cpus.

xxx_cpus won't overflow based on BUILD_BUG_ON and maximum NR_CPUS
in kconfig nowadays.

if people all agree the trimmed list is a break of ABI, I think we may
totally remove this list. For these days, this list probably has never
overflowed but literally this could happen.

thoughts?

Thanks
Barry

2021-04-29 21:40:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf

On 4/29/21 2:08 PM, Song Bao Hua (Barry Song) wrote:
>> Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
>> something?
> This kind of cpu list ABIs have been there for many years but have
> never been documented well.
>
> We have two ABIs:
> xxx_cpus - in format like 3333333333
> xxx_cpus_list - in format like 0,3,5,7,9,11,13....
>
> xxx_cpus_list is another human-readable version of xxx_cpus. It doesn't
> include any more useful information than xxx_cpus.
>
> xxx_cpus won't overflow based on BUILD_BUG_ON and maximum NR_CPUS
> in kconfig nowadays.
>
> if people all agree the trimmed list is a break of ABI, I think we may
> totally remove this list. For these days, this list probably has never
> overflowed but literally this could happen.
>
> thoughts?

From what Greg said, it sounds like removing the BUILD_BUG_ON(), making
it a binary sysfs file, and making it support arbitrary lengths is the
way to go.

Subject: RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf



> -----Original Message-----
> From: Dave Hansen [mailto:[email protected]]
> Sent: Friday, April 30, 2021 9:39 AM
> To: Song Bao Hua (Barry Song) <[email protected]>; tiantao (H)
> <[email protected]>; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; Rafael J.
> Wysocki <[email protected]>; Peter Zijlstra <[email protected]>; Valentin
> Schneider <[email protected]>; Dave Hansen
> <[email protected]>; Daniel Bristot de Oliveira <[email protected]>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
>
> On 4/29/21 2:08 PM, Song Bao Hua (Barry Song) wrote:
> >> Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
> >> something?
> > This kind of cpu list ABIs have been there for many years but have
> > never been documented well.
> >
> > We have two ABIs:
> > xxx_cpus - in format like 3333333333
> > xxx_cpus_list - in format like 0,3,5,7,9,11,13....
> >
> > xxx_cpus_list is another human-readable version of xxx_cpus. It doesn't
> > include any more useful information than xxx_cpus.
> >
> > xxx_cpus won't overflow based on BUILD_BUG_ON and maximum NR_CPUS
> > in kconfig nowadays.
> >
> > if people all agree the trimmed list is a break of ABI, I think we may
> > totally remove this list. For these days, this list probably has never
> > overflowed but literally this could happen.
> >
> > thoughts?
>
> From what Greg said, it sounds like removing the BUILD_BUG_ON(), making
> it a binary sysfs file, and making it support arbitrary lengths is the
> way to go.

I am actually more concerned on xxx_cpus_list than xxx_cpus.

xxx_cpus has never overflowed. Though there is a BUILD_BUG_ON(), but the
maximum NR_CPUS is 8096, it is still far from overflow.
8096 /32 * 9 = 2277
as 2277 < 4096

While NR_CPUS gets to ~14500, for a 4KB page, xxx_cpus will overflow.
But I don't know when hardware will reach there. If we reach there,
the existing code to describe topology and schedule tasks might also
need rework.

On the other hand, lscpu, hwloc, numactl etc are using the existing
hex bitmap ABI:

$ strace lscpu 2>&1 | grep topology
faccessat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/thread_siblings", F_OK) = 0
openat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/thread_siblings",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/core_siblings",
O_RDONLY|O_CLOEXEC) = 3
faccessat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/book_siblings", F_OK) = -1
ENOENT (No such file or directory)
faccessat(AT_FDCWD,
"/sys/devices/system/cpu/cpu1/topology/thread_siblings", F_OK) = 0
openat(AT_FDCWD,
...

$ strace numactl --hardware 2>&1 | grep cpu
openat(AT_FDCWD, "/sys/devices/system/cpu",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3

If we move to binary, it means we have to change those applications.

For this moment, I'd argue we keep cpu bitmap ABI as is and defer this
issue to when we really get so many cores.

Thanks
Barry

2021-04-29 22:39:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf

On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> $ strace numactl --hardware 2>&1 | grep cpu
> openat(AT_FDCWD, "/sys/devices/system/cpu",
> O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
> openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
> openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
> openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
>
> If we move to binary, it means we have to change those applications.

I thought Greg was saying to using a sysfs binary attribute using
something like like sysfs_create_bin_file(). Those don't have the
PAGE_SIZE limitation. But, there's also nothing to keep us from spewing
nice human-readable text via the "binary" file.

We don't need to change the file format, just the internal kernel API
that we produce the files with.

Subject: RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf



> -----Original Message-----
> From: Dave Hansen [mailto:[email protected]]
> Sent: Friday, April 30, 2021 10:39 AM
> To: Song Bao Hua (Barry Song) <[email protected]>; tiantao (H)
> <[email protected]>; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; Rafael J.
> Wysocki <[email protected]>; Peter Zijlstra <[email protected]>; Valentin
> Schneider <[email protected]>; Dave Hansen
> <[email protected]>; Daniel Bristot de Oliveira <[email protected]>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
>
> On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> > $ strace numactl --hardware 2>&1 | grep cpu
> > openat(AT_FDCWD, "/sys/devices/system/cpu",
> > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
> >
> > If we move to binary, it means we have to change those applications.
>
> I thought Greg was saying to using a sysfs binary attribute using
> something like like sysfs_create_bin_file(). Those don't have the
> PAGE_SIZE limitation. But, there's also nothing to keep us from spewing
> nice human-readable text via the "binary" file.
>
> We don't need to change the file format, just the internal kernel API
> that we produce the files with.

Dave, thanks for clarification. Sounds a way to go.

Barry

2021-04-30 06:08:13

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf

On Thu, Apr 29, 2021 at 03:38:39PM -0700, Dave Hansen wrote:
> On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> > $ strace numactl --hardware 2>&1 | grep cpu
> > openat(AT_FDCWD, "/sys/devices/system/cpu",
> > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
> >
> > If we move to binary, it means we have to change those applications.
>
> I thought Greg was saying to using a sysfs binary attribute using
> something like like sysfs_create_bin_file(). Those don't have the
> PAGE_SIZE limitation. But, there's also nothing to keep us from spewing
> nice human-readable text via the "binary" file.

That is correct.

2021-05-07 03:46:52

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf

Hi Tian,

I love your patch! Yet something to improve:

[auto build test ERROR on lwn/docs-next]
[also build test ERROR on driver-core/driver-core-testing linus/master v5.12 next-20210506]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Tian-Tao/clarify-and-cleanup-CPU-and-NUMA-topology-ABIs/20210429-150451
base: git://git.lwn.net/linux-2.6 docs-next
config: mips-allyesconfig (attached as .config)
compiler: mips-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/357a0babc7ad904e3099bf86034af88cf5ce2a70
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Tian-Tao/clarify-and-cleanup-CPU-and-NUMA-topology-ABIs/20210429-150451
git checkout 357a0babc7ad904e3099bf86034af88cf5ce2a70
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross W=1 ARCH=mips

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from arch/mips/include/asm/io.h:31,
from arch/mips/include/asm/page.h:200,
from include/linux/cpumask.h:15,
from include/linux/smp.h:13,
from arch/mips/include/asm/cpu-type.h:12,
from arch/mips/include/asm/timex.h:19,
from include/linux/timex.h:65,
from include/linux/time32.h:13,
from include/linux/time.h:60,
from include/linux/compat.h:10,
from arch/mips/kernel/asm-offsets.c:12:
>> arch/mips/include/asm/processor.h:269:2: error: unknown type name 'cpumask_t'
269 | cpumask_t user_cpus_allowed;
| ^~~~~~~~~
arch/mips/kernel/asm-offsets.c:26:6: warning: no previous prototype for 'output_ptreg_defines' [-Wmissing-prototypes]
26 | void output_ptreg_defines(void)
| ^~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:78:6: warning: no previous prototype for 'output_task_defines' [-Wmissing-prototypes]
78 | void output_task_defines(void)
| ^~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:93:6: warning: no previous prototype for 'output_thread_info_defines' [-Wmissing-prototypes]
93 | void output_thread_info_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:110:6: warning: no previous prototype for 'output_thread_defines' [-Wmissing-prototypes]
110 | void output_thread_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:138:6: warning: no previous prototype for 'output_thread_fpu_defines' [-Wmissing-prototypes]
138 | void output_thread_fpu_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:181:6: warning: no previous prototype for 'output_mm_defines' [-Wmissing-prototypes]
181 | void output_mm_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:220:6: warning: no previous prototype for 'output_sc_defines' [-Wmissing-prototypes]
220 | void output_sc_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:255:6: warning: no previous prototype for 'output_signal_defined' [-Wmissing-prototypes]
255 | void output_signal_defined(void)
| ^~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:322:6: warning: no previous prototype for 'output_pbe_defines' [-Wmissing-prototypes]
322 | void output_pbe_defines(void)
| ^~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:334:6: warning: no previous prototype for 'output_pm_defines' [-Wmissing-prototypes]
334 | void output_pm_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:348:6: warning: no previous prototype for 'output_kvm_defines' [-Wmissing-prototypes]
348 | void output_kvm_defines(void)
| ^~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:392:6: warning: no previous prototype for 'output_cps_defines' [-Wmissing-prototypes]
392 | void output_cps_defines(void)
| ^~~~~~~~~~~~~~~~~~
--
In file included from arch/mips/include/asm/io.h:31,
from arch/mips/include/asm/page.h:200,
from include/linux/cpumask.h:15,
from include/linux/smp.h:13,
from arch/mips/include/asm/cpu-type.h:12,
from arch/mips/include/asm/timex.h:19,
from include/linux/timex.h:65,
from include/linux/time32.h:13,
from include/linux/time.h:60,
from include/linux/compat.h:10,
from arch/mips/kernel/asm-offsets.c:12:
>> arch/mips/include/asm/processor.h:269:2: error: unknown type name 'cpumask_t'
269 | cpumask_t user_cpus_allowed;
| ^~~~~~~~~
arch/mips/kernel/asm-offsets.c:26:6: warning: no previous prototype for 'output_ptreg_defines' [-Wmissing-prototypes]
26 | void output_ptreg_defines(void)
| ^~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:78:6: warning: no previous prototype for 'output_task_defines' [-Wmissing-prototypes]
78 | void output_task_defines(void)
| ^~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:93:6: warning: no previous prototype for 'output_thread_info_defines' [-Wmissing-prototypes]
93 | void output_thread_info_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:110:6: warning: no previous prototype for 'output_thread_defines' [-Wmissing-prototypes]
110 | void output_thread_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:138:6: warning: no previous prototype for 'output_thread_fpu_defines' [-Wmissing-prototypes]
138 | void output_thread_fpu_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:181:6: warning: no previous prototype for 'output_mm_defines' [-Wmissing-prototypes]
181 | void output_mm_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:220:6: warning: no previous prototype for 'output_sc_defines' [-Wmissing-prototypes]
220 | void output_sc_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:255:6: warning: no previous prototype for 'output_signal_defined' [-Wmissing-prototypes]
255 | void output_signal_defined(void)
| ^~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:322:6: warning: no previous prototype for 'output_pbe_defines' [-Wmissing-prototypes]
322 | void output_pbe_defines(void)
| ^~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:334:6: warning: no previous prototype for 'output_pm_defines' [-Wmissing-prototypes]
334 | void output_pm_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:348:6: warning: no previous prototype for 'output_kvm_defines' [-Wmissing-prototypes]
348 | void output_kvm_defines(void)
| ^~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:392:6: warning: no previous prototype for 'output_cps_defines' [-Wmissing-prototypes]
392 | void output_cps_defines(void)
| ^~~~~~~~~~~~~~~~~~
make[2]: *** [scripts/Makefile.build:116: arch/mips/kernel/asm-offsets.s] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [Makefile:1233: prepare0] Error 2
make[1]: Target 'modules_prepare' not remade because of errors.
make: *** [Makefile:215: __sub-make] Error 2
make: Target 'modules_prepare' not remade because of errors.
--
In file included from arch/mips/include/asm/io.h:31,
from arch/mips/include/asm/page.h:200,
from include/linux/cpumask.h:15,
from include/linux/smp.h:13,
from arch/mips/include/asm/cpu-type.h:12,
from arch/mips/include/asm/timex.h:19,
from include/linux/timex.h:65,
from include/linux/time32.h:13,
from include/linux/time.h:60,
from include/linux/compat.h:10,
from arch/mips/kernel/asm-offsets.c:12:
>> arch/mips/include/asm/processor.h:269:2: error: unknown type name 'cpumask_t'
269 | cpumask_t user_cpus_allowed;
| ^~~~~~~~~
arch/mips/kernel/asm-offsets.c:26:6: warning: no previous prototype for 'output_ptreg_defines' [-Wmissing-prototypes]
26 | void output_ptreg_defines(void)
| ^~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:78:6: warning: no previous prototype for 'output_task_defines' [-Wmissing-prototypes]
78 | void output_task_defines(void)
| ^~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:93:6: warning: no previous prototype for 'output_thread_info_defines' [-Wmissing-prototypes]
93 | void output_thread_info_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:110:6: warning: no previous prototype for 'output_thread_defines' [-Wmissing-prototypes]
110 | void output_thread_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:138:6: warning: no previous prototype for 'output_thread_fpu_defines' [-Wmissing-prototypes]
138 | void output_thread_fpu_defines(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:181:6: warning: no previous prototype for 'output_mm_defines' [-Wmissing-prototypes]
181 | void output_mm_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:220:6: warning: no previous prototype for 'output_sc_defines' [-Wmissing-prototypes]
220 | void output_sc_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:255:6: warning: no previous prototype for 'output_signal_defined' [-Wmissing-prototypes]
255 | void output_signal_defined(void)
| ^~~~~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:322:6: warning: no previous prototype for 'output_pbe_defines' [-Wmissing-prototypes]
322 | void output_pbe_defines(void)
| ^~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:334:6: warning: no previous prototype for 'output_pm_defines' [-Wmissing-prototypes]
334 | void output_pm_defines(void)
| ^~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:348:6: warning: no previous prototype for 'output_kvm_defines' [-Wmissing-prototypes]
348 | void output_kvm_defines(void)
| ^~~~~~~~~~~~~~~~~~
arch/mips/kernel/asm-offsets.c:392:6: warning: no previous prototype for 'output_cps_defines' [-Wmissing-prototypes]
392 | void output_cps_defines(void)
| ^~~~~~~~~~~~~~~~~~
make[2]: *** [scripts/Makefile.build:116: arch/mips/kernel/asm-offsets.s] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [Makefile:1233: prepare0] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:215: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.


vim +/cpumask_t +269 arch/mips/include/asm/processor.h

e50c0a8fa60da9a include/asm-mips/processor.h Ralf Baechle 2005-05-31 242
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 243 /*
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 244 * If you change thread_struct remember to change the #defines below too!
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 245 */
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 246 struct thread_struct {
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 247 /* Saved main processor registers. */
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 248 unsigned long reg16;
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 249 unsigned long reg17, reg18, reg19, reg20, reg21, reg22, reg23;
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 250 unsigned long reg29, reg30, reg31;
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 251
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 252 /* Saved cp0 stuff. */
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 253 unsigned long cp0_status;
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 254
2725f3778fddb70 arch/mips/include/asm/processor.h Paul Burton 2018-11-07 255 #ifdef CONFIG_MIPS_FP_SUPPORT
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 256 /* Saved fpu/fpu emulator stuff. */
37cddff8e330a87 arch/mips/include/asm/processor.h Paul Burton 2014-07-11 257 struct mips_fpu_struct fpu FPU_ALIGN;
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton 2016-07-08 258 /* Assigned branch delay slot 'emulation' frame */
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton 2016-07-08 259 atomic_t bd_emu_frame;
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton 2016-07-08 260 /* PC of the branch from a branch delay slot 'emulation' */
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton 2016-07-08 261 unsigned long bd_emu_branch_pc;
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton 2016-07-08 262 /* PC to continue from following a branch delay slot 'emulation' */
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton 2016-07-08 263 unsigned long bd_emu_cont_pc;
aebdc6ff3b2e793 arch/mips/include/asm/processor.h Yousong Zhou 2020-03-24 264 #endif
f088fc84f94c1a3 include/asm-mips/processor.h Ralf Baechle 2006-04-05 265 #ifdef CONFIG_MIPS_MT_FPAFF
f088fc84f94c1a3 include/asm-mips/processor.h Ralf Baechle 2006-04-05 266 /* Emulated instruction count */
f088fc84f94c1a3 include/asm-mips/processor.h Ralf Baechle 2006-04-05 267 unsigned long emulated_fp;
f088fc84f94c1a3 include/asm-mips/processor.h Ralf Baechle 2006-04-05 268 /* Saved per-thread scheduler affinity mask */
f088fc84f94c1a3 include/asm-mips/processor.h Ralf Baechle 2006-04-05 @269 cpumask_t user_cpus_allowed;
f088fc84f94c1a3 include/asm-mips/processor.h Ralf Baechle 2006-04-05 270 #endif /* CONFIG_MIPS_MT_FPAFF */
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 271
e50c0a8fa60da9a include/asm-mips/processor.h Ralf Baechle 2005-05-31 272 /* Saved state of the DSP ASE, if available. */
e50c0a8fa60da9a include/asm-mips/processor.h Ralf Baechle 2005-05-31 273 struct mips_dsp_state dsp;
e50c0a8fa60da9a include/asm-mips/processor.h Ralf Baechle 2005-05-31 274
6aa3524c182c01b arch/mips/include/asm/processor.h David Daney 2008-09-23 275 /* Saved watch register state, if available. */
6aa3524c182c01b arch/mips/include/asm/processor.h David Daney 2008-09-23 276 union mips_watch_reg_state watch;
6aa3524c182c01b arch/mips/include/asm/processor.h David Daney 2008-09-23 277
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 278 /* Other stuff associated with the thread. */
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 279 unsigned long cp0_badvaddr; /* Last user fault */
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 280 unsigned long cp0_baduaddr; /* Last kernel fault accessing USEG */
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 281 unsigned long error_code;
e3b28831c18c6c9 arch/mips/include/asm/processor.h Ralf Baechle 2015-07-28 282 unsigned long trap_nr;
b5e00af81f298f4 arch/mips/include/asm/processor.h David Daney 2008-12-11 283 #ifdef CONFIG_CPU_CAVIUM_OCTEON
b5e00af81f298f4 arch/mips/include/asm/processor.h David Daney 2008-12-11 284 struct octeon_cop2_state cp2 __attribute__ ((__aligned__(128)));
b5e00af81f298f4 arch/mips/include/asm/processor.h David Daney 2008-12-11 285 struct octeon_cvmseg_state cvmseg __attribute__ ((__aligned__(128)));
5649d37c2b23ad6 arch/mips/include/asm/processor.h Jayachandran C 2013-06-10 286 #endif
5649d37c2b23ad6 arch/mips/include/asm/processor.h Jayachandran C 2013-06-10 287 #ifdef CONFIG_CPU_XLP
5649d37c2b23ad6 arch/mips/include/asm/processor.h Jayachandran C 2013-06-10 288 struct nlm_cop2_state cp2;
b5e00af81f298f4 arch/mips/include/asm/processor.h David Daney 2008-12-11 289 #endif
e50c0a8fa60da9a include/asm-mips/processor.h Ralf Baechle 2005-05-31 290 struct mips_abi *abi;
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 291 };
^1da177e4c3f415 include/asm-mips/processor.h Linus Torvalds 2005-04-16 292

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (17.38 kB)
.config.gz (68.25 kB)
Download all attachments

2021-07-23 11:29:52

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf

On Fri, Jul 23, 2021 at 11:20:19AM +0000, Song Bao Hua (Barry Song) wrote:
>
>
> > -----Original Message-----
> > From: Dave Hansen [mailto:[email protected]]
> > Sent: Friday, April 30, 2021 10:39 AM
> > To: Song Bao Hua (Barry Song) <[email protected]>; tiantao (H)
> > <[email protected]>; [email protected]; [email protected]
> > Cc: [email protected]; [email protected]; Rafael J.
> > Wysocki <[email protected]>; Peter Zijlstra <[email protected]>; Valentin
> > Schneider <[email protected]>; Dave Hansen
> > <[email protected]>; Daniel Bristot de Oliveira <[email protected]>
> > Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> > of sysfs pagebuf
> >
> > On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> > > $ strace numactl --hardware 2>&1 | grep cpu
> > > openat(AT_FDCWD, "/sys/devices/system/cpu",
> > > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
> > > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
> > > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
> > > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
> > >
> > > If we move to binary, it means we have to change those applications.
> >
> > I thought Greg was saying to using a sysfs binary attribute using
> > something like like sysfs_create_bin_file(). Those don't have the
> > PAGE_SIZE limitation. But, there's also nothing to keep us from spewing
> > nice human-readable text via the "binary" file.
> >
> > We don't need to change the file format, just the internal kernel API
> > that we produce the files with.
>
> Sorry for waking-up the old thread.
>
> I am not sure how common a regular device_attribute will be larger than
> 4KB and have to work around by bin_attribute. But I wrote a prototype
> which can initially support large regular sysfs entry and be able to
> fill the entire buffer by only one show() entry. The other words to say,
> we don't need to call read() of bin_attribute multiple times for a large
> regular file. Right now, only read-only attribute is supported.
>
> Subject: [RFC PATCH] sysfs: support regular device attr which can be larger than
> PAGE_SIZE
>
> A regular sysfs ABI could be more than 4KB, right now, we are using
> bin_attribute to workaround and break this limit. A better solution
> would be supporting long device attribute. In this case, we will
> still be able to enjoy the advantages of buffering and seeking of
> seq file and only need to fill the entire buffer of sysfs entry
> once.

No, please no. I WANT people to run into this problem and realize that
it went totally wrong because they should not be having more than one
"value" in a sysfs file like this.

Let's not make it easy on people please, moving to a bin attribute is a
big deal, let's keep it that way.

thanks,

greg k-h

Subject: RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf



> -----Original Message-----
> From: Dave Hansen [mailto:[email protected]]
> Sent: Friday, April 30, 2021 10:39 AM
> To: Song Bao Hua (Barry Song) <[email protected]>; tiantao (H)
> <[email protected]>; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; Rafael J.
> Wysocki <[email protected]>; Peter Zijlstra <[email protected]>; Valentin
> Schneider <[email protected]>; Dave Hansen
> <[email protected]>; Daniel Bristot de Oliveira <[email protected]>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
>
> On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> > $ strace numactl --hardware 2>&1 | grep cpu
> > openat(AT_FDCWD, "/sys/devices/system/cpu",
> > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
> >
> > If we move to binary, it means we have to change those applications.
>
> I thought Greg was saying to using a sysfs binary attribute using
> something like like sysfs_create_bin_file(). Those don't have the
> PAGE_SIZE limitation. But, there's also nothing to keep us from spewing
> nice human-readable text via the "binary" file.
>
> We don't need to change the file format, just the internal kernel API
> that we produce the files with.

Sorry for waking-up the old thread.

I am not sure how common a regular device_attribute will be larger than
4KB and have to work around by bin_attribute. But I wrote a prototype
which can initially support large regular sysfs entry and be able to
fill the entire buffer by only one show() entry. The other words to say,
we don't need to call read() of bin_attribute multiple times for a large
regular file. Right now, only read-only attribute is supported.

Subject: [RFC PATCH] sysfs: support regular device attr which can be larger than
PAGE_SIZE

A regular sysfs ABI could be more than 4KB, right now, we are using
bin_attribute to workaround and break this limit. A better solution
would be supporting long device attribute. In this case, we will
still be able to enjoy the advantages of buffering and seeking of
seq file and only need to fill the entire buffer of sysfs entry
once.

Signed-off-by: Barry Song <[email protected]>
---
drivers/base/core.c | 2 +-
fs/seq_file.c | 22 ++++++++++++++++++++++
fs/sysfs/file.c | 40 +++++++++++++++++++++++++++++++++++++++-
fs/sysfs/group.c | 4 ++--
include/linux/device.h | 20 ++++++++++++++++++++
include/linux/seq_file.h | 1 +
include/linux/sysfs.h | 1 +
7 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index cadcade65825..0cd4ed165154 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2053,7 +2053,7 @@ static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,

if (dev_attr->show)
ret = dev_attr->show(dev, dev_attr, buf);
- if (ret >= (ssize_t)PAGE_SIZE) {
+ if (ret >= (ssize_t)PAGE_SIZE && !(attr->mode & SYSFS_LONGATTR)) {
printk("dev_attr_show: %pS returned bad count\n",
dev_attr->show);
}
diff --git a/fs/seq_file.c b/fs/seq_file.c
index b117b212ef28..9054615f8f19 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -84,6 +84,28 @@ int seq_open(struct file *file, const struct seq_operations *op)
}
EXPORT_SYMBOL(seq_open);

+/**
+ * seq_open_prealloc_buf - allow seq_file users to preallocate buf
+ * @file: file we initialize
+ * @size: the maximum size of the file
+ *
+ * apply to those scenerios single_open_size() doesn't apply. for
+ * example, while a regular sysfs entry is more than PAGE_SIZE;
+ * this will permit users to fill the entire buffer by calling
+ * device_attr show() only once
+ */
+int seq_open_prealloc_buf(struct file *file, unsigned long size)
+{
+ void *buf = seq_buf_alloc(size);
+ if (buf)
+ return -ENOMEM;
+
+ ((struct seq_file *)file->private_data)->buf = buf;
+ ((struct seq_file *)file->private_data)->size = size;
+
+ return 0;
+}
+
static int traverse(struct seq_file *m, loff_t offset)
{
loff_t pos = 0;
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 9aefa7779b29..09ae12c2326c 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -16,6 +16,7 @@
#include <linux/mutex.h>
#include <linux/seq_file.h>
#include <linux/mm.h>
+#include <linux/device.h> /* for device's longattr support */

#include "sysfs.h"

@@ -202,6 +203,32 @@ void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr)
}
EXPORT_SYMBOL_GPL(sysfs_notify);

+static int sysfs_kf_longattr_seq_show(struct seq_file *sf, void *v)
+{
+ struct kernfs_open_file *of = sf->private;
+ struct kobject *kobj = of->kn->parent->priv;
+ const struct sysfs_ops *ops = sysfs_file_ops(of->kn);
+ ssize_t count;
+ char *buf;
+
+ count = seq_get_buf(sf, &buf);
+
+ if (ops->show) {
+ count = ops->show(kobj, of->kn->priv, buf);
+ if (count < 0)
+ return count;
+ }
+
+ seq_commit(sf, count);
+ return 0;
+}
+
+static int sysfs_longattr_open(struct kernfs_open_file *of)
+{
+ struct device_long_attribute *lattr = (struct device_long_attribute *)of->kn->priv;
+ return seq_open_prealloc_buf(of->file, lattr->size);
+}
+
static const struct kernfs_ops sysfs_file_kfops_empty = {
};

@@ -223,6 +250,11 @@ static const struct kernfs_ops sysfs_prealloc_kfops_ro = {
.prealloc = true,
};

+static struct kernfs_ops sysfs_longattr_kfops_ro = {
+ .open = sysfs_longattr_open,
+ .seq_show = sysfs_kf_longattr_seq_show,
+};
+
static const struct kernfs_ops sysfs_prealloc_kfops_wo = {
.write = sysfs_kf_write,
.prealloc = true,
@@ -276,6 +308,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
if (sysfs_ops->show && sysfs_ops->store) {
if (mode & SYSFS_PREALLOC)
ops = &sysfs_prealloc_kfops_rw;
+ else if (mode & SYSFS_LONGATTR)
+ ops = &sysfs_longattr_kfops_ro;
else
ops = &sysfs_file_kfops_rw;
} else if (sysfs_ops->show) {
@@ -291,7 +325,11 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
} else
ops = &sysfs_file_kfops_empty;

- size = PAGE_SIZE;
+ if (mode & SYSFS_LONGATTR) {
+ size = ((struct device_long_attribute *)attr)->size;
+ } else {
+ size = PAGE_SIZE;
+ }
} else {
struct bin_attribute *battr = (void *)attr;

diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index 64e6a6698935..2d80b05550d1 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -56,11 +56,11 @@ static int create_files(struct kernfs_node *parent, struct kobject *kobj,
continue;
}

- WARN(mode & ~(SYSFS_PREALLOC | 0664),
+ WARN(mode & ~(SYSFS_LONGATTR | SYSFS_PREALLOC | 0664),
"Attribute %s: Invalid permissions 0%o\n",
(*attr)->name, mode);

- mode &= SYSFS_PREALLOC | 0664;
+ mode &= SYSFS_LONGATTR | SYSFS_PREALLOC | 0664;
error = sysfs_add_file_mode_ns(parent, *attr, false,
mode, uid, gid, NULL);
if (unlikely(error))
diff --git a/include/linux/device.h b/include/linux/device.h
index 59940f1744c1..791a3d25f0bb 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -150,6 +150,26 @@ ssize_t device_store_bool(struct device *dev, struct device_attribute *attr,
struct device_attribute dev_attr_##_name = \
__ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store)

+/*
+ * for those devices whose attributes are larger than 4KB but still want
+ * to take the advantages of regular sysfs, like show() method is able to
+ * fill the entire buffer by one read operation
+ */
+struct device_long_attribute {
+ struct device_attribute attr;
+ size_t size;
+};
+
+#define __LONG_ATTR_RO(_name, _size) { \
+ .attr.attr = {.name = __stringify(_name), \
+ .mode = SYSFS_LONGATTR | 0444 }, \
+ .attr.show = _name##_show, \
+ .size = _size, \
+}
+
+#define LONG_ATTR_RO(_name, _size) \
+struct device_long_attribute dev_attr_##_name = __LONG_ATTR_RO(_name, _size)
+
int device_create_file(struct device *device,
const struct device_attribute *entry);
void device_remove_file(struct device *dev,
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index dd99569595fd..e7357fc91c1c 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -106,6 +106,7 @@ void seq_pad(struct seq_file *m, char c);

char *mangle_path(char *s, const char *p, const char *esc);
int seq_open(struct file *, const struct seq_operations *);
+int seq_open_prealloc_buf(struct file *, unsigned long);
ssize_t seq_read(struct file *, char __user *, size_t, loff_t *);
ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter);
loff_t seq_lseek(struct file *, loff_t, int);
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index a12556a4b93a..9198afd46fb0 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -97,6 +97,7 @@ struct attribute_group {
*/

#define SYSFS_PREALLOC 010000
+#define SYSFS_LONGATTR 020000

#define __ATTR(_name, _mode, _show, _store) { \
.attr = {.name = __stringify(_name), \
--
2.25.1

very simple way to use it:
Add some long attribute by:

LONG_ATTR_RO(xxx, 2 * PAGE_SIZE);
LONG_ATTR_RO(yyy, 2 * PAGE_SIZE);
....
Then add xxx and yyy to attribute list as we usually do
for normal device_attribute:
struct attribute *attrs[] = {
&dev_attr_xxx.attr.attr,
&dev_attr_yyy.attr.attr,
}

Not quite sure if it is valuable. If it is yes, I will
split the code and send a RFC patchset.

Thanks
Barry

Subject: RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf



> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> Sent: Friday, July 23, 2021 11:29 PM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: Dave Hansen <[email protected]>; tiantao (H) <[email protected]>;
> [email protected]; [email protected]; [email protected];
> Rafael J. Wysocki <[email protected]>; Peter Zijlstra <[email protected]>;
> Valentin Schneider <[email protected]>; Dave Hansen
> <[email protected]>; Daniel Bristot de Oliveira
> <[email protected]>; Linuxarm <[email protected]>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
>
> On Fri, Jul 23, 2021 at 11:20:19AM +0000, Song Bao Hua (Barry Song) wrote:
> >
> >
> > > -----Original Message-----
> > > From: Dave Hansen [mailto:[email protected]]
> > > Sent: Friday, April 30, 2021 10:39 AM
> > > To: Song Bao Hua (Barry Song) <[email protected]>; tiantao (H)
> > > <[email protected]>; [email protected]; [email protected]
> > > Cc: [email protected]; [email protected]; Rafael J.
> > > Wysocki <[email protected]>; Peter Zijlstra <[email protected]>;
> Valentin
> > > Schneider <[email protected]>; Dave Hansen
> > > <[email protected]>; Daniel Bristot de Oliveira
> <[email protected]>
> > > Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> > > of sysfs pagebuf
> > >
> > > On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> > > > $ strace numactl --hardware 2>&1 | grep cpu
> > > > openat(AT_FDCWD, "/sys/devices/system/cpu",
> > > > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > > > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) =
> 3
> > > > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) =
> 3
> > > > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) =
> 3
> > > > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) =
> 3
> > > >
> > > > If we move to binary, it means we have to change those applications.
> > >
> > > I thought Greg was saying to using a sysfs binary attribute using
> > > something like like sysfs_create_bin_file(). Those don't have the
> > > PAGE_SIZE limitation. But, there's also nothing to keep us from spewing
> > > nice human-readable text via the "binary" file.
> > >
> > > We don't need to change the file format, just the internal kernel API
> > > that we produce the files with.
> >
> > Sorry for waking-up the old thread.
> >
> > I am not sure how common a regular device_attribute will be larger than
> > 4KB and have to work around by bin_attribute. But I wrote a prototype
> > which can initially support large regular sysfs entry and be able to
> > fill the entire buffer by only one show() entry. The other words to say,
> > we don't need to call read() of bin_attribute multiple times for a large
> > regular file. Right now, only read-only attribute is supported.
> >
> > Subject: [RFC PATCH] sysfs: support regular device attr which can be larger
> than
> > PAGE_SIZE
> >
> > A regular sysfs ABI could be more than 4KB, right now, we are using
> > bin_attribute to workaround and break this limit. A better solution
> > would be supporting long device attribute. In this case, we will
> > still be able to enjoy the advantages of buffering and seeking of
> > seq file and only need to fill the entire buffer of sysfs entry
> > once.
>
> No, please no. I WANT people to run into this problem and realize that
> it went totally wrong because they should not be having more than one
> "value" in a sysfs file like this.
>
> Let's not make it easy on people please, moving to a bin attribute is a
> big deal, let's keep it that way.

Ok. Got it. Thanks for clarification. That was the experiment I made
recently when I almost got totally stuck.

>
> thanks,
>
> greg k-h

Thanks
Barry