Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
This breaks userspace code that retrieves the size before reading the file. Rather
than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
limitation of cpumap ABI") let's put in a size value at compile time. Use direct
comparison and a worst-case maximum to ensure compile time constants. For cpulist the
max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
In order to get near that you'd need a system with every other CPU on one node or
something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
On an 80 cpu 4-node sytem (NR_CPUS == 8192)
before:
-r--r--r--. 1 root root 0 Jul 12 14:08 /sys/devices/system/node/node0/cpulist
-r--r--r--. 1 root root 0 Jul 11 17:25 /sys/devices/system/node/node0/cpumap
after:
-r--r--r--. 1 root root 40960 Jul 12 16:48 /sys/devices/system/node/node0/cpulist
-r--r--r--. 1 root root 4096 Jul 12 15:50 /sys/devices/system/node/node0/cpumap
Fixes: 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI")
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Phil Auld <[email protected]>
---
drivers/base/node.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ac6376ef7a1..291c69671f23 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -45,7 +45,7 @@ static inline ssize_t cpumap_read(struct file *file, struct kobject *kobj,
return n;
}
-static BIN_ATTR_RO(cpumap, 0);
+static BIN_ATTR_RO(cpumap, PAGE_SIZE);
static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
struct bin_attribute *attr, char *buf,
@@ -66,7 +66,7 @@ static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
return n;
}
-static BIN_ATTR_RO(cpulist, 0);
+static BIN_ATTR_RO(cpulist, (((NR_CPUS * 5) > PAGE_SIZE) ? NR_CPUS *5 : PAGE_SIZE));
/**
* struct node_access_nodes - Access class device to hold user visible
--
2.31.1
On Wed, Jul 13, 2022 at 9:58 AM Phil Auld <[email protected]> wrote:
>
> Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
> This breaks userspace code that retrieves the size before reading the file. Rather
> than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
> limitation of cpumap ABI") let's put in a size value at compile time. Use direct
> comparison and a worst-case maximum to ensure compile time constants. For cpulist the
> max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
> In order to get near that you'd need a system with every other CPU on one node or
> something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
> to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
>
> On an 80 cpu 4-node system (NR_CPUS == 8192)
>
> before:
>
> -r--r--r--. 1 root root 0 Jul 12 14:08 /sys/devices/system/node/node0/cpulist
> -r--r--r--. 1 root root 0 Jul 11 17:25 /sys/devices/system/node/node0/cpumap
it is a fundamental problem of bin_attr, isn't it? when we don't know the
exact size of an attribute, and this size might become more than one
PAGE_SIZE, we use bin_attr to break the limitation. but the fact is that
we really don't know or it is really hard to know the actual size of the
attribute.
>
> after:
>
> -r--r--r--. 1 root root 40960 Jul 12 16:48 /sys/devices/system/node/node0/cpulist
> -r--r--r--. 1 root root 4096 Jul 12 15:50 /sys/devices/system/node/node0/cpumap
if we finally set a size which might be improper, it seems we defeat the
purpose we start to move to bin_attr?
>
> Fixes: 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI")
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Signed-off-by: Phil Auld <[email protected]>
> ---
> drivers/base/node.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ac6376ef7a1..291c69671f23 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -45,7 +45,7 @@ static inline ssize_t cpumap_read(struct file *file, struct kobject *kobj,
> return n;
> }
>
> -static BIN_ATTR_RO(cpumap, 0);
> +static BIN_ATTR_RO(cpumap, PAGE_SIZE);
PAGE_SIZE is probably big enough, will we still calculate to get it rather than
hard coding?
>
> static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
> struct bin_attribute *attr, char *buf,
> @@ -66,7 +66,7 @@ static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
> return n;
> }
>
> -static BIN_ATTR_RO(cpulist, 0);
> +static BIN_ATTR_RO(cpulist, (((NR_CPUS * 5) > PAGE_SIZE) ? NR_CPUS *5 : PAGE_SIZE));
I am still not sure why it is NR_CPUS * 5. Is 5 bytes big enough to
describe the number
of cpu id? technically it seems not, for example, for cpuid=100000,
we need at least 6
bytes.
BTW, my silly question is that what if we set the size to MAXIMUM int?
Will it fix
the userspace fsstat?
>
> /**
> * struct node_access_nodes - Access class device to hold user visible
> --
> 2.31.1
>
Thanks
Barry
On Tue, Jul 12, 2022 at 05:43:01PM -0400, Phil Auld wrote:
> Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
> This breaks userspace code that retrieves the size before reading the file. Rather
> than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
> limitation of cpumap ABI") let's put in a size value at compile time. Use direct
> comparison and a worst-case maximum to ensure compile time constants. For cpulist the
> max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
> In order to get near that you'd need a system with every other CPU on one node or
> something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
> to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
Does userspace care about that size, or can we just put any value in
there and it will be ok? How about just returning to the original
PAGE_SIZE value to keep things looking identical, will userspace not
read more than that size from the file then?
> On an 80 cpu 4-node sytem (NR_CPUS == 8192)
We have systems running Linux with many more cpus than that, and your
company knows this :)
thanks,
greg k-h
On Wed, Jul 13, 2022 at 11:18:59AM +1200 Barry Song wrote:
> On Wed, Jul 13, 2022 at 9:58 AM Phil Auld <[email protected]> wrote:
> >
> > Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
> > This breaks userspace code that retrieves the size before reading the file. Rather
> > than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
> > limitation of cpumap ABI") let's put in a size value at compile time. Use direct
> > comparison and a worst-case maximum to ensure compile time constants. For cpulist the
> > max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
> > In order to get near that you'd need a system with every other CPU on one node or
> > something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
> > to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
> >
> > On an 80 cpu 4-node system (NR_CPUS == 8192)
> >
> > before:
> >
> > -r--r--r--. 1 root root 0 Jul 12 14:08 /sys/devices/system/node/node0/cpulist
> > -r--r--r--. 1 root root 0 Jul 11 17:25 /sys/devices/system/node/node0/cpumap
>
> it is a fundamental problem of bin_attr, isn't it? when we don't know the
> exact size of an attribute, and this size might become more than one
> PAGE_SIZE, we use bin_attr to break the limitation. but the fact is that
> we really don't know or it is really hard to know the actual size of the
> attribute.
>
But it broke userspace applications. I figured rather than revert it maybe
we can find a max size to put in there and make it continue to work.
> >
> > after:
> >
> > -r--r--r--. 1 root root 40960 Jul 12 16:48 /sys/devices/system/node/node0/cpulist
> > -r--r--r--. 1 root root 4096 Jul 12 15:50 /sys/devices/system/node/node0/cpumap
>
> if we finally set a size which might be improper, it seems we defeat the
> purpose we start to move to bin_attr?
>
> >
> > Fixes: 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI")
> > Cc: Greg Kroah-Hartman <[email protected]>
> > Cc: "Rafael J. Wysocki" <[email protected]>
> > Signed-off-by: Phil Auld <[email protected]>
> > ---
> > drivers/base/node.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index 0ac6376ef7a1..291c69671f23 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -45,7 +45,7 @@ static inline ssize_t cpumap_read(struct file *file, struct kobject *kobj,
> > return n;
> > }
> >
> > -static BIN_ATTR_RO(cpumap, 0);
> > +static BIN_ATTR_RO(cpumap, PAGE_SIZE);
>
> PAGE_SIZE is probably big enough, will we still calculate to get it rather than
> hard coding?
This one is actually wrong. I did not realize how big a NR_CPUS people were actually using.
It should be something like (NR_CPUS/4 + NR_CPUS/32).
>
> >
> > static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
> > struct bin_attribute *attr, char *buf,
> > @@ -66,7 +66,7 @@ static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
> > return n;
> > }
> >
> > -static BIN_ATTR_RO(cpulist, 0);
> > +static BIN_ATTR_RO(cpulist, (((NR_CPUS * 5) > PAGE_SIZE) ? NR_CPUS *5 : PAGE_SIZE));
>
> I am still not sure why it is NR_CPUS * 5. Is 5 bytes big enough to
> describe the number
> of cpu id? technically it seems not, for example, for cpuid=100000,
> we need at least 6
> bytes.
Sure. As I said in the comment I wanted to do NR_CPUS * (ceil(log10(NR_CPUS)) + 1) but doing
that math in the kernel was messy. So I used 5. Even that is probably way bigger than needed.
Are there really 100000 cpus on one node with discontiguous cpuids? "0-99999" is only, what,
9 characters?
We can put whatever number you want that is >= the size the read will return.
Thanks,
Phil
>
> BTW, my silly question is that what if we set the size to MAXIMUM int?
> Will it fix
> the userspace fsstat?
>
> >
> > /**
> > * struct node_access_nodes - Access class device to hold user visible
> > --
> > 2.31.1
> >
>
> Thanks
> Barry
>
--
Hi Greg,
On Wed, Jul 13, 2022 at 08:06:02AM +0200 Greg Kroah-Hartman wrote:
> On Tue, Jul 12, 2022 at 05:43:01PM -0400, Phil Auld wrote:
> > Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
> > This breaks userspace code that retrieves the size before reading the file. Rather
> > than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
> > limitation of cpumap ABI") let's put in a size value at compile time. Use direct
> > comparison and a worst-case maximum to ensure compile time constants. For cpulist the
> > max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
> > In order to get near that you'd need a system with every other CPU on one node or
> > something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
> > to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
>
> Does userspace care about that size, or can we just put any value in
> there and it will be ok? How about just returning to the original
> PAGE_SIZE value to keep things looking identical, will userspace not
> read more than that size from the file then?
>
I'll go look. But I think the point of pre-reading the size with fstat is to allocate
a buffer to read into. So that may be a problem.
That said, I believe in this case it's the cpulist file which given the use of ranges
is very unlikely to actually get that big.
> > On an 80 cpu 4-node sytem (NR_CPUS == 8192)
>
> We have systems running Linux with many more cpus than that, and your
> company knows this :)
The 80 cpus here don't matter and we only build with NR_CPUS = 8192 :)
But yes, I realize now that the cpumap part I posted is broken for larger
NR_CPUS. I originally had it as NR_CPUS, but as I said in my reply to Barry,
it wants to be ~= NR_CPUS/4 + NR_CPUS/32. I'll change that.
I think we should decide on a max for each and use that.
Cheers,
Phil
>
> thanks,
>
> greg k-h
>
--
On Thu, Jul 14, 2022 at 12:00:34AM +1200 Barry Song wrote:
> Got it.
>
> On Wed, Jul 13, 2022 at 11:37 PM Phil Auld <[email protected]> wrote:
> >
> > On Wed, Jul 13, 2022 at 11:18:59AM +1200 Barry Song wrote:
> > > On Wed, Jul 13, 2022 at 9:58 AM Phil Auld <[email protected]> wrote:
> > > >
> > > > Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
> > > > This breaks userspace code that retrieves the size before reading the file. Rather
> > > > than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
> > > > limitation of cpumap ABI") let's put in a size value at compile time. Use direct
> > > > comparison and a worst-case maximum to ensure compile time constants. For cpulist the
> > > > max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
> > > > In order to get near that you'd need a system with every other CPU on one node or
> > > > something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
> > > > to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
> > > >
> > > > On an 80 cpu 4-node system (NR_CPUS == 8192)
> > > >
> > > > before:
> > > >
> > > > -r--r--r--. 1 root root 0 Jul 12 14:08 /sys/devices/system/node/node0/cpulist
> > > > -r--r--r--. 1 root root 0 Jul 11 17:25 /sys/devices/system/node/node0/cpumap
> > >
> > > it is a fundamental problem of bin_attr, isn't it? when we don't know the
> > > exact size of an attribute, and this size might become more than one
> > > PAGE_SIZE, we use bin_attr to break the limitation. but the fact is that
> > > we really don't know or it is really hard to know the actual size of the
> > > attribute.
> > >
> >
> > But it broke userspace applications. I figured rather than revert it maybe
> > we can find a max size to put in there and make it continue to work.
> >
> > > >
> > > > after:
> > > >
> > > > -r--r--r--. 1 root root 40960 Jul 12 16:48 /sys/devices/system/node/node0/cpulist
> > > > -r--r--r--. 1 root root 4096 Jul 12 15:50 /sys/devices/system/node/node0/cpumap
> > >
> > > if we finally set a size which might be improper, it seems we defeat the
> > > purpose we start to move to bin_attr?
> > >
> > > >
> > > > Fixes: 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI")
> > > > Cc: Greg Kroah-Hartman <[email protected]>
> > > > Cc: "Rafael J. Wysocki" <[email protected]>
> > > > Signed-off-by: Phil Auld <[email protected]>
> > > > ---
> > > > drivers/base/node.c | 4 ++--
> > > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > > index 0ac6376ef7a1..291c69671f23 100644
> > > > --- a/drivers/base/node.c
> > > > +++ b/drivers/base/node.c
> > > > @@ -45,7 +45,7 @@ static inline ssize_t cpumap_read(struct file *file, struct kobject *kobj,
> > > > return n;
> > > > }
> > > >
> > > > -static BIN_ATTR_RO(cpumap, 0);
> > > > +static BIN_ATTR_RO(cpumap, PAGE_SIZE);
> > >
> > > PAGE_SIZE is probably big enough, will we still calculate to get it rather than
> > > hard coding?
> >
> > This one is actually wrong. I did not realize how big a NR_CPUS people were actually using.
> > It should be something like (NR_CPUS/4 + NR_CPUS/32).
> >
> > >
> > > >
> > > > static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
> > > > struct bin_attribute *attr, char *buf,
> > > > @@ -66,7 +66,7 @@ static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
> > > > return n;
> > > > }
> > > >
> > > > -static BIN_ATTR_RO(cpulist, 0);
> > > > +static BIN_ATTR_RO(cpulist, (((NR_CPUS * 5) > PAGE_SIZE) ? NR_CPUS *5 : PAGE_SIZE));
> > >
> > > I am still not sure why it is NR_CPUS * 5. Is 5 bytes big enough to
> > > describe the number
> > > of cpu id? technically it seems not, for example, for cpuid=100000,
> > > we need at least 6
> > > bytes.
> >
> > Sure. As I said in the comment I wanted to do NR_CPUS * (ceil(log10(NR_CPUS)) + 1) but doing
> > that math in the kernel was messy. So I used 5. Even that is probably way bigger than needed.
> > Are there really 100000 cpus on one node with discontiguous cpuids? "0-99999" is only, what,
> > 9 characters?
> >
> > We can put whatever number you want that is >= the size the read will return.
>
> Thanks,
> does it mean we can use something like -1UL?
>
I suppose we could be that seems like it would be overkill, no? My understanding is the app
in question uses the reported size to allocate a buffer to read the file into. It needs to
be at least equal to the amount we'll actually read but 4GB seems like it might be a bit much.
Cheers,
Phil
> >
> > Thanks,
> > Phil
> >
> > >
> > > BTW, my silly question is that what if we set the size to MAXIMUM int?
> > > Will it fix
> > > the userspace fsstat?
> > >
> > > >
> > > > /**
> > > > * struct node_access_nodes - Access class device to hold user visible
> > > > --
> > > > 2.31.1
> > > >
> > >
> > > Thanks
> > > Barry
> > >
> >
> > --
> >
>
--
Got it.
On Wed, Jul 13, 2022 at 11:37 PM Phil Auld <[email protected]> wrote:
>
> On Wed, Jul 13, 2022 at 11:18:59AM +1200 Barry Song wrote:
> > On Wed, Jul 13, 2022 at 9:58 AM Phil Auld <[email protected]> wrote:
> > >
> > > Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
> > > This breaks userspace code that retrieves the size before reading the file. Rather
> > > than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
> > > limitation of cpumap ABI") let's put in a size value at compile time. Use direct
> > > comparison and a worst-case maximum to ensure compile time constants. For cpulist the
> > > max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
> > > In order to get near that you'd need a system with every other CPU on one node or
> > > something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
> > > to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
> > >
> > > On an 80 cpu 4-node system (NR_CPUS == 8192)
> > >
> > > before:
> > >
> > > -r--r--r--. 1 root root 0 Jul 12 14:08 /sys/devices/system/node/node0/cpulist
> > > -r--r--r--. 1 root root 0 Jul 11 17:25 /sys/devices/system/node/node0/cpumap
> >
> > it is a fundamental problem of bin_attr, isn't it? when we don't know the
> > exact size of an attribute, and this size might become more than one
> > PAGE_SIZE, we use bin_attr to break the limitation. but the fact is that
> > we really don't know or it is really hard to know the actual size of the
> > attribute.
> >
>
> But it broke userspace applications. I figured rather than revert it maybe
> we can find a max size to put in there and make it continue to work.
>
> > >
> > > after:
> > >
> > > -r--r--r--. 1 root root 40960 Jul 12 16:48 /sys/devices/system/node/node0/cpulist
> > > -r--r--r--. 1 root root 4096 Jul 12 15:50 /sys/devices/system/node/node0/cpumap
> >
> > if we finally set a size which might be improper, it seems we defeat the
> > purpose we start to move to bin_attr?
> >
> > >
> > > Fixes: 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI")
> > > Cc: Greg Kroah-Hartman <[email protected]>
> > > Cc: "Rafael J. Wysocki" <[email protected]>
> > > Signed-off-by: Phil Auld <[email protected]>
> > > ---
> > > drivers/base/node.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > index 0ac6376ef7a1..291c69671f23 100644
> > > --- a/drivers/base/node.c
> > > +++ b/drivers/base/node.c
> > > @@ -45,7 +45,7 @@ static inline ssize_t cpumap_read(struct file *file, struct kobject *kobj,
> > > return n;
> > > }
> > >
> > > -static BIN_ATTR_RO(cpumap, 0);
> > > +static BIN_ATTR_RO(cpumap, PAGE_SIZE);
> >
> > PAGE_SIZE is probably big enough, will we still calculate to get it rather than
> > hard coding?
>
> This one is actually wrong. I did not realize how big a NR_CPUS people were actually using.
> It should be something like (NR_CPUS/4 + NR_CPUS/32).
>
> >
> > >
> > > static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
> > > struct bin_attribute *attr, char *buf,
> > > @@ -66,7 +66,7 @@ static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
> > > return n;
> > > }
> > >
> > > -static BIN_ATTR_RO(cpulist, 0);
> > > +static BIN_ATTR_RO(cpulist, (((NR_CPUS * 5) > PAGE_SIZE) ? NR_CPUS *5 : PAGE_SIZE));
> >
> > I am still not sure why it is NR_CPUS * 5. Is 5 bytes big enough to
> > describe the number
> > of cpu id? technically it seems not, for example, for cpuid=100000,
> > we need at least 6
> > bytes.
>
> Sure. As I said in the comment I wanted to do NR_CPUS * (ceil(log10(NR_CPUS)) + 1) but doing
> that math in the kernel was messy. So I used 5. Even that is probably way bigger than needed.
> Are there really 100000 cpus on one node with discontiguous cpuids? "0-99999" is only, what,
> 9 characters?
>
> We can put whatever number you want that is >= the size the read will return.
Thanks,
does it mean we can use something like -1UL?
>
> Thanks,
> Phil
>
> >
> > BTW, my silly question is that what if we set the size to MAXIMUM int?
> > Will it fix
> > the userspace fsstat?
> >
> > >
> > > /**
> > > * struct node_access_nodes - Access class device to hold user visible
> > > --
> > > 2.31.1
> > >
> >
> > Thanks
> > Barry
> >
>
> --
>
On Wed, Jul 13, 2022 at 07:48:00AM -0400 Phil Auld wrote:
> Hi Greg,
>
> On Wed, Jul 13, 2022 at 08:06:02AM +0200 Greg Kroah-Hartman wrote:
> > On Tue, Jul 12, 2022 at 05:43:01PM -0400, Phil Auld wrote:
> > > Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
> > > This breaks userspace code that retrieves the size before reading the file. Rather
> > > than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
> > > limitation of cpumap ABI") let's put in a size value at compile time. Use direct
> > > comparison and a worst-case maximum to ensure compile time constants. For cpulist the
> > > max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
> > > In order to get near that you'd need a system with every other CPU on one node or
> > > something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
> > > to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
> >
> > Does userspace care about that size, or can we just put any value in
> > there and it will be ok? How about just returning to the original
> > PAGE_SIZE value to keep things looking identical, will userspace not
> > read more than that size from the file then?
> >
>
> I'll go look. But I think the point of pre-reading the size with fstat is to allocate
> a buffer to read into. So that may be a problem.
>
From what I understand the app does use the size from fstat to allocate a buffer.
It also does a lseek to the end and back. This is actaully where it breaks when it gets a 0.
This is before:
8747 10:20:32.584460 fstat(6</sys/devices/system/node/node0/cpulist>, {..., st_size=4096, ...}) = 0 <0.000006>
8747 10:20:32.584502 fstat(6</sys/devices/system/node/node0/cpulist>, {..., st_size=4096, ...}) = 0 <0.000005>
8747 10:20:32.584537 lseek(6</sys/devices/system/node/node0/cpulist>, 4096, SEEK_SET) = 4096 <0.000005>
8747 10:20:32.584560 lseek(6</sys/devices/system/node/node0/cpulist>, 0, SEEK_SET) = 0 <0.000005>
8747 10:20:32.584585 read(6</sys/devices/system/node/node0/cpulist>, "0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46\n", 4096) = 67 <0.000008>
8747 10:20:32.584617 read(6</sys/devices/system/node/node0/cpulist>, "", 4096) = 0 <0.000005>
(I'll note here their system does seem to have a worst case cpu to node list too)
And with the bin_attr change:
8871 09:27:41.034279 fstat(6</sys/devices/system/node/node0/cpulist>, {..., st_size=0, ...}) = 0 <0.000009>
8871 09:27:41.034330 fstat(6</sys/devices/system/node/node0/cpulist>, {..., st_size=0, ...}) = 0 <0.000010>
8871 09:27:41.034377 lseek(6</sys/devices/system/node/node0/cpulist>, 0, SEEK_SET) = 0 <0.000008>
8871 09:27:41.034410 lseek(6</sys/devices/system/node/node0/cpulist>, 0, SEEK_SET) = 0 <0.000008>
And here it spins in a loop.
> That said, I believe in this case it's the cpulist file which given the use of ranges
> is very unlikely to actually get that big.
>
> > > On an 80 cpu 4-node sytem (NR_CPUS == 8192)
> >
> > We have systems running Linux with many more cpus than that, and your
> > company knows this :)
>
> The 80 cpus here don't matter and we only build with NR_CPUS = 8192 :)
>
> But yes, I realize now that the cpumap part I posted is broken for larger
> NR_CPUS. I originally had it as NR_CPUS, but as I said in my reply to Barry,
> it wants to be ~= NR_CPUS/4 + NR_CPUS/32. I'll change that.
>
> I think we should decide on a max for each and use that.
>
What values of NR_CPUS are people actually using when they build kernels? Barry mentioned cpuid 100000 for
example. I'm not sure if that was real or just illustrating the need for more characters.
I've modified my patch to use NR_CPUS/2 for the cpumap which should be plenty. And to use NR_CPUS*6 for
the cpulist, which covers up to 99999 cpus safely.
Cheers,
Phil
> Cheers,
> Phil
>
> >
> > thanks,
> >
> > greg k-h
> >
>
> --
--
On Wed, Jul 13, 2022 at 07:47:58AM -0400, Phil Auld wrote:
> Hi Greg,
>
> On Wed, Jul 13, 2022 at 08:06:02AM +0200 Greg Kroah-Hartman wrote:
> > On Tue, Jul 12, 2022 at 05:43:01PM -0400, Phil Auld wrote:
> > > Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
> > > This breaks userspace code that retrieves the size before reading the file. Rather
> > > than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
> > > limitation of cpumap ABI") let's put in a size value at compile time. Use direct
> > > comparison and a worst-case maximum to ensure compile time constants. For cpulist the
> > > max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
> > > In order to get near that you'd need a system with every other CPU on one node or
> > > something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
> > > to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
> >
> > Does userspace care about that size, or can we just put any value in
> > there and it will be ok? How about just returning to the original
> > PAGE_SIZE value to keep things looking identical, will userspace not
> > read more than that size from the file then?
> >
>
> I'll go look. But I think the point of pre-reading the size with fstat is to allocate
> a buffer to read into. So that may be a problem.
>
> That said, I believe in this case it's the cpulist file which given the use of ranges
> is very unlikely to actually get that big.
That is why we had to change this to a binary file. Think about
every-other CPU being there, that's a huge list. This already was
broken on some systems which is why it had to be changed (i.e. we didn't
change it for no reason at all.)
> > > On an 80 cpu 4-node sytem (NR_CPUS == 8192)
> >
> > We have systems running Linux with many more cpus than that, and your
> > company knows this :)
>
> The 80 cpus here don't matter and we only build with NR_CPUS = 8192 :)
>
> But yes, I realize now that the cpumap part I posted is broken for larger
> NR_CPUS. I originally had it as NR_CPUS, but as I said in my reply to Barry,
> it wants to be ~= NR_CPUS/4 + NR_CPUS/32. I'll change that.
>
> I think we should decide on a max for each and use that.
Sure, pick a max size please, that's fine with me.
greg k-h
On Wed, Jul 13, 2022 at 03:05:52PM +0200 Greg Kroah-Hartman wrote:
> On Wed, Jul 13, 2022 at 07:47:58AM -0400, Phil Auld wrote:
> > Hi Greg,
> >
> > On Wed, Jul 13, 2022 at 08:06:02AM +0200 Greg Kroah-Hartman wrote:
> > > On Tue, Jul 12, 2022 at 05:43:01PM -0400, Phil Auld wrote:
> > > > Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
> > > > This breaks userspace code that retrieves the size before reading the file. Rather
> > > > than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
> > > > limitation of cpumap ABI") let's put in a size value at compile time. Use direct
> > > > comparison and a worst-case maximum to ensure compile time constants. For cpulist the
> > > > max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
> > > > In order to get near that you'd need a system with every other CPU on one node or
> > > > something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
> > > > to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
> > >
> > > Does userspace care about that size, or can we just put any value in
> > > there and it will be ok? How about just returning to the original
> > > PAGE_SIZE value to keep things looking identical, will userspace not
> > > read more than that size from the file then?
> > >
> >
> > I'll go look. But I think the point of pre-reading the size with fstat is to allocate
> > a buffer to read into. So that may be a problem.
> >
> > That said, I believe in this case it's the cpulist file which given the use of ranges
> > is very unlikely to actually get that big.
>
> That is why we had to change this to a binary file. Think about
> every-other CPU being there, that's a huge list. This already was
> broken on some systems which is why it had to be changed (i.e. we didn't
> change it for no reason at all.)
>
I didn't think you did and the change made sense. I did not expect this to
cause problems either when I backported it... :)
> > > > On an 80 cpu 4-node sytem (NR_CPUS == 8192)
> > >
> > > We have systems running Linux with many more cpus than that, and your
> > > company knows this :)
> >
> > The 80 cpus here don't matter and we only build with NR_CPUS = 8192 :)
> >
> > But yes, I realize now that the cpumap part I posted is broken for larger
> > NR_CPUS. I originally had it as NR_CPUS, but as I said in my reply to Barry,
> > it wants to be ~= NR_CPUS/4 + NR_CPUS/32. I'll change that.
> >
> > I think we should decide on a max for each and use that.
>
> Sure, pick a max size please, that's fine with me.
Right. I had another reply that crossed in the ether.
I can repost with the new version shortly.
It's using cpumap at NR_CPUS/2 and cpulist at NR_CPUS*6.
Cheers,
Phil
>
> greg k-h
>
--
On Wed, Jul 13, 2022 at 07:37:27AM -0400, Phil Auld wrote:
> On Wed, Jul 13, 2022 at 11:18:59AM +1200 Barry Song wrote:
> > On Wed, Jul 13, 2022 at 9:58 AM Phil Auld <[email protected]> wrote:
> > >
> > > Using bin_attributes with a 0 size causes fstat and friends to return that 0 size.
> > > This breaks userspace code that retrieves the size before reading the file. Rather
> > > than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size
> > > limitation of cpumap ABI") let's put in a size value at compile time. Use direct
> > > comparison and a worst-case maximum to ensure compile time constants. For cpulist the
> > > max is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1) which for 8192 is 40960.
> > > In order to get near that you'd need a system with every other CPU on one node or
> > > something similar. e.g. (0,2,4,... 1024,1026...). We set it to a min of PAGE_SIZE
> > > to retain the older behavior. For cpumap, PAGE_SIZE is plenty big.
> > >
> > > On an 80 cpu 4-node system (NR_CPUS == 8192)
> > >
> > > before:
> > >
> > > -r--r--r--. 1 root root 0 Jul 12 14:08 /sys/devices/system/node/node0/cpulist
> > > -r--r--r--. 1 root root 0 Jul 11 17:25 /sys/devices/system/node/node0/cpumap
> >
> > it is a fundamental problem of bin_attr, isn't it? when we don't know the
> > exact size of an attribute, and this size might become more than one
> > PAGE_SIZE, we use bin_attr to break the limitation. but the fact is that
> > we really don't know or it is really hard to know the actual size of the
> > attribute.
> >
>
> But it broke userspace applications. I figured rather than revert it maybe
> we can find a max size to put in there and make it continue to work.
Yes, we need to do this, we can't break userspace.
thanks,
greg k-h