2006-09-27 07:55:58

by Martin Devera

[permalink] [raw]
Subject: stat of /proc fails after CPU hot-unplug with EOVERFLOW in 2.6.18

Hello,

I have 2way Opteron machine. I've done this:
echo 0 > /sys/devices/system/cpu/cpu1/online

and then strace stat /proc:

[snip]
personality(PER_LINUX) = 4194304
getpid() = 14926
brk(0) = 0x804b000
brk(0x804b1a0) = 0x804b1a0
brk(0x804c000) = 0x804c000
stat("/proc", 0xbf8e7490) = -1 EOVERFLOW

When I do echo 1 > ... to start cpu again then the stat starts
to work again ... Weird.

Martin


2006-09-27 08:13:56

by Andrew Morton

[permalink] [raw]
Subject: Re: stat of /proc fails after CPU hot-unplug with EOVERFLOW in 2.6.18

On Wed, 27 Sep 2006 09:55:47 +0200
Martin Devera <[email protected]> wrote:

> Hello,
>
> I have 2way Opteron machine. I've done this:
> echo 0 > /sys/devices/system/cpu/cpu1/online
>
> and then strace stat /proc:
>
> [snip]
> personality(PER_LINUX) = 4194304
> getpid() = 14926
> brk(0) = 0x804b000
> brk(0x804b1a0) = 0x804b1a0
> brk(0x804c000) = 0x804c000
> stat("/proc", 0xbf8e7490) = -1 EOVERFLOW
>
> When I do echo 1 > ... to start cpu again then the stat starts
> to work again ... Weird.
>

boggle.

Can you add this patch, see where it's going bad?


fs/stat.c | 30 +++++++++++++++++++++++-------
1 file changed, 23 insertions(+), 7 deletions(-)

diff -puN fs/stat.c~a fs/stat.c
--- a/fs/stat.c~a
+++ a/fs/stat.c
@@ -18,6 +18,8 @@
#include <asm/uaccess.h>
#include <asm/unistd.h>

+#define D() printk("%s:%d\n", __FILE__, __LINE__)
+
void generic_fillattr(struct inode *inode, struct kstat *stat)
{
stat->dev = inode->i_sb->s_dev;
@@ -141,14 +143,18 @@ static int cp_old_stat(struct kstat *sta
tmp.st_ino = stat->ino;
tmp.st_mode = stat->mode;
tmp.st_nlink = stat->nlink;
- if (tmp.st_nlink != stat->nlink)
+ if (tmp.st_nlink != stat->nlink) {
+ D();
return -EOVERFLOW;
+ }
SET_UID(tmp.st_uid, stat->uid);
SET_GID(tmp.st_gid, stat->gid);
tmp.st_rdev = old_encode_dev(stat->rdev);
#if BITS_PER_LONG == 32
- if (stat->size > MAX_NON_LFS)
+ if (stat->size > MAX_NON_LFS) {
+ D();
return -EOVERFLOW;
+ }
#endif
tmp.st_size = stat->size;
tmp.st_atime = stat->atime.tv_sec;
@@ -195,11 +201,15 @@ static int cp_new_stat(struct kstat *sta
struct stat tmp;

#if BITS_PER_LONG == 32
- if (!old_valid_dev(stat->dev) || !old_valid_dev(stat->rdev))
+ if (!old_valid_dev(stat->dev) || !old_valid_dev(stat->rdev)) {
+ D();
return -EOVERFLOW;
+ }
#else
- if (!new_valid_dev(stat->dev) || !new_valid_dev(stat->rdev))
+ if (!new_valid_dev(stat->dev) || !new_valid_dev(stat->rdev)) {
+ D();
return -EOVERFLOW;
+ }
#endif

memset(&tmp, 0, sizeof(tmp));
@@ -211,8 +221,10 @@ static int cp_new_stat(struct kstat *sta
tmp.st_ino = stat->ino;
tmp.st_mode = stat->mode;
tmp.st_nlink = stat->nlink;
- if (tmp.st_nlink != stat->nlink)
+ if (tmp.st_nlink != stat->nlink) {
+ D();
return -EOVERFLOW;
+ }
SET_UID(tmp.st_uid, stat->uid);
SET_GID(tmp.st_gid, stat->gid);
#if BITS_PER_LONG == 32
@@ -221,8 +233,10 @@ static int cp_new_stat(struct kstat *sta
tmp.st_rdev = new_encode_dev(stat->rdev);
#endif
#if BITS_PER_LONG == 32
- if (stat->size > MAX_NON_LFS)
+ if (stat->size > MAX_NON_LFS) {
+ D();
return -EOVERFLOW;
+ }
#endif
tmp.st_size = stat->size;
tmp.st_atime = stat->atime.tv_sec;
@@ -337,8 +351,10 @@ static long cp_new_stat64(struct kstat *
memset(&tmp, 0, sizeof(struct stat64));
#ifdef CONFIG_MIPS
/* mips has weird padding, so we don't get 64 bits there */
- if (!new_valid_dev(stat->dev) || !new_valid_dev(stat->rdev))
+ if (!new_valid_dev(stat->dev) || !new_valid_dev(stat->rdev)) {
+ D();
return -EOVERFLOW;
+ }
tmp.st_dev = new_encode_dev(stat->dev);
tmp.st_rdev = new_encode_dev(stat->rdev);
#else
_

2006-09-27 14:44:18

by Martin Devera

[permalink] [raw]
Subject: Re: stat of /proc fails after CPU hot-unplug with EOVERFLOW in 2.6.18

Andrew Morton wrote:
> On Wed, 27 Sep 2006 09:55:47 +0200
> Martin Devera <[email protected]> wrote:
>
>> Hello,
>>
>> I have 2way Opteron machine. I've done this:
>> echo 0 > /sys/devices/system/cpu/cpu1/online
>>
>> and then strace stat /proc:
>>
>> [snip]
>> personality(PER_LINUX) = 4194304
>> getpid() = 14926
>> brk(0) = 0x804b000
>> brk(0x804b1a0) = 0x804b1a0
>> brk(0x804c000) = 0x804c000
>> stat("/proc", 0xbf8e7490) = -1 EOVERFLOW
>>
>> When I do echo 1 > ... to start cpu again then the stat starts
>> to work again ... Weird.
>>
>
> boggle.
>
> Can you add this patch, see where it's going bad?

Ehh .. I finally learned how to code jprobe (I can't reboot the machine now),
tested, installed and ... guess what ? The overflow bug is gone :-(
It simply works now.
I will reboot it next week and try again.

thanks for a help and sorry for your wasted time,
Martin

2006-10-03 08:41:22

by Martin Devera

[permalink] [raw]
Subject: Re: stat of /proc fails after CPU hot-unplug with EOVERFLOW in 2.6.18

Andrew Morton wrote:
> On Wed, 27 Sep 2006 09:55:47 +0200
> Martin Devera <[email protected]> wrote:
>
>> Hello,
>>
>> I have 2way Opteron machine. I've done this:
>> echo 0 > /sys/devices/system/cpu/cpu1/online
>>
>> and then strace stat /proc:
>>
>> [snip]
>> personality(PER_LINUX) = 4194304
>> getpid() = 14926
>> brk(0) = 0x804b000
>> brk(0x804b1a0) = 0x804b1a0
>> brk(0x804c000) = 0x804c000
>> stat("/proc", 0xbf8e7490) = -1 EOVERFLOW
>>
>> When I do echo 1 > ... to start cpu again then the stat starts
>> to work again ... Weird.

Hello,
I just want to make more info public. It seems that the problem is deeper.
The 2.6.18 kernel crashed the machine 4 times till now. Symptoms are - working
net, ssh was functional but I was not able to run single binary except "cat",
others giving me permission denied of Bus error.
I was doing no experiments with cpu hotplug this time. The machine was up
with 2.6.17.1 for six months and no problems.
Also I found weird errors like tg3 watchdog timeout, sata read errors (on all
sectors) etc. on console. Seems like memory corruption to me. It is worth to
note that the lockup always occured after high load.
We use MSI Far2 dual opteron MoBo.

All related info is at http://luxik.cdi.cz/~devik/files/2618-corrupt/ along
with 2.6.17.1 config (for comparison).
The main problem is that I have no similar server to simulate the problem
off-site. Thus take this report mainly as informative, I hope to replace
the server in a few weeks to investigate it more. For now we are back on
2.6.17.1.

Martin