2002-09-10 01:35:57

by Andrea Arcangeli

[permalink] [raw]
Subject: 2.4.20pre5aa2

2.4.20pre5aa1 had a deadlock in the sched_yield changes (missing _irq
while taking the spinlock). this new one should be rock solid ;).

URL:

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20pre5aa2.gz
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20pre5aa2/

Changelog:

Only in 2.4.20pre5aa2: 00_ext3-o_direct-1

O_DIRECT support for ext3, from Andrew and Stephen.

Only in 2.4.20pre5aa2: 00_find_or_create_page-1

Cleanup patch from Christoph to start the xfs merging.

Only in 2.4.20pre5aa1: 00_net-softirq-2
Only in 2.4.20pre5aa2: 00_net-softirq-3

This time I think I fixed the AF_UNIX latency in lmbench
to go as fast as with irqrate applied (if yes, as I expect it was
totally unrelated to the irqrate irq proper part). Please
benchmark (totally untested).

Only in 2.4.20pre5aa1: 00_prepare-write-fixes-3-1
Only in 2.4.20pre5aa2: 98_prepare-write-fixes-3-1

Moved at the end so it compiles even if you stop applying patches
in the middle. From Christoph.

Only in 2.4.20pre5aa2: 00_reiserfs-o_direct-1

Fixes for O_DIRECT with reiserfs from Chris.

Only in 2.4.20pre5aa1: 00_sched-O1-aa-2.4.19rc3-2.gz
Only in 2.4.20pre5aa2: 00_sched-O1-aa-2.4.19rc3-3.gz

Fix deadlock in sched_yield (rq->lock must be acquired after
disabling irqs). From Andi.

Only in 2.4.20pre5aa2: 00_slabinfo-shared-address-space-1

Fix from Arnd Bergmann to avoid archs with shared/overlapped address
space across kernel and userspace to show broken (literally ;) in
/proc/slabinfo.

Only in 2.4.20pre5aa1: 10_rawio-vary-io-12
Only in 2.4.20pre5aa2: 10_rawio-vary-io-13

Cleanedup version from Christoph.

Only in 2.4.20pre5aa2: 50_uml-patch-2.4.19-2.gz
Only in 2.4.20pre5aa2: 51_uml-aa-11
Only in 2.4.20pre5aa1: 51_uml-ac-to-aa-10
Only in 2.4.20pre5aa2: 53_uml-cache-shift-1
Only in 2.4.20pre5aa1: 56_uml-pte-highmem-3
Only in 2.4.20pre5aa2: 56_uml-pte-highmem-4
Only in 2.4.20pre5aa1: 60_tux-flush_icache_range-1

Make UML compile and work again (didn't like too much the /usr/lib/uml
hardcoded path just for this proggy:

andrea@dualathlon:~> ls /usr/lib/uml/
port-helper
andrea@dualathlon:~>

to make the debugger working). I'd prefer to install it locally
in my home dir.

Only in 2.4.20pre5aa2: 70_PF_FSTRANS-1
Only in 2.4.20pre5aa2: 70_alloc_inode-1
Only in 2.4.20pre5aa2: 70_delalloc-1
Only in 2.4.20pre5aa2: 70_dmapi-stuff-1
Only in 2.4.20pre5aa2: 70_iget-1
Only in 2.4.20pre5aa2: 70_intermezzo-junk-1
Only in 2.4.20pre5aa2: 70_quota-backport-1
Only in 2.4.20pre5aa2: 70_vmap-1
Only in 2.4.20pre5aa2: 70_xattr-1
Only in 2.4.20pre5aa1: 70_xfs-1.1-6.gz
Only in 2.4.20pre5aa2: 70_xfs-config-stuff-1
Only in 2.4.20pre5aa2: 70_xfs-cvs-020905-1
Only in 2.4.20pre5aa2: 70_xfs-exports-1
Only in 2.4.20pre5aa2: 70_xfs-sysctl-1
Only in 2.4.20pre5aa2: 71_posix_acl-1
Only in 2.4.20pre5aa2: 71_xfs-aa-1
Only in 2.4.20pre5aa1: 71_xfs-kiobuf-slab-1
Only in 2.4.20pre5aa2: 71_xfs-zalloc-fix-1
Only in 2.4.20pre5aa1: 72_xfs-O_DIRECT-1
Only in 2.4.20pre5aa1: 73_xfs-blksize-PAGE_SIZE-1
Only in 2.4.20pre5aa1: 74_super_quotaops-1
Only in 2.4.20pre5aa1: 75_compile-dmapi-1
Only in 2.4.20pre5aa1: 76_xfs-64bit-1

XFS SGI updates from Christoph.

Only in 2.4.20pre5aa1: 82_x86_64-suse-3
Only in 2.4.20pre5aa2: 82_x86_64-suse-4
Only in 2.4.20pre5aa1: 87_x86_64-o1sched-2

Make x86-64 compile (modulo aio, didn't merge the wtd framework yet).

Only in 2.4.20pre5aa1: 90_ext3-commit-interval-2
Only in 2.4.20pre5aa2: 90_ext3-commit-interval-3
Only in 2.4.20pre5aa1: 96_inode_read_write-atomic-4
Only in 2.4.20pre5aa2: 96_inode_read_write-atomic-5
Only in 2.4.20pre5aa1: 9940_ocfs-1.gz
Only in 2.4.20pre5aa2: 9940_ocfs-2.gz

Rediffed

Only in 2.4.20pre5aa1: 9900_aio-4.gz
Only in 2.4.20pre5aa2: 9900_aio-5.gz
Only in 2.4.20pre5aa1: 9910_shm-largepage-2.gz
Only in 2.4.20pre5aa2: 9910_shm-largepage-3.gz
Only in 2.4.20pre5aa1: 9920_kgdb-1.gz
Only in 2.4.20pre5aa2: 9920_kgdb-2.gz

Rediffed after fixing some compilation issue (wtd is still missing
for most archs though).

Only in 2.4.20pre5aa1: 9950_futex-1.gz
Only in 2.4.20pre5aa2: 9950_futex-2.gz

New fixed version.

Andrea


2002-09-10 18:46:53

by Joe Kellner

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Quoting Andrea Arcangeli <[email protected]>:

> 2.4.20pre5aa1 had a deadlock in the sched_yield changes (missing _irq
> while taking the spinlock). this new one should be rock solid ;).
>
> URL:
>
>
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20pre5aa2.gz
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20pre5aa2/
>


Andrea,
I've tried using this kernel on a dual atlon MP system using a Tyan thunder K7
board and two athlon MP 1900's. When it goes to load the kernel image the system
just reboots. I'm using the exact same .config as I used with 2.4.20pre5aa1,
which worked fine. If you need any more information I'll be glad to provide it.

-------------------------------------------------
sent via KingsMeade secure webmail http://www.kingsmeadefarm.com

2002-09-11 18:11:59

by Christian Guggenberger

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Hi!

just tried out 2.4.20-pre5aa2 with xfs enabled as module. But I can't load
the xfs Module...
modprobe xfs just won't work. Via top on another console I see two modpobe
processes, each consuming 99.9% CPU time. Then, after a minute or so, the
machine reboots...

System is a Dell Precision with 2 Intel [email protected] and 2GB RDRAM and
hyper-threading enabled, OS is Debian/GNU Linux 3.0 with:

gcc-2.95.4 20011002 (Debian prerelease)
ld-2.12.90.0.1 20020307 Debian/GNU Linux


I tried to disable HT, but then it was even worse. Then my machine crashed
hard after starting "modprobe xfs".


thanks in advance
Christian

P.S. if needed, I could post my .config, or other relevant things...

2002-09-11 18:19:33

by Austin Gonyou

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Did you try just using insmod, instead of modprobe. To use modprobe,
your module must have something defined in /etc/modules.conf


On Wed, 2002-09-11 at 13:16, Christian Guggenberger wrote:
> Hi!
>
> just tried out 2.4.20-pre5aa2 with xfs enabled as module. But I can't
> load
> the xfs Module...
> modprobe xfs just won't work. Via top on another console I see two
> modpobe
> processes, each consuming 99.9% CPU time. Then, after a minute or so,
> the
> machine reboots...
>
> System is a Dell Precision with 2 Intel [email protected] and 2GB RDRAM and
> hyper-threading enabled, OS is Debian/GNU Linux 3.0 with:
>
> gcc-2.95.4 20011002 (Debian prerelease)
> ld-2.12.90.0.1 20020307 Debian/GNU Linux
>
>
> I tried to disable HT, but then it was even worse. Then my machine
> crashed
> hard after starting "modprobe xfs".
>
>
> thanks in advance
> Christian
>
> P.S. if needed, I could post my .config, or other relevant things...
>
--
Austin Gonyou <[email protected]>
Coremetrics, Inc.

2002-09-11 18:24:00

by Christian Guggenberger

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Am 11 Sep 2002 20:24:15 schrieb(en) Austin Gonyou:
> Did you try just using insmod, instead of modprobe. To use modprobe,
> your module must have something defined in /etc/modules.conf
>
>

yep, I tried this before, but causes the same bad result...

Christian

2002-09-11 18:36:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

was a collision between new xfs and new scheduler, you can use this fix
in the meantime:

--- 2.4.20pre5aa3/fs/xfs/pagebuf/page_buf.c.~1~ Wed Sep 11 05:17:46 2002
+++ 2.4.20pre5aa3/fs/xfs/pagebuf/page_buf.c Wed Sep 11 06:00:35 2002
@@ -2055,9 +2055,9 @@ pagebuf_iodone_daemon(
spin_unlock_irq(&current->sigmask_lock);

/* Migrate to the right CPU */
- current->cpus_allowed = 1UL << cpu;
- while (smp_processor_id() != cpu)
- schedule();
+ set_cpus_allowed(current, 1UL << cpu);
+ if (cpu() != cpu)
+ BUG();

sprintf(current->comm, "pagebuf_io_CPU%d", bind_cpu);
INIT_LIST_HEAD(&pagebuf_iodone_tq[cpu]);

also remeber to apply the O_DIRECT fixes for reiserfs and ext3 (that
were left over after merging the new nfs stuff). all will be fixed in
next -aa of course.

--- 2.4.19pre3aa1/fs/reiserfs/inode.c.~1~ Tue Mar 12 00:07:18 2002
+++ 2.4.19pre3aa1/fs/reiserfs/inode.c Tue Mar 12 01:24:21 2002
@@ -2161,10 +2161,11 @@
}
}

-static int reiserfs_direct_io(int rw, struct inode *inode,
+static int reiserfs_direct_io(int rw, struct file * filp,
struct kiobuf *iobuf, unsigned long blocknr,
int blocksize)
{
+ struct inode * inode = filp->f_dentry->d_inode->i_mapping->host;
return generic_direct_IO(rw, inode, iobuf, blocknr, blocksize,
reiserfs_get_block_direct_io) ;
}
--- 2.4.20pre5aa2/fs/ext3/inode.c.~1~ Mon Sep 9 02:38:08 2002
+++ 2.4.20pre5aa2/fs/ext3/inode.c Tue Sep 10 05:22:18 2002
@@ -1385,9 +1385,10 @@ static int ext3_releasepage(struct page
}

static int
-ext3_direct_IO(int rw, struct inode *inode, struct kiobuf *iobuf,
+ext3_direct_IO(int rw, struct file * filp, struct kiobuf *iobuf,
unsigned long blocknr, int blocksize)
{
+ struct inode * inode = filp->f_dentry->d_inode->i_mapping->host;
struct ext3_inode_info *ei = EXT3_I(inode);
handle_t *handle = NULL;
int ret;

Andrea

2002-09-11 18:40:59

by Christian Guggenberger

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Am 11 Sep 2002 20:35:18 schrieb(en) Austin Gonyou:
> Ahh..I see now. So, before the machine reboots, what do you get in
> dmesg?
>
> Anything, an oops maybe? maybe do dmesg > ~/somefile so you can keep
> them around after the reboot?
>
>

Have to correct my first statement. The Machine crashes in both cases, HT
enabled or not...
But I didn't find any Output in dmesg.

I'll now try Andrea's patches...

thank you :)
Christian

2002-09-11 18:46:52

by Austin Gonyou

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

NP. Glad he was listening! As always!! :-D

On Wed, 2002-09-11 at 13:45, Christian Guggenberger wrote:
> Am 11 Sep 2002 20:35:18 schrieb(en) Austin Gonyou:
> > Ahh..I see now. So, before the machine reboots, what do you get in
> > dmesg?
> >
> > Anything, an oops maybe? maybe do dmesg > ~/somefile so you can keep
> > them around after the reboot?
> >
> >
>
> Have to correct my first statement. The Machine crashes in both cases,
> HT
> enabled or not...
> But I didn't find any Output in dmesg.
>
> I'll now try Andrea's patches...
>
> thank you :)
> Christian
--
Austin Gonyou <[email protected]>
Coremetrics, Inc.

2002-09-11 18:40:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Wed, Sep 11, 2002 at 08:16:02PM +0200, Christian Guggenberger wrote:
> Hi!
>
> just tried out 2.4.20-pre5aa2 with xfs enabled as module. But I can't load
> the xfs Module...
> modprobe xfs just won't work. Via top on another console I see two modpobe
> processes, each consuming 99.9% CPU time. Then, after a minute or so, the
> machine reboots...
>
> System is a Dell Precision with 2 Intel [email protected] and 2GB RDRAM and
> hyper-threading enabled, OS is Debian/GNU Linux 3.0 with:
>
> gcc-2.95.4 20011002 (Debian prerelease)
> ld-2.12.90.0.1 20020307 Debian/GNU Linux
>
>
> I tried to disable HT, but then it was even worse. Then my machine crashed
> hard after starting "modprobe xfs".

Could you please try the following patch from Andrea?

--- 2.4.20pre5aa3/fs/xfs/pagebuf/page_buf.c.~1~ Wed Sep 11 05:17:46 2002
+++ 2.4.20pre5aa3/fs/xfs/pagebuf/page_buf.c Wed Sep 11 06:00:35 2002
@@ -2055,9 +2055,9 @@ pagebuf_iodone_daemon(
spin_unlock_irq(&current->sigmask_lock);

/* Migrate to the right CPU */
- current->cpus_allowed = 1UL << cpu;
- while (smp_processor_id() != cpu)
- schedule();
+ set_cpus_allowed(current, 1UL << cpu);
+ if (cpu() != cpu)
+ BUG();

sprintf(current->comm, "pagebuf_io_CPU%d", bind_cpu);
INIT_LIST_HEAD(&pagebuf_iodone_tq[cpu]);

2002-09-11 19:07:04

by Christian Guggenberger

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Am 11 Sep 2002 20:44:47 schrieb(en) Christoph Hellwig:
> Could you please try the following patch from Andrea?
>
> --- 2.4.20pre5aa3/fs/xfs/pagebuf/page_buf.c.~1~ Wed Sep 11
> 05:17:46 2002
> +++ 2.4.20pre5aa3/fs/xfs/pagebuf/page_buf.c Wed Sep 11 06:00:35
> 2002
> @@ -2055,9 +2055,9 @@ pagebuf_iodone_daemon(
> spin_unlock_irq(&current->sigmask_lock);
>
> /* Migrate to the right CPU */
> - current->cpus_allowed = 1UL << cpu;
> - while (smp_processor_id() != cpu)
> - schedule();
> + set_cpus_allowed(current, 1UL << cpu);
> + if (cpu() != cpu)
> + BUG();
>
> sprintf(current->comm, "pagebuf_io_CPU%d", bind_cpu);
> INIT_LIST_HEAD(&pagebuf_iodone_tq[cpu]);
>
>

andrea,

I applied your patch to page_buf.c (but not the ext3/reiserfs stuff,
because there's no need for me) and now everything seems to work fine!

thank you!
Christian
ge

2002-09-12 23:22:52

by Samuel Flory

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

ksymoops 2.4.5 on i686 2.4.20-pre5aa2-fixed-xfs. Options used
-V (specified)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.20-pre5aa2-fixed-xfs/ (default)
-m /boot/System.map-2.4.20-pre5aa2-fixed-xfs (default)

kernel BUG at page_buf.c:578!
invalid operand: 0000 2.4.20-pre5aa2-fixed-xfs #4 SMP Thu Sep 12 11:51:40 PDT 2002
CPU: 1
EIP: 0010:[<c0208592>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000 ebx: 00000001 ecx: 00000001 edx: 00000000
esi: c5e15a04 edi: c5e15980 ebp: 00000002 esp: c5c19a58
ds: 0018 es: 0018 ss: 0018
Process mount (pid: 805, stackpage=c5c19000)
Stack: c5c18000 00001000 00000000 00000001 00000000 00000005 000001f0 00000002
c5e15980 c5e15980 00002205 c5e20540 c02089c8 c5e15980 c5c117b4 00002205
13005f90 00000000 c5912d40 13005f90 00000000 c01f4d2d c5912d40 13005f90
Call Trace: [<c02089c8>] [<c01f4d2d>] [<c01f45e1>] [<c01f463a>] [<c01f58b2>]
[<c01f59f6>] [<c01f5b5a>] [<c01f6744>] [<c01f6c74>] [<c01f6cc4>] [<c01f6e45>]
[<c01efde7>] [<c01f861c>] [<c01f776b>] [<c01f77b0>] [<c01ff894>] [<c01ff97b>]
[<c0212ce6>] [<c0148c81>] [<c0148e9c>] [<c014dfb8>] [<c015c145>] [<c015c43b>]
[<c015c29c>] [<c015c904>] [<c0108d9b>]
Code: 0f 0b 42 02 95 ab 35 c0 0f b7 47 7c 81 4f 08 04 00 00 01 8d


>>EIP; c0208592 <_pagebuf_lookup_pages+2a2/2f0> <=====

>>esi; c5e15a04 <END_OF_CODE+7abc1/????>
>>edi; c5e15980 <END_OF_CODE+7ab3d/????>
>>esp; c5c19a58 <[ip_tables].data.end+116159/294761>

Trace; c02089c8 <pagebuf_get+98/120>
Trace; c01f4d2d <xlog_recover_do_buffer_trans+fd/230>
Trace; c01f45e1 <xlog_recover_insert_item_frontq+11/20>
Trace; c01f463a <xlog_recover_reorder_trans+4a/90>
Trace; c01f58b2 <xlog_recover_do_trans+52/100>
Trace; c01f59f6 <xlog_recover_commit_trans+26/40>
Trace; c01f5b5a <xlog_recover_process_data+12a/1d0>
Trace; c01f6744 <xlog_do_recovery_pass+354/800>
Trace; c01f6c74 <xlog_do_log_recovery+84/b0>
Trace; c01f6cc4 <xlog_do_recover+24/110>
Trace; c01f6e45 <xlog_recover+95/c0>
Trace; c01efde7 <xfs_log_mount+77/b0>
Trace; c01f861c <xfs_mountfs+a7c/fe0>
Trace; c01f776b <xfs_readsb+3b/c0>
Trace; c01f77b0 <xfs_readsb+80/c0>
Trace; c01ff894 <xfs_cmountfs+574/610>
Trace; c01ff97b <xfs_mount+4b/60>
Trace; c0212ce6 <linvfs_read_super+f6/240>
Trace; c0148c81 <get_sb_bdev+1b1/230>
Trace; c0148e9c <do_kern_mount+5c/120>
Trace; c014dfb8 <link_path_walk+918/a20>
Trace; c015c145 <do_add_mount+75/180>
Trace; c015c43b <do_mount+14b/170>
Trace; c015c29c <copy_mount_options+4c/a0>
Trace; c015c904 <sys_mount+a4/100>
Trace; c0108d9b <system_call+33/38>

Code; c0208592 <_pagebuf_lookup_pages+2a2/2f0>
00000000 <_EIP>:
Code; c0208592 <_pagebuf_lookup_pages+2a2/2f0> <=====
0: 0f 0b ud2a <=====
Code; c0208594 <_pagebuf_lookup_pages+2a4/2f0>
2: 42 inc %edx
Code; c0208595 <_pagebuf_lookup_pages+2a5/2f0>
3: 02 95 ab 35 c0 0f add 0xfc035ab(%ebp),%dl
Code; c020859b <_pagebuf_lookup_pages+2ab/2f0>
9: b7 47 mov $0x47,%bh
Code; c020859d <_pagebuf_lookup_pages+2ad/2f0>
b: 7c 81 jl ffffff8e <_EIP+0xffffff8e>
Code; c020859f <_pagebuf_lookup_pages+2af/2f0>
d: 4f dec %edi
Code; c02085a0 <_pagebuf_lookup_pages+2b0/2f0>
e: 08 04 00 or %al,(%eax,%eax,1)
Code; c02085a3 <_pagebuf_lookup_pages+2b3/2f0>
11: 00 01 add %al,(%ecx)
Code; c02085a5 <_pagebuf_lookup_pages+2b5/2f0>
13: 8d 00 lea (%eax),%eax


Attachments:
xfs.opps (3.55 kB)

2002-09-12 23:41:06

by Steve Lord

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Thu, 2002-09-12 at 18:29, Samuel Flory wrote:
> Your patch seem to solve only some of the xfs issues for me. Before
> the patch my system hung when booting. This only occured I had xfs
> compiled into the kernel. After patching things seemed fine, but
> durning "dbench 32" the system locked. Upon rebooting and attempting to
> mount the filesystem I got this:
> XFS mounting filesystem md(9,2)
> Starting XFS recovery on filesystem: md(9,2) (dev: 9/2)
> kernel BUG at page_buf.c:578!
> <and so on>
>

Line numbers in no way line up with the code I have in front of me,
However, this appears to equate to a failure in the address space
remapping code. This is not a failure I have ever seen in our code
base.

Steve


--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]

2002-09-12 23:59:24

by Samuel Flory

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Line 578 is BUG(); below:
mapit:
pb->pb_flags |= _PBF_MEM_ALLOCATED;
if (all_mapped) {
pb->pb_flags |= _PBF_ALL_PAGES_MAPPED;

/* A single page buffer is always mappable */
if (page_count == 1) {
pb->pb_addr = (caddr_t)
page_address(pb->pb_pages[0]) +
pb->pb_offset;
pb->pb_flags |= PBF_MAPPED;
} else if (flags & PBF_MAPPED) {
if (as_list_len > 64)
purge_addresses();
pb->pb_addr = vmap(pb->pb_pages, page_count);
if (!pb->pb_addr)
BUG();
pb->pb_addr += pb->pb_offset;
pb->pb_flags |= PBF_MAPPED | _PBF_ADDR_ALLOCATED;
}
}
/* If some pages were found with data in them
* we are not in PBF_NONE state.
*/
if (good_pages != 0) {
pb->pb_flags &= ~(PBF_NONE);
if (good_pages != page_count) {
pb->pb_flags |= PBF_PARTIAL;
}
}

PB_TRACE(pb, PB_TRACE_REC(look_pg), good_pages);

return rval;
}


Stephen Lord wrote:

>On Thu, 2002-09-12 at 18:29, Samuel Flory wrote:
>
>
>> Your patch seem to solve only some of the xfs issues for me. Before
>>the patch my system hung when booting. This only occured I had xfs
>>compiled into the kernel. After patching things seemed fine, but
>>durning "dbench 32" the system locked. Upon rebooting and attempting to
>>mount the filesystem I got this:
>>XFS mounting filesystem md(9,2)
>>Starting XFS recovery on filesystem: md(9,2) (dev: 9/2)
>>kernel BUG at page_buf.c:578!
>><and so on>
>>
>>
>>
>
>Line numbers in no way line up with the code I have in front of me,
>However, this appears to equate to a failure in the address space
>remapping code. This is not a failure I have ever seen in our code
>base.
>
>Steve
>
>
>
>


2002-09-13 00:18:03

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Thu, Sep 12, 2002 at 04:29:31PM -0700, Samuel Flory wrote:
> Your patch seem to solve only some of the xfs issues for me. Before
> the patch my system hung when booting. This only occured I had xfs
> compiled into the kernel. After patching things seemed fine, but
> durning "dbench 32" the system locked. Upon rebooting and attempting to
> mount the filesystem I got this:
> XFS mounting filesystem md(9,2)
> Starting XFS recovery on filesystem: md(9,2) (dev: 9/2)
> kernel BUG at page_buf.c:578!
> <and so on>
>
> PS- The results of ksymoops are attached.

that seems a bug in xfs, it BUG() if vmap fails, it must not BUG(), it
must return -ENOMEM to userspace instead, or it can try to recollect and
release some of the other vmalloced entries. Most probably you run into
an address space shortage, not a real ram shortage, so to workaround it
you can recompile with CONFIG_2G and it'll probably work, also dropping
the gap page in vmalloc may help workaround it (there's no config option
for it though). It could be also a vmap leak, maybe a missing vfree,
just some idea.


>
> Andrea Arcangeli wrote:
>
> >was a collision between new xfs and new scheduler, you can use this fix
> >in the meantime:
> >
> >--- 2.4.20pre5aa3/fs/xfs/pagebuf/page_buf.c.~1~ Wed Sep 11 05:17:46
> >2002
> >+++ 2.4.20pre5aa3/fs/xfs/pagebuf/page_buf.c Wed Sep 11 06:00:35 2002
> >@@ -2055,9 +2055,9 @@ pagebuf_iodone_daemon(
> > spin_unlock_irq(&current->sigmask_lock);
> >
> > /* Migrate to the right CPU */
> >- current->cpus_allowed = 1UL << cpu;
> >- while (smp_processor_id() != cpu)
> >- schedule();
> >+ set_cpus_allowed(current, 1UL << cpu);
> >+ if (cpu() != cpu)
> >+ BUG();
> >
> > sprintf(current->comm, "pagebuf_io_CPU%d", bind_cpu);
> > INIT_LIST_HEAD(&pagebuf_iodone_tq[cpu]);
> >
> >also remeber to apply the O_DIRECT fixes for reiserfs and ext3 (that
> >were left over after merging the new nfs stuff). all will be fixed in
> >next -aa of course.
> >
> >--- 2.4.19pre3aa1/fs/reiserfs/inode.c.~1~ Tue Mar 12 00:07:18 2002
> >+++ 2.4.19pre3aa1/fs/reiserfs/inode.c Tue Mar 12 01:24:21 2002
> >@@ -2161,10 +2161,11 @@
> > }
> >}
> >
> >-static int reiserfs_direct_io(int rw, struct inode *inode,
> >+static int reiserfs_direct_io(int rw, struct file * filp,
> > struct kiobuf *iobuf, unsigned long blocknr,
> > int blocksize)
> >{
> >+ struct inode * inode = filp->f_dentry->d_inode->i_mapping->host;
> > return generic_direct_IO(rw, inode, iobuf, blocknr, blocksize,
> > reiserfs_get_block_direct_io) ;
> >}
> >--- 2.4.20pre5aa2/fs/ext3/inode.c.~1~ Mon Sep 9 02:38:08 2002
> >+++ 2.4.20pre5aa2/fs/ext3/inode.c Tue Sep 10 05:22:18 2002
> >@@ -1385,9 +1385,10 @@ static int ext3_releasepage(struct page
> >}
> >
> >static int
> >-ext3_direct_IO(int rw, struct inode *inode, struct kiobuf *iobuf,
> >+ext3_direct_IO(int rw, struct file * filp, struct kiobuf *iobuf,
> > unsigned long blocknr, int blocksize)
> >{
> >+ struct inode * inode = filp->f_dentry->d_inode->i_mapping->host;
> > struct ext3_inode_info *ei = EXT3_I(inode);
> > handle_t *handle = NULL;
> > int ret;
> >
> >Andrea
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
> >
> >
> >
> >
>

> ksymoops 2.4.5 on i686 2.4.20-pre5aa2-fixed-xfs. Options used
> -V (specified)
> -k /proc/ksyms (default)
> -l /proc/modules (default)
> -o /lib/modules/2.4.20-pre5aa2-fixed-xfs/ (default)
> -m /boot/System.map-2.4.20-pre5aa2-fixed-xfs (default)
>
> kernel BUG at page_buf.c:578!
> invalid operand: 0000 2.4.20-pre5aa2-fixed-xfs #4 SMP Thu Sep 12 11:51:40 PDT 2002
> CPU: 1
> EIP: 0010:[<c0208592>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010246
> eax: 00000000 ebx: 00000001 ecx: 00000001 edx: 00000000
> esi: c5e15a04 edi: c5e15980 ebp: 00000002 esp: c5c19a58
> ds: 0018 es: 0018 ss: 0018
> Process mount (pid: 805, stackpage=c5c19000)
> Stack: c5c18000 00001000 00000000 00000001 00000000 00000005 000001f0 00000002
> c5e15980 c5e15980 00002205 c5e20540 c02089c8 c5e15980 c5c117b4 00002205
> 13005f90 00000000 c5912d40 13005f90 00000000 c01f4d2d c5912d40 13005f90
> Call Trace: [<c02089c8>] [<c01f4d2d>] [<c01f45e1>] [<c01f463a>] [<c01f58b2>]
> [<c01f59f6>] [<c01f5b5a>] [<c01f6744>] [<c01f6c74>] [<c01f6cc4>] [<c01f6e45>]
> [<c01efde7>] [<c01f861c>] [<c01f776b>] [<c01f77b0>] [<c01ff894>] [<c01ff97b>]
> [<c0212ce6>] [<c0148c81>] [<c0148e9c>] [<c014dfb8>] [<c015c145>] [<c015c43b>]
> [<c015c29c>] [<c015c904>] [<c0108d9b>]
> Code: 0f 0b 42 02 95 ab 35 c0 0f b7 47 7c 81 4f 08 04 00 00 01 8d
>
>
> >>EIP; c0208592 <_pagebuf_lookup_pages+2a2/2f0> <=====
>
> >>esi; c5e15a04 <END_OF_CODE+7abc1/????>
> >>edi; c5e15980 <END_OF_CODE+7ab3d/????>
> >>esp; c5c19a58 <[ip_tables].data.end+116159/294761>
>
> Trace; c02089c8 <pagebuf_get+98/120>
> Trace; c01f4d2d <xlog_recover_do_buffer_trans+fd/230>
> Trace; c01f45e1 <xlog_recover_insert_item_frontq+11/20>
> Trace; c01f463a <xlog_recover_reorder_trans+4a/90>
> Trace; c01f58b2 <xlog_recover_do_trans+52/100>
> Trace; c01f59f6 <xlog_recover_commit_trans+26/40>
> Trace; c01f5b5a <xlog_recover_process_data+12a/1d0>
> Trace; c01f6744 <xlog_do_recovery_pass+354/800>
> Trace; c01f6c74 <xlog_do_log_recovery+84/b0>
> Trace; c01f6cc4 <xlog_do_recover+24/110>
> Trace; c01f6e45 <xlog_recover+95/c0>
> Trace; c01efde7 <xfs_log_mount+77/b0>
> Trace; c01f861c <xfs_mountfs+a7c/fe0>
> Trace; c01f776b <xfs_readsb+3b/c0>
> Trace; c01f77b0 <xfs_readsb+80/c0>
> Trace; c01ff894 <xfs_cmountfs+574/610>
> Trace; c01ff97b <xfs_mount+4b/60>
> Trace; c0212ce6 <linvfs_read_super+f6/240>
> Trace; c0148c81 <get_sb_bdev+1b1/230>
> Trace; c0148e9c <do_kern_mount+5c/120>
> Trace; c014dfb8 <link_path_walk+918/a20>
> Trace; c015c145 <do_add_mount+75/180>
> Trace; c015c43b <do_mount+14b/170>
> Trace; c015c29c <copy_mount_options+4c/a0>
> Trace; c015c904 <sys_mount+a4/100>
> Trace; c0108d9b <system_call+33/38>
>
> Code; c0208592 <_pagebuf_lookup_pages+2a2/2f0>
> 00000000 <_EIP>:
> Code; c0208592 <_pagebuf_lookup_pages+2a2/2f0> <=====
> 0: 0f 0b ud2a <=====
> Code; c0208594 <_pagebuf_lookup_pages+2a4/2f0>
> 2: 42 inc %edx
> Code; c0208595 <_pagebuf_lookup_pages+2a5/2f0>
> 3: 02 95 ab 35 c0 0f add 0xfc035ab(%ebp),%dl
> Code; c020859b <_pagebuf_lookup_pages+2ab/2f0>
> 9: b7 47 mov $0x47,%bh
> Code; c020859d <_pagebuf_lookup_pages+2ad/2f0>
> b: 7c 81 jl ffffff8e <_EIP+0xffffff8e>
> Code; c020859f <_pagebuf_lookup_pages+2af/2f0>
> d: 4f dec %edi
> Code; c02085a0 <_pagebuf_lookup_pages+2b0/2f0>
> e: 08 04 00 or %al,(%eax,%eax,1)
> Code; c02085a3 <_pagebuf_lookup_pages+2b3/2f0>
> 11: 00 01 add %al,(%ecx)
> Code; c02085a5 <_pagebuf_lookup_pages+2b5/2f0>
> 13: 8d 00 lea (%eax),%eax
>



Andrea

2002-09-13 00:43:20

by Steve Lord

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Thu, 2002-09-12 at 19:23, Andrea Arcangeli wrote:
>
> that seems a bug in xfs, it BUG() if vmap fails, it must not BUG(), it
> must return -ENOMEM to userspace instead, or it can try to recollect and
> release some of the other vmalloced entries. Most probably you run into
> an address space shortage, not a real ram shortage, so to workaround it
> you can recompile with CONFIG_2G and it'll probably work, also dropping
> the gap page in vmalloc may help workaround it (there's no config option
> for it though). It could be also a vmap leak, maybe a missing vfree,
> just some idea.
>

We hold vmalloced space for very short periods of time, in fact
filesystem recovery and large extended attributes are the only
cases. In this case we should be attempting to remap 2 pages
together. The only way out of this would be to fail the whole
mount at this point. I suspect a leak elsewhere.

Samuel, when you mounted xfs and it oopsed, was it shortly after bootup?
Also, how far did your dbench run get before it hung? I tried the
kernel, but I paniced during startup - then I realized I did not
apply the patch to fix the xfs/scheduler interactions first.

How much memory is in the machine by the way? And Andrea, is the
vmalloc space size reduced in the 3G user space configuration?

Steve

--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]

2002-09-13 00:49:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Thu, Sep 12, 2002 at 07:47:48PM -0500, Stephen Lord wrote:
> How much memory is in the machine by the way? And Andrea, is the
> vmalloc space size reduced in the 3G user space configuration?

it's not reduced, it's the usual 128m.

BTW, I forgot to say that to really take advantage of CONFIG_2G one
should increase __VMALLOC_RESERVE too, it's not directly in function of
the CONFIG_2G.

Andrea

2002-09-13 01:12:06

by Samuel Flory

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Andrea Arcangeli wrote:

>On Thu, Sep 12, 2002 at 04:29:31PM -0700, Samuel Flory wrote:
>
>
>> Your patch seem to solve only some of the xfs issues for me. Before
>>the patch my system hung when booting. This only occured I had xfs
>>compiled into the kernel. After patching things seemed fine, but
>>durning "dbench 32" the system locked. Upon rebooting and attempting to
>>mount the filesystem I got this:
>>XFS mounting filesystem md(9,2)
>>Starting XFS recovery on filesystem: md(9,2) (dev: 9/2)
>>kernel BUG at page_buf.c:578!
>><and so on>
>>
>>PS- The results of ksymoops are attached.
>>
>>
>
>that seems a bug in xfs, it BUG() if vmap fails, it must not BUG(), it
>must return -ENOMEM to userspace instead, or it can try to recollect and
>release some of the other vmalloced entries. Most probably you run into
>an address space shortage, not a real ram shortage, so to workaround it
>you can recompile with CONFIG_2G and it'll probably work, also dropping
>the gap page in vmalloc may help workaround it (there's no config option
>for it though). It could be also a vmap leak, maybe a missing vfree,
>just some idea.
>
>
>

The system has 4G of ram, and 4G of swap. So real memory is not an
issue. The system is a intended to be an nfs server. As a result nfs
performance is my only real concern. I should really use CONFIG_3GB as
I'm not doing much in user space other a tftp, and dhcp server.

In any case the system isn't in production so I can leave it as is
till monday.

2002-09-13 01:20:21

by Samuel Flory

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Stephen Lord wrote:

>On Thu, 2002-09-12 at 19:23, Andrea Arcangeli wrote:
>
>
>>that seems a bug in xfs, it BUG() if vmap fails, it must not BUG(), it
>>must return -ENOMEM to userspace instead, or it can try to recollect and
>>release some of the other vmalloced entries. Most probably you run into
>>an address space shortage, not a real ram shortage, so to workaround it
>>you can recompile with CONFIG_2G and it'll probably work, also dropping
>>the gap page in vmalloc may help workaround it (there's no config option
>>for it though). It could be also a vmap leak, maybe a missing vfree,
>>just some idea.
>>
>>
>>
>
>We hold vmalloced space for very short periods of time, in fact
>filesystem recovery and large extended attributes are the only
>cases. In this case we should be attempting to remap 2 pages
>together. The only way out of this would be to fail the whole
>mount at this point. I suspect a leak elsewhere.
>
>Samuel, when you mounted xfs and it oopsed, was it shortly after bootup?
>

Yes I'd just logged in and manually mounted it.

>Also, how far did your dbench run get before it hung? I tried the
>kernel, but I paniced during startup - then I realized I did not
>apply the patch to fix the xfs/scheduler interactions first.
>
>
It looked around 1/4 to 1/2 done with dbench 32. I'm not sure if it was
the 1st or second run. I run dbench from a script:
sync
sync
./dbench 2
sync
sync
./dbench 4
sync
sync
./dbench 8
sync
sync
./dbench 16
sync
sync
./dbench 32
sync
sync
./dbench 64
sync
sync
<repeats >

I generally use this script narrow down which configurations seem to
be most promising.

>How much memory is in the machine by the way?
>
4G ram, and 4G swap.




2002-09-13 02:07:35

by Samuel Flory

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

include/asm-i386/page.h:#define __VMALLOC_RESERVE (128 << 20)
include/asm/page.h:#define __VMALLOC_RESERVE (128 << 20)


Andrea Arcangeli wrote:

>On Thu, Sep 12, 2002 at 07:47:48PM -0500, Stephen Lord wrote:
>
>
>>How much memory is in the machine by the way? And Andrea, is the
>>vmalloc space size reduced in the 3G user space configuration?
>>
>>
>
>it's not reduced, it's the usual 128m.
>
>BTW, I forgot to say that to really take advantage of CONFIG_2G one
>should increase __VMALLOC_RESERVE too, it's not directly in function of
>the CONFIG_2G.
>

So how much do you recommend increasing it? Currently it's:
include/asm-i386/page.h:#define __VMALLOC_RESERVE (128 << 20)
include/asm/page.h:#define __VMALLOC_RESERVE (128 << 20)

2002-09-13 02:07:32

by Samuel Flory

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Stephen is there any reason to leave the system in it's current state?
(IE You guys want the output of some tool.) Or shall I give it a go at
a kernel with CONFIG_3GB, and maybe play with vmalloc settings?

Samuel Flory wrote:

> Stephen Lord wrote:
>
>> On Thu, 2002-09-12 at 19:23, Andrea Arcangeli wrote:
>>
>>
>>> that seems a bug in xfs, it BUG() if vmap fails, it must not BUG(), it
>>> must return -ENOMEM to userspace instead, or it can try to recollect
>>> and
>>> release some of the other vmalloced entries. Most probably you run into
>>> an address space shortage, not a real ram shortage, so to workaround it
>>> you can recompile with CONFIG_2G and it'll probably work, also dropping
>>> the gap page in vmalloc may help workaround it (there's no config
>>> option
>>> for it though). It could be also a vmap leak, maybe a missing vfree,
>>> just some idea.
>>>
>>>
>>
>>
>> We hold vmalloced space for very short periods of time, in fact
>> filesystem recovery and large extended attributes are the only
>> cases. In this case we should be attempting to remap 2 pages
>> together. The only way out of this would be to fail the whole
>> mount at this point. I suspect a leak elsewhere.
>>
>> Samuel, when you mounted xfs and it oopsed, was it shortly after bootup?
>>
>
> Yes I'd just logged in and manually mounted it.
>
>> Also, how far did your dbench run get before it hung? I tried the
>> kernel, but I paniced during startup - then I realized I did not
>> apply the patch to fix the xfs/scheduler interactions first.
>>
>>
> It looked around 1/4 to 1/2 done with dbench 32. I'm not sure if it
> was the 1st or second run. I run dbench from a script:
> sync
> sync
> ./dbench 2
> sync
> sync
> ./dbench 4
> sync
> sync
> ./dbench 8
> sync
> sync
> ./dbench 16
> sync
> sync
> ./dbench 32
> sync
> sync
> ./dbench 64
> sync
> sync
> <repeats >
>
> I generally use this script narrow down which configurations seem to
> be most promising.
>
>> How much memory is in the machine by the way?
>
> 4G ram, and 4G swap.
>
>
>
>


2002-09-13 12:50:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Thu, Sep 12, 2002 at 07:14:14PM -0700, Samuel Flory wrote:
> include/asm-i386/page.h:#define __VMALLOC_RESERVE (128 << 20)
> include/asm/page.h:#define __VMALLOC_RESERVE (128 << 20)
>
>
> Andrea Arcangeli wrote:
>
> >On Thu, Sep 12, 2002 at 07:47:48PM -0500, Stephen Lord wrote:
> >
> >
> >>How much memory is in the machine by the way? And Andrea, is the
> >>vmalloc space size reduced in the 3G user space configuration?
> >>
> >>
> >
> >it's not reduced, it's the usual 128m.
> >
> >BTW, I forgot to say that to really take advantage of CONFIG_2G one
> >should increase __VMALLOC_RESERVE too, it's not directly in function of
> >the CONFIG_2G.
> >
>
> So how much do you recommend increasing it? Currently it's:
> include/asm-i386/page.h:#define __VMALLOC_RESERVE (128 << 20)
> include/asm/page.h:#define __VMALLOC_RESERVE (128 << 20)

you can try to compile with CONFIG_3G and to set __VMALLOC_RESERVE to
(512 << 20) and see if it helps. If it only happens a bit later then
it's most probably an address space leak, should be easy to track down
some debugging instrumentation.

Andrea

2002-09-13 19:13:53

by Steve Lord

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Thu, 2002-09-12 at 20:18, Samuel Flory wrote:
>
> The system has 4G of ram, and 4G of swap. So real memory is not an
> issue. The system is a intended to be an nfs server. As a result nfs
> performance is my only real concern. I should really use CONFIG_3GB as
> I'm not doing much in user space other a tftp, and dhcp server.
>
> In any case the system isn't in production so I can leave it as is
> till monday.
>

So, after backing out 00_net-softirq (this was killing my networking
and NFS setup for some reason) and applying the new scheduler
related fix in xfs, I have had this kernel up a few hours running
dbench and a bunch of other things, it has not hung once or exhibited
any other problems.

Having said that, my environment is different, I do not have 4G of
memory, I have 128M, and I also do not have md - not enough disks
right now to do that.

Steve

--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]

2002-09-13 21:03:01

by Samuel Flory

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Andrea Arcangeli wrote:

>
>
>you can try to compile with CONFIG_3G and to set __VMALLOC_RESERVE to
>(512 << 20) and see if it helps. If it only happens a bit later then
>it's most probably an address space leak, should be easy to track down
>some debugging instrumentation.
>
>


It seems to be working for me now. I'm getting about 200 on dbench 4,
and 90 on dbench 64. (Note you need to increase your log size to get
these kinda of numbers.) Now I get to see how fast I can read files via
nfs.

2002-09-13 21:13:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Fri, Sep 13, 2002 at 02:09:54PM -0700, Samuel Flory wrote:
> Andrea Arcangeli wrote:
>
> >
> >
> >you can try to compile with CONFIG_3G and to set __VMALLOC_RESERVE to
> >(512 << 20) and see if it helps. If it only happens a bit later then
> >it's most probably an address space leak, should be easy to track down
> >some debugging instrumentation.
> >
> >
>
>
> It seems to be working for me now. I'm getting about 200 on dbench 4,
> and 90 on dbench 64. (Note you need to increase your log size to get
> these kinda of numbers.) Now I get to see how fast I can read files via
> nfs.

btw, if you run into troubles with networking with aa2 try to backout
the last net-softirq patch, not sure why yet but the last modification I
did malfunctions with some nic. Couldn't reproduce it here, but I'll
look into that next week and I'll fix it too for the next -aa.

So, returning to xfs, it is possible dbench really generates lots of
simultaneous vmaps because of its concurrency, so I would suggest to add
an atomic counter increased at every vmap/vmalloc and decreased at every
vfree and to check it after every increase storing the max value in a
sysctl, to see what's the max concurrency you reach with the vmaps. (you
can also export the counter via the sysctl, to verify for no memleaks
after unmounting xfs)

Andrea

2002-09-14 14:37:01

by Steve Lord

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Fri, 2002-09-13 at 16:18, Andrea Arcangeli wrote:

> So, returning to xfs, it is possible dbench really generates lots of
> simultaneous vmaps because of its concurrency, so I would suggest to add
> an atomic counter increased at every vmap/vmalloc and decreased at every
> vfree and to check it after every increase storing the max value in a
> sysctl, to see what's the max concurrency you reach with the vmaps. (you
> can also export the counter via the sysctl, to verify for no memleaks
> after unmounting xfs)
>
> Andrea

There are no vmaps during normal operation on xfs unless you are
setting extended attributes of more than 4K in size, or you
used some more obscure mkfs options. Only filesystem recovery will
use it otherwise.

Steve


2002-09-15 11:08:32

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Sat, Sep 14, 2002 at 09:39:24AM -0500, Steve Lord wrote:
> On Fri, 2002-09-13 at 16:18, Andrea Arcangeli wrote:
>
> > So, returning to xfs, it is possible dbench really generates lots of
> > simultaneous vmaps because of its concurrency, so I would suggest to add
> > an atomic counter increased at every vmap/vmalloc and decreased at every
> > vfree and to check it after every increase storing the max value in a
> > sysctl, to see what's the max concurrency you reach with the vmaps. (you
> > can also export the counter via the sysctl, to verify for no memleaks
> > after unmounting xfs)
> >
> > Andrea
>
> There are no vmaps during normal operation on xfs unless you are
> setting extended attributes of more than 4K in size, or you
> used some more obscure mkfs options. Only filesystem recovery will
> use it otherwise.

Perhaps the original poster used those obscure mkfs options? What option
will trigger huge allocations ?


-Andi

2002-09-15 19:31:59

by Samuel Flory

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2



Andi Kleen wrote:

>On Sat, Sep 14, 2002 at 09:39:24AM -0500, Steve Lord wrote:
>
>
>>On Fri, 2002-09-13 at 16:18, Andrea Arcangeli wrote:
>>
>>
>>
>>>So, returning to xfs, it is possible dbench really generates lots of
>>>simultaneous vmaps because of its concurrency, so I would suggest to add
>>>an atomic counter increased at every vmap/vmalloc and decreased at every
>>>vfree and to check it after every increase storing the max value in a
>>>sysctl, to see what's the max concurrency you reach with the vmaps. (you
>>>can also export the counter via the sysctl, to verify for no memleaks
>>>after unmounting xfs)
>>>
>>>Andrea
>>>
>>>
>>There are no vmaps during normal operation on xfs unless you are
>>setting extended attributes of more than 4K in size, or you
>>used some more obscure mkfs options. Only filesystem recovery will
>>use it otherwise.
>>
>>
>
>Perhaps the original poster used those obscure mkfs options? What option
>will trigger huge allocations ?
>

I did not use any special options on the filesystem that had the issue.



2002-09-16 16:00:08

by Dave Hansen

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

diff -ur linux-2.5.34-mm4/fs/proc/proc_misc.c linux-2.5.34-mm4-vmalloc-stats/fs/proc/proc_misc.c
--- linux-2.5.34-mm4/fs/proc/proc_misc.c Sat Sep 14 21:23:54 2002
+++ linux-2.5.34-mm4-vmalloc-stats/fs/proc/proc_misc.c Sat Sep 14 22:38:12 2002
@@ -38,6 +38,7 @@
#include <linux/smp_lock.h>
#include <linux/seq_file.h>
#include <linux/times.h>
+#include <linux/vmalloc.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -128,6 +129,40 @@
return proc_calc_metrics(page, start, off, count, eof, len);
}

+struct vmalloc_info {
+ unsigned long used;
+ unsigned long largest_chunk;
+};
+
+static struct vmalloc_info get_vmalloc_info(void)
+{
+ unsigned long prev_end = VMALLOC_START;
+ struct vm_struct* vma;
+ struct vmalloc_info vmi;
+ vmi.used = 0;
+
+ read_lock(&vmlist_lock);
+
+ if(!vmlist)
+ vmi.largest_chunk = (VMALLOC_END-VMALLOC_START);
+ else
+ vmi.largest_chunk = 0;
+
+ for (vma = vmlist; vma; vma = vma->next) {
+ unsigned long free_area_size =
+ (unsigned long)vma->addr - prev_end;
+ vmi.used += vma->size;
+ if (vmi.largest_chunk < free_area_size )
+ vmi.largest_chunk = free_area_size;
+ prev_end = vma->size + (unsigned long)vma->addr;
+ }
+ if(VMALLOC_END-prev_end > vmi.largest_chunk)
+ vmi.largest_chunk = VMALLOC_END-prev_end;
+
+ read_unlock(&vmlist_lock);
+ return vmi;
+}
+
extern atomic_t vm_committed_space;

static int meminfo_read_proc(char *page, char **start, off_t off,
@@ -138,6 +173,8 @@
struct page_state ps;
unsigned long inactive;
unsigned long active;
+ unsigned long vmtot;
+ struct vmalloc_info vmi;

get_page_state(&ps);
get_zone_counts(&active, &inactive);
@@ -150,6 +187,11 @@
si_swapinfo(&i);
committed = atomic_read(&vm_committed_space);

+ vmtot = (VMALLOC_END-VMALLOC_START)>>10;
+ vmi = get_vmalloc_info();
+ vmi.used >>= 10;
+ vmi.largest_chunk >>= 10;
+
/*
* Tagged format, for easy grepping and expansion.
*/
@@ -174,7 +216,10 @@
"Slab: %8lu kB\n"
"Committed_AS: %8u kB\n"
"PageTables: %8lu kB\n"
- "ReverseMaps: %8lu\n",
+ "ReverseMaps: %8lu\n"
+ "VmalTotal: %8lu kB\n"
+ "VmalUsed: %8lu kB\n"
+ "VmalChunk: %8lu kB\n",
K(i.totalram),
K(i.freeram),
K(i.sharedram),
@@ -195,7 +240,10 @@
K(ps.nr_slab),
K(committed),
K(ps.nr_page_table_pages),
- ps.nr_reverse_maps
+ ps.nr_reverse_maps,
+ vmtot,
+ vmi.used,
+ vmi.largest_chunk
);

#ifdef CONFIG_HUGETLB_PAGE


Attachments:
vmalloc-stats-2.5.34-mm4-2.patch (2.37 kB)
(No filename) (232.00 B)
Download all attachments

2002-09-16 16:14:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

On Mon, Sep 16, 2002 at 09:03:41AM -0700, Dave Hansen wrote:
> + vmi = get_vmalloc_info();

hmm, not sure if it's better to slowdown vmalloc instead of
/proc/meminfo and to keep meminfo o1. In theory vmalloc should be used
only for persistent infrequent allocations, so meminfo has a chance to
be recalled more frequently with monitors like xosview during workloads.
Admittedly in final production with no monitoring meminfo is going to
never be recalled, however I like the idea to keep meminfo very quick.

Andrea

2002-09-16 16:35:37

by Dave Hansen

[permalink] [raw]
Subject: Re: 2.4.20pre5aa2

Andrea Arcangeli wrote:
> On Mon, Sep 16, 2002 at 09:03:41AM -0700, Dave Hansen wrote:
>
>>+ vmi = get_vmalloc_info();
>
> hmm, not sure if it's better to slowdown vmalloc instead of
> /proc/meminfo and to keep meminfo o1. In theory vmalloc should be used
> only for persistent infrequent allocations, so meminfo has a chance to
> be recalled more frequently with monitors like xosview during workloads.
> Admittedly in final production with no monitoring meminfo is going to
> never be recalled, however I like the idea to keep meminfo very quick.

When I first set out to do it, I modified vmalloc. But, I decided that it would
probably be easier to get a patch in that didn't modify vmalloc itself. The
used calculation is much easier (used += requested_size), but the largest free
chunk gets harder to do. I think that this would have required vfree to get
into the game as well. It seemed much easier to just make meminfo do a little
more work.

--
Dave Hansen
[email protected]