2003-07-17 10:14:02

by Andrea Arcangeli

[permalink] [raw]
Subject: 2.4.22pre6aa1

URL:

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.22pre6aa1.gz

changelog diff between 2.4.21rc8aa1 and 2.4.22pre6aa1:

Only in 2.4.21rc8aa1: 00_01_cciss-1
Only in 2.4.21rc8aa1: 00_02_cciss-1
Only in 2.4.21rc8aa1: 00_03_cciss-1

Updates are in mainline.

Only in 2.4.21rc8aa1: 00_backout-irda-trivial-1

Somebody acknowledged and fixed the breakage properly.
I hadn't a chance to test it myself yet on my cellphone,
but I will shortly.

Only in 2.4.21rc8aa1: 00_binfmt-elf-checks-1
Only in 2.4.22pre6aa1: 00_binfmt-elf-checks-2
Only in 2.4.21rc8aa1: 00_dirty-inode-1
Only in 2.4.22pre6aa1: 00_dirty-inode-3
Only in 2.4.21rc8aa1: 00_drop-inetpeer-cache-4.gz
Only in 2.4.22pre6aa1: 00_drop-inetpeer-cache-5.gz
Only in 2.4.21rc8aa1: 00_ext3-register-filesystem-lifo-1
Only in 2.4.22pre6aa1: 00_ext3-register-filesystem-lifo-2
Only in 2.4.21rc8aa1: 00_extraversion-24
Only in 2.4.22pre6aa1: 00_extraversion-26
Only in 2.4.21rc8aa1: 00_generic_file_write_nolock-1
Only in 2.4.22pre6aa1: 00_generic_file_write_nolock-3
Only in 2.4.21rc8aa1: 00_module-locking-fix-2
Only in 2.4.22pre6aa1: 00_module-locking-fix-3
Only in 2.4.21rc8aa1: 00_netconsole-2.4.10-C2-3.gz
Only in 2.4.22pre6aa1: 00_netconsole-2.4.10-C2-4.gz
Only in 2.4.21rc8aa1: 00_rwsem-fair-36
Only in 2.4.21rc8aa1: 00_rwsem-fair-36-recursive-8
Only in 2.4.22pre6aa1: 00_rwsem-fair-38
Only in 2.4.22pre6aa1: 00_rwsem-fair-38-recursive-8
Only in 2.4.21rc8aa1: 00_setfl-race-fix-2
Only in 2.4.22pre6aa1: 00_setfl-race-fix-3
Only in 2.4.21rc8aa1: 00_vm-cleanups-2
Only in 2.4.22pre6aa1: 00_vm-cleanups-3
Only in 2.4.21rc8aa1: 05_vm_20_cleanups-2
Only in 2.4.22pre6aa1: 05_vm_20_cleanups-3
Only in 2.4.21rc8aa1: 07_qlogicfc-4.gz
Only in 2.4.22pre6aa1: 07_qlogicfc-5.gz
Only in 2.4.21rc8aa1: 10_rawio-vary-io-18
Only in 2.4.22pre6aa1: 10_rawio-vary-io-21
Only in 2.4.21rc8aa1: 20_rcu-poll-8
Only in 2.4.22pre6aa1: 20_rcu-poll-9
Only in 2.4.21rc8aa1: 20_sched-o1-fixes-8
Only in 2.4.22pre6aa1: 20_sched-o1-fixes-9
Only in 2.4.21rc8aa1: 50_uml-patch-2.4.20-5-1.gz
Only in 2.4.22pre6aa1: 50_uml-patch-2.4.20-5-2.gz
Only in 2.4.21rc8aa1: 60_atomic-lookup-5
Only in 2.4.22pre6aa1: 60_atomic-lookup-6
Only in 2.4.21rc8aa1: 60_tux-exports-6
Only in 2.4.22pre6aa1: 60_tux-exports-7
Only in 2.4.21rc8aa1: 70_delalloc-2
Only in 2.4.22pre6aa1: 70_delalloc-3
Only in 2.4.21rc8aa1: 96_inode_read_write-atomic-6
Only in 2.4.22pre6aa1: 96_inode_read_write-atomic-8
Only in 2.4.21rc8aa1: 97_i_size-corruption-fixes-2
Only in 2.4.22pre6aa1: 97_i_size-corruption-fixes-4
Only in 2.4.21rc8aa1: 9900_aio-20.gz
Only in 2.4.22pre6aa1: 9900_aio-21.gz
Only in 2.4.22pre6aa1: 9920_kgdb-10.gz
Only in 2.4.21rc8aa1: 9920_kgdb-8.gz
Only in 2.4.21rc8aa1: 9925_kmsgdump-0.4.4-2.gz
Only in 2.4.22pre6aa1: 9925_kmsgdump-0.4.4-3.gz
Only in 2.4.21rc8aa1: 9930_io_request_scale-5
Only in 2.4.22pre6aa1: 9930_io_request_scale-6
Only in 2.4.22pre6aa1: 9985_blk-atomic-12
Only in 2.4.21rc8aa1: 9985_blk-atomic-9
Only in 2.4.21rc8aa1: 9996_kiobuf-slab-1
Only in 2.4.22pre6aa1: 9996_kiobuf-slab-2
Only in 2.4.21rc8aa1: 9998_lowlatency-fixes-12
Only in 2.4.22pre6aa1: 9998_lowlatency-fixes-13
Only in 2.4.21rc8aa1: 9999_dm-1
Only in 2.4.22pre6aa1: 9999_dm-2
Only in 2.4.21rc8aa1: 9999_gcc-3.3-6
Only in 2.4.22pre6aa1: 9999_gcc-3.3-7
Only in 2.4.21rc8aa1: 9999_sched_yield_scale-2
Only in 2.4.22pre6aa1: 9999_sched_yield_scale-5

Rediffed.

Only in 2.4.22pre6aa1: 00_copy-namespace-1

Fix copy-namespace.

Only in 2.4.21rc8aa1: 00_cpufreq-1

Dropped (would better go in mainline than in -aa, I already
tried it and it doesn't do what I need, and now it's rejecting
in multiple ways).

Only in 2.4.22pre6aa1: 00_crc-makefile-clean-1

Remeber to delete the autogenerated files to generate
clean diffs.

Only in 2.4.21rc8aa1: 00_cs46xx-u32-1
Only in 2.4.21rc8aa1: 00_floppy-smp-race-and-queuesize-1
Only in 2.4.21rc8aa1: 00_ipv6-route-fix-1
Only in 2.4.21rc8aa1: 00_o_direct-b_page-null-1
Only in 2.4.21rc8aa1: 00_ppp-ioctl-memleak-1
Only in 2.4.21rc8aa1: 00_tcp-tw-death-2
Only in 2.4.21rc8aa1: 00_usbnet-zaurus-c700-1
Only in 2.4.21rc8aa1: 00_wait_kio-cleanup-1
Only in 2.4.21rc8aa1: 10_tlb-state-3
Only in 2.4.21rc8aa1: 30_02_call-reserve1-1
Only in 2.4.21rc8aa1: 30_03_call-reserve2-2
Only in 2.4.21rc8aa1: 30_04_noac-1
Only in 2.4.21rc8aa1: 30_09_o_direct-3
Only in 2.4.21rc8aa1: 30_10-lockd1-1
Only in 2.4.21rc8aa1: 30_11-lockd2-1
Only in 2.4.21rc8aa1: 30_13-lockd4-1
Only in 2.4.21rc8aa1: 30_15-xprt_fixes-1
Only in 2.4.21rc8aa1: 70_quota-backport-3
Only in 2.4.21rc8aa1: 9999901_O_DIRECT-1

Merged in mainline.

Only in 2.4.22pre6aa1: 00_elevator-lowlatency-1

Reduced the number of requests during seeks (the latency times
increased slightly during seeks with pre5/pre6).

Only in 2.4.22pre6aa1: 00_elevator-read-reservation-axboe-2l-1

Incremental patch from Jens, that reserved some spare
request for reads. This is been measured to avoid some
waiting for reads and it's beneficial in the common case.

Only in 2.4.22pre6aa1: 00_fdatasync-cleanup-1

Avoid a compile time warning.

Only in 2.4.21rc8aa1: 00_ksoftirqd-max-loop-networking-1
Only in 2.4.22pre6aa1: 00_ksoftirqd-max-loop-networking-2

Merged a fix from Philip Craig to be sure to make
the anti-DoS logic effective. He wrote and verified
the code. It makes perfect sense so it's applied.
Normal usages shouldn't notice the difference, especially with the
max-loop logic.

Only in 2.4.22pre6aa1: 00_parport-multi-io-pci-1

Multi-io cards depends on config-pci, from
Matthew Bell.

Only in 2.4.21rc8aa1: 00_radeon-3

This started to reject and the mainline code seems
slightly different now. Should be rechecked later.

Only in 2.4.21rc8aa1: 00_sched-O1-aa-2.4.19rc3-12.gz
Only in 2.4.22pre6aa1: 00_sched-O1-aa-2.4.19rc3-14.gz

Avoid losing an half timeslice of in signal delivery delay
if the signal was sent while the task was under weakup. Fix
from Ingo Molnar.

Only in 2.4.21rc8aa1: 00_semop-timeout-2
Only in 2.4.22pre6aa1: 00_semop-timeout-3

Most of it merged in mainline, except the ia64 entry
in the syscall table. Interestingly the syscall
now allocated for ia64 is different than the one in 21rc8aa1.

Only in 2.4.21rc8aa1: 00_smp-timers-not-deadlocking-3
Only in 2.4.22pre6aa1: 00_smp-timers-not-deadlocking-5

Merged an anti deadlock fix from lcm, 2.5 probably needs it too. In
short the theory that mod_timer is the only thing that can run in
parallel was wrong, add_timer and del_timer/del_timer_sync can too.
Having already fixed mod_timer in a backwards compatible way before
merging the smp-timers in -aa, made it easy to fix those further
windows too.

Only in 2.4.21rc8aa1: 00_usb_get_string-len-1

Dropped, was the wrong fix and it could break stuff.

Only in 2.4.22pre6aa1: 05_vm_25_try_to_free_buffers-invariant-1

Minor cleanup from Daniele Bellucci.

Only in 2.4.21rc8aa1: 10_o_direct-open-check-3
Only in 2.4.22pre6aa1: 10_o_direct-open-check-4

Updated to handle the double API.

Only in 2.4.21rc8aa1: 10_try-cciss-only-4G-1

Dropped, new code in mainline.

Only in 2.4.22pre6aa1: 21_ppc64-aa-2

Was used to fix ppc64 around pre3, but pre3-pre6 may have
broke stuff again, I didn't check.

Only in 2.4.21rc8aa1: 70_xfs-1.2-3.gz
Only in 2.4.22pre6aa1: 70_xfs-1.3-2.gz
Only in 2.4.21rc8aa1: 70_xfs-config-stuff-3
Only in 2.4.22pre6aa1: 70_xfs-config-stuff-4
Only in 2.4.21rc8aa1: 70_xfs-exports-1
Only in 2.4.22pre6aa1: 70_xfs-exports-2
Only in 2.4.21rc8aa1: 70_xfs-sysctl-2
Only in 2.4.22pre6aa1: 70_xfs-sysctl-3
Only in 2.4.21rc8aa1: 71_posix_acl-2
Only in 2.4.22pre6aa1: 71_posix_acl-3
Only in 2.4.22pre6aa1: 71_xfs-VM_IO-1
Only in 2.4.21rc8aa1: 71_xfs-aa-2
Only in 2.4.22pre6aa1: 71_xfs-aa-4
Only in 2.4.22pre6aa1: 71_xfs-fixup-1
Only in 2.4.22pre6aa1: 71_xfs-infrastructure-1
Only in 2.4.22pre6aa1: 71_xfs-tuning-1

Upgraded XFS from 1.2 to 1.3.

Only in 2.4.21rc8aa1: 80_x86_64-common-code-6
Only in 2.4.21rc8aa1: 82_x86_64-suse-12
Only in 2.4.21rc8aa1: 84_x86-64-arch-3
Only in 2.4.21rc8aa1: 85_x86-64-includes-2

Dropped, mainline is more uptodate. Though it
won't compile like ia64.

Only in 2.4.21rc8aa1: 93_NUMAQ-10
Only in 2.4.22pre6aa1: 93_NUMAQ-13

Merged latest numa code for x440.

Only in 2.4.22pre6aa1: 9900_aio-21-ppc-1

ppc aio code.

Only in 2.4.22pre6aa1: 9901_aio-blkdev-1

Allow aio on blkdevices too (dunno who wrote this).

Only in 2.4.21rc8aa1: 9910_shm-largepage-13.gz
Only in 2.4.22pre6aa1: 9910_shm-largepage-16.gz

Thanks to Hugh for the help in porting the bigpages
to the rewritten shmfs layer in 22pre. No idea at the moment if it
works or if it only compiles.

Only in 2.4.21rc8aa1: 9940_ocfs-2.gz
Only in 2.4.22pre6aa1: 9940_ocfs-3.gz
Only in 2.4.21rc8aa1: 9941_ocfs-20021012.gz
Only in 2.4.22pre6aa1: 9941_ocfs-direct-1
Only in 2.4.22pre6aa1: 9941_ocfs-warnings-1
Only in 2.4.21rc8aa1: 9942_ocfs-compile-2
Only in 2.4.22pre6aa1: 9942_ocfs-o_direct-API-1

Upgraded to a more recent ocfs version (merged by Andi Kleen).

Only in 2.4.21rc8aa1: 9980_fix-pausing-5
Only in 2.4.22pre6aa1: 9980_fix-pausing-6
Only in 2.4.21rc8aa1: 9981_elevator-lowlatency-5

Fix pausing and elevator lowlatency are now in 2.4.22pre.

Unplugging the queue may avoid a reschedule.

Only in 2.4.21rc8aa1: 9986_elevator-merge-fast-path-1
Only in 2.4.22pre6aa1: 9986_elevator-merge-fast-path-2

Enabled for headactive devices (i.e. IDE) too. Idea
and original patch from Daniele Bellucci, final
patch from Jens Axboe.

Only in 2.4.22pre6aa1: 9998_lowlatency-reiserfs-1

Added an appealing reschedule hook (should be double checked).

Only in 2.4.22pre6aa1: 9999900_desktop-2

Added a desktop mode that guarantees an higher degree
of fariness in the scheduler.

Only in 2.4.22pre6aa1: 9999900_drm-4.3-1.gz

drm updates. Merged by Chip Salzenberg.

Only in 2.4.22pre6aa1: 9999900_ecc-20020904-1.gz

ecc timer poller latest code. Merged by Chip Salzenberg.

Only in 2.4.21rc8aa1: 9999900_ikd-1
Only in 2.4.22pre6aa1: 9999900_ikd-2.gz

Initialize it at boot so it will have a chance to work.

Only in 2.4.22pre6aa1: 9999900_x86-movsl-copy-user-1

Boost the copy-user asm.

Only in 2.4.21rc8aa1: 9999_truncate-nopage-race-1
Only in 2.4.22pre6aa1: 9999_truncate-nopage-race-3

Take advantage of the i_alloc_sem in read mode to serialize only
against truncates, to avoid possible suprious reschedules.

Only in 2.4.21rc8aa1: 10_ext3-o_direct-2
Only in 2.4.22pre6aa1: 10_ext3-o_direct-3
Only in 2.4.21rc8aa1: 40_o_direct-reiserfs-2

Update to new API.

Andrea


2003-07-17 10:28:02

by ooyama eiichi

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

Hi Andrea.

I am sorry, I couldn't find this file.
Maybe, I have to wait ?

From: Andrea Arcangeli <[email protected]>
Subject: 2.4.22pre6aa1
Date: Thu, 17 Jul 2003 12:28:57 +0200

> URL:
>
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.22pre6aa1.gz
>
> changelog diff between 2.4.21rc8aa1 and 2.4.22pre6aa1:
>
> Only in 2.4.21rc8aa1: 00_01_cciss-1
> Only in 2.4.21rc8aa1: 00_02_cciss-1
> Only in 2.4.21rc8aa1: 00_03_cciss-1
>
> Updates are in mainline.
>
> Only in 2.4.21rc8aa1: 00_backout-irda-trivial-1
>
> Somebody acknowledged and fixed the breakage properly.
> I hadn't a chance to test it myself yet on my cellphone,
> but I will shortly.

2003-07-17 10:38:19

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Thursday 17 July 2003 12:42, ooyama eiichi wrote:

Hi Ooyama,

> I am sorry, I couldn't find this file.
> Maybe, I have to wait ?
use another mirror, e.g:
http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.22pre6aa1.gz

ciao, Marc

2003-07-17 10:38:34

by ooyama eiichi

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

Hi,Andrea.

I can get the file from this URL: (without "us")
http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.22pre6aa1.gz


> Hi Andrea.
>
> I am sorry, I couldn't find this file.
> Maybe, I have to wait ?
>
> From: Andrea Arcangeli <[email protected]>
> Subject: 2.4.22pre6aa1
> Date: Thu, 17 Jul 2003 12:28:57 +0200
>
> > URL:
> >
> > http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.22pre6aa1.gz
> >
> > changelog diff between 2.4.21rc8aa1 and 2.4.22pre6aa1:
> >
> > Only in 2.4.21rc8aa1: 00_01_cciss-1
> > Only in 2.4.21rc8aa1: 00_02_cciss-1
> > Only in 2.4.21rc8aa1: 00_03_cciss-1
> >
> > Updates are in mainline.
> >
> > Only in 2.4.21rc8aa1: 00_backout-irda-trivial-1
> >
> > Somebody acknowledged and fixed the breakage properly.
> > I hadn't a chance to test it myself yet on my cellphone,
> > but I will shortly.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-07-17 15:27:54

by Dave Jones

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Thu, Jul 17, 2003 at 12:28:57PM +0200, Andrea Arcangeli wrote:

> Only in 2.4.21rc8aa1: 00_cpufreq-1
>
> Dropped (would better go in mainline than in -aa

Proposed for 2.4.23. Marcelo doesn't seem to have any objections.

> I already tried it and it doesn't do what I need

You know where to report bugs...

Dave

2003-07-17 20:15:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Thu, Jul 17, 2003 at 04:42:12PM +0100, Dave Jones wrote:
> On Thu, Jul 17, 2003 at 12:28:57PM +0200, Andrea Arcangeli wrote:
> > I already tried it and it doesn't do what I need
>
> You know where to report bugs...

Hmm I thought it was a feature not a bug or I would have already
reported something ;)

What I need is to set the frequency to around 400mhz when on battery,
but that's not any of the speedstep frequencies, the speedstep
frequencies are too fast (750/1200mhz) or too slow (250mhz). Is it
supposed to work that way?

thanks,

Andrea

2003-07-17 22:01:36

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Thursday 17 July 2003 12:28, Andrea Arcangeli wrote:

Hi Andrea,

> Only in 2.4.22pre6aa1: 00_elevator-lowlatency-1
> Only in 2.4.22pre6aa1: 00_elevator-read-reservation-axboe-2l-1

Hmm, this is now my first day testing out .22-pre6 and .22-pre6aa1 with the
new I/O stall fixes. At a first look & feel it's very good, but I've noticed
a side effect (if it can be called so):

VMware4 Workstation
-------------------

2.4.22-pre[6|6aa1]: ~ 1 minute 02 seconds from: Start this virtual machine ...
2.4.22-pre2 : ~ 30 seconds from: Start this virtual machine ...

... to start up Windows 2000 Professional completely.

Well, personally I don't care about the slowdown of vmware startup with a VM
but there may be many other slowdows?!

ciao, Marc

2003-07-17 22:11:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Fri, Jul 18, 2003 at 12:13:38AM +0200, Marc-Christian Petersen wrote:
> 2.4.22-pre[6|6aa1]: ~ 1 minute 02 seconds from: Start this virtual machine ...
> 2.4.22-pre2 : ~ 30 seconds from: Start this virtual machine ...
>
> ... to start up Windows 2000 Professional completely.

can you check what's doing? reading or writing? I guess it's a kind of
workload that would seek all over the place. However throughput should
be better with seeks now since I could grow the queue (if something only
latency would be worse but the above is a throughput thing only, latency
doesn't matter).

Can you retry once more time with pre2 vs pre6 to be 100% sure it's
reproducible?

thanks,

Andrea

2003-07-17 22:13:01

by Mike Fedyk

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Fri, Jul 18, 2003 at 12:13:38AM +0200, Marc-Christian Petersen wrote:
> VMware4 Workstation
> -------------------
>
> 2.4.22-pre[6|6aa1]: ~ 1 minute 02 seconds from: Start this virtual machine ...
> 2.4.22-pre2 : ~ 30 seconds from: Start this virtual machine ...
>
> ... to start up Windows 2000 Professional completely.
>
> Well, personally I don't care about the slowdown of vmware startup with a VM
> but there may be many other slowdows?!

Can you try a stock -pre kernel, say pre[256], and see where the additional
time starts?

2003-07-17 22:20:54

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Friday 18 July 2003 00:27, Mike Fedyk wrote:

Hi Mike,

> Can you try a stock -pre kernel, say pre[256], and see where the additional
> time starts?
Sure. Well, but I expect the behaviour starts with -pre3.

Anyway, I'll test.

ciao, Marc

2003-07-17 22:19:35

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Friday 18 July 2003 00:13, Marc-Christian Petersen wrote:

> On Thursday 17 July 2003 12:28, Andrea Arcangeli wrote:
>
> Hi Andrea,
>
> > Only in 2.4.22pre6aa1: 00_elevator-lowlatency-1
> > Only in 2.4.22pre6aa1: 00_elevator-read-reservation-axboe-2l-1
>
> Hmm, this is now my first day testing out .22-pre6 and .22-pre6aa1 with the
> new I/O stall fixes. At a first look & feel it's very good, but I've
> noticed a side effect (if it can be called so):
>
> VMware4 Workstation
> -------------------
>
> 2.4.22-pre[6|6aa1]: ~ 1 minute 02 seconds from: Start this virtual machine
> ... 2.4.22-pre2 : ~ 30 seconds from: Start this virtual
> machine ...
>
> ... to start up Windows 2000 Professional completely.
>
> Well, personally I don't care about the slowdown of vmware startup with a
> VM but there may be many other slowdows?!
hmmm:

2.4.22-pre[6|6aa1]:
-------------------
root@codeman:[/] # dd if=/dev/zero of=/home/largefile bs=16384 count=131072
131072+0 records in
131072+0 records out
2147483648 bytes transferred in 128.765686 seconds (16677453 bytes/sec)

2.4.22-pre2:
------------
root@codeman:[/] # dd if=/dev/zero of=/home/largefile bs=16384 count=131072
131072+0 records in
131072+0 records out
2147483648 bytes transferred in 98.489331 seconds (21804226 bytes/sec)

both kernels freshly rebooted.


Machine:
--------
Celeron 1,3GHz
512MB RAM
2x IDE (UDMA100) 60/40 GB
1GB SWAP, 512MB on each disk (same priority)
ext3fs (data=ordered)
XFree 4.3
WindowMaker 0.82-CVS


ciao, Marc

2003-07-17 22:36:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Fri, Jul 18, 2003 at 12:30:45AM +0200, Marc-Christian Petersen wrote:
> On Friday 18 July 2003 00:13, Marc-Christian Petersen wrote:
>
> > On Thursday 17 July 2003 12:28, Andrea Arcangeli wrote:
> >
> > Hi Andrea,
> >
> > > Only in 2.4.22pre6aa1: 00_elevator-lowlatency-1
> > > Only in 2.4.22pre6aa1: 00_elevator-read-reservation-axboe-2l-1
> >
> > Hmm, this is now my first day testing out .22-pre6 and .22-pre6aa1 with the
> > new I/O stall fixes. At a first look & feel it's very good, but I've
> > noticed a side effect (if it can be called so):
> >
> > VMware4 Workstation
> > -------------------
> >
> > 2.4.22-pre[6|6aa1]: ~ 1 minute 02 seconds from: Start this virtual machine
> > ... 2.4.22-pre2 : ~ 30 seconds from: Start this virtual
> > machine ...
> >
> > ... to start up Windows 2000 Professional completely.
> >
> > Well, personally I don't care about the slowdown of vmware startup with a
> > VM but there may be many other slowdows?!
> hmmm:
>
> 2.4.22-pre[6|6aa1]:
> -------------------
> root@codeman:[/] # dd if=/dev/zero of=/home/largefile bs=16384 count=131072
> 131072+0 records in
> 131072+0 records out
> 2147483648 bytes transferred in 128.765686 seconds (16677453 bytes/sec)
>
> 2.4.22-pre2:
> ------------
> root@codeman:[/] # dd if=/dev/zero of=/home/largefile bs=16384 count=131072
> 131072+0 records in
> 131072+0 records out
> 2147483648 bytes transferred in 98.489331 seconds (21804226 bytes/sec)
>
> both kernels freshly rebooted.

this explains it.

Can you try to change include/linux/blkdev.h like this:

-#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
+#define MAX_QUEUE_SECTORS (16 << (20 - 9)) /* 4 mbytes when full sized */

This will raise the queue from 4 to 16M. That is the first(/only) thing
that can explain a drop in performnace while doing contigous I/O.
However I didn't expect it to make a difference, or at least not so
relevant.

If this doesn't help at all, it might not be an elevator/blkdev thing.
At least on my machines the contigous I/O still at the same speed.

You also where the only one reporting a loss of performance with
elevator-lowlatency, it could be still the same problem that you've
seen at that time.

Last but not the least, if it's an elevator/blkdev thing, you must be
able to measure it with reads too, not only with writes. Can you try to
read that file back? (careful about the cache effects if you read it
multiple times and you interrupt it, best it to benchmark reads after a
mount to be sure)

> ext3fs (data=ordered)

can you try with data=writeback (or ext2) or hdparm -W1 and see if you
can still see the same delta between the two kernels? (careful with -W1
as it invalidates journaling)

thanks,

Andrea

2003-07-18 00:17:11

by Chris Mason

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Thu, 2003-07-17 at 18:50, Andrea Arcangeli wrote:
> On Fri, Jul 18, 2003 at 12:30:45AM +0200, Marc-Christian Petersen wrote:
> > 2.4.22-pre[6|6aa1]:
> > -------------------
> > root@codeman:[/] # dd if=/dev/zero of=/home/largefile bs=16384 count=131072
> > 131072+0 records in
> > 131072+0 records out
> > 2147483648 bytes transferred in 128.765686 seconds (16677453 bytes/sec)
> >
> > 2.4.22-pre2:
> > ------------
> > root@codeman:[/] # dd if=/dev/zero of=/home/largefile bs=16384 count=131072
> > 131072+0 records in
> > 131072+0 records out
> > 2147483648 bytes transferred in 98.489331 seconds (21804226 bytes/sec)
> >
> > both kernels freshly rebooted.
>
> this explains it.
>
> Can you try to change include/linux/blkdev.h like this:
>
> -#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
> +#define MAX_QUEUE_SECTORS (16 << (20 - 9)) /* 4 mbytes when full sized */
>
> This will raise the queue from 4 to 16M. That is the first(/only) thing
> that can explain a drop in performnace while doing contigous I/O.
> However I didn't expect it to make a difference, or at least not so
> relevant.
>
> If this doesn't help at all, it might not be an elevator/blkdev thing.
> At least on my machines the contigous I/O still at the same speed.
>
Especially with just one writer, you really shouldn't be able to see a
difference in pre6. Did you measure this change on both pre6 and
pre6aa1. Your message indicated that but I wanted to double check to
make sure.

-chris


2003-07-18 05:32:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Fri, Jul 18, 2003 at 12:50:02AM +0200, Andrea Arcangeli wrote:
> At least on my machines the contigous I/O still at the same speed.

Just to be 100% sure I run an accurate benchmarks myself too. I had the
numbers for the pre-2.4.22pre levels, but I didn't benchmarked yet on
the final code in mainline that had some cosmetical difference.

These are the results for the contigous I/O with vanilla 2.4.21 against
vanilla 2.4.22-pre6 against 2.4.22pre6aa2 (and aa2 is completely equal
to aa1 in terms of blkdev/IO). BTW, pre6aa2 is configured as desktop so
it has some additional overhead (not significant in pure I/O bound
computatations).

aic7xxx booted with mem=128m ext3 data=ordered

-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Kernel MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
2.4.21 100 11052 77.3 21683 15.5 16401 8.8 20347 82.1 32765 6.1 865.6 2.8
2.4.21 100 13533 94.5 21236 13.9 15904 7.9 21182 82.3 35019 5.1 1254.3 2.2
2.4.21 100 12402 86.5 22453 14.9 16165 5.8 20270 82.1 34754 6.4 1398.8 3.8

22pre6 100 13070 91.4 23314 15.3 15402 6.5 21202 81.8 33167 8.4 959.9 2.2
22pre6 100 13181 92.2 18556 12.5 16506 6.9 20562 78.1 33394 4.9 1271.9 1.9
22pre6 100 14082 98.5 23170 16.1 16199 5.7 21045 81.2 34124 7.3 1450.6 4.0

22pre6aa2 100 12703 90.5 23245 16.3 15533 6.8 19730 79.6 37072 8.0 775.9 1.4
22pre6aa2 100 13241 94.0 20602 14.4 15562 7.0 19675 79.6 37102 7.7 843.5 1.8
22pre6aa2 100 12948 93.0 21566 15.0 15970 7.6 19460 81.7 36599 7.2 740.6 1.7

as you can see for contigous I/O I can't measure any regression at all,
the minor variations across the three runs are likely for the largest
part influenced by the ext3 block allocation that can change for every
run.

Andrea

2003-07-18 18:03:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Thu, Jul 17, 2003 at 12:28:57PM +0200, Andrea Arcangeli wrote:
> Only in 2.4.21rc8aa1: 9910_shm-largepage-13.gz
> Only in 2.4.22pre6aa1: 9910_shm-largepage-16.gz
>
> Thanks to Hugh for the help in porting the bigpages
> to the rewritten shmfs layer in 22pre. No idea at the moment if it
> works or if it only compiles.

Any reason you don't use a backport of hugetlbfs like the IA64 or
the RH AS3 tree?

2003-07-18 22:19:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Fri, Jul 18, 2003 at 07:18:53PM +0100, Christoph Hellwig wrote:
> On Thu, Jul 17, 2003 at 12:28:57PM +0200, Andrea Arcangeli wrote:
> > Only in 2.4.21rc8aa1: 9910_shm-largepage-13.gz
> > Only in 2.4.22pre6aa1: 9910_shm-largepage-16.gz
> >
> > Thanks to Hugh for the help in porting the bigpages
> > to the rewritten shmfs layer in 22pre. No idea at the moment if it
> > works or if it only compiles.
>
> Any reason you don't use a backport of hugetlbfs like the IA64 or
> the RH AS3 tree?

bigpages= is a documented API that has to be used in production, so I
can easily add the hugetlbfs API but I guess I've to keep this one
anyways. I also would need to verify the performance of hugetlbfs before
suggesting migrating to it, for example I don't want
preallocation/prefaulting (IIRC hugetlbfs preallocates everything). I
also like the single huge array of page pointers, that is very hardwired
but optimal for those workloads.

Andrea

2003-07-18 22:33:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Sat, Jul 19, 2003 at 12:27:50AM +0200, Andrea Arcangeli wrote:
> bigpages= is a documented API that has to be used in production, so I
> can easily add the hugetlbfs API but I guess I've to keep this one
> anyways. I also would need to verify the performance of hugetlbfs before
> suggesting migrating to it, for example I don't want
> preallocation/prefaulting (IIRC hugetlbfs preallocates everything). I
> also like the single huge array of page pointers, that is very hardwired
> but optimal for those workloads.

Most of the complaints I've gotten are about lack of support for mixed
PSE and non-PSE mappings, not preallocation or performance (generally
its usage doesn't involve creation/destruction cycle performance
requirements, and most of the time they intend to use 100% of the memory).

It's basically too stupid and operating on too small a data set to
screw up performance-wise apart from creation/destruction, which is not
intended to be performant (and will never be; it blits oversized areas).

I wouldn't mind hearing of what you believe is missing, so long as it's
within the constraints of what's mergeable. =(


-- wli

2003-07-18 22:42:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Fri, Jul 18, 2003 at 03:48:24PM -0700, William Lee Irwin III wrote:
> On Sat, Jul 19, 2003 at 12:27:50AM +0200, Andrea Arcangeli wrote:
> > bigpages= is a documented API that has to be used in production, so I
> > can easily add the hugetlbfs API but I guess I've to keep this one
> > anyways. I also would need to verify the performance of hugetlbfs before
> > suggesting migrating to it, for example I don't want
> > preallocation/prefaulting (IIRC hugetlbfs preallocates everything). I
> > also like the single huge array of page pointers, that is very hardwired
> > but optimal for those workloads.
>
> Most of the complaints I've gotten are about lack of support for mixed
> PSE and non-PSE mappings, not preallocation or performance (generally
> its usage doesn't involve creation/destruction cycle performance
> requirements, and most of the time they intend to use 100% of the memory).
>
> It's basically too stupid and operating on too small a data set to
> screw up performance-wise apart from creation/destruction, which is not
> intended to be performant (and will never be; it blits oversized areas).
>
> I wouldn't mind hearing of what you believe is missing, so long as it's
> within the constraints of what's mergeable. =(

I tend to think the creation/destruction will be the most noticeable
performance difference in practice. allocating 42G in a single block
will take a bit of time ;). I'm not necessairly worse or unacceptable,
but it's different. And I feel I've to retain the bigpages= API (as an
API not as in implementation) anyways. Furthmore I'm unsure if hugtlbfs
is relaxed like the shm-largpeage patch is, I mean, it should be
possible to mmap the stuff with 4k granularty too, or stuff could break
due that change of API too.

Andrea

2003-07-18 22:48:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Sat, Jul 19, 2003 at 12:53:28AM +0200, Andrea Arcangeli wrote:
> I tend to think the creation/destruction will be the most noticeable
> performance difference in practice. allocating 42G in a single block
> will take a bit of time ;). I'm not necessairly worse or unacceptable,
> but it's different. And I feel I've to retain the bigpages= API (as an
> API not as in implementation) anyways. Furthmore I'm unsure if hugtlbfs
> is relaxed like the shm-largpeage patch is, I mean, it should be
> possible to mmap the stuff with 4k granularty too, or stuff could break
> due that change of API too.

I've just not gotten feedback about creation and destruction; I get the
impression it's an uncommon operation.

The alignment etc. considerations are bits I probably can't get merged. =(

Most of the work I did was trying to get the preexisting semantics into
more standard-looking API's, e.g. vfs ops and standard-ish sysv shm.


-- wli

2003-07-18 22:57:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Fri, Jul 18, 2003 at 04:04:31PM -0700, William Lee Irwin III wrote:
> On Sat, Jul 19, 2003 at 12:53:28AM +0200, Andrea Arcangeli wrote:
> > I tend to think the creation/destruction will be the most noticeable
> > performance difference in practice. allocating 42G in a single block
> > will take a bit of time ;). I'm not necessairly worse or unacceptable,
> > but it's different. And I feel I've to retain the bigpages= API (as an
> > API not as in implementation) anyways. Furthmore I'm unsure if hugtlbfs
> > is relaxed like the shm-largpeage patch is, I mean, it should be
> > possible to mmap the stuff with 4k granularty too, or stuff could break
> > due that change of API too.
>
> I've just not gotten feedback about creation and destruction; I get the
> impression it's an uncommon operation.

It's uncommon of course. A 42G allocated all at once, may take a while
and 48G works flawlessy at peak performance w/o 4:4. I support as much
as 64G all in a single shmfs file backed by bigpages (and it won't run
out of memory with a 64G box either, even with the 3:1 mapping)

> The alignment etc. considerations are bits I probably can't get merged. =(

so the apps will need changes and a kernel API way to know the hardware
page size provided by hugetlbfs (though they could probe for it with
many tries).

> Most of the work I did was trying to get the preexisting semantics into
> more standard-looking API's, e.g. vfs ops and standard-ish sysv shm.

yes.

Andrea

2003-07-18 23:37:37

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Sat, Jul 19, 2003 at 01:12:30AM +0200, Andrea Arcangeli wrote:
> so the apps will need changes and a kernel API way to know the hardware
> page size provided by hugetlbfs (though they could probe for it with
> many tries).

The hugepage size is exported in /proc/meminfo for the time being.

I think 2.7 will see something we both like better.


-- wli

2003-07-18 23:49:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Fri, Jul 18, 2003 at 04:53:09PM -0700, William Lee Irwin III wrote:
> On Sat, Jul 19, 2003 at 01:12:30AM +0200, Andrea Arcangeli wrote:
> > so the apps will need changes and a kernel API way to know the hardware
> > page size provided by hugetlbfs (though they could probe for it with
> > many tries).
>
> The hugepage size is exported in /proc/meminfo for the time being.

ok.

> I think 2.7 will see something we both like better.

the transparency feature in the shm-largepage patch is quite nice since
you could trivially put an app on the fs w/o any breakage that way (not
everything has to be strictly mapped with bigpages, so it would make the
code more relaxed by just changing the mountpoint). Of course a way
to know for sure if a mapping is marked VM_LARGEPAGE would be needed
then to be sure the app has the right pieces of vm backed with the right
page size.

Andrea

2003-07-22 13:19:23

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Friday 18 July 2003 00:50, Andrea Arcangeli wrote:

Hi Andrea,

> Can you try to change include/linux/blkdev.h like this:
> -#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
> +#define MAX_QUEUE_SECTORS (16 << (20 - 9)) /* 4 mbytes when full sized */
> This will raise the queue from 4 to 16M. That is the first(/only) thing
> that can explain a drop in performnace while doing contigous I/O.
> However I didn't expect it to make a difference, or at least not so
> relevant.
> If this doesn't help at all, it might not be an elevator/blkdev thing.
> At least on my machines the contigous I/O still at the same speed.
well, it doesn't help at all. I/O gets more worse with that change. (8mb/s
less). How can this happen? *wondering*

> You also where the only one reporting a loss of performance with
> elevator-lowlatency, it could be still the same problem that you've
> seen at that time.
The only one? Surely not. Also Con tested your elevator-lowlatency and we both
saw performance degration :)

> can you try with data=writeback (or ext2) or hdparm -W1 and see if you
> can still see the same delta between the two kernels? (careful with -W1
> as it invalidates journaling)
Yes, I'll do it later this day.

Sorry for my late reply. I've been very busy.

ciao, Marc


2003-07-22 13:19:34

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Friday 18 July 2003 02:30, Chris Mason wrote:

Hi Chris,

> > If this doesn't help at all, it might not be an elevator/blkdev thing.
> > At least on my machines the contigous I/O still at the same speed.
> Especially with just one writer, you really shouldn't be able to see a
> difference in pre6. Did you measure this change on both pre6 and
> pre6aa1. Your message indicated that but I wanted to double check to
> make sure.
Yes, I measured it with pre6 and pre6aa1. There is no noticable difference.

ciao, Marc


2003-07-22 13:49:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Tue, Jul 22, 2003 at 03:34:16PM +0200, Marc-Christian Petersen wrote:
> On Friday 18 July 2003 00:50, Andrea Arcangeli wrote:
>
> Hi Andrea,
>
> > Can you try to change include/linux/blkdev.h like this:
> > -#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
> > +#define MAX_QUEUE_SECTORS (16 << (20 - 9)) /* 4 mbytes when full sized */
> > This will raise the queue from 4 to 16M. That is the first(/only) thing
> > that can explain a drop in performnace while doing contigous I/O.
> > However I didn't expect it to make a difference, or at least not so
> > relevant.
> > If this doesn't help at all, it might not be an elevator/blkdev thing.
> > At least on my machines the contigous I/O still at the same speed.
> well, it doesn't help at all. I/O gets more worse with that change. (8mb/s
> less). How can this happen? *wondering*
>
> > You also where the only one reporting a loss of performance with
> > elevator-lowlatency, it could be still the same problem that you've
> > seen at that time.
> The only one? Surely not. Also Con tested your elevator-lowlatency and we both
> saw performance degration :)

performance degradation when? note that we're only talking about
contigous I/O here, not contest. I can't measure any performance
degradation during contigous I/O and if something it could be explained
by the now shorter queue, but you tried enlarging it and it went even
slower (this was good btw, confirming a larger queue was completely
worthless and it only hurts the VM without providing any I/O bandwidth
pipelining benefit). The elevator-lowlatency should have no other effect
other than a shorter queue during pure contigous I/O.

> > can you try with data=writeback (or ext2) or hdparm -W1 and see if you
> > can still see the same delta between the two kernels? (careful with -W1
> > as it invalidates journaling)
> Yes, I'll do it later this day.

please try plain ext2, this sounds like some fs effect of some sort. The
fs must throttle on the shorter queue or seek differently somehow.

> Sorry for my late reply. I've been very busy.

No problem ;)

Andrea

2003-07-22 13:53:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Tue, Jul 22, 2003 at 02:28:03PM +0200, Marc-Christian Petersen wrote:
> Yes, I measured it with pre6 and pre6aa1. There is no noticable difference.

this makes sense, thanks for double checking.

Andrea

2003-07-23 11:06:24

by Sergey S. Kostyliov

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

Hello Andrea,

This is during `swapoff -a`, on a heavily loaded box:

ksymoops 2.4.9 on i686 2.4.22-pre6aa1. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.22-pre6aa1/ (default)
-m /usr/src/linux/System.map (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Error (regular_file): read_system_map stat /usr/src/linux/System.map failed
ksymoops: No such file or directory
kernel BUG at shmem.c:490!
invalid operand: 0000 2.4.22-pre6aa1 #1 SMP Thu Jul 17 20:24:29 MSD 2003
CPU: 0
EIP: 0010:[<801424cb>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 00000508 ebx: 8d846a00 ecx: c7919800 edx: c79198fc
esi: c79198fc edi: 8d846b40 ebp: 97b999a0 esp: af853e34
ds: 0018 es: 0018 ss: 0018
Process oracle (pid: 23274, stackpage=af853000)
Stack: 8d846a00 8d846a00 80142460 c7919800 80161abf 8d846a00 00000000 00000000
97b999a0 8d846a00 8d846a00 8015e98e 8d846a00 97b999a0 97b999a0 97b999b8
8015ea5a 97b999a0 803349e0 8d141860 c78f6fa0 80148acb 97b999a0 9eb9f774
Call Trace: [<80142460>] [<80161abf>] [<8015e98e>] [<8015ea5a>]
[<80148acb>]
[<80132695>] [<8011c35a>] [<80121c66>] [<80128069>] [<80127f13>]
[<80128115>]
[<801073a8>] [<801236c4>] [<80123562>] [<80127546>] [<80115b50>]
[<80117b80>]
[<80107614>]
Code: 0f 0b ea 01 24 c1 27 80 eb cd 0f 0b e9 01 24 c1 27 80 eb bc


>>EIP; 801424cb <alloc_pages_node+75b/2c70> <=====

Trace; 80142460 <alloc_pages_node+6f0/2c70>
Trace; 80161abf <iput+14f/2b0>
Trace; 8015e98e <lock_may_write+21e/260>
Trace; 8015ea5a <dput+8a/150>
Trace; 80148acb <fput+db/110>
Trace; 80132695 <do_brk+3d5/700>
Trace; 8011c35a <remove_wait_queue+4fa/d10>
Trace; 80121c66 <exit_mm+4f6/770>
Trace; 80128069 <unblock_all_signals+109/150>
Trace; 80127f13 <flush_signal_handlers+103/110>
Trace; 80128115 <dequeue_signal+65/4f0>
Trace; 801073a8 <__read_lock_failed+11a8/17c0>
Trace; 801236c4 <tasklet_kill+f4/120>
Trace; 80123562 <__tasklet_hi_schedule+162/1a0>
Trace; 80127546 <del_timer_sync+a16/ca0>
Trace; 80115b50 <smp_call_function+ce0/19f0>
Trace; 80117b80 <__verify_write+230/ab0>
Trace; 80107614 <__read_lock_failed+1414/17c0>

Code; 801424cb <alloc_pages_node+75b/2c70>
00000000 <_EIP>:
Code; 801424cb <alloc_pages_node+75b/2c70> <=====
0: 0f 0b ud2a <=====
Code; 801424cd <alloc_pages_node+75d/2c70>
2: ea 01 24 c1 27 80 eb ljmp $0xeb80,$0x27c12401
Code; 801424d4 <alloc_pages_node+764/2c70>
9: cd 0f int $0xf
Code; 801424d6 <alloc_pages_node+766/2c70>
b: 0b e9 or %ecx,%ebp
Code; 801424d8 <alloc_pages_node+768/2c70>
d: 01 24 c1 add %esp,(%ecx,%eax,8)
Code; 801424db <alloc_pages_node+76b/2c70>
10: 27 daa
Code; 801424dc <alloc_pages_node+76c/2c70>
11: 80 eb bc sub $0xbc,%bl


1 warning and 1 error issued. Results may not be reliable.
--
Best regards,
Sergey S. Kostyliov <[email protected]>
Public PGP key: http://sysadminday.org.ru/rathamahata.asc

2003-07-24 12:13:21

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Tuesday 22 July 2003 15:59, Andrea Arcangeli wrote:

Hi Andrea,

> performance degradation when? note that we're only talking about
> contigous I/O here, not contest. I can't measure any performance
> degradation during contigous I/O and if something it could be explained
> by the now shorter queue, but you tried enlarging it and it went even
> slower (this was good btw, confirming a larger queue was completely
> worthless and it only hurts the VM without providing any I/O bandwidth
> pipelining benefit). The elevator-lowlatency should have no other effect
> other than a shorter queue during pure contigous I/O.
Well, contigous I/O isn't a big problem, though I saw performance degradation
in contigous I/O. The problem is, that I still see mouse stops while heavy
I/O, that I still see keyboard stops while heavy I/O, X is dog slow while
heavy I/O (renicing X to -20 doesn't really help). I really miss the 2.4.18
time where this wasn't a problem at all!
Contest was not the reason. An easy reproducable scenario is:

dd if=/dev/zero of=/home/largefile bs=16384 count=131072

This will kill your mouse, keyboard and X. The only "workaround" not to see
mouse stops, keyboard stops and X dogstyle was decreasing nr_requests from
128 to 4. Anything higher resulted in pauses (e.g. 8 for nr_requests).
Maybe SCSI behaves totally different, dunno. ATM I don't have SCSI around to
test it, only IDE (ATA100/ATA133).

I've tested this too for .22-pre7, changing "MAX_NR_REQUESTS 1024" to "4". And
now the big surprise: Still mouse stops, keyboard stops while, e.g. the above
dd command, but with, for sure, very low throughput. So throughput dropping
is not the problem here at all. I have very very low throughput but still
pauses/stops. How is this possible? I am very confused about the code :-(

> > > can you try with data=writeback (or ext2) or hdparm -W1 and see if you
> > > can still see the same delta between the two kernels? (careful with -W1
> > > as it invalidates journaling)
> > Yes, I'll do it later this day.
> please try plain ext2, this sounds like some fs effect of some sort. The
> fs must throttle on the shorter queue or seek differently somehow.
well, ext2 does not make any difference :-(

I thought trying out q->full from Chris would make any difference. I am quite
sure that it must be a merge error by me, otherwise I cannot explain why
q->full kills my X-windows for tons of seconds during a "make -j16 bzImage
modules" I get stops of the whole system too for some seconds every 30
seconds or so. Ripped out q->full (not just disabling via elvtune -b 0) fixed
at least that behaviour.

Another funny thing, not dependant on q->full, is, that VMware needs over 1
Minute to start up with a Windows 2000 in it where w/o the lowlat elevator it
needs ~30 seconds or less to start up completely. VMware has reads/writes
during the startup about _max_ of 500kb/s. Before it went up to 10mb/s.
Now we should decide if it's either a bug in the kernel or a bug in VMware ;))

> > Sorry for my late reply. I've been very busy.
> No problem ;)
ok :) thnx. Sorry again for the delay, but I wanted to be sure about the
reports so I had to test many things out first.

Hmm, I am a bit afraid that no one else noticed this yet. This reminds be back
to over a year ago where I reported I/O stalls/pauses/stops with 2.4.19-pre's
and no one noticed that but you after some time. A 'real' fix for that came up
over one year later and some days before we had a big discussion about it with
many people involved noticing it too.

Don't get me wrong Andrea and Chris :) .. but I am quite disappointed about
current Linux for the Desktop. 2.4 has I/O problems, 2.6 has Scheduler
problems, 2 things I cannot live with for my Desktop. Maybe Linus is right
when he said, Linux may be Desktop ready in 2006.

Any suggestions what I can do to help to fix that silly behaviour? I really
really want a usable 2.4 tree again (read: 2.4.22 final) :)

P.S.: I've CC'ed Nick.

ciao, Marc

2003-07-24 14:28:11

by Chris Mason

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Thu, 2003-07-24 at 08:27, Marc-Christian Petersen wrote:
> On Tuesday 22 July 2003 15:59, Andrea Arcangeli wrote:
>
> Hi Andrea,
>
> > performance degradation when? note that we're only talking about
> > contigous I/O here, not contest. I can't measure any performance
> > degradation during contigous I/O and if something it could be explained
> > by the now shorter queue, but you tried enlarging it and it went even
> > slower (this was good btw, confirming a larger queue was completely
> > worthless and it only hurts the VM without providing any I/O bandwidth
> > pipelining benefit). The elevator-lowlatency should have no other effect
> > other than a shorter queue during pure contigous I/O.
> Well, contigous I/O isn't a big problem, though I saw performance degradation
> in contigous I/O. The problem is, that I still see mouse stops while heavy
> I/O, that I still see keyboard stops while heavy I/O, X is dog slow while
> heavy I/O (renicing X to -20 doesn't really help). I really miss the 2.4.18
> time where this wasn't a problem at all!
> Contest was not the reason. An easy reproducable scenario is:
>
> dd if=/dev/zero of=/home/largefile bs=16384 count=131072
>
> This will kill your mouse, keyboard and X. The only "workaround" not to see
> mouse stops, keyboard stops and X dogstyle was decreasing nr_requests from
> 128 to 4. Anything higher resulted in pauses (e.g. 8 for nr_requests).
> Maybe SCSI behaves totally different, dunno. ATM I don't have SCSI around to
> test it, only IDE (ATA100/ATA133).

Ok, there's something fundamental we're missing here, the IDE boxes I
test on don't show this ;-) Can you setup a serial console and capture
sysrq-t during the pause? Or better yet setup kgdb.

What kind of keyboard/mouse do you have?

I'll give you an updated q->full patch on Monday, including the
__get_request_wait latency stats.

-chris


2003-07-25 05:13:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Wed, Jul 23, 2003 at 03:21:15PM +0400, Sergey S. Kostyliov wrote:
> Hello Andrea,
>
> This is during `swapoff -a`, on a heavily loaded box:
>
> ksymoops 2.4.9 on i686 2.4.22-pre6aa1. Options used
> -V (default)
> -k /proc/ksyms (default)
> -l /proc/modules (default)
> -o /lib/modules/2.4.22-pre6aa1/ (default)
> -m /usr/src/linux/System.map (default)
>
> Warning: You did not tell me where to find symbol information. I will
> assume that the log matches the kernel and modules that are running
> right now and I'll use the default options above for symbol resolution.
> If the current kernel and/or modules do not match the log, you can get
> more accurate output by telling me the kernel version and where to find
> map, modules, ksyms etc. ksymoops -h explains the options.
>
> Error (regular_file): read_system_map stat /usr/src/linux/System.map failed
> ksymoops: No such file or directory
> kernel BUG at shmem.c:490!

hmm, 2.4.22pre6aa1 was the first 2.4 largepages port to the >=22pre
shmfs backport from 2.5. It could be a bug in 2.5, or a bug present only
in the backport of the 2.5 code to 22pre, or even a bug only present in
-aa due the largepage patch ported on top of the backport included in
22pre. I'll have a closer look at it tomorrow. The place where it
crashed is:

BUG_ON(inode->i_blocks);

it might be only a minor accounting issue. It needs some auditing.

I'm afraid you're the first one testing the shmfs backport in 22pre +
the largepage support patch in my tree with a big app doing swapoff at
the same time.

Are you using bigpages btw?

thank you very much for the feedback,

Andrea

PS. shall this give us relevant problems in the debugging/auditing, I'll
just give you a patch to backout the backport and go back to the shmfs
code in 2.4.21rc8aa1 that is running rock solid in production with
largepages (I doubt you need the loop device on top of shmfs anyways). I
prefer not to spend much time on new 2.4 features.

2003-07-25 10:55:53

by Sergey S. Kostyliov

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

Hello Andrea,

On Friday 25 July 2003 09:28, you wrote:
> On Wed, Jul 23, 2003 at 03:21:15PM +0400, Sergey S. Kostyliov wrote:
> > Hello Andrea,

<cut>

> hmm, 2.4.22pre6aa1 was the first 2.4 largepages port to the >=22pre
> shmfs backport from 2.5. It could be a bug in 2.5, or a bug present only
> in the backport of the 2.5 code to 22pre, or even a bug only present in
> -aa due the largepage patch ported on top of the backport included in
> 22pre. I'll have a closer look at it tomorrow. The place where it
> crashed is:
>
> BUG_ON(inode->i_blocks);
>
> it might be only a minor accounting issue. It needs some auditing.

Thanks for you recponce!
Yes, it seems possible. At least it continue to run just fine after
oops for 2.4.22pre6aa1.
Btw I've managed to get a hard lockup with 2.4.22pre7aa1 in the same
scenario. It just stops responding to even Alt+SysRq+* from keyboard.

>
> I'm afraid you're the first one testing the shmfs backport in 22pre +
> the largepage support patch in my tree with a big app doing swapoff at
> the same time.
>
> Are you using bigpages btw?

No, I'm not using bigpages.

>
> thank you very much for the feedback,
>
> Andrea
>
> PS. shall this give us relevant problems in the debugging/auditing, I'll
> just give you a patch to backout the backport and go back to the shmfs
> code in 2.4.21rc8aa1 that is running rock solid in production with
> largepages (I doubt you need the loop device on top of shmfs anyways). I
> prefer not to spend much time on new 2.4 features.

I doubt it depends on bigpages because they
are not used in my setup. But I can live with that. Rule: do not run
`swapoff -a` under load doesn't sound as impossible in my case (if this
is the only way to trigger this problem).

--
Best regards,
Sergey S. Kostyliov <[email protected]>
Public PGP key: http://sysadminday.org.ru/rathamahata.asc

2003-07-25 18:47:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

Hi Sergey,

On Fri, Jul 25, 2003 at 03:10:59PM +0400, Sergey S. Kostyliov wrote:
> I doubt it depends on bigpages because they
> are not used in my setup. But I can live with that. Rule: do not run
> `swapoff -a` under load doesn't sound as impossible in my case (if this
> is the only way to trigger this problem).

can you reproduce it with 2.4.21rc8aa1? If not, then likely it's a
2.5/2.6 bug that went in 2.4 during the backport. I spoke with Hugh an
hour ago about this, he will soon look into this too.

Andrea

2003-08-03 17:11:55

by Sergey S. Kostyliov

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

Hello Andrea,

On Friday 25 July 2003 23:02, Andrea Arcangeli wrote:
> Hi Sergey,
>
> On Fri, Jul 25, 2003 at 03:10:59PM +0400, Sergey S. Kostyliov wrote:
> > I doubt it depends on bigpages because they
> > are not used in my setup. But I can live with that. Rule: do not run
> > `swapoff -a` under load doesn't sound as impossible in my case (if this
> > is the only way to trigger this problem).
>
> can you reproduce it with 2.4.21rc8aa1? If not, then likely it's a
> 2.5/2.6 bug that went in 2.4 during the backport. I spoke with Hugh an
> hour ago about this, he will soon look into this too.

Sorry for late responce. I wasn't able to reproduce neither oops nor
lockup with 2.4.21rc8aa1.

>
> Andrea

--
Best regards,
Sergey S. Kostyliov <[email protected]>
Public PGP key: http://sysadminday.org.ru/rathamahata.asc

2003-08-16 11:56:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Sun, Aug 03, 2003 at 09:12:00PM +0400, Sergey S. Kostyliov wrote:
> Hello Andrea,
>
> On Friday 25 July 2003 23:02, Andrea Arcangeli wrote:
> > Hi Sergey,
> >
> > On Fri, Jul 25, 2003 at 03:10:59PM +0400, Sergey S. Kostyliov wrote:
> > > I doubt it depends on bigpages because they
> > > are not used in my setup. But I can live with that. Rule: do not run
> > > `swapoff -a` under load doesn't sound as impossible in my case (if this
> > > is the only way to trigger this problem).
> >
> > can you reproduce it with 2.4.21rc8aa1? If not, then likely it's a
> > 2.5/2.6 bug that went in 2.4 during the backport. I spoke with Hugh an
> > hour ago about this, he will soon look into this too.
>
> Sorry for late responce. I wasn't able to reproduce neither oops nor
> lockup with 2.4.21rc8aa1.

ok good. I'm betting it's the shm backport that destabilized something.
I had no time to look further into it during vacations ;), but the first
suspect thing I mentioned to Hugh during OLS was this:

static void shmem_removepage(struct page *page)
{
if (!PageLaunder(page))
shmem_free_blocks(page->mapping->host, 1);
}

It's not exactly obvious how the accounting should change in function of
the Launder bit. I mean, a writepage can happen even w/o the launder
bitflag set (if it's not invoked by the vm) and I don't see how a msync
or a vm pressure writepage trigger should be different in terms of
accounting of the blocks in an inode.

Overall I need a bit more of time on Monday to digest the whole backport
to be sure of what's going on and if the above is right after all.

Andrea

2003-08-16 13:53:11

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

Hi Andrea,

Welcome back. Sergey and I have been in contact over this while you
were away, I kept it private so as not to inflate your mailbox further.

Brief summary (subject to confirmation by Sergey's testing) would be:
don't worry about it, it's not an -aa problem, it's a long-standing and
rare bug, in fact much less likely to occur in current 2.6 and 2.4.22-rc
than in 2.4.21 and earlier: seems Sergey's just been doing good testing.
So I've not bothered Marcelo with fixing it for 2.4.22, will submit fix
to 2.6.0-test and 2.4.23-pre later on.

You'll immediately counter what I've said there, by pointing out that
BUG_ON(inode->i_blocks) couldn't have triggered in 2.4.21 and earlier,
since I only added it in 2.4.22-pre. True, but instead it would have
gone on to hit clear_inode's "if (inode->i_data.nrpages) BUG();"
(assuming I've identified the issue correctly).

On Sat, 16 Aug 2003, Andrea Arcangeli wrote:
> On Sun, Aug 03, 2003 at 09:12:00PM +0400, Sergey S. Kostyliov wrote:
> > On Friday 25 July 2003 23:02, Andrea Arcangeli wrote:
> > > On Fri, Jul 25, 2003 at 03:10:59PM +0400, Sergey S. Kostyliov wrote:
> > > > I doubt it depends on bigpages because they
> > > > are not used in my setup. But I can live with that. Rule: do not run
> > > > `swapoff -a` under load doesn't sound as impossible in my case (if this
> > > > is the only way to trigger this problem).

I believe the issue is that shmem_unuse_inode can swizzle a page
from swap cache back into page cache after deletion's or truncation's
truncate_inode_pages has cleaned out the page cache for that inode.

Not a great big deal in the truncation case (though it could depart
from spec: I can imagine fsx detecting inconsistency, seen before
2.4.22-pre, but not since), but dangerous in the deletion case - if
there were neither i_blocks nor nrpages BUG, then you'd end up with
a page in the cache with page->mapping pointing into freed inode.

There used to be nothing to prevent this (the info->sem I eliminated
was of no use in the swapoff case), but in 2.5 and 2.4.22-pre I added
those I_FREEING and i_size checks to shmem_unuse_inode to prevent it.

Or so I thought. But faced with explaining Sergey's BUG_ON,
I eventually realized it's not good enough (when SMP) just to check
before adding into page cache, it needs to be checked after.

Or, as in the patch Sergey is currently testing below, shmem_truncate
must be prepared to truncate_inode_pages again. That's the approach
I originally implemented in 2.5, but I grew disgusted with it every
time I thought of partial truncation trundling twice through
truncate_inode_pages (it can easily be avoided when nrpages == 0,
but that's unlikely in partial truncation).

So VM_PAGEIN flag stuff to restrict it to when it might be necessary;
extended to cover other races when reading the page at the same time
as truncating (though I think generic_file_read has a window of this
kind that we've never worried about). I expect to split the patch
into several before sending Marcelo and Andrew.

There may be another piece needed, for even rarer race: what if the
truncated page arrives at shmem_writepage after shmem_truncate has
cleaned the swap pages, before it recalls truncate_inode_pages?
But I'll return to this later, I'm attending to other stuff right
now, this is all exceedingly rare (unless Sergey shows otherwise).

If Andrew happens to be reading this, yes, these subtle races and
oft-revisited solutions do shed further doubt on the whole business
of tmpfs swapcache swizzling: perhaps 2.7 can find a safer way.

> > > can you reproduce it with 2.4.21rc8aa1? If not, then likely it's a
> > > 2.5/2.6 bug that went in 2.4 during the backport. I spoke with Hugh an
> > > hour ago about this, he will soon look into this too.
> >
> > Sorry for late responce. I wasn't able to reproduce neither oops nor
> > lockup with 2.4.21rc8aa1.

It (or rather, clear_inode's nrpages BUG) should be much easier to hit
with 2.4.21rc8aa1. I wonder whether Sergey was just (un)lucky to hit
it in his 2.4.22pre6aa1 testing: he's not mentioned whether or not he
can reproduce it at will. I've not been able to reproduce it at all.

There might be some kind of timing difference, which somehow makes it
easier to hit the narrower window in 2.4.22pre6aa1, but I don't see
what that is.

> ok good. I'm betting it's the shm backport that destabilized something.
> I had no time to look further into it during vacations ;), but the first
> suspect thing I mentioned to Hugh during OLS was this:
>
> static void shmem_removepage(struct page *page)
> {
> if (!PageLaunder(page))
> shmem_free_blocks(page->mapping->host, 1);
> }
>
> It's not exactly obvious how the accounting should change in function of
> the Launder bit. I mean, a writepage can happen even w/o the launder
> bitflag set (if it's not invoked by the vm) and I don't see how a msync
> or a vm pressure writepage trigger should be different in terms of
> accounting of the blocks in an inode.

I thought we'd settled this one then. I understand you're suspicious
of using a PageLaunder test in that way, but it has worked correctly
in the -ac tree for a year or so. The point is, shmem_removepage gets
called whenever shmem page removed from cache, so it gets called when
shmem_writepage moves page from page to swap cache: but in that case
the page must still be counted as occupying filesystem space, we must
not adjust shmem_free_blocks. PageLaunder, set only during writepage,
identifies that case. I guess I should add a comment there.

> Overall I need a bit more of time on Monday to digest the whole backport
> to be sure of what's going on and if the above is right after all.

If you have time to do so, that would be great: but I don't think it
need be your priority. Certainly nobody else has reported a problem.

Hugh

2003-08-16 13:58:34

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

On Sat, 16 Aug 2003, Hugh Dickins wrote:
>
> Or, as in the patch Sergey is currently testing below, shmem_truncate
> must be prepared to truncate_inode_pages again. That's the approach
> I originally implemented in 2.5, but I grew disgusted with it every
> time I thought of partial truncation trundling twice through
> truncate_inode_pages (it can easily be avoided when nrpages == 0,
> but that's unlikely in partial truncation).
>
> So VM_PAGEIN flag stuff to restrict it to when it might be necessary;
> extended to cover other races when reading the page at the same time
> as truncating (though I think generic_file_read has a window of this
> kind that we've never worried about). I expect to split the patch
> into several before sending Marcelo and Andrew.

And here is the patch I claimed to be below.
If you apply it to anything other than 2.4.22-pre6aa1,
please be careful to check that it has applied correctly.
Originally I made a patch against 2.4.22-pre6, and then applied to
2.4.22-pre6aa1: but I have never seen patch make such a mess of it!

Hugh

--- 2.4.22-pre6aa1/mm/shmem.c Thu Jul 31 15:23:58 2003
+++ linux/mm/shmem.c Mon Aug 11 21:00:55 2003
@@ -92,6 +92,9 @@

#define VM_ACCT(size) (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)

+/* info->flags needs a VM_flag to handle pagein/truncate race efficiently */
+#define VM_PAGEIN VM_READ
+
/* Pretend that each entry is of this size in directory's i_size */
#define BOGO_DIRENT_SIZE 20

@@ -435,6 +438,18 @@

BUG_ON(info->swapped > info->next_index);
spin_unlock(&info->lock);
+
+ if (inode->i_mapping->nrpages && (info->flags & VM_PAGEIN)) {
+ /*
+ * Call truncate_inode_pages again: racing shmem_unuse_inode
+ * may have swizzled a page in from swap since vmtruncate or
+ * generic_delete_inode did it, before we lowered next_index.
+ * Also, though shmem_getpage checks i_size before adding to
+ * cache, no recheck after: so fix the narrow window there too.
+ */
+ truncate_inode_pages(inode->i_mapping, inode->i_size);
+ }
+
if (freed)
shmem_free_blocks(inode, freed);
}
@@ -459,6 +474,19 @@
attr->ia_size>>PAGE_CACHE_SHIFT,
&page, SGP_READ);
}
+ /*
+ * Reset VM_PAGEIN flag so that shmem_truncate can
+ * detect if any pages might have been added to cache
+ * after truncate_inode_pages. But we needn't bother
+ * if it's being fully truncated to zero-length: the
+ * nrpages check is efficient enough in that case.
+ */
+ if (attr->ia_size) {
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ spin_lock(&info->lock);
+ info->flags &= ~VM_PAGEIN;
+ spin_unlock(&info->lock);
+ }
}
}

@@ -511,7 +539,6 @@
struct address_space *mapping;
swp_entry_t *ptr;
unsigned long idx;
- unsigned long limit;
int offset;

idx = 0;
@@ -543,13 +570,9 @@
inode = info->inode;
mapping = inode->i_mapping;
delete_from_swap_cache(page);
-
- /* Racing against delete or truncate? Must leave out of page cache */
- limit = (inode->i_state & I_FREEING)? 0:
- (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-
- if (idx >= limit || add_to_page_cache_unique(page,
- mapping, idx, page_hash(mapping, idx)) == 0) {
+ if (add_to_page_cache_unique(page,
+ mapping, idx, page_hash(mapping, idx)) == 0) {
+ info->flags |= VM_PAGEIN;
ptr[offset].val = 0;
info->swapped--;
} else if (add_to_swap_cache(page, entry) != 0)
@@ -634,6 +657,7 @@
* Add page back to page cache, unref swap, try again.
*/
add_to_page_cache_locked(page, mapping, index);
+ info->flags |= VM_PAGEIN;
spin_unlock(&info->lock);
swap_free(swap);
goto getswap;
@@ -809,6 +833,7 @@
swap_free(swap);
} else if (add_to_page_cache_unique(swappage,
mapping, idx, page_hash(mapping, idx)) == 0) {
+ info->flags |= VM_PAGEIN;
entry->val = 0;
info->swapped--;
spin_unlock(&info->lock);
@@ -868,6 +893,7 @@
goto failed;
goto repeat;
}
+ info->flags |= VM_PAGEIN;
}

spin_unlock(&info->lock);

2003-08-16 14:50:46

by Sergey S. Kostyliov

[permalink] [raw]
Subject: Re: 2.4.22pre6aa1

Hi Hugh and Andrew,

On Saturday 16 August 2003 18:00, Hugh Dickins wrote:
> On Sat, 16 Aug 2003, Hugh Dickins wrote:
> > Or, as in the patch Sergey is currently testing below, shmem_truncate
> > must be prepared to truncate_inode_pages again. That's the approach
> > I originally implemented in 2.5, but I grew disgusted with it every
> > time I thought of partial truncation trundling twice through
> > truncate_inode_pages (it can easily be avoided when nrpages == 0,
> > but that's unlikely in partial truncation).
> >
> > So VM_PAGEIN flag stuff to restrict it to when it might be necessary;
> > extended to cover other races when reading the page at the same time
> > as truncating (though I think generic_file_read has a window of this
> > kind that we've never worried about). I expect to split the patch
> > into several before sending Marcelo and Andrew.
>
> And here is the patch I claimed to be below.
> If you apply it to anything other than 2.4.22-pre6aa1,
> please be careful to check that it has applied correctly.
> Originally I made a patch against 2.4.22-pre6, and then applied to
> 2.4.22-pre6aa1: but I have never seen patch make such a mess of it!

I just want to confirm that I wasn't able to repeat this problem
with the patch below (applied to 2.4.22-pre7aa1) after more than
3 days of testing. I'll inform you if any issues will arise.
Thank you!

>
> Hugh
>
> --- 2.4.22-pre6aa1/mm/shmem.c Thu Jul 31 15:23:58 2003
> +++ linux/mm/shmem.c Mon Aug 11 21:00:55 2003
> @@ -92,6 +92,9 @@
>
> #define VM_ACCT(size) (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
>
> +/* info->flags needs a VM_flag to handle pagein/truncate race efficiently
> */ +#define VM_PAGEIN VM_READ
> +
> /* Pretend that each entry is of this size in directory's i_size */
> #define BOGO_DIRENT_SIZE 20
>
> @@ -435,6 +438,18 @@
>
> BUG_ON(info->swapped > info->next_index);
> spin_unlock(&info->lock);
> +
> + if (inode->i_mapping->nrpages && (info->flags & VM_PAGEIN)) {
> + /*
> + * Call truncate_inode_pages again: racing shmem_unuse_inode
> + * may have swizzled a page in from swap since vmtruncate or
> + * generic_delete_inode did it, before we lowered next_index.
> + * Also, though shmem_getpage checks i_size before adding to
> + * cache, no recheck after: so fix the narrow window there too.
> + */
> + truncate_inode_pages(inode->i_mapping, inode->i_size);
> + }
> +
> if (freed)
> shmem_free_blocks(inode, freed);
> }
> @@ -459,6 +474,19 @@
> attr->ia_size>>PAGE_CACHE_SHIFT,
> &page, SGP_READ);
> }
> + /*
> + * Reset VM_PAGEIN flag so that shmem_truncate can
> + * detect if any pages might have been added to cache
> + * after truncate_inode_pages. But we needn't bother
> + * if it's being fully truncated to zero-length: the
> + * nrpages check is efficient enough in that case.
> + */
> + if (attr->ia_size) {
> + struct shmem_inode_info *info = SHMEM_I(inode);
> + spin_lock(&info->lock);
> + info->flags &= ~VM_PAGEIN;
> + spin_unlock(&info->lock);
> + }
> }
> }
>
> @@ -511,7 +539,6 @@
> struct address_space *mapping;
> swp_entry_t *ptr;
> unsigned long idx;
> - unsigned long limit;
> int offset;
>
> idx = 0;
> @@ -543,13 +570,9 @@
> inode = info->inode;
> mapping = inode->i_mapping;
> delete_from_swap_cache(page);
> -
> - /* Racing against delete or truncate? Must leave out of page cache */
> - limit = (inode->i_state & I_FREEING)? 0:
> - (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> -
> - if (idx >= limit || add_to_page_cache_unique(page,
> - mapping, idx, page_hash(mapping, idx)) == 0) {
> + if (add_to_page_cache_unique(page,
> + mapping, idx, page_hash(mapping, idx)) == 0) {
> + info->flags |= VM_PAGEIN;
> ptr[offset].val = 0;
> info->swapped--;
> } else if (add_to_swap_cache(page, entry) != 0)
> @@ -634,6 +657,7 @@
> * Add page back to page cache, unref swap, try again.
> */
> add_to_page_cache_locked(page, mapping, index);
> + info->flags |= VM_PAGEIN;
> spin_unlock(&info->lock);
> swap_free(swap);
> goto getswap;
> @@ -809,6 +833,7 @@
> swap_free(swap);
> } else if (add_to_page_cache_unique(swappage,
> mapping, idx, page_hash(mapping, idx)) == 0) {
> + info->flags |= VM_PAGEIN;
> entry->val = 0;
> info->swapped--;
> spin_unlock(&info->lock);
> @@ -868,6 +893,7 @@
> goto failed;
> goto repeat;
> }
> + info->flags |= VM_PAGEIN;
> }
>
> spin_unlock(&info->lock);

--
Best regards,
Sergey S. Kostyliov <[email protected]>
Public PGP key: http://sysadminday.org.ru/rathamahata.asc