2002-09-04 23:30:54

by Andrea Arcangeli

[permalink] [raw]
Subject: 2.4.20pre5aa1

AIO now seems to work fine and it's in complete sync with the 2.5 kernel
API:

rm -f testdir/rofile
echo "test" >testdir/rofile
chmod 400 testdir/rofile
rm -f testdir/rwfile
echo "test" >testdir/rwfile
chmod 600 testdir/rwfile
rm -f testdir/wofile
echo "test" >testdir/wofile
chmod 200 testdir/wofile
rm -f testdir.ext2/rwfile
echo "test" >testdir.ext2/rwfile
chmod 600 testdir.ext2/rwfile
./runtests.sh cases/2.p cases/3.p cases/4.p cases/5.p cases/6.p cases/7.p cases/8.p cases/10.p cases/11.p cases/12.p cases/13.p
Test run starting at Wed Sep 4 05:02:02 GMT 2002
Starting cases/2.p
expect -14: io_setup(-1000, 0xc0010000) = -14 [Bad address]
expect -14: io_setup( 1000, 0xc0010000) = -14 [Bad address]
expect -14: io_setup( 0, 0xc0010000) = -14 [Bad address]
expect -22: io_setup(-1000, 0xbfffde28) = -22 [Invalid argument]
expect -22: io_setup( -1, 0xbfffde28) = -22 [Invalid argument]
expect -22: io_setup( 0, 0xbfffde28) = -22 [Invalid argument]
expect 0: io_setup( 1, 0xbfffde28) = 0 [Success]
expect -22: io_setup( 1, 0xbfffde28) = -22 [Invalid argument]
test cases/2.t completed PASSED.
Completed cases/2.p with 0.
Starting cases/3.p
expect -22: io_submit(0xffffffff, 1, 0xbfffdd14) = -22 [Invalid argument]
expect 0: io_submit(0x40017000, 0, 0xbfffdd14) = 0 [Success]
expect -14: io_submit(0x40017000, 1, (nil)) = -14 [Bad address]
expect -14: io_submit(0x40017000, 1, 0xffffffff) = -14 [Bad address]
expect -14: io_submit(0x40017000, 2, 0xbfffdd0c) = -14 [Bad address]
expect -14: io_submit(0x40017000, 2, 0xbfffdd04) = -14 [Bad address]
expect -22: io_submit(0x40017000, -1, 0xbfffdd14) = -22 [Invalid argument]
test cases/3.t completed PASSED.
Completed cases/3.p with 0.
Starting cases/4.p
expect -9: (w), res = sync_submit: io_submit res=-9 [Bad file descriptor]
-9 [Bad file descriptor]
expect -9: (r), res = sync_submit: io_submit res=-9 [Bad file descriptor]
-9 [Bad file descriptor]
expect 512: (w), res = 512 [Success]
expect 512: (r), res = 512 [Success]
expect -22: (r), res = -22 [Invalid argument]
expect -22: (w), res = -22 [Invalid argument]
expect 0: (r), res = 0 [Success]
expect 4: (w), res = 4 [Success]
expect 4: (w), res = 4 [Success]
expect 8: (r), res = 8 [Success]
read after append: [12345678]
expect -14: (r), res = sync_submit: io_submit res=-14 [Bad address]
-14 [Bad address]
expect -14: (w), res = sync_submit: io_submit res=-14 [Bad address]
-14 [Bad address]
expect -14: (w), res = -14 [Bad address]
test cases/4.t completed PASSED.
Completed cases/4.p with 0.
Starting cases/5.p
expect 512: (w), res = 512 [Success]
expect 512: (r), res = 512 [Success]
expect 512: (r), res = 512 [Success]
expect 512: (w), res = 512 [Success]
expect 512: (w), res = 512 [Success]
expect -14: (r), res = -14 [Bad address]
expect 512: (r), res = 512 [Success]
expect -14: (w), res = -14 [Bad address]
test cases/5.t completed PASSED.
Completed cases/5.p with 0.
Starting cases/6.p
size = 1031628
expect 524288000: (w), res = 524288000 [Success]
expect 524288000: (r), res = 524288000 [Success]
test cases/6.t completed PASSED.
Completed cases/6.p with 0.
Starting cases/7.p
expect 512: (w), res = 512 [Success]
expect 512: (r), res = 512 [Success]
expect 511: (w), res = 511 [Success]
expect 511: (r), res = 511 [Success]
expect -27: (w), res = -27 [File too large]
expect 0: (r), res = 0 [Success]
expect 0: (w), res = 0 [Success]
test cases/7.t completed PASSED.
Completed cases/7.p with 0.
Starting cases/8.p
expect 512: (w), res = 512 [Success]
expect 512: (r), res = 512 [Success]
expect 511: (w), res = 511 [Success]
expect 511: (r), res = 511 [Success]
expect -27: (w), res = -27 [File too large]
expect 0: (r), res = 0 [Success]
expect 0: (w), res = 0 [Success]
test cases/8.t completed PASSED.
Completed cases/8.p with 0.
Starting cases/10.p
expect 103407616: (w), res = 103407616 [Success]
expect 103407616: (r), res = 103407616 [Success]
expect -28: (w), res = -28 [No space left on device]
expect 0: (r), res = 0 [Success]
expect 103407615: (w), res = 103407615 [Success]
expect 103407615: (r), res = 103407615 [Success]
expect 0: (r), res = 0 [Success]
expect -28: (w), res = -28 [No space left on device]
expect 0: (r), res = 0 [Success]
expect 0: (w), res = 0 [Success]
test cases/10.t completed PASSED.
Completed cases/10.p with 0.
Starting cases/11.p
completed 1000000 out of 1000000 writes
completed 1000000 out of 1000000 reads
test cases/11.t completed PASSED.
Completed cases/11.p with 0.
Starting cases/12.p
expect 0: io_submit(0x40017000, 0, (nil)) = 0 [Success]
expect -22: io_submit(0x40017000, 0, (nil)) = -22 [Invalid argument]
child exited with status 0
test cases/12.t completed PASSED.
Completed cases/12.p with 0.
Starting cases/13.p
expect 8: io_submit(0x40017000, 8, 0xbfffd67c) = 8 []
event[0]: write[0] okay, returned: 1048576 [Unknown error 4293918720]
event[1]: write[1] okay, returned: 1048576 [Unknown error 4293918720]
event[2]: write[2] okay, returned: 1048576 [Unknown error 4293918720]
event[3]: write[3] okay, returned: 1048576 [Unknown error 4293918720]
event[4]: write[4] okay, returned: 1048576 [Unknown error 4293918720]
event[5]: write[5] okay, returned: 1048576 [Unknown error 4293918720]
event[6]: write[6] okay, returned: 1048576 [Unknown error 4293918720]
event[7]: write[7] okay, returned: 1048576 [Unknown error 4293918720]
test cases/13.t completed PASSED.
Completed cases/13.p with 0.
Pass: 11 Fail: 0
^^^^^^^^^^^^^^^^^
Test run complete at Wed Sep 4 05:03:20 GMT 2002

The 2.5 libaio that works with this kernel and 2.5 is here:

http://www.us.kernel.org/pub/linux/kernel/people/andrea/libaio/libaio-0.3.15-2.5.tar.gz

This libaio should work on all archs 32/64bit big/little endian (only
the syscall numbers are missing but a number of them aren't registered
yet, they can be easily added in the file syscall.h). The resulting
libaio binary will work with 2.5 kernels too. The API provided by the
above libaio is compatible with the libaio in the RHAS so Oracle for
RHAS should work fine on top of this kernel after installing the above
libaio.so.1 in /usr/lib/libaio.so.1 (it's still untested though).

Since nobody apparently knows the semantics of the io_queue_wait, such
call keeps doing nothing at the moment (and Oracle isn't using it so
it's not a problem). I reccomend nobody to use the io_queue_wait, and
I'd like to remove it in the long run unless somebody claims its
semantics.

Currently the kernel compiles only for x86 I guess, I will shortly
release a new -aa that hopefully will compile on some more
architectures, the changes needed should be very small.

One more thing, the make modules_install for some unknwon reasons
generates huge complains from depmod -ae at the end, just ignore them, I
don't know why these errors are showing up but the modules loads just
fine, so it has to be some kind of depmod of makefile mistake, but I'll
think about it after all archs compiles.

URL:

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20pre5aa1.gz

Changelog diff between 2.4.19rc5aa1 and 2.4.20pre5aa1:

Only in 2.4.19rc5aa1: 000_e100-2.1.6.gz
Only in 2.4.19rc5aa1: 000_e1000-4.3.2.gz
Only in 2.4.19rc5aa1: 00_TCDRAIN-1
Only in 2.4.19rc5aa1: 00_blkdev-pagecache-alias-read_full_page-1
Only in 2.4.19rc5aa1: 00_block-highmem-all-19-1.gz
Only in 2.4.19rc5aa1: 00_dnotify-cleanup-2
Only in 2.4.19rc5aa1: 00_elevator-fixes-1
Only in 2.4.19rc5aa1: 00_ext3-0.9.18.gz
Only in 2.4.19rc5aa1: 00_ffs_compile_failure-1
Only in 2.4.19rc5aa1: 00_pci-dma_mask-1
Only in 2.4.19rc5aa1: 00_pre7ac3-p4-xeon-1
Only in 2.4.19rc5aa1: 00_set_64bit-atomic-1
Only in 2.4.19rc5aa1: 00_sock_fasync-memleak-1
Only in 2.4.19rc5aa1: 00_sr-scatter-pad-1
Only in 2.4.19rc5aa1: 00_stack-overflow-detection-1
Only in 2.4.19rc5aa1: 05_vm_00_touch-buffer-1
Only in 2.4.19rc5aa1: 05_vm_04_dump_stack-1
Only in 2.4.19rc5aa1: 05_vm_12_drain_cpu_caches-2
Only in 2.4.19rc5aa1: 05_vm_14_block_flushpage_check-2
Only in 2.4.19rc5aa1: 05_vm_19_nodev-cleanup-1
Only in 2.4.19rc5aa1: 10_pte-highmem-f00f-2
Only in 2.4.19rc5aa1: 20_kiobuf-slab-3
Only in 2.4.19rc5aa1: 30_x86_memfree-boot-cleanup-1
Only in 2.4.19rc5aa1: 30_x86_setup-boot-cleanup-5
Only in 2.4.19rc5aa1: 90_init-survive-threaded-race-5
Only in 2.4.19rc5aa1: 94_numaq-tsc-4

Merged in mainline.

Only in 2.4.19rc5aa1: 00_3.5G-address-space-4
Only in 2.4.20pre5aa1: 00_3.5G-address-space-5
Only in 2.4.20pre5aa1: 00_cpu-affinity-syscall-rml-2
Only in 2.4.19rc5aa1: 00_cpu-affinity-syscall-rml-2.4.19-pre9-1
Only in 2.4.19rc5aa1: 00_extraversion-5
Only in 2.4.20pre5aa1: 00_extraversion-6
Only in 2.4.19rc5aa1: 00_net-softirq-1
Only in 2.4.20pre5aa1: 00_net-softirq-2
Only in 2.4.19rc5aa1: 00_ordered-freeing-1
Only in 2.4.20pre5aa1: 00_ordered-freeing-2
Only in 2.4.19rc5aa1: 00_spinlock-no-egcs-1
Only in 2.4.20pre5aa1: 00_spinlock-no-egcs-2
Only in 2.4.19rc5aa1: 00_vm-cleanups-1
Only in 2.4.20pre5aa1: 00_vm-cleanups-2
Only in 2.4.19rc5aa1: 00_x86-optimize-apic-irq-and-cacheline-1
Only in 2.4.20pre5aa1: 00_x86-optimize-apic-irq-and-cacheline-2
Only in 2.4.19rc5aa1: 05_vm_03_vm_tunables-2
Only in 2.4.20pre5aa1: 05_vm_03_vm_tunables-3
Only in 2.4.19rc5aa1: 05_vm_08_try_to_free_pages_nozone-2
Only in 2.4.20pre5aa1: 05_vm_08_try_to_free_pages_nozone-3
Only in 2.4.19rc5aa1: 05_vm_09_misc_junk-2
Only in 2.4.20pre5aa1: 05_vm_09_misc_junk-3
Only in 2.4.19rc5aa1: 05_vm_17_rest-8
Only in 2.4.20pre5aa1: 05_vm_17_rest-9
Only in 2.4.19rc5aa1: 10_lvm-snapshot-check-2
Only in 2.4.20pre5aa1: 10_lvm-snapshot-check-3
Only in 2.4.19rc5aa1: 10_rawio-vary-io-11
Only in 2.4.20pre5aa1: 10_rawio-vary-io-12
Only in 2.4.19rc5aa1: 10_sched-o1-hyperthreading-2
Only in 2.4.20pre5aa1: 10_sched-o1-hyperthreading-3
Only in 2.4.19rc5aa1: 20_pte-highmem-26
Only in 2.4.20pre5aa1: 20_pte-highmem-27.gz
Only in 2.4.19rc5aa1: 30_irq-balance-12
Only in 2.4.20pre5aa1: 30_irq-balance-13
Only in 2.4.19rc5aa1: 60_tux-syscall-3
Only in 2.4.20pre5aa1: 60_tux-syscall-4
Only in 2.4.19rc5aa1: 70_xfs-1.1-5.gz
Only in 2.4.20pre5aa1: 70_xfs-1.1-6.gz
Only in 2.4.19rc5aa1: 80_x86_64-common-code-5
Only in 2.4.20pre5aa1: 80_x86_64-common-code-6
Only in 2.4.19rc5aa1: 90_buddyinfo-3
Only in 2.4.20pre5aa1: 90_buddyinfo-4
Only in 2.4.19rc5aa1: 90_proc-mapped-base-2
Only in 2.4.20pre5aa1: 90_proc-mapped-base-3
Only in 2.4.19rc5aa1: 90_s390-aa-2
Only in 2.4.20pre5aa1: 90_s390-aa-3
Only in 2.4.19rc5aa1: 90_s390x-aa-1
Only in 2.4.20pre5aa1: 90_s390x-aa-2
Only in 2.4.19rc5aa1: 93_NUMAQ-4
Only in 2.4.20pre5aa1: 93_NUMAQ-5
Only in 2.4.19rc5aa1: 94_discontigmem-meminfo-2
Only in 2.4.20pre5aa1: 94_discontigmem-meminfo-3
Only in 2.4.19rc5aa1: 96_inode_read_write-atomic-3
Only in 2.4.20pre5aa1: 96_inode_read_write-atomic-4
Only in 2.4.19rc5aa1: 9900_aio-2.gz
Only in 2.4.20pre5aa1: 9900_aio-4.gz
Only in 2.4.19rc5aa1: 9900_aio-API-x86-2
Only in 2.4.20pre5aa1: 9901_aio-API-x86-3

Rediffed.

Only in 2.4.20pre5aa1: 00_bdflush-docs-1

Updated the docs, from Christian Zoz.

Only in 2.4.20pre5aa1: 00_conntrack-hash-1

Fix the hashfn to avoid wasting cpu, from Martin Wilck.

Only in 2.4.20pre5aa1: 00_e100-compile-1

Fix compilation problem in e100 mainline.

Only in 2.4.20pre5aa1: 00_invlpg-386-1

Be in function of 386 rather than in function of global
page feature (as suggested by Linus).

Only in 2.4.19rc5aa1: 00_lowlatency-fixes-6
Only in 2.4.20pre5aa1: 00_lowlatency-fixes-8

Converted to cond_resched in mainline.

Only in 2.4.20pre5aa1: 00_max-mp-busses-1

Enlarge the number of busses to avoid random mem corruption
on some new machines (most important: add check to avoid
random mem corruption very hard to debug in the future).

Only in 2.4.19rc5aa1: 00_nfs-bkl-3
Only in 2.4.19rc5aa1: 00_nfs-rpc-ping-2
Only in 2.4.19rc5aa1: 00_nfs-seekdir-1
Only in 2.4.19rc5aa1: 00_nfs-svc_tcp-2
Only in 2.4.19rc5aa1: 10_nfs-o_direct-5
Only in 2.4.19rc5aa1: 10_o1-sched-nfs-1
Only in 2.4.20pre5aa1: 30_01_nfs-call-start-1
Only in 2.4.20pre5aa1: 30_02_call-reserve1-1
Only in 2.4.20pre5aa1: 30_03_call-reserve2-1
Only in 2.4.20pre5aa1: 30_04_noac-1
Only in 2.4.20pre5aa1: 30_05_seekdir-1
Only in 2.4.20pre5aa1: 30_06_cto2-1
Only in 2.4.20pre5aa1: 30_07_access-1
Only in 2.4.20pre5aa1: 30_08_rdplus-1
Only in 2.4.20pre5aa1: 30_09_o_direct-1
Only in 2.4.20pre5aa1: 30_10-lockd1-1
Only in 2.4.20pre5aa1: 30_11-lockd2-1
Only in 2.4.20pre5aa1: 30_12-lockd3-1
Only in 2.4.20pre5aa1: 30_13-lockd4-1

Merge latest nfs client code from Trond.

Only in 2.4.20pre5aa1: 00_o_direct-b_page-null-1

Let get_block understand it was a O_DIRECT asking for the
block using bh.b_page = NULL (reiserfs will use it).

Only in 2.4.19rc5aa1: 00_o_direct-blkdev-1
Only in 2.4.20pre5aa1: 00_o_direct-blkdev-2

Fix mem corruption bug that can trigger with initrd because
initrd has no softblock blocksize entry in the array of
the ramdisk major (the array has only 16 entries in function
of the compile time number of ramdisks, initrd is at the end).

Only in 2.4.20pre5aa1: 00_prepare-write-fixes-3-1

Also check the i_size is in sync with the last block we allocated in
the metadata, it won't be updated in the commit_write if prepare_write
is failing.

Only in 2.4.19rc5aa1: 00_rwsem-fair-30
Only in 2.4.19rc5aa1: 00_rwsem-fair-30-recursive-8
Only in 2.4.20pre5aa1: 00_rwsem-fair-31
Only in 2.4.20pre5aa1: 00_rwsem-fair-31-recursive-8

Add the trylock operations with the same semantics of mainline.

Only in 2.4.19rc5aa1: 00_sched-O1-aa-2.4.19rc3-1.gz
Only in 2.4.20pre5aa1: 00_sched-O1-aa-2.4.19rc3-2.gz

Merge sched_yield from Ingo with the same semantics of 2.4,
to fix pthreads slowdowns.

Only in 2.4.20pre5aa1: 00_usagi-ipv6-1

Avoid to bind to ipv4 too.

Only in 2.4.19rc5aa1: 50_uml-patch-2.4.18-30.gz
Only in 2.4.19rc5aa1: 50_uml-patch-2.4.18-31.gz
Only in 2.4.19rc5aa1: 50_uml-patch-2.4.18-34.gz
Only in 2.4.19rc5aa1: 50_uml-patch-2.4.18-36.gz
Only in 2.4.19rc5aa1: 50_uml-patch-2.4.18-40
Only in 2.4.19rc5aa1: 50_uml-patch-2.4.18-41.gz
Only in 2.4.19rc5aa1: 50_uml-patch-2.4.18-43.gz
Only in 2.4.19rc5aa1: 50_uml-patch-2.4.18-47.gz
Only in 2.4.20pre5aa1: 50_uml-patch-2.4.19-1.gz
Only in 2.4.20pre5aa1: 51_uml-ac-to-aa-10
Only in 2.4.19rc5aa1: 51_uml-ac-to-aa-9
Only in 2.4.19rc5aa1: 52_uml-blockdev-1
Only in 2.4.19rc5aa1: 53_uml-page-struct-updates-1
Only in 2.4.19rc5aa1: 55_uml-page_address-3
Only in 2.4.19rc5aa1: 56_uml-pte-highmem-2
Only in 2.4.20pre5aa1: 56_uml-pte-highmem-3
Only in 2.4.19rc5aa1: 58_uml-hostfs-compile-1
Only in 2.4.19rc5aa1: 59_uml-compat-2.5-2
Only in 2.4.20pre5aa1: 59_uml-compat-2.5-3
Only in 2.4.19rc5aa1: 59_uml-yield-2

UML updates from Jeff.

Only in 2.4.19rc5aa1: 81_x86_64-arch-2.4.19rc1-1.gz
Only in 2.4.19rc5aa1: 82_x86_64-suse-2
Only in 2.4.20pre5aa1: 82_x86_64-suse-3
Only in 2.4.19rc5aa1: 83_x86_64-cvs-020716-1
Only in 2.4.19rc5aa1: 84_x86-64-mtrr-compile-1
Only in 2.4.19rc5aa1: 87_x86_64-o1sched-1
Only in 2.4.20pre5aa1: 87_x86_64-o1sched-2

Go in sync with mainline.

Only in 2.4.20pre5aa1: 9920_kgdb-1.gz

kgdb from akpm.

Only in 2.4.20pre5aa1: 9930_io_request_scale-3
Only in 2.4.20pre5aa1: 9931_io_request_scale-drivers-1

Avoid global io_request_lock scalability for a number
of scsi drivers.

Only in 2.4.20pre5aa1: 9940_ocfs-1.gz

Oracle Cluster Filesystem, developement release.

Only in 2.4.20pre5aa1: 9950_futex-1.gz

Futex locking available to userspace.

Andrea


2002-09-04 23:54:53

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

Andrea Arcangeli wrote:
>
> ...
>
> The 2.5 libaio that works with this kernel and 2.5 is here:
>
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/libaio/libaio-0.3.15-2.5.tar.gz

Thanks for doing this. The lack of AIO-aware applications and of
testing tools for AIO is a bit of a worry at present.

> Only in 2.4.20pre5aa1: 00_prepare-write-fixes-3-1
>
> Also check the i_size is in sync with the last block we allocated in
> the metadata, it won't be updated in the commit_write if prepare_write
> is failing.

That's the right thing to do I guess. Is this a problem in practice?
If prepare_write() fails, generic_file_write() will truncate back to
the current i_size?

> Only in 2.4.20pre5aa1: 9920_kgdb-1.gz
>
> kgdb from akpm.

Life in the fast lane ;) Kudos to Amit Kale.

2002-09-05 00:21:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Wed, Sep 04, 2002 at 04:56:09PM -0700, Andrew Morton wrote:
> That's the right thing to do I guess. Is this a problem in practice?

I hit this issue with one of the testcases of the libaio, the 10.p, it
still fails on a 1024 blocksize filesystem, while it works fine on a 4k
fs of course (as you could see from the previous email). While
investigating I noticed we didn't update i_size there, but it doesn't
fix the problem. Anyways it's a minor issue, and there's no fs
corruption and it's not specific to aio.

The problem with the 10.p is that after the file fills the disk and the
write returns -ENOSPC, an ftruncate(0) won't be enough to allow further
writes (no matter if it's an async write or not) . The next write on the
same inode will return -ENOSPC too regardless if the i_size is zero and
the df returns 100% disk free. So such place is involved with the
problem, I verified the reason the async write post-trunate fails is
because the prepare_write returns -ENOSPEC, the async-io layer is
completely fine. After killing the task the problem goes away
automatically and the next run of 10.p works fine again until the
-ENOSPC + ftruncate(0). I was able to reproduce it also w/o async-io.
As said this isn't completely solved but it looks very low prio. In any
case no problem can happen on 4k softblocksize.

> If prepare_write() fails, generic_file_write() will truncate back to
> the current i_size?

Good point, so yes it shouldn't be needed and infact it doesn't make
differences, as said above the problem of the 10.p can still happen on a
1024 softblocksize despite my updates to the i_size. Now I wonder if the
vmtruncate there is involved. I feel like dropping the vmtruncate there
and lefting my i_size_write in the prepare_write could fix it, but I'm
unsure. I still don't see exactly where the post-truncate problem cames
from.

> > Only in 2.4.20pre5aa1: 9920_kgdb-1.gz
> >
> > kgdb from akpm.
>
> Life in the fast lane ;) Kudos to Amit Kale.

btw, I'm not enabling the frame pointer, I use gcc 3.2 so I want to use
the dwarf2 information in the vmlinux binary, and the dwarf2 resolver
inside gdb on the client side to provide reliable stack traces. I'm not
sure if that works yet, the dwarf2 part is still untested but that's the
long term plan. Next would be to make it working via netconsole so I
don't need to wire cables to many boxes, the only problem with that is
that netconsole isn't a two way communication link but like it polls for
transmits it could poll for receives too.

BTW, while merging aio from 2.5 to 2.4 and fixing and porting the libaio
(in particular thanks to one of Ben's testcases that was checkin for
this specific case) I found this bug in 2.5, I'd suggest to apply it to
mainline (CC'ed Linus):

--- aio-2.5/fs/aio.c.~1~ Tue Aug 27 22:09:49 2002
+++ aio-2.5/fs/aio.c Thu Sep 5 02:20:30 2002
@@ -992,7 +992,7 @@ asmlinkage long sys_io_submit(aio_contex
break;
}

- if (unlikely(__copy_from_user(&tmp, user_iocb, sizeof(tmp)))) {
+ if (unlikely(copy_from_user(&tmp, user_iocb, sizeof(tmp)))) {
ret = -EFAULT;
break;
}

Andrea

2002-09-05 12:39:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

> Only in 2.4.20pre5aa1: 00_prepare-write-fixes-3-1
>
> Also check the i_size is in sync with the last block we allocated in
> the metadata, it won't be updated in the commit_write if prepare_write
> is failing.

When testing -aa + my xfs update without the 9* series. this gave an compile
error. Any chance you could move it after 96_inode_read_write-atomic-4?

2002-09-05 16:48:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 01:44:14PM +0100, Christoph Hellwig wrote:
> > Only in 2.4.20pre5aa1: 00_prepare-write-fixes-3-1
> >
> > Also check the i_size is in sync with the last block we allocated in
> > the metadata, it won't be updated in the commit_write if prepare_write
> > is failing.
>
> When testing -aa + my xfs update without the 9* series. this gave an compile
> error. Any chance you could move it after 96_inode_read_write-atomic-4?

correct, thanks.

btw, even if xfs is applied before the inode_read_write-atomic, please
make sure xfs will learn using the i_size_read when out of the semaphore
and i_size_write too. I know the locking is different there but I doubt
you're just managing the i_size without races. So it's up to you, either
move xfs patches after inode_read_Write-atomic, or make a separate
race-fix against the xfs codebase at the 70 level. It's not urgent btw,
the inode_read_write doesn't break anything yet (or I couldn't deal with
the flood of errors in every i_size read/write in all the fs out there),
so you won't notice it, but it gives you a chance to fix the races in
the "important" production filesystems. The pagecache/vfs layer should
be safe now, so most fs should be just safe with it. If the fixes will
came later as an incremental patch to apply at the end after a first xfs
update, that's fine with me.

Andrea

2002-09-05 17:05:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 06:53:07PM +0200, Andrea Arcangeli wrote:
> btw, even if xfs is applied before the inode_read_write-atomic, please
> make sure xfs will learn using the i_size_read when out of the semaphore
> and i_size_write too. I know the locking is different there but I doubt
> you're just managing the i_size without races.

XFS always has the XFS i_lock around accessing it. Either in read mode
or in write mode for updates (the lock is a so-called mrlock which
basically as a rwsem with a few subtile differences).

Anyway most acceses i_size in the new code are done by the generic
code now as XFS calls it internally. Take a look at the update I sent
you a few seconds ago :)

2002-09-05 18:36:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 06:09:04PM +0100, Christoph Hellwig wrote:
> On Thu, Sep 05, 2002 at 06:53:07PM +0200, Andrea Arcangeli wrote:
> > btw, even if xfs is applied before the inode_read_write-atomic, please
> > make sure xfs will learn using the i_size_read when out of the semaphore
> > and i_size_write too. I know the locking is different there but I doubt
> > you're just managing the i_size without races.
>
> XFS always has the XFS i_lock around accessing it. Either in read mode
> or in write mode for updates (the lock is a so-called mrlock which
> basically as a rwsem with a few subtile differences).
>
> Anyway most acceses i_size in the new code are done by the generic
> code now as XFS calls it internally. Take a look at the update I sent
> you a few seconds ago :)

maybe I'm overlooking something but after a short read it seems you have
no lock in do_truncate->setattr like for all the other fs, so the
vmtruncate can run under reads and the i_size can change under you and
in turn you must always read it with i_size_read using asm, like all the
other fs, if you're not holding the i_sem (and you certainly aren't
holding the i_sem that frequently, you don't even for writes). this
because i_sem is the only lock/sem hold by truncate. Infact I'm unsure
how you serialize the i_size writes of truncate against the ones from
writes, that seems problematic too, the i_size could get a value past
the last block allocated (in turn corrupting the fs). Please double
check that I'm wrong, thanks.

Andrea

2002-09-05 18:43:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 08:41:25PM +0200, Andrea Arcangeli wrote:
> other fs, if you're not holding the i_sem (and you certainly aren't
> holding the i_sem that frequently, you don't even for writes).

Except of O_DIRECT writes we _do_ hold i_sem, btw.

2002-09-05 18:42:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 08:41:25PM +0200, Andrea Arcangeli wrote:
> maybe I'm overlooking something but after a short read it seems you have
> no lock in do_truncate->setattr like for all the other fs, so the
> vmtruncate can run under reads and the i_size can change under you and
> in turn you must always read it with i_size_read using asm, like all the
> other fs, if you're not holding the i_sem (and you certainly aren't
> holding the i_sem that frequently, you don't even for writes). this
> because i_sem is the only lock/sem hold by truncate. Infact I'm unsure
> how you serialize the i_size writes of truncate against the ones from
> writes, that seems problematic too, the i_size could get a value past
> the last block allocated (in turn corrupting the fs). Please double
> check that I'm wrong, thanks.

I think we should take the XFS-specific inode lock around vmtruncate.
Need to double-check with Steve.

2002-09-05 18:58:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 08:59:17PM +0200, Andrea Arcangeli wrote:
> > I think we should take the XFS-specific inode lock around vmtruncate.
> > Need to double-check with Steve.
>
> this is the function I'm looking at and it's called with xfs specific
> inode lock, and I don't see it grabbed either before calling vmtruncate:

should != do

we either need to use your accessors for i_size or take the XFS inode
lock around vmtruncate.

2002-09-05 18:54:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 07:46:49PM +0100, Christoph Hellwig wrote:
> On Thu, Sep 05, 2002 at 08:41:25PM +0200, Andrea Arcangeli wrote:
> > maybe I'm overlooking something but after a short read it seems you have
> > no lock in do_truncate->setattr like for all the other fs, so the
> > vmtruncate can run under reads and the i_size can change under you and
> > in turn you must always read it with i_size_read using asm, like all the
> > other fs, if you're not holding the i_sem (and you certainly aren't
> > holding the i_sem that frequently, you don't even for writes). this
> > because i_sem is the only lock/sem hold by truncate. Infact I'm unsure
> > how you serialize the i_size writes of truncate against the ones from
> > writes, that seems problematic too, the i_size could get a value past
> > the last block allocated (in turn corrupting the fs). Please double
> > check that I'm wrong, thanks.
>
> I think we should take the XFS-specific inode lock around vmtruncate.
> Need to double-check with Steve.

this is the function I'm looking at and it's called with xfs specific
inode lock, and I don't see it grabbed either before calling vmtruncate:

+STATIC int
+linvfs_setattr(
+ struct dentry *dentry,
+ struct iattr *attr)
+{
+ struct inode *inode = dentry->d_inode;
+ vnode_t *vp = LINVFS_GET_VP(inode);
+ vattr_t vattr;
+ unsigned int ia_valid = attr->ia_valid;
+ int error;
+ int flags = 0;
+
+ memset(&vattr, 0, sizeof(vattr_t));
+ if (ia_valid & ATTR_UID) {
+ vattr.va_mask |= AT_UID;
+ vattr.va_uid = attr->ia_uid;
+ }
+ if (ia_valid & ATTR_GID) {
+ vattr.va_mask |= AT_GID;
+ vattr.va_gid = attr->ia_gid;
+ }
+ if (ia_valid & ATTR_SIZE) {
+ vattr.va_mask |= AT_SIZE;
+ vattr.va_size = attr->ia_size;
+ }
+ if (ia_valid & ATTR_ATIME) {
+ vattr.va_mask |= AT_ATIME;
+ vattr.va_atime.tv_sec = attr->ia_atime;
+ vattr.va_atime.tv_nsec = 0;
+ }
+ if (ia_valid & ATTR_MTIME) {
+ vattr.va_mask |= AT_MTIME;
+ vattr.va_mtime.tv_sec = attr->ia_mtime;
+ vattr.va_mtime.tv_nsec = 0;
+ }
+ if (ia_valid & ATTR_CTIME) {
+ vattr.va_mask |= AT_CTIME;
+ vattr.va_ctime.tv_sec = attr->ia_ctime;
+ vattr.va_ctime.tv_nsec = 0;
+ }
+ if (ia_valid & ATTR_MODE) {
+ vattr.va_mask |= AT_MODE;
+ vattr.va_mode = attr->ia_mode;
+ if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+ inode->i_mode &= ~S_ISGID;
+ }
+
+ if (ia_valid & (ATTR_MTIME_SET | ATTR_ATIME_SET))
+ flags = ATTR_UTIME;
+
+ VOP_SETATTR(vp, &vattr, flags, NULL, error);
+ if (error)
+ return(-error); /* Positive error up from XFS */
+ if (ia_valid & ATTR_SIZE) {
+ error = vmtruncate(inode, attr->ia_size);
+ }
+
+ if (!error) {
+ vn_revalidate(vp, 0);
+ mark_inode_dirty_sync(inode);
+ }
+ return error;
+}

Andrea

2002-09-05 19:10:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 09:06:00PM +0200, Andrea Arcangeli wrote:
> On Thu, Sep 05, 2002 at 07:48:24PM +0100, Christoph Hellwig wrote:
> > On Thu, Sep 05, 2002 at 08:41:25PM +0200, Andrea Arcangeli wrote:
> > > other fs, if you're not holding the i_sem (and you certainly aren't
> > > holding the i_sem that frequently, you don't even for writes).
> >
> > Except of O_DIRECT writes we _do_ hold i_sem, btw.
>
> can't you end with this?
>
> O_DIRECT write
> write finishes
> truncate drops the write
> truncate set i_sem to 0
> write set i_sem to something

s/i_sem/i_size/g ?

> and the fs is then corrupt? (very minor corruption of course and
> extremely hard to trigger, trivially solvable by an fsck, ext[23] had
> similar issues too with the get_block failures with < PAGE_SIZE
> softblocksize, fixed around 2.4.19, that was certainly easier to
> reproduce btw)

Won't happen:


O_DIRECT write starts
+ takes XFS iolock exclusive
- invalidates pagecache
+ downgrades iolock to shared
- perform write

xfs_setattr for truncate called
+ takes XFS iolock shared
-> blocks

- write i_size to something
+ releases iolock
-> gets woken

2002-09-05 19:01:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 07:48:24PM +0100, Christoph Hellwig wrote:
> On Thu, Sep 05, 2002 at 08:41:25PM +0200, Andrea Arcangeli wrote:
> > other fs, if you're not holding the i_sem (and you certainly aren't
> > holding the i_sem that frequently, you don't even for writes).
>
> Except of O_DIRECT writes we _do_ hold i_sem, btw.

can't you end with this?

O_DIRECT write
write finishes
truncate drops the write
truncate set i_sem to 0
write set i_sem to something

and the fs is then corrupt? (very minor corruption of course and
extremely hard to trigger, trivially solvable by an fsck, ext[23] had
similar issues too with the get_block failures with < PAGE_SIZE
softblocksize, fixed around 2.4.19, that was certainly easier to
reproduce btw)

Andrea

2002-09-05 19:08:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 08:02:40PM +0100, Christoph Hellwig wrote:
> we either need to use your accessors for i_size or take the XFS inode
> lock around vmtruncate.

the latter would take care of O_DIRECT too I think. Of course it's
mostly a theorical issue, I mentioned it just so you would check it,
keep it in mind and eventually fix it, we had this kind of races in the
32bit architectures in on all the fs for ages, infact you know 2.4-aa is
the only tree out there with these race fixed for most important fs, 2.4
and 2.5 mainline are still racy too (2.4 because it was a recent
discovery, 2.5 because it's my mistake that I didn't yet had time to
submit fixes, btw, if anybody is interested to port to 2.5 that's
welcome). For the normal fs I didn't want to add locks around the read
and truncate paths, and that's why I implemented the lockless accessors,
also consiering the accessors are zerocost noops on all the 64bit archs
(long [or should now say "short" :) ] term thinking).

Andrea

2002-09-05 19:13:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 09:13:25PM +0200, Andrea Arcangeli wrote:
> the latter would take care of O_DIRECT too I think. Of course it's
> mostly a theorical issue, I mentioned it just so you would check it,
> keep it in mind and eventually fix it, we had this kind of races in the
> 32bit architectures in on all the fs for ages, infact you know 2.4-aa is
> the only tree out there with these race fixed for most important fs, 2.4
> and 2.5 mainline are still racy too (2.4 because it was a recent
> discovery, 2.5 because it's my mistake that I didn't yet had time to
> submit fixes, btw, if anybody is interested to port to 2.5 that's
> welcome). For the normal fs I didn't want to add locks around the read
> and truncate paths, and that's why I implemented the lockless accessors,
> also consiering the accessors are zerocost noops on all the 64bit archs
> (long [or should now say "short" :) ] term thinking).

For 2.5 I'd prefer to make i_sem a r/semaphore and take it in read mode
instead of the lockless games we play with 64bit sizes currently.

I think this should also give a nice speedup as e.g. readdir or lookup
could happen in parallel then.

2002-09-05 19:22:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.20pre5aa1

On Thu, Sep 05, 2002 at 08:15:30PM +0100, Christoph Hellwig wrote:
> Won't happen:
>
>
> O_DIRECT write starts
> + takes XFS iolock exclusive
> - invalidates pagecache
> + downgrades iolock to shared
> - perform write
>
> xfs_setattr for truncate called
> + takes XFS iolock shared

^^^ exclusive ^^^

> -> blocks
>
> - write i_size to something
> + releases iolock
> -> gets woken
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
---end quoted text---