2002-07-19 00:06:23

by Daniel McNeil

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Thu, 2002-07-18 at 03:35, Andrea Arcangeli wrote:
> On Wed, Jul 17, 2002 at 05:10:41PM -0700, Daniel McNeil wrote:
> > We coded up something very similar creating a get_64bit() using the
> > cmpxchg8 instruction and using the set_64bit() already there.
> >
> > I created a patch to have cp_new_stat64() use get_64bit() and
> > the changes to i_size to use set_64bit(). I tested this patch
> > on an 8 proc machine running 2.4.18. This did fix the race
> > with extending the i_size past 4GB when another process is
> > reading i_size (by doing a stat()). Here are some process
>
> Actually stat can confuse userspace but it's the last of all problems,
> the real issue is expanding truncate or lseek+write behind the end of the
> file that can generate metadata fs corruption with a parallel writepage.

fstat() was an easy way to test for incorrect reading of i_size,
so I could test the cmpxchg8 code to read a 64-bit value.
>
> > times for running parallel fstat()s to the same file on an SMP
> > 8-proc box:
> >
> > Test program doing fstat() 500,000 times to a single file.
> >
> > # of processes times
> > 1 0.17user 0.34system 0:00.51elapsed
> > 2 0.40user 0.85system 0:00.62elapsed
> > 3 0.45user 2.18system 0:00.88elapsed
> > 4 0.74user 3.54system 0:01.08elapsed
> > 5 0.68user 5.58system 0:01.33elapsed
> > 6 0.97user 8.57system 0:01.82elapsed
> > 7 1.14user 12.93system 0:02.10elapsed
> > 8 1.20user 22.52system 0:02.98elapsed
> >
> > As you can see the fstat()s do not scale well. I think this is due
> > to the cmpxchg8 instruction actually writing to the cache line
> > even if the compare fails. The software developers manual says:
>
> fstat or any similar operation run in parallel on the same file
> descriptor isn't going to scale very well anyways because of the fget
> and atomic-inc/dec cacheline trashing.
>
> If instead you were using different filedescriptors then if it doesn't
> scale well that's the cmpxchg8. And of course we expect it not to scale
> well with the fix applied, so it's all right in all cases.
>

This is with different file discriptors, so the bad scaling is from the
cmpxchg8. The same test on 2.4.18 without the patch was:

8 processes 1.51user 2.53system 0:00.51elapsed

> >
> > "The destination operand is written back if the comparison fails;
> > otherwise the source operand is written into the destination. (The
> > processor never produces a locked read without also producing a locked
> > write.)"
> >
> > We were not using a lock prefix, but it appears it is doing the write
>
> that's a bug as far I can see, without the lock prefix it's not atomic,
> so it should generate the same corruption even using cmpxchg, see also
> the fix for set_64bit that I did separately even for the pagetable
> walking. I wrote a fat comment with pointer to the page of the pdf that
> states without the lock it's all but atomic.
>
> > anyway. That would explain the bad scaling.
>
> of course, with the fix applied every time we read the inode without the
> i_sem we will have to make a write that will render impossible to share
> the cacheline in read only mode across the cpus and it will generate the
> bouncing if the same inode is accessed at the same time by multiple
> cpus. that'e expected :)

This cacheline bouncing is the thing I don't like. It was not clear
from the cmpxchg8 documentation if any data was written when the
comparison failed when not using the lock prefix. By using an
unexpected negative value for the compare value the compare would
fail and just read the 64bit value. Unfortunately, it looks like
it does write (the old value if the comparison fails). I think
you are right that this would be a bug then, since there could
be a racing update to i_size to could race in without the lock
prefix.

>
> But it will scale if the different cpus accesses different inodes
> simultaneously. It's like the per-inode radix tree lock of 2.5, if all
> cpus works on the same inode, per-inode lock or global lock won't make
> differences for that workload.
>
> The only other possible fix is to change the highlevel locking and to
> require all i_size readers to hold the i_sem in read mode (of cours the
> i_sem should become a rw semaphore), this will also avoid the silly
> checks for page->mapping in generic_file_read, and secondly we should
> make those readers recursive, so there's no fs reentrancy problem when
> the vm does writepage, that is trivially doable with my rwsem, they
> provide the recursive functionality in a abstracted manner, so it would
> be truly easy to implement that more robust and simpler locking
> mechanism, OTOH that is a radical locking change in the core VM/vfs
> while what I implemented right now is the strict bugfix that doesn't
> change the locking design.
>

If a read lock is used to read i_size, then the rw semaphore would cause
the same cache line bouncing as the cmpxchg8.

> BTW, while fixing 486/386 support, I will also add another optimization
> that is to avoid the chmpxchg not only for 64bit archs, but also for UP
> compilations, as said that's only an SMP race that cannot triger on UP
> (all assuming there is no -preempt, but there **cannot** be any -preempt
> in 2.4 anyways, but while forward porting the fix to 2.5 if -preempt is
> enabled the kernel will have to use chmpxchg for reading i_size, really
> with -preempt and UP compilation for reading and writing the i_size the
> cmpxchg doesn't need to put the lock on the bus while it always needs
> the lock on the bus for SMP compiles.
>
> I will upload an update after I fixed 486/386 and I optimized away the
> cmpxchg for UP compiles.
>
> If you have suggestions that's welcome, thanks.

I'll let you know if I think of any.

Daniel



2002-07-19 09:20:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Thu, Jul 18, 2002 at 05:09:21PM -0700, Daniel McNeil wrote:
> On Thu, 2002-07-18 at 03:35, Andrea Arcangeli wrote:
> > On Wed, Jul 17, 2002 at 05:10:41PM -0700, Daniel McNeil wrote:
> > > We coded up something very similar creating a get_64bit() using the
> > > cmpxchg8 instruction and using the set_64bit() already there.
> > >
> > > I created a patch to have cp_new_stat64() use get_64bit() and
> > > the changes to i_size to use set_64bit(). I tested this patch
> > > on an 8 proc machine running 2.4.18. This did fix the race
> > > with extending the i_size past 4GB when another process is
> > > reading i_size (by doing a stat()). Here are some process
> >
> > Actually stat can confuse userspace but it's the last of all problems,
> > the real issue is expanding truncate or lseek+write behind the end of the
> > file that can generate metadata fs corruption with a parallel writepage.
>
> fstat() was an easy way to test for incorrect reading of i_size,
> so I could test the cmpxchg8 code to read a 64-bit value.

yes that's probably the simpler way to reproduce it.

> >
> > > times for running parallel fstat()s to the same file on an SMP
> > > 8-proc box:
> > >
> > > Test program doing fstat() 500,000 times to a single file.
> > >
> > > # of processes times
> > > 1 0.17user 0.34system 0:00.51elapsed
> > > 2 0.40user 0.85system 0:00.62elapsed
> > > 3 0.45user 2.18system 0:00.88elapsed
> > > 4 0.74user 3.54system 0:01.08elapsed
> > > 5 0.68user 5.58system 0:01.33elapsed
> > > 6 0.97user 8.57system 0:01.82elapsed
> > > 7 1.14user 12.93system 0:02.10elapsed
> > > 8 1.20user 22.52system 0:02.98elapsed
> > >
> > > As you can see the fstat()s do not scale well. I think this is due
> > > to the cmpxchg8 instruction actually writing to the cache line
> > > even if the compare fails. The software developers manual says:
> >
> > fstat or any similar operation run in parallel on the same file
> > descriptor isn't going to scale very well anyways because of the fget
> > and atomic-inc/dec cacheline trashing.
> >
> > If instead you were using different filedescriptors then if it doesn't
> > scale well that's the cmpxchg8. And of course we expect it not to scale
> > well with the fix applied, so it's all right in all cases.
> >
>
> This is with different file discriptors, so the bad scaling is from the
> cmpxchg8. The same test on 2.4.18 without the patch was:
>
> 8 processes 1.51user 2.53system 0:00.51elapsed
>
> > >
> > > "The destination operand is written back if the comparison fails;
> > > otherwise the source operand is written into the destination. (The
> > > processor never produces a locked read without also producing a locked
> > > write.)"
> > >
> > > We were not using a lock prefix, but it appears it is doing the write
> >
> > that's a bug as far I can see, without the lock prefix it's not atomic,
> > so it should generate the same corruption even using cmpxchg, see also
> > the fix for set_64bit that I did separately even for the pagetable
> > walking. I wrote a fat comment with pointer to the page of the pdf that
> > states without the lock it's all but atomic.
> >
> > > anyway. That would explain the bad scaling.
> >
> > of course, with the fix applied every time we read the inode without the
> > i_sem we will have to make a write that will render impossible to share
> > the cacheline in read only mode across the cpus and it will generate the
> > bouncing if the same inode is accessed at the same time by multiple
> > cpus. that'e expected :)
>
> This cacheline bouncing is the thing I don't like. It was not clear

I don't like it either of course, but I couldn't figure out a way to read 64bit
atomically without generating it.

> If a read lock is used to read i_size, then the rw semaphore would cause
> the same cache line bouncing as the cmpxchg8.

Of course, and certainly we cannot use an extremely write-slow lock like the
brlock that isn't even schedule aware (without even counting the unfariness
it introduces for writes), because it would be too slow and RCU as well isn't
feasible, unless we make the i_size a pointer to a long long (rather than a
long long), which would break again quite a lot of code.

I was suggesting the read-write recursive lock just as a viable solution that
would make the whole thing much cleaner and simpler. But it has the downside
to change the design of the locking and to break all i_sem users. I wouldn't
worry too much about the cacheline bouncing that the read-recursive rw sem
introduces, the most important paths are read/write that are going to beat on
the global (or per-inode) pagecache lock anyways. So a per-inode rw lock
shouldn't really be too bad compared to the per-inode radix tree lock or global
pagecachelock (just like the cmpxchg8 shouldn't be too bad either for the same
reason).

>
> > BTW, while fixing 486/386 support, I will also add another optimization
> > that is to avoid the chmpxchg not only for 64bit archs, but also for UP
> > compilations, as said that's only an SMP race that cannot triger on UP
> > (all assuming there is no -preempt, but there **cannot** be any -preempt
> > in 2.4 anyways, but while forward porting the fix to 2.5 if -preempt is
> > enabled the kernel will have to use chmpxchg for reading i_size, really
> > with -preempt and UP compilation for reading and writing the i_size the
> > cmpxchg doesn't need to put the lock on the bus while it always needs
> > the lock on the bus for SMP compiles.
> >
> > I will upload an update after I fixed 486/386 and I optimized away the
> > cmpxchg for UP compiles.
> >
> > If you have suggestions that's welcome, thanks.
>
> I'll let you know if I think of any.

I guess cmpxchg8b is the best way at least for 2.4.

Andrea

2002-07-19 22:53:40

by Daniel McNeil

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Fri, 2002-07-19 at 02:23, Andrea Arcangeli wrote:
> On Thu, Jul 18, 2002 at 05:09:21PM -0700, Daniel McNeil wrote:
> > fstat() was an easy way to test for incorrect reading of i_size,
> > so I could test the cmpxchg8 code to read a 64-bit value.
>
> yes that's probably the simpler way to reproduce it.
>
> > This cacheline bouncing is the thing I don't like. It was not clear
>
> I don't like it either of course, but I couldn't figure out a way to read 64bit
> atomically without generating it.
>
> > >
> > > If you have suggestions that's welcome, thanks.
> >
> > I'll let you know if I think of any.
>
> I guess cmpxchg8b is the best way at least for 2.4.
>
> Andrea

Andrea,

Here is another approach. I added two version fields to the inode
structure. The first one is updated before i_size and the 2nd is
updated after with memory barriers in between. The i_size_read()
samples the version fields and i_size and loops until it can read
i_size without an i_size update happening at the same time. It is
not pretty but it does fix the problem and the cache line is not
written by i_size_read() and it should work on all architechtures.
I've tested this on a two proc system.

Let me know what you think,

Daniel

--- linux-2.4.19-rc2aa1/include/linux/fs.h Fri Jul 19 15:22:08 2002
+++ linux-2.4.19-rc2aa1.new/include/linux/fs.h Fri Jul 19 15:22:25 2002
@@ -501,6 +501,8 @@
atomic_t i_writecount;
unsigned int i_attr_flags;
__u32 i_generation;
+ volatile short i_attr_version1;
+ volatile short i_attr_version2;
union {
struct minix_inode_info minix_i;
struct ext2_inode_info ext2_i;
@@ -539,7 +541,23 @@
static inline loff_t i_size_read(struct inode * inode)
{
#if BITS_PER_LONG==32
- return (loff_t) read_64bit((unsigned long long *) &inode->i_size);
+ short v1;
+ short v2;
+ loff_t i_size;
+
+ /*
+ * retry if i_size was possibly modified while sampling.
+ */
+ do {
+ v1 = inode->i_attr_version1;
+ v2 = inode->i_attr_version2;
+ rmb();
+ i_size = inode->i_size;
+ rmb();
+ } while (v1 != v2 ||
+ v1 != inode->i_attr_version1 ||
+ v1 != inode->i_attr_version2);
+ return i_size;
#elif BITS_PER_LONG==64
return inode->i_size;
#endif
@@ -548,8 +566,12 @@
static inline void i_size_write(struct inode * inode, loff_t i_size)
{
#if BITS_PER_LONG==32
- set_64bit((unsigned long long *) &inode->i_size,
- (unsigned long long) i_size);
+ inode->i_attr_version1++; /* changing i_size */
+ wmb();
+ inode->i_size = i_size;
+ wmb();
+ inode->i_attr_version2++; /* done with change */
+ wmb();
#elif BITS_PER_LONG==64
inode->i_size = i_size;
#endif

2002-07-23 16:52:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Fri, Jul 19, 2002 at 03:56:36PM -0700, Daniel McNeil wrote:
> Here is another approach. I added two version fields to the inode
> structure. The first one is updated before i_size and the 2nd is
> updated after with memory barriers in between. The i_size_read()
> samples the version fields and i_size and loops until it can read
> i_size without an i_size update happening at the same time. It is
> not pretty but it does fix the problem and the cache line is not
> written by i_size_read() and it should work on all architechtures.
> I've tested this on a two proc system.

I also considered this possibility before taking the other approch, I
thought it was inferior because it adds branches and it increases the
dcache pressure, so I thought just marking our cacheline dirty and
reading it in one go, with no additional overhead would been a win (the
less possible number of cycles and no branch prediction issues). Of
course the below will allow parallel i_size readers to scale, but again,
I think the fstat benchmark doesn't matter much and true parallel
readers on the same inode (not only i_size readers) will have to collide
on the pagecache_lock anyways (even in 2.5). So I still think the
chmpxchg8b is a win despite it marks the i_size cacheline dirty, but
somebody should try to benchmark it probably to verify the major
bottleneck remains the pagecache_lock.

I actually applied the below but I enabled it only for the non x86 32bit
archs (like s390, ppc) where I have no idea how to code a get_64bit it
in asm. It should be definitely better than a separate spinlock
protecting the i_size.

comments?

Andrea

2002-07-23 17:04:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Fri, Jul 19, 2002 at 03:56:36PM -0700, Daniel McNeil wrote:
> + short v1;
> + short v2;
> + loff_t i_size;
> +
> + /*
> + * retry if i_size was possibly modified while sampling.
> + */
> + do {
> + v1 = inode->i_attr_version1;
> + v2 = inode->i_attr_version2;
> + rmb();
> + i_size = inode->i_size;
> + rmb();
> + } while (v1 != v2 ||
> + v1 != inode->i_attr_version1 ||
> + v1 != inode->i_attr_version2);
> + return i_size;
[..]
> #elif BITS_PER_LONG==64
> return inode->i_size;
> #endif
> @@ -548,8 +566,12 @@
> static inline void i_size_write(struct inode * inode, loff_t i_size)
> {
> #if BITS_PER_LONG==32
> - set_64bit((unsigned long long *) &inode->i_size,
> - (unsigned long long) i_size);
> + inode->i_attr_version1++; /* changing i_size */
> + wmb();
> + inode->i_size = i_size;
> + wmb();
> + inode->i_attr_version2++; /* done with change */
> + wmb();
> #elif BITS_PER_LONG==64
> inode->i_size = i_size;
> #endif

btw, looking at the implementation the read side was buggy. First it's
pointless to read both version1 and version2 before reading the isize,
secondly if you increase version1 first (in the writer), the reader has
to read version2 before i_size and version1 after i_size. While you're
doing the opposite, you compare v1 (version1) read before i_size with
version2 after i_size.

So while merging it I rewrote it this way (I also change the type of the
sequence number to int, 2^16 can overflow quite fast in a multighz cpu,
mostly for paranoid of course, and to go safe with the atomic granularty
provided by archs, int certainly has to be atomic-granular).

loff_t i_size;
int v1, v2;

/* Retry if i_size was possibly modified while sampling. */
do {
v1 = inode->i_size_version1;
rmb();
i_size = inode->i_size;
rmb();
v2 = inode->i_size_version2;
} while (v1 != v2);

return i_size;


and the new writer side:

inode->i_size_version2++;
wmb();
inode->i_size = i_size;
wmb();
inode->i_size_version1++;
wmb(); /* make it visible ASAP */

Andrea

2002-07-23 18:02:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Tue, Jul 23, 2002 at 07:08:07PM +0200, Andrea Arcangeli wrote:
> So while merging it I rewrote it this way (I also change the type of the

here it is the final full patch:

diff -urNp race/fs/inode.c race-fix/fs/inode.c
--- race/fs/inode.c Tue Jul 23 18:46:47 2002
+++ race-fix/fs/inode.c Tue Jul 23 19:13:02 2002
@@ -111,6 +111,7 @@ static void init_once(void * foo, kmem_c
sema_init(&inode->i_sem, 1);
sema_init(&inode->i_zombie, 1);
spin_lock_init(&inode->i_data.i_shared_lock);
+ i_size_ordered_init(inode);
}
}

diff -urNp race/include/asm-i386/system.h race-fix/include/asm-i386/system.h
--- race/include/asm-i386/system.h Tue Jul 23 18:46:44 2002
+++ race-fix/include/asm-i386/system.h Tue Jul 23 18:47:10 2002
@@ -143,6 +143,8 @@ struct __xchg_dummy { unsigned long a[10
#define __xg(x) ((struct __xchg_dummy *)(x))


+#ifdef CONFIG_X86_CMPXCHG
+#define __ARCH_HAS_GET_SET_64BIT 1
/*
* The semantics of XCHGCMP8B are a bit strange, this is why
* there is a loop and the loading of %%eax and %%edx has to
@@ -167,7 +169,7 @@ static inline void __set_64bit (unsigned
"lock cmpxchg8b (%0)\n\t"
"jnz 1b"
: /* no outputs */
- : "D"(ptr),
+ : "r"(ptr),
"b"(low),
"c"(high)
: "ax","dx","memory");
@@ -197,6 +199,32 @@ static inline void __set_64bit_var (unsi
__set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
__set_64bit(ptr, ll_low(value), ll_high(value)) )

+
+/*
+ * The memory clobber is needed in the read side only if
+ * there is an unsafe writer before the get_64bit, which should
+ * never be the case, but just to be safe.
+ */
+static inline unsigned long long get_64bit(unsigned long long * ptr)
+{
+ unsigned long low, high;
+
+ __asm__ __volatile__ (
+ "\n1:\t"
+ "movl (%2), %%eax\n\t"
+ "movl 4(%2), %%edx\n\t"
+ "movl %%eax, %%ebx\n\t"
+ "movl %%edx, %%ecx\n\t"
+ "lock cmpxchg8b (%2)\n\t"
+ "jnz 1b"
+ : "=&b" (low), "=&c" (high)
+ : "r" (ptr)
+ : "ax","dx","memory");
+
+ return low | ((unsigned long long) high << 32);
+}
+#endif /* CONFIG_X86_CMPXCHG */
+
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway
* Note 2: xchg has side effect, so that attribute volatile is necessary,
diff -urNp race/include/linux/fs.h race-fix/include/linux/fs.h
--- race/include/linux/fs.h Tue Jul 23 18:46:48 2002
+++ race-fix/include/linux/fs.h Tue Jul 23 19:13:51 2002
@@ -449,6 +449,13 @@ struct block_device {
struct list_head bd_inodes;
};

+#if BITS_PER_LONG==32 && defined(CONFIG_SMP) && !defined(__ARCH_HAS_GET_SET_64BIT)
+#define __NEED_I_SIZE_ORDERED
+#define i_size_ordered_init(inode) do { (inode)->i_size_version1 = (inode)->i_size_version2 = 0; } while (0)
+#else
+#define i_size_ordered_init(inode) do { } while (0)
+#endif
+
struct inode {
struct list_head i_hash;
struct list_head i_list;
@@ -534,8 +541,65 @@ struct inode {
struct xfs_inode_info xfs_i;
void *generic_ip;
} u;
+#ifdef __NEED_I_SIZE_ORDERED
+ volatile int i_size_version1;
+ volatile int i_size_version2;
+#endif
};

+/*
+ * NOTE: in a 32bit arch with a preemptable kernel and
+ * an UP compile the i_size_read/write must be atomic
+ * with respect to the local cpu (unlike with preempt disabled),
+ * but they don't need to be atomic with respect to other cpus like in
+ * true SMP (so they need either to either locally disable irq around
+ * the read or for example on x86 they can be still implemented as a
+ * cmpxchg8b without the need of the lock prefix). For SMP compiles
+ * and 64bit archs it makes no difference if preempt is enabled or not.
+ */
+static inline loff_t i_size_read(struct inode * inode)
+{
+#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+#ifdef __ARCH_HAS_GET_SET_64BIT
+ return (loff_t) get_64bit((unsigned long long *) &inode->i_size);
+#else
+ loff_t i_size;
+ int v1, v2;
+
+ /* Retry if i_size was possibly modified while sampling. */
+ do {
+ v1 = inode->i_size_version1;
+ rmb();
+ i_size = inode->i_size;
+ rmb();
+ v2 = inode->i_size_version2;
+ } while (v1 != v2);
+
+ return i_size;
+#endif
+#elif BITS_PER_LONG==64 || !defined(CONFIG_SMP)
+ return inode->i_size;
+#endif
+}
+
+static inline void i_size_write(struct inode * inode, loff_t i_size)
+{
+#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+#ifdef __ARCH_HAS_GET_SET_64BIT
+ set_64bit((unsigned long long *) &inode->i_size, (unsigned long long) i_size);
+#else
+ inode->i_size_version2++;
+ wmb();
+ inode->i_size = i_size;
+ wmb();
+ inode->i_size_version1++;
+ wmb(); /* make it visible ASAP */
+#endif
+#elif BITS_PER_LONG==64 || !defined(CONFIG_SMP)
+ inode->i_size = i_size;
+#endif
+}
+
static inline void inode_add_bytes(struct inode *inode, loff_t bytes)
{
inode->i_blocks += bytes >> 9;

Andrea

2002-07-23 18:12:17

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Tue, 23 Jul 2002, Andrea Arcangeli wrote:

> diff -urNp race/include/asm-i386/system.h race-fix/include/asm-i386/system.h
> --- race/include/asm-i386/system.h Tue Jul 23 18:46:44 2002
> +++ race-fix/include/asm-i386/system.h Tue Jul 23 18:47:10 2002
> @@ -143,6 +143,8 @@ struct __xchg_dummy { unsigned long a[10
> #define __xg(x) ((struct __xchg_dummy *)(x))
>
>
> +#ifdef CONFIG_X86_CMPXCHG
> +#define __ARCH_HAS_GET_SET_64BIT 1
> /*
> * The semantics of XCHGCMP8B are a bit strange, this is why
> * there is a loop and the loading of %%eax and %%edx has to
> @@ -167,7 +169,7 @@ static inline void __set_64bit (unsigned
> "lock cmpxchg8b (%0)\n\t"
> "jnz 1b"
> : /* no outputs */
> - : "D"(ptr),
> + : "r"(ptr),
> "b"(low),
> "c"(high)
> : "ax","dx","memory");

The condition is invalid, "cmpxchg" != "cmpxchg8b" and CONFIG_X86_CMPXCHG
is (correctly) set for the i486 which doesn't support the latter. You
probably need a separate option, e.g. CONFIG_X86_CMPXCHG8B, and verify the
presence of the instruction with the feature flags if the option is set
(check_config() seems the right place).

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-07-23 19:16:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Tue, Jul 23, 2002 at 08:15:48PM +0200, Maciej W. Rozycki wrote:
> On Tue, 23 Jul 2002, Andrea Arcangeli wrote:
>
> > diff -urNp race/include/asm-i386/system.h race-fix/include/asm-i386/system.h
> > --- race/include/asm-i386/system.h Tue Jul 23 18:46:44 2002
> > +++ race-fix/include/asm-i386/system.h Tue Jul 23 18:47:10 2002
> > @@ -143,6 +143,8 @@ struct __xchg_dummy { unsigned long a[10
> > #define __xg(x) ((struct __xchg_dummy *)(x))
> >
> >
> > +#ifdef CONFIG_X86_CMPXCHG
> > +#define __ARCH_HAS_GET_SET_64BIT 1
> > /*
> > * The semantics of XCHGCMP8B are a bit strange, this is why
> > * there is a loop and the loading of %%eax and %%edx has to
> > @@ -167,7 +169,7 @@ static inline void __set_64bit (unsigned
> > "lock cmpxchg8b (%0)\n\t"
> > "jnz 1b"
> > : /* no outputs */
> > - : "D"(ptr),
> > + : "r"(ptr),
> > "b"(low),
> > "c"(high)
> > : "ax","dx","memory");
>
> The condition is invalid, "cmpxchg" != "cmpxchg8b" and CONFIG_X86_CMPXCHG
> is (correctly) set for the i486 which doesn't support the latter. You
> probably need a separate option, e.g. CONFIG_X86_CMPXCHG8B, and verify the
> presence of the instruction with the feature flags if the option is set
> (check_config() seems the right place).

the problem with the feature flag check is that I don't want to make it
conditional at runtime, if I start adding branches and checks on the
feature flag (or pointer to functions) I can as well use the ordered
read/writes C version without reading/writing the 64bit atomically. So
the check_config() will be the oops or the not-oops at the first i_size
read/write :).

As for the CONFIG_X86_CMPXCHG8B you're right it's needed, setting
CONFIG_M486=y and CONFIG_SMP=y would generate a kernel that would oops
on a 486 and I don't see any other way to get 486+SMP case right without
checking for the X86_FEATURE_CX8 capability at runtime. Not that I think
486+SMP is high prio but yes, theoretically it's a bug.

Andrea

2002-07-24 14:16:24

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Tue, 23 Jul 2002, Andrea Arcangeli wrote:

> the problem with the feature flag check is that I don't want to make it
> conditional at runtime, if I start adding branches and checks on the

No, no, no, that would be insane, no doubt.

> feature flag (or pointer to functions) I can as well use the ordered
> read/writes C version without reading/writing the 64bit atomically. So
> the check_config() will be the oops or the not-oops at the first i_size
> read/write :).

I meant something explicit like that:

#ifdef CONFIG_X86_CMPXCHG8B
if (!cpu_has_cx8)
panic("Kernel compiled for Pentium+, requires CMPXCHG8B
feature!");
#endif

An oops would be quite an obscure response for a configuration error. As
I stated, just look into check_config(), for how it's done in similar
cases.

> As for the CONFIG_X86_CMPXCHG8B you're right it's needed, setting
> CONFIG_M486=y and CONFIG_SMP=y would generate a kernel that would oops
> on a 486 and I don't see any other way to get 486+SMP case right without
> checking for the X86_FEATURE_CX8 capability at runtime. Not that I think
> 486+SMP is high prio but yes, theoretically it's a bug.

No need to break what works, either.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-07-24 14:22:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Wed, Jul 24, 2002 at 04:19:41PM +0200, Maciej W. Rozycki wrote:
> I meant something explicit like that:
>
> #ifdef CONFIG_X86_CMPXCHG8B
> if (!cpu_has_cx8)
> panic("Kernel compiled for Pentium+, requires CMPXCHG8B
> feature!");
> #endif
>
> An oops would be quite an obscure response for a configuration error. As

I'm not sure if we catch all the config errors at runtime anyways, and
the oops is simple enough to decode (the eip will point to the
cmpxchg8b), but ok, the panic is trivial enough that it worth it.

as for the config option I just added it to close the 486+SMP case.

Andrea

2002-07-29 18:34:12

by Bob Miller

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Tue, Jul 23, 2002 at 07:47:12PM +0200, Andrea Arcangeli wrote:
> On Tue, Jul 23, 2002 at 07:08:07PM +0200, Andrea Arcangeli wrote:
> > So while merging it I rewrote it this way (I also change the type of the
>
> here it is the final full patch:
>

Stuff deleted...

> diff -urNp race/include/asm-i386/system.h race-fix/include/asm-i386/system.h
> --- race/include/asm-i386/system.h Tue Jul 23 18:46:44 2002
> +++ race-fix/include/asm-i386/system.h Tue Jul 23 18:47:10 2002
> @@ -143,6 +143,8 @@ struct __xchg_dummy { unsigned long a[10
> #define __xg(x) ((struct __xchg_dummy *)(x))
>
>
> +#ifdef CONFIG_X86_CMPXCHG
> +#define __ARCH_HAS_GET_SET_64BIT 1
> /*
> * The semantics of XCHGCMP8B are a bit strange, this is why
> * there is a loop and the loading of %%eax and %%edx has to
> @@ -167,7 +169,7 @@ static inline void __set_64bit (unsigned
> "lock cmpxchg8b (%0)\n\t"
> "jnz 1b"
> : /* no outputs */
> - : "D"(ptr),
> + : "r"(ptr),
> "b"(low),
> "c"(high)
> : "ax","dx","memory");
> @@ -197,6 +199,32 @@ static inline void __set_64bit_var (unsi
> __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
> __set_64bit(ptr, ll_low(value), ll_high(value)) )
>

Stuff deleted...
>
> +/*
> + * NOTE: in a 32bit arch with a preemptable kernel and
> + * an UP compile the i_size_read/write must be atomic
> + * with respect to the local cpu (unlike with preempt disabled),
> + * but they don't need to be atomic with respect to other cpus like in
> + * true SMP (so they need either to either locally disable irq around
> + * the read or for example on x86 they can be still implemented as a
> + * cmpxchg8b without the need of the lock prefix). For SMP compiles
> + * and 64bit archs it makes no difference if preempt is enabled or not.
> + */
> +static inline loff_t i_size_read(struct inode * inode)
> +{
> +#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
> +#ifdef __ARCH_HAS_GET_SET_64BIT
> + return (loff_t) get_64bit((unsigned long long *) &inode->i_size);
> +#else
> + loff_t i_size;
> + int v1, v2;
> +
> + /* Retry if i_size was possibly modified while sampling. */
> + do {
> + v1 = inode->i_size_version1;
> + rmb();
> + i_size = inode->i_size;
> + rmb();
> + v2 = inode->i_size_version2;
> + } while (v1 != v2);
> +
> + return i_size;
> +#endif
> +#elif BITS_PER_LONG==64 || !defined(CONFIG_SMP)
> + return inode->i_size;
> +#endif
> +}
> +

Andrea,

Sorry for responding to this thread so late (I have been on holiday)...
I don't like creating __ARCH_HAS_GET_SET_64BIT and then doing conditional
code based on it. I believe that get_64bit() and set_64bit() should
always be defined and used. On x86 with cmpxchg8b and SMP or PREEMPT
get_64bit() and set_64bit() use cmpxchg8b. On 386 and 486 with SMP
or PREEMPT create "safe" versions i.e.:

static inline void set_64bit(unsigned long long * ptr, unsigned long long value)
{
lock_kernel();
*ptr = value;
unlock_kernel();
}

static inline unsigned long long get_64bit(unsigned long long * ptr)
{
unsigned long long retval;

lock_kernel();
retval = *ptr;
unlock_kernel

return reval;
}

I know BKL sucks but how many SMP/PREEMPT 386/486 boxes are really out there?

And for all non SMP or PREEMPT do:

static inline void set_64bit(unsigned long long * ptr, unsigned long long value)
{
*ptr = value;
}

static inline unsigned long long get_64bit(unsigned long long * ptr)
{
return *ptr;
}

Other arches are free to do the "right thing" for them selfs.

--
Bob Miller Email: [email protected]
Open Source Development Lab Phone: 503.626.2455 Ext. 17

2002-07-29 18:43:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Mon, Jul 29, 2002 at 11:37:30AM -0700, Bob Miller wrote:
> On Tue, Jul 23, 2002 at 07:47:12PM +0200, Andrea Arcangeli wrote:
> > On Tue, Jul 23, 2002 at 07:08:07PM +0200, Andrea Arcangeli wrote:
> > > So while merging it I rewrote it this way (I also change the type of the
> >
> > here it is the final full patch:
> >
>
> Stuff deleted...
>
> > diff -urNp race/include/asm-i386/system.h race-fix/include/asm-i386/system.h
> > --- race/include/asm-i386/system.h Tue Jul 23 18:46:44 2002
> > +++ race-fix/include/asm-i386/system.h Tue Jul 23 18:47:10 2002
> > @@ -143,6 +143,8 @@ struct __xchg_dummy { unsigned long a[10
> > #define __xg(x) ((struct __xchg_dummy *)(x))
> >
> >
> > +#ifdef CONFIG_X86_CMPXCHG
> > +#define __ARCH_HAS_GET_SET_64BIT 1
> > /*
> > * The semantics of XCHGCMP8B are a bit strange, this is why
> > * there is a loop and the loading of %%eax and %%edx has to
> > @@ -167,7 +169,7 @@ static inline void __set_64bit (unsigned
> > "lock cmpxchg8b (%0)\n\t"
> > "jnz 1b"
> > : /* no outputs */
> > - : "D"(ptr),
> > + : "r"(ptr),
> > "b"(low),
> > "c"(high)
> > : "ax","dx","memory");
> > @@ -197,6 +199,32 @@ static inline void __set_64bit_var (unsi
> > __set_64bit(ptr, (unsigned int)(value), (unsigned int)((value)>>32ULL) ) : \
> > __set_64bit(ptr, ll_low(value), ll_high(value)) )
> >
>
> Stuff deleted...
> >
> > +/*
> > + * NOTE: in a 32bit arch with a preemptable kernel and
> > + * an UP compile the i_size_read/write must be atomic
> > + * with respect to the local cpu (unlike with preempt disabled),
> > + * but they don't need to be atomic with respect to other cpus like in
> > + * true SMP (so they need either to either locally disable irq around
> > + * the read or for example on x86 they can be still implemented as a
> > + * cmpxchg8b without the need of the lock prefix). For SMP compiles
> > + * and 64bit archs it makes no difference if preempt is enabled or not.
> > + */
> > +static inline loff_t i_size_read(struct inode * inode)
> > +{
> > +#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
> > +#ifdef __ARCH_HAS_GET_SET_64BIT
> > + return (loff_t) get_64bit((unsigned long long *) &inode->i_size);
> > +#else
> > + loff_t i_size;
> > + int v1, v2;
> > +
> > + /* Retry if i_size was possibly modified while sampling. */
> > + do {
> > + v1 = inode->i_size_version1;
> > + rmb();
> > + i_size = inode->i_size;
> > + rmb();
> > + v2 = inode->i_size_version2;
> > + } while (v1 != v2);
> > +
> > + return i_size;
> > +#endif
> > +#elif BITS_PER_LONG==64 || !defined(CONFIG_SMP)
> > + return inode->i_size;
> > +#endif
> > +}
> > +
>
> Andrea,
>
> Sorry for responding to this thread so late (I have been on holiday)...
> I don't like creating __ARCH_HAS_GET_SET_64BIT and then doing conditional
> code based on it. I believe that get_64bit() and set_64bit() should
> always be defined and used. On x86 with cmpxchg8b and SMP or PREEMPT
> get_64bit() and set_64bit() use cmpxchg8b. On 386 and 486 with SMP
> or PREEMPT create "safe" versions i.e.:
>
> static inline void set_64bit(unsigned long long * ptr, unsigned long long value)
> {
> lock_kernel();
> *ptr = value;
> unlock_kernel();
> }
>
> static inline unsigned long long get_64bit(unsigned long long * ptr)
> {
> unsigned long long retval;
>
> lock_kernel();
> retval = *ptr;
> unlock_kernel
>
> return reval;
> }
>
> I know BKL sucks but how many SMP/PREEMPT 386/486 boxes are really out there?
>
> And for all non SMP or PREEMPT do:
>
> static inline void set_64bit(unsigned long long * ptr, unsigned long long value)
> {
> *ptr = value;
> }
>
> static inline unsigned long long get_64bit(unsigned long long * ptr)
> {
> return *ptr;
> }
>
> Other arches are free to do the "right thing" for them selfs.

well, the point are exactly the other archs, i.e. s390. Why should s390
lose total scalability during read/writes with a big kernel lock or
similar, when they can use the ordered version? (maybe it's even
possible to read/write 64bit atomically on s390 like we do on 586+, but
still some 32bit arch may not provide that functionality and they
will definitely prefer the ordered read/write using sequence numbers
rather than a big kernel lock or global lock) That's the whole point of
the __ARCH_HAS_GET_SET_64BIT information.

Andrea

2002-07-30 00:31:09

by Daniel McNeil

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

Andrea,

Sorry I haven't responded, but I was on vacation all last week and
was not near a computer.

I like your code change. Incrementing the v2 before the v1 in the
i_size_write() is much better. My code was definitely uglier -- but
it was correct since the version1 and version2 where sampled before
i_size was read and version1 and version2 where checked again after.
It was excessive, but correct.

On your patch, shouldn't non-smp preempt still use the 64-bit stuff?
The comment says it should, but the #ifdef's are not checking for
PREEMPT or did I miss something?

I would still be curious about the performance difference between the
version approach and the cmpxchg8 approach. With SMP I'm a bit worried
about the cacheline bouncing around and the memory bandwith wasted.

Any ideas on what kind of test would be appropriate?
I've got access to 2-proc to 8-proc systems I could run some tests on,
just not sure what test would be useful. The fstat() test isn't
realistic.

Increasing the versions to 32-bit is ok with -- I was just trying to
not waste too much space.

Daniel

On Tue, 2002-07-23 at 10:08, Andrea Arcangeli wrote:
> On Fri, Jul 19, 2002 at 03:56:36PM -0700, Daniel McNeil wrote:
> > + short v1;
> > + short v2;
> > + loff_t i_size;
> > +
> > + /*
> > + * retry if i_size was possibly modified while sampling.
> > + */
> > + do {
> > + v1 = inode->i_attr_version1;
> > + v2 = inode->i_attr_version2;
> > + rmb();
> > + i_size = inode->i_size;
> > + rmb();
> > + } while (v1 != v2 ||
> > + v1 != inode->i_attr_version1 ||
> > + v1 != inode->i_attr_version2);
> > + return i_size;
> [..]
> > #elif BITS_PER_LONG==64
> > return inode->i_size;
> > #endif
> > @@ -548,8 +566,12 @@
> > static inline void i_size_write(struct inode * inode, loff_t i_size)
> > {
> > #if BITS_PER_LONG==32
> > - set_64bit((unsigned long long *) &inode->i_size,
> > - (unsigned long long) i_size);
> > + inode->i_attr_version1++; /* changing i_size */
> > + wmb();
> > + inode->i_size = i_size;
> > + wmb();
> > + inode->i_attr_version2++; /* done with change */
> > + wmb();
> > #elif BITS_PER_LONG==64
> > inode->i_size = i_size;
> > #endif
>
> btw, looking at the implementation the read side was buggy. First it's
> pointless to read both version1 and version2 before reading the isize,
> secondly if you increase version1 first (in the writer), the reader has
> to read version2 before i_size and version1 after i_size. While you're
> doing the opposite, you compare v1 (version1) read before i_size with
> version2 after i_size.
>
> So while merging it I rewrote it this way (I also change the type of the
> sequence number to int, 2^16 can overflow quite fast in a multighz cpu,
> mostly for paranoid of course, and to go safe with the atomic granularty
> provided by archs, int certainly has to be atomic-granular).
>
> loff_t i_size;
> int v1, v2;
>
> /* Retry if i_size was possibly modified while sampling. */
> do {
> v1 = inode->i_size_version1;
> rmb();
> i_size = inode->i_size;
> rmb();
> v2 = inode->i_size_version2;
> } while (v1 != v2);
>
> return i_size;
>
>
> and the new writer side:
>
> inode->i_size_version2++;
> wmb();
> inode->i_size = i_size;
> wmb();
> inode->i_size_version1++;
> wmb(); /* make it visible ASAP */
>
> Andrea
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2002-07-30 01:07:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1 i_size atomic access

On Mon, Jul 29, 2002 at 05:34:16PM -0700, Daniel McNeil wrote:
> Andrea,
>
> Sorry I haven't responded, but I was on vacation all last week and
> was not near a computer.

No problem :)

> I like your code change. Incrementing the v2 before the v1 in the
> i_size_write() is much better. My code was definitely uglier -- but
> it was correct since the version1 and version2 where sampled before
> i_size was read and version1 and version2 where checked again after.
> It was excessive, but correct.

ok, so you had the dependency v1 == v2, so you were also implicitly
comparing v2 with the new version 1 ok.

>
> On your patch, shouldn't non-smp preempt still use the 64-bit stuff?
> The comment says it should, but the #ifdef's are not checking for
> PREEMPT or did I miss something?

there's no preempt in 2.4, the comment was meant for anybody foward
porting it to 2.5.

> I would still be curious about the performance difference between the
> version approach and the cmpxchg8 approach. With SMP I'm a bit worried
> about the cacheline bouncing around and the memory bandwith wasted.

Randy didn't report any decrease in performance, so in normal loads
shouldn't be noticeable.

> Any ideas on what kind of test would be appropriate?
> I've got access to 2-proc to 8-proc systems I could run some tests on,
> just not sure what test would be useful. The fstat() test isn't
> realistic.

I would say dbench is a good candidate for this kind of change to verify
it's not noticeable.

then you could test two parallel reads on the same inode, for example
two parallel dd if=file of=/dev/null reading from cache, and see if
there's a difference of bandwidth with cmpxchg8b and ordered
read/writes (on a 4p you could try with 4 parallel dd).

> Increasing the versions to 32-bit is ok with -- I was just trying to
> not waste too much space.

ok, as said the int granularity is going to be atomic for all archs.

Andrea