2006-01-05 23:58:43

by Chris Wright

[permalink] [raw]
Subject: [PATCH 0/6] -stable review

This is the start of the stable review cycle for the 2.6.14.6 release.
There are 6 patches in this series, all will be posted as a response to
this one. If anyone has any issues with these being applied, please let
us know. If anyone is a maintainer of the proper subsystem, and wants
to add a Signed-off-by: line to the patch, please respond with it.

These patches are sent out with a number of different people on the
Cc: line. If you wish to be a reviewer, please email [email protected]
to add your name to the list. If you want to be off the reviewer list,
also email us.

Responses should be made by Sat, Jan 7, 23:00 UTC. Anything received after
that time, might be too late.

thanks,

the -stable release team
--


2006-01-06 00:43:37

by Chris Wright

[permalink] [raw]
Subject: [PATCH 3/6] Insanity avoidance in /proc (CVE-2005-4605)

-stable review patch. If anyone has any objections, please let us know.
------------------

Insanity avoidance in /proc

The old /proc interfaces were never updated to use loff_t, and are just
generally broken. Now, we should be using the seq_file interface for
all of the proc files, but converting the legacy functions is more work
than most people care for and has little upside..

But at least we can make the non-LFS rules explicit, rather than just
insanely wrapping the offset or something.

Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
fs/proc/generic.c | 47 +++++++++++++++++++++++------------------------
1 file changed, 23 insertions(+), 24 deletions(-)

--- linux-2.6.14.5.orig/fs/proc/generic.c
+++ linux-2.6.14.5/fs/proc/generic.c
@@ -54,6 +54,18 @@ proc_file_read(struct file *file, char _
ssize_t n, count;
char *start;
struct proc_dir_entry * dp;
+ unsigned long long pos;
+
+ /*
+ * Gaah, please just use "seq_file" instead. The legacy /proc
+ * interfaces cut loff_t down to off_t for reads, and ignore
+ * the offset entirely for writes..
+ */
+ pos = *ppos;
+ if (pos > MAX_NON_LFS)
+ return 0;
+ if (nbytes > MAX_NON_LFS - pos)
+ nbytes = MAX_NON_LFS - pos;

dp = PDE(inode);
if (!(page = (char*) __get_free_page(GFP_KERNEL)))
@@ -202,30 +214,17 @@ proc_file_write(struct file *file, const
static loff_t
proc_file_lseek(struct file *file, loff_t offset, int orig)
{
- lock_kernel();
-
- switch (orig) {
- case 0:
- if (offset < 0)
- goto out;
- file->f_pos = offset;
- unlock_kernel();
- return(file->f_pos);
- case 1:
- if (offset + file->f_pos < 0)
- goto out;
- file->f_pos += offset;
- unlock_kernel();
- return(file->f_pos);
- case 2:
- goto out;
- default:
- goto out;
- }
-
-out:
- unlock_kernel();
- return -EINVAL;
+ loff_t retval = -EINVAL;
+ switch (orig) {
+ case 1:
+ offset += file->f_pos;
+ /* fallthrough */
+ case 0:
+ if (offset < 0 || offset > MAX_NON_LFS)
+ break;
+ file->f_pos = retval = offset;
+ }
+ return retval;
}

static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)

--

2006-01-06 00:43:52

by Chris Wright

[permalink] [raw]
Subject: [PATCH 4/6] sysctl: dont overflow the user-supplied buffer with 0

-stable review patch. If anyone has any objections, please let us know.
------------------

If the string was too long to fit in the user-supplied buffer,
the sysctl layer would zero-terminate it by writing past the
end of the buffer. Don't do that.

Noticed by Yi Yang <[email protected]>

Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
kernel/sysctl.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

--- linux-2.6.14.5.orig/kernel/sysctl.c
+++ linux-2.6.14.5/kernel/sysctl.c
@@ -2200,14 +2200,12 @@ int sysctl_string(ctl_table *table, int
if (get_user(len, oldlenp))
return -EFAULT;
if (len) {
- l = strlen(table->data);
+ l = strlen(table->data)+1;
if (len > l) len = l;
if (len >= table->maxlen)
len = table->maxlen;
if(copy_to_user(oldval, table->data, len))
return -EFAULT;
- if(put_user(0, ((char __user *) oldval) + len))
- return -EFAULT;
if(put_user(len, oldlenp))
return -EFAULT;
}

--

2006-01-06 00:44:51

by Chris Wright

[permalink] [raw]
Subject: [PATCH 6/6] [ATYFB]: Fix onboard video on SPARC Blade 100 for 2.6.{13,14,15}

-stable review patch. If anyone has any objections, please let us know.
------------------

I have recently been switching from using 2.4.32 on my trusty
old Sparc Blade 100 to using 2.6.15 . Some of the problems I ran into
were distorted video when the console was active (missing first
character, skipped dots) and when running X windows (colored snow,
stripes, missing pixels). A quick examination of the 2.6 versus 2.4
source for the ATY driver revealed alot of changes.

A closer look at the code/data for the 64GR/XL chip revealed
two minor "typos" that the rewriter(s) of the code made. The first is
a incorrect clock value (230 .vs. 235) and the second is a missing
flag (M64F_SDRAM_MAGIC_PLL). Making both these changes seems to have
fixed my problem. I tend to think the 235 value is the correct one,
as there is a 29.4 Mhz clock crystal close to the video chip and 235.2
(29.4*8) is too close to 235 to make it a coincidence.

The flag for M64F_SDRAM_MAGIC_PLL was dropped during the
changes made by adaplas in file revision 1.72 on the old bitkeeper
repository.

The change relating to the clock rate has been there forever,
at least in the 2.6 tree. I'm not sure where to look for the old 2.5
tree or if anyone cares when it happened.

On SPARC Blades 100's, which use the ATY MACH64GR video chipset, the
clock crystal frequency is 235.2 Mhz, not 230 Mhz. The chipset also
requires the use of M64F_SDRAM_MAGIC_PLL in order to setup the PLL
properly for the DRAM.

Signed-off-by: "Luis F. Ortiz" <[email protected]>
Signed-off-by: "David S. Miller" <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
---
drivers/video/aty/atyfb_base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.14.5.orig/drivers/video/aty/atyfb_base.c
+++ linux-2.6.14.5/drivers/video/aty/atyfb_base.c
@@ -404,7 +404,7 @@ static struct {
{ PCI_CHIP_MACH64GM, "3D RAGE XL (Mach64 GM, AGP)", 230, 83, 63, ATI_CHIP_264XL },
{ PCI_CHIP_MACH64GN, "3D RAGE XL (Mach64 GN, AGP)", 230, 83, 63, ATI_CHIP_264XL },
{ PCI_CHIP_MACH64GO, "3D RAGE XL (Mach64 GO, PCI-66/BGA)", 230, 83, 63, ATI_CHIP_264XL },
- { PCI_CHIP_MACH64GR, "3D RAGE XL (Mach64 GR, PCI-33MHz)", 230, 83, 63, ATI_CHIP_264XL },
+ { PCI_CHIP_MACH64GR, "3D RAGE XL (Mach64 GR, PCI-33MHz)", 235, 83, 63, ATI_CHIP_264XL | M64F_SDRAM_MAGIC_PLL },
{ PCI_CHIP_MACH64GL, "3D RAGE XL (Mach64 GL, PCI)", 230, 83, 63, ATI_CHIP_264XL },
{ PCI_CHIP_MACH64GS, "3D RAGE XL (Mach64 GS, PCI)", 230, 83, 63, ATI_CHIP_264XL },


--

2006-01-06 00:46:03

by Chris Wright

[permalink] [raw]
Subject: [PATCH 1/6] drivers/net/sungem.c: gem_remove_one mustnt be __devexit

-stable review patch. If anyone has any objections, please let us know.
------------------

gem_remove_one() is called from the __devinit gem_init_one().

Therefore, gem_remove_one() mustn't be __devexit.

This patch was already included in 2.6.15-rc7.


Signed-off-by: Adrian Bunk <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/sungem.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-2.6.14.5.orig/drivers/net/sungem.c
+++ linux-2.6.14.5/drivers/net/sungem.c
@@ -2905,7 +2905,7 @@ static int __devinit gem_get_device_addr
return 0;
}

-static void __devexit gem_remove_one(struct pci_dev *pdev)
+static void gem_remove_one(struct pci_dev *pdev)
{
struct net_device *dev = pci_get_drvdata(pdev);

@@ -3179,7 +3179,7 @@ static struct pci_driver gem_driver = {
.name = GEM_MODULE_NAME,
.id_table = gem_pci_tbl,
.probe = gem_init_one,
- .remove = __devexit_p(gem_remove_one),
+ .remove = gem_remove_one,
#ifdef CONFIG_PM
.suspend = gem_suspend,
.resume = gem_resume,

--

2006-01-06 00:45:53

by Chris Wright

[permalink] [raw]
Subject: [PATCH 5/6] UFS: inode->i_sem is not released in error path

-stable review patch. If anyone has any objections, please let us know.
------------------

---

fs/ufs/super.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.14.5/fs/ufs/super.c
===================================================================
--- linux-2.6.14.5.orig/fs/ufs/super.c
+++ linux-2.6.14.5/fs/ufs/super.c
@@ -1294,8 +1294,10 @@ static ssize_t ufs_quota_write(struct su
blk++;
}
out:
- if (len == towrite)
+ if (len == towrite) {
+ up(&inode->i_sem);
return err;
+ }
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
inode->i_version++;

--

2006-01-06 00:46:32

by Chris Wright

[permalink] [raw]
Subject: [PATCH 2/6] ieee80211_crypt_tkip depends on NET_RADIO

-stable review patch. If anyone has any objections, please let us know.
------------------

*** Warning: ".wireless_send_event" [net/ieee80211/ieee80211_crypt_tkip.ko]

This bug was also reported as kerenl Bugzilla #5551.

Signed-off-by: Olaf Hering <[email protected]>
Signed-off-by: Adrian Bunk <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ieee80211/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.14.5.orig/net/ieee80211/Kconfig
+++ linux-2.6.14.5/net/ieee80211/Kconfig
@@ -55,7 +55,7 @@ config IEEE80211_CRYPT_CCMP

config IEEE80211_CRYPT_TKIP
tristate "IEEE 802.11i TKIP encryption"
- depends on IEEE80211
+ depends on IEEE80211 && NET_RADIO
select CRYPTO
select CRYPTO_MICHAEL_MIC
---help---

--

2006-01-06 00:51:12

by Chris Wright

[permalink] [raw]
Subject: Re: [PATCH 0/6] -stable review

* Chris Wright ([email protected]) wrote:
> There are 6 patches in this series, all will be posted as a response to
> this one.

Sorry about the misfire. I reposted the actual patches.

thanks,
-chris

2006-01-06 01:31:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/6] sysctl: dont overflow the user-supplied buffer with 0



On Thu, 5 Jan 2006, Chris Wright wrote:
>
> If the string was too long to fit in the user-supplied buffer,
> the sysctl layer would zero-terminate it by writing past the
> end of the buffer. Don't do that.

Don't use this one. It doesn't zero-pad the string if the result buffer is
too small (which _could_ cause problems) and it actually returns the wrong
length too.

There's a better one in the final 2.6.15.

Linus

2006-01-06 03:38:37

by Chris Wright

[permalink] [raw]
Subject: Re: [PATCH 4/6] sysctl: dont overflow the user-supplied buffer with 0

* Linus Torvalds ([email protected]) wrote:
> Don't use this one. It doesn't zero-pad the string if the result buffer is
> too small (which _could_ cause problems) and it actually returns the wrong
> length too.
>
> There's a better one in the final 2.6.15.

Thanks, adding this (it's got both rolled together so it's the same as
upstream).

thanks,
-chris
--

Subject: sysctl: make sure to terminate strings with a NUL

From: Linus Torvalds <[email protected]>

This is a slightly more complete fix for the previous minimal sysctl
string fix. It always terminates the returned string with a NUL, even
if the full result wouldn't fit in the user-supplied buffer.

The returned length is the full untruncated length, so that you can
tell when truncation has occurred.

Signed-off-by: Linus Torvalds <[email protected]>
[chrisw: inclusive of minimal fix so it's same as upstream]
Signed-off-by: Chris Wright <[email protected]>
---
kernel/sysctl.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)

--- linux-2.6.14.5.orig/kernel/sysctl.c
+++ linux-2.6.14.5/kernel/sysctl.c
@@ -2191,29 +2191,32 @@ int sysctl_string(ctl_table *table, int
void __user *oldval, size_t __user *oldlenp,
void __user *newval, size_t newlen, void **context)
{
- size_t l, len;
-
if (!table->data || !table->maxlen)
return -ENOTDIR;

if (oldval && oldlenp) {
- if (get_user(len, oldlenp))
+ size_t bufsize;
+ if (get_user(bufsize, oldlenp))
return -EFAULT;
- if (len) {
- l = strlen(table->data);
- if (len > l) len = l;
- if (len >= table->maxlen)
+ if (bufsize) {
+ size_t len = strlen(table->data), copied;
+
+ /* This shouldn't trigger for a well-formed sysctl */
+ if (len > table->maxlen)
len = table->maxlen;
- if(copy_to_user(oldval, table->data, len))
- return -EFAULT;
- if(put_user(0, ((char __user *) oldval) + len))
+
+ /* Copy up to a max of bufsize-1 bytes of the string */
+ copied = (len >= bufsize) ? bufsize - 1 : len;
+
+ if (copy_to_user(oldval, table->data, copied) ||
+ put_user(0, (char __user *)(oldval + copied)))
return -EFAULT;
- if(put_user(len, oldlenp))
+ if (put_user(len, oldlenp))
return -EFAULT;
}
}
if (newval && newlen) {
- len = newlen;
+ size_t len = newlen;
if (len > table->maxlen)
len = table->maxlen;
if(copy_from_user(table->data, newval, len))
@@ -2222,7 +2225,7 @@ int sysctl_string(ctl_table *table, int
len--;
((char *) table->data)[len] = 0;
}
- return 0;
+ return 1;
}

/*

2006-01-06 10:18:44

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH, RFC] RCU : OOM avoidance and lower latency

--- linux-2.6.15/kernel/rcupdate.c 2006-01-03 04:21:10.000000000 +0100
+++ linux-2.6.15-edum/kernel/rcupdate.c 2006-01-06 11:10:45.000000000 +0100
@@ -71,14 +71,14 @@

/* Fake initialization required by compiler */
static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int maxbatch = 10000;
+static int maxbatch = 100;

#ifndef __HAVE_ARCH_CMPXCHG
/*
* We use an array of spinlocks for the rcurefs -- similar to ones in sparc
* 32 bit atomic_t implementations, and a hash function similar to that
* for our refcounting needs.
- * Can't help multiprocessors which donot have cmpxchg :(
+ * Can't help multiprocessors which dont have cmpxchg :(
*/

spinlock_t __rcuref_hash[RCUREF_HASH_SIZE] = {
@@ -110,9 +110,17 @@
*rdp->nxttail = head;
rdp->nxttail = &head->next;

- if (unlikely(++rdp->count > 10000))
- set_need_resched();
-
+/*
+ * OOM avoidance : If we queued too many items in this queue,
+ * free the oldest entry
+ */
+ if (unlikely(++rdp->count > 10000)) {
+ rdp->count--;
+ head = rdp->donelist;
+ rdp->donelist = head->next;
+ local_irq_restore(flags);
+ return head->func(head);
+ }
local_irq_restore(flags);
}

@@ -148,12 +156,17 @@
rdp = &__get_cpu_var(rcu_bh_data);
*rdp->nxttail = head;
rdp->nxttail = &head->next;
- rdp->count++;
/*
- * Should we directly call rcu_do_batch() here ?
- * if (unlikely(rdp->count > 10000))
- * rcu_do_batch(rdp);
+ * OOM avoidance : If we queued too many items in this queue,
+ * free the oldest entry
*/
+ if (unlikely(++rdp->count > 10000)) {
+ rdp->count--;
+ head = rdp->donelist;
+ rdp->donelist = head->next;
+ local_irq_restore(flags);
+ return head->func(head);
+ }
local_irq_restore(flags);
}

@@ -209,7 +222,7 @@
static void rcu_do_batch(struct rcu_data *rdp)
{
struct rcu_head *next, *list;
- int count = 0;
+ int count = maxbatch;

list = rdp->donelist;
while (list) {
@@ -217,7 +230,7 @@
list->func(list);
list = next;
rdp->count--;
- if (++count >= maxbatch)
+ if (--count <= 0)
break;
}
if (!rdp->donelist)


Attachments:
RCU_OOM.patch (2.03 kB)

2006-01-06 12:53:42

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency (Version 2), HOTPLUG_CPU fix

--- linux-2.6.15/kernel/rcupdate.c 2006-01-03 04:21:10.000000000 +0100
+++ linux-2.6.15-edum/kernel/rcupdate.c 2006-01-06 13:32:02.000000000 +0100
@@ -71,14 +71,14 @@

/* Fake initialization required by compiler */
static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int maxbatch = 10000;
+static int maxbatch = 100;

#ifndef __HAVE_ARCH_CMPXCHG
/*
* We use an array of spinlocks for the rcurefs -- similar to ones in sparc
* 32 bit atomic_t implementations, and a hash function similar to that
* for our refcounting needs.
- * Can't help multiprocessors which donot have cmpxchg :(
+ * Can't help multiprocessors which dont have cmpxchg :(
*/

spinlock_t __rcuref_hash[RCUREF_HASH_SIZE] = {
@@ -110,9 +110,19 @@
*rdp->nxttail = head;
rdp->nxttail = &head->next;

- if (unlikely(++rdp->count > 10000))
- set_need_resched();
-
+/*
+ * OOM avoidance : If we queued too many items in this queue,
+ * free the oldest entry (from the donelist only to respect
+ * RCU constraints)
+ */
+ if (unlikely(++rdp->count > 10000 && (head = rdp->donelist))) {
+ rdp->count--;
+ rdp->donelist = head->next;
+ if (!rdp->donelist)
+ rdp->donetail = &rdp->donelist;
+ local_irq_restore(flags);
+ return head->func(head);
+ }
local_irq_restore(flags);
}

@@ -148,12 +158,19 @@
rdp = &__get_cpu_var(rcu_bh_data);
*rdp->nxttail = head;
rdp->nxttail = &head->next;
- rdp->count++;
/*
- * Should we directly call rcu_do_batch() here ?
- * if (unlikely(rdp->count > 10000))
- * rcu_do_batch(rdp);
+ * OOM avoidance : If we queued too many items in this queue,
+ * free the oldest entry (from the donelist only to respect
+ * RCU constraints)
*/
+ if (unlikely(++rdp->count > 10000 && (head = rdp->donelist))) {
+ rdp->count--;
+ rdp->donelist = head->next;
+ if (!rdp->donelist)
+ rdp->donetail = &rdp->donelist;
+ local_irq_restore(flags);
+ return head->func(head);
+ }
local_irq_restore(flags);
}

@@ -208,19 +225,20 @@
*/
static void rcu_do_batch(struct rcu_data *rdp)
{
- struct rcu_head *next, *list;
- int count = 0;
+ struct rcu_head *next = NULL, *list;
+ int count = maxbatch;

list = rdp->donelist;
while (list) {
- next = rdp->donelist = list->next;
+ next = list->next;
list->func(list);
list = next;
rdp->count--;
- if (++count >= maxbatch)
+ if (--count <= 0)
break;
}
- if (!rdp->donelist)
+ rdp->donelist = next;
+ if (!next)
rdp->donetail = &rdp->donelist;
else
tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu));
@@ -344,11 +362,9 @@
static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
struct rcu_head **tail)
{
- local_irq_disable();
*this_rdp->nxttail = list;
if (list)
this_rdp->nxttail = tail;
- local_irq_enable();
}

static void __rcu_offline_cpu(struct rcu_data *this_rdp,
@@ -362,9 +378,12 @@
if (rcp->cur != rcp->completed)
cpu_quiet(rdp->cpu, rcp, rsp);
spin_unlock_bh(&rsp->lock);
+ local_irq_disable();
rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
-
+ this_rdp->count += rdp->count;
+ rdp->count = 0;
+ local_irq_enable();
}
static void rcu_offline_cpu(int cpu)
{


Attachments:
RCU_OOM.patch (3.14 kB)

2006-01-06 13:00:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Friday 06 January 2006 11:17, Eric Dumazet wrote:

>
> I assume that if a CPU queued 10.000 items in its RCU queue, then the
> oldest entry cannot still be in use by another CPU. This might sounds as a
> violation of RCU rules, (I'm not an RCU expert) but seems quite reasonable.

I don't think it's a good assumption. Another CPU might be stuck in a long
running interrupt, and still have a reference in the code running below
the interrupt handler.

And in general letting correctness depend on magic numbers like this is
very nasty.

-Andi

2006-01-06 13:10:16

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

Andi Kleen a écrit :
> On Friday 06 January 2006 11:17, Eric Dumazet wrote:
>
>> I assume that if a CPU queued 10.000 items in its RCU queue, then the
>> oldest entry cannot still be in use by another CPU. This might sounds as a
>> violation of RCU rules, (I'm not an RCU expert) but seems quite reasonable.
>
> I don't think it's a good assumption. Another CPU might be stuck in a long
> running interrupt, and still have a reference in the code running below
> the interrupt handler.
>
> And in general letting correctness depend on magic numbers like this is
> very nasty.
>

I agree Andi, I posted a 2nd version of the patch with no more assumptions.

Eric


2006-01-06 13:36:54

by Alan

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
> I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest
> entry cannot still be in use by another CPU. This might sounds as a violation
> of RCU rules, (I'm not an RCU expert) but seems quite reasonable.

Fixing the real problem in the routing code would be the real fix.

The underlying problem of RCU and memory usage could be solved more
safely by making sure that the sleeping memory allocator path always
waits until at least one RCU cleanup has occurred after it fails an
allocation before it starts trying harder. That ought to also naturally
throttle memory consumers more in the situation which is the right
behaviour.

Alan

2006-01-06 14:01:04

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

Alan Cox a ?crit :
> On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
>> I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest
>> entry cannot still be in use by another CPU. This might sounds as a violation
>> of RCU rules, (I'm not an RCU expert) but seems quite reasonable.
>
> Fixing the real problem in the routing code would be the real fix.
>

So far nobody succeeded in 'fixing the routing code', few people can even read
the code from the first line to the last one...

I think this code is not buggy, it only makes general RCU assumptions about
delayed freeing of dst entries. In some cases, the general assumptions are
just wrong. We can fix it at RCU level, and future users of call_rcu_bh() wont
have to think *hard* about 'general assumptions'.

Of course, we can ignore the RCU problem and mark somewhere on a sticker:
***DONT USE OR RISK CRASHES***
***USE IT ONLY FOR FUN***

> The underlying problem of RCU and memory usage could be solved more
> safely by making sure that the sleeping memory allocator path always
> waits until at least one RCU cleanup has occurred after it fails an
> allocation before it starts trying harder. That ought to also naturally
> throttle memory consumers more in the situation which is the right
> behaviour.
>

In the case of call_rcu_bh(), you can be sure that the caller cannot afford
'sleeping memory allocations'. Better drop a frame than block the stack, no ?

Eric

2006-01-06 14:46:30

by Alan

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Gwe, 2006-01-06 at 15:00 +0100, Eric Dumazet wrote:
> In the case of call_rcu_bh(), you can be sure that the caller cannot afford
> 'sleeping memory allocations'. Better drop a frame than block the stack, no ?

atomic allocations can't sleep and will fail which is fine. If memory
allocation pressure exists for sleeping allocations because of a large
rcu backlog we want to be sure that the rcu backlog from the networking
stack or other sources does not cause us to OOM kill or take incorrect
action.

So if for example we want to grow a process stack and the memory is
there just stuck in the RCU lists pending recovery we want to let the
RCU recovery happen before making drastic decisions.


2006-01-06 16:47:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Fri, Jan 06, 2006 at 01:37:12PM +0000, Alan Cox wrote:
> On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
> > I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest
> > entry cannot still be in use by another CPU. This might sounds as a violation
> > of RCU rules, (I'm not an RCU expert) but seems quite reasonable.
>
> Fixing the real problem in the routing code would be the real fix.
>
> The underlying problem of RCU and memory usage could be solved more
> safely by making sure that the sleeping memory allocator path always
> waits until at least one RCU cleanup has occurred after it fails an
> allocation before it starts trying harder. That ought to also naturally
> throttle memory consumers more in the situation which is the right
> behaviour.

A quick look at rt_garbage_collect() leads me to believe that although
the IP route cache does try to limit its use of memory, it does not
fully account for memory that it has released to RCU, but that RCU has
not yet freed due to a grace period not having elapsed.

The following appears to be possible:

1. rt_garbage_collect() sees that there are too many entries,
and sets "goal" to the number to free up, based on a
computed "equilibrium" value.

2. The number of entries is (correctly) decremented only when
the corresponding RCU callback is invoked, which actually
frees the entry.

3. Between the time that rt_garbage_collect() is invoked the
first time and when the RCU grace period ends, rt_garbage_collect()
is invoked again. It still sees too many entries (since
RCU has not yet freed the ones released by the earlier
invocation in step (1) above), so frees a bunch more.

4. Packets routed now miss the route cache, because the corresponding
entries are waiting for a grace period, slowing the system down.
Therefore, even more entries are freed to make room for new
entries corresponding to the new packets.

If my (likely quite naive) reading of the IP route cache code is correct,
it would be possible to end up in a steady state with most of the entries
always being in RCU rather than in the route cache.

Eric, could this be what is happening to your system?

If it is, one straightforward fix would be to keep a count of the number
of route-cache entries waiting on RCU, and for rt_garbage_collect()
to subtract this number of entries from its goal. Does this make sense?

Thanx, Paul

2006-01-06 17:23:41

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

Paul E. McKenney a ?crit :
> On Fri, Jan 06, 2006 at 01:37:12PM +0000, Alan Cox wrote:
>> On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
>>> I assume that if a CPU queued 10.000 items in its RCU queue, then the oldest
>>> entry cannot still be in use by another CPU. This might sounds as a violation
>>> of RCU rules, (I'm not an RCU expert) but seems quite reasonable.
>> Fixing the real problem in the routing code would be the real fix.
>>
>> The underlying problem of RCU and memory usage could be solved more
>> safely by making sure that the sleeping memory allocator path always
>> waits until at least one RCU cleanup has occurred after it fails an
>> allocation before it starts trying harder. That ought to also naturally
>> throttle memory consumers more in the situation which is the right
>> behaviour.
>
> A quick look at rt_garbage_collect() leads me to believe that although
> the IP route cache does try to limit its use of memory, it does not
> fully account for memory that it has released to RCU, but that RCU has
> not yet freed due to a grace period not having elapsed.
>
> The following appears to be possible:
>
> 1. rt_garbage_collect() sees that there are too many entries,
> and sets "goal" to the number to free up, based on a
> computed "equilibrium" value.
>
> 2. The number of entries is (correctly) decremented only when
> the corresponding RCU callback is invoked, which actually
> frees the entry.
>
> 3. Between the time that rt_garbage_collect() is invoked the
> first time and when the RCU grace period ends, rt_garbage_collect()
> is invoked again. It still sees too many entries (since
> RCU has not yet freed the ones released by the earlier
> invocation in step (1) above), so frees a bunch more.
>
> 4. Packets routed now miss the route cache, because the corresponding
> entries are waiting for a grace period, slowing the system down.
> Therefore, even more entries are freed to make room for new
> entries corresponding to the new packets.
>
> If my (likely quite naive) reading of the IP route cache code is correct,
> it would be possible to end up in a steady state with most of the entries
> always being in RCU rather than in the route cache.
>
> Eric, could this be what is happening to your system?
>
> If it is, one straightforward fix would be to keep a count of the number
> of route-cache entries waiting on RCU, and for rt_garbage_collect()
> to subtract this number of entries from its goal. Does this make sense?
>

Hi Paul

Thanks for reviewing route code :)

As I said, the problem comes from 'route flush cache', that is periodically
done by rt_run_flush(), triggered by rt_flush_timer.

The 10% of LOWMEM ram that was used by route-cache entries are pushed into rcu
queues (with call_rcu_bh()) and network continue to receive
packets from *many* sources that want their route-cache entry.


Eric

2006-01-06 19:24:14

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Fri, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
> I have some servers that once in a while crashes when the ip route
> cache is flushed. After
> raising /proc/sys/net/ipv4/route/secret_interval (so that *no*
> flush is done), I got better uptime for these servers.

Argh, where is that documented? I have been banging my head against
this for weeks - how do I keep the kernel from flushing 4096 routes at
once in softirq context causing huge (~8-20ms) latency problems?

I tried all the route related sysctls I could find and nothing worked...

Lee

2006-01-06 19:26:44

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Fri, 2006-01-06 at 13:58 +0100, Andi Kleen wrote:
> Another CPU might be stuck in a long
> running interrupt

Shouldn't a long running interrupt be considered a bug?

Lee

2006-01-06 20:25:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Fri, Jan 06, 2006 at 06:19:15PM +0100, Eric Dumazet wrote:
> Paul E. McKenney a ?crit :
> >On Fri, Jan 06, 2006 at 01:37:12PM +0000, Alan Cox wrote:
> >>On Gwe, 2006-01-06 at 11:17 +0100, Eric Dumazet wrote:
> >>>I assume that if a CPU queued 10.000 items in its RCU queue, then the
> >>>oldest entry cannot still be in use by another CPU. This might sounds as
> >>>a violation of RCU rules, (I'm not an RCU expert) but seems quite
> >>>reasonable.
> >>Fixing the real problem in the routing code would be the real fix.
> >>
> >>The underlying problem of RCU and memory usage could be solved more
> >>safely by making sure that the sleeping memory allocator path always
> >>waits until at least one RCU cleanup has occurred after it fails an
> >>allocation before it starts trying harder. That ought to also naturally
> >>throttle memory consumers more in the situation which is the right
> >>behaviour.
> >
> >A quick look at rt_garbage_collect() leads me to believe that although
> >the IP route cache does try to limit its use of memory, it does not
> >fully account for memory that it has released to RCU, but that RCU has
> >not yet freed due to a grace period not having elapsed.
> >
> >The following appears to be possible:
> >
> >1. rt_garbage_collect() sees that there are too many entries,
> > and sets "goal" to the number to free up, based on a
> > computed "equilibrium" value.
> >
> >2. The number of entries is (correctly) decremented only when
> > the corresponding RCU callback is invoked, which actually
> > frees the entry.
> >
> >3. Between the time that rt_garbage_collect() is invoked the
> > first time and when the RCU grace period ends, rt_garbage_collect()
> > is invoked again. It still sees too many entries (since
> > RCU has not yet freed the ones released by the earlier
> > invocation in step (1) above), so frees a bunch more.
> >
> >4. Packets routed now miss the route cache, because the corresponding
> > entries are waiting for a grace period, slowing the system down.
> > Therefore, even more entries are freed to make room for new
> > entries corresponding to the new packets.
> >
> >If my (likely quite naive) reading of the IP route cache code is correct,
> >it would be possible to end up in a steady state with most of the entries
> >always being in RCU rather than in the route cache.
> >
> >Eric, could this be what is happening to your system?
> >
> >If it is, one straightforward fix would be to keep a count of the number
> >of route-cache entries waiting on RCU, and for rt_garbage_collect()
> >to subtract this number of entries from its goal. Does this make sense?
> >
>
> Hi Paul
>
> Thanks for reviewing route code :)
>
> As I said, the problem comes from 'route flush cache', that is periodically
> done by rt_run_flush(), triggered by rt_flush_timer.
>
> The 10% of LOWMEM ram that was used by route-cache entries are pushed into
> rcu queues (with call_rcu_bh()) and network continue to receive
> packets from *many* sources that want their route-cache entry.

Hello, Eric,

The rt_run_flush() function could indeed be suffering from the same
problem. Dipankar's recent patch should help RCU grace periods proceed
more quickly, does that help?

If not, it may be worthwhile to limit the number of times that
rt_run_flush() runs per RCU grace period.

Thanx, Paul

2006-01-06 20:34:03

by David Miller

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

From: "Paul E. McKenney" <[email protected]>
Date: Fri, 6 Jan 2006 12:26:26 -0800

> If not, it may be worthwhile to limit the number of times that
> rt_run_flush() runs per RCU grace period.

This is mixing two sets of requirements.

rt_run_flush() runs periodically in order to regenerate the hash
function secret key. Now, for that specific case it might actually be
possible to rehash instead of flush, but the locking is a little bit
tricky :-) And also, I think we're regenerating the secret key
just a little bit too often, I think we'd get enough security
with a less frequent regeneration.

I'll look into this and your other ideas later today hopefully.

2006-01-06 22:34:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Friday 06 January 2006 21:26, Paul E. McKenney wrote:

> If not, it may be worthwhile to limit the number of times that
> rt_run_flush() runs per RCU grace period.

Problem is that without rt_run_flush new routes and route attribute
changes don't get used by the stack. If RCU takes long and routes
keep changing this might be a big issue.

As a admin I would be certainly annoyed if the network stack
ignored my new route for some unbounded time.

Perhaps a better way would be to just exclude dst entries in RCU state
from the normal accounting and assume that if the system
really runs short of memory because of this the results would
trigger quiescent states more quickly, freeing the memory again.

-Andi

2006-01-06 22:33:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Friday 06 January 2006 20:26, Lee Revell wrote:
> On Fri, 2006-01-06 at 13:58 +0100, Andi Kleen wrote:
> > Another CPU might be stuck in a long
> > running interrupt
>
> Shouldn't a long running interrupt be considered a bug?

In normal operation yes, but there can be always exceptional
circumstances where it's unavoidable (e.g. during error handling)
and in the name of defensive programming the rest of the system ought
to tolerate it.

-Andi

2006-01-07 00:20:22

by David Miller

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

From: Andi Kleen <[email protected]>
Date: Fri, 6 Jan 2006 21:57:41 +0100

> Perhaps a better way would be to just exclude dst entries in RCU state
> from the normal accounting and assume that if the system
> really runs short of memory because of this the results would
> trigger quiescent states more quickly, freeing the memory again.

That's one idea...

Eric, how important do you honestly think the per-hashchain spinlocks
are? That's the big barrier from making rt_secret_rebuild() a simple
rehash instead of flushing the whole table as it does now.

The lock is only grabbed for updates, and the access to these locks is
random and as such probably non-local when taken anyways. Back before
we used RCU for reads, this array-of-spinlock thing made a lot more
sense.

I mean something like this patch:

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f701a13..f9436c7 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -204,36 +204,8 @@ __u8 ip_tos2prio[16] = {
struct rt_hash_bucket {
struct rtable *chain;
};
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-/*
- * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks
- * The size of this table is a power of two and depends on the number of CPUS.
- */
-#if NR_CPUS >= 32
-#define RT_HASH_LOCK_SZ 4096
-#elif NR_CPUS >= 16
-#define RT_HASH_LOCK_SZ 2048
-#elif NR_CPUS >= 8
-#define RT_HASH_LOCK_SZ 1024
-#elif NR_CPUS >= 4
-#define RT_HASH_LOCK_SZ 512
-#else
-#define RT_HASH_LOCK_SZ 256
-#endif

-static spinlock_t *rt_hash_locks;
-# define rt_hash_lock_addr(slot) &rt_hash_locks[(slot) & (RT_HASH_LOCK_SZ - 1)]
-# define rt_hash_lock_init() { \
- int i; \
- rt_hash_locks = kmalloc(sizeof(spinlock_t) * RT_HASH_LOCK_SZ, GFP_KERNEL); \
- if (!rt_hash_locks) panic("IP: failed to allocate rt_hash_locks\n"); \
- for (i = 0; i < RT_HASH_LOCK_SZ; i++) \
- spin_lock_init(&rt_hash_locks[i]); \
- }
-#else
-# define rt_hash_lock_addr(slot) NULL
-# define rt_hash_lock_init()
-#endif
+static DEFINE_SPINLOCK(rt_hash_lock);

static struct rt_hash_bucket *rt_hash_table;
static unsigned rt_hash_mask;
@@ -627,7 +599,7 @@ static void rt_check_expire(unsigned lon

if (*rthp == 0)
continue;
- spin_lock(rt_hash_lock_addr(i));
+ spin_lock(&rt_hash_lock);
while ((rth = *rthp) != NULL) {
if (rth->u.dst.expires) {
/* Entry is expired even if it is in use */
@@ -660,7 +632,7 @@ static void rt_check_expire(unsigned lon
rt_free(rth);
#endif /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
}
- spin_unlock(rt_hash_lock_addr(i));
+ spin_unlock(&rt_hash_lock);

/* Fallback loop breaker. */
if (time_after(jiffies, now))
@@ -683,11 +655,11 @@ static void rt_run_flush(unsigned long d
get_random_bytes(&rt_hash_rnd, 4);

for (i = rt_hash_mask; i >= 0; i--) {
- spin_lock_bh(rt_hash_lock_addr(i));
+ spin_lock_bh(&rt_hash_lock);
rth = rt_hash_table[i].chain;
if (rth)
rt_hash_table[i].chain = NULL;
- spin_unlock_bh(rt_hash_lock_addr(i));
+ spin_unlock_bh(&rt_hash_lock);

for (; rth; rth = next) {
next = rth->u.rt_next;
@@ -820,7 +792,7 @@ static int rt_garbage_collect(void)

k = (k + 1) & rt_hash_mask;
rthp = &rt_hash_table[k].chain;
- spin_lock_bh(rt_hash_lock_addr(k));
+ spin_lock_bh(&rt_hash_lock);
while ((rth = *rthp) != NULL) {
if (!rt_may_expire(rth, tmo, expire)) {
tmo >>= 1;
@@ -852,7 +824,7 @@ static int rt_garbage_collect(void)
goal--;
#endif /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
}
- spin_unlock_bh(rt_hash_lock_addr(k));
+ spin_unlock_bh(&rt_hash_lock);
if (goal <= 0)
break;
}
@@ -922,7 +894,7 @@ restart:

rthp = &rt_hash_table[hash].chain;

- spin_lock_bh(rt_hash_lock_addr(hash));
+ spin_lock_bh(&rt_hash_lock);
while ((rth = *rthp) != NULL) {
#ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED
if (!(rth->u.dst.flags & DST_BALANCED) &&
@@ -948,7 +920,7 @@ restart:
rth->u.dst.__use++;
dst_hold(&rth->u.dst);
rth->u.dst.lastuse = now;
- spin_unlock_bh(rt_hash_lock_addr(hash));
+ spin_unlock_bh(&rt_hash_lock);

rt_drop(rt);
*rp = rth;
@@ -989,7 +961,7 @@ restart:
if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) {
int err = arp_bind_neighbour(&rt->u.dst);
if (err) {
- spin_unlock_bh(rt_hash_lock_addr(hash));
+ spin_unlock_bh(&rt_hash_lock);

if (err != -ENOBUFS) {
rt_drop(rt);
@@ -1030,7 +1002,7 @@ restart:
}
#endif
rt_hash_table[hash].chain = rt;
- spin_unlock_bh(rt_hash_lock_addr(hash));
+ spin_unlock_bh(&rt_hash_lock);
*rp = rt;
return 0;
}
@@ -1098,7 +1070,7 @@ static void rt_del(unsigned hash, struct
{
struct rtable **rthp;

- spin_lock_bh(rt_hash_lock_addr(hash));
+ spin_lock_bh(&rt_hash_lock);
ip_rt_put(rt);
for (rthp = &rt_hash_table[hash].chain; *rthp;
rthp = &(*rthp)->u.rt_next)
@@ -1107,7 +1079,7 @@ static void rt_del(unsigned hash, struct
rt_free(rt);
break;
}
- spin_unlock_bh(rt_hash_lock_addr(hash));
+ spin_unlock_bh(&rt_hash_lock);
}

void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
@@ -3155,7 +3127,6 @@ int __init ip_rt_init(void)
&rt_hash_mask,
0);
memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
- rt_hash_lock_init();

ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
ip_rt_max_size = (rt_hash_mask + 1) * 16;

2006-01-07 02:18:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Saturday 07 January 2006 01:17, David S. Miller wrote:

>
> I mean something like this patch:

Looks like a good idea to me.

I always disliked the per chain spinlocks even for other hash tables like
TCP/UDP multiplex - it would be much nicer to use a much smaller separately
hashed lock table and save cache. In this case the special case of using
a one entry only lock hash table makes sense.

-Andi

2006-01-07 07:11:21

by David Miller

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

From: Andi Kleen <[email protected]>
Date: Sat, 7 Jan 2006 02:09:01 +0100

> I always disliked the per chain spinlocks even for other hash tables like
> TCP/UDP multiplex - it would be much nicer to use a much smaller separately
> hashed lock table and save cache. In this case the special case of using
> a one entry only lock hash table makes sense.

I used to think they were a great technique. But in each case I
thought they could be applied, better schemes have come along.
In the case of the page cache we went to a per-address-space tree,
and here in the routing cache we went to RCU.

There are RCU patches around for the TCP hashes and I'd like to
put those in at some point as well. In fact, they'd be even
more far reaching since Arnaldo abstracted away the socket
hashing stuff into an inet_hashtables subsystem.

2006-01-07 07:34:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

Andi Kleen a ?crit :
>
> I always disliked the per chain spinlocks even for other hash tables like
> TCP/UDP multiplex - it would be much nicer to use a much smaller separately
> hashed lock table and save cache. In this case the special case of using
> a one entry only lock hash table makes sense.
>

I agree, I do use a hashed spinlock array on my local tree for TCP, mainly to
reduce the hash table size by a 2 factor.

Eric

2006-01-07 07:47:03

by David Miller

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

From: Eric Dumazet <[email protected]>
Date: Sat, 07 Jan 2006 08:34:35 +0100

> I agree, I do use a hashed spinlock array on my local tree for TCP,
> mainly to reduce the hash table size by a 2 factor.

So what do you think about going to a single spinlock for the
routing cache?

2006-01-07 07:54:01

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

David S. Miller a ?crit :
> From: Eric Dumazet <[email protected]>
> Date: Sat, 07 Jan 2006 08:34:35 +0100
>
>> I agree, I do use a hashed spinlock array on my local tree for TCP,
>> mainly to reduce the hash table size by a 2 factor.
>
> So what do you think about going to a single spinlock for the
> routing cache?

I have no problem with this, since the biggest server I have is 4 way, but are
you sure big machines wont suffer from this single spinlock ?

Also I dont understand what you want to do after this single spinlock patch.
How is it supposed to help the 'ip route flush cache' problem ?

In my case, I have about 600.000 dst-entries :

# grep ip_dst /proc/slabinfo
ip_dst_cache 616250 622440 320 12 1 : tunables 54 27 8 :
slabdata 51870 51870 0


Eric

2006-01-07 08:30:33

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

David S. Miller a ?crit :
>
> Eric, how important do you honestly think the per-hashchain spinlocks
> are? That's the big barrier from making rt_secret_rebuild() a simple
> rehash instead of flushing the whole table as it does now.
>

No problem for me in going to a single spinlock.
I did the hashed spinlock patch in order to reduce the size of the route hash
table and not hurting big NUMA machines. If you think a single spinlock is OK,
that's even better !

> The lock is only grabbed for updates, and the access to these locks is
> random and as such probably non-local when taken anyways. Back before
> we used RCU for reads, this array-of-spinlock thing made a lot more
> sense.
>
> I mean something like this patch:
>

> +static DEFINE_SPINLOCK(rt_hash_lock);


Just one point : This should be cache_line aligned, and use one full cache
line to avoid false sharing at least. (If a cpu takes the lock, no need to
invalidate *rt_hash_table for all other cpus)

Eric

2006-01-07 08:36:46

by David Miller

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

From: Eric Dumazet <[email protected]>
Date: Sat, 07 Jan 2006 08:53:52 +0100

> I have no problem with this, since the biggest server I have is 4
> way, but are you sure big machines wont suffer from this single
> spinlock ?

It is the main question.

> Also I dont understand what you want to do after this single
> spinlock patch. How is it supposed to help the 'ip route flush
> cache' problem ? In my case, I have about 600.000 dst-entries :

I don't claim to have a solution to this problem currently.

Doing RCU and going through the whole DST GC machinery is overkill for
an active system. So, perhaps a very simple solution will do:

1) On rt_run_flush(), do not rt_free(), instead collect all active
routing cache entries onto a global list, begin a timer to
fire in 10 seconds (or some sysctl configurable amount).

2) When a new routing cache entry is needed, check the global
list appended to in #1 above first, failing that do dst_alloc()
as is done currently.

3) If timer expires, rt_free() any entries in the global list.

The missing trick is how to ensure RCU semantics when reallocating
from the global list.

The idea is that an active system will immediately repopulate itself
with all of these entries just flushed from the table.

RCU really doesn't handle this kind of problem very well. It truly
excels when work is generated by process context work, not interrupt
work.

2006-01-07 20:29:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH, RFC] RCU : OOM avoidance and lower latency

On Sat, Jan 07, 2006 at 12:36:25AM -0800, David S. Miller wrote:
> From: Eric Dumazet <[email protected]>
> Date: Sat, 07 Jan 2006 08:53:52 +0100
>
> > I have no problem with this, since the biggest server I have is 4
> > way, but are you sure big machines wont suffer from this single
> > spinlock ?
>
> It is the main question.
>
> > Also I dont understand what you want to do after this single
> > spinlock patch. How is it supposed to help the 'ip route flush
> > cache' problem ? In my case, I have about 600.000 dst-entries :
>
> I don't claim to have a solution to this problem currently.
>
> Doing RCU and going through the whole DST GC machinery is overkill for
> an active system. So, perhaps a very simple solution will do:
>
> 1) On rt_run_flush(), do not rt_free(), instead collect all active
> routing cache entries onto a global list, begin a timer to
> fire in 10 seconds (or some sysctl configurable amount).
>
> 2) When a new routing cache entry is needed, check the global
> list appended to in #1 above first, failing that do dst_alloc()
> as is done currently.
>
> 3) If timer expires, rt_free() any entries in the global list.
>
> The missing trick is how to ensure RCU semantics when reallocating
> from the global list.

The straightforward ways of doing this require a per-entry lock in
addition to the dst_entry reference count -- lots of read-side overhead.

More complex approaches use a generation number that is incremented
when adding to or removing from the global list. When the generation
number overflows, unconditionally rt_free() it rather than adding
to the global list again. Then there needs to be some clever code
on the read side to detect the case when the generation number
changes while acquiring a reference. And memory barriers. Also
lots of read-side overhead. Also, it is now -always- necessary to
acquire a reference on the read-side.

> The idea is that an active system will immediately repopulate itself
> with all of these entries just flushed from the table.
>
> RCU really doesn't handle this kind of problem very well. It truly
> excels when work is generated by process context work, not interrupt
> work.

Sounds like a challenge to me. ;-)

Well, one possible way to attack Eric's workload might be the following:

o Size the hash table to strike the appropriate balance between
read-side search overhead and memory consumption. Call the
number of hash-chain headers N.

o Create a hashed array of locks sized to allow the update to
proceed sufficiently quickly. Call the number of locks M,
probably a power of two. This means that M CPUs can be doing
the update in parallel.

o Create an array of M^2 list headers (call it xfer[][]), but
since this is only needed during an update, it can be allocated
and deallocated if need be. (Me, with my big-server experience,
would probably just create the array, since M is not likely to
be too large. But your mileage may vary. And you really only
need M*(M-1) list headers, but that makes the index calculation
a bit more annoying.)

o Use a two-phase update. In the first phase, each updating
CPU acquires the corresponding lock and removes entries from
the corresponding partition of the hash table. If the new
location of a given entry falls into the same partition, it
is added back to the appropriate hash chain of that partition.
Otherwise, add the entry to xfer[dst][src], where "src" and
"dst" are indexes of the corresponding partitions.

o When all CPUs finish removing entries from their partition,
they check into a barrier. Once all have checked in, they
can start the second phase of the update.

o In the second phase, each CPU removes the entries from the
xfer array that are destined for its partition and adds them
to the hash chain that they are destined for.

Some commentary and variations, in the hope that this inspires someone
to come up with an even better idea:

o Unless M is at least three, there is no performance gain
over a single global lock with a single CPU doing the update,
since each element must now undergo four list operations rather
than just two.

o The xfer[][] array must have each entry cache-aligned, or
you lose big on cacheline effects. Note that it is -not-
sufficient to simply align the rows or the columns, since
each CPU has its own column when inserting and its own
row when removing from xfer[][].

o And the data-skew effects are less severe if this procedure
runs from process context. A spinning barrier must be used
otherwise. But note that the per-partition locks could remain
spinlocks, only the barrier need involve sleeping (in case
that helps, am getting a bit ahead of my understanding of
this part of the kernel).

So I half-agree with Dave -- this works better if the bulk
update is done in process context, however, updates involving
single entries being individually added and removed from the
hash table can be done from interrupt context.

o One (more complex!) way to reduce the effect of data skew to
partition the array based on the number of elements in each
partition rather than by the number of chains, but this would
require a second barrier that is completed before dropping locks
in order to avoid messing up concurrent single-element updates.

And this only handles the phase-1 data-skew effects. It is
possible to handle phase-2 data skew as well, but it takes an
up-front full-table scan and so it is not clear how it could
possibly be a performance win.

o Note that routing slows down significantly while an update is
in progress, because on average only half of the entries are
in the hash table during the update. I have no idea whether
or not this is acceptable. I am assuming that the same
mechanism that prevents two concurrent route-cache misses
from inserting two entries from the same route would also
prevent adding a new entry when one was already in the
xfer array.

Thoughts?

Thanx, Paul