Hi,
maybe you could help me out with a really weird problem we're having
with a NFS fileserver for a couple of webservers:
- Dual Xeon 2.2 GHz
- 6 GB RAM
- QLogic FCAL Host adapter with about 5.5 TB on a several RAIDs
- Debian "woody" w/Kernel 2.4.19
Running just "find /" (or ls -R or tar on a large directory) locally
slows the box down to absolute unresponsiveness - it takes minutes
to just run ps and kill the find process. During that time, kupdated
and kswapd gobble up all available CPU time.
The system performs great otherwise, so I've ruled out a hardware
problem. It can't be a load problem because during normal operation,
the system is more or less bored out of its mind (70-90% idle time).
I'm really at the end of my wits here :-(
Any help would be greatly appreciated!
TIA,
Thomas
On 4 February 2003 11:29, Thomas B?tzler wrote:
> maybe you could help me out with a really weird problem we're having
> with a NFS fileserver for a couple of webservers:
>
> - Dual Xeon 2.2 GHz
> - 6 GB RAM
> - QLogic FCAL Host adapter with about 5.5 TB on a several RAIDs
> - Debian "woody" w/Kernel 2.4.19
>
> Running just "find /" (or ls -R or tar on a large directory) locally
> slows the box down to absolute unresponsiveness - it takes minutes
> to just run ps and kill the find process. During that time, kupdated
> and kswapd gobble up all available CPU time.
>
> The system performs great otherwise, so I've ruled out a hardware
> problem. It can't be a load problem because during normal operation,
> the system is more or less bored out of its mind (70-90% idle time).
>
> I'm really at the end of my wits here :-(
>
> Any help would be greatly appreciated!
Canned response:
* does non-highmem kernel make any difference?
* does UP kernel make any difference?
* can you profile kernel while "time ls -R" is running?
* try 2.4.20 and/or .21-pre4
* tell us what you found out
--
vda
>
> Hi,
>
> maybe you could help me out with a really weird problem we're having
> with a NFS fileserver for a couple of webservers:
>
> - Dual Xeon 2.2 GHz
> - 6 GB RAM
> - QLogic FCAL Host adapter with about 5.5 TB on a several RAIDs
> - Debian "woody" w/Kernel 2.4.19
>
> Running just "find /" (or ls -R or tar on a large directory) locally
> slows the box down to absolute unresponsiveness - it takes minutes
> to just run ps and kill the find process. During that time, kupdated
> and kswapd gobble up all available CPU time.
>
Could be that your "low memory" is filled up with inodes. This would
only happen in these tests if you're using ext2, and there are a *lot*
of directories.
I've prepared a lineup of Andrea's VM patches at
http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/
It would be useful if you could apply 10_inode-highmem-2.patch and
report back. It applies to 2.4.19 as well, and should work OK there.
On Wednesday 05 February 2003 10:39, Andrew Morton wrote:
Hi Andrew,
> > Running just "find /" (or ls -R or tar on a large directory) locally
> > slows the box down to absolute unresponsiveness - it takes minutes
> > to just run ps and kill the find process. During that time, kupdated
> > and kswapd gobble up all available CPU time.
> Could be that your "low memory" is filled up with inodes. This would
> only happen in these tests if you're using ext2, and there are a *lot*
> of directories.
> I've prepared a lineup of Andrea's VM patches at
> It would be useful if you could apply 10_inode-highmem-2.patch and
> report back. It applies to 2.4.19 as well, and should work OK there.
is there any reason why this (inode-highmem-2) has never been submitted for
inclusion into mainline yet?
ciao, Marc
On Wed, Feb 19, 2003 at 05:42:34PM +0100, Marc-Christian Petersen wrote:
> On Wednesday 05 February 2003 10:39, Andrew Morton wrote:
>
> Hi Andrew,
>
> > > Running just "find /" (or ls -R or tar on a large directory) locally
> > > slows the box down to absolute unresponsiveness - it takes minutes
> > > to just run ps and kill the find process. During that time, kupdated
> > > and kswapd gobble up all available CPU time.
> > Could be that your "low memory" is filled up with inodes. This would
> > only happen in these tests if you're using ext2, and there are a *lot*
> > of directories.
> > I've prepared a lineup of Andrea's VM patches at
> > It would be useful if you could apply 10_inode-highmem-2.patch and
> > report back. It applies to 2.4.19 as well, and should work OK there.
> is there any reason why this (inode-highmem-2) has never been submitted for
> inclusion into mainline yet?
Marcelo please include this:
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21pre4aa3/10_inode-highmem-2
other fixes should be included too but they don't apply cleanly yet
unfortunately, I (or somebody else) should rediff them against mainline.
Andrea
On Wednesday 19 February 2003 18:49, Andrea Arcangeli wrote:
Hi Andrea,
> Marcelo please include this:
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.2
>1pre4aa3/10_inode-highmem-2
great. Thanks. Now let's hope Marcelo use this :)
> other fixes should be included too but they don't apply cleanly yet
> unfortunately, I (or somebody else) should rediff them against mainline.
Can you tell me what in special you mean? I'd do this.
ciao, Marc
Marc-Christian Petersen <[email protected]> wrote:
>
> On Wednesday 19 February 2003 18:49, Andrea Arcangeli wrote:
>
> Hi Andrea,
>
> > Marcelo please include this:
> > http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.2
> >1pre4aa3/10_inode-highmem-2
> great. Thanks. Now let's hope Marcelo use this :)
>
> > other fixes should be included too but they don't apply cleanly yet
> > unfortunately, I (or somebody else) should rediff them against mainline.
> Can you tell me what in special you mean? I'd do this.
>
Andrea's VM patches, against 2.4.21-pre4 are at
http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/
The applying order is in the series file.
These have been rediffed, and apply cleanly. They have not been
tested much though.
Good morning, Thomas,
On Tue, 4 Feb 2003, Thomas B?tzler wrote:
> maybe you could help me out with a really weird problem we're having
> with a NFS fileserver for a couple of webservers:
>
> - Dual Xeon 2.2 GHz
> - 6 GB RAM
> - QLogic FCAL Host adapter with about 5.5 TB on a several RAIDs
> - Debian "woody" w/Kernel 2.4.19
>
> Running just "find /" (or ls -R or tar on a large directory) locally
> slows the box down to absolute unresponsiveness - it takes minutes
> to just run ps and kill the find process. During that time, kupdated
> and kswapd gobble up all available CPU time.
>
> The system performs great otherwise, so I've ruled out a hardware
> problem. It can't be a load problem because during normal operation,
> the system is more or less bored out of its mind (70-90% idle time).
>
> I'm really at the end of my wits here :-(
>
> Any help would be greatly appreciated!
I'm sure the inode problem Andrew and Andrea have pointed out is
more likely.
However, just out of interest, does the problem go away or become
less severe if you use "noatime" on that filesystem?
mount -o remount,noatime /my_raid_mount_point
?
Cheers,
- Bill
---------------------------------------------------------------------------
Lavish spending can be disastrous. Don't buy any lavishes for a
while.
(Courtesy of Paul Jakma <[email protected]>)
--------------------------------------------------------------------------
William Stearns ([email protected]). Mason, Buildkernel, freedups, p0f,
rsync-backup, ssh-keyinstall, dns-check, more at: http://www.stearns.org
Linux articles at: http://www.opensourcedigest.com
--------------------------------------------------------------------------
On Thursday 20 February 2003 19:35, Andrew Morton wrote:
Hi Andrew,
> Andrea's VM patches, against 2.4.21-pre4 are at
> http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/
> The applying order is in the series file.
I am afraid Marcelo will never accept these or some of them.
Or am I wrong?
ciao, Marc
Marc-Christian Petersen <[email protected]> wrote:
>
> On Thursday 20 February 2003 19:35, Andrew Morton wrote:
>
> Hi Andrew,
>
> > Andrea's VM patches, against 2.4.21-pre4 are at
> > http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/
> > The applying order is in the series file.
> I am afraid Marcelo will never accept these or some of them.
>
The most important one is inode-highmem. It's a safe patch, and the risk of
it causing problems due to not having other surrounding -aa stuff is low.
It's a matter of someone getting down, testing it and sending it.
Ho hum. It'll take an hour. I shall try.
On Thu, Feb 20, 2003 at 10:35:43AM -0800, Andrew Morton wrote:
> Marc-Christian Petersen <[email protected]> wrote:
> >
> > On Wednesday 19 February 2003 18:49, Andrea Arcangeli wrote:
> >
> > Hi Andrea,
> >
> > > Marcelo please include this:
> > > http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.2
> > >1pre4aa3/10_inode-highmem-2
> > great. Thanks. Now let's hope Marcelo use this :)
> >
> > > other fixes should be included too but they don't apply cleanly yet
> > > unfortunately, I (or somebody else) should rediff them against mainline.
> > Can you tell me what in special you mean? I'd do this.
> >
>
> Andrea's VM patches, against 2.4.21-pre4 are at
>
> http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/
>
> The applying order is in the series file.
Cool!
>
> These have been rediffed, and apply cleanly. They have not been
> tested much though.
If they didn't reject in non obvious way they should work fine too ;)
If Marcelo merges them I'll verify everything when I update to his tree
like I do regularly with everything else that rejects.
btw, I finished today fixing a deadlock condition in the xdr layer
triggered by nfs on highmem machines, here's the fix against 2.4.21pre4,
please apply it now to pre4 or will have to live in my tree with the
other hundred of patches like it happened to some of the patches we're
discussing in this thread.
Explanation is very simple: you _can't_ kmap two times in the context of
a single task (especially if more than one task can run the same code at
the same time). I don't yet have the confirmation that this fixes the
deadlock though (it takes days to reproduce so it will take weeks to
confirm), but I can't see anything else wrong at the moment, and this
remains a genuine highmem deadlock that has to be fixed. The fix is
optimal, no change unless you run out of kmaps and in turn you can
deadlock, i.e. all the light workloads won't be affected at all.
Note, this was developed on top of 2.4.21pre4aa3, so I had to rework it
to make it apply cleanly to mainline, the version I tested and included
in -aa is different, so this one is untested but if it compiles it will
work like a charm ;).
2.5.62 has the very same deadlock condition in xdr triggered by nfs too.
Andrew, if you're forward porting it yourself like with the filebacked
vma merging feature just let me know so we make sure not to duplicate
effort.
diff -urNp nfs-ref/include/asm-i386/highmem.h nfs/include/asm-i386/highmem.h
--- nfs-ref/include/asm-i386/highmem.h 2003-02-14 07:01:58.000000000 +0100
+++ nfs/include/asm-i386/highmem.h 2003-02-20 21:42:17.000000000 +0100
@@ -56,16 +56,19 @@ extern void kmap_init(void) __init;
#define PKMAP_NR(virt) ((virt-PKMAP_BASE) >> PAGE_SHIFT)
#define PKMAP_ADDR(nr) (PKMAP_BASE + ((nr) << PAGE_SHIFT))
-extern void * FASTCALL(kmap_high(struct page *page));
+extern void * FASTCALL(kmap_high(struct page *page, int nonblocking));
extern void FASTCALL(kunmap_high(struct page *page));
-static inline void *kmap(struct page *page)
+#define kmap(page) __kmap(page, 0)
+#define kmap_nonblock(page) __kmap(page, 1)
+
+static inline void *__kmap(struct page *page, int nonblocking)
{
if (in_interrupt())
out_of_line_bug();
if (page < highmem_start_page)
return page_address(page);
- return kmap_high(page);
+ return kmap_high(page, nonblocking);
}
static inline void kunmap(struct page *page)
diff -urNp nfs-ref/include/linux/sunrpc/xdr.h nfs/include/linux/sunrpc/xdr.h
--- nfs-ref/include/linux/sunrpc/xdr.h 2003-02-19 01:12:41.000000000 +0100
+++ nfs/include/linux/sunrpc/xdr.h 2003-02-20 21:39:51.000000000 +0100
@@ -137,7 +137,7 @@ void xdr_zero_iovec(struct iovec *, int,
* XDR buffer helper functions
*/
extern int xdr_kmap(struct iovec *, struct xdr_buf *, unsigned int);
-extern void xdr_kunmap(struct xdr_buf *, unsigned int);
+extern void xdr_kunmap(struct xdr_buf *, unsigned int, int);
extern void xdr_shift_buf(struct xdr_buf *, size_t);
/*
diff -urNp nfs-ref/mm/highmem.c nfs/mm/highmem.c
--- nfs-ref/mm/highmem.c 2002-11-29 02:23:18.000000000 +0100
+++ nfs/mm/highmem.c 2003-02-20 21:45:27.000000000 +0100
@@ -77,7 +77,7 @@ static void flush_all_zero_pkmaps(void)
flush_tlb_all();
}
-static inline unsigned long map_new_virtual(struct page *page)
+static inline unsigned long map_new_virtual(struct page *page, int nonblocking)
{
unsigned long vaddr;
int count;
@@ -96,6 +96,9 @@ start:
if (--count)
continue;
+ if (nonblocking)
+ return 0;
+
/*
* Sleep for somebody else to unmap their entries
*/
@@ -126,7 +129,7 @@ start:
return vaddr;
}
-void *kmap_high(struct page *page)
+void *kmap_high(struct page *page, int nonblocking)
{
unsigned long vaddr;
@@ -138,11 +141,15 @@ void *kmap_high(struct page *page)
*/
spin_lock(&kmap_lock);
vaddr = (unsigned long) page->virtual;
- if (!vaddr)
- vaddr = map_new_virtual(page);
+ if (!vaddr) {
+ vaddr = map_new_virtual(page, nonblocking);
+ if (!vaddr)
+ goto out;
+ }
pkmap_count[PKMAP_NR(vaddr)]++;
if (pkmap_count[PKMAP_NR(vaddr)] < 2)
BUG();
+ out:
spin_unlock(&kmap_lock);
return (void*) vaddr;
}
diff -urNp nfs-ref/net/sunrpc/xdr.c nfs/net/sunrpc/xdr.c
--- nfs-ref/net/sunrpc/xdr.c 2002-11-29 02:23:23.000000000 +0100
+++ nfs/net/sunrpc/xdr.c 2003-02-20 21:39:51.000000000 +0100
@@ -180,7 +180,7 @@ int xdr_kmap(struct iovec *iov_base, str
{
struct iovec *iov = iov_base;
struct page **ppage = xdr->pages;
- unsigned int len, pglen = xdr->page_len;
+ unsigned int len, pglen = xdr->page_len, first_kmap;
len = xdr->head[0].iov_len;
if (base < len) {
@@ -203,9 +203,17 @@ int xdr_kmap(struct iovec *iov_base, str
ppage += base >> PAGE_CACHE_SHIFT;
base &= ~PAGE_CACHE_MASK;
}
+ first_kmap = 1;
do {
len = PAGE_CACHE_SIZE;
- iov->iov_base = kmap(*ppage);
+ if (first_kmap) {
+ first_kmap = 0;
+ iov->iov_base = kmap(*ppage);
+ } else {
+ iov->iov_base = kmap_nonblock(*ppage);
+ if (!iov->iov_base)
+ goto out;
+ }
if (base) {
iov->iov_base += base;
len -= base;
@@ -223,20 +231,23 @@ map_tail:
iov->iov_base = (char *)xdr->tail[0].iov_base + base;
iov++;
}
+ out:
return (iov - iov_base);
}
-void xdr_kunmap(struct xdr_buf *xdr, unsigned int base)
+void xdr_kunmap(struct xdr_buf *xdr, unsigned int base, int niov)
{
struct page **ppage = xdr->pages;
unsigned int pglen = xdr->page_len;
if (!pglen)
return;
- if (base > xdr->head[0].iov_len)
+ if (base >= xdr->head[0].iov_len)
base -= xdr->head[0].iov_len;
- else
+ else {
+ niov--;
base = 0;
+ }
if (base >= pglen)
return;
@@ -250,7 +261,11 @@ void xdr_kunmap(struct xdr_buf *xdr, uns
* we bump pglen here, and just subtract PAGE_CACHE_SIZE... */
pglen += base & ~PAGE_CACHE_MASK;
}
- for (;;) {
+ /*
+ * In case we could only do a partial xdr_kmap, all remaining iovecs
+ * refer to pages. Otherwise we detect the end through pglen.
+ */
+ for (; niov; niov--) {
flush_dcache_page(*ppage);
kunmap(*ppage);
if (pglen <= PAGE_CACHE_SIZE)
@@ -322,9 +337,22 @@ void
xdr_shift_buf(struct xdr_buf *xdr, size_t len)
{
struct iovec iov[MAX_IOVEC];
- unsigned int nr;
+ unsigned int nr, len_part, n, skip;
+
+ skip = 0;
+ do {
+
+ nr = xdr_kmap(iov, xdr, skip);
+
+ len_part = 0;
+ for (n = 0; n < nr; n++)
+ len_part += iov[n].iov_len;
+
+ xdr_shift_iovec(iov, nr, len_part);
+
+ xdr_kunmap(xdr, skip, nr);
- nr = xdr_kmap(iov, xdr, 0);
- xdr_shift_iovec(iov, nr, len);
- xdr_kunmap(xdr, 0);
+ skip += len_part;
+ len -= len_part;
+ } while (len);
}
diff -urNp nfs-ref/net/sunrpc/xprt.c nfs/net/sunrpc/xprt.c
--- nfs-ref/net/sunrpc/xprt.c 2003-01-29 06:14:32.000000000 +0100
+++ nfs/net/sunrpc/xprt.c 2003-02-20 21:39:51.000000000 +0100
@@ -226,23 +226,34 @@ xprt_sendmsg(struct rpc_xprt *xprt, stru
/* Dont repeat bytes */
skip = req->rq_bytes_sent;
slen = xdr->len - skip;
- niov = xdr_kmap(niv, xdr, skip);
+ oldfs = get_fs(); set_fs(get_ds());
+ do {
+ unsigned int slen_part, n;
- msg.msg_flags = MSG_DONTWAIT|MSG_NOSIGNAL;
- msg.msg_iov = niv;
- msg.msg_iovlen = niov;
- msg.msg_name = (struct sockaddr *) &xprt->addr;
- msg.msg_namelen = sizeof(xprt->addr);
- msg.msg_control = NULL;
- msg.msg_controllen = 0;
+ niov = xdr_kmap(niv, xdr, skip);
- oldfs = get_fs(); set_fs(get_ds());
- clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
- result = sock_sendmsg(sock, &msg, slen);
+ msg.msg_flags = MSG_DONTWAIT|MSG_NOSIGNAL;
+ msg.msg_iov = niv;
+ msg.msg_iovlen = niov;
+ msg.msg_name = (struct sockaddr *) &xprt->addr;
+ msg.msg_namelen = sizeof(xprt->addr);
+ msg.msg_control = NULL;
+ msg.msg_controllen = 0;
+
+ slen_part = 0;
+ for (n = 0; n < niov; n++)
+ slen_part += niv[n].iov_len;
+
+ clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+ result = sock_sendmsg(sock, &msg, slen_part);
+
+ xdr_kunmap(xdr, skip, niov);
+
+ skip += slen_part;
+ slen -= slen_part;
+ } while (result >= 0 && slen);
set_fs(oldfs);
- xdr_kunmap(xdr, skip);
-
dprintk("RPC: xprt_sendmsg(%d) = %d\n", slen, result);
if (result >= 0)
Andrea
On Thu, Feb 20, 2003 at 01:41:04PM -0800, Andrew Morton wrote:
> Marc-Christian Petersen <[email protected]> wrote:
> >
> > On Thursday 20 February 2003 19:35, Andrew Morton wrote:
> >
> > Hi Andrew,
> >
> > > Andrea's VM patches, against 2.4.21-pre4 are at
> > > http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/
> > > The applying order is in the series file.
> > I am afraid Marcelo will never accept these or some of them.
> >
>
> The most important one is inode-highmem. It's a safe patch, and the risk of
> it causing problems due to not having other surrounding -aa stuff is low.
>
> It's a matter of someone getting down, testing it and sending it.
>
> Ho hum. It'll take an hour. I shall try.
this is a pre kernel, it's meant to *test* stuff, if anything will go
wrong we're here ready to fix it immediatly. Sure, applying the patch of
the last minute to an -rc just before releasing the new official kernel
w/o any kind of testing was a bad idea, but we must not be too much
conservative either, especially like in these cases where we are fixing
bugs, I mean we can't delay bugfixes with the argument that they could
introduce new bugs, otherwise we can as well stop fixing bugs.
Also note that this stuff is being tested aggressively for a very long
time by lots of people, it's not a last minute patch like the xdr
highmem deadlock ;).
Don't take me wrong, I'm not saying that Marcelo is too conservative,
quite the opposite, I'm simply not so pessimistic that the stuff won't
go in ;).
Andrea
On Thu, Feb 20, 2003 at 11:56:14PM +0100, Trond Myklebust wrote:
> >>>>> " " == Andrea Arcangeli <[email protected]> writes:
>
> > 2.5.62 has the very same deadlock condition in xdr triggered by
> > nfs too.
> > Andrew, if you're forward porting it yourself like with the
> > filebacked vma merging feature just let me know so we make sure
> > not to duplicate effort.
>
> For 2.5.x we should rather fix MSG_MORE so that it actually works
> instead of messing with hacks to kmap().
>
> For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> kmap of > 1 page in one call. Appended here as an attachment FYI
> (Marcelo do *not* apply!).
One should also consider kmap_atomic... (bcrl suggest)
Jeff
>>>>> " " == Jeff Garzik <[email protected]> writes:
> One should also consider kmap_atomic... (bcrl suggest)
The problem is that sendmsg() can sleep. kmap_atomic() isn't really
appropriate here.
Cheers,
Trond
On Feb 20, 2003 22:54 +0100, Andrea Arcangeli wrote:
> Explanation is very simple: you _can't_ kmap two times in the context of
> a single task (especially if more than one task can run the same code at
> the same time). I don't yet have the confirmation that this fixes the
> deadlock though (it takes days to reproduce so it will take weeks to
> confirm), but I can't see anything else wrong at the moment, and this
> remains a genuine highmem deadlock that has to be fixed. The fix is
> optimal, no change unless you run out of kmaps and in turn you can
> deadlock, i.e. all the light workloads won't be affected at all.
We had a similar problem in Lustre, where we have to kmap multiple pages
at once and hold them over a network RPC (which is doing zero-copy DMA
into multiple pages at once), and there is possibly a very heavy load
of kmaps because the client and the server can be on the same system.
What we did was set up a "kmap reservation", which used an atomic_dec()
+ wait_event() to reschedule the task until it could get enough kmaps
to satisfy the request without deadlocking (i.e. exceeding the kmap cap
which we conservitavely set at 3/4 of all kmap space).
A single "server" task could exceed the kmap cap by enough to satisfy the
maximum possible request size, so that a single system with both clients
and servers can always make forward progress even in the face of clients
trying to kmap more than the total amount of kmap space.
This works for us because we are the only consumer of huge amounts of kmaps
on our systems, but it would be nice to have a generic interface to do that
so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
On Fri, Feb 21, 2003 at 12:12:19AM +0100, Trond Myklebust wrote:
> >>>>> " " == Jeff Garzik <[email protected]> writes:
>
> > One should also consider kmap_atomic... (bcrl suggest)
>
> The problem is that sendmsg() can sleep. kmap_atomic() isn't really
> appropriate here.
100% correct.
Andrea
On Thu, Feb 20, 2003 at 11:56:14PM +0100, Trond Myklebust wrote:
> >>>>> " " == Andrea Arcangeli <[email protected]> writes:
>
> > 2.5.62 has the very same deadlock condition in xdr triggered by
> > nfs too.
> > Andrew, if you're forward porting it yourself like with the
> > filebacked vma merging feature just let me know so we make sure
> > not to duplicate effort.
>
> For 2.5.x we should rather fix MSG_MORE so that it actually works
> instead of messing with hacks to kmap().
>
> For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> kmap of > 1 page in one call. Appended here as an attachment FYI
> (Marcelo do *not* apply!).
you can't do it this way, the number of kmap available can be just 1,
and you can ask for 10000 in a row this way. Furthmore you want to be
able to use all the kmaps available, think if you have 11 kmaps, and 10
are constantly in use. I much prefer my approch that is the most
finegrined and scalable and it doesn't risk to deadlock in function of
the number of kmaps in the pool and the max reservation you make. I just
considered the approch implemented in the patch you quoted and I
discarded it for the reasons explained above.
Andrea
On Thu, Feb 20, 2003 at 06:04:30PM -0500, Jeff Garzik wrote:
> On Thu, Feb 20, 2003 at 11:56:14PM +0100, Trond Myklebust wrote:
> > >>>>> " " == Andrea Arcangeli <[email protected]> writes:
> >
> > > 2.5.62 has the very same deadlock condition in xdr triggered by
> > > nfs too.
> > > Andrew, if you're forward porting it yourself like with the
> > > filebacked vma merging feature just let me know so we make sure
> > > not to duplicate effort.
> >
> > For 2.5.x we should rather fix MSG_MORE so that it actually works
> > instead of messing with hacks to kmap().
> >
> > For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> > kmap of > 1 page in one call. Appended here as an attachment FYI
> > (Marcelo do *not* apply!).
>
>
> One should also consider kmap_atomic... (bcrl suggest)
impossible, either you submit page structures to the IP layer, or you
*must* have persistence, depending on a sock_sendmsg that can't schedule
would be totally broken (or the preemptive thing is a joke). nfs client
O_DIRET zerocopy would be a nice feature but this is 2.4.
the only option would be the atomic and at the same time persistent
kmaps in the process address space that don't work well with threads...
but again this is 2.4 and we miss it even in 2.5 because of the troubles
they generate.
Andrea
On Thu, Feb 20, 2003 at 04:15:36PM -0700, Andreas Dilger wrote:
> On Feb 20, 2003 22:54 +0100, Andrea Arcangeli wrote:
> > Explanation is very simple: you _can't_ kmap two times in the context of
> > a single task (especially if more than one task can run the same code at
> > the same time). I don't yet have the confirmation that this fixes the
> > deadlock though (it takes days to reproduce so it will take weeks to
> > confirm), but I can't see anything else wrong at the moment, and this
> > remains a genuine highmem deadlock that has to be fixed. The fix is
> > optimal, no change unless you run out of kmaps and in turn you can
> > deadlock, i.e. all the light workloads won't be affected at all.
>
> We had a similar problem in Lustre, where we have to kmap multiple pages
> at once and hold them over a network RPC (which is doing zero-copy DMA
> into multiple pages at once), and there is possibly a very heavy load
> of kmaps because the client and the server can be on the same system.
>
> What we did was set up a "kmap reservation", which used an atomic_dec()
> + wait_event() to reschedule the task until it could get enough kmaps
> to satisfy the request without deadlocking (i.e. exceeding the kmap cap
> which we conservitavely set at 3/4 of all kmap space).
Your approch was fragile (every arch is free to give you just 1 kmap in
the pool and you still must not deadlock) and it's not capable of using
the whole kmap pool at the same time. the only robust and efficient way
to fix it is the kmap_nonblock IMHO
> A single "server" task could exceed the kmap cap by enough to satisfy the
> maximum possible request size, so that a single system with both clients
> and servers can always make forward progress even in the face of clients
> trying to kmap more than the total amount of kmap space.
>
> This works for us because we are the only consumer of huge amounts of kmaps
> on our systems, but it would be nice to have a generic interface to do that
> so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).
This isn't the problem, if NFS wouldn't be broken it couldn't deadlock
against Lustre even with your design (assuming you don't fall in the two
problems mentioned above). But still your design is more fragile and
less scalable, especially for a generic implementation where you don't
know how many pages you'll reserve in mean, and you don't know how many
kmaps entries the architecture can provide to you. But of course with
kmap_nonblock you'll have to fallback submitting single pages if it
fails, it's a bit more difficult but it's more robust and optimized IMHO.
Andrea
On Feb 21, 2003 10:46 +0100, Andrea Arcangeli wrote:
> On Thu, Feb 20, 2003 at 04:15:36PM -0700, Andreas Dilger wrote:
> > What we did was set up a "kmap reservation", which used an atomic_dec()
> > + wait_event() to reschedule the task until it could get enough kmaps
> > to satisfy the request without deadlocking (i.e. exceeding the kmap cap
> > which we conservitavely set at 3/4 of all kmap space).
>
> Your approch was fragile (every arch is free to give you just 1 kmap in
> the pool and you still must not deadlock) and it's not capable of using
> the whole kmap pool at the same time. the only robust and efficient way
> to fix it is the kmap_nonblock IMHO
So (says the person who only ever uses i386 and ia64), does an arch exist
which needs highmem/kmap, but only ever gives 1 kmap in the pool?
> > This works for us because we are the only consumer of huge amounts of kmaps
> > on our systems, but it would be nice to have a generic interface to do that
> > so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).
>
> This isn't the problem, if NFS wouldn't be broken it couldn't deadlock
> against Lustre even with your design (assuming you don't fall in the two
> problems mentioned above). But still your design is more fragile and
> less scalable, especially for a generic implementation where you don't
> know how many pages you'll reserve in mean, and you don't know how many
> kmaps entries the architecture can provide to you. But of course with
> kmap_nonblock you'll have to fallback submitting single pages if it
> fails, it's a bit more difficult but it's more robust and optimized IMHO.
In our case, Lustre (well Portals really, the underlying network protocol)
always knows in advance the number of pages that it will need to kmap
because the client needs to tell the server in advance how much bulk data
is going to send. This is required for being able to do RDMA. It might
be possible to have the server do the transfer in multiple parts if
kmap_nonblock() failed, but that is not how things are currently set up,
which is why we block in advance until we know we can get enough pages.
This is very similar to ext3 journaling, which requests in advance the
maximum number of journal blocks it might need, and blocks until it can
get them all.
The only problem happens when other parts of the kernel start acquiring
multiple kmaps without using the same reservation/accounting system as us.
Each works fine in isolation, but in combination it fails.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
On Fri, Feb 21, 2003 at 12:41:09PM -0700, Andreas Dilger wrote:
> On Feb 21, 2003 10:46 +0100, Andrea Arcangeli wrote:
> > On Thu, Feb 20, 2003 at 04:15:36PM -0700, Andreas Dilger wrote:
> > > What we did was set up a "kmap reservation", which used an atomic_dec()
> > > + wait_event() to reschedule the task until it could get enough kmaps
> > > to satisfy the request without deadlocking (i.e. exceeding the kmap cap
> > > which we conservitavely set at 3/4 of all kmap space).
> >
> > Your approch was fragile (every arch is free to give you just 1 kmap in
> > the pool and you still must not deadlock) and it's not capable of using
> > the whole kmap pool at the same time. the only robust and efficient way
> > to fix it is the kmap_nonblock IMHO
>
> So (says the person who only ever uses i386 and ia64), does an arch exist
> which needs highmem/kmap, but only ever gives 1 kmap in the pool?
>
> > > This works for us because we are the only consumer of huge amounts of kmaps
> > > on our systems, but it would be nice to have a generic interface to do that
> > > so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).
> >
> > This isn't the problem, if NFS wouldn't be broken it couldn't deadlock
> > against Lustre even with your design (assuming you don't fall in the two
> > problems mentioned above). But still your design is more fragile and
> > less scalable, especially for a generic implementation where you don't
> > know how many pages you'll reserve in mean, and you don't know how many
> > kmaps entries the architecture can provide to you. But of course with
> > kmap_nonblock you'll have to fallback submitting single pages if it
> > fails, it's a bit more difficult but it's more robust and optimized IMHO.
>
> In our case, Lustre (well Portals really, the underlying network protocol)
> always knows in advance the number of pages that it will need to kmap
> because the client needs to tell the server in advance how much bulk data
> is going to send. This is required for being able to do RDMA. It might
> be possible to have the server do the transfer in multiple parts if
> kmap_nonblock() failed, but that is not how things are currently set up,
> which is why we block in advance until we know we can get enough pages.
>
> This is very similar to ext3 journaling, which requests in advance the
> maximum number of journal blocks it might need, and blocks until it can
> get them all.
>
> The only problem happens when other parts of the kernel start acquiring
> multiple kmaps without using the same reservation/accounting system as us.
> Each works fine in isolation, but in combination it fails.
no, if the other places are not buggy, it won't fail, regardless if they
use your mechanism or the kmap_nonblock. you don't have to use your
mechanism everywhere to make your mechanism work. For istance you will
be fine with the kmap_nonblock fix in combination with your current
code. Not sure why you think otherwise.
I understand it may be simpler to do the full reservation, in ext3 you
don't even risk anything because you know how large the pool is, but I
think for these cases the kmap_nonblock is superior because you have
obvious depdency on the architecture and you're not able to use at best
all the kmap pool (and here there's not a transaction that has to be
committed all at once so it's doable). still in practice it will work
fine in combination of the other safe usages (like kmap_nonblock) if you
reserve few enough pages at time.
Andrea
Trond Myklebust <[email protected]> wrote:
>
> >>>>> " " == Andrea Arcangeli <[email protected]> writes:
>
> > 2.5.62 has the very same deadlock condition in xdr triggered by
> > nfs too.
> > Andrew, if you're forward porting it yourself like with the
> > filebacked vma merging feature just let me know so we make sure
> > not to duplicate effort.
>
> For 2.5.x we should rather fix MSG_MORE so that it actually works
> instead of messing with hacks to kmap().
Is the fixing of MSG_MORE likely to actually happen?
> For 2.4.x, Hirokazu Takahashi had a patch which allowed for a safe
> kmap of > 1 page in one call. Appended here as an attachment FYI
> (Marcelo do *not* apply!).
Andrea's patch is quite simple. Although I wonder if this, in
xdr_kmap():
+ } else {
+ iov->iov_base = kmap_nonblock(*ppage);
+ if (!iov->iov_base)
+ goto out;
+ }
should be skipping the map_tail thing?
>>>>> " " == Andrew Morton <[email protected]> writes:
>> For 2.5.x we should rather fix MSG_MORE so that it actually
>> works instead of messing with hacks to kmap().
> Is the fixing of MSG_MORE likely to actually happen?
We had better try. The server/knfsd has already converted to sendpage
+ MSG_MORE 8-)
That won't work for 2.4.x though, since that doesn't have support for
sendpage over UDP.
Cheers,
Trond
On Fri, 2003-02-21 at 01:41, Andrea Arcangeli wrote:
> On Fri, Feb 21, 2003 at 12:12:19AM +0100, Trond Myklebust wrote:
> > >>>>> " " == Jeff Garzik <[email protected]> writes:
> >
> > > One should also consider kmap_atomic... (bcrl suggest)
> >
> > The problem is that sendmsg() can sleep. kmap_atomic() isn't really
> > appropriate here.
>
> 100% correct.
It actually depends upon whether you have sk->priority set
to GFP_ATOMIC or GFP_KERNEL.
On Fri, Feb 21, 2003 at 04:40:41PM -0800, David S. Miller wrote:
> On Fri, 2003-02-21 at 01:41, Andrea Arcangeli wrote:
> > On Fri, Feb 21, 2003 at 12:12:19AM +0100, Trond Myklebust wrote:
> > > >>>>> " " == Jeff Garzik <[email protected]> writes:
> > >
> > > > One should also consider kmap_atomic... (bcrl suggest)
> > >
> > > The problem is that sendmsg() can sleep. kmap_atomic() isn't really
> > > appropriate here.
> >
> > 100% correct.
>
> It actually depends upon whether you have sk->priority set
> to GFP_ATOMIC or GFP_KERNEL.
You must not disable preemption when entering sock_sendmsg no matter
sk->priority. disabling preemption inside sock_sendmsg is way too late
so even if you have such preemption bug in sock_sendmsg, it won't help.
you would need to disable preemption in the caller before doing the
kmap_atomic if something. And again that is a preemption bug.
Not to tell you'd need to allocate a big pool of atomic kmaps to do
that, and this would eat hundred megs of virtual address space since
it's replicated per-cpu. This is has even less sense, those machines
where the highmem deadlock triggers eats normal zone big time.
Really, the claim that it can be solved with atomic kmaps doesn't make
any sense to me, nor the fact the sock_sendmsg will not schedule if
called with GFP_ATOMIC. Of course it must not schedule if it can be
called from an irq with priority=GFP_ATOMIC, but this isn't the case
we're discussing here, an irq implicitly just disabled preemption by
design and calling sock_sendmsg from irq isn't really desiderable (even
if technically possible maybe with priority=GFP_ATOMIC according to you)
because it will take some time.
Andrea
On Wednesday 19 February 2003 18:49, Andrea Arcangeli wrote:
Hi Marcelo,
apply this, please!
> On Wed, Feb 19, 2003 at 05:42:34PM +0100, Marc-Christian Petersen wrote:
> > On Wednesday 05 February 2003 10:39, Andrew Morton wrote:
> >
> > Hi Andrew,
> >
> > > > Running just "find /" (or ls -R or tar on a large directory) locally
> > > > slows the box down to absolute unresponsiveness - it takes minutes
> > > > to just run ps and kill the find process. During that time, kupdated
> > > > and kswapd gobble up all available CPU time.
> > >
> > > Could be that your "low memory" is filled up with inodes. This would
> > > only happen in these tests if you're using ext2, and there are a *lot*
> > > of directories.
> > > I've prepared a lineup of Andrea's VM patches at
> > > It would be useful if you could apply 10_inode-highmem-2.patch and
> > > report back. It applies to 2.4.19 as well, and should work OK there.
> >
> > is there any reason why this (inode-highmem-2) has never been submitted
> > for inclusion into mainline yet?
Marcelo please include this:
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21pre4aa3/10_inode-highmem-2
other fixes should be included too but they don't apply cleanly yet
unfortunately, I (or somebody else) should rediff them against mainline.
> Andrea
ciao, Marc
On Thursday 27 February 2003 00:17, Marc-Christian Petersen wrote:
Hi again,
> Hi Marcelo,
> apply this, please!
Patch is by Andrea. I will send this every day once until I see the merge in
-BK or a mail from you here on LKML why you don't take it!
P.S.: I see some bogus patches in -BK (now -pre5) which got merged. This patch
exists since ages (inode-highmem-2), survived tons of testing and it is
a must!
I can only _repeat_ Andrea (I agree 100% with his statement):
------------------------------------------------------------------------
this is a pre kernel, it's meant to *test* stuff, if anything will go
wrong we're here ready to fix it immediatly. Sure, applying the patch of
the last minute to an -rc just before releasing the new official kernel
w/o any kind of testing was a bad idea, but we must not be too much
conservative either, especially like in these cases where we are fixing
bugs, I mean we can't delay bugfixes with the argument that they could
introduce new bugs, otherwise we can as well stop fixing bugs.
Also note that this stuff is being tested aggressively for a very long
time by lots of people, it's not a last minute patch like the xdr
highmem deadlock ;).
------------------------------------------------------------------------
regards!
>
> > On Wed, Feb 19, 2003 at 05:42:34PM +0100, Marc-Christian Petersen wrote:
> > > On Wednesday 05 February 2003 10:39, Andrew Morton wrote:
> > >
> > > Hi Andrew,
> > >
> > > > > Running just "find /" (or ls -R or tar on a large directory)
> > > > > locally slows the box down to absolute unresponsiveness - it takes
> > > > > minutes to just run ps and kill the find process. During that time,
> > > > > kupdated and kswapd gobble up all available CPU time.
> > > >
> > > > Could be that your "low memory" is filled up with inodes. This would
> > > > only happen in these tests if you're using ext2, and there are a
> > > > *lot* of directories.
> > > > I've prepared a lineup of Andrea's VM patches at
> > > > It would be useful if you could apply 10_inode-highmem-2.patch and
> > > > report back. It applies to 2.4.19 as well, and should work OK there.
> > >
> > > is there any reason why this (inode-highmem-2) has never been submitted
> > > for inclusion into mainline yet?
>
> Marcelo please include this:
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21
>pre4aa3/10_inode-highmem-2 other fixes should be included too but they don't
> apply cleanly yet unfortunately, I (or somebody else) should rediff them
> against mainline.