LinuxLists.cc - 2.4: NFS client kmapping across I/O

2002-01-29 02:25:39

Subject: 2.4: NFS client kmapping across I/O

Trond, thanks for the excellent fattr race fix. I'm sorry I haven't
been able to give feedback until now, things got busy for a while. I
have not yet had a chance to run your fixes, but after studying them I
believe that they will resolve the race nicely, especially with the use
of nfs_inode_lock in the recent NFS_ALL experimental patches. FWIW.

Now I also have time to mention the other NFS client issue we ran into
recently, I have not found mention of it on the mailing lists. The NFS
client is kmapping pages for the duration of reads from and writes to
the server. This creates a scaling limitation, especially under
CONFIG_HIGHMEM64G and i386 where there are only 512 entries in the
highmem kmap table. Under I/O load, it is easy to fill up the table,
hanging all processes that need to map highmem pages a substantial
fraction of the time.

Before 2.4.15, it is particularly bad, nfs_flushd locks up the kernel
under I/O load. My testcase was to copy 12 100M files from one NFS
server to another, it was very reliable at producing the lockup right
away. nfs_flushd fills up the kmap table as it sends out requests, then
blocks waiting for another kmap entry to free up. But it is running in
rpciod, so rpciod is blocked and cannot process any responses to the
requests. No kmap entries are ever freed once the table fills up. In
this state, the machine pings and responds to SysRq on the serial port,
but just about everything else hangs.

It looks like nfs_flushd was turned off in 2.4.15, that is the
workaround I have applied to our machines. I have also limited the
number of requests across all NFS servers to LAST_PKMAP-64, to leave
some kmap entries available for non-NFS use. It is not an ideal
workaround, though, it artificially limits I/O to multiple servers.
I've thought about bumping up LAST_PKMAP to increase the size of the
highmem kmap table, but the table looks like it was designed to be
small.

I've also thought about pushing the kmaps and kunmaps down into the RPC
layer, so the pages are only mapped while data is copied to or from
them, not while waiting for the network. That would be more work, but
it looks doable, so I wanted to run the problem and the approach by you
knowledgeable folks while I'm waiting for hardware to free up for kernel
hacking.

Thanks,
Xeno

2002-01-29 09:06:15

by Trond Myklebust

[permalink] [raw]

Subject: Re: 2.4: NFS client kmapping across I/O

>>>>> " " == xeno <[email protected]> writes:

> Trond, thanks for the excellent fattr race fix. I'm sorry I
> haven't been able to give feedback until now, things got busy
> for a while. I have not yet had a chance to run your fixes,
> but after studying them I believe that they will resolve the
> race nicely, especially with the use of nfs_inode_lock in the
> recent NFS_ALL experimental patches. FWIW.

Note: the nfs_inode_lock is not in itself part of any 'fix'. It is
merely a substitute for the current BKL protection of attribute
updates on NFS. The BKL gets dropped from NFS + rpciod in the patch
linux-2.4.x-rpc_bkl.dif which is, of course, included in NFS_ALL.

> I've also thought about pushing the kmaps and kunmaps down into
> the RPC layer, so the pages are only mapped while data is
> copied to or from them, not while waiting for the network.
> That would be more work, but it looks doable, so I wanted to
> run the problem and the approach by you knowledgeable folks
> while I'm waiting for hardware to free up for kernel hacking.

This is the long term solution. We will in any case want to make the
RPC layer 'page aware' in order to be able to make use of the
zero-copy socket stuff (a.k.a. sendpage()).
I'm still not ready to detail exactly how it should be done
though. I'll have to get back to you on this one...

Cheers,
Trond

2002-01-29 18:15:19

by Hugh Dickins

[permalink] [raw]

Subject: Re: 2.4: NFS client kmapping across I/O

On Mon, 28 Jan 2002, Xeno wrote:
>
> Now I also have time to mention the other NFS client issue we ran into
> recently, I have not found mention of it on the mailing lists. The NFS
> client is kmapping pages for the duration of reads from and writes to
> the server. This creates a scaling limitation, especially under
> CONFIG_HIGHMEM64G and i386 where there are only 512 entries in the
> highmem kmap table. Under I/O load, it is easy to fill up the table,
> hanging all processes that need to map highmem pages a substantial
> fraction of the time.
> ...
> I've thought about bumping up LAST_PKMAP to increase the size of the
> highmem kmap table, but the table looks like it was designed to be
> small.

You're right that kmap users should avoid holding them for very long,
I'd certainly not discourage anyone from pursuing that effort.

But in case it helps in the short term, below is 2.4.18-pre7 version
of patch I prepared earlier for 2.4.18-pre4aa1: plus now defaults to
1800 kmaps for both HIGHMEM4G and HIGHMEM64G. And as a side-effect,
it raises the overly strict limit on the i386 vmalloc area, like -aa.
But I've not touched PPC or Sparc, their limits are unchanged.

I don't think the kmap area was really designed to be so small, it's
just that nobody ever got around to allowing more than one page table
for it. It was surely absurd that HIGHMEM64G (doubly-sized ptes so half
as many per page table) should be limited to fewer kmaps than HIGHMEM4G.

Hugh

diff -urN 2.4.18-pre7/arch/i386/mm/init.c linux/arch/i386/mm/init.c
--- 2.4.18-pre7/arch/i386/mm/init.c Mon Jan 7 13:02:56 2002
+++ linux/arch/i386/mm/init.c Tue Jan 29 16:02:23 2002
@@ -167,44 +167,6 @@
set_pte_phys(address, phys, flags);
}

-static void __init fixrange_init (unsigned long start, unsigned long end, pgd_t *pgd_base)
-{
- pgd_t *pgd;
- pmd_t *pmd;
- pte_t *pte;
- int i, j;
- unsigned long vaddr;
-
- vaddr = start;
- i = __pgd_offset(vaddr);
- j = __pmd_offset(vaddr);
- pgd = pgd_base + i;
-
- for ( ; (i < PTRS_PER_PGD) && (vaddr != end); pgd++, i++) {
-#if CONFIG_X86_PAE
- if (pgd_none(*pgd)) {
- pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- set_pgd(pgd, __pgd(__pa(pmd) + 0x1));
- if (pmd != pmd_offset(pgd, 0))
- printk("PAE BUG #02!\n");
- }
- pmd = pmd_offset(pgd, vaddr);
-#else
- pmd = (pmd_t *)pgd;
-#endif
- for (; (j < PTRS_PER_PMD) && (vaddr != end); pmd++, j++) {
- if (pmd_none(*pmd)) {
- pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte)));
- if (pte != pte_offset(pmd, 0))
- BUG();
- }
- vaddr += PMD_SIZE;
- }
- j = 0;
- }
-}
-
static void __init pagetable_init (void)
{
unsigned long vaddr, end;
@@ -221,8 +183,24 @@

pgd_base = swapper_pg_dir;
#if CONFIG_X86_PAE
- for (i = 0; i < PTRS_PER_PGD; i++)
+ /*
+ * First set all four entries of the pgd.
+ * Usually only one page is needed here: if PAGE_OFFSET lowered,
+ * maybe three pages: need not be contiguous, but might as well.
+ */
+ pmd = (pmd_t *)alloc_bootmem_low_pages(KERNEL_PGD_PTRS*PAGE_SIZE);
+ for (i = 1; i < USER_PGD_PTRS; i++)
set_pgd(pgd_base + i, __pgd(1 + __pa(empty_zero_page)));
+ for (; i < PTRS_PER_PGD; i++, pmd += PTRS_PER_PMD)
+ set_pgd(pgd_base + i, __pgd(1 + __pa(pmd)));
+ /*
+ * Add low memory identity-mappings - SMP needs it when
+ * starting up on an AP from real-mode. In the non-PAE
+ * case we already have these mappings through head.S.
+ * All user-space mappings are explicitly cleared after
+ * SMP startup.
+ */
+ pgd_base[0] = pgd_base[USER_PGD_PTRS];
#endif
i = __pgd_offset(PAGE_OFFSET);
pgd = pgd_base + i;
@@ -231,14 +209,7 @@
vaddr = i*PGDIR_SIZE;
if (end && (vaddr >= end))
break;
-#if CONFIG_X86_PAE
- pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- set_pgd(pgd, __pgd(__pa(pmd) + 0x1));
-#else
- pmd = (pmd_t *)pgd;
-#endif
- if (pmd != pmd_offset(pgd, 0))
- BUG();
+ pmd = pmd_offset(pgd, 0);
for (j = 0; j < PTRS_PER_PMD; pmd++, j++) {
vaddr = i*PGDIR_SIZE + j*PMD_SIZE;
if (end && (vaddr >= end))
@@ -274,34 +245,27 @@
}

/*
- * Fixed mappings, only the page table structure has to be
- * created - mappings will be set by set_fixmap():
+ * Leave vmalloc() to create its own page tables as needed,
+ * but create the page tables at top of virtual memory, to be
+ * populated by kmap_high(), kmap_atomic(), and set_fixmap().
+ * kmap_high() assumes pkmap_page_table contiguous throughout.
*/
- vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
- fixrange_init(vaddr, 0, pgd_base);
-
#if CONFIG_HIGHMEM
- /*
- * Permanent kmaps:
- */
vaddr = PKMAP_BASE;
- fixrange_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base);
+#else
+ vaddr = FIXADDR_START;
+#endif
+ pmd = pmd_offset(pgd_offset_k(vaddr), vaddr);
+ i = (0UL - (vaddr & PMD_MASK)) >> PMD_SHIFT;
+ pte = (pte_t *)alloc_bootmem_low_pages(i*PAGE_SIZE);
+ for (; --i >= 0; pmd++, pte += PTRS_PER_PTE)
+ set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte)));

+#if CONFIG_HIGHMEM
pgd = swapper_pg_dir + __pgd_offset(vaddr);
pmd = pmd_offset(pgd, vaddr);
pte = pte_offset(pmd, vaddr);
pkmap_page_table = pte;
-#endif
-
-#if CONFIG_X86_PAE
- /*
- * Add low memory identity-mappings - SMP needs it when
- * starting up on an AP from real-mode. In the non-PAE
- * case we already have these mappings through head.S.
- * All user-space mappings are explicitly cleared after
- * SMP startup.
- */
- pgd_base[0] = pgd_base[USER_PTRS_PER_PGD];
#endif
}

diff -urN 2.4.18-pre7/include/asm-i386/highmem.h linux/include/asm-i386/highmem.h
--- 2.4.18-pre7/include/asm-i386/highmem.h Mon Jan 28 18:42:17 2002
+++ linux/include/asm-i386/highmem.h Tue Jan 29 16:02:23 2002
@@ -42,17 +42,13 @@
extern void kmap_init(void) __init;

/*
- * Right now we initialize only a single pte table. It can be extended
- * easily, subsequent pte tables have to be allocated in one physical
- * chunk of RAM.
+ * LAST_PKMAP is the maximum number of persistent (non-atomic) kmappings
+ * outstanding at one time. 1800 was chosen as more than the previous
+ * 1024, not necessarily a power of 2, and 2000 would need one more
+ * page table (since last shared with fixmaps): redefine it if you need.
*/
-#define PKMAP_BASE (0xfe000000UL)
-#ifdef CONFIG_X86_PAE
-#define LAST_PKMAP 512
-#else
-#define LAST_PKMAP 1024
-#endif
-#define LAST_PKMAP_MASK (LAST_PKMAP-1)
+#define LAST_PKMAP 1800
+#define PKMAP_BASE (FIXADDR_START - (LAST_PKMAP << PAGE_SHIFT))
#define PKMAP_NR(virt) ((virt-PKMAP_BASE) >> PAGE_SHIFT)
#define PKMAP_ADDR(nr) (PKMAP_BASE + ((nr) << PAGE_SHIFT))

diff -urN 2.4.18-pre7/mm/highmem.c linux/mm/highmem.c
--- 2.4.18-pre7/mm/highmem.c Mon Jan 7 13:03:04 2002
+++ linux/mm/highmem.c Tue Jan 29 16:02:23 2002
@@ -85,8 +85,8 @@
count = LAST_PKMAP;
/* Find an empty entry */
for (;;) {
- last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
- if (!last_pkmap_nr) {
+ if (++last_pkmap_nr >= LAST_PKMAP) {
+ last_pkmap_nr = 0;
flush_all_zero_pkmaps();
count = LAST_PKMAP;
}

2002-01-29 18:35:32

by Rik van Riel

[permalink] [raw]

Subject: Re: 2.4: NFS client kmapping across I/O

On Tue, 29 Jan 2002, Hugh Dickins wrote:
> On Mon, 28 Jan 2002, Xeno wrote:
> >
> > Now I also have time to mention the other NFS client issue we ran into
> > recently, I have not found mention of it on the mailing lists. The NFS
> > client is kmapping pages for the duration of reads from and writes to
> > the server. This creates a scaling limitation, especially under
>
> You're right that kmap users should avoid holding them for very long,
> I'd certainly not discourage anyone from pursuing that effort.

Things like this would be fixed by having a kmap variant
which maps the pages into process-private space, kind of
like kmap_atomic(), but into a (4 MB sized?) part of
address space which is only visible within this process.

This would mean that each process can have a few MB of
kmap()d pages ... should be nice for things like having
copy_{to,from}_user "cache" the hot pages while keeping
longer term mappings from filling the global pool.

regards,

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-30 00:56:33

by Xeno

[permalink] [raw]

Subject: Re: 2.4: NFS client kmapping across I/O

Trond Myklebust wrote:
>
> We will in any case want to make the RPC layer 'page aware' in
> order to be able to make use of the zero-copy socket stuff
> (a.k.a. sendpage()). I'm still not ready to detail exactly how
> it should be done though. I'll have to get back to you on this one...

Oh, yeah, zero-copy is definitely the way to go. No hurry, the
workaround is good enough for now. Thanks, Trond!

Hugh Dickins wrote:
>
> I don't think the kmap area was really designed to be so small, it's
> just that nobody ever got around to allowing more than one page table
> for it. It was surely absurd that HIGHMEM64G (doubly-sized ptes so half
> as many per page table) should be limited to fewer kmaps than HIGHMEM4G.

Thanks for the patch, Hugh, I may try it out when I get some free
hardware next week. Looks like there is already a limit of 256 pages
kmapped per mount built into the NFS client, so LAST_PKMAP of 1800 would
allow heavy I/O to 7 mounts before filling up the kmap table, up from 2
mounts.

I wasn't sure about the kmap table size because of the loops in
flush_all_zero_pkmaps and map_new_virtual. In the worst case, when the
table is nearly full, like when NFS is maxing it out, each of those
operations will tend to scan through the table without finding much that
can be freed/used. I don't have a machine where I can tinker with it at
the moment, though, so I'm not sure how much it matters.

Xeno