2001-03-19 18:44:30

by Rik van Riel

[permalink] [raw]
Subject: 3rd version of R/W mmap_sem patch available

On Mon, 19 Mar 2001, Mike Galbraith wrote:

> @@ -1135,6 +1170,7 @@
[large patch]

I've been finding small bugs in both my late-night code and in
Mike's code and have redone the changes in do_anonymous_page(),
do_no_page() and do_swap_page() much more carefully...

Now the code is beautiful and it might even be bugfree ;)

If you feel particularly adventurous, please help me test the
patch; it is available from:

http://www.surriel.com/patches/2.4/2.4.2-ac20-rwmmap_sem3

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/


2001-03-19 22:48:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available



> Now the code is beautiful and it might even be bugfree ;)

I'm applying this to my tree - I'm not exactly comfortable with this
during the 2.4.x timeframe, but at the same time I'm even less comfortable
with the current alternative, which is to make the regular semaphores
fairer (we tried it once, and the implementation had problems, I'm not
going to try that again during 2.4.x).

Besides, the fair semaphores would potentially slow things down, while
this potentially speeds things up. So.. It looks obvious enough.

Linus

2001-03-19 23:14:29

by Manfred Spraul

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available

>
> Besides, the fair semaphores would potentially slow things down, while
> this potentially speeds things up. So.. It looks obvious enough.
>

Rik, did you check that {pte,pmd}_alloc are thread safe? At least in
2.4.2 they aren't (include/asm-i386/pgalloc.h), and your patch doesn't
touch pgalloc.

{pte,pmd}_alloc are called from handle_mm_fault, and that function is
now running with down_read().
--

Manfred



2001-03-19 23:28:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available



On Tue, 20 Mar 2001, Manfred Spraul wrote:
>
> Rik, did you check that {pte,pmd}_alloc are thread safe? At least in
> 2.4.2 they aren't (include/asm-i386/pgalloc.h), and your patch doesn't
> touch pgalloc.

Excellent point. We used to do all the looping and re-trying, but it got
ripped out a long time ago (and in any case, it historically didn't do
SMP, so the old code doesn't really work).

Linus

2001-03-19 23:37:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available



On Mon, 19 Mar 2001, Linus Torvalds wrote:
>
> Excellent point. We used to do all the looping and re-trying, but it got
> ripped out a long time ago (and in any case, it historically didn't do
> SMP, so the old code doesn't really work).

Actually, funnily enough, I see that the old thread-safe stuff is still
there in get_pte_kernel_slow(). The only thing that breaks it is that we
don't hold any locks, so it's only UP-safe, not SMP-safe.

However, it definitely looks like we should just un-inline that thing
completely, and make a lot of it architecture-independent anyway.

Linus

2001-03-20 02:47:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available


There is a 2.4.3-pre5 in the test-directory on ftp.kernel.org.

The complete changelog is appended, but the biggest recent change is the
mmap_sem change, which I updated with new locking rules for pte/pmd_alloc
to avoid the race on the actual page table build.

This has only been tested on i386 without PAE, and is known to break other
architectures. Ingo, mind checking what PAE needs? Generally, the changes
are simple, and really only implies changing the pte/pmd allocation
functions to _only_ allocate (ie removing the stuff that actually modifies
the page tables, as that is now handled by generic code), and to make sure
that the "pgd/pmd_populate()" functions do the right thing.

I have also removed the xxx_kernel() functions - for architectures that
need them, I suspect that the right approach is to just make the
"populate" funtions notice when "mm" is "init_mm", the kernel context.
That removed a lot of duplicate code that had little good reason.

This pre-release is meant mainly as a synchronization point for mm
developers, not for generic use.

Thanks,

Linus


-----
-pre5:
- Rik van Riel and others: mm rw-semaphore (ps/top ok when swapping)
- IDE: 256 sectors at a time is legal, but apparently confuses some
drives. Max out at 255 sectors instead.
- Petko Manolov: USB pegasus driver update
- make the boottime memory map printout at least almost readable.
- USB driver updates
- pte_alloc()/pmd_alloc() need page_table_lock.

-pre4:
- Petr Vandrovec, Al Viro: dentry revalidation fixes
- Stephen Tweedie / Manfred Spraul: kswapd and ptrace race
- Neil Brown: nfsd/rpc/raid cleanups and fixes

-pre3:
- Alan Cox: continued merging
- Urban Widmark: smbfs fix (d_add on already hashed dentry - no-no).
- Andrew Morton: 3c59x update
- Jeff Garzik: network driver cleanups and fixes
- G?rard Roudier: sym-ncr drivers update
- Jens Axboe: more loop cleanups and fixes
- David Miller: sparc update, some networking fixes

-pre2:
- Jens Axboe: fix loop device deadlocks
- Greg KH: USB updates
- Alan Cox: continued merging
- Tim Waugh: parport and documentation updates
- Cort Dougan: PowerPC merge
- Jeff Garzik: network driver updates
- Justin Gibbs: new and much improved aic7xxx driver 6.1.5

-pre1:
- Chris Mason: reiserfs, another null bytes bug
- Andrea Arkangeli: make SMP Athlon build
- Alexander Zarochentcev: reiserfs directory fsync SMP locking fix
- Jeff Garzik: PCI network driver updates
- Alan Cox: continue merging
- Ingo Molnar: fix RAID AUTORUN ioctl, scheduling improvements

2001-03-20 03:36:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available



On Tue, 20 Mar 2001, Andrew Morton wrote:
> Linus Torvalds wrote:
> >
> > There is a 2.4.3-pre5 in the test-directory on ftp.kernel.org.
>
> I can't see it. Where did you hide it?

Ahh. The mirroring is apparently broken. I put my stuff on a faster local
connection to "master.kernel.org", and depend on it being mirrored
automatically. And apparently the mirroring has been down for a few days
now. Ugh. I'll ping Peter.

Linus

2001-03-20 04:36:17

by Rik van Riel

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available

On Tue, 20 Mar 2001, Manfred Spraul wrote:
> >
> > Besides, the fair semaphores would potentially slow things down, while
> > this potentially speeds things up. So.. It looks obvious enough.
>
> Rik, did you check that {pte,pmd}_alloc are thread safe? At
> least in 2.4.2 they aren't (include/asm-i386/pgalloc.h), and
> your patch doesn't touch pgalloc.

I checked and they're not. This still needs to be fixed...

(ie the patch really isn't ready yet to be included in the
main kernel ... OTOH, the changes needed to make it ready
are all trivial and tedious ;))

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-03-20 04:38:27

by Rik van Riel

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available

On Mon, 19 Mar 2001, Linus Torvalds wrote:
> On Mon, 19 Mar 2001, Linus Torvalds wrote:
> >
> > Excellent point. We used to do all the looping and re-trying, but it got
> > ripped out a long time ago (and in any case, it historically didn't do
> > SMP, so the old code doesn't really work).
>
> Actually, funnily enough, I see that the old thread-safe stuff is still
> there in get_pte_kernel_slow(). The only thing that breaks it is that we
> don't hold any locks, so it's only UP-safe, not SMP-safe.
>
> However, it definitely looks like we should just un-inline that thing
> completely, and make a lot of it architecture-independent anyway.

Also, because lots of architectures seem to have exactly
the same code, we might as well remove the duplicates and
put them in the same place...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-03-20 04:42:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available



On Tue, 20 Mar 2001, Rik van Riel wrote:
>
> (ie the patch really isn't ready yet to be included in the
> main kernel ... OTOH, the changes needed to make it ready
> are all trivial and tedious ;))

They are trivial and tedious only if done wrong - which will also add tons
of new places where we lock and unlock only to lock again.

My -pre5 has the non-trivial "fix the calling convention and require that
pmd/pgd_alloc() be called with the lock held" version of the patch.

Linus

2001-03-20 06:09:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available



On Tue, 20 Mar 2001, Marcelo Tosatti wrote:
>
> Could the IDE one cause corruption ?

Only with broken disks, as far as we know right now. There's been so far
just one report of this problem, and nobody has heard back about which
disk this was.. And it should be noisy about it when it happens -
complaining about lost interrupts and resetting the IDE controller.

So unlikely.

Linus

2001-03-20 06:15:01

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available



On Mon, 19 Mar 2001, Linus Torvalds wrote:

>
>
> On Tue, 20 Mar 2001, Marcelo Tosatti wrote:
> >
> > Could the IDE one cause corruption ?
>
> Only with broken disks, as far as we know right now. There's been so far
> just one report of this problem, and nobody has heard back about which
> disk this was.. And it should be noisy about it when it happens -
> complaining about lost interrupts and resetting the IDE controller.
>
> So unlikely.

Ok, so I think we have a problem. The disk is OK -- no lost interrupts or
resets. Just this message on syslog and pgbench complaining about
corruption of the database.

I'll put pre5 in and try to reproduce the problem (I hitted it while
running pgbench + shmtest).

Damn.

2001-03-20 06:00:39

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available



On Mon, 19 Mar 2001, Linus Torvalds wrote:

>
> There is a 2.4.3-pre5 in the test-directory on ftp.kernel.org.
>
> The complete changelog is appended, but the biggest recent change is the
> mmap_sem change, which I updated with new locking rules for pte/pmd_alloc
> to avoid the race on the actual page table build.
>
> This has only been tested on i386 without PAE, and is known to break other
> architectures. Ingo, mind checking what PAE needs? Generally, the changes
> are simple, and really only implies changing the pte/pmd allocation
> functions to _only_ allocate (ie removing the stuff that actually modifies
> the page tables, as that is now handled by generic code), and to make sure
> that the "pgd/pmd_populate()" functions do the right thing.
>
> I have also removed the xxx_kernel() functions - for architectures that
> need them, I suspect that the right approach is to just make the
> "populate" funtions notice when "mm" is "init_mm", the kernel context.
> That removed a lot of duplicate code that had little good reason.
>
> This pre-release is meant mainly as a synchronization point for mm
> developers, not for generic use.
>
> Thanks,
>
> Linus
>
>
> -----
> -pre5:
> - Rik van Riel and others: mm rw-semaphore (ps/top ok when swapping)
> - IDE: 256 sectors at a time is legal, but apparently confuses some
> drives. Max out at 255 sectors instead.

Could the IDE one cause corruption ?

EXT2-fs error (device ide0(3,1)): ext2_free_blocks: bit already cleared
for block 6211

Just hitted this now with pre3.

2001-03-20 06:38:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3rd version of R/W mmap_sem patch available



On Tue, 20 Mar 2001, Marcelo Tosatti wrote:
>
> I'll put pre5 in and try to reproduce the problem (I hitted it while
> running pgbench + shmtest).

I found a case where pre5 will forget to unlock the page_table_lock (in
copy_page_range()), and one place where I had missed the lock altogether
(in ioremap()), so I'll make a pre6 (neither is a problem on UP, though,
so pre5 is not unusable - even on SMP it works really well until you hit
the case where it forgets to unlock ;).

Although I'd prefer to see somebody check out the other architectures, to
do the (pretty trivial) changes to make them support properly threaded
page faults. I'd hate to have two pre-patches without any input from other
architectures..

Linus

2001-03-21 18:15:11

by Ivan Kokshaysky

[permalink] [raw]
Subject: 2.4.3-pre6 alpha pte/pmd_alloc update

Surprisingly, it works on lx164. :-)

Ivan.

On Mon, Mar 19, 2001 at 06:46:17PM -0800, Linus Torvalds wrote:
> This has only been tested on i386 without PAE, and is known to break other
> architectures. Ingo, mind checking what PAE needs? Generally, the changes
> are simple, and really only implies changing the pte/pmd allocation
> functions to _only_ allocate (ie removing the stuff that actually modifies
> the page tables, as that is now handled by generic code), and to make sure
> that the "pgd/pmd_populate()" functions do the right thing.
>
> I have also removed the xxx_kernel() functions - for architectures that
> need them, I suspect that the right approach is to just make the
> "populate" funtions notice when "mm" is "init_mm", the kernel context.
> That removed a lot of duplicate code that had little good reason.

diff -urp 2.4.3p6/arch/alpha/mm/init.c linux/arch/alpha/mm/init.c
--- 2.4.3p6/arch/alpha/mm/init.c Wed Nov 29 09:43:39 2000
+++ linux/arch/alpha/mm/init.c Wed Mar 21 20:04:21 2001
@@ -43,20 +43,6 @@ struct thread_struct original_pcb;
struct pgtable_cache_struct quicklists;
#endif

-void
-__bad_pmd(pgd_t *pgd)
-{
- printk("Bad pgd in pmd_alloc: %08lx\n", pgd_val(*pgd));
- pgd_set(pgd, BAD_PAGETABLE);
-}
-
-void
-__bad_pte(pmd_t *pmd)
-{
- printk("Bad pmd in pte_alloc: %08lx\n", pmd_val(*pmd));
- pmd_set(pmd, (pte_t *) BAD_PAGETABLE);
-}
-
pgd_t *
get_pgd_slow(void)
{
@@ -80,66 +66,26 @@ get_pgd_slow(void)
return ret;
}

-pmd_t *
-get_pmd_slow(pgd_t *pgd, unsigned long offset)
-{
- pmd_t *pmd;
-
- pmd = (pmd_t *) __get_free_page(GFP_KERNEL);
- if (pgd_none(*pgd)) {
- if (pmd) {
- clear_page((void *)pmd);
- pgd_set(pgd, pmd);
- return pmd + offset;
- }
- pgd_set(pgd, BAD_PAGETABLE);
- return NULL;
- }
- free_page((unsigned long)pmd);
- if (pgd_bad(*pgd)) {
- __bad_pmd(pgd);
- return NULL;
- }
- return (pmd_t *) pgd_page(*pgd) + offset;
-}
-
-pte_t *
-get_pte_slow(pmd_t *pmd, unsigned long offset)
-{
- pte_t *pte;
-
- pte = (pte_t *) __get_free_page(GFP_KERNEL);
- if (pmd_none(*pmd)) {
- if (pte) {
- clear_page((void *)pte);
- pmd_set(pmd, pte);
- return pte + offset;
- }
- pmd_set(pmd, (pte_t *) BAD_PAGETABLE);
- return NULL;
- }
- free_page((unsigned long)pte);
- if (pmd_bad(*pmd)) {
- __bad_pte(pmd);
- return NULL;
- }
- return (pte_t *) pmd_page(*pmd) + offset;
-}
-
int do_check_pgt_cache(int low, int high)
{
int freed = 0;
- if(pgtable_cache_size > high) {
- do {
- if(pgd_quicklist)
- free_pgd_slow(get_pgd_fast()), freed++;
- if(pmd_quicklist)
- free_pmd_slow(get_pmd_fast()), freed++;
- if(pte_quicklist)
- free_pte_slow(get_pte_fast()), freed++;
- } while(pgtable_cache_size > low);
- }
- return freed;
+ if(pgtable_cache_size > high) {
+ do {
+ if(pgd_quicklist) {
+ free_pgd_slow(get_pgd_fast());
+ freed++;
+ }
+ if(pmd_quicklist) {
+ pmd_free_slow(pmd_alloc_one_fast());
+ freed++;
+ }
+ if(pte_quicklist) {
+ pte_free_slow(pte_alloc_one_fast());
+ freed++;
+ }
+ } while(pgtable_cache_size > low);
+ }
+ return freed;
}

/*
diff -urp 2.4.3p6/include/asm-alpha/pgalloc.h linux/include/asm-alpha/pgalloc.h
--- 2.4.3p6/include/asm-alpha/pgalloc.h Sat Dec 30 01:07:23 2000
+++ linux/include/asm-alpha/pgalloc.h Wed Mar 21 19:48:27 2001
@@ -241,6 +241,9 @@ extern struct pgtable_cache_struct {
#define pte_quicklist (quicklists.pte_cache)
#define pgtable_cache_size (quicklists.pgtable_cache_sz)

+#define pmd_populate(pmd, pte) pmd_set(pmd, pte)
+#define pgd_populate(pgd, pmd) pgd_set(pgd, pmd)
+
extern pgd_t *get_pgd_slow(void);

static inline pgd_t *get_pgd_fast(void)
@@ -268,9 +271,15 @@ static inline void free_pgd_slow(pgd_t *
free_page((unsigned long)pgd);
}

-extern pmd_t *get_pmd_slow(pgd_t *pgd, unsigned long address_premasked);
+static inline pmd_t *pmd_alloc_one(void)
+{
+ pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL);
+ if (ret)
+ clear_page(ret);
+ return ret;
+}

-static inline pmd_t *get_pmd_fast(void)
+static inline pmd_t *pmd_alloc_one_fast(void)
{
unsigned long *ret;

@@ -282,21 +291,27 @@ static inline pmd_t *get_pmd_fast(void)
return (pmd_t *)ret;
}

-static inline void free_pmd_fast(pmd_t *pmd)
+static inline void pmd_free_fast(pmd_t *pmd)
{
*(unsigned long *)pmd = (unsigned long) pte_quicklist;
pte_quicklist = (unsigned long *) pmd;
pgtable_cache_size++;
}

-static inline void free_pmd_slow(pmd_t *pmd)
+static inline void pmd_free_slow(pmd_t *pmd)
{
free_page((unsigned long)pmd);
}

-extern pte_t *get_pte_slow(pmd_t *pmd, unsigned long address_preadjusted);
+static inline pte_t *pte_alloc_one(void)
+{
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
+ if (pte)
+ clear_page(pte);
+ return pte;
+}

-static inline pte_t *get_pte_fast(void)
+static inline pte_t *pte_alloc_one_fast(void)
{
unsigned long *ret;

@@ -308,66 +323,22 @@ static inline pte_t *get_pte_fast(void)
return (pte_t *)ret;
}

-static inline void free_pte_fast(pte_t *pte)
+static inline void pte_free_fast(pte_t *pte)
{
*(unsigned long *)pte = (unsigned long) pte_quicklist;
pte_quicklist = (unsigned long *) pte;
pgtable_cache_size++;
}

-static inline void free_pte_slow(pte_t *pte)
+static inline void pte_free_slow(pte_t *pte)
{
free_page((unsigned long)pte);
}

-extern void __bad_pte(pmd_t *pmd);
-extern void __bad_pmd(pgd_t *pgd);
-
-#define pte_free_kernel(pte) free_pte_fast(pte)
-#define pte_free(pte) free_pte_fast(pte)
-#define pmd_free_kernel(pmd) free_pmd_fast(pmd)
-#define pmd_free(pmd) free_pmd_fast(pmd)
+#define pte_free(pte) pte_free_fast(pte)
+#define pmd_free(pmd) pmd_free_fast(pmd)
#define pgd_free(pgd) free_pgd_fast(pgd)
#define pgd_alloc() get_pgd_fast()
-
-static inline pte_t * pte_alloc(pmd_t *pmd, unsigned long address)
-{
- address = (address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
- if (pmd_none(*pmd)) {
- pte_t *page = get_pte_fast();
-
- if (!page)
- return get_pte_slow(pmd, address);
- pmd_set(pmd, page);
- return page + address;
- }
- if (pmd_bad(*pmd)) {
- __bad_pte(pmd);
- return NULL;
- }
- return (pte_t *) pmd_page(*pmd) + address;
-}
-
-static inline pmd_t * pmd_alloc(pgd_t *pgd, unsigned long address)
-{
- address = (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
- if (pgd_none(*pgd)) {
- pmd_t *page = get_pmd_fast();
-
- if (!page)
- return get_pmd_slow(pgd, address);
- pgd_set(pgd, page);
- return page + address;
- }
- if (pgd_bad(*pgd)) {
- __bad_pmd(pgd);
- return NULL;
- }
- return (pmd_t *) pgd_page(*pgd) + address;
-}
-
-#define pte_alloc_kernel pte_alloc
-#define pmd_alloc_kernel pmd_alloc

extern int do_check_pgt_cache(int, int);

2001-03-25 15:55:49

by Ingo Molnar

[permalink] [raw]
Subject: [patch] pae-2.4.3-A4


On Mon, 19 Mar 2001, Linus Torvalds wrote:

> The complete changelog is appended, but the biggest recent change is
> the mmap_sem change, which I updated with new locking rules for
> pte/pmd_alloc to avoid the race on the actual page table build.
>
> This has only been tested on i386 without PAE, and is known to break
> other architectures. Ingo, mind checking what PAE needs? [...]

one nontrivial issue was that on PAE the pgd has to be installed with
'present' pgd entries, due to a CPU erratum. This means that the
pgd_present() code in mm/memory.c, while correct theoretically, doesnt
work with PAE. An equivalent solution is to use !pgd_none(), which also
works with the PAE workaround.

PAE mode could re-define pgd_present() to filter out the workaround - do
you prefer this to the !pgd_none() solution?

the rest was pretty straightforward.

in any case, with the attached pae-2.4.3-A4 patch (against 2.4.3-pre7,
applies to 2.4.2-ac24 cleanly as well) applied, 2.4.3-pre7 boots & works
just fine on PAE 64GB-HIGHMEM and non-PAE kernels.

- the patch also does another cleanup: removes various bad_pagetable code
snippets all around the x86 tree, it's not needed anymore. This saves 8
KB RAM on x86 systems.

- removed the last remaining *_kernel() macro.

- fixed a minor clear_page() bug in pgalloc.h, gfp() could fail in the
future.

Ingo


Attachments:
pae-2.4.3-A4 (5.74 kB)

2001-03-25 16:35:30

by Russell King

[permalink] [raw]
Subject: Re: [patch] pae-2.4.3-A4

On Sun, Mar 25, 2001 at 04:53:37PM +0200, Ingo Molnar wrote:
> one nontrivial issue was that on PAE the pgd has to be installed with
> 'present' pgd entries, due to a CPU erratum. This means that the
> pgd_present() code in mm/memory.c, while correct theoretically, doesnt
> work with PAE. An equivalent solution is to use !pgd_none(), which also
> works with the PAE workaround.

Certainly that's the way the original *_alloc routines used to work.
In fact, ARM never had need to implement the pmd_present() macros, since
they were never referenced - only the pmd_none() macros were.

However, I'm currently struggling with this change on ARM - so far after
a number of hours trying to kick something into shape, I've not managed
to even get to the stange where I get a kernel image to link, let alone
the compilation to finish.

One of my many dilemas at the moment is how to allocate the page 0 PMD
in pgd_alloc(), where we don't have a mm_struct to do the locking against.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2001-03-25 18:09:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] pae-2.4.3-A4



On Sun, 25 Mar 2001, Ingo Molnar wrote:
>
> one nontrivial issue was that on PAE the pgd has to be installed with
> 'present' pgd entries, due to a CPU erratum. This means that the
> pgd_present() code in mm/memory.c, while correct theoretically, doesnt
> work with PAE. An equivalent solution is to use !pgd_none(), which also
> works with the PAE workaround.

Note that due to the very same erratum, we really should populate the PGD
from the very beginning. See the other thread about how we currently
fail to properly invalidate the TLB on other CPU's when we add a new PGD
entry, exactly because the other CPU's are caching the "nonexistent" PGD
entry that we just replaced.

So my suggestion for PAE is:

- populate in gdb_alloc() (ie just do the three "alloc_page()" calls to
allocate the PMD's immediately)

NOTE: This makes the race go away, and will actually speed things up as
we will pretty much in practice always populate the PGD _anyway_, the
way the VM areas are laid out.

- make "pgd_present()" always return 1.

NOTE: This will speed up the page table walkers anyway. It will also
avoid the problem above.

- make "free_pmd()" a no-op.

All of the above will (a) simplify things (b) remove special cases and (c)
remove actual and existing bugs.

(In fact, the reason why the PGD populate missing TLB invalidate probably
never happens in practice is exactly the fact that the PGD is always
populated so fast that it's hard to make a test-case that shows this. But
it's still a bug - probably fairly easily triggered by a threaded program
that is statically linked (so that the code loads at 0x40000000 and
doesn't have the loader linked low - so the lowest PGD entry is not
allocated until later).

Does anybody see any problems with the above? It looks like the obvious
fix.

Linus

2001-03-25 19:53:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] pae-2.4.3-A4


On Sun, 25 Mar 2001, Linus Torvalds wrote:

> So my suggestion for PAE is:
>
> - populate in gdb_alloc() (ie just do the three "alloc_page()" calls to
> allocate the PMD's immediately)
>
> NOTE: This makes the race go away, and will actually speed things up as
> we will pretty much in practice always populate the PGD _anyway_, the
> way the VM areas are laid out.
>
> - make "pgd_present()" always return 1.
>
> NOTE: This will speed up the page table walkers anyway. It will also
> avoid the problem above.
>
> - make "free_pmd()" a no-op.
>
> All of the above will (a) simplify things (b) remove special cases and
> (c) remove actual and existing bugs.

yep, this truly makes things so much easier and cleaner! There was only
one thing missing to make it work:

- make "pgd_clear()" a no-op.

[the reason for the slightly more complex pgd_alloc_slow() code is to
support non-default virtual memory splits as well, where the number of
user pgds is not necesserily 3.]

plus i took to opportunity to reduce the allocation size of PAE-pgds.
Their size is only 32 bytes, and we allocated a full page. Now the code
kmalloc()s a 32 byte cacheline for the pgd. (there is a hardware
constraint on alignment: this cacheline must be at least 16-byte aligned,
which is true for the current kmalloc() code.) So the per-process cost is
reduced by almost 4 KB.

and i got rid of pgalloc-[2|3]level.h - with the pmds merged into the pgd
logic the algorithmic difference between 2-level and this pseudo-3-level
PAE paging is not that big anymore. The pgtable-[2|3]level.h files are
still separate.

the attached pae-2.4.3-B2 patch (against 2.4.3-pre7) compiles & boots just
fine both in PAE and non-PAE mode. The patch removes 217 lines, and adds
only 78 lines.

Ingo


Attachments:
pae-2.4.3-B2 (11.10 kB)
Subject: Re: [patch] pae-2.4.3-A4

> plus i took to opportunity to reduce the allocation size of PAE-pgds.
> Their size is only 32 bytes, and we allocated a full page. Now the code
> kmalloc()s a 32 byte cacheline for the pgd. (there is a hardware
> constraint on alignment: this cacheline must be at least 16-byte aligned,
> which is true for the current kmalloc() code.)

No, it isn't ;-)

Just enable slab debugging (it's a kernel option in alan's ac kernels),
and then the kmalloc result is not aligned anymore.

--
Manfred

2001-03-30 14:01:23

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] PAE zap_low_mappings no-op

i386 pgd_clear() is now a no-op with PAE as without:
so zap_low_mappings() isn't zapping in the PAE case.
Patch below against 2.4.3, or 2.4.2-ac28 offset 1 line.

Hugh

--- 2.4.3/arch/i386/mm/init.c Mon Mar 26 20:01:56 2001
+++ linux/arch/i386/mm/init.c Fri Mar 30 14:46:34 2001
@@ -309,14 +309,11 @@
* Zap initial low-memory mappings.
*
* Note that "pgd_clear()" doesn't do it for
- * us in this case, because pgd_clear() is a
- * no-op in the 2-level case (pmd_clear() is
- * the thing that clears the page-tables in
- * that case).
+ * us, because pgd_clear() is a no-op on i386.
*/
for (i = 0; i < USER_PTRS_PER_PGD; i++)
#if CONFIG_X86_PAE
- pgd_clear(swapper_pg_dir+i);
+ set_pgd(swapper_pg_dir+i, __pgd(1 + __pa(empty_zero_page)));
#else
set_pgd(swapper_pg_dir+i, __pgd(0));
#endif

2001-04-09 16:00:57

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] 2.4.4-pre1 sparc/mm typo

--- 2.4.4-pre1/arch/sparc/mm/generic.c Sat Apr 7 08:15:16 2001
+++ linux/arch/sparc/mm/generic.c Mon Apr 9 16:48:42 2001
@@ -21,7 +21,7 @@
struct page *ptpage = pte_page(page);
if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
return;
- page_cache_release(page);
+ page_cache_release(ptpage);
return;
}
swap_free(pte_to_swp_entry(page));