2006-08-29 14:15:32

by Dong Feng

[permalink] [raw]
Subject: The 3G (or nG) Kernel Memory Space Offset

The Linux kernel permenantly map 3-4G linear memory space to 0-4G
physical memory space. My question is that what is the rationality
behind this counterintuitive mapping. Is this just some personal
choice for the earlier kernel developers?


2006-08-29 14:33:11

by Andi Kleen

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

On Tuesday 29 August 2006 16:15, Dong Feng wrote:
> The Linux kernel permenantly map 3-4G linear memory space to 0-4G

i386 Mainline kernel doesn't, no.

-Andi

2006-08-29 14:44:16

by Jan Engelhardt

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

>
> The Linux kernel permenantly map 3-4G linear memory space to 0-4G
> physical memory space.

"3-4G linear memory space" is usually the "kernel space", i.e. 0xc0000000
upwards. mostly the kernel is loaded here (on x86).

"0-4G physical memory space" denotes RAM. Since kernelspace is resident, it
only seems logical to map it to 0G (that is, the start of RAM), because the
end of RAM can be flexible.

IOW, you cannot map kernelspace to the physical location 0xc0000000 because
there might not be that much RAM.

(Also note the PCI memory hole which is near the end of the 4G range.)

> My question is that what is the rationality
> behind this counterintuitive mapping. Is this just some personal
> choice for the earlier kernel developers?


Jan Engelhardt
--

2006-08-29 16:01:55

by Dong Feng

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

2006/8/29, Jan Engelhardt <[email protected]>:
>
> "0-4G physical memory space" denotes RAM. Since kernelspace is resident, it
> only seems logical to map it to 0G (that is, the start of RAM), because the
> end of RAM can be flexible.
>
> IOW, you cannot map kernelspace to the physical location 0xc0000000 because
> there might not be that much RAM.
>
> (Also note the PCI memory hole which is near the end of the 4G range.)
>
>
> Jan Engelhardt
> --
>


Sorry for my typo. I actually means "0-1G physical memory space." My
question is actually why there is a 3G offset from linear kernel to
physical kernel. Why not simply have kernel memory linear space
located on 0-1G linear address, and therefore the physical kernel and
linear kernel just coincide?

Or perhaps this offset is just some personal favor. Say if the first
kernel designer decided to locate kernel at 2-3G linear address, then
2G offset would have appeared in code. Is this the case?

2006-08-29 16:05:44

by Arjan van de Ven

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

On Wed, 2006-08-30 at 00:01 +0800, Dong Feng wrote:
> 2006/8/29, Jan Engelhardt <[email protected]>:
> >
> > "0-4G physical memory space" denotes RAM. Since kernelspace is resident, it
> > only seems logical to map it to 0G (that is, the start of RAM), because the
> > end of RAM can be flexible.
> >
> > IOW, you cannot map kernelspace to the physical location 0xc0000000 because
> > there might not be that much RAM.
> >
> > (Also note the PCI memory hole which is near the end of the 4G range.)
> >
> >
> > Jan Engelhardt
> > --
> >
>
>
> Sorry for my typo. I actually means "0-1G physical memory space." My
> question is actually why there is a 3G offset from linear kernel to
> physical kernel. Why not simply have kernel memory linear space
> located on 0-1G linear address, and therefore the physical kernel and
> linear kernel just coincide?


the price for that would be that you would have to flush all the tlb's
on each syscall. That's seen as a quite hefty price by many kernel
developers.


2006-08-29 16:13:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

On Wed, 30 Aug 2006, Dong Feng wrote:

> Or perhaps this offset is just some personal favor. Say if the first
> kernel designer decided to locate kernel at 2-3G linear address, then
> 2G offset would have appeared in code. Is this the case?

Well this is the second time that you suggest that the reason for
technical decisions have to do with personal favors. Are you trying to
provoke us into answering your question?

2006-08-29 16:16:34

by Dong Feng

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

No, please do not get me wrong. Or perhaps please tolerate my poor English.

My intention is just trying to understand whether there is an absolute
rationality for a design choice or it just has to have a choice and
the choice can be made arbitrarily.

Sorry again if my English cause any misunderstanding.


2006/8/30, Christoph Lameter <[email protected]>:
> On Wed, 30 Aug 2006, Dong Feng wrote:
>
> > Or perhaps this offset is just some personal favor. Say if the first
> > kernel designer decided to locate kernel at 2-3G linear address, then
> > 2G offset would have appeared in code. Is this the case?
>
> Well this is the second time that you suggest that the reason for
> technical decisions have to do with personal favors. Are you trying to
> provoke us into answering your question?
>
>

2006-08-29 16:41:41

by Jan Engelhardt

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

>>
>> Sorry for my typo. I actually means "0-1G physical memory space." My
>> question is actually why there is a 3G offset from linear kernel to
>> physical kernel. Why not simply have kernel memory linear space
>> located on 0-1G linear address, and therefore the physical kernel and
>> linear kernel just coincide?
>
>the price for that would be that you would have to flush all the tlb's
>on each syscall. That's seen as a quite hefty price by many kernel
>developers.

Since it's all just virtual addresses, is the TLB flush really that much
different when kernelspace runs from (virtual) 0x00000000-0x3FFFFFFF rather
than (virtual)0xC000000-0xFFFFFFFF?


Jan Engelhardt
--

2006-08-29 16:43:08

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

Dong Feng wrote:
> Sorry for my typo. I actually means "0-1G physical memory space." My
> question is actually why there is a 3G offset from linear kernel to
> physical kernel. Why not simply have kernel memory linear space
> located on 0-1G linear address, and therefore the physical kernel and
> linear kernel just coincide?

If kernel virtual addresses were low, you would either need to do an
address-space switch (=TLB flush) on every user-kernel switch, or
require userspace to be at some high address. The former would be very
expensive, and the latter very strange (the standard x86 ABI requires
low addresses). The clean solution is to map the kernel to the high
part of the address space, but it is easier to load the kernel into low
physical memory at boot, thus leading to a physical-virtual offset. The
selection of 3G is a reasonable tradeoff of physical memory size vs user
virtual address space size, but of course it can be adjusted, or you can
use highmem.

J

2006-08-29 16:44:49

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

Jan Engelhardt wrote:
> Since it's all just virtual addresses, is the TLB flush really that much
> different when kernelspace runs from (virtual) 0x00000000-0x3FFFFFFF rather
> than (virtual)0xC000000-0xFFFFFFFF?
>

If kernel and userspace are disjoint, they can be in the same address
space, so there's no need for a TLB flush at all.

J

2006-08-29 18:37:18

by pg_lkm

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

[ ... ]

df> My question is actually why there is a 3G offset from linear
df> kernel to physical kernel. Why not simply have kernel memory
df> linear space located on 0-1G linear address, and therefore the
df> physical kernel and linear kernel just coincide?

First of all there are _three_ mapping regions:

* for the per-process address space (x86 default 3GiB at address 0);
* the kernel address space (x86 default 128MiB at address 3Gib);
* the real memory address space (x86 default the last 896MiB).

The kernel address space is small and does not matter much in this
discussionm except for stealing 128MiB. What really matter are the
other two. Note also that the memory resident pages of a process
are necessarily mapped twice, once in the per-process address
space and once in the real memory space.

There are actually three possible cases:

1) per-process mapped low, real memory mapped high (e.g. 3GiB+128MiB+896MiB).
2) real memory mapped low, per-process mapped high (e.g. 896MiB+128MiB+3GiB).
3) both per-process and real memory mapped low (e.g. 3.9GiB+128MiB), with
real memory/per process flipping or something else.

jeremy> If kernel virtual addresses were low, you would either
jeremy> need to do an address-space switch (=TLB flush)

This is case #3, which was the norm on many platforms, e.g. UNIX
PDP-11. To be practical it requires special instructions to
load/store from unmapped address spaces. Linus prefers to map
both kernel and physical memory in every address space.

jeremy> on every user-kernel switch,

Not on every user-kernel switch, because there are two (or
three) possibilities:

* Only the real-memory address space has the 128MiB kernel
address space map, which seems what this phrase assumes.

* Each address space, including both per-process ones and the
real memory one, have a 128MiB mapping for the kernel address
space.

If the 128MiB kernel address space were still to be mapped in
every process address space *and* the real memory space, one would
need a switch only when the kernel wants to access per-process
space and real memory in short time. Unless there is some special
way that allows the kernel to address real memory directly, but
then that gets a bit cumbersome.

jeremy> or require userspace to be at some high address.

This is case #2.

jeremy> and the latter very strange (the standard x86 ABI
jeremy> requires low addresses).

Strange does not matter a lot; but it is somewhat surprising
that the x86 ABI, which includes shared libs all over the place,
does require low addresses. But if that is the case it must have
been an important point in the past, when layout compatibility
might have mattered for iBCS (anybody remembers that? :->).

What I suspect is more likely as a reason to avoid mapping
per-process address space at high addresses is that would have
broken many incorrect programs... :-)

jeremy> The clean solution is to map the kernel to the high part
jeremy> of the address space,

Not necessarily clean, but perhaps required by ABI compatibility.

jeremy> but it is easier to load the kernel into low physical
jeremy> memory at boot, thus leading to a physical-virtual
jeremy> offset.

Odd, because this is an argument to have case #2 or #3: because
then one loads the kernel code at low physical addresses, and
then maps them 1-1 onto virtual addresses.

jeremy> The selection of 3G is a reasonable tradeoff of physical
jeremy> memory size vs user virtual address space size, but of
jeremy> course it can be adjusted, or you can use highmem.

Probably this was a bit dumb, because of the missing 128MiB
syndrome. I would set the default for per-process space to
2GiB-128MiB, leaving only those users who want more per-process
address space and don't want to move to 64-bit to move the
boundary back. Some reflections on this in:

http://WWW.sabi.co.UK/Notes/#060821c

2006-08-29 21:15:48

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

Peter Grandi wrote:
> [ ... ]
>
> df> My question is actually why there is a 3G offset from linear
> df> kernel to physical kernel. Why not simply have kernel memory
> df> linear space located on 0-1G linear address, and therefore the
> df> physical kernel and linear kernel just coincide?
>
> First of all there are _three_ mapping regions:
>
> * for the per-process address space (x86 default 3GiB at address 0);
> * the kernel address space (x86 default 128MiB at address 3Gib);
> * the real memory address space (x86 default the last 896MiB).
>
> The kernel address space is small and does not matter much in this
> discussionm except for stealing 128MiB. What really matter are the
> other two. Note also that the memory resident pages of a process
> are necessarily mapped twice, once in the per-process address
> space and once in the real memory space.
>
> There are actually three possible cases:
>
> 1) per-process mapped low, real memory mapped high (e.g. 3GiB+128MiB+896MiB).
> 2) real memory mapped low, per-process mapped high (e.g. 896MiB+128MiB+3GiB).
> 3) both per-process and real memory mapped low (e.g. 3.9GiB+128MiB), with
> real memory/per process flipping or something else.
>
> jeremy> If kernel virtual addresses were low, you would either
> jeremy> need to do an address-space switch (=TLB flush)
>
> This is case #3, which was the norm on many platforms, e.g. UNIX
> PDP-11. To be practical it requires special instructions to
> load/store from unmapped address spaces. Linus prefers to map
> both kernel and physical memory in every address space.
>

Linux used to use segmentation to do something like this; %fs was set up
to point to the user address space in the kernel, and accesses to
userspace used an %fs segment override. This history is still visible
in the naming of get/set_fs (which has nothing to do with filesystems).

> * Only the real-memory address space has the 128MiB kernel
> address space map, which seems what this phrase assumes.
>
> * Each address space, including both per-process ones and the
> real memory one, have a 128MiB mapping for the kernel address
> space.
>

By "real" I assume you mean "physical". What you're suggesting is
something akin to highmem, but applied to all memory. With highmem, the
kernel can't assume it has direct access to all physical memory, and you
must explicitly map it in with kmap() to use it. You could do this with
all memory all the time, but with an obvious performance (and
complexity) overhead.

> Strange does not matter a lot; but it is somewhat surprising
> that the x86 ABI, which includes shared libs all over the place,
> does require low addresses. But if that is the case it must have
> been an important point in the past, when layout compatibility
> might have mattered for iBCS (anybody remembers that? :->).
>

Well, if you're mapping kernel+physical memory at low addresses, it
means that userspace moves about depending on where you want to put the
user/kernel split. That's a lot harder to deal with than just moving
around the limit.

The load address for ET_EXEC executables is defined as 0x08048000; you
can use ET_DYN if you want to load them elsewhere. Using lower
addresses allows the use of instructions with smaller pointers and
offsets (though this might be less important on x86). x86-64's normal
compilation model requires non-relocatable code to be in the lower
2Gbytes, for example.

> Odd, because this is an argument to have case #2 or #3: because
> then one loads the kernel code at low physical addresses, and
> then maps them 1-1 onto virtual addresses.
>

I'm pointing out that the existing design has a reasonable technical
justification, and is not somebody's arbitrary personal choice. There
are certainly other possible designs, with their own pros and cons.

J

2006-08-30 04:32:12

by George Spelvin

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

Just to answer the question in elementary terms:

This is because:
- On x86, the user and kernel share the available 4G virtual address space,
- User space gets first choice, and so takes the low 3G.
- The kernel thus has to use the high 1G, and if it wants a copy
of physical memory, that's the only place it can go.

In somewhat more detail:

1) In standard x86 Linux, the user and kernel address spaces share the 4
GB virtual address space of the x86 processor. There are other ways
to do it (see the 4G+4G patch for an example), but they're slower.

x86 processors only support one set of page tables at a time, and
changing is a slow operation. Other processors let you have separate
user and kernel page tables active simultaneously, but x86 does not.

So for speed, you don't want to change page tables to make a system
call. Also, many system calls are passed pointers to buffers in user
memory, so need to access user memory. It's fastest and easiest to do
this if user memory is in the address space when executing kernel code.

Fortunately, x86 page tables have a "user" bit in each page table
entry, that can make pages only accessible from the kernel. They are
still in the user's virtual address space, but can't be accessed.
Thus, it is possible for the user and kernel to share the address space.

So, given all of this, Linux (as well as most other operating systems)
on x86 has decided to divide the 4 GB virtual address space into "user"
and "kernel" parts. As far as the user is concerned, the kernel part
is just "missing", so it's made as as small as reasonably possible.

2) The division chosen is that the user gets the low 3G of the address
space, and the kernel gets the high 1G. x86 ABI standards require
that user space gets low addresses, and in any case, the kernel exists
to make user-space programs happy.

3) The kernel finds it convenient to have a copy of physical memory in its
address space, so it maps one. If there's more RAM than will fit in the
kernel address space, the HIGHMEM patches provide an alternative.
Since this is an elementary explanation, I won't describe how that works.

Thus, the physical memory map used in the kernel ends up offset by 3G.

2006-08-30 06:12:42

by Jan Engelhardt

[permalink] [raw]
Subject: Re: The 3G (or nG) Kernel Memory Space Offset

>
> The load address for ET_EXEC executables is defined as 0x08048000;
> you can use ET_DYN if you want to load them elsewhere. Using lower
> addresses allows the use of instructions with smaller pointers and
> offsets (though this might be less important on x86).

Less on x86. HTE tells me there are only two ways (EB and E9):

EB ?? jmp OFFSET8 for 16/32/64
E9 ?? ?? jmp OFFSET16 for 16-bit mode
E9 ?? ?? ?? ?? jmp OFFSET32 for 32/64-bit mode



Jan Engelhardt
--