2008-07-09 16:51:43

by Mike Travis

[permalink] [raw]
Subject: [RFC 00/15] x86_64: Optimize percpu accesses


This patchset provides the following:

* Cleanup: Fix early references to cpumask_of_cpu(0)

Provides an early cpumask_of_cpu(0) usable before the cpumask_of_cpu_map
is allocated and initialized.

* Generic: Percpu infrastructure to rebase the per cpu area to zero

This provides for the capability of accessing the percpu variables
using a local register instead of having to go through a table
on node 0 to find the cpu-specific offsets. It also would allow
atomic operations on percpu variables to reduce required locking.
Uses a new config var HAVE_ZERO_BASED_PER_CPU to indicate to the
generic code that the arch has this new basing.

(Note: split into two patches, one to rebase percpu variables at 0,
and the second to actually use %gs as the base for percpu variables.)

* x86_64: Fold pda into per cpu area

Declare the pda as a per cpu variable. This will move the pda
area to an address accessible by the x86_64 per cpu macros.
Subtraction of __per_cpu_start will make the offset based from
the beginning of the per cpu area. Since %gs is pointing to the
pda, it will then also point to the per cpu variables and can be
accessed thusly:

%gs:[&per_cpu_xxxx - __per_cpu_start]

* x86_64: Rebase per cpu variables to zero

Take advantage of the zero-based per cpu area provided above.
Then we can directly use the x86_32 percpu operations. x86_32
offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly
to the pda and the per cpu area thereby allowing access to the
pda with the x86_64 pda operations and access to the per cpu
variables using x86_32 percpu operations.


Based on linux-2.6.tip/master with following patches applied:

[PATCH 1/1] x86: Add check for node passed to node_to_cpumask V3
[PATCH 1/1] x86: Change _node_to_cpumask_ptr to return const ptr
[PATCH 1/1] sched: Reduce stack size in isolated_cpu_setup()
[PATCH 1/1] kthread: Reduce stack pressure in create_kthread and kthreadd


Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Mike Travis <[email protected]>
---

--


2008-07-09 17:23:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Hi Mike,

Did the suspected linker bug issue ever get resolved?

-hpa

2008-07-09 17:27:40

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> This patchset provides the following:
>
> * Cleanup: Fix early references to cpumask_of_cpu(0)
>
> Provides an early cpumask_of_cpu(0) usable before the cpumask_of_cpu_map
> is allocated and initialized.
>
> * Generic: Percpu infrastructure to rebase the per cpu area to zero
>
> This provides for the capability of accessing the percpu variables
> using a local register instead of having to go through a table
> on node 0 to find the cpu-specific offsets. It also would allow
> atomic operations on percpu variables to reduce required locking.
> Uses a new config var HAVE_ZERO_BASED_PER_CPU to indicate to the
> generic code that the arch has this new basing.
>
> (Note: split into two patches, one to rebase percpu variables at 0,
> and the second to actually use %gs as the base for percpu variables.)
>
> * x86_64: Fold pda into per cpu area
>
> Declare the pda as a per cpu variable. This will move the pda
> area to an address accessible by the x86_64 per cpu macros.
> Subtraction of __per_cpu_start will make the offset based from
> the beginning of the per cpu area. Since %gs is pointing to the
> pda, it will then also point to the per cpu variables and can be
> accessed thusly:
>
> %gs:[&per_cpu_xxxx - __per_cpu_start]
>
> * x86_64: Rebase per cpu variables to zero
>
> Take advantage of the zero-based per cpu area provided above.
> Then we can directly use the x86_32 percpu operations. x86_32
> offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly
> to the pda and the per cpu area thereby allowing access to the
> pda with the x86_64 pda operations and access to the per cpu
> variables using x86_32 percpu operations.

The bulk of this series is pda_X to x86_X_percpu conversion. This seems
like pointless churn to me; there's nothing inherently wrong with the
pda_X interfaces, and doing this transformation doesn't get us any
closer to unifying 32 and 64 bit.

I think we should start devolving things out of the pda in the other
direction: make a series where each patch takes a member of struct
x8664_pda, converts it to a per-cpu variable (where possible, the same
one that 32-bit uses), and updates all the references accordingly. When
the pda is as empty as it can be, we can look at removing the
pda-specific interfaces.

J

2008-07-09 17:40:18

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Hi Mike,
>
> Did the suspected linker bug issue ever get resolved?
>
> -hpa

Hi Peter,

I was not able to figure out how the two versions of the same
kernel compiled by gcc-4.2.0 and gcc-4.2.4 differed. Currently,
I'm sticking with gcc-4.2.4 as it boots much farther.

There still is a problem where if I bump THREAD_ORDER, the
problem goes away and everything so far that I've tested boots
up fine.

We tried to install a later gcc (4.3.1) that might have the
"GCC_HAS_SP" flag but our sys admin reported:

The 4.3.1 version gives me errors on the make. I had to
pre-install gmp and mpfr, but, I still get errors on the make.

I think that was the latest he found on the GNU/GCC site.

Thanks,
Mike

2008-07-09 17:40:37

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:

> The bulk of this series is pda_X to x86_X_percpu conversion. This seems
> like pointless churn to me; there's nothing inherently wrong with the
> pda_X interfaces, and doing this transformation doesn't get us any
> closer to unifying 32 and 64 bit.

What is the point of the pda_X interface? It does not exist on 32 bit.
The pda wastes the GS segment register on a small memory area. This patchset
makes the GS segment usable to reach all of the per cpu area by placing
the pda into the per cpu area. Thus the pda_X interface becomes obsolete
and the 32 bit per cpu stuff becomes usable under 64 bit unifying both
architectures.


> I think we should start devolving things out of the pda in the other
> direction: make a series where each patch takes a member of struct
> x8664_pda, converts it to a per-cpu variable (where possible, the same
> one that 32-bit uses), and updates all the references accordingly. When
> the pda is as empty as it can be, we can look at removing the
> pda-specific interfaces.

This patchset places the whole x8664_pda structure into the per cpu area and makes the pda macros operate on the x8664_pda structure in the per cpu area. Not sure why you want to go through the churn of doing it for each object separately.

2008-07-09 17:45:10

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Did the suspected linker bug issue ever get resolved?

I don't believe so. I think Mike is getting very early crashes
depending on some combination of gcc, linker and kernel config. Or
something.

This fragility makes me very nervous. It seems hard enough to get this
stuff working with current tools; making it work over the whole range of
supported tools looks like its going to be hard.

J

2008-07-09 17:46:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> H. Peter Anvin wrote:
>> Hi Mike,
>>
>> Did the suspected linker bug issue ever get resolved?
>>
>> -hpa
>
> Hi Peter,
>
> I was not able to figure out how the two versions of the same
> kernel compiled by gcc-4.2.0 and gcc-4.2.4 differed. Currently,
> I'm sticking with gcc-4.2.4 as it boots much farther.
>
> There still is a problem where if I bump THREAD_ORDER, the
> problem goes away and everything so far that I've tested boots
> up fine.
>
> We tried to install a later gcc (4.3.1) that might have the
> "GCC_HAS_SP" flag but our sys admin reported:
>
> The 4.3.1 version gives me errors on the make. I had to
> pre-install gmp and mpfr, but, I still get errors on the make.
>
> I think that was the latest he found on the GNU/GCC site.
>

We have seen miscompilations with gcc 4.3.0 at least; not sure about 4.3.1.

-hpa

2008-07-09 17:52:18

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> What is the point of the pda_X interface? It does not exist on 32 bit.
> The pda wastes the GS segment register on a small memory area. This patchset
> makes the GS segment usable to reach all of the per cpu area by placing
> the pda into the per cpu area. Thus the pda_X interface becomes obsolete
> and the 32 bit per cpu stuff becomes usable under 64 bit unifying both
> architectures.
>

I think we agree on the desired outcome. I just disagree with the path
to getting there.

>> I think we should start devolving things out of the pda in the other
>> direction: make a series where each patch takes a member of struct
>> x8664_pda, converts it to a per-cpu variable (where possible, the same
>> one that 32-bit uses), and updates all the references accordingly. When
>> the pda is as empty as it can be, we can look at removing the
>> pda-specific interfaces.
>>
>
> This patchset places the whole x8664_pda structure into the per cpu area and makes the pda macros operate on the x8664_pda structure in the per cpu area. Not sure why you want to go through the churn of doing it for each object separately.
>

No, it's not churn doing it object at a time. If you convert
pda.pcurrent into a percpu current_task variable, then at one stroke
you've 1) shrunk the pda, 2) unified with i386. If you go through the
process of converting all the read_pda(pcurrent) references into
x86_read_percpu(pda.pcurrent) then that's a pure churn patch. It
doesn't get rid of the pda variable, it doesn't unify with i386. All it
does is remove a reference to a macro which was fairly inoffensive in
the first place.

Once the pda has shrunk as much as it can (which remove everything
except stack_canary, I think), then remove all the X_pda macros, since
there won't be any users anyway.

J

2008-07-09 18:00:23

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
> Mike Travis wrote:
>> This patchset provides the following:
>>
>> * Cleanup: Fix early references to cpumask_of_cpu(0)
>>
>> Provides an early cpumask_of_cpu(0) usable before the
>> cpumask_of_cpu_map
>> is allocated and initialized.
>>
>> * Generic: Percpu infrastructure to rebase the per cpu area to zero
>>
>> This provides for the capability of accessing the percpu variables
>> using a local register instead of having to go through a table
>> on node 0 to find the cpu-specific offsets. It also would allow
>> atomic operations on percpu variables to reduce required locking.
>> Uses a new config var HAVE_ZERO_BASED_PER_CPU to indicate to the
>> generic code that the arch has this new basing.
>>
>> (Note: split into two patches, one to rebase percpu variables at 0,
>> and the second to actually use %gs as the base for percpu variables.)
>>
>> * x86_64: Fold pda into per cpu area
>>
>> Declare the pda as a per cpu variable. This will move the pda
>> area to an address accessible by the x86_64 per cpu macros.
>> Subtraction of __per_cpu_start will make the offset based from
>> the beginning of the per cpu area. Since %gs is pointing to the
>> pda, it will then also point to the per cpu variables and can be
>> accessed thusly:
>>
>> %gs:[&per_cpu_xxxx - __per_cpu_start]
>>
>> * x86_64: Rebase per cpu variables to zero
>>
>> Take advantage of the zero-based per cpu area provided above.
>> Then we can directly use the x86_32 percpu operations. x86_32
>> offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly
>> to the pda and the per cpu area thereby allowing access to the
>> pda with the x86_64 pda operations and access to the per cpu
>> variables using x86_32 percpu operations.
>
> The bulk of this series is pda_X to x86_X_percpu conversion. This seems
> like pointless churn to me; there's nothing inherently wrong with the
> pda_X interfaces, and doing this transformation doesn't get us any
> closer to unifying 32 and 64 bit.
>
> I think we should start devolving things out of the pda in the other
> direction: make a series where each patch takes a member of struct
> x8664_pda, converts it to a per-cpu variable (where possible, the same
> one that 32-bit uses), and updates all the references accordingly. When
> the pda is as empty as it can be, we can look at removing the
> pda-specific interfaces.
>
> J

I did compartmentalize the changes so they were in separate patches,
and in particular, by separating the changes to the include files, I
was able to zero in on some problems much easier.

But I have no objections to leaving the cpu_pda ops in place and then,
as you're suggesting, extract and modify the fields as appropriate.

Another approach would be to leave the changes from XXX_pda() to
x86_percpu_XXX in place, and do the patches with simply changing
pda.VAR to VAR .)

In any case I would like to get this version working first, before
attempting that rewrite, as that won't change the generated code.

Btw, while I've got your attention... ;-), there's some code in
arch/x86/xen/smp.c:xen_smp_prepare_boot_cpu() that should be looked
at closer for zero-based per_cpu__gdt_page:

make_lowmem_page_readwrite(&per_cpu__gdt_page);

(I wasn't sure how to deal with this but I suspect the __percpu_offset[]
or __per_cpu_load should be added to it.)

Thanks,
Mike

2008-07-09 18:03:15

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>> The bulk of this series is pda_X to x86_X_percpu conversion. This seems
>> like pointless churn to me; there's nothing inherently wrong with the
>> pda_X interfaces, and doing this transformation doesn't get us any
>> closer to unifying 32 and 64 bit.
>
> What is the point of the pda_X interface? It does not exist on 32 bit.
> The pda wastes the GS segment register on a small memory area. This patchset
> makes the GS segment usable to reach all of the per cpu area by placing
> the pda into the per cpu area. Thus the pda_X interface becomes obsolete
> and the 32 bit per cpu stuff becomes usable under 64 bit unifying both
> architectures.
>
>
>> I think we should start devolving things out of the pda in the other
>> direction: make a series where each patch takes a member of struct
>> x8664_pda, converts it to a per-cpu variable (where possible, the same
>> one that 32-bit uses), and updates all the references accordingly. When
>> the pda is as empty as it can be, we can look at removing the
>> pda-specific interfaces.
>
> This patchset places the whole x8664_pda structure into the per cpu area and makes the pda macros operate on the x8664_pda structure in the per cpu area. Not sure why you want to go through the churn of doing it for each object separately.

I think Jeremy's point is that by removing the pda struct entirely, the
references to the fields can be the same for both x86_32 and x86_64.

Thanks,
Mike

2008-07-09 18:05:24

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Mike Travis wrote:
>> H. Peter Anvin wrote:
>>> Hi Mike,
>>>
>>> Did the suspected linker bug issue ever get resolved?
>>>
>>> -hpa
>>
>> Hi Peter,
>>
>> I was not able to figure out how the two versions of the same
>> kernel compiled by gcc-4.2.0 and gcc-4.2.4 differed. Currently,
>> I'm sticking with gcc-4.2.4 as it boots much farther.
>>
>> There still is a problem where if I bump THREAD_ORDER, the
>> problem goes away and everything so far that I've tested boots
>> up fine.
>>
>> We tried to install a later gcc (4.3.1) that might have the
>> "GCC_HAS_SP" flag but our sys admin reported:
>>
>> The 4.3.1 version gives me errors on the make. I had to
>> pre-install gmp and mpfr, but, I still get errors on the make.
>>
>> I think that was the latest he found on the GNU/GCC site.
>>
>
> We have seen miscompilations with gcc 4.3.0 at least; not sure about 4.3.1.
>
> -hpa

Hmm, I wonder how the CONFIG_CC_STACKPROTECTOR was tested? Could it
be a config option for building GCC itself?

Thanks,
Mike

2008-07-09 18:09:45

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Did the suspected linker bug issue ever get resolved?
>
> I don't believe so. I think Mike is getting very early crashes
> depending on some combination of gcc, linker and kernel config. Or
> something.

Yes and unfortunately since SGI does not do it's own compilers any
more (they were MIPS compilers), there's no one here that really
understands the internals of the compile tools.

>
> This fragility makes me very nervous. It seems hard enough to get this
> stuff working with current tools; making it work over the whole range of
> supported tools looks like its going to be hard.

(me too ;-)

Once I get a solid version working with (at least) gcc-4.2.4, then
regression testing with older tools will be easier, or at least a
table of results can be produced.

Thanks,
Mike

2008-07-09 18:14:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:

> I think Jeremy's point is that by removing the pda struct entirely, the
> references to the fields can be the same for both x86_32 and x86_64.

That is going to be difficult. The GS register is tied up for the pda area
as long as you have it. And you cannot get rid of the pda because of the library
compatibility issues. We would break binary compatibility if we would get rid of the pda.

If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications.

It will be possible to shrink the pda (as long as we maintain the fields that glibc needs) after this patchset because the pda and the per cpu area can both be reached with the GS register. So (apart from undiscovered surprises) the generated code is the same.

2008-07-09 18:14:46

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
...
>
> Once the pda has shrunk as much as it can (which remove everything
> except stack_canary, I think), then remove all the X_pda macros, since
> there won't be any users anyway.
>
> J

You bring up a good point here. Since the stack_canary has to be 20
(or is that 0x20?) bytes from %gs, then it sounds like we'll still
need a pda struct of some sort. And zero padding before that seems
counter-productive, so perhaps taking a poll of the most used pda
(or percpu) variables and keeping them in the same cache line would
be more useful?

Thanks,
Mike

2008-07-09 18:23:25

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> Jeremy Fitzhardinge wrote:
> ...
>
>> Once the pda has shrunk as much as it can (which remove everything
>> except stack_canary, I think), then remove all the X_pda macros, since
>> there won't be any users anyway.
>>
>> J
>>
>
> You bring up a good point here. Since the stack_canary has to be 20
> (or is that 0x20?) bytes from %gs, then it sounds like we'll still
> need a pda struct of some sort. And zero padding before that seems
> counter-productive, so perhaps taking a poll of the most used pda
> (or percpu) variables and keeping them in the same cache line would
> be more useful?

The offset is 40 (decimal). I don't see any particular problem with
putting zero padding in there; we can get rid of it if
CONFIG_STACK_PROTECTOR is off.

The trouble with reusing the space is that it's going to be things like
"current" which are the best candidates for going there - but if you do
that you lose i386 unification (unless you play some tricks where you
arrange to make the percpu variables land there while still appearing to
be normal percpu vars).

J

2008-07-09 18:26:33

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> That is going to be difficult. The GS register is tied up for the pda area
> as long as you have it. And you cannot get rid of the pda because of the library
> compatibility issues. We would break binary compatibility if we would get rid of the pda.
>
> If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications.
>

Eh? Yes they will. That's the whole point of making the pda a percpu
variable itself. You can use %gs:<small> to get to the pda, and
%gs:<larger> to get to percpu variables. Converting pda->percpu will
just have the effect of increasing the %gs offset into the percpu space.

This project isn't interesting to me unless per-cpu variables are
directly accessible off %gs.

J

2008-07-09 18:27:18

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Mike Travis wrote:
>
>> I think Jeremy's point is that by removing the pda struct entirely, the
>> references to the fields can be the same for both x86_32 and x86_64.
>
> That is going to be difficult. The GS register is tied up for the pda area
> as long as you have it. And you cannot get rid of the pda because of the library
> compatibility issues. We would break binary compatibility if we would get rid of the pda.
>
> If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications.
>
> It will be possible to shrink the pda (as long as we maintain the fields that glibc needs) after this patchset because the pda and the per cpu area can both be reached with the GS register. So (apart from undiscovered surprises) the generated code is the same.

Is there a comprehensive list of these library accesses to variables
offset from %gs, or is it only the "stack_canary"?

Thanks,
Mike

2008-07-09 18:32:09

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
...
> The trouble with reusing the space is that it's going to be things like
> "current" which are the best candidates for going there - but if you do
> that you lose i386 unification (unless you play some tricks where you
> arrange to make the percpu variables land there while still appearing to
> be normal percpu vars).
>
> J


One more approach... ;-) Once the pda and percpu vars are in the same
area, then could the pda be used for both 32 and 64...?

2008-07-09 18:34:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>> Did the suspected linker bug issue ever get resolved?
>> I don't believe so. I think Mike is getting very early crashes
>> depending on some combination of gcc, linker and kernel config. Or
>> something.
>
> Yes and unfortunately since SGI does not do it's own compilers any
> more (they were MIPS compilers), there's no one here that really
> understands the internals of the compile tools.
>

A bummer, too, since that compiler lives on as the Pathscale compiler...

-hpa

2008-07-09 18:35:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:

> Eh? Yes they will. That's the whole point of making the pda a percpu
> variable itself. You can use %gs:<small> to get to the pda, and
> %gs:<larger> to get to percpu variables. Converting pda->percpu will
> just have the effect of increasing the %gs offset into the percpu space.

Right that is what this patchset does.

> This project isn't interesting to me unless per-cpu variables are
> directly accessible off %gs.

Maybe I misunderstood but it seems that you proposed to convert individual members of the pda structure (which uses GS) to per cpu variables (which without this patchset cannot use GS).

2008-07-09 18:35:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Mike Travis wrote:
>
>> I think Jeremy's point is that by removing the pda struct entirely, the
>> references to the fields can be the same for both x86_32 and x86_64.
>
> That is going to be difficult. The GS register is tied up for the pda area
> as long as you have it. And you cannot get rid of the pda because of the library
> compatibility issues. We would break binary compatibility if we would get rid of the pda.
>

We're talking about the kernel here... who gives a hoot about binary
compatibility?

-hpa

2008-07-09 18:41:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>> Eh? Yes they will. That's the whole point of making the pda a percpu
>> variable itself. You can use %gs:<small> to get to the pda, and
>> %gs:<larger> to get to percpu variables. Converting pda->percpu will
>> just have the effect of increasing the %gs offset into the percpu space.
>
> Right that is what this patchset does.
>
>> This project isn't interesting to me unless per-cpu variables are
>> directly accessible off %gs.
>
> Maybe I misunderstood but it seems that you proposed to convert individual members of the pda structure (which uses GS) to per cpu variables (which without this patchset cannot use GS).
>

I don't understand the "individual members" requirement here.

-hpa

2008-07-09 18:47:23

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> Christoph Lameter wrote:
>
>> Mike Travis wrote:
>>
>>
>>> I think Jeremy's point is that by removing the pda struct entirely, the
>>> references to the fields can be the same for both x86_32 and x86_64.
>>>
>> That is going to be difficult. The GS register is tied up for the pda area
>> as long as you have it. And you cannot get rid of the pda because of the library
>> compatibility issues. We would break binary compatibility if we would get rid of the pda.
>>
>> If one attempts to remove one field after another then the converted accesses will not be able to use GS relative accesses anymore. This can lead to all sorts of complications.
>>
>> It will be possible to shrink the pda (as long as we maintain the fields that glibc needs) after this patchset because the pda and the per cpu area can both be reached with the GS register. So (apart from undiscovered surprises) the generated code is the same.
>>
>
> Is there a comprehensive list of these library accesses to variables
> offset from %gs, or is it only the "stack_canary"?

It's just the stack canary. It isn't library accesses; it's the code
gcc generates:

foo: subq $152, %rsp
movq %gs:40, %rax
movq %rax, 136(%rsp)
...
movq 136(%rsp), %rdx
xorq %gs:40, %rdx
je .L3
call __stack_chk_fail
.L3:
addq $152, %rsp
.p2align 4,,4
ret


There are two irritating things here:

One is that the kernel supports -fstack-protector for x86-64, which
forces us into all these contortions in the first place. We don't
support stack-protector for 32-bit (gcc does), and things are much easier.

The other somewhat orthogonal irritation is the fixed "40". If they'd
generated %gs:__gcc_stack_canary, then we could alias that to a per-cpu
variable like anything else and the whole problem would go away - and we
could support stack-protector on 32-bit with no problems (and normal
usermode could define __gcc_stack_canary to be a weak symbol with value
"40" (20 on 32-bit) for backwards compatibility).

I'm close to proposing that we run a post-processor over the generated
assembly to perform the %gs:40 -> %gs:__gcc_stack_canary transformation
and deal with it that way.

J

2008-07-09 18:48:40

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Maybe I misunderstood but it seems that you proposed to convert individual members of the pda structure (which uses GS) to per cpu variables (which without this patchset cannot use GS).
>

I have no objections to parts 1-3 of the patchset. It's just 4-15,
which does the mechanical conversion.

J

2008-07-09 18:55:06

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
> Christoph Lameter wrote:
>> Maybe I misunderstood but it seems that you proposed to convert
>> individual members of the pda structure (which uses GS) to per cpu
>> variables (which without this patchset cannot use GS).
>>
>
> I have no objections to parts 1-3 of the patchset. It's just 4-15,
> which does the mechanical conversion.

Well yes I agree these could be better if the fields would be moved out of the pda
structure itself but then it wont be mechanical anymore and require more
review. But these are an important step because they allow us to get rid of the
pda operations that do not exist for 32 bit.

2008-07-09 19:05:51

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> I did compartmentalize the changes so they were in separate patches,
> and in particular, by separating the changes to the include files, I
> was able to zero in on some problems much easier.
>
> But I have no objections to leaving the cpu_pda ops in place and then,
> as you're suggesting, extract and modify the fields as appropriate.
>
> Another approach would be to leave the changes from XXX_pda() to
> x86_percpu_XXX in place, and do the patches with simply changing
> pda.VAR to VAR .)
>

Yes, but that's still two patches where one would do. If I'm going to
go through the effort of reconciling your percpu patches with my code,
I'd like to be able to remove some #ifdef CONFIG_X86_64s in the process.

> In any case I would like to get this version working first, before
> attempting that rewrite, as that won't change the generated code.
>

Well, as far as I can tell the real meat of the series is in 1-3 and the
rest is fluff. If just applying 1-3 works, then everything else should too.

> Btw, while I've got your attention... ;-), there's some code in
> arch/x86/xen/smp.c:xen_smp_prepare_boot_cpu() that should be looked
> at closer for zero-based per_cpu__gdt_page:
>
> make_lowmem_page_readwrite(&per_cpu__gdt_page);
>
> (I wasn't sure how to deal with this but I suspect the __percpu_offset[]
> or __per_cpu_load should be added to it.)

Already fixed in the mass of patches I posted yesterday. I turned it
into &per_cpu_var(gdt_page)).

J

2008-07-09 19:07:50

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Well yes I agree these could be better if the fields would be moved out of the pda
> structure itself but then it wont be mechanical anymore and require more
> review.

Yes, but they'll have more value. And if you do it as one variable per
patch, then it should be easy to bisect should any problems arise.

> But these are an important step because they allow us to get rid of the
> pda operations that do not exist for 32 bit.
>

No, they don't help at all, because they convert X_pda(Y) (which doesn't
exist) into x86_X_percpu(pda.Y) (which also doesn't exist). They don't
remove any #ifdef CONFIG_X86_64's. If you're going to tromp all over
the source, you may as well do the most useful thing in the first step.

J

2008-07-09 19:08:29

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> One more approach... ;-) Once the pda and percpu vars are in the same
> area, then could the pda be used for both 32 and 64...?
>

The i386 code works quite reliably thanks ;)

J

2008-07-09 19:13:37

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:

> No, they don't help at all, because they convert X_pda(Y) (which doesn't
> exist) into x86_X_percpu(pda.Y) (which also doesn't exist). They don't
> remove any #ifdef CONFIG_X86_64's. If you're going to tromp all over
> the source, you may as well do the most useful thing in the first step.

Well they help in the sense that the patches get rid of the special X_pda(Y) operations.
x86_X_percpu will then exist under 32 bit and 64 bit.

What is remaining is the task to rename

pda.Y -> Z

in order to make variable references the same under both arches. Presumably the Z is the corresponding 32 bit variable. There are likely a number of cases where the transformation
is trivial if we just identify the corresponding 32 bit equivalent.

2008-07-09 19:28:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


* Mike Travis <[email protected]> wrote:

> * x86_64: Rebase per cpu variables to zero
>
> Take advantage of the zero-based per cpu area provided above. Then
> we can directly use the x86_32 percpu operations. x86_32 offsets
> %fs by __per_cpu_start. x86_64 has %gs pointing directly to the
> pda and the per cpu area thereby allowing access to the pda with
> the x86_64 pda operations and access to the per cpu variables
> using x86_32 percpu operations.

hm, have the binutils (or gcc) problems with this been resolved?

If common binutils versions miscompile the kernel with this feature then
i guess we cannot just unconditionally enable it. (My hope is that it's
not necessarily a binutils bug but some broken assumption of the kernel
somewhere.)

Ingo

2008-07-09 19:32:42

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Well they help in the sense that the patches get rid of the special X_pda(Y) operations.
> x86_X_percpu will then exist under 32 bit and 64 bit.
>
> What is remaining is the task to rename
>
> pda.Y -> Z
>
> in order to make variable references the same under both arches. Presumably the Z is the corresponding 32 bit variable. There are likely a number of cases where the transformation
> is trivial if we just identify the corresponding 32 bit equivalent.
>

Yes, I understand that, but it's still pointless churn. The
intermediate step is no improvement over what was there before, and
isn't any closer to the desired final result.

Once you've made the pda a percpu variable, and redefined all the X_pda
macros in terms of x86_X_percpu, then there's no need to touch all the
usage sites until you're *actually* going to unify something. Touching
them all just because you find "X_pda" unsightly doesn't help anyone.
Ideally every site you touch will remove a #ifdef CONFIG_X86_64, or make
two as-yet unified pieces of code closer to unification.

J

2008-07-09 19:34:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


* Mike Travis <[email protected]> wrote:

> > This fragility makes me very nervous. It seems hard enough to get
> > this stuff working with current tools; making it work over the whole
> > range of supported tools looks like its going to be hard.
>
> (me too ;-)
>
> Once I get a solid version working with (at least) gcc-4.2.4, then
> regression testing with older tools will be easier, or at least a
> table of results can be produced.

the problem is, we cannot just put it even into tip/master if there's no
short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for
me otherwise, for series of thousands of randomly built kernels.

can we just leave out the zero-based percpu stuff safely and could i
test the rest of your series - or are there dependencies? I think
zero-based percpu, while nice in theory, is probably just a very small
positive effect so it's not a life or death issue. (or is there any
deeper, semantic reason why we'd want it?)

Ingo

2008-07-09 19:42:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


* Jeremy Fitzhardinge <[email protected]> wrote:

>> What is remaining is the task to rename
>>
>> pda.Y -> Z
>>
>> in order to make variable references the same under both arches.
>> Presumably the Z is the corresponding 32 bit variable. There are
>> likely a number of cases where the transformation is trivial if we
>> just identify the corresponding 32 bit equivalent.
>
> Yes, I understand that, but it's still pointless churn. The
> intermediate step is no improvement over what was there before, and
> isn't any closer to the desired final result.
>
> Once you've made the pda a percpu variable, and redefined all the
> X_pda macros in terms of x86_X_percpu, then there's no need to touch
> all the usage sites until you're *actually* going to unify something.
> Touching them all just because you find "X_pda" unsightly doesn't help
> anyone. Ideally every site you touch will remove a #ifdef
> CONFIG_X86_64, or make two as-yet unified pieces of code closer to
> unification.

that makes sense. Does everyone agree on #1-#2-#3 and then gradual
elimination of most pda members (without going through an intermediate
renaming of pda members) being the way to go?

Ingo

2008-07-09 19:45:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:

> Yes, I understand that, but it's still pointless churn. The
> intermediate step is no improvement over what was there before, and
> isn't any closer to the desired final result.

No its not pointless churn. We actually eliminate all the pda operations and use the per_cpu operations both on 32 and 64 bit. That is unification.

We would be glad if you could contribute the patches to get rid of the pda.xxx references. I do not think that either Mike or I have the 32 bit expertise needed to do that step. We went as far as we could. The patches are touching all the points of interest so locating the lines to fix should be easy.



2008-07-09 19:48:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Ingo Molnar wrote:
>
> the problem is, we cannot just put it even into tip/master if there's no
> short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for
> me otherwise, for series of thousands of randomly built kernels.
>

4.2.3 is fine; he was using 4.2.0 before, and as far as I know, 4.2.0
and 4.2.1 are known broken for the kernel.

-hpa

2008-07-09 19:49:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Ingo Molnar wrote:
>
> that makes sense. Does everyone agree on #1-#2-#3 and then gradual
> elimination of most pda members (without going through an intermediate
> renaming of pda members) being the way to go?
>

Works for me.

-hpa

2008-07-09 19:49:36

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> We would be glad if you could contribute the patches to get rid of the pda.xxx references. I do not think that either Mike or I have the 32 bit expertise needed to do that step. We went as far as we could. The patches are touching all the points of interest so locating the lines to fix should be easy.
>

Yes, I'd be happy to contribute.

J

2008-07-09 19:53:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Ingo Molnar wrote:

> that makes sense. Does everyone agree on #1-#2-#3 and then gradual
> elimination of most pda members (without going through an intermediate
> renaming of pda members) being the way to go?

I think we all agree on 1-2-3.

The rest is TBD. Hope Jeremy can add his wisdom there to get the pda.X replaced by the proper percpu names for 32 bit.

With Jeremy's approach we would be doing two steps at once (getting rid of pda ops plus unifying the variable names between 32 and 64 bit). Maybe more difficult to verify correctness. The removal of the pda ops is a pretty straighforward conversion.

2008-07-09 20:01:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


* Christoph Lameter <[email protected]> wrote:

> Ingo Molnar wrote:
>
> > that makes sense. Does everyone agree on #1-#2-#3 and then gradual
> > elimination of most pda members (without going through an
> > intermediate renaming of pda members) being the way to go?
>
> I think we all agree on 1-2-3.
>
> The rest is TBD. Hope Jeremy can add his wisdom there to get the pda.X
> replaced by the proper percpu names for 32 bit.
>
> With Jeremy's approach we would be doing two steps at once (getting
> rid of pda ops plus unifying the variable names between 32 and 64
> bit). Maybe more difficult to verify correctness. The removal of the
> pda ops is a pretty straighforward conversion.

Yes, but there's nothing magic about pda variables versus percpu
variables. We should be able to do the pda -> unified step just as much
as we can do a percpu -> unified step. We can think of pda as a funky,
pre-percpu-era relic.

The only thing that percpu really offers over pda is its familarity.
read_pda() has the per-cpu-ness embedded in it, which is nasty with
regard to tracking preemption properties, etc.

So converting to percpu would bring us CONFIG_PREEMPT_DEBUG=y checking
to those ex-pda variables. Today if a read_pda() (or anything but
pcurrent) is done in a non-preempt region that's likely a bug - but
nothing warns about it.

So in that light 4-15 might make some sense in standardizing all these
accesses and making sure it all fits into an existing, familar API
world, with no register level assumptions and assembly (and ABI) ties,
which is instrumented as well, with explicit smp_processor_id()
dependencies, etc.

Ingo

2008-07-09 20:02:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


I just took a quick look at how stack_protector works on x86_64. Unless there is
some deep kernel magic that changes the segment register to %gs from the ABI specified
%fs CC_STACKPROTECTOR is totally broken on x86_64. We access our pda through %gs.

Further -fstack-protector-all only seems to detect against buffer overflows and
thus corruption of the stack. Not stack overflows. So it doesn't appear especially
useful.

So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying to figure out
how to use a zero based percpu area.

That should allow us to make the current pda a per cpu variable, and use %gs with
a large offset to access the per cpu area. And since it is only the per cpu accesses
and the pda accesses that will change we should not need to fight toolchain issues
and other weirdness. The linked binary can remain the same.

Eric

2008-07-09 20:06:02

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> I just took a quick look at how stack_protector works on x86_64. Unless there is
> some deep kernel magic that changes the segment register to %gs from the ABI specified
> %fs CC_STACKPROTECTOR is totally broken on x86_64. We access our pda through %gs.
>

-mcmodel=kernel switches it to using %gs.

> Further -fstack-protector-all only seems to detect against buffer overflows and
> thus corruption of the stack. Not stack overflows. So it doesn't appear especially
> useful.
>

It's a bit useful. But at the cost of preventing a pile of more useful
unification work, not to mention making all access to per-cpu variables
more expensive.

> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying to figure out
> how to use a zero based percpu area.
>

Yes, please.

> That should allow us to make the current pda a per cpu variable, and use %gs with
> a large offset to access the per cpu area. And since it is only the per cpu accesses
> and the pda accesses that will change we should not need to fight toolchain issues
> and other weirdness. The linked binary can remain the same.
>

Yes, and it would be functionally identical to the 32-bit code.

J

2008-07-09 20:08:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


* Eric W. Biederman <[email protected]> wrote:

> I just took a quick look at how stack_protector works on x86_64.
> Unless there is some deep kernel magic that changes the segment
> register to %gs from the ABI specified %fs CC_STACKPROTECTOR is
> totally broken on x86_64. We access our pda through %gs.
>
> Further -fstack-protector-all only seems to detect against buffer
> overflows and thus corruption of the stack. Not stack overflows. So
> it doesn't appear especially useful.

CC_STACKPROTECTOR, as fixed in -tip, can catch the splice exploit, and
there's no known breakage in it.

Deep stack recursion itself is not really interesting. (as that cannot
arbitrarily be triggered by attackers in most cases) Random overflows of
buffers on the stackframe is a lot more common, and that's what
stackprotector protects against.

( Note that deep stack recursion can be caught via another debug
mechanism, ftrace's mcount approach. )

> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying
> to figure out how to use a zero based percpu area.

Note that the zero-based percpu problems are completely unrelated to
stackprotector. I was able to hit them with a stackprotector-disabled
gcc-4.2.3 environment.

Ingo

2008-07-09 20:10:07

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Ingo Molnar wrote:
> * Christoph Lameter <[email protected]> wrote:
>
>
>> Ingo Molnar wrote:
>>
>>
>>> that makes sense. Does everyone agree on #1-#2-#3 and then gradual
>>> elimination of most pda members (without going through an
>>> intermediate renaming of pda members) being the way to go?
>>>
>> I think we all agree on 1-2-3.
>>
>> The rest is TBD. Hope Jeremy can add his wisdom there to get the pda.X
>> replaced by the proper percpu names for 32 bit.
>>
>> With Jeremy's approach we would be doing two steps at once (getting
>> rid of pda ops plus unifying the variable names between 32 and 64
>> bit). Maybe more difficult to verify correctness. The removal of the
>> pda ops is a pretty straighforward conversion.
>>
>
> Yes, but there's nothing magic about pda variables versus percpu
> variables. We should be able to do the pda -> unified step just as much
> as we can do a percpu -> unified step. We can think of pda as a funky,
> pre-percpu-era relic.
>
> The only thing that percpu really offers over pda is its familarity.
> read_pda() has the per-cpu-ness embedded in it, which is nasty with
> regard to tracking preemption properties, etc.
>
> So converting to percpu would bring us CONFIG_PREEMPT_DEBUG=y checking
> to those ex-pda variables. Today if a read_pda() (or anything but
> pcurrent) is done in a non-preempt region that's likely a bug - but
> nothing warns about it.
>
> So in that light 4-15 might make some sense in standardizing all these
> accesses and making sure it all fits into an existing, familar API
> world, with no register level assumptions and assembly (and ABI) ties,
> which is instrumented as well, with explicit smp_processor_id()
> dependencies, etc.
>

Yeah, but doing

#define read_pda(x) x86_read_percpu(x)

gives you all that anyway. Though because x86_X_percpu and X_pda are
guaranteed to be atomic with respect to preemption, it's actually
reasonable to use them with preemption enabled.

J

2008-07-09 20:11:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Ingo Molnar wrote:
> Note that the zero-based percpu problems are completely unrelated to
> stackprotector. I was able to hit them with a stackprotector-disabled
> gcc-4.2.3 environment.

The only reason we need to keep a zero-based pda is to support
stack-protector. If we drop drop it, we can drop the pda - and its
special zero-based properties - entirely.

J

2008-07-09 20:14:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

On Wed, 09 Jul 2008 13:00:19 -0700
[email protected] (Eric W. Biederman) wrote:

>
> I just took a quick look at how stack_protector works on x86_64.
> Unless there is some deep kernel magic that changes the segment
> register to %gs from the ABI specified %fs CC_STACKPROTECTOR is
> totally broken on x86_64. We access our pda through %gs.

and so does gcc in kernel mode.

>
> Further -fstack-protector-all only seems to detect against buffer
> overflows and thus corruption of the stack. Not stack overflows. So
> it doesn't appear especially useful.

stopping buffer overflows and other return address corruption is not
useful? Excuse me?

>
> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying
> to figure out how to use a zero based percpu area.

So why don't we NOT do that and fix instead what you're trying to do?

>
> That should allow us to make the current pda a per cpu variable, and
> use %gs with a large offset to access the per cpu area.

and what does that gain us?


--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-07-09 20:15:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


* Jeremy Fitzhardinge <[email protected]> wrote:

>> Further -fstack-protector-all only seems to detect against buffer
>> overflows and thus corruption of the stack. Not stack overflows. So
>> it doesn't appear especially useful.
>
> It's a bit useful. But at the cost of preventing a pile of more
> useful unification work, not to mention making all access to per-cpu
> variables more expensive.

well, stackprotector is near zero maintenance trouble. It mostly binds
in places that are fundamentally non-unifiable anyway. (nobody is going
to unify the assembly code in switch_to())

i had zero-based percpu problems (early crashes) with a 4.2.3 gcc that
had --fstack-protect compiled out, so there's no connection there.

In its fixed form in tip/core/stackprotector it can catch the splice
exploit which makes it quite a bit useful. It would be rather silly to
not offer that feature.

Ingo

2008-07-09 20:18:46

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
>> Note that the zero-based percpu problems are completely unrelated to
>> stackprotector. I was able to hit them with a stackprotector-disabled
>> gcc-4.2.3 environment.
>
> The only reason we need to keep a zero-based pda is to support
> stack-protector. If we drop drop it, we can drop the pda - and its
> special zero-based properties - entirely.


Another reason to use a zero based per cpu area is to limit the offset range. Limiting the offset range allows in turn to limit the size of the generated instructions because it is part of the instruction. It also is easier to handle since __per_cpu_start does not figure
in the calculation of the offsets.

2008-07-09 20:27:46

by Adrian Bunk

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

On Wed, Jul 09, 2008 at 03:44:51PM -0400, H. Peter Anvin wrote:
> Ingo Molnar wrote:
>>
>> the problem is, we cannot just put it even into tip/master if there's
>> no short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid
>> for me otherwise, for series of thousands of randomly built kernels.
>>
>
> 4.2.3 is fine; he was using 4.2.0 before, and as far as I know, 4.2.0
> and 4.2.1 are known broken for the kernel.

Not sure where your knowledge comes from, but the ones I try to get
blacklisted due to known gcc bugs are 4.1.0 and 4.1.1.

On a larger picture, we officially support gcc >= 3.2, and if any kernel
change triggers a bug with e.g. gcc 3.2.3 that's technically a
regression in the kernel...

> -hpa

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2008-07-09 20:32:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge <[email protected]> writes:

> It's just the stack canary. It isn't library accesses; it's the code gcc
> generates:
>
> foo: subq $152, %rsp
> movq %gs:40, %rax
> movq %rax, 136(%rsp)
> ...
> movq 136(%rsp), %rdx
> xorq %gs:40, %rdx
> je .L3
> call __stack_chk_fail
> .L3:
> addq $152, %rsp
> .p2align 4,,4
> ret
>
>
> There are two irritating things here:
>
> One is that the kernel supports -fstack-protector for x86-64, which forces us
> into all these contortions in the first place. We don't support stack-protector
> for 32-bit (gcc does), and things are much easier.

How does gcc know to use %gs instead of the usual %fs for accessing
the stack protector variable? My older gcc-4.1.x on ubuntu always uses %fs.

> The other somewhat orthogonal irritation is the fixed "40". If they'd generated
> %gs:__gcc_stack_canary, then we could alias that to a per-cpu variable like
> anything else and the whole problem would go away - and we could support
> stack-protector on 32-bit with no problems (and normal usermode could define
> __gcc_stack_canary to be a weak symbol with value "40" (20 on 32-bit) for
> backwards compatibility).
>
> I'm close to proposing that we run a post-processor over the generated assembly
> to perform the %gs:40 -> %gs:__gcc_stack_canary transformation and deal with it
> that way.

Or we could do something completely evil. And use the other segment
register for the stack canary.

I think the unification is valid and useful, and that trying to keep
that stupid stack canary working is currently more trouble then it is
worth.

Eric

2008-07-09 20:33:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>> Ingo Molnar wrote:
>>
>>> Note that the zero-based percpu problems are completely unrelated to
>>> stackprotector. I was able to hit them with a stackprotector-disabled
>>> gcc-4.2.3 environment.
>>>
>> The only reason we need to keep a zero-based pda is to support
>> stack-protector. If we drop drop it, we can drop the pda - and its
>> special zero-based properties - entirely.
>>
>
>
> Another reason to use a zero based per cpu area is to limit the offset range. Limiting the offset range allows in turn to limit the size of the generated instructions because it is part of the instruction.

No, it makes no difference. %gs:X always has a 32-bit offset in the
instruction, regardless of how big X is:

mov %eax, %gs:0
mov %eax, %gs:0x1234567
->
0: 65 89 04 25 00 00 00 00 mov %eax,%gs:0x0
8: 65 89 04 25 67 45 23 01 mov %eax,%gs:0x1234567


> It also is easier to handle since __per_cpu_start does not figure
> in the calculation of the offsets.
>

No, you do it the same as i386. You set the segment base to be
percpu_area-__per_cpu_start, and then just refer to %gs:per_cpu__foo
directly. You can use rip-relative addressing to make it a smaller
addressing mode too:

0: 65 89 05 00 00 00 00 mov %eax,%gs:0(%rip) # 0x7


J

2008-07-09 20:36:24

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> How does gcc know to use %gs instead of the usual %fs for accessing
> the stack protector variable? My older gcc-4.1.x on ubuntu always uses %fs.
>

-mcmodel=kernel

> Or we could do something completely evil. And use the other segment
> register for the stack canary.
>

That would still require gcc changes, so it doesn't help much.

J

2008-07-09 20:40:56

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

On Wed, 09 Jul 2008 13:11:03 -0700
Jeremy Fitzhardinge <[email protected]> wrote:

> Ingo Molnar wrote:
> > Note that the zero-based percpu problems are completely unrelated
> > to stackprotector. I was able to hit them with a
> > stackprotector-disabled gcc-4.2.3 environment.
>
> The only reason we need to keep a zero-based pda is to support
> stack-protector. If we drop drop it, we can drop the pda - and its
> special zero-based properties - entirely.

what's wrong with zero based btw?

do they stop us from using gcc's __thread keyword for per cpu variables
or something? (*that* would be a nice feature)

or does it stop us from putting the per cpu variables starting from
offset 4096 onwards?

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-07-09 20:42:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Arjan van de Ven <[email protected]> writes:

> On Wed, 09 Jul 2008 13:00:19 -0700
> [email protected] (Eric W. Biederman) wrote:
>
>>
>> I just took a quick look at how stack_protector works on x86_64.
>> Unless there is some deep kernel magic that changes the segment
>> register to %gs from the ABI specified %fs CC_STACKPROTECTOR is
>> totally broken on x86_64. We access our pda through %gs.
>
> and so does gcc in kernel mode.

Some gcc's in kernel mode. The one I tested with doesn't.

>> Further -fstack-protector-all only seems to detect against buffer
>> overflows and thus corruption of the stack. Not stack overflows. So
>> it doesn't appear especially useful.
>
> stopping buffer overflows and other return address corruption is not
> useful? Excuse me?

Stopping buffer overflows and return address corruption is useful. Simply
catching and panic'ing the machine when the occur is less useful. We aren't
perfect but we have a pretty good track record of handling this with
old fashioned methods.

>> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying
>> to figure out how to use a zero based percpu area.
>
> So why don't we NOT do that and fix instead what you're trying to do?

So our choices are.
fix -fstack-protector to not use a hard coded offset.
fix gcc/ld to not miscompile the kernel at random times that prevents us from
booting when we add a segement with an address at 0.

-fstack-protector does not use the TLS ABI and instead uses nasty hard coded magic
and that is why it is a problem. Otherwise we could easily support it.

>> That should allow us to make the current pda a per cpu variable, and
>> use %gs with a large offset to access the per cpu area.
>
> and what does that gain us?

A faster more maintainable kernel.

Eric

2008-07-09 20:43:31

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
>
> Another reason to use a zero based per cpu area is to limit the offset range. Limiting the offset range allows in turn to limit the size of the generated instructions because it is part of the instruction. It also is easier to handle since __per_cpu_start does not figure
> in the calculation of the offsets.
>

No, that makes no difference. There is no short-offset form that
doesn't involve a register (ignoring the 16-bit 67h form on 32 bits.)

For 64 bits, you want to keep the offsets within %rip?2 GB, or you will
have relocation overflows for %rip-based forms; for absolute forms you
have to be in the range 0-4 GB. The %rip-based forms are shorter, and
I'm pretty sure they're the ones we currently generate. Since we base
the kernel at 0xffffffff80000000 (-2 GB) this means a zero-based offset
is actively wrong, and only work by accident (since the first
CONFIG_PHYSICAL_START of that space is unused.)

-hpa

2008-07-09 20:47:19

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Arjan van de Ven wrote:
> what's wrong with zero based btw?
>

Nothing in princple. In practice it's triggering an amazing variety of
toolchain bugs.

> do they stop us from using gcc's __thread keyword for per cpu variables
> or something? (*that* would be a nice feature)
>

The powerpc guys tried it, and it doesn't work. per-cpu is not
semantically equivalent to per-thread. If you have a function in which
you refer to a percpu variable and then have a preemptable section in
the middle followed by another reference to the same percpu variable,
it's hard to stop gcc from caching a reference to the old tls variable,
even though we may have switched cpus in the meantime.

Also, we explicitly use the other segment register in kernel mode, to
avoid segment register switches where possible. Even with
-mcmodel=kernel, gcc generates %fs references to tls variables.

J

2008-07-09 20:48:34

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Thinking about this some more, I don't know if it would make sense to
> put the x86-64 stack canary at the *end* of the percpu area, and
> otherwise use negative offsets. That would make sure they were
> readily reachable from %rip-based references from within the kernel
> text area.

If we can move the canary then a whole pile of options open up. But the
problem is that we can't.

J

2008-07-09 20:49:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
>
> No, you do it the same as i386. You set the segment base to be
> percpu_area-__per_cpu_start, and then just refer to %gs:per_cpu__foo
> directly. You can use rip-relative addressing to make it a smaller
> addressing mode too:
>
> 0: 65 89 05 00 00 00 00 mov %eax,%gs:0(%rip) # 0x7
>

Thinking about this some more, I don't know if it would make sense to
put the x86-64 stack canary at the *end* of the percpu area, and
otherwise use negative offsets. That would make sure they were readily
reachable from %rip-based references from within the kernel text area.

-hpa

2008-07-09 20:51:09

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> 1. it means pda references are invalid if their offsets are ever more
> than CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...)

Why?

As an aside, could we solve the problems by making CONFIG_PHYSICAL_BASE
0 - putting the percpu variables as the first thing in the kernel - and
relocating on load? That would avoid having to make a special PT_LOAD
segment at 0. Hm, would that result in the pda and the boot params
getting mushed together?

J

2008-07-09 20:51:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Arjan van de Ven wrote:
> On Wed, 09 Jul 2008 13:11:03 -0700
> Jeremy Fitzhardinge <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>>> Note that the zero-based percpu problems are completely unrelated
>>> to stackprotector. I was able to hit them with a
>>> stackprotector-disabled gcc-4.2.3 environment.
>> The only reason we need to keep a zero-based pda is to support
>> stack-protector. If we drop drop it, we can drop the pda - and its
>> special zero-based properties - entirely.
>
> what's wrong with zero based btw?
>

Two problems:

1. it means pda references are invalid if their offsets are ever more
than CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...)

2. some vague hints of a linker bug.

-hpa

2008-07-09 20:55:26

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Ingo Molnar wrote:
> * Mike Travis <[email protected]> wrote:
>
>> * x86_64: Rebase per cpu variables to zero
>>
>> Take advantage of the zero-based per cpu area provided above. Then
>> we can directly use the x86_32 percpu operations. x86_32 offsets
>> %fs by __per_cpu_start. x86_64 has %gs pointing directly to the
>> pda and the per cpu area thereby allowing access to the pda with
>> the x86_64 pda operations and access to the per cpu variables
>> using x86_32 percpu operations.
>
> hm, have the binutils (or gcc) problems with this been resolved?
>
> If common binutils versions miscompile the kernel with this feature then
> i guess we cannot just unconditionally enable it. (My hope is that it's
> not necessarily a binutils bug but some broken assumption of the kernel
> somewhere.)
>
> Ingo

Currently I'm using gcc-4.2.4. Which are you using?

I labeled it "RFC" as it does not quite work without THREAD_ORDER bumped to 2.
And I believe the stack overflow is happening because of some interrupt
routine as it does not happen on our simulator.

After that is taken care of, I'll start regression testing earlier compilers.
I think someone mentioned that gcc-2.something was the minimum required...?

Thanks,
Mike

2008-07-09 21:01:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


* Eric W. Biederman <[email protected]> wrote:

> Arjan van de Ven <[email protected]> writes:
>
> > On Wed, 09 Jul 2008 13:00:19 -0700
> > [email protected] (Eric W. Biederman) wrote:
> >
> >>
> >> I just took a quick look at how stack_protector works on x86_64.
> >> Unless there is some deep kernel magic that changes the segment
> >> register to %gs from the ABI specified %fs CC_STACKPROTECTOR is
> >> totally broken on x86_64. We access our pda through %gs.
> >
> > and so does gcc in kernel mode.
>
> Some gcc's in kernel mode. The one I tested with doesn't.

yes - stackprotector enabled distros build with kernel mode enabled gcc.

> >> Further -fstack-protector-all only seems to detect against buffer
> >> overflows and thus corruption of the stack. Not stack overflows.
> >> So it doesn't appear especially useful.
> >
> > stopping buffer overflows and other return address corruption is not
> > useful? Excuse me?
>
> Stopping buffer overflows and return address corruption is useful.
> Simply catching and panic'ing the machine when the occur is less
> useful. [...]

why? I personally prefer an informative panic in an overflow-suspect
piece of code instead of a guest root on my machine.

I think you miss one of the fundamental security aspects here. The panic
is not there just to inform the administrator - although it certainly
has such a role.

It is mainly there to _deter_ attackers from experimenting with certain
exploits.

For the more sophisticated attackers (not the script kiddies - the ones
who can do serious economic harm) their exploits and their attack
vectors are their main assets. They want their exploits to work on the
next target as well, and they want to be as stealth as possible.

For a script kiddie crashing a box is not a big issue - they work with
public exploits.

This means that the serious attempts will only use an attack if its
effects are 100% deterministic - they wont risk something like a 50%/50%
chance of a crash (or even a 10% chance of a crash). Some of the most
sophisticated kernel exploits i've seen had like 80% of their code
complexity in making sure that they dont crash the target box. They were
more resilient code than a lot of kernel code we have.

> [...] We aren't perfect but we have a pretty good track record of
> handling this with old fashioned methods.

That's your opinion. A valid counter point is that more layers of
defense, in a fundamentally fragile area (buffers on the stack, return
addresses), cannot hurt. If you've got a firewall that is only 10% busy
even under peak load it's a valid option to spend some CPU cycles on
preventive measures.

A firewall _itself_ is huge overhead already - so there's absolutely no
valid technical reason to forbid a firewall from having something like
stackprotector built into its kernel. (and into most of its userspace)

We could have caught the vsplice exploit as well with stackprotector -
but our security QA was not strong enough to keep it from slowly
regressing. (without anyone noticing) That's fixed now in
tip/core/stackprotector.

Ingo

2008-07-09 21:02:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge <[email protected]> writes:

>> Or we could do something completely evil. And use the other segment
>> register for the stack canary.
>>
>
> That would still require gcc changes, so it doesn't help much.

We could use %fs for the per cpu variables. Then we could set %gs to whatever
we wanted to sync up with gcc

Eric

2008-07-09 21:03:57

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Ingo Molnar wrote:
> * Mike Travis <[email protected]> wrote:
>
>>> This fragility makes me very nervous. It seems hard enough to get
>>> this stuff working with current tools; making it work over the whole
>>> range of supported tools looks like its going to be hard.
>> (me too ;-)
>>
>> Once I get a solid version working with (at least) gcc-4.2.4, then
>> regression testing with older tools will be easier, or at least a
>> table of results can be produced.
>
> the problem is, we cannot just put it even into tip/master if there's no
> short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for
> me otherwise, for series of thousands of randomly built kernels.

Great, I'll request we load gcc-4.2.3 on our devel server.

>
> can we just leave out the zero-based percpu stuff safely and could i
> test the rest of your series - or are there dependencies? I think
> zero-based percpu, while nice in theory, is probably just a very small
> positive effect so it's not a life or death issue. (or is there any
> deeper, semantic reason why we'd want it?)

I sort of assumed that zero-based would not make it into 2.6.26-rcX,
and no, reaching 4096 cpus does not require it. The other patches
I've been submitting are more general and will fix possible panics
(like stack overflows, etc.)

Thanks,
Mike

2008-07-09 21:04:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


* Eric W. Biederman <[email protected]> wrote:

> Jeremy Fitzhardinge <[email protected]> writes:
>
> >> Or we could do something completely evil. And use the other segment
> >> register for the stack canary.
> >>
> >
> > That would still require gcc changes, so it doesn't help much.
>
> We could use %fs for the per cpu variables. Then we could set %gs to
> whatever we wanted to sync up with gcc

one problem is, there's no SWAPFS instruction.

Ingo

2008-07-09 21:05:44

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
>>> What is remaining is the task to rename
>>>
>>> pda.Y -> Z
>>>
>>> in order to make variable references the same under both arches.
>>> Presumably the Z is the corresponding 32 bit variable. There are
>>> likely a number of cases where the transformation is trivial if we
>>> just identify the corresponding 32 bit equivalent.
>> Yes, I understand that, but it's still pointless churn. The
>> intermediate step is no improvement over what was there before, and
>> isn't any closer to the desired final result.
>>
>> Once you've made the pda a percpu variable, and redefined all the
>> X_pda macros in terms of x86_X_percpu, then there's no need to touch
>> all the usage sites until you're *actually* going to unify something.
>> Touching them all just because you find "X_pda" unsightly doesn't help
>> anyone. Ideally every site you touch will remove a #ifdef
>> CONFIG_X86_64, or make two as-yet unified pieces of code closer to
>> unification.
>
> that makes sense. Does everyone agree on #1-#2-#3 and then gradual
> elimination of most pda members (without going through an intermediate
> renaming of pda members) being the way to go?
>
> Ingo


This is fine with me... not much more work required to go "all the way"... ;-)

2008-07-09 21:11:34

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

On Wed, 09 Jul 2008 13:22:06 -0700
[email protected] (Eric W. Biederman) wrote:

> Jeremy Fitzhardinge <[email protected]> writes:
>
> > It's just the stack canary. It isn't library accesses; it's the
> > code gcc generates:
> >
> > foo: subq $152, %rsp
> > movq %gs:40, %rax
> > movq %rax, 136(%rsp)
> > ...
> > movq 136(%rsp), %rdx
> > xorq %gs:40, %rdx
> > je .L3
> > call __stack_chk_fail
> > .L3:
> > addq $152, %rsp
> > .p2align 4,,4
> > ret
> >
> >
> > There are two irritating things here:
> >
> > One is that the kernel supports -fstack-protector for x86-64, which
> > forces us into all these contortions in the first place. We don't
> > support stack-protector for 32-bit (gcc does), and things are much
> > easier.
>
> How does gcc know to use %gs instead of the usual %fs for accessing
> the stack protector variable? My older gcc-4.1.x on ubuntu always
> uses %fs.

ubuntu broke gcc (they don't want to have compiler flags per package so
patches stuff in gcc instead).


> I think the unification is valid and useful, and that trying to keep
> that stupid stack canary working is currently more trouble then it is
> worth.

I think that "unification over everything" is stupid, especially if it
removes useful features.



--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-07-09 21:12:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge <[email protected]> writes:

> H. Peter Anvin wrote:
>> Thinking about this some more, I don't know if it would make sense to put the
>> x86-64 stack canary at the *end* of the percpu area, and otherwise use
>> negative offsets. That would make sure they were readily reachable from
>> %rip-based references from within the kernel text area.
>
> If we can move the canary then a whole pile of options open up. But the problem
> is that we can't.

But we can pick an arbitrary point where %gs points at.

Hmm. This whole thing is even sillier then I thought.
Why can't we access per cpu vars as:
%gs:(per_cpu__var - __per_cpu_start) ?

If we can subtract constants and allow the linker to perform that resolution
at link. A zero based per cpu segment becomes a moot issue.

We may need to change the definition of PERCPU in vmlinux.lds.h to
#define PERCPU(align) \
. = ALIGN(align); \
- __per_cpu_start = .; \
.data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { \
+ __per_cpu_start = .; \
*(.data.percpu) \
*(.data.percpu.shared_aligned) \
+ __per_cpu_end = .; \
+ }
- } \
- __per_cpu_end = .;


So that the linker knows __per_cpu_start and __per_cpu_end are in the same section
but otherwise it sounds entirely reasonable. Just slightly trickier math at link
time.

Eric

2008-07-09 21:13:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


* Mike Travis <[email protected]> wrote:

> After that is taken care of, I'll start regression testing earlier
> compilers. I think someone mentioned that gcc-2.something was the
> minimum required...?

i think the current official minimum is around gcc-3.2 [2.x is out of
question because we have a few feature dependencies on gcc-3.x] - but i
stopped using it because it miscompiles the kernel so often. 4.0 was
really bad due to large stack footprint. The 4.3.x series miscompiles
the kernel too in certain situations - there was a high-rising
kerneloops.org crash recently in ext3.

So in general, 'too new' is bad because it has new regressions, 'too
old' is bad because it has unfixed old regressions. Somewhere in the
middle, 4.2.x-ish, seems to be pretty robust in practice.

Ingo

2008-07-09 21:19:36

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> 1. it means pda references are invalid if their offsets are ever more
>> than CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...)
>
> Why?
>
> As an aside, could we solve the problems by making CONFIG_PHYSICAL_BASE
> 0 - putting the percpu variables as the first thing in the kernel - and
> relocating on load? That would avoid having to make a special PT_LOAD
> segment at 0. Hm, would that result in the pda and the boot params
> getting mushed together?
>

CONFIG_PHYSICAL_START rather. And no, it can't be zero! Realistically
we should make it 16 MB by default (currently 2 MB), to keep the DMA
zone clear.

Either way, I really suspect that the right thing to do is to use
negative offsets, with the possible exception of a handful of things (40
bytes or less, perhaps like current) which can get small positive
offsets and end up in the "super hot" cacheline.

The sucky part is that I don't believe GNU ld has native support for a
"hanging down" section (one which has a fixed endpoint rather than a
starting point), so it requires extra magic around the link (or finding
some way to do it with linker script functions.) Let me see if I can
cook up something in linker script that would actually work.

-hpa

2008-07-09 21:20:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>>> Or we could do something completely evil. And use the other segment
>>> register for the stack canary.
>>>
>> That would still require gcc changes, so it doesn't help much.
>
> We could use %fs for the per cpu variables. Then we could set %gs to whatever
> we wanted to sync up with gcc

No swapfs instruction, and extra performance penalty because %fs is used
in userspace.

-hpa

2008-07-09 21:21:21

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> But we can pick an arbitrary point where %gs points at.
>
> Hmm. This whole thing is even sillier then I thought.
> Why can't we access per cpu vars as:
> %gs:(per_cpu__var - __per_cpu_start) ?
>

Because there's no linker reloc for doing subtraction (or addition) of
two symbols.

> If we can subtract constants and allow the linker to perform that resolution
> at link. A zero based per cpu segment becomes a moot issue.
>

They're not constants; they're symbols.

J

2008-07-09 21:23:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
>
> But we can pick an arbitrary point where %gs points at.
>
> Hmm. This whole thing is even sillier then I thought.
> Why can't we access per cpu vars as:
> %gs:(per_cpu__var - __per_cpu_start) ?
>
> If we can subtract constants and allow the linker to perform that resolution
> at link. A zero based per cpu segment becomes a moot issue.
>

And then we're back here again!

Supposedly the linker buggers up, although we don't have conclusive
evidence...

-hpa

2008-07-09 21:24:42

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Ingo Molnar wrote:
> * Mike Travis <[email protected]> wrote:
>
>
>>> This fragility makes me very nervous. It seems hard enough to get
>>> this stuff working with current tools; making it work over the whole
>>> range of supported tools looks like its going to be hard.
>>>
>> (me too ;-)
>>
>> Once I get a solid version working with (at least) gcc-4.2.4, then
>> regression testing with older tools will be easier, or at least a
>> table of results can be produced.
>>
>
> the problem is, we cannot just put it even into tip/master if there's no
> short-term hope of fixing a problem it triggers. gcc-4.2.3 is solid for
> me otherwise, for series of thousands of randomly built kernels.
>
> can we just leave out the zero-based percpu stuff safely and could i
> test the rest of your series - or are there dependencies? I think
> zero-based percpu, while nice in theory, is probably just a very small
> positive effect so it's not a life or death issue. (or is there any
> deeper, semantic reason why we'd want it?)
>

I'm looking forward to using it, because I can make the Xen vcpu
structure a percpu variable shared with the hypervisor. This means
something like a disable interrupt becomes a simple "movb
$1,%gs:per_cpu__xen_vcpu_event_mask". If access to percpu variables is
indirect (ie, two instructions) I need to disable preemption which makes
the whole thing much more complex, and too big to inline. There are
other cases where preemption-safe access to percpu variables is useful
as well.

My view, which is admittedly very one-sided, is that all this brokenness
is forced on us by gcc's stack-protector brokenness. My preferred
approach would be to fix -fstack-protector by eliminating the
requirement for small offsets from %gs. With that in place we could
support it without needing a pda. In the meantime, we could either
support stack-protector or direct access to percpu variables. Either
way, we don't need to worry about making zero-based percpu work.

J

2008-07-09 21:26:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:

> No, it makes no difference. %gs:X always has a 32-bit offset in the
> instruction, regardless of how big X is:
>
> mov %eax, %gs:0
> mov %eax, %gs:0x1234567
> ->
> 0: 65 89 04 25 00 00 00 00 mov %eax,%gs:0x0
> 8: 65 89 04 25 67 45 23 01 mov %eax,%gs:0x1234567

The processor itself supports smaller offsets.

Note also that the 32 bit offset size limits the offset that can be added to the segment register. You either need to place the per cpu area either in the last 2G of the address space or in the first 2G. The zero based approach removes that limitation.


>> It also is easier to handle since __per_cpu_start does not figure
>> in the calculation of the offsets.
>>
>
> No, you do it the same as i386. You set the segment base to be
> percpu_area-__per_cpu_start, and then just refer to %gs:per_cpu__foo
> directly. You can use rip-relative addressing to make it a smaller
> addressing mode too:
>
> 0: 65 89 05 00 00 00 00 mov %eax,%gs:0(%rip) # 0x7

RIP relative also implies a 32 bit offset meaning that the code cannot be more than 2G away from the per cpu area.

2008-07-09 21:26:47

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Either way, I really suspect that the right thing to do is to use
> negative offsets, with the possible exception of a handful of things
> (40 bytes or less, perhaps like current) which can get small positive
> offsets and end up in the "super hot" cacheline.
>
> The sucky part is that I don't believe GNU ld has native support for a
> "hanging down" section (one which has a fixed endpoint rather than a
> starting point), so it requires extra magic around the link (or
> finding some way to do it with linker script functions.)

If you're going to do another linker pass, you could have a script to
extract all the percpu symbols and generate a set of derived zero-based
ones and then link against that.

Or generate a vmlinux with relocations and "relocate" all the percpu
symbols down to 0.

J

2008-07-09 21:39:51

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> I just took a quick look at how stack_protector works on x86_64. Unless there is
> some deep kernel magic that changes the segment register to %gs from the ABI specified
> %fs CC_STACKPROTECTOR is totally broken on x86_64. We access our pda through %gs.
>
> Further -fstack-protector-all only seems to detect against buffer overflows and
> thus corruption of the stack. Not stack overflows. So it doesn't appear especially
> useful.
>
> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying to figure out
> how to use a zero based percpu area.
>
> That should allow us to make the current pda a per cpu variable, and use %gs with
> a large offset to access the per cpu area. And since it is only the per cpu accesses
> and the pda accesses that will change we should not need to fight toolchain issues
> and other weirdness. The linked binary can remain the same.
>
> Eric

Hi Eric,

There is one pda op that I was not able to remove. Most likely it can be recoded
but it was a bit over my expertise. Most likely the "pda_offset(field)" can be
replaced with "per_cpu_var(field)" [per_cpu__##field], but for "_proxy_pda.field"
I wasn't sure about.

include/asm-x86/pda.h:

/*
* This is not atomic against other CPUs -- CPU preemption needs to be off
* NOTE: This relies on the fact that the cpu_pda is the *first* field in
* the per cpu area. Move it and you'll need to change this.
*/
#define test_and_clear_bit_pda(bit, field) \
({ \
int old__; \
asm volatile("btr %2,%%gs:%c3\n\tsbbl %0,%0" \
: "=r" (old__), "+m" (_proxy_pda.field) \
: "dIr" (bit), "i" (pda_offset(field)) : "memory");\
old__; \
})

And there is only one reference to it.

arch/x86/kernel/process_64.c:

static void __exit_idle(void)
{
if (test_and_clear_bit_pda(0, isidle) == 0)
return;
atomic_notifier_call_chain(&idle_notifier, IDLE_END, NULL);
}

Thanks,
Mike

2008-07-09 21:41:35

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>
>> No, it makes no difference. %gs:X always has a 32-bit offset in the
>> instruction, regardless of how big X is:
>>
>> mov %eax, %gs:0
>> mov %eax, %gs:0x1234567
>> ->
>> 0: 65 89 04 25 00 00 00 00 mov %eax,%gs:0x0
>> 8: 65 89 04 25 67 45 23 01 mov %eax,%gs:0x1234567
>>
>
> The processor itself supports smaller offsets.
>

Not in 64-bit mode. In 32-bit mode you can use the addr16 prefix, but
that would only save a byte per use (and I doubt it's a fast-path in the
processor).

> Note also that the 32 bit offset size limits the offset that can be added to the segment register. You either need to place the per cpu area either in the last 2G of the address space or in the first 2G. The zero based approach removes that limitation.
>

No. The %gs base is a full 64-bit value you can put anywhere in the
address space. So long as your percpu data is within 2G of that point
you can get to it directly.


>> 0: 65 89 05 00 00 00 00 mov %eax,%gs:0(%rip) # 0x7
>>
>
> RIP relative also implies a 32 bit offset meaning that the code cannot be more than 2G away from the per cpu area.
>

It means the percpu symbols must be within 2G of your code. We can't
compile the kernel any other way (there's no -mcmodel=large-kernel).

J

2008-07-09 21:43:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>> No, it makes no difference. %gs:X always has a 32-bit offset in the
>> instruction, regardless of how big X is:
>>
>> mov %eax, %gs:0
>> mov %eax, %gs:0x1234567
>> ->
>> 0: 65 89 04 25 00 00 00 00 mov %eax,%gs:0x0
>> 8: 65 89 04 25 67 45 23 01 mov %eax,%gs:0x1234567
>
> The processor itself supports smaller offsets.

No, it doesn't, unless you have a base register. There is no naked
disp8 form, and disp16 is only available in 16- or 32-bit mode (and in
32-bit form it requires a 67h prefix.)

> Note also that the 32 bit offset size limits the offset that can be added to the segment register. You either need to place the per cpu area either in the last 2G of the address space or in the first 2G. The zero based approach removes that limitation.

The offset is either ?2 GB from the segment register, or ?2 GB from the
segment register plus %rip. The latter is more efficient.

The processor *does* permit a 64-bit absolute form, which can be used
with a segment register, but that one is hideously restricted (only move
to/from %rax) and bloated (10 bytes!)

>> 0: 65 89 05 00 00 00 00 mov %eax,%gs:0(%rip) # 0x7
>
> RIP relative also implies a 32 bit offset meaning that the code cannot be more than 2G away from the per cpu area.

Not from the per cpu area, but from the linked address of the per cpu
area (the segment register base can point anywhere.)

In our case means between -2 GB and a smallish positive value (I believe
it is guaranteed to be 2 MB or more.)

Being able to use %rip-relative forms would save a byte per reference,
which is valuable.

-hpa

2008-07-09 21:44:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
>
> If you're going to do another linker pass, you could have a script to
> extract all the percpu symbols and generate a set of derived zero-based
> ones and then link against that.
>
> Or generate a vmlinux with relocations and "relocate" all the percpu
> symbols down to 0.
>

Yeah, I'd hate to have to go to either of those lengths though.

-hpa

2008-07-09 21:48:17

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> Eric W. Biederman wrote:
>
>> I just took a quick look at how stack_protector works on x86_64. Unless there is
>> some deep kernel magic that changes the segment register to %gs from the ABI specified
>> %fs CC_STACKPROTECTOR is totally broken on x86_64. We access our pda through %gs.
>>
>> Further -fstack-protector-all only seems to detect against buffer overflows and
>> thus corruption of the stack. Not stack overflows. So it doesn't appear especially
>> useful.
>>
>> So we don't we kill the broken CONFIG_CC_STACKPROTECTOR. Stop trying to figure out
>> how to use a zero based percpu area.
>>
>> That should allow us to make the current pda a per cpu variable, and use %gs with
>> a large offset to access the per cpu area. And since it is only the per cpu accesses
>> and the pda accesses that will change we should not need to fight toolchain issues
>> and other weirdness. The linked binary can remain the same.
>>
>> Eric
>>
>
> Hi Eric,
>
> There is one pda op that I was not able to remove. Most likely it can be recoded
> but it was a bit over my expertise. Most likely the "pda_offset(field)" can be
> replaced with "per_cpu_var(field)" [per_cpu__##field], but for "_proxy_pda.field"
> I wasn't sure about.
>
> include/asm-x86/pda.h:
>
> /*
> * This is not atomic against other CPUs -- CPU preemption needs to be off
> * NOTE: This relies on the fact that the cpu_pda is the *first* field in
> * the per cpu area. Move it and you'll need to change this.
> */
> #define test_and_clear_bit_pda(bit, field) \
> ({ \
> int old__; \
> asm volatile("btr %2,%%gs:%c3\n\tsbbl %0,%0" \
> : "=r" (old__), "+m" (_proxy_pda.field) \
> : "dIr" (bit), "i" (pda_offset(field)) : "memory");\
>
asm volatile("btr %2,%%gs:%1\n\tsbbl %0,%0" \
: "=r" (old__), "+m" (per_cpu_var(var)) \
: "dIr" (bit) : "memory");\


but it barely seems worthwhile if we really can't use test_and_clear_bit.

J

2008-07-09 22:02:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis <[email protected]> writes:

> Hi Eric,
>
> There is one pda op that I was not able to remove. Most likely it can be
> recoded
> but it was a bit over my expertise. Most likely the "pda_offset(field)" can be
> replaced with "per_cpu_var(field)" [per_cpu__##field], but for
> "_proxy_pda.field"
> I wasn't sure about.

If you notice we never use %1. My reading would be we just have the +m
there to tell the compiler we may be changing the field. So just
a reference to the per_cpu_var directly should be sufficient. Although
"memory" may actually be enough.

Eric

2008-07-09 22:12:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

"H. Peter Anvin" <[email protected]> writes:

> Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>> 1. it means pda references are invalid if their offsets are ever more than
>>> CONFIG_PHYSICAL_BASE (which I do not think is likely, but still...)
>>
>> Why?
>>
>> As an aside, could we solve the problems by making CONFIG_PHYSICAL_BASE 0 -
>> putting the percpu variables as the first thing in the kernel - and relocating
>> on load? That would avoid having to make a special PT_LOAD segment at 0. Hm,
>> would that result in the pda and the boot params getting mushed together?
>>
>
> CONFIG_PHYSICAL_START rather. And no, it can't be zero! Realistically we
> should make it 16 MB by default (currently 2 MB), to keep the DMA zone clear.

Also on x86_64 CONFIG_PHYSICAL_START is irrelevant as the kernel text segment
is liked at a fixed address -2G and the option only determines the virtual
to physical address mapping.

That said the idea may not be too far off.

Potentially we could put the percpu area at our fixed -2G address and then
we have a constant (instead of an address) we could subtract from this address.

Eric

2008-07-09 22:30:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
>>>
>> CONFIG_PHYSICAL_START rather. And no, it can't be zero! Realistically we
>> should make it 16 MB by default (currently 2 MB), to keep the DMA zone clear.
>
> Also on x86_64 CONFIG_PHYSICAL_START is irrelevant as the kernel text segment
> is liked at a fixed address -2G and the option only determines the virtual
> to physical address mapping.
>

No, it's not irrelevant; we currently base the kernel at virtual address
-2 GB (KERNEL_IMAGE_START) + CONFIG_PHYSICAL_START, in order to have the
proper alignment for large pages.

Now, it probably wouldn't hurt moving KERNEL_IMAGE_START up a bit to
have low positive values safer to use.

> That said the idea may not be too far off.
>
> Potentially we could put the percpu area at our fixed -2G address and then
> we have a constant (instead of an address) we could subtract from this address.

We can't put it at -2 GB since the offset +40 for the stack sentinel is
hard-coded into gcc. This leaves growing upward from +48 (or another
small positive number), or growing down from zero (or +40) as realistic
options.

Unfortunately, GNU ld handles grow-down not at all.

-hpa

2008-07-09 22:32:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter <[email protected]> writes:

> Note also that the 32 bit offset size limits the offset that can be added to the
> segment register. You either need to place the per cpu area either in the last
> 2G of the address space or in the first 2G. The zero based approach removes that
> limitation.

Good point. Which means that fundamentally we need to come up with a special
linker segment or some other way to guarantee that the offsets we use for per
cpu variables is within 2G of the segment register.

Which means that my idea of using the technique we use on x86_32 will not work.

Eric

2008-07-09 22:33:22

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> Christoph Lameter <[email protected]> writes:
>
>
>> Note also that the 32 bit offset size limits the offset that can be added to the
>> segment register. You either need to place the per cpu area either in the last
>> 2G of the address space or in the first 2G. The zero based approach removes that
>> limitation.
>>
>
> Good point. Which means that fundamentally we need to come up with a special
> linker segment or some other way to guarantee that the offsets we use for per
> cpu variables is within 2G of the segment register.
>
> Which means that my idea of using the technique we use on x86_32 will not work.

No, the compiler memory model we use guarantees that everything will be
within 2G of each other. The linker will spew loudly if that's not the
case.

J

2008-07-09 23:22:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Arjan van de Ven <[email protected]> writes:

>> I think the unification is valid and useful, and that trying to keep
>> that stupid stack canary working is currently more trouble then it is
>> worth.
>
> I think that "unification over everything" is stupid, especially if it
> removes useful features.

After looking at this some more any solution that actually works will
enable us to make the stack canary work, as we have a 32bit offset to
deal with. So there is no point in killing the feature.

That said I have no sympathy for a thread local variable that is
compiled as an absolute symbol instead of using the proper thread
local markup. The implementation of -fstack-protector however useful
still appears to be a nasty hack, ignoring decades of best practice in
how to implement things.

Do you have a clue who we need to bug on the gcc team to get the
compiler to implement a proper TLS version of -fstack-protector?

- Unification over everything is stupid.
- Interesting features that disregard decades implementation experience
are also stupid.

Since we know that the code stack_canary is always a part of the
executable. Being a fundamental part of glibc and libpthreads etc.
We can use the local exec model for tls storage. The local exec model
means the compiler should be able to output code such as
"movq %fs:stack_canary@tpoff, %rax" to read the stack canary in user space.
Instead it emits the much more stupid "movq "%fs:40, %rax". Not even
letting the linker have a say in the placement of the variable.

So we either need to update the gcc code to do something proper or
someone needs to update the sysv tls abi spec so %fs:40 joins %fs:0 in
the ranks of magic address in thread local storage, so that other
compilers can reliably use offset 40, and no one will have an excuse
for changing it in the future. Frankly I think updating the ABI is
the wrong solution but it at least it would document this stupidity.

Does -fstack-protector compiled code even fail to run with gcc that
does not implement a thread local variable at %fs:40? Or does it
just silently break.

Eric

2008-07-09 23:42:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge <[email protected]> writes:

>> Which means that my idea of using the technique we use on x86_32 will not
> work.
>
> No, the compiler memory model we use guarantees that everything will be within
> 2G of each other. The linker will spew loudly if that's not the case.

The per cpu area is at least theoretically dynamically allocated. And we
really want to put it in cpu local memory. Which means on any reasonable
NUMA machine the per cpu areas should be all over the box.

So there is no guarantee that with an arbitrary 64bit address in %gs of anything.

Grr. Except you are correct. We have to guarantee that the offsets we have
chosen at compile time still work. And we know all of the compile time offsets
will be in the -2G range. So they are all 32bit numbers. Negative 32bit
numbers to be sure. That trivially leaves us with everything working except
the nasty hard coded decimal 40.

Eric

2008-07-10 00:02:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

"H. Peter Anvin" <[email protected]> writes:

> Eric W. Biederman wrote:
>>>>
>>> CONFIG_PHYSICAL_START rather. And no, it can't be zero! Realistically we
>>> should make it 16 MB by default (currently 2 MB), to keep the DMA zone clear.
>>
>> Also on x86_64 CONFIG_PHYSICAL_START is irrelevant as the kernel text segment
>> is liked at a fixed address -2G and the option only determines the virtual
>> to physical address mapping.
>>
>
> No, it's not irrelevant; we currently base the kernel at virtual address -2 GB
> (KERNEL_IMAGE_START) + CONFIG_PHYSICAL_START, in order to have the proper
> alignment for large pages.

Ugh. That is silly. We need to restrict CONFIG_PHYSICAL_START to the aligned
choices obviously. But -2G is better aligned then anything else we can do virtually.

For the 32bit code we need to play some of those games because it doesn't have
it's own magic chunk of the address space to live in.

>> That said the idea may not be too far off.
>>
>> Potentially we could put the percpu area at our fixed -2G address and then
>> we have a constant (instead of an address) we could subtract from this
> address.
>
> We can't put it at -2 GB since the offset +40 for the stack sentinel is
> hard-coded into gcc. This leaves growing upward from +48 (or another small
> positive number), or growing down from zero (or +40) as realistic options.

I was thinking everything except that access would be done as:
%gs:var - -2G aka
%gs:var - START_KERNEL.
So that everything was a small 32bit number. That the linker and the compiler can
resolve. The trick is to put the stack canary at 40 decimal.

I was just trying to find a compile time know location for the start of the percpu
area so we could subtract it off.

Unless the linker just winds up overflowing in the subtraction and doing hideous
things to us. Although that should be pretty easy to spot and to test for at
build time.

-2G has the interesting distinction that we might get away with just dropping the
high bits.

> Unfortunately, GNU ld handles grow-down not at all.

Another alternative that almost fares better then a segment with
a base of zero is a base of -32K or so. Only trouble that would get us
manually managing the per cpu area size again.

Eric

2008-07-10 00:24:08

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>
>>> Which means that my idea of using the technique we use on x86_32 will not
>>>
>> work.
>>
>> No, the compiler memory model we use guarantees that everything will be within
>> 2G of each other. The linker will spew loudly if that's not the case.
>>
>
> The per cpu area is at least theoretically dynamically allocated. And we
> really want to put it in cpu local memory. Which means on any reasonable
> NUMA machine the per cpu areas should be all over the box.
>

Yes, but that doesn't matter in the slightest. The effective address
will be within 2G of the base; the base can be anywhere.

> So there is no guarantee that with an arbitrary 64bit address in %gs of anything.
>
> Grr. Except you are correct. We have to guarantee that the offsets we have
> chosen at compile time still work. And we know all of the compile time offsets
> will be in the -2G range. So they are all 32bit numbers. Negative 32bit
> numbers to be sure. That trivially leaves us with everything working except
> the nasty hard coded decimal 40.
>

Right.

J

2008-07-10 00:24:40

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Eric W. Biederman wrote:
>> Jeremy Fitzhardinge <[email protected]> writes:
>>
>>>> Which means that my idea of using the technique we use on x86_32
>>>> will not
>>> work.
>>>
>>> No, the compiler memory model we use guarantees that everything will
>>> be within
>>> 2G of each other. The linker will spew loudly if that's not the case.
>>
>> The per cpu area is at least theoretically dynamically allocated.
>> And we
>> really want to put it in cpu local memory. Which means on any
>> reasonable
>> NUMA machine the per cpu areas should be all over the box.
>>
>> So there is no guarantee that with an arbitrary 64bit address in %gs
>> of anything.
>>
>
> That doesn't matter in the slightest.

Creepy, get out of my brain.

J

2008-07-10 00:27:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>>> Which means that my idea of using the technique we use on x86_32 will not
>> work.
>>
>> No, the compiler memory model we use guarantees that everything will be within
>> 2G of each other. The linker will spew loudly if that's not the case.
>
> The per cpu area is at least theoretically dynamically allocated. And we
> really want to put it in cpu local memory. Which means on any reasonable
> NUMA machine the per cpu areas should be all over the box.
>
> So there is no guarantee that with an arbitrary 64bit address in %gs of anything.
>

That doesn't matter in the slightest.

> Grr. Except you are correct. We have to guarantee that the offsets we have
> chosen at compile time still work. And we know all of the compile time offsets
> will be in the -2G range. So they are all 32bit numbers. Negative 32bit
> numbers to be sure. That trivially leaves us with everything working except
> the nasty hard coded decimal 40.

The *offsets* have to be in the proper range, but the %gs_base is an
arbitrary 64-bit number.

-hpa

2008-07-10 14:15:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

With the zero based approach you do not have a relative address anymore. We are basically creating a new absolute address space where we place variables starting at zero.

This means that we are fully independent from the placement of the percpu segment.

The loader may place the per cpu segment with the initialized variables anywhere. We just need to set GS correctly for the boot cpu. We always need to refer to the per cpu variables
via GS or by adding the per cpu offset to the __per_cpu_offset[] (which is now badly named because it points directly to the start of the percpu segment for each processor).

So there is no 2G limitation on the distance between the code and the percpu segment anymore. The 2G limitation still exists for the *size* of the per cpu segment. If we go beyond 2G in defined per cpu variables then the per cpu addresses will wrap.

2008-07-10 14:34:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> With the zero based approach you do not have a relative address anymore. We are basically creating a new absolute address space where we place variables starting at zero.
>
> This means that we are fully independent from the placement of the percpu segment.
>
> The loader may place the per cpu segment with the initialized variables anywhere. We just need to set GS correctly for the boot cpu. We always need to refer to the per cpu variables
> via GS or by adding the per cpu offset to the __per_cpu_offset[] (which is now badly named because it points directly to the start of the percpu segment for each processor).
>
> So there is no 2G limitation on the distance between the code and the percpu segment anymore. The 2G limitation still exists for the *size* of the per cpu segment. If we go beyond 2G in defined per cpu variables then the per cpu addresses will wrap.

Okay, this is getting somewhat annoying. Several people now have missed
the point.

Noone has talked about the actual placement of the percpu segment data.

Using RIP-based references, however, are *cheaper* than using absolute
references. For RIP-based references to be valid, then the *offsets*
need to be in the range [-2 GB + CONFIG_PHYSICAL_START ...
CONFIG_PHYSICAL_START). This is similar to the constraint on absolute
refereces, where the *offsets* have to be in the range [-2 GB, 2 GB).

None of this affects the absolute positioning of the data. The final
address are determined by:

fs_base + rip + offset
or
fs_base + offset

... respectively. fs_base is an arbitrary 64-bit number; rip (in the
kernel) is in the range [-2 GB + CONFIG_PHYSICAL_START, 0), and offset
is in the range [-2 GB, 2 GB).

(The high end of the rip range above is slightly too wide.)

-hpa

2008-07-10 15:30:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:

> Noone has talked about the actual placement of the percpu segment data.

But the placement of the percpu segment data is a problem because of the way we
currently have the linker calculate offsets. I have had kernel configurations where I changed the placement of the percpu segment leading to linker failures because the percpu segment was not in 2G range of the code segment!

This is a particular problem if we have a large number of processors (like 4096) that each require a sizable segment of virtual address space up there for the per cpu allocator.

> None of this affects the absolute positioning of the data. The final
> address are determined by:
>
> fs_base + rip + offset
> or
> fs_base + offset
>
> ... respectively. fs_base is an arbitrary 64-bit number; rip (in the
> kernel) is in the range [-2 GB + CONFIG_PHYSICAL_START, 0), and offset
> is in the range [-2 GB, 2 GB).

Well the zero based results in this becoming always

gs_base + absolute address in per cpu segment

Why are RIP based references cheaper? The offset to the per cpu segment is certainly more than what can be fit into 16 bits.

2008-07-10 15:49:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
>
> Well the zero based results in this becoming always
>
> gs_base + absolute address in per cpu segment

You can do either way. For RIP-based, you have to worry about the
possible range for the RIP register when referencing. Currently, even
for "make allyesconfig" the per cpu segment is a lot smaller than the
minimum value for CONFIG_PHYSICAL_START (2 MB), so there is no issue,
but there is a distinct lack of wiggle room, which can be resolved
either by using negative offsets, or by moving the kernel text area up a
bit from -2 GB.

> Why are RIP based references cheaper? The offset to the per cpu segment is certainly more than what can be fit into 16 bits.

Where are you getting 16 bits from?!?! *There are no 16-bit offsets in
64-bit mode, period, full stop.*

RIP-based references are cheaper because the x86-64 architects chose to
optimize RIP-based references over absolute references. Therefore
RIP-based references are encodable with only a MODR/M byte, whereas
absolute references require a SIB byte as well -- longer instruction,
possibly a less optimized path through the CPU, and *definitely*
something that gets exercised less in the linker.

-hpa

2008-07-10 16:22:24

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
...
> Another alternative that almost fares better then a segment with
> a base of zero is a base of -32K or so. Only trouble that would get us
> manually managing the per cpu area size again.

One thing to remember is the eventual goal is implementing the cpu_alloc
functions which I think we've agreed has to be "growable". This means that
the addresses will need to be virtual to allow the same offsets for all cpus.
The patchset I have uses 2Mb pages. This "little" twist might figure into the
implementation issues that are being discussed.

Thanks,
Mike

2008-07-10 16:25:12

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:

> but there is a distinct lack of wiggle room, which can be resolved
> either by using negative offsets, or by moving the kernel text area up a
> bit from -2 GB.

Lets say we reserve 256MB of cpu alloc space per processor.

On a system with 4k processors this will result in the need for 1TB virtual address space for per cpu areas (note that there may be more processors in the future). Preferably we would calculate the address of the per cpu area by

PERCPU_START_ADDRESS + PERCPU_SIZE * smp_processor_id()

instead of looking it up in a table because that will save a memory access on per_cpu().

The first percpu area would ideally be the per cpu segment generated by the linker.

How would that fit into the address map? In particular the 2G distance between code and the first per cpu area must not be violated unless we go to a zero based approach.

Maybe there is another way of arranging things that would allow for this?

2008-07-10 16:32:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> Eric W. Biederman wrote:
> ...
>> Another alternative that almost fares better then a segment with
>> a base of zero is a base of -32K or so. Only trouble that would get us
>> manually managing the per cpu area size again.
>
> One thing to remember is the eventual goal is implementing the cpu_alloc
> functions which I think we've agreed has to be "growable". This means that
> the addresses will need to be virtual to allow the same offsets for all cpus.
> The patchset I have uses 2Mb pages. This "little" twist might figure into the
> implementation issues that are being discussed.
>

No, since the *addresses* can be arbitrary. The current issue is about
*offsets.*

-hpa

2008-07-10 16:36:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:

> No, since the *addresses* can be arbitrary. The current issue is about
> *offsets.*

Well those are intimately connected.

2008-07-10 16:41:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> H. Peter Anvin wrote:
>
>> but there is a distinct lack of wiggle room, which can be resolved
>> either by using negative offsets, or by moving the kernel text area up a
>> bit from -2 GB.
>
> Lets say we reserve 256MB of cpu alloc space per processor.
>
> On a system with 4k processors this will result in the need for 1TB virtual address space for per cpu areas (note that there may be more processors in the future). Preferably we would calculate the address of the per cpu area by
>
> PERCPU_START_ADDRESS + PERCPU_SIZE * smp_processor_id()
>
> instead of looking it up in a table because that will save a memory access on per_cpu().

It will, but it might still be a net loss due to higher load on the TLB
(you're effectively using the TLB to do the table lookup for you.) On
the other hand, Mike points out that once we move away from fixed-sized
segments we pretty much have to use virtual addresses anyway(*).

> The first percpu area would ideally be the per cpu segment generated by the linker.
>
> How would that fit into the address map? In particular the 2G distance between code and the first per cpu area must not be violated unless we go to a zero based approach.

If with "zero-based" you mean "nonzero gs_base for the boot CPU" then
yes, you're right.

Note again that that is completely orthogonal to RIP-based versus absolute.

-hpa

2008-07-10 16:46:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:

> It will, but it might still be a net loss due to higher load on the TLB
> (you're effectively using the TLB to do the table lookup for you.) On
> the other hand, Mike points out that once we move away from fixed-sized
> segments we pretty much have to use virtual addresses anyway(*).

There will be no additional overhead since the memory already mapped 1-1 using 2MB TLBs and we want to use the same for the percpu areas. This is similar to the vmemmap solution.

>> The first percpu area would ideally be the per cpu segment generated
>> by the linker.
>>
>> How would that fit into the address map? In particular the 2G distance
>> between code and the first per cpu area must not be violated unless we
>> go to a zero based approach.
>
> If with "zero-based" you mean "nonzero gs_base for the boot CPU" then
> yes, you're right.
>
> Note again that that is completely orthogonal to RIP-based versus absolute.

?? The distance to the per cpu area for cpu 0 is larger than 2G. Kernel wont link with RIP based addresses. You would have to place the per cpu areas 1TB before the kernel text.



2008-07-10 16:46:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> H. Peter Anvin wrote:
>
>> No, since the *addresses* can be arbitrary. The current issue is about
>> *offsets.*
>
> Well those are intimately connected.

Not really, since gs_base is an arbitrary 64-bit pointer.

-hpa

2008-07-10 16:48:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Christoph Lameter wrote:
>> H. Peter Anvin wrote:
>>
>>> No, since the *addresses* can be arbitrary. The current issue is about
>>> *offsets.*
>>
>> Well those are intimately connected.
>
> Not really, since gs_base is an arbitrary 64-bit pointer.

The current scheme ties the offsets to kernel code addresses.

2008-07-10 17:08:07

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> One thing to remember is the eventual goal is implementing the cpu_alloc
> functions which I think we've agreed has to be "growable". This means that
> the addresses will need to be virtual to allow the same offsets for all cpus.
> The patchset I have uses 2Mb pages. This "little" twist might figure into the
> implementation issues that are being discussed.

You want to virtually map the percpu area? How and when would it get
extended?

J

2008-07-10 17:13:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:

> You want to virtually map the percpu area? How and when would it get
> extended?

It would get extended when cpu_alloc() is called and the allocator finds that there is no per cpu memory available.

2008-07-10 17:21:05

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Mike Travis wrote:
>> Eric W. Biederman wrote:
>> ...
>>> Another alternative that almost fares better then a segment with
>>> a base of zero is a base of -32K or so. Only trouble that would get us
>>> manually managing the per cpu area size again.
>>
>> One thing to remember is the eventual goal is implementing the cpu_alloc
>> functions which I think we've agreed has to be "growable". This means
>> that
>> the addresses will need to be virtual to allow the same offsets for
>> all cpus.
>> The patchset I have uses 2Mb pages. This "little" twist might figure
>> into the
>> implementation issues that are being discussed.
>>
>
> No, since the *addresses* can be arbitrary. The current issue is about
> *offsets.*
>
> -hpa

Ok, thanks for clearing that up. I just didn't want us to drop the ball
trying to make that double play... ;-)

Cheers,
Mike

2008-07-10 17:22:30

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> H. Peter Anvin wrote:
>
>> Christoph Lameter wrote:
>>
>>> H. Peter Anvin wrote:
>>>
>>>
>>>> No, since the *addresses* can be arbitrary. The current issue is about
>>>> *offsets.*
>>>>
>>> Well those are intimately connected.
>>>
>> Not really, since gs_base is an arbitrary 64-bit pointer.
>>
>
> The current scheme ties the offsets to kernel code addresses.
>

This is getting very frustrating. We've been going around and around on
this point, what, 5 or 6 times at least.

The base address of the percpu area and the offsets from that base are
completely independent values.

The offset is limited to 2G. The 2G limit applies regardless of how you
compute your effective address. It doesn't matter if its absolute. It
doesn't matter if it's rip-relative. It doesn't matter if it's
zero-based. Small absolute addresses generate exactly the same form as
large absolute addresses. There is no 8-bit or 16-bit address mode.

The base is arbitrary. It can be any canonical address at all. It has
no effect on how you compute your offset.

The addressing modes:

* ABS
* off(%rip)

Are exactly equivalent in what offsets they can generate, so long as *at
link time* the percpu *symbols* are within 2G of the code addressing
them. *After* the addressing mode has generated an effective address
(by whatever means it likes), the %gs: override applies the segment
base, which can therefore offset the effective address to anywhere at all.

J

2008-07-10 17:26:18

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>
>> You want to virtually map the percpu area? How and when would it get
>> extended?
>>
>
> It would get extended when cpu_alloc() is called and the allocator finds that there is no per cpu memory available.
>

Which, I take it, allocates percpu memory. It would have the same
caveats as vmalloc memory, with respect to accessing it during fault
handlers and nmi handlers, I take it.

How would cpu_alloc() actually get used? It doesn't make much sense for
general code, since we don't have the notion of a percpu pointer to
memory (vs a pointer to percpu memory). Is the intended use for
allocating percpu memory in modules? What other uses?

J

2008-07-10 17:32:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter <[email protected]> writes:

> H. Peter Anvin wrote:
>
>> but there is a distinct lack of wiggle room, which can be resolved
>> either by using negative offsets, or by moving the kernel text area up a
>> bit from -2 GB.
>
> Lets say we reserve 256MB of cpu alloc space per processor.

First off right now reserving more than about 64KB is ridiculous. We rightly
don't have that many per cpu variables.

> On a system with 4k processors this will result in the need for 1TB virtual
> address space for per cpu areas (note that there may be more processors in the
> future). Preferably we would calculate the address of the per cpu area by
>
> PERCPU_START_ADDRESS + PERCPU_SIZE * smp_processor_id()
>
> instead of looking it up in a table because that will save a memory access on
> per_cpu().

???? Optimizing per_cpu seems to be the wrong path. If you want to go fast you
access the data on the cpu you start out on.

> The first percpu area would ideally be the per cpu segment generated by the
> linker.
>
> How would that fit into the address map? In particular the 2G distance between
> code and the first per cpu area must not be violated unless we go to a zero
> based approach.
>
> Maybe there is another way of arranging things that would allow for this?

Yes. Start with a patch that doesn't have freaky failures that can't be understood
or bisected because the patch is too big. The only reason we are having a conversation
about alternative implementations is because the current implementation has weird
random incomprehensible failures. The most likely culprit is playing with
the linker. It could be something else.

So please REFACTOR the patch that changes things to DO ONE THING PER PATCH.

Eric

2008-07-10 17:32:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
>
> The base address of the percpu area and the offsets from that base are
> completely independent values.

Definitely.


> The addressing modes:
>
> * ABS
> * off(%rip)
>
> Are exactly equivalent in what offsets they can generate, so long as *at
> link time* the percpu *symbols* are within 2G of the code addressing
> them. *After* the addressing mode has generated an effective address
> (by whatever means it likes), the %gs: override applies the segment
> base, which can therefore offset the effective address to anywhere at all.

Right. The problem is with the percpu area handled by the linker. That percpu area is used by the boot cpu and later we setup other additional per cpu areas. Those can be placed in an arbitrary way if one goes through a table of pointers to these areas.

However, that does not work if one calculates the virtual address instead of looking up a physical address.




2008-07-10 17:33:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> H. Peter Anvin wrote:
>
>
>> It will, but it might still be a net loss due to higher load on the TLB
>> (you're effectively using the TLB to do the table lookup for you.) On
>> the other hand, Mike points out that once we move away from fixed-sized
>> segments we pretty much have to use virtual addresses anyway(*).
>>
>
> There will be no additional overhead since the memory already mapped 1-1 using 2MB TLBs and we want to use the same for the percpu areas. This is similar to the vmemmap solution.
>
>
>>> The first percpu area would ideally be the per cpu segment generated
>>> by the linker.
>>>
>>> How would that fit into the address map? In particular the 2G distance
>>> between code and the first per cpu area must not be violated unless we
>>> go to a zero based approach.
>>>
>> If with "zero-based" you mean "nonzero gs_base for the boot CPU" then
>> yes, you're right.
>>
>> Note again that that is completely orthogonal to RIP-based versus absolute.
>>
>
> ?? The distance to the per cpu area for cpu 0 is larger than 2G. Kernel wont link with RIP based addresses. You would have to place the per cpu areas 1TB before the kernel text.

If %gs:0 points to start of your percpu area, then all the offsets off
%gs are going to be no larger than the amount of percpu memory you
have. The gs base itself can be any 64-bit address, so it doesn't
matter where it is within overall kernel memory. Using zero-based
percpu area means that you must set a non-zero %gs base before you can
access the percpu area.

If the layout of the percpu area is done by the linker by packing all
the percpu variables into one section, then any address computation
using a percpu variable symbol will generate an offset which is
appropriate to apply to a %gs: addressing mode.

The nice thing about the non-zero-based scheme i386 uses is that setting
gs-base to zero means that percpu variables accesses get directly to the
prototype percpu data area, which simplifies boot time setup (which is
doubly awkward on 32-bit because you need to generate a GDT entry rather
than just load an MSR as you do in 64-bit).

J

2008-07-10 17:36:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
> Christoph Lameter wrote:
>> Jeremy Fitzhardinge wrote:
>>
>>
>>> You want to virtually map the percpu area? How and when would it get
>>> extended?
>>>
>>
>> It would get extended when cpu_alloc() is called and the allocator
>> finds that there is no per cpu memory available.
>>
>
> Which, I take it, allocates percpu memory. It would have the same
> caveats as vmalloc memory, with respect to accessing it during fault
> handlers and nmi handlers, I take it.

Right. One would not want to allocate per cpu memory in those contexts. The current allocpercpu() functions already have those restrictions.

> How would cpu_alloc() actually get used? It doesn't make much sense for
> general code, since we don't have the notion of a percpu pointer to
> memory (vs a pointer to percpu memory). Is the intended use for
> allocating percpu memory in modules? What other uses?

Argh. Do I have to reexplain all of this all over again? Please look at the latest cpu_alloc patchset and the related discussions.

The cpu_alloc patchset introduces the concept of a pointer to percpu memory. Or you could call it an offset into the percpu segment that can be treated (in a restricted way) like a pointer....


2008-07-10 17:40:25

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:

> First off right now reserving more than about 64KB is ridiculous. We rightly
> don't have that many per cpu variables.

We do. The case has been made numerous times that we need at least several megabytes of per cpu memory in case someone creates gazillions of ip tunnels etc etc.


>> instead of looking it up in a table because that will save a memory access on
>> per_cpu().
>
> ???? Optimizing per_cpu seems to be the wrong path. If you want to go fast you
> access the data on the cpu you start out on.

Yes most arches provide specialized registers for local per cpu variable access. There are cases though in which you have to access another processors cpu space.

>> Maybe there is another way of arranging things that would allow for this?
>
> Yes. Start with a patch that doesn't have freaky failures that can't be understood
> or bisected because the patch is too big. The only reason we are having a conversation

The patches are reasonably small. The problem that Mike seems to have is early boot debugging.

2008-07-10 17:41:19

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
> Mike Travis wrote:
>> One thing to remember is the eventual goal is implementing the cpu_alloc
>> functions which I think we've agreed has to be "growable". This means
>> that
>> the addresses will need to be virtual to allow the same offsets for
>> all cpus.
>> The patchset I have uses 2Mb pages. This "little" twist might figure
>> into the
>> implementation issues that are being discussed.
>
> You want to virtually map the percpu area? How and when would it get
> extended?
>
> J


CPU_ALLOC(), or some such means. This is to replace the percpu allocator
in modules.c:

Subject: cpu alloc: The allocator

The per cpu allocator allows dynamic allocation of memory on all
processors simultaneously. A bitmap is used to track used areas.
The allocator implements tight packing to reduce the cache footprint
and increase speed since cacheline contention is typically not a concern
for memory mainly used by a single cpu. Small objects will fill up gaps
left by larger allocations that required alignments.

The size of the cpu_alloc area can be changed via make menuconfig.

Signed-off-by: Christoph Lameter <[email protected]>

and:

Subject: cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator

Remove the builtin per cpu allocator from modules.c and use cpu_alloc instead.

The patch also removes PERCPU_ENOUGH_ROOM. The size of the cpu_alloc area is
determined by CONFIG_CPU_AREA_SIZE. PERCPU_ENOUGH_ROOMs default was 8k.
CONFIG_CPU_AREA_SIZE defaults to 32k. Thus we have more space to load modules.

Signed-off-by: Christoph Lameter <[email protected]>

The discussion that followed was very emphatic that the size of the space should
not be fixed, but instead be dynamically growable. Since the offset needs to be
fixed for each cpu, then virtual (I think) is the only way to go. The use of a
2MB page just conserves map entries. (Of course, if we just reserved 2MB in the
first place it might not need to be virtual...? But the concern was for systems
with hundreds of (say) network interfaces using even more than 2MB.)

Thanks,
Mike

2008-07-10 17:43:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:

> If %gs:0 points to start of your percpu area, then all the offsets off
> %gs are going to be no larger than the amount of percpu memory you
> have. The gs base itself can be any 64-bit address, so it doesn't
> matter where it is within overall kernel memory. Using zero-based
> percpu area means that you must set a non-zero %gs base before you can
> access the percpu area.

Correct.

> If the layout of the percpu area is done by the linker by packing all
> the percpu variables into one section, then any address computation
> using a percpu variable symbol will generate an offset which is
> appropriate to apply to a %gs: addressing mode.

Of course.

> The nice thing about the non-zero-based scheme i386 uses is that setting
> gs-base to zero means that percpu variables accesses get directly to the
> prototype percpu data area, which simplifies boot time setup (which is
> doubly awkward on 32-bit because you need to generate a GDT entry rather
> than just load an MSR as you do in 64-bit).

Great but it causes trouble in other ways as discussed. Its best to consistently access
per cpu variables using the segment registers.

2008-07-10 17:46:45

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
...
>
> So please REFACTOR the patch that changes things to DO ONE THING PER PATCH.
>
> Eric

Working feverishly on this exact thing... ;-)

2008-07-10 17:48:30

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>> The base address of the percpu area and the offsets from that base are
>> completely independent values.
>>
>
> Definitely.
>
>
>
>> The addressing modes:
>>
>> * ABS
>> * off(%rip)
>>
>> Are exactly equivalent in what offsets they can generate, so long as *at
>> link time* the percpu *symbols* are within 2G of the code addressing
>> them. *After* the addressing mode has generated an effective address
>> (by whatever means it likes), the %gs: override applies the segment
>> base, which can therefore offset the effective address to anywhere at all.
>>
>
> Right. The problem is with the percpu area handled by the linker. That percpu area is used by the boot cpu and later we setup other additional per cpu areas. Those can be placed in an arbitrary way if one goes through a table of pointers to these areas.
>

Yes, but the offset is the same either way. When you want a cpu to
refer to its own percpu memory, regardless of where it is in memory, you
just reload the gs base. The offsets are the same everywhere, and are
computed by the linker with out knowledge or reference to where the
final address will end up.

In other words, at source level:

a = x86_read_percpu(foo)

will generate

mov %gs:percpu__foo, %rax

where the linker decides the value of percpu__foo, which can be up to
4G. Or if we use rip-relative:

mov %gs:percpu__foo(%rip), %rax

we end up with the same result, except that the generated instruction is
a bit more compact.

In the final generated assembly, it ends up being a hardcoded constant
address. Say, 0x7838.

Now if we allocate cpu 43 percpu data at 0xfffffffff7198000, we load %gs
base with that value, and then the instruction is still

mov %gs:0x7838, %rax

and the computed address will be 0xfffffffff7198000 + 0x7838 =
0xfffffffff719f838.

And cpu 62 has its percpu data at 0xffffffffe3819000, and the
instruction is still

mov %gs:0x7838, %rax

and the computed address for it's version of percpu__foo is
0xffffffffe3819000 + 0x7838 = 0xffffffffe3820838.

Note that it doesn't matter how you decide to place the percpu data, so
long as you can load the address into the %gs base.

> However, that does not work if one calculates the virtual address instead of looking up a physical address.
>

Calculate a virtual address for what? Physical address for what? If
you have a large virtual region allocating 256M of percpu space, er, per
cpu, then you just load %gs base with percpu_region_base + cpuid *
256M. It has no effect on the instructions accessing that percpu space.

J

2008-07-10 17:53:23

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
>> The nice thing about the non-zero-based scheme i386 uses is that setting
>> gs-base to zero means that percpu variables accesses get directly to the
>> prototype percpu data area, which simplifies boot time setup (which is
>> doubly awkward on 32-bit because you need to generate a GDT entry rather
>> than just load an MSR as you do in 64-bit).
>>
>
> Great but it causes trouble in other ways as discussed.

What other trouble? It works fine.

> Its best to consistently access
> per cpu variables using the segment registers.
>


It is, but initially the segment base is 0, so just using a percpu
variable does something sensible from the start with no special setup.

J

2008-07-10 17:58:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> Christoph Lameter <[email protected]> writes:
>
>> H. Peter Anvin wrote:
>>
>>> but there is a distinct lack of wiggle room, which can be resolved
>>> either by using negative offsets, or by moving the kernel text area up a
>>> bit from -2 GB.
>> Lets say we reserve 256MB of cpu alloc space per processor.
>
> First off right now reserving more than about 64KB is ridiculous. We rightly
> don't have that many per cpu variables.

Almost half a megabyte in current allyesconfig, and that is not
including dynamic allocations at all.

-hpa

2008-07-10 18:00:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
>
> There will be no additional overhead since the memory already mapped 1-1 using 2MB TLBs and we want to use the same for the percpu areas. This is similar to the vmemmap solution.
>

THAT sounds strange. If you're using dedicated virtual maps (which is
what you're responding to here) then you will *always* have additional
TLB pressure. Furthermore, if you use 2 MB pages, you:

a) can only allocate full 2 MB pages, which is expensive for the static
users and difficult for the dynamic users;

b) increase pressure in the relatively small 2 MB TLB.

-hpa

2008-07-10 18:02:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
>
>> Its best to consistently access
>> per cpu variables using the segment registers.
>
> It is, but initially the segment base is 0, so just using a percpu
> variable does something sensible from the start with no special setup.
>

That's easy enough to fix -- synthesizing a GDT entry is trivial enough
-- but since it currently works, and since the offsets on 32 bits can
reach the full address space anyway, there is no reason to change.

-hpa

2008-07-10 18:04:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
>>>
>> No, it's not irrelevant; we currently base the kernel at virtual address -2 GB
>> (KERNEL_IMAGE_START) + CONFIG_PHYSICAL_START, in order to have the proper
>> alignment for large pages.
>
> Ugh. That is silly. We need to restrict CONFIG_PHYSICAL_START to the aligned
> choices obviously. But -2G is better aligned then anything else we can do virtually.
>

You may think it's silly, but it's actually an advantage in this case.

-hpa

2008-07-10 18:07:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
>
> Right. The problem is with the percpu area handled by the linker. That percpu area is used by the boot cpu and later we setup other additional per cpu areas. Those can be placed in an arbitrary way if one goes through a table of pointers to these areas.
>
> However, that does not work if one calculates the virtual address instead of looking up a physical address.
>

As far as the linker is concerned, there are two address spaces: VMA,
which is the offset, and LMA, which is the physical address on where to
load. The linker doesn't give a flying hoot about the virtual address,
since it's completely irrelevant as far as it is concerned; it's nothing
but a kernel-internal abstraction.

-hpa

2008-07-10 18:08:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
>
> The discussion that followed was very emphatic that the size of the space should
> not be fixed, but instead be dynamically growable. Since the offset needs to be
> fixed for each cpu, then virtual (I think) is the only way to go. The use of a
> 2MB page just conserves map entries. (Of course, if we just reserved 2MB in the
> first place it might not need to be virtual...? But the concern was for systems
> with hundreds of (say) network interfaces using even more than 2MB.)
>

I'm much more concerned about wasting an average of 1 MB of memory per CPU.

-hpa

2008-07-10 18:15:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
>
> Another alternative that almost fares better then a segment with
> a base of zero is a base of -32K or so. Only trouble that would get us
> manually managing the per cpu area size again.
>

Yes, an extra link pass would be better than that. I have tried, and
none of the clever things I tried actually works, since GNU ld has
pretty much no way to get it to reveal its information ahead of time.

However, I want to explore the details of the supposed toolchain issue;
it might be just a simple tweak to the way things are done now to fix it.

Clearly, given the stack protector ABI, we want %gs:40 to be usable and
free.

-hpa

2008-07-10 18:52:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis <[email protected]> writes:

> Eric W. Biederman wrote:
> ...
>> Another alternative that almost fares better then a segment with
>> a base of zero is a base of -32K or so. Only trouble that would get us
>> manually managing the per cpu area size again.
>
> One thing to remember is the eventual goal is implementing the cpu_alloc
> functions which I think we've agreed has to be "growable". This means that
> the addresses will need to be virtual to allow the same offsets for all cpus.
> The patchset I have uses 2Mb pages. This "little" twist might figure into the
> implementation issues that are being discussed.

I had not heard that.

However if you are going to use 2MB pages you might was well just use a
physical address at the start of a node. 2MB is so much larger then
the size of the per cpu memory we need today it isn't even funny.

To get 32K I had to round up on my current system, and honestly it is
important that per cpu data stay relatively small as otherwise the system
won't have memory to use for anything interesting.

I just took a quick look at our alloc_percpu calls. At a first glance
they all appear to be for relatively small data structures. So we can
just about get away with doing what we do today for modules for everything.
The question is what to do when we fill up our preallocated size for percpu
data.

I think we can get away with just simply realloc'ing the percpu area
on each cpu. No fancy table manipulations required. Just update
the base pointer in %gs and in someplace global.

If you do use virtual addresses really requires using 4K pages, so we
can benefit from non-contiguous allocations. I just can't imagine
the per cpu area getting up to 2MB in size, where you would need
multiple 2MB pages. That is a huge jump from the 32KB I see today.


For the rest mostly I have been making a list of things that we can do
that could work. A zero based percpu area is great if you can
eliminate it from suspicion of your weird random failures.

Eric

2008-07-10 18:55:35

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> I think we can get away with just simply realloc'ing the percpu area
> on each cpu. No fancy table manipulations required. Just update
> the base pointer in %gs and in someplace global.
>

It's perfectly legitimate to take the address of a percpu variable and
store it somewhere. We can't move them around.

J

2008-07-10 19:11:48

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
...
> The patches are reasonably small. The problem that Mike seems to have is early boot debugging.

Note that the early boot debugging problems go away with gcc-4.2.4. My
problem now (what I [maybe incorrectly] believe) is stack overflow with
NR_CPUS=4096 and a specific random config file.

Btw, I've completed the first half of splitting the zero_based_fold into
zero_based_only and fold_pda_into_percpu and am testing that now.

2008-07-10 19:12:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

"H. Peter Anvin" <[email protected]> writes:

> Almost half a megabyte in current allyesconfig, and that is not including
> dynamic allocations at all.

Ouch! This start to make me sad I removed the arbitrary cap on static
percpu memory.

Eric

2008-07-10 19:18:46

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> "H. Peter Anvin" <[email protected]> writes:
>
>> Almost half a megabyte in current allyesconfig, and that is not including
>> dynamic allocations at all.
>
> Ouch! This start to make me sad I removed the arbitrary cap on static
> percpu memory.
>
> Eric

The biggest growth came from moving all the xxx[NR_CPUS] arrays into
the per cpu area. So you free up a huge amount of unused memory when
the NR_CPUS count starts getting into the ozone layer. 4k now, 16k
real soon now, ??? future?

Mike

2008-07-10 19:22:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter <[email protected]> writes:

> The patches are reasonably small. The problem that Mike seems to have is early
> boot debugging.

I didn't say small. I said do one thing at a time.

Mike is addressing this.

Fundamentally the problem is that we are seeing weird failures and there is not
enough granularity in the patches to test which part of the patch the failure
comes from.

The fact that the failures happen early in boot just makes it worse to debug.


Eric

2008-07-10 19:22:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge <[email protected]> writes:

> Eric W. Biederman wrote:
>> I think we can get away with just simply realloc'ing the percpu area
>> on each cpu. No fancy table manipulations required. Just update
>> the base pointer in %gs and in someplace global.
>>
>
> It's perfectly legitimate to take the address of a percpu variable and store it
> somewhere. We can't move them around.

Really? I guess there are cases where that makes sense. It is a pretty
rare case though. Especially when you are not talking about doing it temporarily
with preemption disabled. There are few enough users of the API I think we can
certainly explore the cost of forbidding in the general case of storing the
address of a percpu variable.

Eric

2008-07-10 19:39:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
>
> The biggest growth came from moving all the xxx[NR_CPUS] arrays into
> the per cpu area. So you free up a huge amount of unused memory when
> the NR_CPUS count starts getting into the ozone layer. 4k now, 16k
> real soon now, ??? future?
>

Even (or perhaps especially) so, allocating the percpu area in 2 MB
increments is a total nonstarter. It hurts the small, common
configurations way too much. For SGI, it's probably fine.

-hpa

2008-07-10 19:56:46

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>
>> Eric W. Biederman wrote:
>>
>>> I think we can get away with just simply realloc'ing the percpu area
>>> on each cpu. No fancy table manipulations required. Just update
>>> the base pointer in %gs and in someplace global.
>>>
>>>
>> It's perfectly legitimate to take the address of a percpu variable and store it
>> somewhere. We can't move them around.
>>
>
> Really? I guess there are cases where that makes sense. It is a pretty
> rare case though. Especially when you are not talking about doing it temporarily
> with preemption disabled. There are few enough users of the API I think we can
> certainly explore the cost of forbidding in the general case of storing the
> address of a percpu variable.
>

No, that sounds like a bad idea. For one, how would you enforce it?
How would you check for it? It's one of those things that would mostly
work and then fail very rarely.

Secondly, I depend on it. I register a percpu structure with Xen to
share per-vcpu specific information (interrupt mask, time info, runstate
stats, etc).

J

2008-07-10 20:22:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis <[email protected]> writes:


> The biggest growth came from moving all the xxx[NR_CPUS] arrays into
> the per cpu area. So you free up a huge amount of unused memory when
> the NR_CPUS count starts getting into the ozone layer. 4k now, 16k
> real soon now, ??? future?

Hmm. Do you know how big a role kernel_stat plays.

It is a per cpu structure that is sized via NR_IRQS. NR_IRQS is by NR_CPUS.
So ultimately the amount of memory take up is NR_CPUS*NR_CPUS*32 or so.

I have a patch I wrote long ago, that addresses that specific nasty configuration
by moving the per cpu irq counters into pointer available from struct irq_desc.

The next step which I did not get to (but is interesting from a scaling perspective)
was to start dynamically allocating the irq structures.

Eric

2008-07-10 20:25:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


* Eric W. Biederman <[email protected]> wrote:

> Mike Travis <[email protected]> writes:
>
>
> > The biggest growth came from moving all the xxx[NR_CPUS] arrays into
> > the per cpu area. So you free up a huge amount of unused memory
> > when the NR_CPUS count starts getting into the ozone layer. 4k now,
> > 16k real soon now, ??? future?
>
> Hmm. Do you know how big a role kernel_stat plays.
>
> It is a per cpu structure that is sized via NR_IRQS. NR_IRQS is by
> NR_CPUS. So ultimately the amount of memory take up is
> NR_CPUS*NR_CPUS*32 or so.
>
> I have a patch I wrote long ago, that addresses that specific nasty
> configuration by moving the per cpu irq counters into pointer
> available from struct irq_desc.
>
> The next step which I did not get to (but is interesting from a
> scaling perspective) was to start dynamically allocating the irq
> structures.

/me willing to test & babysit any test-patch in that area ...

this is a big problem and it's getting worse quadratically ;-)

Ingo

2008-07-10 20:32:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge <[email protected]> writes:

> Secondly, I depend on it. I register a percpu structure with Xen to share
> per-vcpu specific information (interrupt mask, time info, runstate stats, etc).

Note. That I expect at least something like this is interesting in the context
per cpu device queues. However except possibly for Xen that implies allocating
DMA addressable memory and going through that API, which will keep device drivers
from using per cpu memory that way even if the allocating something for each cpu?

Eric

2008-07-10 20:32:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge <[email protected]> writes:

> No, that sounds like a bad idea. For one, how would you enforce it? How would
> you check for it? It's one of those things that would mostly work and then fail
> very rarely.

Well the easiest way would be to avoid the letting people take the address of
per cpu memory, and just provide macros to read/write it. We are 90% of the
way there already so it isn't a big jump.

> Secondly, I depend on it. I register a percpu structure with Xen to share
> per-vcpu specific information (interrupt mask, time info, runstate stats, etc).

Well even virtual allocation is likely to break the Xen sharing case as you
would at least need to compute the physical address and pass it to Xen.

Eric

2008-07-10 20:53:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:

> I'm much more concerned about wasting an average of 1 MB of memory per CPU.

Well all the memory that is now allocated via allocpercpu()s will be allocated from that 2MB segment. And cpu_alloc packs variables in a dense way. The current slab allocations try to avoid sharing cachelines which wastes lots of memory for every allocation for each and every processor.

2008-07-10 20:54:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:

> What other trouble? It works fine.

Somehow you performed a mind wipe to get rid of all memory of the earlier messages?

2008-07-10 20:55:17

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>
>> No, that sounds like a bad idea. For one, how would you enforce it? How would
>> you check for it? It's one of those things that would mostly work and then fail
>> very rarely.
>>
>
> Well the easiest way would be to avoid the letting people take the address of
> per cpu memory, and just provide macros to read/write it. We are 90% of the
> way there already so it isn't a big jump.
>

Well, the x86_X_percpu api is there. But per_cpu() and get_cpu_var()
both explicitly return lvalues which can have their addresses taken.

>> Secondly, I depend on it. I register a percpu structure with Xen to share
>> per-vcpu specific information (interrupt mask, time info, runstate stats, etc).
>>
>
> Well even virtual allocation is likely to break the Xen sharing case as you
> would at least need to compute the physical address and pass it to Xen.
>

Right. At the moment it assumes that the percpu variable is in the
linear mapping, but it could easily do a pagetable walk if necessary.

J

2008-07-10 20:58:46

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>
>> What other trouble? It works fine.
>>
>
> Somehow you performed a mind wipe to get rid of all memory of the earlier messages?
>

Percpu on i386 hasn't been a point of discussion. It works fine, and
has been working fine for a long time. The same mechanism would work
fine on x86-64. Its only "issue" is that it doesn't support the broken
gcc abi for stack-protector.

The problem is all zero-based percpu on x86-64.

J

2008-07-10 21:05:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> H. Peter Anvin wrote:
>
>> I'm much more concerned about wasting an average of 1 MB of memory per CPU.
>
> Well all the memory that is now allocated via allocpercpu()s will be allocated from that 2MB segment. And cpu_alloc packs variables in a dense way. The current slab allocations try to avoid sharing cachelines which wastes lots of memory for every allocation for each and every processor.
>

And how much is that, especially on *small* systems?

-hpa

2008-07-10 21:06:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:

> Percpu on i386 hasn't been a point of discussion. It works fine, and
> has been working fine for a long time. The same mechanism would work
> fine on x86-64. Its only "issue" is that it doesn't support the broken
> gcc abi for stack-protector.

Well that is one thing and then the scaling issues, and the support of the new cpu allocator, new arch independent cpu operations etc.

> The problem is all zero-based percpu on x86-64.

The zero based stuff will enable a lot of things. Please have a look at the cpu_alloc patchsets.


2008-07-10 21:09:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:

> And how much is that, especially on *small* systems?

i386?

i386 uses 4K mappings. There are just a few cpus supported, there is scarcity of ZONE_NORMAL memory so the per cpu areas really cannot get that big. See the cpu_alloc patchsets for i386.


2008-07-10 21:10:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
>
> Percpu on i386 hasn't been a point of discussion. It works fine, and
> has been working fine for a long time. The same mechanism would work
> fine on x86-64. Its only "issue" is that it doesn't support the broken
> gcc abi for stack-protector.
>
> The problem is all zero-based percpu on x86-64.
>

Well, x86-64 has *two* issues: limited range of offsets (regardless of
if we do RIP-relative or not), and the stack-protector ABI.

I'm still trying to reproduce Mike's setup, but I suspect it can be
switched to RIP-relative for the fixed-offset (static) stuff; for the
dynamic stuff it's all via pointers anyway so the offsets don't matter.

-hpa

2008-07-10 21:18:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> H. Peter Anvin wrote:
>
>> And how much is that, especially on *small* systems?
>
> i386?
>
> i386 uses 4K mappings. There are just a few cpus supported, there is scarcity of ZONE_NORMAL memory so the per cpu areas really cannot get that big. See the cpu_alloc patchsets for i386.
>

No, not i386. x86-64.

-hpa

2008-07-10 21:32:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter <[email protected]> writes:

> Jeremy Fitzhardinge wrote:
>
>> Percpu on i386 hasn't been a point of discussion. It works fine, and
>> has been working fine for a long time. The same mechanism would work
>> fine on x86-64. Its only "issue" is that it doesn't support the broken
>> gcc abi for stack-protector.
>
> Well that is one thing and then the scaling issues, and the support of the new
> cpu allocator, new arch independent cpu operations etc.
>
>> The problem is all zero-based percpu on x86-64.
>
> The zero based stuff will enable a lot of things. Please have a look at the
> cpu_alloc patchsets.

Christoph again. The reason we are balking at the zero based percpu
area is NOT because it is zero based. It is because systems with it
patched in don't work reliably.

The bottom line is if the tools don't support a clever idea we can't use it.

Hopefully the problem can be root caused and we can use a zero based percpu area.
There are several ways we can achieve that.

Further any design that depends on a zero based percpu can work with a contiguous percpu
with an offset so we should not be breaking whatever design you have for the percpu
allocator.

Eric

2008-07-10 21:32:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter <[email protected]> writes:

> H. Peter Anvin wrote:
>
>> And how much is that, especially on *small* systems?
>
> i386?
>
> i386 uses 4K mappings. There are just a few cpus supported, there is scarcity of
> ZONE_NORMAL memory so the per cpu areas really cannot get that big. See the
> cpu_alloc patchsets for i386.

i386 is fundamentally resource constrained. However x86_32 should support a
strict superset of the machines the x86_64 kernel supports.

Because it is resource constrained in the lowmem zone you should not
be able to bring up all of the cpus on a huge cpu box. But you should still
be able to boot and run the kernel. So for percpu data we have effectively
same size constraints.

Eric

2008-07-10 21:36:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
> Jeremy Fitzhardinge wrote:
>
>> Percpu on i386 hasn't been a point of discussion. It works fine, and
>> has been working fine for a long time. The same mechanism would work
>> fine on x86-64. Its only "issue" is that it doesn't support the broken
>> gcc abi for stack-protector.
>
> Well that is one thing and then the scaling issues, and the support of the new cpu allocator, new arch independent cpu operations etc.
>
>> The problem is all zero-based percpu on x86-64.
>
> The zero based stuff will enable a lot of things. Please have a look at the cpu_alloc patchsets.
>

No argument this work is worthwhile. The main issues on the table is
the particular choice of offsets, and the handling of the virtual space
-- I believe 2 MB mappings are too large, except perhaps as an option.

-hpa

2008-07-10 21:42:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Ingo Molnar <[email protected]> writes:

> /me willing to test & babysit any test-patch in that area ...
>
> this is a big problem and it's getting worse quadratically ;-)
>

Well here is a copy of my old patch to get things started.
It isn't where I'm working right now so I don't have time to rebase
the patch, but the same logic should still apply.

----

>From e02f708c0eca6708c8f79824717705379e982fe3 Mon Sep 17 00:00:00 2001
From: Eric W. Biederman <[email protected]>
Date: Tue, 13 Feb 2007 02:42:50 -0700
Subject: [PATCH] genirq: Kill the percpu NR_IRQS sized array in kstat.

In struct kernel_stat which has one instance per cpu we keep a
count of how many times each irq has occured on that cpu. Given
that we don't usually use all of our irqs this is very wasteful
of space and in particular percpu space.

This patch replaces that array on all architectures that use
GENERIC_HARD_IRQS with a point to a array of cpus in struct irq_desc.
This allocates the array at boot time after we have generated the
cpu_possible_map and is only large enough to hold the largest possible
cpu index.

Assuming the common case of dense cpu numbers this consumes roughly the
same amount of space as the current mechanism and removes the NR_IRQS
sized array.

The only immediate win is to get these counts out of the limited size
percpu areas.

Shortly I will make the need for NR_IRQS sized arrays obsolete, allowing
a single kernel to support huge numbers of irqs and still be efficient
on small machines.

With the removal of the NR_IRQS sized arrays this patch will be a clear
size win in space consumption for small machines.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/alpha/kernel/irq.c | 2 +-
arch/alpha/kernel/irq_alpha.c | 2 +-
arch/arm/kernel/irq.c | 2 +-
arch/avr32/kernel/irq.c | 2 +-
arch/cris/kernel/irq.c | 2 +-
arch/frv/kernel/irq.c | 2 +-
arch/i386/kernel/io_apic.c | 2 +-
arch/i386/kernel/irq.c | 2 +-
arch/i386/mach-visws/visws_apic.c | 2 +-
arch/ia64/kernel/irq.c | 2 +-
arch/ia64/kernel/irq_ia64.c | 4 ++--
arch/m32r/kernel/irq.c | 2 +-
arch/mips/au1000/common/time.c | 4 ++--
arch/mips/kernel/irq.c | 2 +-
arch/mips/kernel/time.c | 4 ++--
arch/mips/sgi-ip22/ip22-int.c | 2 +-
arch/mips/sgi-ip22/ip22-time.c | 4 ++--
arch/mips/sgi-ip27/ip27-timer.c | 2 +-
arch/mips/sibyte/bcm1480/smp.c | 2 +-
arch/mips/sibyte/sb1250/irq.c | 2 +-
arch/mips/sibyte/sb1250/smp.c | 2 +-
arch/parisc/kernel/irq.c | 2 +-
arch/powerpc/kernel/irq.c | 2 +-
arch/ppc/amiga/amiints.c | 4 ++--
arch/ppc/amiga/cia.c | 2 +-
arch/ppc/amiga/ints.c | 4 ++--
arch/sh/kernel/irq.c | 2 +-
arch/sparc64/kernel/irq.c | 4 ++--
arch/sparc64/kernel/smp.c | 2 +-
arch/um/kernel/irq.c | 2 +-
arch/x86_64/kernel/irq.c | 6 +-----
arch/xtensa/kernel/irq.c | 2 +-
fs/proc/proc_misc.c | 2 +-
include/linux/irq.h | 4 ++++
include/linux/kernel_stat.h | 20 +++++++++++++++++---
init/main.c | 1 +
kernel/irq/chip.c | 15 +++++----------
kernel/irq/handle.c | 29 +++++++++++++++++++++++++++--
38 files changed, 94 insertions(+), 59 deletions(-)

diff --git a/arch/alpha/kernel/irq.c b/arch/alpha/kernel/irq.c
index 3659af8..8e0af05 100644
--- a/arch/alpha/kernel/irq.c
+++ b/arch/alpha/kernel/irq.c
@@ -88,7 +88,7 @@ show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%10u ", kstat_irqs(irq));
#else
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[irq]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
#endif
seq_printf(p, " %14s", irq_desc[irq].chip->typename);
seq_printf(p, " %c%s",
diff --git a/arch/alpha/kernel/irq_alpha.c b/arch/alpha/kernel/irq_alpha.c
index e16aeb6..2c0852c 100644
--- a/arch/alpha/kernel/irq_alpha.c
+++ b/arch/alpha/kernel/irq_alpha.c
@@ -64,7 +64,7 @@ do_entInt(unsigned long type, unsigned long vector,
smp_percpu_timer_interrupt(regs);
cpu = smp_processor_id();
if (cpu != boot_cpuid) {
- kstat_cpu(cpu).irqs[RTC_IRQ]++;
+ irq_desc[RTC_IRQ].kstat_irqs[cpu]++;
} else {
handle_irq(RTC_IRQ);
}
diff --git a/arch/arm/kernel/irq.c b/arch/arm/kernel/irq.c
index e101846..db79c4c 100644
--- a/arch/arm/kernel/irq.c
+++ b/arch/arm/kernel/irq.c
@@ -76,7 +76,7 @@ int show_interrupts(struct seq_file *p, void *v)

seq_printf(p, "%3d: ", i);
for_each_present_cpu(cpu)
- seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu));
seq_printf(p, " %10s", irq_desc[i].chip->name ? : "-");
seq_printf(p, " %s", action->name);
for (action = action->next; action; action = action->next)
diff --git a/arch/avr32/kernel/irq.c b/arch/avr32/kernel/irq.c
index fd31124..7cddf0a 100644
--- a/arch/avr32/kernel/irq.c
+++ b/arch/avr32/kernel/irq.c
@@ -56,7 +56,7 @@ int show_interrupts(struct seq_file *p, void *v)

seq_printf(p, "%3d: ", i);
for_each_online_cpu(cpu)
- seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu));
seq_printf(p, " %8s", irq_desc[i].chip->name ? : "-");
seq_printf(p, " %s", action->name);
for (action = action->next; action; action = action->next)
diff --git a/arch/cris/kernel/irq.c b/arch/cris/kernel/irq.c
index 903ea62..9d7c1d7 100644
--- a/arch/cris/kernel/irq.c
+++ b/arch/cris/kernel/irq.c
@@ -66,7 +66,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%10u ", kstat_irqs(i));
#else
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
#endif
seq_printf(p, " %14s", irq_desc[i].chip->typename);
seq_printf(p, " %s", action->name);
diff --git a/arch/frv/kernel/irq.c b/arch/frv/kernel/irq.c
index 87f360a..ff6579f 100644
--- a/arch/frv/kernel/irq.c
+++ b/arch/frv/kernel/irq.c
@@ -75,7 +75,7 @@ int show_interrupts(struct seq_file *p, void *v)
if (action) {
seq_printf(p, "%3d: ", i);
for_each_present_cpu(cpu)
- seq_printf(p, "%10u ", kstat_cpu(cpu).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, cpu));
seq_printf(p, " %10s", irq_desc[i].chip->name ? : "-");
seq_printf(p, " %s", action->name);
for (action = action->next;
diff --git a/arch/i386/kernel/io_apic.c b/arch/i386/kernel/io_apic.c
index edcc849..c660c8b 100644
--- a/arch/i386/kernel/io_apic.c
+++ b/arch/i386/kernel/io_apic.c
@@ -488,7 +488,7 @@ static void do_irq_balance(void)
if ( package_index == i )
IRQ_DELTA(package_index,j) = 0;
/* Determine the total count per processor per IRQ */
- value_now = (unsigned long) kstat_cpu(i).irqs[j];
+ value_now = (unsigned long) kstat_irqs_cpu(j, i);

/* Determine the activity per processor per IRQ */
delta = value_now - LAST_CPU_IRQ(i,j);
diff --git a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c
index eeb29af..0a30abc 100644
--- a/arch/i386/kernel/irq.c
+++ b/arch/i386/kernel/irq.c
@@ -279,7 +279,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%10u ", kstat_irqs(i));
#else
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
#endif
seq_printf(p, " %8s", irq_desc[i].chip->name);
seq_printf(p, "-%-8s", irq_desc[i].name);
diff --git a/arch/i386/mach-visws/visws_apic.c b/arch/i386/mach-visws/visws_apic.c
index 38c2b13..0d153eb 100644
--- a/arch/i386/mach-visws/visws_apic.c
+++ b/arch/i386/mach-visws/visws_apic.c
@@ -240,7 +240,7 @@ static irqreturn_t piix4_master_intr(int irq, void *dev_id)
/*
* handle this 'virtual interrupt' as a Cobalt one now.
*/
- kstat_cpu(smp_processor_id()).irqs[realirq]++;
+ desc->kstat_irqs[smp_processor_id()]++;

if (likely(desc->action != NULL))
handle_IRQ_event(realirq, desc->action);
diff --git a/arch/ia64/kernel/irq.c b/arch/ia64/kernel/irq.c
index ce49c85..06edbb8 100644
--- a/arch/ia64/kernel/irq.c
+++ b/arch/ia64/kernel/irq.c
@@ -73,7 +73,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%10u ", kstat_irqs(i));
#else
for_each_online_cpu(j) {
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
}
#endif
seq_printf(p, " %14s", irq_desc[i].chip->name);
diff --git a/arch/ia64/kernel/irq_ia64.c b/arch/ia64/kernel/irq_ia64.c
index 456f57b..be1dd6e 100644
--- a/arch/ia64/kernel/irq_ia64.c
+++ b/arch/ia64/kernel/irq_ia64.c
@@ -181,7 +181,7 @@ ia64_handle_irq (ia64_vector vector, struct pt_regs *regs)
ia64_srlz_d();
while (vector != IA64_SPURIOUS_INT_VECTOR) {
if (unlikely(IS_RESCHEDULE(vector)))
- kstat_this_cpu.irqs[vector]++;
+ kstat_irqs_this_cpu(&irq_desc[vector])++;
else {
ia64_setreg(_IA64_REG_CR_TPR, vector);
ia64_srlz_d();
@@ -228,7 +228,7 @@ void ia64_process_pending_intr(void)
*/
while (vector != IA64_SPURIOUS_INT_VECTOR) {
if (unlikely(IS_RESCHEDULE(vector)))
- kstat_this_cpu.irqs[vector]++;
+ kstat_irqs_this_cpu(&irq_desc[vector])++;
else {
struct pt_regs *old_regs = set_irq_regs(NULL);

diff --git a/arch/m32r/kernel/irq.c b/arch/m32r/kernel/irq.c
index f8d8650..4fb85b2 100644
--- a/arch/m32r/kernel/irq.c
+++ b/arch/m32r/kernel/irq.c
@@ -52,7 +52,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%10u ", kstat_irqs(i));
#else
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
#endif
seq_printf(p, " %14s", irq_desc[i].chip->typename);
seq_printf(p, " %s", action->name);
diff --git a/arch/mips/au1000/common/time.c b/arch/mips/au1000/common/time.c
index fa1c62f..c2a084e 100644
--- a/arch/mips/au1000/common/time.c
+++ b/arch/mips/au1000/common/time.c
@@ -81,13 +81,13 @@ void mips_timer_interrupt(void)
int irq = 63;

irq_enter();
- kstat_this_cpu.irqs[irq]++;
+ kstat_irqs_this_cpu(&irq_desc[irq])++;

if (r4k_offset == 0)
goto null;

do {
- kstat_this_cpu.irqs[irq]++;
+ kstat_irqs_this_cpu(&irq_desc[irq])++;
do_timer(1);
#ifndef CONFIG_SMP
update_process_times(user_mode(get_irq_regs()));
diff --git a/arch/mips/kernel/irq.c b/arch/mips/kernel/irq.c
index 2fe4c86..c2cae91 100644
--- a/arch/mips/kernel/irq.c
+++ b/arch/mips/kernel/irq.c
@@ -115,7 +115,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%10u ", kstat_irqs(i));
#else
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
#endif
seq_printf(p, " %14s", irq_desc[i].chip->name);
seq_printf(p, " %s", action->name);
diff --git a/arch/mips/kernel/time.c b/arch/mips/kernel/time.c
index e5e56bd..0a829e2 100644
--- a/arch/mips/kernel/time.c
+++ b/arch/mips/kernel/time.c
@@ -204,7 +204,7 @@ asmlinkage void ll_timer_interrupt(int irq)
int r2 = cpu_has_mips_r2;

irq_enter();
- kstat_this_cpu.irqs[irq]++;
+ kstat_irqs_this_cpu(&irq_desc[irq])++;

/*
* Suckage alert:
@@ -228,7 +228,7 @@ asmlinkage void ll_local_timer_interrupt(int irq)
{
irq_enter();
if (smp_processor_id() != 0)
- kstat_this_cpu.irqs[irq]++;
+ kstat_irqs_this_cpu(&irq_desc[irq])++;

/* we keep interrupt disabled all the time */
local_timer_interrupt(irq, NULL);
diff --git a/arch/mips/sgi-ip22/ip22-int.c b/arch/mips/sgi-ip22/ip22-int.c
index b454924..382a8a5 100644
--- a/arch/mips/sgi-ip22/ip22-int.c
+++ b/arch/mips/sgi-ip22/ip22-int.c
@@ -164,7 +164,7 @@ static void indy_buserror_irq(void)
int irq = SGI_BUSERR_IRQ;

irq_enter();
- kstat_this_cpu.irqs[irq]++;
+ kstat_irqs_this_cpu(&irq_desc[irq])++;
ip22_be_interrupt(irq);
irq_exit();
}
diff --git a/arch/mips/sgi-ip22/ip22-time.c b/arch/mips/sgi-ip22/ip22-time.c
index 2055547..0cd6887 100644
--- a/arch/mips/sgi-ip22/ip22-time.c
+++ b/arch/mips/sgi-ip22/ip22-time.c
@@ -182,7 +182,7 @@ void indy_8254timer_irq(void)
char c;

irq_enter();
- kstat_this_cpu.irqs[irq]++;
+ kstat_irqs_this_cpu(&irq_desc[irq])++;
printk(KERN_ALERT "Oops, got 8254 interrupt.\n");
ArcRead(0, &c, 1, &cnt);
ArcEnterInteractiveMode();
@@ -194,7 +194,7 @@ void indy_r4k_timer_interrupt(void)
int irq = SGI_TIMER_IRQ;

irq_enter();
- kstat_this_cpu.irqs[irq]++;
+ kstat_irqs_this_cpu(&irq_desc[irq])++;
timer_interrupt(irq, NULL);
irq_exit();
}
diff --git a/arch/mips/sgi-ip27/ip27-timer.c b/arch/mips/sgi-ip27/ip27-timer.c
index 8c3c78c..592449c 100644
--- a/arch/mips/sgi-ip27/ip27-timer.c
+++ b/arch/mips/sgi-ip27/ip27-timer.c
@@ -106,7 +106,7 @@ again:
if (LOCAL_HUB_L(PI_RT_COUNT) >= ct_cur[cpu])
goto again;

- kstat_this_cpu.irqs[irq]++; /* kstat only for bootcpu? */
+ irq_desc[irq].kstat_irqs[cpu]++; /* kstat only for bootcpu? */

if (cpu == 0)
do_timer(1);
diff --git a/arch/mips/sibyte/bcm1480/smp.c b/arch/mips/sibyte/bcm1480/smp.c
index bf32827..a070238 100644
--- a/arch/mips/sibyte/bcm1480/smp.c
+++ b/arch/mips/sibyte/bcm1480/smp.c
@@ -93,7 +93,7 @@ void bcm1480_mailbox_interrupt(void)
int cpu = smp_processor_id();
unsigned int action;

- kstat_this_cpu.irqs[K_BCM1480_INT_MBOX_0_0]++;
+ irq_desc[K_BCM1480_INT_MBOX_0_0].kstat_irqs[cpu]++;
/* Load the mailbox register to figure out what we're supposed to do */
action = (__raw_readq(mailbox_0_regs[cpu]) >> 48) & 0xffff;

diff --git a/arch/mips/sibyte/sb1250/irq.c b/arch/mips/sibyte/sb1250/irq.c
index 1482394..fb7d77f 100644
--- a/arch/mips/sibyte/sb1250/irq.c
+++ b/arch/mips/sibyte/sb1250/irq.c
@@ -390,7 +390,7 @@ static void sb1250_kgdb_interrupt(void)
* host to stop the break, since we would see another
* interrupt on the end-of-break too)
*/
- kstat_this_cpu.irqs[kgdb_irq]++;
+ kstat_irqs_this_cpu(&irq_desc[kgdb_irq])++;
mdelay(500);
duart_out(R_DUART_CMD, V_DUART_MISC_CMD_RESET_BREAK_INT |
M_DUART_RX_EN | M_DUART_TX_EN);
diff --git a/arch/mips/sibyte/sb1250/smp.c b/arch/mips/sibyte/sb1250/smp.c
index c38e1f3..54c6164 100644
--- a/arch/mips/sibyte/sb1250/smp.c
+++ b/arch/mips/sibyte/sb1250/smp.c
@@ -81,7 +81,7 @@ void sb1250_mailbox_interrupt(void)
int cpu = smp_processor_id();
unsigned int action;

- kstat_this_cpu.irqs[K_INT_MBOX_0]++;
+ irq_desc[K_INT_MBOX_0].kstat_irqs[cpu]++;
/* Load the mailbox register to figure out what we're supposed to do */
action = (____raw_readq(mailbox_regs[cpu]) >> 48) & 0xffff;

diff --git a/arch/parisc/kernel/irq.c b/arch/parisc/kernel/irq.c
index b39c5b9..c222bbd 100644
--- a/arch/parisc/kernel/irq.c
+++ b/arch/parisc/kernel/irq.c
@@ -192,7 +192,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%3d: ", i);
#ifdef CONFIG_SMP
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
#else
seq_printf(p, "%10u ", kstat_irqs(i));
#endif
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 919fbf5..0be818a 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -189,7 +189,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%3d: ", i);
#ifdef CONFIG_SMP
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
#else
seq_printf(p, "%10u ", kstat_irqs(i));
#endif /* CONFIG_SMP */
diff --git a/arch/ppc/amiga/amiints.c b/arch/ppc/amiga/amiints.c
index 265fcd3..3dc8651 100644
--- a/arch/ppc/amiga/amiints.c
+++ b/arch/ppc/amiga/amiints.c
@@ -184,7 +184,7 @@ inline void amiga_do_irq(int irq, struct pt_regs *fp)
irq_desc_t *desc = irq_desc + irq;
struct irqaction *action = desc->action;

- kstat_cpu(0).irqs[irq]++;
+ desc->kstat_irqs[0]++;
action->handler(irq, action->dev_id, fp);
}

@@ -193,7 +193,7 @@ void amiga_do_irq_list(int irq, struct pt_regs *fp)
irq_desc_t *desc = irq_desc + irq;
struct irqaction *action;

- kstat_cpu(0).irqs[irq]++;
+ desc->kstat_irqs[0]++;

amiga_custom.intreq = ami_intena_vals[irq];

diff --git a/arch/ppc/amiga/cia.c b/arch/ppc/amiga/cia.c
index 9558f2f..33faf2d 100644
--- a/arch/ppc/amiga/cia.c
+++ b/arch/ppc/amiga/cia.c
@@ -146,7 +146,7 @@ static void cia_handler(int irq, void *dev_id, struct pt_regs *fp)
amiga_custom.intreq = base->int_mask;
for (i = 0; i < CIA_IRQS; i++, irq++) {
if (ints & 1) {
- kstat_cpu(0).irqs[irq]++;
+ desc->kstat_irqs[0]++;
action = desc->action;
action->handler(irq, action->dev_id, fp);
}
diff --git a/arch/ppc/amiga/ints.c b/arch/ppc/amiga/ints.c
index 083a174..84ec6cb 100644
--- a/arch/ppc/amiga/ints.c
+++ b/arch/ppc/amiga/ints.c
@@ -128,7 +128,7 @@ asmlinkage void process_int(unsigned long vec, struct pt_regs *fp)
{
if (vec >= VEC_INT1 && vec <= VEC_INT7 && !MACH_IS_BVME6000) {
vec -= VEC_SPUR;
- kstat_cpu(0).irqs[vec]++;
+ irq_desc[vec].kstat_irqs[0]++;
irq_list[vec].handler(vec, irq_list[vec].dev_id, fp);
} else {
if (mach_process_int)
@@ -147,7 +147,7 @@ int m68k_get_irq_list(struct seq_file *p, void *v)
if (mach_default_handler) {
for (i = 0; i < SYS_IRQS; i++) {
seq_printf(p, "auto %2d: %10u ", i,
- i ? kstat_cpu(0).irqs[i] : num_spurious);
+ i ? kstat_irqs_cpu(i, 0) : num_spurious);
seq_puts(p, " ");
seq_printf(p, "%s\n", irq_list[i].devname);
}
diff --git a/arch/sh/kernel/irq.c b/arch/sh/kernel/irq.c
index 67be2b6..e9c739a 100644
--- a/arch/sh/kernel/irq.c
+++ b/arch/sh/kernel/irq.c
@@ -52,7 +52,7 @@ int show_interrupts(struct seq_file *p, void *v)
goto unlock;
seq_printf(p, "%3d: ",i);
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_cpu(i, j));
seq_printf(p, " %14s", irq_desc[i].chip->name);
seq_printf(p, "-%-8s", irq_desc[i].name);
seq_printf(p, " %s", action->name);
diff --git a/arch/sparc64/kernel/irq.c b/arch/sparc64/kernel/irq.c
index b5ff3ee..4a436a7 100644
--- a/arch/sparc64/kernel/irq.c
+++ b/arch/sparc64/kernel/irq.c
@@ -154,7 +154,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%10u ", kstat_irqs(i));
#else
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
#endif
seq_printf(p, " %9s", irq_desc[i].chip->typename);
seq_printf(p, " %s", action->name);
@@ -605,7 +605,7 @@ void timer_irq(int irq, struct pt_regs *regs)
old_regs = set_irq_regs(regs);
irq_enter();

- kstat_this_cpu.irqs[0]++;
+ irq_desc[0].kstat_irqs[0]++;
timer_interrupt(irq, NULL);

irq_exit();
diff --git a/arch/sparc64/kernel/smp.c b/arch/sparc64/kernel/smp.c
index fc99f7b..155703b 100644
--- a/arch/sparc64/kernel/smp.c
+++ b/arch/sparc64/kernel/smp.c
@@ -1212,7 +1212,7 @@ void smp_percpu_timer_interrupt(struct pt_regs *regs)
irq_enter();

if (cpu == boot_cpu_id) {
- kstat_this_cpu.irqs[0]++;
+ irq_desc[0].kstat_irqs[cpu]++;
timer_tick_interrupt(regs);
}

diff --git a/arch/um/kernel/irq.c b/arch/um/kernel/irq.c
index 50a288b..fa16410 100644
--- a/arch/um/kernel/irq.c
+++ b/arch/um/kernel/irq.c
@@ -61,7 +61,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%10u ", kstat_irqs(i));
#else
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
#endif
seq_printf(p, " %14s", irq_desc[i].chip->typename);
seq_printf(p, " %s", action->name);
diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 9fe2e28..beefb89 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -69,12 +69,8 @@ int show_interrupts(struct seq_file *p, void *v)
if (!action)
goto skip;
seq_printf(p, "%3d: ",i);
-#ifndef CONFIG_SMP
- seq_printf(p, "%10u ", kstat_irqs(i));
-#else
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
-#endif
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
seq_printf(p, " %8s", irq_desc[i].chip->name);
seq_printf(p, "-%-8s", irq_desc[i].name);

diff --git a/arch/xtensa/kernel/irq.c b/arch/xtensa/kernel/irq.c
index c9ea73b..c35e271 100644
--- a/arch/xtensa/kernel/irq.c
+++ b/arch/xtensa/kernel/irq.c
@@ -99,7 +99,7 @@ int show_interrupts(struct seq_file *p, void *v)
seq_printf(p, "%10u ", kstat_irqs(i));
#else
for_each_online_cpu(j)
- seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
+ seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
#endif
seq_printf(p, " %14s", irq_desc[i].chip->typename);
seq_printf(p, " %s", action->name);
diff --git a/fs/proc/proc_misc.c b/fs/proc/proc_misc.c
index e2c4c0a..21be453 100644
--- a/fs/proc/proc_misc.c
+++ b/fs/proc/proc_misc.c
@@ -472,7 +472,7 @@ static int show_stat(struct seq_file *p, void *v)
softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
for (j = 0 ; j < NR_IRQS ; j++)
- sum += kstat_cpu(i).irqs[j];
+ sum += kstat_irqs_cpu(j, i);
}

seq_printf(p, "cpu %llu %llu %llu %llu %llu %llu %llu %llu\n",
diff --git a/include/linux/irq.h b/include/linux/irq.h
index bb78ab9..9c61fd7 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -156,6 +156,7 @@ struct irq_desc {
void *handler_data;
void *chip_data;
struct irqaction *action; /* IRQ action list */
+ unsigned int *kstat_irqs;
unsigned int status; /* IRQ status */

unsigned int depth; /* nested irq disables */
@@ -178,6 +179,9 @@ struct irq_desc {

extern struct irq_desc irq_desc[NR_IRQS];

+#define kstat_irqs_this_cpu(DESC) \
+ ((DESC)->kstat_irqs[smp_processor_id()])
+
/*
* Migration helpers for obsolete names, they will go away:
*/
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 43e895f..0c8f650 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -27,7 +27,9 @@ struct cpu_usage_stat {

struct kernel_stat {
struct cpu_usage_stat cpustat;
+#ifndef CONFIG_GENERIC_HARDIRQS
unsigned int irqs[NR_IRQS];
+#endif
};

DECLARE_PER_CPU(struct kernel_stat, kstat);
@@ -38,15 +40,27 @@ DECLARE_PER_CPU(struct kernel_stat, kstat);

extern unsigned long long nr_context_switches(void);

+#ifndef CONFIG_GENERIC_HARDIRQS
+static inline unsigned int kstat_irqs_cpu(unsigned int irq, int cpu)
+{
+ return kstat_cpu(cpu).irqs[irq];
+}
+static inline void init_kstat_irqs(void) {}
+#else
+extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu);
+extern void init_kstat_irqs(void);
+#endif /* CONFIG_GENERIC_HARDIRQS */
+
/*
* Number of interrupts per specific IRQ source, since bootup
*/
-static inline int kstat_irqs(int irq)
+static inline unsigned int kstat_irqs(unsigned int irq)
{
- int cpu, sum = 0;
+ unsigned int sum = 0;
+ int cpu;

for_each_possible_cpu(cpu)
- sum += kstat_cpu(cpu).irqs[irq];
+ sum += kstat_irqs_cpu(irq, cpu);

return sum;
}
diff --git a/init/main.c b/init/main.c
index a92989e..23f1c64 100644
--- a/init/main.c
+++ b/init/main.c
@@ -559,6 +559,7 @@ asmlinkage void __init start_kernel(void)
sort_main_extable();
trap_init();
rcu_init();
+ init_kstat_irqs();
init_IRQ();
pidhash_init();
init_timers();
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index f83d691..7896286 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -288,13 +288,12 @@ handle_simple_irq(unsigned int irq, struct irq_desc *desc)
{
struct irqaction *action;
irqreturn_t action_ret;
- const unsigned int cpu = smp_processor_id();

spin_lock(&desc->lock);

if (unlikely(desc->status & IRQ_INPROGRESS))
goto out_unlock;
- kstat_cpu(cpu).irqs[irq]++;
+ kstat_irqs_this_cpu(desc)++;

action = desc->action;
if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
@@ -332,7 +331,6 @@ out_unlock:
void fastcall
handle_level_irq(unsigned int irq, struct irq_desc *desc)
{
- unsigned int cpu = smp_processor_id();
struct irqaction *action;
irqreturn_t action_ret;

@@ -342,7 +340,7 @@ handle_level_irq(unsigned int irq, struct irq_desc *desc)
if (unlikely(desc->status & IRQ_INPROGRESS))
goto out_unlock;
desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
- kstat_cpu(cpu).irqs[irq]++;
+ kstat_irqs_this_cpu(desc)++;

/*
* If its disabled or no action available
@@ -383,7 +381,6 @@ out_unlock:
void fastcall
handle_fasteoi_irq(unsigned int irq, struct irq_desc *desc)
{
- unsigned int cpu = smp_processor_id();
struct irqaction *action;
irqreturn_t action_ret;

@@ -393,7 +390,7 @@ handle_fasteoi_irq(unsigned int irq, struct irq_desc *desc)
goto out;

desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
- kstat_cpu(cpu).irqs[irq]++;
+ kstat_irqs_this_cpu(desc)++;

/*
* If its disabled or no action available
@@ -442,8 +439,6 @@ out:
void fastcall
handle_edge_irq(unsigned int irq, struct irq_desc *desc)
{
- const unsigned int cpu = smp_processor_id();
-
spin_lock(&desc->lock);

desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -460,7 +455,7 @@ handle_edge_irq(unsigned int irq, struct irq_desc *desc)
goto out_unlock;
}

- kstat_cpu(cpu).irqs[irq]++;
+ kstat_irqs_this_cpu(desc)++;

/* Start handling the irq */
desc->chip->ack(irq);
@@ -516,7 +511,7 @@ handle_percpu_irq(unsigned int irq, struct irq_desc *desc)
{
irqreturn_t action_ret;

- kstat_this_cpu.irqs[irq]++;
+ kstat_irqs_this_cpu(desc)++;

if (desc->chip->ack)
desc->chip->ack(irq);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index aff1f0f..27cf665 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -15,6 +15,7 @@
#include <linux/random.h>
#include <linux/interrupt.h>
#include <linux/kernel_stat.h>
+#include <linux/bootmem.h>

#include "internals.h"

@@ -30,7 +31,7 @@ void fastcall
handle_bad_irq(unsigned int irq, struct irq_desc *desc)
{
print_irq_desc(irq, desc);
- kstat_this_cpu.irqs[irq]++;
+ kstat_irqs_this_cpu(desc)++;
ack_bad_irq(irq);
}

@@ -170,7 +171,7 @@ fastcall unsigned int __do_IRQ(unsigned int irq)
struct irqaction *action;
unsigned int status;

- kstat_this_cpu.irqs[irq]++;
+ kstat_irqs_this_cpu(desc)++;
if (CHECK_IRQ_PER_CPU(desc->status)) {
irqreturn_t action_ret;

@@ -269,3 +270,27 @@ void early_init_irq_lock_class(void)
}

#endif
+
+__init void init_kstat_irqs(void)
+{
+ unsigned entries = 0, cpu;
+ unsigned int irq;
+ unsigned bytes;
+
+ /* Compute the worst case size of a per cpu array */
+ for_each_possible_cpu(cpu)
+ if (cpu >= entries)
+ entries = cpu + 1;
+
+ /* Compute how many bytes we need per irq and allocate them */
+ bytes = entries*sizeof(unsigned int);
+ for (irq = 0; irq < NR_IRQS; irq++)
+ irq_desc[irq].kstat_irqs = alloc_bootmem(bytes);
+}
+
+unsigned int kstat_irqs_cpu(unsigned int irq, int cpu)
+{
+ struct irq_desc *desc = irq_desc + irq;
+ return desc->kstat_irqs[cpu];
+}
+EXPORT_SYMBOL(kstat_irqs_cpu);
--
1.5.0.g53756

2008-07-10 23:37:17

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Mike Travis wrote:
>>
>> The biggest growth came from moving all the xxx[NR_CPUS] arrays into
>> the per cpu area. So you free up a huge amount of unused memory when
>> the NR_CPUS count starts getting into the ozone layer. 4k now, 16k
>> real soon now, ??? future?
>>
>
> Even (or perhaps especially) so, allocating the percpu area in 2 MB
> increments is a total nonstarter. It hurts the small, common
> configurations way too much. For SGI, it's probably fine.
>
> -hpa

Yes, "right-sizing" the kernel for systems from 512M laptops to 4k cpu
systems with memory totally maxed out, using only a binary distribution
has proven tricky (at best... ;-)

One alternative was to only allocate a chunk similar in size to
PERCPU_ENOUGH_ROOM and allow for startup options to create a bigger
space if needed. Though the realloc idea has some merit if we can
validate the non-use of pointers to specific cpu's percpu vars.
(CPU_ALLOC contains a function to dereference a percpu offset.)

The discussion started at:

http://marc.info/?l=linux-kernel&m=121212026716085&w=4

Thanks,
Mike

2008-07-11 00:12:39

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
...
> -- I believe 2 MB mappings are too large, except perhaps as an option.
>
> -hpa

Hmm, that might be the way to go.... At boot up time determine the
size of the system in terms of cpu count and memory available and
attempt to do the right thing, with startup options to override the
internal choices... ?

(Surely a system that has a "gazillion ip tunnels" could modify it's
kernel start options... ;-)

Unfortunately, we can't use a MODULE to support different options unless
we change how the kernel starts up (would need to mount the root fs
before starting secondary cpus.)

Btw, the "zero_based_only" patch (w/o the pda folded into the percpu
area) gets to the point shown below. Dropping NR_CPUS from 4096 to 256
clears up the error. So except for the "stack overflow" message I got
yesterday, the result is the same. As soon as I get a chance, I'll try
it out with gcc-4.2.0 to see if it changed the boot up problem.

Thanks,
Mike

[ 0.096000] ACPI: Core revision 20080321
[ 0.108889] Parsing all Control Methods:
[ 0.116198] Table [DSDT](id 0001) - 364 Objects with 40 Devices 109 Methods 20 Regions
[ 0.124000] Parsing all Control Methods:
[ 0.128000] Table [SSDT](id 0002) - 43 Objects with 0 Devices 16 Methods 0 Regions
[ 0.132000] tbxface-0598 [02] tb_load_namespace : ACPI Tables successfully acquired
[ 0.148000] evxfevnt-0091 [02] enable : Transition to ACPI mode successful
[ 0.200000] CPU0: Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping 07
[ 0.211685] Using local APIC timer interrupts.
[ 0.220000] APIC timer calibration result 20781901
[ 0.224000] Detected 20.781 MHz APIC timer.
[ 0.228000] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 0.228000] IP: [<0000000000000000>]
[ 0.228000] PGD 0
[ 0.228000] Oops: 0010 [1] SMP
[ 0.228000] CPU 0
[ 0.228000] Pid: 1, comm: swapper Not tainted 2.6.26-rc8-tip-ingo-test-0701-00208-g79a4d68-dirty #7
[ 0.228000] RIP: 0010:[<0000000000000000>] [<0000000000000000>]
[ 0.228000] RSP: 0000:ffff81022ed1fe18 EFLAGS: 00010286
[ 0.228000] RAX: 0000000000000000 RBX: ffff81022ed1fe84 RCX: ffffffff80d0de80
[ 0.228000] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffffff80d0de80
[ 0.228000] RBP: ffff81022ed1fe50 R08: ffff81022ed1fe84 R09: ffffffff80e28ae0
[ 0.228000] R10: ffff81022ed1fe80 R11: ffff81022ed39188 R12: 00000000ffffffff
[ 0.228000] R13: ffffffff80d0de40 R14: 0000000000000001 R15: 0000000000000003
[ 0.228000] FS: 0000000000000000(0000) GS:ffffffff80de69c0(0000) knlGS:0000000000000000
[ 0.228000] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 0.228000] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
[ 0.228000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.228000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 0.228000] Process swapper (pid: 1, threadinfo ffff81022ed10000, task ffff81022ec93100)
[ 0.228000] Stack: ffffffff8024f34c 0000000000000000 0000000000000001 0000000000000001
[ 0.228000] 00000000fffffff0 ffffffff80e8e4e0 0000000000092fd0 ffff81022ed1fe60
[ 0.228000] ffffffff8024f3cc ffff81022ed1fea0 ffffffff808ef5b8 0000000000000008
[ 0.228000] Call Trace:
[ 0.228000] [<ffffffff8024f34c>] ? notifier_call_chain+0x38/0x60
[ 0.228000] [<ffffffff8024f3cc>] __raw_notifier_call_chain+0xe/0x10
[ 0.228000] [<ffffffff808ef5b8>] cpu_up+0xa8/0x138
[ 0.228000] [<ffffffff80e4d9b9>] kernel_init+0xdf/0x327
[ 0.228000] [<ffffffff8020d4b8>] child_rip+0xa/0x12
[ 0.228000] [<ffffffff8020c955>] ? restore_args+0x0/0x30
[ 0.228000] [<ffffffff80e4d8da>] ? kernel_init+0x0/0x327
[ 0.228000] [<ffffffff8020d4ae>] ? child_rip+0x0/0x12
[ 0.228000]
[ 0.228000]
[ 0.228000] Code: Bad RIP value.
[ 0.228000] RIP [<0000000000000000>]
[ 0.228000] RSP <ffff81022ed1fe18>
[ 0.228000] CR2: 0000000000000000
[ 0.232000] ---[ end trace a7919e7f17c0a725 ]---
[ 0.236000] Kernel panic - not syncing: Attempted to kill init!

2008-07-11 00:21:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
>
> Hmm, that might be the way to go.... At boot up time determine the
> size of the system in terms of cpu count and memory available and
> attempt to do the right thing, with startup options to override the
> internal choices... ?
>
> (Surely a system that has a "gazillion ip tunnels" could modify it's
> kernel start options... ;-)
>
> Unfortunately, we can't use a MODULE to support different options unless
> we change how the kernel starts up (would need to mount the root fs
> before starting secondary cpus.)
>

Using a module doesn't make any sense anyway. This is more what the
kernel command line is for.

-hpa

2008-07-11 00:52:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis <[email protected]> writes:

> Btw, the "zero_based_only" patch (w/o the pda folded into the percpu
> area) gets to the point shown below. Dropping NR_CPUS from 4096 to 256
> clears up the error. So except for the "stack overflow" message I got
> yesterday, the result is the same. As soon as I get a chance, I'll try
> it out with gcc-4.2.0 to see if it changed the boot up problem.

Thanks that seems to confirm the suspicion that it the zero based percpu
segment that is causing problems.

In this case it appears that the notifier block for one of the cpu notifiers
got stomped, or never got initialized.

Eric

2008-07-11 00:55:45

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>>
>> Percpu on i386 hasn't been a point of discussion. It works fine, and
>> has been working fine for a long time. The same mechanism would work
>> fine on x86-64. Its only "issue" is that it doesn't support the
>> broken gcc abi for stack-protector.
>>
>> The problem is all zero-based percpu on x86-64.
>>
>
> Well, x86-64 has *two* issues: limited range of offsets (regardless of
> if we do RIP-relative or not), and the stack-protector ABI.
>
> I'm still trying to reproduce Mike's setup, but I suspect it can be
> switched to RIP-relative for the fixed-offset (static) stuff; for the
> dynamic stuff it's all via pointers anyway so the offsets don't matter.
>
> -hpa

I'm rebuilding my tip tree now, that should bring it up to date. I'll
repost patches #1 to (currently) #4 shortly.

I'm looking at some code that does not have patches I sent in like 4 to 5
months ago (acpi/NR_CPUS related changes).

Thanks,
Mike

2008-07-11 00:58:53

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Mike Travis wrote:
>>
>> Hmm, that might be the way to go.... At boot up time determine the
>> size of the system in terms of cpu count and memory available and
>> attempt to do the right thing, with startup options to override the
>> internal choices... ?
>>
>> (Surely a system that has a "gazillion ip tunnels" could modify it's
>> kernel start options... ;-)
>>
>> Unfortunately, we can't use a MODULE to support different options unless
>> we change how the kernel starts up (would need to mount the root fs
>> before starting secondary cpus.)
>>
>
> Using a module doesn't make any sense anyway. This is more what the
> kernel command line is for.
>
> -hpa


I was thinking that supporting virtual percpu addresses would take a fair
amount of code that, if living in a MODULE, wouldn't impact small systems.
But it seems to be not worth the effort... ;-)

Thanks,
Mike

2008-07-11 01:39:35

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
> Mike Travis <[email protected]> writes:
>
>
>> The biggest growth came from moving all the xxx[NR_CPUS] arrays into
>> the per cpu area. So you free up a huge amount of unused memory when
>> the NR_CPUS count starts getting into the ozone layer. 4k now, 16k
>> real soon now, ??? future?
>
> Hmm. Do you know how big a role kernel_stat plays.
>
> It is a per cpu structure that is sized via NR_IRQS. NR_IRQS is by NR_CPUS.
> So ultimately the amount of memory take up is NR_CPUS*NR_CPUS*32 or so.
>
> I have a patch I wrote long ago, that addresses that specific nasty configuration
> by moving the per cpu irq counters into pointer available from struct irq_desc.
>
> The next step which I did not get to (but is interesting from a scaling perspective)
> was to start dynamically allocating the irq structures.
>
> Eric

If you could dig that up, that would be great. Another engr here at SGI
took that task off my hands and he's been able to do a few things to reduce
the "# irqs" but irq_desc is still one of the bigger static arrays (>256k).

(There was some discussion a while back on this very subject.)

The top data users are:

====== Data (-l 500)
1 - ingo-test-0701-256
2 - 4k-defconfig
3 - ingo-test-0701

.1. .2. .3. ..final..
1048576 -917504 +917504 1048576 . __log_buf(.bss)
262144 -262144 +262144 262144 . gl_hash_table(.bss)
122360 -122360 +122360 122360 . g_bitstream(.data)
119756 -119756 +119756 119756 . init_data(.rodata)
89760 -89760 +89760 89760 . o2net_nodes(.bss)
76800 -76800 +614400 614400 +700% early_node_map(.data)
44548 -44548 +44548 44548 . typhoon_firmware_image(.rodata)
43008 +215040 . 258048 +500% irq_desc(.data.cacheline_aligned)
42768 -42768 +42768 42768 . s_firmLoad(.data)
41184 -41184 +41184 41184 . saa7134_boards(.data)
38912 -38912 +38912 38912 . dabusb(.bss)
34804 -34804 +34804 34804 . g_Firmware(.data)
32768 -32768 +32768 32768 . read_buffers(.bss)
19968 -19968 +159744 159744 +700% initkmem_list3(.init.data)
18041 -18041 +18041 18041 . OperationalCodeImage_GEN1(.data)
16507 -16507 +16507 16507 . OperationalCodeImage_GEN2(.data)
16464 -16464 +16464 16464 . ipw_geos(.rodata)
16388 +114688 -114688 16388 . map_pid_to_cmdline(.bss)
16384 -16384 +16384 16384 . gl_hash_locks(.bss)
16384 +245760 . 262144 +1500% boot_pageset(.bss)
16128 +215040 . 231168 +1333% irq_cfg(.data.read_mostly)

2008-07-11 01:48:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
>
> I was thinking that supporting virtual percpu addresses would take a fair
> amount of code that, if living in a MODULE, wouldn't impact small systems.
> But it seems to be not worth the effort... ;-)
>

No, and it seems pretty toxic. If we're doing virtual, they should
almost certainly always be virtual (except perhaps on UP.) Page size is
a separate issue.

-hpa

2008-07-11 03:02:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis <[email protected]> writes:

> If you could dig that up, that would be great. Another engr here at SGI
> took that task off my hands and he's been able to do a few things to reduce
> the "# irqs" but irq_desc is still one of the bigger static arrays (>256k).

So the part I had completed which turns takes the NR_IRQS array out of kernel_stat
I posted. Here are my mental notes on how to handle the rest.

Also if you will notice on x86_64 everything that is per irq is in irq_cfg.
Which explains why irq_cfg grows. We have those crazy bitmaps almost useless
bitmaps of which cpu we want to direct irqs to. In the irq configuration so that
doesn't help.

The array sized by NR_IRQS irqs are in:

drivers/char/random.c:static struct timer_rand_state *irq_timer_state[NR_IRQS];
looks like it should go in irq_desc (it's a generic feature).

drivers/pcmcia/pcmcia_resource.c:static u8 pcmcia_used_irq[NR_IRQS];
That number should be 16 possibly 32 for sanity not NR_IRQS.

drivers/net/hamradio/scc.c:static struct irqflags { unsigned char used : 1; } Ivec[NR_IRQS];
drivers/serial/68328serial.c:struct m68k_serial *IRQ_ports[NR_IRQS];
drivers/serial/8250.c:static struct irq_info irq_lists[NR_IRQS];
drivers/serial/m32r_sio.c:static struct irq_info irq_lists[NR_IRQS];
The are all drivers and should allocate a proper per irq structure like every other driver.

drivers/xen/events.c:static struct packed_irq irq_info[NR_IRQS];
drivers/xen/events.c:static int irq_bindcount[NR_IRQS];
For all intents and purposes this is another architecture, that should be fixed up
at some point.


The interfaces from include/linux/interrupt.h that take irq interfaces
that take an irq number are slow path.

So it is just a matter of writing an irq_descp(irq) that returns
an irq_desc and returns an irq. The definition would go something
like:

#ifndef CONFIG_DYNAMIC_NR_IRQ
#define irq_descp(irq) (irq >= 0 && irq < NR_IRQS)? (irq_desc + irq) : NULL
#else
struct irq_desc *irq_descp(int irq)
{
struct irq_desc *desc;
rcu_read_lock();
list_for_each_entry_rcu(desc, &irq_list, list) {
if (desc->irq == irq)
return desc;
}
rcu_read_unlock();
return NULL;
}
#endif

Then the generic irq code just needs to use irq_descp through out.
And the arch code needs to allocate/free irq_descs and add them to the
list. With say:
int add_irq_desc(int irq, struct irq_desc *desc)
{
struct irq_desc *old;
int error = -EINVAL;
spin_lock(&irq_list_lock);
old = irq_descp(irq);
if (old)
goto out;
list_add_rcu(&desc.list, &irc_list);
error = 0;
return error;
}
With architecture picking the irq number so it can be stable and have meaning
to users.

Starting from that direction it isn't too hard and it should yield timely results.

Eric

2008-07-11 06:59:47

by Rusty Russell

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

On Friday 11 July 2008 06:22:52 Eric W. Biederman wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
> > No, that sounds like a bad idea. For one, how would you enforce it? How
> > would you check for it? It's one of those things that would mostly work
> > and then fail very rarely.
>
> Well the easiest way would be to avoid the letting people take the address
> of per cpu memory, and just provide macros to read/write it. We are 90% of
> the way there already so it isn't a big jump.

Hi Eric,

I decided against that originally, but we can revisit that decision. But
it would *not* be easy. Try it on kernel/sched.c which uses per-cpu "struct
rq".

Perhaps we could limit dynamically allocated per-cpu mem this way though...

Rusty.

2008-07-11 15:33:54

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

H. Peter Anvin wrote:
> Christoph Lameter wrote:
>> H. Peter Anvin wrote:
>>
>>> And how much is that, especially on *small* systems?
>>
>> i386?
>>
>> i386 uses 4K mappings. There are just a few cpus supported, there is
>> scarcity of ZONE_NORMAL memory so the per cpu areas really cannot get
>> that big. See the cpu_alloc patchsets for i386.
>>
>
> No, not i386. x86-64.

x86_64 are small systems? For 64 bit use one would expect 4GB of memory.

2008-07-11 15:37:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> H. Peter Anvin wrote:
> ...
>> -- I believe 2 MB mappings are too large, except perhaps as an option.
>>
>> -hpa
>
> Hmm, that might be the way to go.... At boot up time determine the
> size of the system in terms of cpu count and memory available and
> attempt to do the right thing, with startup options to override the
> internal choices... ?

Ok. That is an extension of the static per cpu area scenario already supported by cpu_alloc. Should not be too difficult to implement.

2008-07-11 15:38:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:

> I was thinking that supporting virtual percpu addresses would take a fair
> amount of code that, if living in a MODULE, wouldn't impact small systems.
> But it seems to be not worth the effort... ;-)

For base page mappings that logic is already provided by the vmalloc subsystem. For 2M mappings the vmemmap logic can be used.

2008-07-11 16:14:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Christoph Lameter wrote:
>>>
>> No, not i386. x86-64.
>
> x86_64 are small systems? For 64 bit use one would expect 4GB of memory.

Hardly so. i386 has too many other limitations; for one thing, it
starts having performance problems (needing HIHGMEM) at less than 1 GB.
The additional registers, etc.

-hpa

2008-07-11 17:02:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses


> x86_64 are small systems? For 64 bit use one would expect 4GB of memory.

Anything over 1G where 32bit runs out of lowmem starts to be a win.
Plus sometimes it just doesn't make sense to use a 32bit kernel.

Expecting 4GB of real memory seems silly.

Eric

2008-07-11 17:17:47

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Eric W. Biederman wrote:
>> x86_64 are small systems? For 64 bit use one would expect 4GB of memory.
>
> Anything over 1G where 32bit runs out of lowmem starts to be a win.
> Plus sometimes it just doesn't make sense to use a 32bit kernel.
>
> Expecting 4GB of real memory seems silly.

Especially since people set up VM instances small and then grow them.
They can have *weird* CPU-to-RAM ratios; I have heard of 8 VCPUs and
24 MB of RAM in production.

-hpa

2008-07-25 15:49:25

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Did the suspected linker bug issue ever get resolved?
>
> I don't believe so. I think Mike is getting very early crashes
> depending on some combination of gcc, linker and kernel config. Or
> something.
>
> This fragility makes me very nervous. It seems hard enough to get this
> stuff working with current tools; making it work over the whole range of
> supported tools looks like its going to be hard.
>
> J

FYI, I think it was a combination of errors that was causing my problems.
In any case, I've successfully compiled and booted Ingo's pesky config
config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and 4.2.4
on both Intel and AMD boxes. (As well as a variety of other configs.)

I think Hugh's change adding "text" to the BUILD_IRQ macro might also have
helped, since interrupts seemed to have always been involved in the panics.
That might explain the variable-ness of the gcc version's.

Thanks,
Mike

2008-07-25 16:09:11

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> FYI, I think it was a combination of errors that was causing my problems.
> In any case, I've successfully compiled and booted Ingo's pesky config
> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and 4.2.4
> on both Intel and AMD boxes. (As well as a variety of other configs.)
>

Good. What compilers have you tested? Will it work over the complete
supported range?

> I think Hugh's change adding "text" to the BUILD_IRQ macro might also have
> helped, since interrupts seemed to have always been involved in the panics.
> That might explain the variable-ness of the gcc version's.

Yes, indeed. That was a nasty one, and will be very sensitive to the
exact order gcc decided to emit things.

J

2008-07-25 16:47:01

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
> Mike Travis wrote:
>> FYI, I think it was a combination of errors that was causing my problems.
>> In any case, I've successfully compiled and booted Ingo's pesky config
>> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and
>> 4.2.4
>> on both Intel and AMD boxes. (As well as a variety of other configs.)
>>
>
> Good. What compilers have you tested? Will it work over the complete
> supported range?

The oldest gcc I have available is 3.4.3 (and I just tried that one and
it worked.) So that one and the ones listed above I've verified using
Ingo's test config. (All other testing I'm using 4.2.3.)

>
>> I think Hugh's change adding "text" to the BUILD_IRQ macro might also
>> have
>> helped, since interrupts seemed to have always been involved in the
>> panics.
>> That might explain the variable-ness of the gcc version's.
>
> Yes, indeed. That was a nasty one, and will be very sensitive to the
> exact order gcc decided to emit things.
>
> J

2008-07-25 16:59:15

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Mike Travis wrote:
> Jeremy Fitzhardinge wrote:
>
>> Mike Travis wrote:
>>
>>> FYI, I think it was a combination of errors that was causing my problems.
>>> In any case, I've successfully compiled and booted Ingo's pesky config
>>> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and
>>> 4.2.4
>>> on both Intel and AMD boxes. (As well as a variety of other configs.)
>>>
>>>
>> Good. What compilers have you tested? Will it work over the complete
>> supported range?
>>
>
> The oldest gcc I have available is 3.4.3 (and I just tried that one and
> it worked.) So that one and the ones listed above I've verified using
> Ingo's test config. (All other testing I'm using 4.2.3.)
>

OK. I've got a range of toolchains on various test machines around
here, so I'll give it a spin. Are your most recent changes in tip.git yet?

J

2008-07-25 18:13:15

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC 00/15] x86_64: Optimize percpu accesses

Jeremy Fitzhardinge wrote:
> Mike Travis wrote:
>> Jeremy Fitzhardinge wrote:
>>
>>> Mike Travis wrote:
>>>
>>>> FYI, I think it was a combination of errors that was causing my
>>>> problems.
>>>> In any case, I've successfully compiled and booted Ingo's pesky config
>>>> config-Tue_Jul__1_16_48_45_CEST_2008.bad with gcc's 4.2.0, 4.2.3 and
>>>> 4.2.4
>>>> on both Intel and AMD boxes. (As well as a variety of other configs.)
>>>>
>>> Good. What compilers have you tested? Will it work over the complete
>>> supported range?
>>>
>>
>> The oldest gcc I have available is 3.4.3 (and I just tried that one and
>> it worked.) So that one and the ones listed above I've verified using
>> Ingo's test config. (All other testing I'm using 4.2.3.)
>>
>
> OK. I've got a range of toolchains on various test machines around
> here, so I'll give it a spin. Are your most recent changes in tip.git yet?
>
> J


Almost there...