This patch allows the user to disable write combined mapping
of the efifb framebuffer console using an nowc option.
A customer noticed major slowdowns while logging to the console
with write combining enabled, on other tasks running on the same
CPU. (10x or greater slow down on all other cores on the same CPU
as is doing the logging).
I reproduced this on a machine with dual CPUs.
Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core)
I wrote a test that just mmaps the pci bar and writes to it in
a loop, while this was running in the background one a single
core with (taskset -c 1), building a kernel up to init/version.o
(taskset -c 8) went from 13s to 133s or so. I've yet to explain
why this occurs or what is going wrong I haven't managed to find
a perf command that in any way gives insight into this.
11,885,070,715 instructions # 1.39 insns per cycle
vs
12,082,592,342 instructions # 0.13 insns per cycle
is the only thing I've spotted of interest, I've tried at least:
dTLB-stores,dTLB-store-misses,L1-dcache-stores,LLC-store,LLC-store-misses,LLC-load-misses,LLC-loads,\mem-loads,mem-stores,iTLB-loads,iTLB-load-misses,cache-references,cache-misses
For now it seems at least a good idea to allow a user to disable write
combining if they see this until we can figure it out.
Note also most users get a real framebuffer driver loaded when kms
kicks in, it just happens on these machines the kernel didn't support
the gpu specific driver.
Signed-off-by: Dave Airlie <[email protected]>
---
Documentation/fb/efifb.txt | 6 ++++++
drivers/video/fbdev/efifb.c | 8 +++++++-
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/Documentation/fb/efifb.txt b/Documentation/fb/efifb.txt
index a59916c..1a85c1b 100644
--- a/Documentation/fb/efifb.txt
+++ b/Documentation/fb/efifb.txt
@@ -27,5 +27,11 @@ You have to add the following kernel parameters in your elilo.conf:
Macbook Pro 17", iMac 20" :
video=efifb:i20
+Accepted options:
+
+nowc Don't map the framebuffer write combined. This can be used
+ to workaround side-effects and slowdowns on other CPU cores
+ when large amounts of console data are written.
+
--
Edgar Hucek <[email protected]>
diff --git a/drivers/video/fbdev/efifb.c b/drivers/video/fbdev/efifb.c
index b827a81..a568fe0 100644
--- a/drivers/video/fbdev/efifb.c
+++ b/drivers/video/fbdev/efifb.c
@@ -17,6 +17,7 @@
#include <asm/efi.h>
static bool request_mem_succeeded = false;
+static bool nowc = false;
static struct fb_var_screeninfo efifb_defined = {
.activate = FB_ACTIVATE_NOW,
@@ -99,6 +100,8 @@ static int efifb_setup(char *options)
screen_info.lfb_height = simple_strtoul(this_opt+7, NULL, 0);
else if (!strncmp(this_opt, "width:", 6))
screen_info.lfb_width = simple_strtoul(this_opt+6, NULL, 0);
+ else if (!strcmp(this_opt, "nowc"))
+ nowc = true;
}
}
@@ -255,7 +258,10 @@ static int efifb_probe(struct platform_device *dev)
info->apertures->ranges[0].base = efifb_fix.smem_start;
info->apertures->ranges[0].size = size_remap;
- info->screen_base = ioremap_wc(efifb_fix.smem_start, efifb_fix.smem_len);
+ if (nowc)
+ info->screen_base = ioremap(efifb_fix.smem_start, efifb_fix.smem_len);
+ else
+ info->screen_base = ioremap_wc(efifb_fix.smem_start, efifb_fix.smem_len);
if (!info->screen_base) {
pr_err("efifb: abort, cannot ioremap video memory 0x%x @ 0x%lx\n",
efifb_fix.smem_len, efifb_fix.smem_start);
--
2.9.4
On Tue, Jul 18, 2017 at 04:09:09PM +1000, Dave Airlie wrote:
> This patch allows the user to disable write combined mapping
> of the efifb framebuffer console using an nowc option.
>
> A customer noticed major slowdowns while logging to the console
> with write combining enabled, on other tasks running on the same
> CPU. (10x or greater slow down on all other cores on the same CPU
> as is doing the logging).
>
> I reproduced this on a machine with dual CPUs.
> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core)
>
> I wrote a test that just mmaps the pci bar and writes to it in
> a loop, while this was running in the background one a single
> core with (taskset -c 1), building a kernel up to init/version.o
> (taskset -c 8) went from 13s to 133s or so. I've yet to explain
> why this occurs or what is going wrong I haven't managed to find
> a perf command that in any way gives insight into this.
>
> 11,885,070,715 instructions # 1.39 insns per cycle
> vs
> 12,082,592,342 instructions # 0.13 insns per cycle
>
> is the only thing I've spotted of interest, I've tried at least:
> dTLB-stores,dTLB-store-misses,L1-dcache-stores,LLC-store,LLC-store-misses,LLC-load-misses,LLC-loads,\mem-loads,mem-stores,iTLB-loads,iTLB-load-misses,cache-references,cache-misses
>
> For now it seems at least a good idea to allow a user to disable write
> combining if they see this until we can figure it out.
Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/
using ioremap_wc() for the exact same reason. I'm not against letting
the user force one way or the other if it helps, though it sure would be
nice to know why.
Anyway,
Acked-By: Peter Jones <[email protected]>
Bartlomiej, do you want to handle this in your devel tree?
--
Peter
On Tue, Jul 18, 2017 at 7:34 AM, Peter Jones <[email protected]> wrote:
>
> Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/
> using ioremap_wc() for the exact same reason. I'm not against letting
> the user force one way or the other if it helps, though it sure would be
> nice to know why.
It's kind of amazing for another reason too: how is ioremap_wc()
_possibly_ slower than ioremap_nocache() (which is what plain
ioremap() is)?
The difference is literally _PAGE_CACHE_MODE_WC vs _PAGE_CACHE_MODE_UC_MINUS.
Both of them should be uncached, but WC should allow much better write
behavior. It should also allow much better system behavior.
This really sounds like a band-aid patch that just hides some other
issue entirely. Maybe we screw up the cache modes for some PAT mode
setup?
Or maybe it really is something where there is one global write queue
per die (not per CPU), and having that write queue "active" doing
combining will slow down every core due to some crazy synchronization
issue?
x86 people, look at what Dave Airlie did, I'll just repeat it because
it sounds so crazy:
> A customer noticed major slowdowns while logging to the console
> with write combining enabled, on other tasks running on the same
> CPU. (10x or greater slow down on all other cores on the same CPU
> as is doing the logging).
>
> I reproduced this on a machine with dual CPUs.
> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core)
>
> I wrote a test that just mmaps the pci bar and writes to it in
> a loop, while this was running in the background one a single
> core with (taskset -c 1), building a kernel up to init/version.o
> (taskset -c 8) went from 13s to 133s or so. I've yet to explain
> why this occurs or what is going wrong I haven't managed to find
> a perf command that in any way gives insight into this.
So basically the UC vs WC thing seems to slow down somebody *else* (in
this case a kernel compile) on another core entirely, by a factor of
10x. Maybe the WC writer itself is much faster, but _others_ are
slowed down enormously.
Whaa? That just seems incredible.
Dave - while your test sounds very simple, can you package it up some
way so that somebody inside of Intel can just run it on one of their
machines?
The patch itself (to allow people to *not* do WC that is supposed to
be so much better but clearly doesn't seem to be) looks fine to me,
but it would be really good to get intel to look at this.
Linus
On 19 July 2017 at 05:57, Linus Torvalds <[email protected]> wrote:
> On Tue, Jul 18, 2017 at 7:34 AM, Peter Jones <[email protected]> wrote:
>>
>> Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/
>> using ioremap_wc() for the exact same reason. I'm not against letting
>> the user force one way or the other if it helps, though it sure would be
>> nice to know why.
>
> It's kind of amazing for another reason too: how is ioremap_wc()
> _possibly_ slower than ioremap_nocache() (which is what plain
> ioremap() is)?
In normal operation the console is faster with _wc. It's the side effects
on other cores that is the problem.
> Or maybe it really is something where there is one global write queue
> per die (not per CPU), and having that write queue "active" doing
> combining will slow down every core due to some crazy synchronization
> issue?
>
> x86 people, look at what Dave Airlie did, I'll just repeat it because
> it sounds so crazy:
>
>> A customer noticed major slowdowns while logging to the console
>> with write combining enabled, on other tasks running on the same
>> CPU. (10x or greater slow down on all other cores on the same CPU
>> as is doing the logging).
>>
>> I reproduced this on a machine with dual CPUs.
>> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core)
>>
>> I wrote a test that just mmaps the pci bar and writes to it in
>> a loop, while this was running in the background one a single
>> core with (taskset -c 1), building a kernel up to init/version.o
>> (taskset -c 8) went from 13s to 133s or so. I've yet to explain
>> why this occurs or what is going wrong I haven't managed to find
>> a perf command that in any way gives insight into this.
>
> So basically the UC vs WC thing seems to slow down somebody *else* (in
> this case a kernel compile) on another core entirely, by a factor of
> 10x. Maybe the WC writer itself is much faster, but _others_ are
> slowed down enormously.
>
> Whaa? That just seems incredible.
Yes I've been staring at this for a while now trying to narrow it down, I've
been a bit slow on testing it on a wider range of Intel CPUs, I've
only really managed
to play on that particular machine,
I've attached two test files. compile both of them (I just used make
write_resource burn-cycles).
On my test CPU core 1/8 are on same die.
time taskset -c 1 ./burn-cycles
takes about 6 seconds
taskset -c 8 ./write_resource wc
taskset -c 1 ./burn-cycles
takes about 1 minute.
Now I've noticed write_resource wc or not wc doesn't seem to make a
difference, so
I think it matters that efifb has used _wc for the memory area already
and set PAT on it for wc,
and we always get wc on that BAR.
>From the other person seeing it:
"I done a similar test some time ago, the result was the same.
I ran some benchmarks, and it seems that when data set fits in L1
cache there is no significant performance degradation."
Dave.
On 19 July 2017 at 06:44, Dave Airlie <[email protected]> wrote:
> On 19 July 2017 at 05:57, Linus Torvalds <[email protected]> wrote:
>> On Tue, Jul 18, 2017 at 7:34 AM, Peter Jones <[email protected]> wrote:
>>>
>>> Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/
>>> using ioremap_wc() for the exact same reason. I'm not against letting
>>> the user force one way or the other if it helps, though it sure would be
>>> nice to know why.
>>
>> It's kind of amazing for another reason too: how is ioremap_wc()
>> _possibly_ slower than ioremap_nocache() (which is what plain
>> ioremap() is)?
>
> In normal operation the console is faster with _wc. It's the side effects
> on other cores that is the problem.
>
>> Or maybe it really is something where there is one global write queue
>> per die (not per CPU), and having that write queue "active" doing
>> combining will slow down every core due to some crazy synchronization
>> issue?
>>
>> x86 people, look at what Dave Airlie did, I'll just repeat it because
>> it sounds so crazy:
>>
>>> A customer noticed major slowdowns while logging to the console
>>> with write combining enabled, on other tasks running on the same
>>> CPU. (10x or greater slow down on all other cores on the same CPU
>>> as is doing the logging).
>>>
>>> I reproduced this on a machine with dual CPUs.
>>> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core)
>>>
>>> I wrote a test that just mmaps the pci bar and writes to it in
>>> a loop, while this was running in the background one a single
>>> core with (taskset -c 1), building a kernel up to init/version.o
>>> (taskset -c 8) went from 13s to 133s or so. I've yet to explain
>>> why this occurs or what is going wrong I haven't managed to find
>>> a perf command that in any way gives insight into this.
>>
>> So basically the UC vs WC thing seems to slow down somebody *else* (in
>> this case a kernel compile) on another core entirely, by a factor of
>> 10x. Maybe the WC writer itself is much faster, but _others_ are
>> slowed down enormously.
>>
>> Whaa? That just seems incredible.
>
> Yes I've been staring at this for a while now trying to narrow it down, I've
> been a bit slow on testing it on a wider range of Intel CPUs, I've
> only really managed
> to play on that particular machine,
>
> I've attached two test files. compile both of them (I just used make
> write_resource burn-cycles).
>
> On my test CPU core 1/8 are on same die.
>
> time taskset -c 1 ./burn-cycles
> takes about 6 seconds
>
> taskset -c 8 ./write_resource wc
> taskset -c 1 ./burn-cycles
> takes about 1 minute.
>
> Now I've noticed write_resource wc or not wc doesn't seem to make a
> difference, so
> I think it matters that efifb has used _wc for the memory area already
> and set PAT on it for wc,
> and we always get wc on that BAR.
>
> From the other person seeing it:
> "I done a similar test some time ago, the result was the same.
> I ran some benchmarks, and it seems that when data set fits in L1
> cache there is no significant performance degradation."
Oh and just FYI, the machine I've tested this on has an mgag200 server
graphics card backing the framebuffer, but with just efifb loaded.
Dave.
On Tue, Jul 18, 2017 at 2:21 PM, Dave Airlie <[email protected]> wrote:
>
> Oh and just FYI, the machine I've tested this on has an mgag200 server
> graphics card backing the framebuffer, but with just efifb loaded.
Yeah, it looks like it needs special hardware - and particularly the
kind of garbage hardware that people only have on servers.
Why do server people continually do absolute sh*t hardware? It's crap,
crap, crap across the board outside the CPU. Nasty and bad hacky stuff
that nobody else would touch with a ten-foot pole, and the "serious
enterprise" people lap it up like it was ambrosia.
It's not just "graphics is bad anyway since we don't care". It's all
the things they ostensibly _do_ care about too, like the disk and the
fabric infrastructure. Buggy nasty crud.
Anyway, rant over. I wonder if we could show this without special
hardware by just mapping some region that doesn't even have hardware
in it as WC. Do we even expose the PAT settings to user space, though,
or do we always have to have some fake module to create the PAT stuff?
Linus
On 19 July 2017 at 08:22, Linus Torvalds <[email protected]> wrote:
> On Tue, Jul 18, 2017 at 2:21 PM, Dave Airlie <[email protected]> wrote:
>>
>> Oh and just FYI, the machine I've tested this on has an mgag200 server
>> graphics card backing the framebuffer, but with just efifb loaded.
>
> Yeah, it looks like it needs special hardware - and particularly the
> kind of garbage hardware that people only have on servers.
>
> Why do server people continually do absolute sh*t hardware? It's crap,
> crap, crap across the board outside the CPU. Nasty and bad hacky stuff
> that nobody else would touch with a ten-foot pole, and the "serious
> enterprise" people lap it up like it was ambrosia.
>
> It's not just "graphics is bad anyway since we don't care". It's all
> the things they ostensibly _do_ care about too, like the disk and the
> fabric infrastructure. Buggy nasty crud.
I've tried to reproduce now on:
Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
using some address space from
02:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2)
And I don't see the issue.
I'll try and track down some more efi compatible mga or other wierd server chips
stuff if I can.
> Anyway, rant over. I wonder if we could show this without special
> hardware by just mapping some region that doesn't even have hardware
> in it as WC. Do we even expose the PAT settings to user space, though,
> or do we always have to have some fake module to create the PAT stuff?
I do wonder wtf the hw could be doing that would cause this, but I've no idea
how to tell what difference a write combined PCI transaction would have on the
bus side of things, and what the device could generate that would cause such
a horrible slowdown.
Dave.
On 19 July 2017 at 09:16, Dave Airlie <[email protected]> wrote:
> On 19 July 2017 at 08:22, Linus Torvalds <[email protected]> wrote:
>> On Tue, Jul 18, 2017 at 2:21 PM, Dave Airlie <[email protected]> wrote:
>>>
>>> Oh and just FYI, the machine I've tested this on has an mgag200 server
>>> graphics card backing the framebuffer, but with just efifb loaded.
>>
>> Yeah, it looks like it needs special hardware - and particularly the
>> kind of garbage hardware that people only have on servers.
>>
>> Why do server people continually do absolute sh*t hardware? It's crap,
>> crap, crap across the board outside the CPU. Nasty and bad hacky stuff
>> that nobody else would touch with a ten-foot pole, and the "serious
>> enterprise" people lap it up like it was ambrosia.
>>
>> It's not just "graphics is bad anyway since we don't care". It's all
>> the things they ostensibly _do_ care about too, like the disk and the
>> fabric infrastructure. Buggy nasty crud.
>
> I've tried to reproduce now on:
> Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
> using some address space from
> 02:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2)
>
> And I don't see the issue.
>
> I'll try and track down some more efi compatible mga or other wierd server chips
> stuff if I can.
>
>> Anyway, rant over. I wonder if we could show this without special
>> hardware by just mapping some region that doesn't even have hardware
>> in it as WC. Do we even expose the PAT settings to user space, though,
>> or do we always have to have some fake module to create the PAT stuff?
>
> I do wonder wtf the hw could be doing that would cause this, but I've no idea
> how to tell what difference a write combined PCI transaction would have on the
> bus side of things, and what the device could generate that would cause such
> a horrible slowdown.
>
> Dave.
01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA
G200EH (rev 01) (prog-if 00 [VGA controller])
Subsystem: Hewlett-Packard Company iLO4
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 255
Region 0: Memory at 91000000 (32-bit, prefetchable) [size=16M]
Region 1: Memory at 92a88000 (32-bit, non-prefetchable) [size=16K]
Region 2: Memory at 92000000 (32-bit, non-prefetchable) [size=8M]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [a8] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [b0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+
Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+
AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s,
Latency L0 <4us, L1 <4us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train-
SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported,
TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms,
TimeoutDis-, LTR-, OBFF Disabled
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-,
LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Is a full lspci -vvv for the VGA device in question.
Dave.
On 19 July 2017 at 09:16, Dave Airlie <[email protected]> wrote:
> On 19 July 2017 at 09:16, Dave Airlie <[email protected]> wrote:
>> On 19 July 2017 at 08:22, Linus Torvalds <[email protected]> wrote:
>>> On Tue, Jul 18, 2017 at 2:21 PM, Dave Airlie <[email protected]> wrote:
>>>>
>>>> Oh and just FYI, the machine I've tested this on has an mgag200 server
>>>> graphics card backing the framebuffer, but with just efifb loaded.
>>>
>>> Yeah, it looks like it needs special hardware - and particularly the
>>> kind of garbage hardware that people only have on servers.
>>>
>>> Why do server people continually do absolute sh*t hardware? It's crap,
>>> crap, crap across the board outside the CPU. Nasty and bad hacky stuff
>>> that nobody else would touch with a ten-foot pole, and the "serious
>>> enterprise" people lap it up like it was ambrosia.
>>>
>>> It's not just "graphics is bad anyway since we don't care". It's all
>>> the things they ostensibly _do_ care about too, like the disk and the
>>> fabric infrastructure. Buggy nasty crud.
>>
>> I've tried to reproduce now on:
>> Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
>> using some address space from
>> 02:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2)
>>
>> And I don't see the issue.
>>
>> I'll try and track down some more efi compatible mga or other wierd server chips
>> stuff if I can.
>>
>>> Anyway, rant over. I wonder if we could show this without special
>>> hardware by just mapping some region that doesn't even have hardware
>>> in it as WC. Do we even expose the PAT settings to user space, though,
>>> or do we always have to have some fake module to create the PAT stuff?
>>
>> I do wonder wtf the hw could be doing that would cause this, but I've no idea
>> how to tell what difference a write combined PCI transaction would have on the
>> bus side of things, and what the device could generate that would cause such
>> a horrible slowdown.
>>
>> Dave.
>
More digging:
Single CPU system:
Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH
Now I can't get efifb to load on this (due to it being remote and I've
no idea how to make
my install efi onto it), but booting with no framebuffer, and running
the tests on the mga,
show no slowdown on this.
Now I'm starting to wonder if it's something that only happens on
multi-socket systems.
Dave.
On Tue, Jul 18, 2017 at 5:00 PM, Dave Airlie <[email protected]> wrote:
>
> More digging:
> Single CPU system:
> Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
> 01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH
>
> Now I can't get efifb to load on this (due to it being remote and I've
> no idea how to make
> my install efi onto it), but booting with no framebuffer, and running
> the tests on the mga,
> show no slowdown on this.
Is it actually using write-combining memory without a frame buffer,
though? I don't think it is. So the lack of slowdown might be just
from that.
> Now I'm starting to wonder if it's something that only happens on
> multi-socket systems.
Hmm. I guess that's possible, of course.
[ Wild and crazy handwaving... ]
Without write combining, all the uncached writes will be fully
serialized and there is no buffering in the chip write buffers. There
will be at most one outstanding PCI transaction in the uncore write
buffer.
In contrast, _with_ write combining, the write buffers in the uncore
can fill up.
But why should that matter? Maybe memory ordering. When one of the
cores (doesn't matter *which* core) wants to get a cacheline for
exclusive use (ie it did a write to it), it will need to invalidate
cachelines in other cores. However, the uncore now has all those PCI
writes buffered, and the write ordering says that they should happen
before the memory writes. So before it can give the core exclusive
ownership of the new cacheline, it needs to wait for all those
buffered writes to be pushed out, so that no other CPU can see the new
write *before* the device saw the old writes.
But I'm not convinced this is any different in a multi-socket
situation than it is in a single-socket one. The other cores on the
same socket should not be able to see the writes out of order
_either_.
And honestly, I think PCI write posting rules makes the above crazy
handwaving completely bogus anyway. Writes _can_ be posted, so the
memory ordering isn't actually that tight.
I dunno. I really think it would be good if somebody inside Intel
would look at it..
Linus
On 19 July 2017 at 11:15, Linus Torvalds <[email protected]> wrote:
> On Tue, Jul 18, 2017 at 5:00 PM, Dave Airlie <[email protected]> wrote:
>>
>> More digging:
>> Single CPU system:
>> Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
>> 01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH
>>
>> Now I can't get efifb to load on this (due to it being remote and I've
>> no idea how to make
>> my install efi onto it), but booting with no framebuffer, and running
>> the tests on the mga,
>> show no slowdown on this.
>
> Is it actually using write-combining memory without a frame buffer,
> though? I don't think it is. So the lack of slowdown might be just
> from that.
>
>> Now I'm starting to wonder if it's something that only happens on
>> multi-socket systems.
>
> Hmm. I guess that's possible, of course.
>
> [ Wild and crazy handwaving... ]
>
> Without write combining, all the uncached writes will be fully
> serialized and there is no buffering in the chip write buffers. There
> will be at most one outstanding PCI transaction in the uncore write
> buffer.
>
> In contrast, _with_ write combining, the write buffers in the uncore
> can fill up.
>
> But why should that matter? Maybe memory ordering. When one of the
> cores (doesn't matter *which* core) wants to get a cacheline for
> exclusive use (ie it did a write to it), it will need to invalidate
> cachelines in other cores. However, the uncore now has all those PCI
> writes buffered, and the write ordering says that they should happen
> before the memory writes. So before it can give the core exclusive
> ownership of the new cacheline, it needs to wait for all those
> buffered writes to be pushed out, so that no other CPU can see the new
> write *before* the device saw the old writes.
>
> But I'm not convinced this is any different in a multi-socket
> situation than it is in a single-socket one. The other cores on the
> same socket should not be able to see the writes out of order
> _either_.
>
> And honestly, I think PCI write posting rules makes the above crazy
> handwaving completely bogus anyway. Writes _can_ be posted, so the
> memory ordering isn't actually that tight.
>
> I dunno. I really think it would be good if somebody inside Intel
> would look at it..
Yes hoping someone can give some insight.
Scrap the multi-socket it's been seen on a single-socket, but not as
drastic, 2x rather than 10x slowdowns.
It's starting to seem like the commonality might be the Matrox G200EH
which is part of the HP remote management iLO hardware, it might be that
the RAM on the other side of the PCIE connection is causing some sort of
wierd stalls or slowdowns. I'm not sure how best to validate that either.
Dave.
On Wed, Jul 19, 2017 at 9:07 PM, Dave Airlie <[email protected]> wrote:
>
> Yes hoping someone can give some insight.
>
> Scrap the multi-socket it's been seen on a single-socket, but not as
> drastic, 2x rather than 10x slowdowns.
>
> It's starting to seem like the commonality might be the Matrox G200EH
> which is part of the HP remote management iLO hardware, it might be that
> the RAM on the other side of the PCIE connection is causing some sort of
> wierd stalls or slowdowns. I'm not sure how best to validate that either.
It shouldn't be that hard to hack up efifb to allocate some actual RAM
as "framebuffer", unmap it from the direct map, and ioremap_wc() it as
usual. Then you could see if PCIe is important for it.
WC streaming writes over PCIe end up doing 64 byte writes, right?
Maybe the Matrox chip is just extremely slow handling 64b writes.
On Wed, Jul 19, 2017 at 9:28 PM, Andy Lutomirski <[email protected]> wrote:
>
> It shouldn't be that hard to hack up efifb to allocate some actual RAM
> as "framebuffer", unmap it from the direct map, and ioremap_wc() it as
> usual. Then you could see if PCIe is important for it.
The thing is, the "actual RAM" case is unlikely to show this issue.
RAM is special, even when you try to mark it WC or whatever. Yes, it
might be slowed down by lack of caching, but the uncore still *knows*
it is RAM. The accesses go to the memory controller, not the PCI side.
> WC streaming writes over PCIe end up doing 64 byte writes, right?
> Maybe the Matrox chip is just extremely slow handling 64b writes.
.. or maybe there is some unholy "management logic" thing that catches
those writes, because this is server hardware, and server vendors
invariably add "value add" (read; shit) to their hardware to justify
the high price.
Like the Intel "management console" that was such a "security feature".
I think one of the points of those magic graphics cards is that you
can export the frame buffer over the management network, so that you
can still run the graphical Windows GUI management stuff. Because you
wouldn't want to just ssh into it and run command line stuff.
So I wouldn't be surprised at all if the thing has a special back
channel to the network chip with a queue of changes going over
ethernet or something, and then when you stream things at high speeds
to the GPU DRAM, you fill up the management bandwidth.
If it was actual framebuffer DRAM, I would expect it to be *happy*
with streaming 64-bit writes. But some special "management interface
ASIC" that tries to keep track of GPU framebuffer "damage" might be
something else altogether.
Linus
* Linus Torvalds <[email protected]> wrote:
> On Tue, Jul 18, 2017 at 2:21 PM, Dave Airlie <[email protected]> wrote:
> >
> > Oh and just FYI, the machine I've tested this on has an mgag200 server
> > graphics card backing the framebuffer, but with just efifb loaded.
>
> Yeah, it looks like it needs special hardware - and particularly the
> kind of garbage hardware that people only have on servers.
>
> Why do server people continually do absolute sh*t hardware? It's crap,
> crap, crap across the board outside the CPU. Nasty and bad hacky stuff
> that nobody else would touch with a ten-foot pole, and the "serious
> enterprise" people lap it up like it was ambrosia.
>
> It's not just "graphics is bad anyway since we don't care". It's all
> the things they ostensibly _do_ care about too, like the disk and the
> fabric infrastructure. Buggy nasty crud.
I believe it's crappy for similar reasons why almost all other large scale pieces
of human technological infrastructure are crappy if you look deep under the hood:
transportation and communication networks, banking systems, manufacturing, you
name it.
The main reasons are:
- Cost of a clean redesign is an order of magnitude higher that the next delta
revision, once you have accumulated a few decades of legacy.
- The path dependent evolutionary legacies become so ugly after time that most
good people will run away from key elements - so there's not enough internal
energy to redesign and implement a clean methodology from grounds up.
- Even if there are enough good people, the benefits of a clean design are a long
term benefit, constantly hindered by short term pricing.
- For non-experts it's hard to tell a good, clean redesign from a flashy but
fundamentally flawed redesign. Both are expensive and the latter can have
disastrous outcomes.
- These are high margin businesses, with customers captured by legacies, where
you can pass down the costs to customers, which hides the true costs of crap.
i.e. typical free market failure due high complexity combined with (very) long
price propagation latencies and opaqueness of pricing.
I believe the only place where you'll find overall beautiful server hardware as a
rule and not as an exception is in satellite technology: when the unit price is in
excess of $100m, expected life span is 10-20 years with no on-site maintenance,
and it's all running in a fundamentally hostile environment, then clean and robust
hardware design is forced at every step by physics.
Humanity is certainly able to design beautiful hardware, once all other options
are exhausted.
Thanks,
Ingo
On 20 July 2017 at 14:44, Linus Torvalds <[email protected]> wrote:
> On Wed, Jul 19, 2017 at 9:28 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> It shouldn't be that hard to hack up efifb to allocate some actual RAM
>> as "framebuffer", unmap it from the direct map, and ioremap_wc() it as
>> usual. Then you could see if PCIe is important for it.
>
> The thing is, the "actual RAM" case is unlikely to show this issue.
>
> RAM is special, even when you try to mark it WC or whatever. Yes, it
> might be slowed down by lack of caching, but the uncore still *knows*
> it is RAM. The accesses go to the memory controller, not the PCI side.
>
>> WC streaming writes over PCIe end up doing 64 byte writes, right?
>> Maybe the Matrox chip is just extremely slow handling 64b writes.
>
> .. or maybe there is some unholy "management logic" thing that catches
> those writes, because this is server hardware, and server vendors
> invariably add "value add" (read; shit) to their hardware to justify
> the high price.
>
> Like the Intel "management console" that was such a "security feature".
>
> I think one of the points of those magic graphics cards is that you
> can export the frame buffer over the management network, so that you
> can still run the graphical Windows GUI management stuff. Because you
> wouldn't want to just ssh into it and run command line stuff.
>
> So I wouldn't be surprised at all if the thing has a special back
> channel to the network chip with a queue of changes going over
> ethernet or something, and then when you stream things at high speeds
> to the GPU DRAM, you fill up the management bandwidth.
>
> If it was actual framebuffer DRAM, I would expect it to be *happy*
> with streaming 64-bit writes. But some special "management interface
> ASIC" that tries to keep track of GPU framebuffer "damage" might be
> something else altogether.
>
I think it's just some RAM on the management console device that is
partitioned and exposed via the PCI BAR on the mga vga device.
I expect it possibly can't handle lots of writes very well and sends something
back that causes the stalls. I'm not even sure how to prove it.
So I expect we should at least land this patch for now so people who do suffer
from this can at least disable it for now, and if we can narrow it
down to a pci id
or subsys id for certain HP ilo devices, then we can add a blacklist.
I wonder if anyone knows anyone from HPE ilo team.
Dave.
On 19 July 2017 at 00:34, Peter Jones <[email protected]> wrote:
> On Tue, Jul 18, 2017 at 04:09:09PM +1000, Dave Airlie wrote:
>> This patch allows the user to disable write combined mapping
>> of the efifb framebuffer console using an nowc option.
>>
>> A customer noticed major slowdowns while logging to the console
>> with write combining enabled, on other tasks running on the same
>> CPU. (10x or greater slow down on all other cores on the same CPU
>> as is doing the logging).
>>
>> I reproduced this on a machine with dual CPUs.
>> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core)
>>
>> I wrote a test that just mmaps the pci bar and writes to it in
>> a loop, while this was running in the background one a single
>> core with (taskset -c 1), building a kernel up to init/version.o
>> (taskset -c 8) went from 13s to 133s or so. I've yet to explain
>> why this occurs or what is going wrong I haven't managed to find
>> a perf command that in any way gives insight into this.
>>
>> 11,885,070,715 instructions # 1.39 insns per cycle
>> vs
>> 12,082,592,342 instructions # 0.13 insns per cycle
>>
>> is the only thing I've spotted of interest, I've tried at least:
>> dTLB-stores,dTLB-store-misses,L1-dcache-stores,LLC-store,LLC-store-misses,LLC-load-misses,LLC-loads,\mem-loads,mem-stores,iTLB-loads,iTLB-load-misses,cache-references,cache-misses
>>
>> For now it seems at least a good idea to allow a user to disable write
>> combining if they see this until we can figure it out.
>
> Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/
> using ioremap_wc() for the exact same reason. I'm not against letting
> the user force one way or the other if it helps, though it sure would be
> nice to know why.
>
> Anyway,
>
> Acked-By: Peter Jones <[email protected]>
>
> Bartlomiej, do you want to handle this in your devel tree?
I'm happy to stick this in a drm-fixes pull with this ack.
Dave.
On Tuesday, July 25, 2017 02:00:00 PM Dave Airlie wrote:
> On 19 July 2017 at 00:34, Peter Jones <[email protected]> wrote:
> > On Tue, Jul 18, 2017 at 04:09:09PM +1000, Dave Airlie wrote:
> >> This patch allows the user to disable write combined mapping
> >> of the efifb framebuffer console using an nowc option.
> >>
> >> A customer noticed major slowdowns while logging to the console
> >> with write combining enabled, on other tasks running on the same
> >> CPU. (10x or greater slow down on all other cores on the same CPU
> >> as is doing the logging).
> >>
> >> I reproduced this on a machine with dual CPUs.
> >> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core)
> >>
> >> I wrote a test that just mmaps the pci bar and writes to it in
> >> a loop, while this was running in the background one a single
> >> core with (taskset -c 1), building a kernel up to init/version.o
> >> (taskset -c 8) went from 13s to 133s or so. I've yet to explain
> >> why this occurs or what is going wrong I haven't managed to find
> >> a perf command that in any way gives insight into this.
> >>
> >> 11,885,070,715 instructions # 1.39 insns per cycle
> >> vs
> >> 12,082,592,342 instructions # 0.13 insns per cycle
> >>
> >> is the only thing I've spotted of interest, I've tried at least:
> >> dTLB-stores,dTLB-store-misses,L1-dcache-stores,LLC-store,LLC-store-misses,LLC-load-misses,LLC-loads,\mem-loads,mem-stores,iTLB-loads,iTLB-load-misses,cache-references,cache-misses
> >>
> >> For now it seems at least a good idea to allow a user to disable write
> >> combining if they see this until we can figure it out.
> >
> > Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/
> > using ioremap_wc() for the exact same reason. I'm not against letting
> > the user force one way or the other if it helps, though it sure would be
> > nice to know why.
> >
> > Anyway,
> >
> > Acked-By: Peter Jones <[email protected]>
> >
> > Bartlomiej, do you want to handle this in your devel tree?
>
> I'm happy to stick this in a drm-fixes pull with this ack.
I'll put it into fbdev fixes for 4.13 with other fbdev patches.
Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R&D Institute Poland
Samsung Electronics
On Tuesday, July 25, 2017 10:56:15 AM Bartlomiej Zolnierkiewicz wrote:
> On Tuesday, July 25, 2017 02:00:00 PM Dave Airlie wrote:
> > On 19 July 2017 at 00:34, Peter Jones <[email protected]> wrote:
> > > On Tue, Jul 18, 2017 at 04:09:09PM +1000, Dave Airlie wrote:
> > >> This patch allows the user to disable write combined mapping
> > >> of the efifb framebuffer console using an nowc option.
> > >>
> > >> A customer noticed major slowdowns while logging to the console
> > >> with write combining enabled, on other tasks running on the same
> > >> CPU. (10x or greater slow down on all other cores on the same CPU
> > >> as is doing the logging).
> > >>
> > >> I reproduced this on a machine with dual CPUs.
> > >> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core)
> > >>
> > >> I wrote a test that just mmaps the pci bar and writes to it in
> > >> a loop, while this was running in the background one a single
> > >> core with (taskset -c 1), building a kernel up to init/version.o
> > >> (taskset -c 8) went from 13s to 133s or so. I've yet to explain
> > >> why this occurs or what is going wrong I haven't managed to find
> > >> a perf command that in any way gives insight into this.
> > >>
> > >> 11,885,070,715 instructions # 1.39 insns per cycle
> > >> vs
> > >> 12,082,592,342 instructions # 0.13 insns per cycle
> > >>
> > >> is the only thing I've spotted of interest, I've tried at least:
> > >> dTLB-stores,dTLB-store-misses,L1-dcache-stores,LLC-store,LLC-store-misses,LLC-load-misses,LLC-loads,\mem-loads,mem-stores,iTLB-loads,iTLB-load-misses,cache-references,cache-misses
> > >>
> > >> For now it seems at least a good idea to allow a user to disable write
> > >> combining if they see this until we can figure it out.
> > >
> > > Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/
> > > using ioremap_wc() for the exact same reason. I'm not against letting
> > > the user force one way or the other if it helps, though it sure would be
> > > nice to know why.
> > >
> > > Anyway,
> > >
> > > Acked-By: Peter Jones <[email protected]>
> > >
> > > Bartlomiej, do you want to handle this in your devel tree?
> >
> > I'm happy to stick this in a drm-fixes pull with this ack.
>
> I'll put it into fbdev fixes for 4.13 with other fbdev patches.
Patch queued for 4.13, thanks.
Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R&D Institute Poland
Samsung Electronics
On 07/18/17 13:44, Dave Airlie wrote:
>
> In normal operation the console is faster with _wc. It's the side effects
> on other cores that is the problem.
>
I'm guessing leaving these as UC- rate-limits them so it doesn't
interfere with the I/O operations on the other cores...
-hpa