2019-08-13 09:37:46

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi Thomas,

On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
> Hi Thomas,
>
> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
> > Hi,
> >
> > >>Actually we run the benchmark as a background process, do we need to
> > >>disable the cursor and test again?
> > >There's a worker thread that updates the display from the shadow buffer.
> > >The blinking cursor periodically triggers the worker thread, but the
> > >actual update is just the size of one character.
> > >
> > >The point of the test without output is to see if the regression comes
> > >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
> > >from the worker thread. If the regression goes away after disabling the
> > >blinking cursor, then the worker thread is the problem. If it already
> > >goes away if there's simply no output from the test, the screen update
> > >is the problem. On my machine I have to disable the blinking cursor, so
> > >I think the worker causes the performance drop.
> >
> > We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
> > gone.
> >
> > commit:
> >   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
> >   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
> > emulation
> >
> > f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
> > ----------------  -------------------------- ---------------------------
> >          %stddev      change         %stddev
> >              \          |                \
> >      43785                       44481
> > vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
> >      43785                       44481        GEO-MEAN vm-scalability.median
>
> Till now, from Rong's tests:
> 1. Disabling cursor blinking doesn't cure the regression.
> 2. Disabling printint test results to console can workaround the
> regression.
>
> Also if we set the perfer_shadown to 0, the regression is also
> gone.

We also did some further break down for the time consumed by the
new code.

The drm_fb_helper_dirty_work() calls sequentially
1. drm_client_buffer_vmap (290 us)
2. drm_fb_helper_dirty_blit_real (19240 us)
3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver
4. drm_client_buffer_vunmap (215 us)

The average run time is listed after the function names.

From it, we can see drm_fb_helper_dirty_blit_real() takes too long
time (about 20ms for each run). I guess this is the root cause
of this regression, as the original code doesn't use this dirty worker.

As said in last email, setting the prefer_shadow to 0 can avoid
the regrssion. Could it be an option?

Thanks,
Feng

>
> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
> @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
> dev->mode_config.preferred_depth = 16;
> else
> dev->mode_config.preferred_depth = 32;
> - dev->mode_config.prefer_shadow = 1;
> + dev->mode_config.prefer_shadow = 0;
>
> And from the perf data, one obvious difference is good case don't
> call drm_fb_helper_dirty_work(), while bad case calls.
>
> Thanks,
> Feng
>
> > Best Regards,
> > Rong Chen


2019-08-16 06:56:49

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi Thomas,

On Tue, Aug 13, 2019 at 05:36:16PM +0800, Feng Tang wrote:
> Hi Thomas,
>
> On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
> > Hi Thomas,
> >
> > On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
> > > Hi,
> > >
> > > >>Actually we run the benchmark as a background process, do we need to
> > > >>disable the cursor and test again?
> > > >There's a worker thread that updates the display from the shadow buffer.
> > > >The blinking cursor periodically triggers the worker thread, but the
> > > >actual update is just the size of one character.
> > > >
> > > >The point of the test without output is to see if the regression comes
> > > >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
> > > >from the worker thread. If the regression goes away after disabling the
> > > >blinking cursor, then the worker thread is the problem. If it already
> > > >goes away if there's simply no output from the test, the screen update
> > > >is the problem. On my machine I have to disable the blinking cursor, so
> > > >I think the worker causes the performance drop.
> > >
> > > We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
> > > gone.
> > >
> > > commit:
> > >   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
> > >   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
> > > emulation
> > >
> > > f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
> > > ----------------  -------------------------- ---------------------------
> > >          %stddev      change         %stddev
> > >              \          |                \
> > >      43785                       44481
> > > vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
> > >      43785                       44481        GEO-MEAN vm-scalability.median
> >
> > Till now, from Rong's tests:
> > 1. Disabling cursor blinking doesn't cure the regression.
> > 2. Disabling printint test results to console can workaround the
> > regression.
> >
> > Also if we set the perfer_shadown to 0, the regression is also
> > gone.
>
> We also did some further break down for the time consumed by the
> new code.
>
> The drm_fb_helper_dirty_work() calls sequentially
> 1. drm_client_buffer_vmap (290 us)
> 2. drm_fb_helper_dirty_blit_real (19240 us)
> 3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver
> 4. drm_client_buffer_vunmap (215 us)
>
> The average run time is listed after the function names.
>
> From it, we can see drm_fb_helper_dirty_blit_real() takes too long
> time (about 20ms for each run). I guess this is the root cause
> of this regression, as the original code doesn't use this dirty worker.
>
> As said in last email, setting the prefer_shadow to 0 can avoid
> the regrssion. Could it be an option?

Any comments on this? thanks

- Feng

>
> Thanks,
> Feng
>
> >
> > --- a/drivers/gpu/drm/mgag200/mgag200_main.c
> > +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
> > @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
> > dev->mode_config.preferred_depth = 16;
> > else
> > dev->mode_config.preferred_depth = 32;
> > - dev->mode_config.prefer_shadow = 1;
> > + dev->mode_config.prefer_shadow = 0;
> >
> > And from the perf data, one obvious difference is good case don't
> > call drm_fb_helper_dirty_work(), while bad case calls.
> >
> > Thanks,
> > Feng
> >
> > > Best Regards,
> > > Rong Chen
> _______________________________________________
> LKP mailing list
> [email protected]
> https://lists.01.org/mailman/listinfo/lkp

2019-08-22 17:39:31

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi

I was traveling and could reply earlier. Sorry for taking so long.

Am 13.08.19 um 11:36 schrieb Feng Tang:
> Hi Thomas,
>
> On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
>> Hi Thomas,
>>
>> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
>>> Hi,
>>>
>>>>> Actually we run the benchmark as a background process, do we need to
>>>>> disable the cursor and test again?
>>>> There's a worker thread that updates the display from the shadow buffer.
>>>> The blinking cursor periodically triggers the worker thread, but the
>>>> actual update is just the size of one character.
>>>>
>>>> The point of the test without output is to see if the regression comes
>>> >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
>>> >from the worker thread. If the regression goes away after disabling the
>>>> blinking cursor, then the worker thread is the problem. If it already
>>>> goes away if there's simply no output from the test, the screen update
>>>> is the problem. On my machine I have to disable the blinking cursor, so
>>>> I think the worker causes the performance drop.
>>>
>>> We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
>>> gone.
>>>
>>> commit:
>>>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
>>> emulation
>>>
>>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
>>> ----------------  -------------------------- ---------------------------
>>>          %stddev      change         %stddev
>>>              \          |                \
>>>      43785                       44481
>>> vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>>>      43785                       44481        GEO-MEAN vm-scalability.median
>>
>> Till now, from Rong's tests:
>> 1. Disabling cursor blinking doesn't cure the regression.
>> 2. Disabling printint test results to console can workaround the
>> regression.
>>
>> Also if we set the perfer_shadown to 0, the regression is also
>> gone.
>
> We also did some further break down for the time consumed by the
> new code.
>
> The drm_fb_helper_dirty_work() calls sequentially
> 1. drm_client_buffer_vmap (290 us)
> 2. drm_fb_helper_dirty_blit_real (19240 us)
> 3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver
> 4. drm_client_buffer_vunmap (215 us)
>

It's somewhat different to what I observed, but maybe I just couldn't
reproduce the problem correctly.

> The average run time is listed after the function names.
>
> From it, we can see drm_fb_helper_dirty_blit_real() takes too long
> time (about 20ms for each run). I guess this is the root cause
> of this regression, as the original code doesn't use this dirty worker.

True, the original code uses a temporary buffer, but updates the display
immediately.

My guess is that this could be a caching problem. The worker runs on a
different CPU, which doesn't have the shadow buffer in cache.

> As said in last email, setting the prefer_shadow to 0 can avoid
> the regrssion. Could it be an option?

Unfortunately not. Without the shadow buffer, the console's display
buffer permanently resides in video memory. It consumes significant
amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave
enough room for anything else.

The best option is to not print to the console.

Best regards
Thomas

> Thanks,
> Feng
>
>>
>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
>> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
>> @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
>> dev->mode_config.preferred_depth = 16;
>> else
>> dev->mode_config.preferred_depth = 32;
>> - dev->mode_config.prefer_shadow = 1;
>> + dev->mode_config.prefer_shadow = 0;
>>
>> And from the perf data, one obvious difference is good case don't
>> call drm_fb_helper_dirty_work(), while bad case calls.
>>
>> Thanks,
>> Feng
>>
>>> Best Regards,
>>> Rong Chen
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


Attachments:
signature.asc (499.00 B)
OpenPGP digital signature

2019-08-23 08:00:01

by Dave Airlie

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

On Fri, 23 Aug 2019 at 03:25, Thomas Zimmermann <[email protected]> wrote:
>
> Hi
>
> I was traveling and could reply earlier. Sorry for taking so long.
>
> Am 13.08.19 um 11:36 schrieb Feng Tang:
> > Hi Thomas,
> >
> > On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
> >> Hi Thomas,
> >>
> >> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
> >>> Hi,
> >>>
> >>>>> Actually we run the benchmark as a background process, do we need to
> >>>>> disable the cursor and test again?
> >>>> There's a worker thread that updates the display from the shadow buffer.
> >>>> The blinking cursor periodically triggers the worker thread, but the
> >>>> actual update is just the size of one character.
> >>>>
> >>>> The point of the test without output is to see if the regression comes
> >>> >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
> >>> >from the worker thread. If the regression goes away after disabling the
> >>>> blinking cursor, then the worker thread is the problem. If it already
> >>>> goes away if there's simply no output from the test, the screen update
> >>>> is the problem. On my machine I have to disable the blinking cursor, so
> >>>> I think the worker causes the performance drop.
> >>>
> >>> We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is
> >>> gone.
> >>>
> >>> commit:
> >>> f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
> >>> 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
> >>> emulation
> >>>
> >>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox
> >>> ---------------- -------------------------- ---------------------------
> >>> %stddev change %stddev
> >>> \ | \
> >>> 43785 44481
> >>> vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
> >>> 43785 44481 GEO-MEAN vm-scalability.median
> >>
> >> Till now, from Rong's tests:
> >> 1. Disabling cursor blinking doesn't cure the regression.
> >> 2. Disabling printint test results to console can workaround the
> >> regression.
> >>
> >> Also if we set the perfer_shadown to 0, the regression is also
> >> gone.
> >
> > We also did some further break down for the time consumed by the
> > new code.
> >
> > The drm_fb_helper_dirty_work() calls sequentially
> > 1. drm_client_buffer_vmap (290 us)
> > 2. drm_fb_helper_dirty_blit_real (19240 us)
> > 3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver
> > 4. drm_client_buffer_vunmap (215 us)
> >
>
> It's somewhat different to what I observed, but maybe I just couldn't
> reproduce the problem correctly.
>
> > The average run time is listed after the function names.
> >
> > From it, we can see drm_fb_helper_dirty_blit_real() takes too long
> > time (about 20ms for each run). I guess this is the root cause
> > of this regression, as the original code doesn't use this dirty worker.
>
> True, the original code uses a temporary buffer, but updates the display
> immediately.
>
> My guess is that this could be a caching problem. The worker runs on a
> different CPU, which doesn't have the shadow buffer in cache.
>
> > As said in last email, setting the prefer_shadow to 0 can avoid
> > the regrssion. Could it be an option?
>
> Unfortunately not. Without the shadow buffer, the console's display
> buffer permanently resides in video memory. It consumes significant
> amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave
> enough room for anything else.
>
> The best option is to not print to the console.

Wait a second, I thought the driver did an eviction on modeset of the
scanned out object, this was a deliberate design decision made when
writing those drivers, has this been removed in favour of gem and
generic code paths?

Dave.

2019-08-23 22:48:21

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi

Am 22.08.19 um 22:02 schrieb Dave Airlie:
> On Fri, 23 Aug 2019 at 03:25, Thomas Zimmermann <[email protected]> wrote:
>>
>> Hi
>>
>> I was traveling and could reply earlier. Sorry for taking so long.
>>
>> Am 13.08.19 um 11:36 schrieb Feng Tang:
>>> Hi Thomas,
>>>
>>> On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
>>>> Hi Thomas,
>>>>
>>>> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
>>>>> Hi,
>>>>>
>>>>>>> Actually we run the benchmark as a background process, do we need to
>>>>>>> disable the cursor and test again?
>>>>>> There's a worker thread that updates the display from the shadow buffer.
>>>>>> The blinking cursor periodically triggers the worker thread, but the
>>>>>> actual update is just the size of one character.
>>>>>>
>>>>>> The point of the test without output is to see if the regression comes
>>>>> >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
>>>>> >from the worker thread. If the regression goes away after disabling the
>>>>>> blinking cursor, then the worker thread is the problem. If it already
>>>>>> goes away if there's simply no output from the test, the screen update
>>>>>> is the problem. On my machine I have to disable the blinking cursor, so
>>>>>> I think the worker causes the performance drop.
>>>>>
>>>>> We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is
>>>>> gone.
>>>>>
>>>>> commit:
>>>>> f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>>>>> 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
>>>>> emulation
>>>>>
>>>>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox
>>>>> ---------------- -------------------------- ---------------------------
>>>>> %stddev change %stddev
>>>>> \ | \
>>>>> 43785 44481
>>>>> vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>>>>> 43785 44481 GEO-MEAN vm-scalability.median
>>>>
>>>> Till now, from Rong's tests:
>>>> 1. Disabling cursor blinking doesn't cure the regression.
>>>> 2. Disabling printint test results to console can workaround the
>>>> regression.
>>>>
>>>> Also if we set the perfer_shadown to 0, the regression is also
>>>> gone.
>>>
>>> We also did some further break down for the time consumed by the
>>> new code.
>>>
>>> The drm_fb_helper_dirty_work() calls sequentially
>>> 1. drm_client_buffer_vmap (290 us)
>>> 2. drm_fb_helper_dirty_blit_real (19240 us)
>>> 3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver
>>> 4. drm_client_buffer_vunmap (215 us)
>>>
>>
>> It's somewhat different to what I observed, but maybe I just couldn't
>> reproduce the problem correctly.
>>
>>> The average run time is listed after the function names.
>>>
>>> From it, we can see drm_fb_helper_dirty_blit_real() takes too long
>>> time (about 20ms for each run). I guess this is the root cause
>>> of this regression, as the original code doesn't use this dirty worker.
>>
>> True, the original code uses a temporary buffer, but updates the display
>> immediately.
>>
>> My guess is that this could be a caching problem. The worker runs on a
>> different CPU, which doesn't have the shadow buffer in cache.
>>
>>> As said in last email, setting the prefer_shadow to 0 can avoid
>>> the regrssion. Could it be an option?
>>
>> Unfortunately not. Without the shadow buffer, the console's display
>> buffer permanently resides in video memory. It consumes significant
>> amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave
>> enough room for anything else.
>>
>> The best option is to not print to the console.
>
> Wait a second, I thought the driver did an eviction on modeset of the
> scanned out object, this was a deliberate design decision made when
> writing those drivers, has this been removed in favour of gem and
> generic code paths?

Yes. We added back this feature for testing in [1]. It was only an
improvement of ~1% compared to the original report. I wouldn't mind
landing this patch set, but it probably doesn't make a difference either.

Best regards
Thomas

[1] https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html

>
> Dave.
>

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


Attachments:
signature.asc (499.00 B)
OpenPGP digital signature

2019-08-24 05:16:55

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi Thomas,

On Thu, Aug 22, 2019 at 07:25:11PM +0200, Thomas Zimmermann wrote:
> Hi
>
> I was traveling and could reply earlier. Sorry for taking so long.

No problem! I guessed so :)

>
> Am 13.08.19 um 11:36 schrieb Feng Tang:
> > Hi Thomas,
> >
> > On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
> >> Hi Thomas,
> >>
> >> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
> >>> Hi,
> >>>
> >>>>> Actually we run the benchmark as a background process, do we need to
> >>>>> disable the cursor and test again?
> >>>> There's a worker thread that updates the display from the shadow buffer.
> >>>> The blinking cursor periodically triggers the worker thread, but the
> >>>> actual update is just the size of one character.
> >>>>
> >>>> The point of the test without output is to see if the regression comes
> >>> >from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or
> >>> >from the worker thread. If the regression goes away after disabling the
> >>>> blinking cursor, then the worker thread is the problem. If it already
> >>>> goes away if there's simply no output from the test, the screen update
> >>>> is the problem. On my machine I have to disable the blinking cursor, so
> >>>> I think the worker causes the performance drop.
> >>>
> >>> We disabled redirecting stdout/stderr to /dev/kmsg,  and the regression is
> >>> gone.
> >>>
> >>> commit:
> >>>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
> >>>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer
> >>> emulation
> >>>
> >>> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde testcase/testparams/testbox
> >>> ----------------  -------------------------- ---------------------------
> >>>          %stddev      change         %stddev
> >>>              \          |                \
> >>>      43785                       44481
> >>> vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
> >>>      43785                       44481        GEO-MEAN vm-scalability.median
> >>
> >> Till now, from Rong's tests:
> >> 1. Disabling cursor blinking doesn't cure the regression.
> >> 2. Disabling printint test results to console can workaround the
> >> regression.
> >>
> >> Also if we set the perfer_shadown to 0, the regression is also
> >> gone.
> >
> > We also did some further break down for the time consumed by the
> > new code.
> >
> > The drm_fb_helper_dirty_work() calls sequentially
> > 1. drm_client_buffer_vmap (290 us)
> > 2. drm_fb_helper_dirty_blit_real (19240 us)
> > 3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver
> > 4. drm_client_buffer_vunmap (215 us)
> >
>
> It's somewhat different to what I observed, but maybe I just couldn't
> reproduce the problem correctly.
>
> > The average run time is listed after the function names.
> >
> > From it, we can see drm_fb_helper_dirty_blit_real() takes too long
> > time (about 20ms for each run). I guess this is the root cause
> > of this regression, as the original code doesn't use this dirty worker.
>
> True, the original code uses a temporary buffer, but updates the display
> immediately.
>
> My guess is that this could be a caching problem. The worker runs on a
> different CPU, which doesn't have the shadow buffer in cache.

Yes, that's my thought too. I profiled the working set size, for most of
the drm_fb_helper_dirty_blit_real(), it will update a buffer 4096x768(3 MB),
and as it is called 30~40 times per second, it surely will affect the cache.


> > As said in last email, setting the prefer_shadow to 0 can avoid
> > the regrssion. Could it be an option?
>
> Unfortunately not. Without the shadow buffer, the console's display
> buffer permanently resides in video memory. It consumes significant
> amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave
> enough room for anything else.
>
> The best option is to not print to the console.

Do we have other options here?

My thought is this is clearly a regression, that the old driver works
fine, while the new version in linux-next doesn't. Also for a frame
buffer console, writting dozens line of message to it is not a rare
user case. We have many test platforms (servers/desktops/laptops)
with different kinds of GFX hardwares, and this model works fine for
many years :)

Thanks,
Feng



> Best regards
> Thomas
>
> > Thanks,
> > Feng
> >
> >>
> >> --- a/drivers/gpu/drm/mgag200/mgag200_main.c
> >> +++ b/drivers/gpu/drm/mgag200/mgag200_main.c
> >> @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags)
> >> dev->mode_config.preferred_depth = 16;
> >> else
> >> dev->mode_config.preferred_depth = 32;
> >> - dev->mode_config.prefer_shadow = 1;
> >> + dev->mode_config.prefer_shadow = 0;
> >>
> >> And from the perf data, one obvious difference is good case don't
> >> call drm_fb_helper_dirty_work(), while bad case calls.
> >>
> >> Thanks,
> >> Feng
> >>
> >>> Best Regards,
> >>> Rong Chen
> > _______________________________________________
> > dri-devel mailing list
> > [email protected]
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
>
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
>



2019-08-26 10:52:11

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi Feng

Am 24.08.19 um 07:16 schrieb Feng Tang:
> Hi Thomas,
>
> On Thu, Aug 22, 2019 at 07:25:11PM +0200, Thomas Zimmermann wrote:
>> Hi
>>
>> I was traveling and could reply earlier. Sorry for taking so long.
>
> No problem! I guessed so :)
>
>>
>> Am 13.08.19 um 11:36 schrieb Feng Tang:
>>> Hi Thomas,
>>>
>>> On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
>>>> Hi Thomas,
>>>>
>>>> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
>>>>> Hi,
>>>>>
>>>>>>> Actually we run the benchmark as a background process, do
>>>>>>> we need to disable the cursor and test again?
>>>>>> There's a worker thread that updates the display from the
>>>>>> shadow buffer. The blinking cursor periodically triggers
>>>>>> the worker thread, but the actual update is just the size
>>>>>> of one character.
>>>>>>
>>>>>> The point of the test without output is to see if the
>>>>>> regression comes from the buffer update (i.e., the memcpy
>>>>>> from shadow buffer to VRAM), or from the worker thread. If
>>>>>> the regression goes away after disabling the blinking
>>>>>> cursor, then the worker thread is the problem. If it
>>>>>> already goes away if there's simply no output from the
>>>>>> test, the screen update is the problem. On my machine I
>>>>>> have to disable the blinking cursor, so I think the worker
>>>>>> causes the performance drop.
>>>>>
>>>>> We disabled redirecting stdout/stderr to /dev/kmsg, and the
>>>>> regression is gone.
>>>>>
>>>>> commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs
>>>>> framebuffer console 90f479ae51a drm/mgag200: Replace struct
>>>>> mga_fbdev with generic framebuffer emulation
>>>>>
>>>>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde
>>>>> testcase/testparams/testbox ----------------
>>>>> -------------------------- ---------------------------
>>>>> %stddev change %stddev \ | \ 43785
>>>>> 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>>>>> 43785 44481 GEO-MEAN vm-scalability.median
>>>>
>>>> Till now, from Rong's tests: 1. Disabling cursor blinking
>>>> doesn't cure the regression. 2. Disabling printint test results
>>>> to console can workaround the regression.
>>>>
>>>> Also if we set the perfer_shadown to 0, the regression is also
>>>> gone.
>>>
>>> We also did some further break down for the time consumed by the
>>> new code.
>>>
>>> The drm_fb_helper_dirty_work() calls sequentially 1.
>>> drm_client_buffer_vmap (290 us) 2.
>>> drm_fb_helper_dirty_blit_real (19240 us) 3.
>>> helper->fb->funcs->dirty() ---> NULL for mgag200 driver 4.
>>> drm_client_buffer_vunmap (215 us)
>>>
>>
>> It's somewhat different to what I observed, but maybe I just
>> couldn't reproduce the problem correctly.
>>
>>> The average run time is listed after the function names.
>>>
>>> From it, we can see drm_fb_helper_dirty_blit_real() takes too
>>> long time (about 20ms for each run). I guess this is the root
>>> cause of this regression, as the original code doesn't use this
>>> dirty worker.
>>
>> True, the original code uses a temporary buffer, but updates the
>> display immediately.
>>
>> My guess is that this could be a caching problem. The worker runs
>> on a different CPU, which doesn't have the shadow buffer in cache.
>
> Yes, that's my thought too. I profiled the working set size, for most
> of the drm_fb_helper_dirty_blit_real(), it will update a buffer
> 4096x768(3 MB), and as it is called 30~40 times per second, it surely
> will affect the cache.
>
>
>>> As said in last email, setting the prefer_shadow to 0 can avoid
>>> the regrssion. Could it be an option?
>>
>> Unfortunately not. Without the shadow buffer, the console's
>> display buffer permanently resides in video memory. It consumes
>> significant amount of that memory (say 8 MiB out of 16 MiB). That
>> doesn't leave enough room for anything else.
>>
>> The best option is to not print to the console.
>
> Do we have other options here?

I attached two patches. Both show an improvement in my setup at least.
Could you please test them independently from each other and report back?

prefetch.patch prefetches the shadow buffer two scanlines ahead during
the blit function. The idea is to have the scanlines in cache when they
are supposed to go to hardware.

schedule.patch schedules the dirty worker on the current CPU core (i.e.,
the one that did the drawing to the shadow buffer). Hopefully the shadow
buffer remains in cache meanwhile.

Best regards
Thomas

> My thought is this is clearly a regression, that the old driver
> works fine, while the new version in linux-next doesn't. Also for a
> frame buffer console, writting dozens line of message to it is not a
> rare user case. We have many test platforms
> (servers/desktops/laptops) with different kinds of GFX hardwares, and
> this model works fine for many years :)
>
> Thanks, Feng
>
>
>
>> Best regards Thomas
>>
>>> Thanks, Feng
>>>
>>>>
>>>> --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++
>>>> b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -167,7 +167,7 @@
>>>> int mgag200_driver_load(struct drm_device *dev, unsigned long
>>>> flags) dev->mode_config.preferred_depth = 16; else
>>>> dev->mode_config.preferred_depth = 32; -
>>>> dev->mode_config.prefer_shadow = 1; +
>>>> dev->mode_config.prefer_shadow = 0;
>>>>
>>>> And from the perf data, one obvious difference is good case
>>>> don't call drm_fb_helper_dirty_work(), while bad case calls.
>>>>
>>>> Thanks, Feng
>>>>
>>>>> Best Regards, Rong Chen
>>> _______________________________________________ dri-devel mailing
>>> list [email protected]
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>
>> -- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH,
>> Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer,
>> Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
>>
>
>
>
> _______________________________________________ dri-devel mailing
> list [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


Attachments:
prefetch.patch (1.03 kB)
schedule.patch (0.99 kB)
signature.asc (499.00 B)
OpenPGP digital signature
Download all attachments

2019-08-27 12:35:01

by Chen, Rong A

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi Thomas,

On 8/26/2019 6:50 PM, Thomas Zimmermann wrote:
> Hi Feng
>
> Am 24.08.19 um 07:16 schrieb Feng Tang:
>> Hi Thomas,
>>
>> On Thu, Aug 22, 2019 at 07:25:11PM +0200, Thomas Zimmermann wrote:
>>> Hi
>>>
>>> I was traveling and could reply earlier. Sorry for taking so long.
>> No problem! I guessed so :)
>>
>>> Am 13.08.19 um 11:36 schrieb Feng Tang:
>>>> Hi Thomas,
>>>>
>>>> On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
>>>>> Hi Thomas,
>>>>>
>>>>> On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
>>>>>> Hi,
>>>>>>
>>>>>>>> Actually we run the benchmark as a background process, do
>>>>>>>> we need to disable the cursor and test again?
>>>>>>> There's a worker thread that updates the display from the
>>>>>>> shadow buffer. The blinking cursor periodically triggers
>>>>>>> the worker thread, but the actual update is just the size
>>>>>>> of one character.
>>>>>>>
>>>>>>> The point of the test without output is to see if the
>>>>>>> regression comes from the buffer update (i.e., the memcpy
>>>>>>> from shadow buffer to VRAM), or from the worker thread. If
>>>>>>> the regression goes away after disabling the blinking
>>>>>>> cursor, then the worker thread is the problem. If it
>>>>>>> already goes away if there's simply no output from the
>>>>>>> test, the screen update is the problem. On my machine I
>>>>>>> have to disable the blinking cursor, so I think the worker
>>>>>>> causes the performance drop.
>>>>>> We disabled redirecting stdout/stderr to /dev/kmsg, and the
>>>>>> regression is gone.
>>>>>>
>>>>>> commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs
>>>>>> framebuffer console 90f479ae51a drm/mgag200: Replace struct
>>>>>> mga_fbdev with generic framebuffer emulation
>>>>>>
>>>>>> f1f8555dfb9a70a2 90f479ae51afa45efab97afdde
>>>>>> testcase/testparams/testbox ----------------
>>>>>> -------------------------- ---------------------------
>>>>>> %stddev change %stddev \ | \ 43785
>>>>>> 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>>>>>> 43785 44481 GEO-MEAN vm-scalability.median
>>>>> Till now, from Rong's tests: 1. Disabling cursor blinking
>>>>> doesn't cure the regression. 2. Disabling printint test results
>>>>> to console can workaround the regression.
>>>>>
>>>>> Also if we set the perfer_shadown to 0, the regression is also
>>>>> gone.
>>>> We also did some further break down for the time consumed by the
>>>> new code.
>>>>
>>>> The drm_fb_helper_dirty_work() calls sequentially 1.
>>>> drm_client_buffer_vmap (290 us) 2.
>>>> drm_fb_helper_dirty_blit_real (19240 us) 3.
>>>> helper->fb->funcs->dirty() ---> NULL for mgag200 driver 4.
>>>> drm_client_buffer_vunmap (215 us)
>>>>
>>> It's somewhat different to what I observed, but maybe I just
>>> couldn't reproduce the problem correctly.
>>>
>>>> The average run time is listed after the function names.
>>>>
>>>> From it, we can see drm_fb_helper_dirty_blit_real() takes too
>>>> long time (about 20ms for each run). I guess this is the root
>>>> cause of this regression, as the original code doesn't use this
>>>> dirty worker.
>>> True, the original code uses a temporary buffer, but updates the
>>> display immediately.
>>>
>>> My guess is that this could be a caching problem. The worker runs
>>> on a different CPU, which doesn't have the shadow buffer in cache.
>> Yes, that's my thought too. I profiled the working set size, for most
>> of the drm_fb_helper_dirty_blit_real(), it will update a buffer
>> 4096x768(3 MB), and as it is called 30~40 times per second, it surely
>> will affect the cache.
>>
>>
>>>> As said in last email, setting the prefer_shadow to 0 can avoid
>>>> the regrssion. Could it be an option?
>>> Unfortunately not. Without the shadow buffer, the console's
>>> display buffer permanently resides in video memory. It consumes
>>> significant amount of that memory (say 8 MiB out of 16 MiB). That
>>> doesn't leave enough room for anything else.
>>>
>>> The best option is to not print to the console.
>> Do we have other options here?
> I attached two patches. Both show an improvement in my setup at least.
> Could you please test them independently from each other and report back?
>
> prefetch.patch prefetches the shadow buffer two scanlines ahead during
> the blit function. The idea is to have the scanlines in cache when they
> are supposed to go to hardware.
>
> schedule.patch schedules the dirty worker on the current CPU core (i.e.,
> the one that did the drawing to the shadow buffer). Hopefully the shadow
> buffer remains in cache meanwhile.
>
> Best regards
> Thomas

Both patches have little impact on the performance from our side.

prefetch.patch:
commit:
  f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
  90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
framebuffer emulation
  77459f56994 prefetch shadow buffer two lines ahead of blit offset

f1f8555dfb9a70a2  90f479ae51afa45efab97afdde 77459f56994ab87ee5459920b3 
testcase/testparams/testbox
----------------  -------------------------- -------------------------- 
---------------------------
         %stddev      change         %stddev      change %stddev
             \          |                \          | \
     42912             -15%      36517             -17% 35515
vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
     42912             -15%      36517             -17% 35515       
GEO-MEAN vm-scalability.median

schedule.patch:
commit:
  f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
  90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
framebuffer emulation
  ccc5f095c61 schedule dirty worker on local core

f1f8555dfb9a70a2  90f479ae51afa45efab97afdde ccc5f095c61ff6eded0f0ab1b7 
testcase/testparams/testbox
----------------  -------------------------- -------------------------- 
---------------------------
         %stddev      change         %stddev      change %stddev
             \          |                \          | \
     42912             -15%      36517             -15%      36556 ± 
4% vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
     42912             -15%      36517             -15% 36556       
GEO-MEAN vm-scalability.median

Best Regards,
Rong Chen

2019-08-27 17:18:11

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi

Am 27.08.19 um 14:33 schrieb Chen, Rong A:
>
> Both patches have little impact on the performance from our side.

Thanks for testing. Too bad they doesn't solve the issue.

There's another patch attached. Could you please tests this as well?
Thanks a lot!

The patch comes from Daniel Vetter after discussing the problem on IRC.
The idea of the patch is that the old mgag200 code might display much
less frames that the generic code, because mgag200 only prints from
non-atomic context. If we simulate this with the generic code, we should
see roughly the original performance.

Best regards
Thomas

>
> prefetch.patch:
> commit:
>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation
>   77459f56994 prefetch shadow buffer two lines ahead of blit offset
>
> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde 77459f56994ab87ee5459920b3 
> testcase/testparams/testbox
> ----------------  -------------------------- -------------------------- 
> ---------------------------
>          %stddev      change         %stddev      change %stddev
>              \          |                \          | \
>      42912             -15%      36517             -17% 35515
> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>      42912             -15%      36517             -17% 35515       
> GEO-MEAN vm-scalability.median
>
> schedule.patch:
> commit:
>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation
>   ccc5f095c61 schedule dirty worker on local core
>
> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde ccc5f095c61ff6eded0f0ab1b7 
> testcase/testparams/testbox
> ----------------  -------------------------- -------------------------- 
> ---------------------------
>          %stddev      change         %stddev      change %stddev
>              \          |                \          | \
>      42912             -15%      36517             -15%      36556 ±  4%
> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>      42912             -15%      36517             -15% 36556       
> GEO-MEAN vm-scalability.median
>
> Best Regards,
> Rong Chen
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


Attachments:
usecansleep.patch (838.00 B)
signature.asc (499.00 B)
OpenPGP digital signature
Download all attachments

2019-08-28 09:38:41

by Chen, Rong A

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi Thomas,

On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
> Hi
>
> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
>> Both patches have little impact on the performance from our side.
> Thanks for testing. Too bad they doesn't solve the issue.
>
> There's another patch attached. Could you please tests this as well?
> Thanks a lot!
>
> The patch comes from Daniel Vetter after discussing the problem on IRC.
> The idea of the patch is that the old mgag200 code might display much
> less frames that the generic code, because mgag200 only prints from
> non-atomic context. If we simulate this with the generic code, we should
> see roughly the original performance.
>
>

It's cool, the patch "usecansleep.patch" can fix the issue.

commit:
  f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
  90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
framebuffer emulation
  b976b04c2bc only schedule worker from non-atomic context

f1f8555dfb9a70a2  90f479ae51afa45efab97afdde b976b04c2bcf33148d6c7bc1a2 
testcase/testparams/testbox
----------------  -------------------------- -------------------------- 
---------------------------
         %stddev      change         %stddev      change %stddev
             \          |                \          | \
     42912             -15%      36517 44093
vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
     42912             -15%      36517 44093        GEO-MEAN
vm-scalability.median

Best Regards,
Rong Chen

2019-08-28 10:52:58

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi

Am 28.08.19 um 11:37 schrieb Rong Chen:
> Hi Thomas,
>
> On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
>>> Both patches have little impact on the performance from our side.
>> Thanks for testing. Too bad they doesn't solve the issue.
>>
>> There's another patch attached. Could you please tests this as well?
>> Thanks a lot!
>>
>> The patch comes from Daniel Vetter after discussing the problem on IRC.
>> The idea of the patch is that the old mgag200 code might display much
>> less frames that the generic code, because mgag200 only prints from
>> non-atomic context. If we simulate this with the generic code, we should
>> see roughly the original performance.
>>
>>
>
> It's cool, the patch "usecansleep.patch" can fix the issue.

Thank you for testing. But don't get too excited, because the patch
simulates a bug that was present in the original mgag200 code. A
significant number of frames are simply skipped. That is apparently the
reason why it's faster.

Best regards
Thomas

> commit:
>   f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console
>   90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic
> framebuffer emulation
>   b976b04c2bc only schedule worker from non-atomic context
>
> f1f8555dfb9a70a2  90f479ae51afa45efab97afdde b976b04c2bcf33148d6c7bc1a2 
> testcase/testparams/testbox
> ----------------  -------------------------- -------------------------- 
> ---------------------------
>          %stddev      change         %stddev      change %stddev
>              \          |                \          | \
>      42912             -15%      36517 44093
> vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01
>      42912             -15%      36517 44093        GEO-MEAN
> vm-scalability.median
>
> Best Regards,
> Rong Chen
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


Attachments:
signature.asc (499.00 B)
OpenPGP digital signature

2019-09-04 06:28:15

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi Thomas,

On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
> Hi
>
> Am 28.08.19 um 11:37 schrieb Rong Chen:
> > Hi Thomas,
> >
> > On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
> >> Hi
> >>
> >> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
> >>> Both patches have little impact on the performance from our side.
> >> Thanks for testing. Too bad they doesn't solve the issue.
> >>
> >> There's another patch attached. Could you please tests this as well?
> >> Thanks a lot!
> >>
> >> The patch comes from Daniel Vetter after discussing the problem on IRC.
> >> The idea of the patch is that the old mgag200 code might display much
> >> less frames that the generic code, because mgag200 only prints from
> >> non-atomic context. If we simulate this with the generic code, we should
> >> see roughly the original performance.
> >>
> >>
> >
> > It's cool, the patch "usecansleep.patch" can fix the issue.
>
> Thank you for testing. But don't get too excited, because the patch
> simulates a bug that was present in the original mgag200 code. A
> significant number of frames are simply skipped. That is apparently the
> reason why it's faster.

Thanks for the detailed info, so the original code skips time-consuming
work inside atomic context on purpose. Is there any space to optmise it?
If 2 scheduled update worker are handled at almost same time, can one be
skipped?

Thanks,
Feng

>
> Best regards
> Thomas

2019-09-04 06:55:01

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi

Am 04.09.19 um 08:27 schrieb Feng Tang:
>> Thank you for testing. But don't get too excited, because the patch
>> simulates a bug that was present in the original mgag200 code. A
>> significant number of frames are simply skipped. That is apparently the
>> reason why it's faster.
>
> Thanks for the detailed info, so the original code skips time-consuming
> work inside atomic context on purpose. Is there any space to optmise it?
> If 2 scheduled update worker are handled at almost same time, can one be
> skipped?

To my knowledge, there's only one instance of the worker. Re-scheduling
the worker before a previous instance started, will not create a second
instance. The worker's instance will complete all pending updates. So in
some way, skipping workers already happens.

Best regards
Thomas

>
> Thanks,
> Feng
>
>>
>> Best regards
>> Thomas
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


Attachments:
signature.asc (499.00 B)
OpenPGP digital signature

2019-09-04 08:13:52

by Daniel Vetter

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <[email protected]> wrote:
>
> Hi
>
> Am 04.09.19 um 08:27 schrieb Feng Tang:
> >> Thank you for testing. But don't get too excited, because the patch
> >> simulates a bug that was present in the original mgag200 code. A
> >> significant number of frames are simply skipped. That is apparently the
> >> reason why it's faster.
> >
> > Thanks for the detailed info, so the original code skips time-consuming
> > work inside atomic context on purpose. Is there any space to optmise it?
> > If 2 scheduled update worker are handled at almost same time, can one be
> > skipped?
>
> To my knowledge, there's only one instance of the worker. Re-scheduling
> the worker before a previous instance started, will not create a second
> instance. The worker's instance will complete all pending updates. So in
> some way, skipping workers already happens.

So I think that the most often fbcon update from atomic context is the
blinking cursor. If you disable that one you should be back to the old
performance level I think, since just writing to dmesg is from process
context, so shouldn't change.

https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinking

Bunch of tricks, but tbh I haven't tested them.

In any case, I still strongly advice you don't print anything to dmesg
or fbcon while benchmarking, because dmesg/printf are anything but
fast, especially if a gpu driver is involved. There's some efforts to
make the dmesg/printk side less painful (untangling the console_lock
from printk), but fundamentally printing to the gpu from the kernel
through dmesg/fbcon won't be cheap. It's just not something we
optimize beyond "make sure it works for emergencies".
-Daniel

>
> Best regards
> Thomas
>
> >
> > Thanks,
> > Feng
> >
> >>
> >> Best regards
> >> Thomas
> > _______________________________________________
> > dri-devel mailing list
> > [email protected]
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
>
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
>
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2019-09-04 08:36:43

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi Daniel,

On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <[email protected]> wrote:
> >
> > Hi
> >
> > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > >> Thank you for testing. But don't get too excited, because the patch
> > >> simulates a bug that was present in the original mgag200 code. A
> > >> significant number of frames are simply skipped. That is apparently the
> > >> reason why it's faster.
> > >
> > > Thanks for the detailed info, so the original code skips time-consuming
> > > work inside atomic context on purpose. Is there any space to optmise it?
> > > If 2 scheduled update worker are handled at almost same time, can one be
> > > skipped?
> >
> > To my knowledge, there's only one instance of the worker. Re-scheduling
> > the worker before a previous instance started, will not create a second
> > instance. The worker's instance will complete all pending updates. So in
> > some way, skipping workers already happens.
>
> So I think that the most often fbcon update from atomic context is the
> blinking cursor. If you disable that one you should be back to the old
> performance level I think, since just writing to dmesg is from process
> context, so shouldn't change.

Hmm, then for the old driver, it should also do the most update in
non-atomic context?

One other thing is, I profiled that updating a 3MB shadow buffer needs
20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
the cache setting of DRM shadow buffer? say the orginal code use a
cachable buffer?


>
> https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinking
>
> Bunch of tricks, but tbh I haven't tested them.

Thomas has suggested to disable curson by
echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink

We tried that way, and no change for the performance data.

Thanks,
Feng

>
> In any case, I still strongly advice you don't print anything to dmesg
> or fbcon while benchmarking, because dmesg/printf are anything but
> fast, especially if a gpu driver is involved. There's some efforts to
> make the dmesg/printk side less painful (untangling the console_lock
> from printk), but fundamentally printing to the gpu from the kernel
> through dmesg/fbcon won't be cheap. It's just not something we
> optimize beyond "make sure it works for emergencies".
> -Daniel
>
> >
> > Best regards
> > Thomas
> >
> > >
> > > Thanks,
> > > Feng
> > >
> > >>
> > >> Best regards
> > >> Thomas
> > > _______________________________________________
> > > dri-devel mailing list
> > > [email protected]
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > >
> >
> > --
> > Thomas Zimmermann
> > Graphics Driver Developer
> > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> > HRB 21284 (AG Nürnberg)
> >
> > _______________________________________________
> > dri-devel mailing list
> > [email protected]
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

2019-09-04 08:45:26

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi

Am 04.09.19 um 10:35 schrieb Feng Tang:
> Hi Daniel,
>
> On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
>> On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <[email protected]> wrote:
>>>
>>> Hi
>>>
>>> Am 04.09.19 um 08:27 schrieb Feng Tang:
>>>>> Thank you for testing. But don't get too excited, because the patch
>>>>> simulates a bug that was present in the original mgag200 code. A
>>>>> significant number of frames are simply skipped. That is apparently the
>>>>> reason why it's faster.
>>>>
>>>> Thanks for the detailed info, so the original code skips time-consuming
>>>> work inside atomic context on purpose. Is there any space to optmise it?
>>>> If 2 scheduled update worker are handled at almost same time, can one be
>>>> skipped?
>>>
>>> To my knowledge, there's only one instance of the worker. Re-scheduling
>>> the worker before a previous instance started, will not create a second
>>> instance. The worker's instance will complete all pending updates. So in
>>> some way, skipping workers already happens.
>>
>> So I think that the most often fbcon update from atomic context is the
>> blinking cursor. If you disable that one you should be back to the old
>> performance level I think, since just writing to dmesg is from process
>> context, so shouldn't change.
>
> Hmm, then for the old driver, it should also do the most update in
> non-atomic context?
>
> One other thing is, I profiled that updating a 3MB shadow buffer needs
> 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> the cache setting of DRM shadow buffer? say the orginal code use a
> cachable buffer?
>
>
>>
>> https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinking
>>
>> Bunch of tricks, but tbh I haven't tested them.
>
> Thomas has suggested to disable curson by
> echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
>
> We tried that way, and no change for the performance data.

There are several ways of disabling the cursor. On my test system, I entered

tput civis

before the test and got better performance. Did you try this as well?

Best regards
Thomas

>
> Thanks,
> Feng
>
>>
>> In any case, I still strongly advice you don't print anything to dmesg
>> or fbcon while benchmarking, because dmesg/printf are anything but
>> fast, especially if a gpu driver is involved. There's some efforts to
>> make the dmesg/printk side less painful (untangling the console_lock
>> from printk), but fundamentally printing to the gpu from the kernel
>> through dmesg/fbcon won't be cheap. It's just not something we
>> optimize beyond "make sure it works for emergencies".
>> -Daniel
>>
>>>
>>> Best regards
>>> Thomas
>>>
>>>>
>>>> Thanks,
>>>> Feng
>>>>
>>>>>
>>>>> Best regards
>>>>> Thomas
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> [email protected]
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>
>>>
>>> --
>>> Thomas Zimmermann
>>> Graphics Driver Developer
>>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>>> HRB 21284 (AG Nürnberg)
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> [email protected]
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>>
>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


Attachments:
signature.asc (499.00 B)
OpenPGP digital signature

2019-09-04 09:18:47

by Daniel Vetter

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <[email protected]> wrote:
>
> Hi Daniel,
>
> On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <[email protected]> wrote:
> > >
> > > Hi
> > >
> > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > >> Thank you for testing. But don't get too excited, because the patch
> > > >> simulates a bug that was present in the original mgag200 code. A
> > > >> significant number of frames are simply skipped. That is apparently the
> > > >> reason why it's faster.
> > > >
> > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > skipped?
> > >
> > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > the worker before a previous instance started, will not create a second
> > > instance. The worker's instance will complete all pending updates. So in
> > > some way, skipping workers already happens.
> >
> > So I think that the most often fbcon update from atomic context is the
> > blinking cursor. If you disable that one you should be back to the old
> > performance level I think, since just writing to dmesg is from process
> > context, so shouldn't change.
>
> Hmm, then for the old driver, it should also do the most update in
> non-atomic context?
>
> One other thing is, I profiled that updating a 3MB shadow buffer needs
> 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> the cache setting of DRM shadow buffer? say the orginal code use a
> cachable buffer?

Hm, that would indicate the write-combining got broken somewhere. This
should definitely be faster. Also we shouldn't transfer the hole
thing, except when scrolling ...


> > https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinking
> >
> > Bunch of tricks, but tbh I haven't tested them.
>
> Thomas has suggested to disable curson by
> echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
>
> We tried that way, and no change for the performance data.

Huh, if there's other atomic contexts for fbcon update then I'm not
aware ... and if it's all the updates, then you wouldn't see a hole
lot on your screen, neither with the old or new fbdev support in
mgag200. I'm a bit confused ...
-Daniel

>
> Thanks,
> Feng
>
> >
> > In any case, I still strongly advice you don't print anything to dmesg
> > or fbcon while benchmarking, because dmesg/printf are anything but
> > fast, especially if a gpu driver is involved. There's some efforts to
> > make the dmesg/printk side less painful (untangling the console_lock
> > from printk), but fundamentally printing to the gpu from the kernel
> > through dmesg/fbcon won't be cheap. It's just not something we
> > optimize beyond "make sure it works for emergencies".
> > -Daniel
> >
> > >
> > > Best regards
> > > Thomas
> > >
> > > >
> > > > Thanks,
> > > > Feng
> > > >
> > > >>
> > > >> Best regards
> > > >> Thomas
> > > > _______________________________________________
> > > > dri-devel mailing list
> > > > [email protected]
> > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > >
> > >
> > > --
> > > Thomas Zimmermann
> > > Graphics Driver Developer
> > > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> > > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> > > HRB 21284 (AG Nürnberg)
> > >
> > > _______________________________________________
> > > dri-devel mailing list
> > > [email protected]
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch



--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2019-09-04 11:17:57

by Dave Airlie

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

On Wed, 4 Sep 2019 at 19:17, Daniel Vetter <[email protected]> wrote:
>
> On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <[email protected]> wrote:
> >
> > Hi Daniel,
> >
> > On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <[email protected]> wrote:
> > > >
> > > > Hi
> > > >
> > > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > > >> Thank you for testing. But don't get too excited, because the patch
> > > > >> simulates a bug that was present in the original mgag200 code. A
> > > > >> significant number of frames are simply skipped. That is apparently the
> > > > >> reason why it's faster.
> > > > >
> > > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > > skipped?
> > > >
> > > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > > the worker before a previous instance started, will not create a second
> > > > instance. The worker's instance will complete all pending updates. So in
> > > > some way, skipping workers already happens.
> > >
> > > So I think that the most often fbcon update from atomic context is the
> > > blinking cursor. If you disable that one you should be back to the old
> > > performance level I think, since just writing to dmesg is from process
> > > context, so shouldn't change.
> >
> > Hmm, then for the old driver, it should also do the most update in
> > non-atomic context?
> >
> > One other thing is, I profiled that updating a 3MB shadow buffer needs
> > 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> > the cache setting of DRM shadow buffer? say the orginal code use a
> > cachable buffer?
>
> Hm, that would indicate the write-combining got broken somewhere. This
> should definitely be faster. Also we shouldn't transfer the hole
> thing, except when scrolling ...

First rule of fbcon usage, you are always effectively scrolling.

Also these devices might be on a PCIE 1x piece of wet string, not sure
if the numbers reflect that.

Dave.

2019-09-04 11:21:57

by Daniel Vetter

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie <[email protected]> wrote:
>
> On Wed, 4 Sep 2019 at 19:17, Daniel Vetter <[email protected]> wrote:
> >
> > On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <[email protected]> wrote:
> > >
> > > Hi Daniel,
> > >
> > > On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > > > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <[email protected]> wrote:
> > > > >
> > > > > Hi
> > > > >
> > > > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > > > >> Thank you for testing. But don't get too excited, because the patch
> > > > > >> simulates a bug that was present in the original mgag200 code. A
> > > > > >> significant number of frames are simply skipped. That is apparently the
> > > > > >> reason why it's faster.
> > > > > >
> > > > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > > > skipped?
> > > > >
> > > > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > > > the worker before a previous instance started, will not create a second
> > > > > instance. The worker's instance will complete all pending updates. So in
> > > > > some way, skipping workers already happens.
> > > >
> > > > So I think that the most often fbcon update from atomic context is the
> > > > blinking cursor. If you disable that one you should be back to the old
> > > > performance level I think, since just writing to dmesg is from process
> > > > context, so shouldn't change.
> > >
> > > Hmm, then for the old driver, it should also do the most update in
> > > non-atomic context?
> > >
> > > One other thing is, I profiled that updating a 3MB shadow buffer needs
> > > 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> > > the cache setting of DRM shadow buffer? say the orginal code use a
> > > cachable buffer?
> >
> > Hm, that would indicate the write-combining got broken somewhere. This
> > should definitely be faster. Also we shouldn't transfer the hole
> > thing, except when scrolling ...
>
> First rule of fbcon usage, you are always effectively scrolling.
>
> Also these devices might be on a PCIE 1x piece of wet string, not sure
> if the numbers reflect that.

pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and
overhead not entirely out of the question that 150MB/s is actually the
hw limit. If it's really pcie 1x 1.0, no idea where to check that.
Also might be worth to double-check that the gpu pci bar is listed as
wc in debugfs/x86/pat_memtype_list.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2019-09-05 07:09:21

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi Vetter,

On Wed, Sep 04, 2019 at 01:20:29PM +0200, Daniel Vetter wrote:
> On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie <[email protected]> wrote:
> >
> > On Wed, 4 Sep 2019 at 19:17, Daniel Vetter <[email protected]> wrote:
> > >
> > > On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <[email protected]> wrote:
> > > >
> > > > Hi Daniel,
> > > >
> > > > On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > > > > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <[email protected]> wrote:
> > > > > >
> > > > > > Hi
> > > > > >
> > > > > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > > > > >> Thank you for testing. But don't get too excited, because the patch
> > > > > > >> simulates a bug that was present in the original mgag200 code. A
> > > > > > >> significant number of frames are simply skipped. That is apparently the
> > > > > > >> reason why it's faster.
> > > > > > >
> > > > > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > > > > skipped?
> > > > > >
> > > > > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > > > > the worker before a previous instance started, will not create a second
> > > > > > instance. The worker's instance will complete all pending updates. So in
> > > > > > some way, skipping workers already happens.
> > > > >
> > > > > So I think that the most often fbcon update from atomic context is the
> > > > > blinking cursor. If you disable that one you should be back to the old
> > > > > performance level I think, since just writing to dmesg is from process
> > > > > context, so shouldn't change.
> > > >
> > > > Hmm, then for the old driver, it should also do the most update in
> > > > non-atomic context?
> > > >
> > > > One other thing is, I profiled that updating a 3MB shadow buffer needs
> > > > 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> > > > the cache setting of DRM shadow buffer? say the orginal code use a
> > > > cachable buffer?
> > >
> > > Hm, that would indicate the write-combining got broken somewhere. This
> > > should definitely be faster. Also we shouldn't transfer the hole
> > > thing, except when scrolling ...
> >
> > First rule of fbcon usage, you are always effectively scrolling.
> >
> > Also these devices might be on a PCIE 1x piece of wet string, not sure
> > if the numbers reflect that.
>
> pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and
> overhead not entirely out of the question that 150MB/s is actually the
> hw limit. If it's really pcie 1x 1.0, no idea where to check that.
> Also might be worth to double-check that the gpu pci bar is listed as
> wc in debugfs/x86/pat_memtype_list.

Here is some dump of the device info and the pat_memtype_list, while it is
running other 0day task:

controller info
=================
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 05) (prog-if 00 [VGA controller])
Subsystem: Intel Corporation MGA G200e [Pilot] ServerEngines (SEP1)
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 16
NUMA node: 0
Region 0: Memory at d0000000 (32-bit, prefetchable) [size=16M]
Region 1: Memory at d1800000 (32-bit, non-prefetchable) [size=16K]
Region 2: Memory at d1000000 (32-bit, non-prefetchable) [size=8M]
Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [e4] Express (v1) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit-
Address: 00000000 Data: 0000
Kernel driver in use: mgag200
Kernel modules: mgag200


Related pat setting
===================
uncached-minus @ 0xc0000000-0xc0001000
uncached-minus @ 0xc0000000-0xd0000000
uncached-minus @ 0xc0008000-0xc0009000
uncached-minus @ 0xc0009000-0xc000a000
uncached-minus @ 0xc0010000-0xc0011000
uncached-minus @ 0xc0011000-0xc0012000
uncached-minus @ 0xc0012000-0xc0013000
uncached-minus @ 0xc0013000-0xc0014000
uncached-minus @ 0xc0018000-0xc0019000
uncached-minus @ 0xc0019000-0xc001a000
uncached-minus @ 0xc001a000-0xc001b000
write-combining @ 0xd0000000-0xd0300000
write-combining @ 0xd0000000-0xd1000000
uncached-minus @ 0xd1800000-0xd1804000
uncached-minus @ 0xd1900000-0xd1980000
uncached-minus @ 0xd1980000-0xd1981000
uncached-minus @ 0xd1a00000-0xd1a80000
uncached-minus @ 0xd1a80000-0xd1a81000
uncached-minus @ 0xd1f10000-0xd1f11000
uncached-minus @ 0xd1f11000-0xd1f12000
uncached-minus @ 0xd1f12000-0xd1f13000

Host bridge info
================
00:00.0 Host bridge: Intel Corporation Device 7853
Subsystem: Intel Corporation Device 0000
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 0
NUMA node: 0
Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <512ns, L1 <4us
ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
RootCtl: ErrCorrectable+ ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [e0] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?>
Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?>
Capabilities: [250 v1] #19
Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?>
Capabilities: [298 v1] Vendor Specific Information: ID=0007 Rev=0 Len=024 <?>


Thanks,
Feng


>
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

2019-09-05 10:47:37

by Daniel Vetter

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

On Thu, Sep 5, 2019 at 8:58 AM Feng Tang <[email protected]> wrote:
>
> Hi Vetter,
>
> On Wed, Sep 04, 2019 at 01:20:29PM +0200, Daniel Vetter wrote:
> > On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie <[email protected]> wrote:
> > >
> > > On Wed, 4 Sep 2019 at 19:17, Daniel Vetter <[email protected]> wrote:
> > > >
> > > > On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <[email protected]> wrote:
> > > > >
> > > > > Hi Daniel,
> > > > >
> > > > > On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > > > > > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <[email protected]> wrote:
> > > > > > >
> > > > > > > Hi
> > > > > > >
> > > > > > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > > > > > >> Thank you for testing. But don't get too excited, because the patch
> > > > > > > >> simulates a bug that was present in the original mgag200 code. A
> > > > > > > >> significant number of frames are simply skipped. That is apparently the
> > > > > > > >> reason why it's faster.
> > > > > > > >
> > > > > > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > > > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > > > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > > > > > skipped?
> > > > > > >
> > > > > > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > > > > > the worker before a previous instance started, will not create a second
> > > > > > > instance. The worker's instance will complete all pending updates. So in
> > > > > > > some way, skipping workers already happens.
> > > > > >
> > > > > > So I think that the most often fbcon update from atomic context is the
> > > > > > blinking cursor. If you disable that one you should be back to the old
> > > > > > performance level I think, since just writing to dmesg is from process
> > > > > > context, so shouldn't change.
> > > > >
> > > > > Hmm, then for the old driver, it should also do the most update in
> > > > > non-atomic context?
> > > > >
> > > > > One other thing is, I profiled that updating a 3MB shadow buffer needs
> > > > > 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> > > > > the cache setting of DRM shadow buffer? say the orginal code use a
> > > > > cachable buffer?
> > > >
> > > > Hm, that would indicate the write-combining got broken somewhere. This
> > > > should definitely be faster. Also we shouldn't transfer the hole
> > > > thing, except when scrolling ...
> > >
> > > First rule of fbcon usage, you are always effectively scrolling.
> > >
> > > Also these devices might be on a PCIE 1x piece of wet string, not sure
> > > if the numbers reflect that.
> >
> > pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and
> > overhead not entirely out of the question that 150MB/s is actually the
> > hw limit. If it's really pcie 1x 1.0, no idea where to check that.
> > Also might be worth to double-check that the gpu pci bar is listed as
> > wc in debugfs/x86/pat_memtype_list.
>
> Here is some dump of the device info and the pat_memtype_list, while it is
> running other 0day task:

Looks all good, I guess Dave is right with this probably only being a
real slow, real old pcie link, plus maybe some inefficiencies in the
mapping. Your 150MB/s, was that just the copy, or did you include all
the setup/map/unmap/teardown too in your measurement in the trace?
-Daniel

>
> controller info
> =================
> 03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 05) (prog-if 00 [VGA controller])
> Subsystem: Intel Corporation MGA G200e [Pilot] ServerEngines (SEP1)
> Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Interrupt: pin A routed to IRQ 16
> NUMA node: 0
> Region 0: Memory at d0000000 (32-bit, prefetchable) [size=16M]
> Region 1: Memory at d1800000 (32-bit, non-prefetchable) [size=16K]
> Region 2: Memory at d1000000 (32-bit, non-prefetchable) [size=8M]
> Expansion ROM at 000c0000 [disabled] [size=128K]
> Capabilities: [dc] Power Management version 2
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [e4] Express (v1) Legacy Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
> MaxPayload 128 bytes, MaxReadReq 128 bytes
> DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit-
> Address: 00000000 Data: 0000
> Kernel driver in use: mgag200
> Kernel modules: mgag200
>
>
> Related pat setting
> ===================
> uncached-minus @ 0xc0000000-0xc0001000
> uncached-minus @ 0xc0000000-0xd0000000
> uncached-minus @ 0xc0008000-0xc0009000
> uncached-minus @ 0xc0009000-0xc000a000
> uncached-minus @ 0xc0010000-0xc0011000
> uncached-minus @ 0xc0011000-0xc0012000
> uncached-minus @ 0xc0012000-0xc0013000
> uncached-minus @ 0xc0013000-0xc0014000
> uncached-minus @ 0xc0018000-0xc0019000
> uncached-minus @ 0xc0019000-0xc001a000
> uncached-minus @ 0xc001a000-0xc001b000
> write-combining @ 0xd0000000-0xd0300000
> write-combining @ 0xd0000000-0xd1000000
> uncached-minus @ 0xd1800000-0xd1804000
> uncached-minus @ 0xd1900000-0xd1980000
> uncached-minus @ 0xd1980000-0xd1981000
> uncached-minus @ 0xd1a00000-0xd1a80000
> uncached-minus @ 0xd1a80000-0xd1a81000
> uncached-minus @ 0xd1f10000-0xd1f11000
> uncached-minus @ 0xd1f11000-0xd1f12000
> uncached-minus @ 0xd1f12000-0xd1f13000
>
> Host bridge info
> ================
> 00:00.0 Host bridge: Intel Corporation Device 7853
> Subsystem: Intel Corporation Device 0000
> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
> Interrupt: pin A routed to IRQ 0
> NUMA node: 0
> Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00
> DevCap: MaxPayload 128 bytes, PhantFunc 0
> ExtTag- RBE+
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> MaxPayload 128 bytes, MaxReadReq 128 bytes
> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
> LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <512ns, L1 <4us
> ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
> RootCtl: ErrCorrectable+ ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible-
> RootCap: CRSVisible-
> RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> Capabilities: [e0] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
> Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?>
> Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?>
> Capabilities: [250 v1] #19
> Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?>
> Capabilities: [298 v1] Vendor Specific Information: ID=0007 Rev=0 Len=024 <?>
>
>
> Thanks,
> Feng
>
>
> >
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch



--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2019-09-05 12:47:22

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

On Thu, Sep 05, 2019 at 06:37:47PM +0800, Daniel Vetter wrote:
> On Thu, Sep 5, 2019 at 8:58 AM Feng Tang <[email protected]> wrote:
> >
> > Hi Vetter,
> >
> > On Wed, Sep 04, 2019 at 01:20:29PM +0200, Daniel Vetter wrote:
> > > On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie <[email protected]> wrote:
> > > >
> > > > On Wed, 4 Sep 2019 at 19:17, Daniel Vetter <[email protected]> wrote:
> > > > >
> > > > > On Wed, Sep 4, 2019 at 10:35 AM Feng Tang <[email protected]> wrote:
> > > > > >
> > > > > > Hi Daniel,
> > > > > >
> > > > > > On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
> > > > > > > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Hi
> > > > > > > >
> > > > > > > > Am 04.09.19 um 08:27 schrieb Feng Tang:
> > > > > > > > >> Thank you for testing. But don't get too excited, because the patch
> > > > > > > > >> simulates a bug that was present in the original mgag200 code. A
> > > > > > > > >> significant number of frames are simply skipped. That is apparently the
> > > > > > > > >> reason why it's faster.
> > > > > > > > >
> > > > > > > > > Thanks for the detailed info, so the original code skips time-consuming
> > > > > > > > > work inside atomic context on purpose. Is there any space to optmise it?
> > > > > > > > > If 2 scheduled update worker are handled at almost same time, can one be
> > > > > > > > > skipped?
> > > > > > > >
> > > > > > > > To my knowledge, there's only one instance of the worker. Re-scheduling
> > > > > > > > the worker before a previous instance started, will not create a second
> > > > > > > > instance. The worker's instance will complete all pending updates. So in
> > > > > > > > some way, skipping workers already happens.
> > > > > > >
> > > > > > > So I think that the most often fbcon update from atomic context is the
> > > > > > > blinking cursor. If you disable that one you should be back to the old
> > > > > > > performance level I think, since just writing to dmesg is from process
> > > > > > > context, so shouldn't change.
> > > > > >
> > > > > > Hmm, then for the old driver, it should also do the most update in
> > > > > > non-atomic context?
> > > > > >
> > > > > > One other thing is, I profiled that updating a 3MB shadow buffer needs
> > > > > > 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with
> > > > > > the cache setting of DRM shadow buffer? say the orginal code use a
> > > > > > cachable buffer?
> > > > >
> > > > > Hm, that would indicate the write-combining got broken somewhere. This
> > > > > should definitely be faster. Also we shouldn't transfer the hole
> > > > > thing, except when scrolling ...
> > > >
> > > > First rule of fbcon usage, you are always effectively scrolling.
> > > >
> > > > Also these devices might be on a PCIE 1x piece of wet string, not sure
> > > > if the numbers reflect that.
> > >
> > > pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and
> > > overhead not entirely out of the question that 150MB/s is actually the
> > > hw limit. If it's really pcie 1x 1.0, no idea where to check that.
> > > Also might be worth to double-check that the gpu pci bar is listed as
> > > wc in debugfs/x86/pat_memtype_list.
> >
> > Here is some dump of the device info and the pat_memtype_list, while it is
> > running other 0day task:
>
> Looks all good, I guess Dave is right with this probably only being a
> real slow, real old pcie link, plus maybe some inefficiencies in the
> mapping. Your 150MB/s, was that just the copy, or did you include all
> the setup/map/unmap/teardown too in your measurement in the trace?


Following is the breakdown, the 19240 us is the memory copy time

The drm_fb_helper_dirty_work() calls sequentially
1. drm_client_buffer_vmap (290 us)
2. drm_fb_helper_dirty_blit_real (19240 us)
3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver
4. drm_client_buffer_vunmap (215 us)

Thanks,
Feng


> -Daniel
>
> >
> > controller info
> > =================
> > 03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 05) (prog-if 00 [VGA controller])
> > Subsystem: Intel Corporation MGA G200e [Pilot] ServerEngines (SEP1)
> > Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> > Interrupt: pin A routed to IRQ 16
> > NUMA node: 0
> > Region 0: Memory at d0000000 (32-bit, prefetchable) [size=16M]
> > Region 1: Memory at d1800000 (32-bit, non-prefetchable) [size=16K]
> > Region 2: Memory at d1000000 (32-bit, non-prefetchable) [size=8M]
> > Expansion ROM at 000c0000 [disabled] [size=128K]
> > Capabilities: [dc] Power Management version 2
> > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [e4] Express (v1) Legacy Endpoint, MSI 00
> > DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
> > ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
> > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > MaxPayload 128 bytes, MaxReadReq 128 bytes
> > DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
> > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
> > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit-
> > Address: 00000000 Data: 0000
> > Kernel driver in use: mgag200
> > Kernel modules: mgag200
> >
> >
> > Related pat setting
> > ===================
> > uncached-minus @ 0xc0000000-0xc0001000
> > uncached-minus @ 0xc0000000-0xd0000000
> > uncached-minus @ 0xc0008000-0xc0009000
> > uncached-minus @ 0xc0009000-0xc000a000
> > uncached-minus @ 0xc0010000-0xc0011000
> > uncached-minus @ 0xc0011000-0xc0012000
> > uncached-minus @ 0xc0012000-0xc0013000
> > uncached-minus @ 0xc0013000-0xc0014000
> > uncached-minus @ 0xc0018000-0xc0019000
> > uncached-minus @ 0xc0019000-0xc001a000
> > uncached-minus @ 0xc001a000-0xc001b000
> > write-combining @ 0xd0000000-0xd0300000
> > write-combining @ 0xd0000000-0xd1000000
> > uncached-minus @ 0xd1800000-0xd1804000
> > uncached-minus @ 0xd1900000-0xd1980000
> > uncached-minus @ 0xd1980000-0xd1981000
> > uncached-minus @ 0xd1a00000-0xd1a80000
> > uncached-minus @ 0xd1a80000-0xd1a81000
> > uncached-minus @ 0xd1f10000-0xd1f11000
> > uncached-minus @ 0xd1f11000-0xd1f12000
> > uncached-minus @ 0xd1f12000-0xd1f13000
> >
> > Host bridge info
> > ================
> > 00:00.0 Host bridge: Intel Corporation Device 7853
> > Subsystem: Intel Corporation Device 0000
> > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
> > Interrupt: pin A routed to IRQ 0
> > NUMA node: 0
> > Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00
> > DevCap: MaxPayload 128 bytes, PhantFunc 0
> > ExtTag- RBE+
> > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > MaxPayload 128 bytes, MaxReadReq 128 bytes
> > DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
> > LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <512ns, L1 <4us
> > ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
> > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
> > RootCtl: ErrCorrectable+ ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible-
> > RootCap: CRSVisible-
> > RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> > DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd-
> > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
> > LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> > Compliance De-emphasis: -6dB
> > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
> > EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> > Capabilities: [e0] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
> > Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?>
> > Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?>
> > Capabilities: [250 v1] #19
> > Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?>
> > Capabilities: [298 v1] Vendor Specific Information: ID=0007 Rev=0 Len=024 <?>
> >
> >
> > Thanks,
> > Feng
> >
> >
> > >
> > > -Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

2019-09-10 18:22:54

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi

Am 04.09.19 um 08:27 schrieb Feng Tang:
> Hi Thomas,
>
> On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 28.08.19 um 11:37 schrieb Rong Chen:
>>> Hi Thomas,
>>>
>>> On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
>>>> Hi
>>>>
>>>> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
>>>>> Both patches have little impact on the performance from our side.
>>>> Thanks for testing. Too bad they doesn't solve the issue.
>>>>
>>>> There's another patch attached. Could you please tests this as well?
>>>> Thanks a lot!
>>>>
>>>> The patch comes from Daniel Vetter after discussing the problem on IRC.
>>>> The idea of the patch is that the old mgag200 code might display much
>>>> less frames that the generic code, because mgag200 only prints from
>>>> non-atomic context. If we simulate this with the generic code, we should
>>>> see roughly the original performance.
>>>>
>>>>
>>>
>>> It's cool, the patch "usecansleep.patch" can fix the issue.
>>
>> Thank you for testing. But don't get too excited, because the patch
>> simulates a bug that was present in the original mgag200 code. A
>> significant number of frames are simply skipped. That is apparently the
>> reason why it's faster.
>
> Thanks for the detailed info, so the original code skips time-consuming
> work inside atomic context on purpose. Is there any space to optmise it?
> If 2 scheduled update worker are handled at almost same time, can one be
> skipped?

We discussed ideas on IRC and decided that screen updates could be
synchronized with vblank intervals. This may give some rate limiting to
the output.

If you like, you could try the patch set at [1]. It adds the respective
code to console and mgag200.

Best regards
Thomas

[1]
https://lists.freedesktop.org/archives/dri-devel/2019-September/234850.html

>
> Thanks,
> Feng
>
>>
>> Best regards
>> Thomas

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


Attachments:
signature.asc (499.00 B)
OpenPGP digital signature

2019-09-16 10:40:18

by Feng Tang

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi Thomas,

On Mon, Sep 09, 2019 at 04:12:37PM +0200, Thomas Zimmermann wrote:
> Hi
>
> Am 04.09.19 um 08:27 schrieb Feng Tang:
> > Hi Thomas,
> >
> > On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
> >> Hi
> >>
> >> Am 28.08.19 um 11:37 schrieb Rong Chen:
> >>> Hi Thomas,
> >>>
> >>> On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
> >>>> Hi
> >>>>
> >>>> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
> >>>>> Both patches have little impact on the performance from our side.
> >>>> Thanks for testing. Too bad they doesn't solve the issue.
> >>>>
> >>>> There's another patch attached. Could you please tests this as well?
> >>>> Thanks a lot!
> >>>>
> >>>> The patch comes from Daniel Vetter after discussing the problem on IRC.
> >>>> The idea of the patch is that the old mgag200 code might display much
> >>>> less frames that the generic code, because mgag200 only prints from
> >>>> non-atomic context. If we simulate this with the generic code, we should
> >>>> see roughly the original performance.
> >>>>
> >>>>
> >>>
> >>> It's cool, the patch "usecansleep.patch" can fix the issue.
> >>
> >> Thank you for testing. But don't get too excited, because the patch
> >> simulates a bug that was present in the original mgag200 code. A
> >> significant number of frames are simply skipped. That is apparently the
> >> reason why it's faster.
> >
> > Thanks for the detailed info, so the original code skips time-consuming
> > work inside atomic context on purpose. Is there any space to optmise it?
> > If 2 scheduled update worker are handled at almost same time, can one be
> > skipped?
>
> We discussed ideas on IRC and decided that screen updates could be
> synchronized with vblank intervals. This may give some rate limiting to
> the output.
>
> If you like, you could try the patch set at [1]. It adds the respective
> code to console and mgag200.

I just tried the 2 patches, no obvious change (comparing to the
18.8% regression), both in overall benchmark and micro-profiling.

90f479ae51afa45e 04a0983095feaee022cdd65e3e4
---------------- ---------------------------
37236 ± 3% +2.5% 38167 ± 3% vm-scalability.median
0.15 ± 24% -25.1% 0.11 ± 23% vm-scalability.median_stddev
0.15 ± 23% -25.1% 0.11 ± 22% vm-scalability.stddev
12767318 ± 4% +2.5% 13089177 ± 3% vm-scalability.throughput

Thanks,
Feng

>
> Best regards
> Thomas
>
> [1]
> https://lists.freedesktop.org/archives/dri-devel/2019-September/234850.html
>
> >
> > Thanks,
> > Feng
> >
> >>
> >> Best regards
> >> Thomas
>
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)
>



2019-09-17 09:54:24

by Thomas Zimmermann

[permalink] [raw]
Subject: Re: [LKP] [drm/mgag200] 90f479ae51: vm-scalability.median -18.8% regression

Hi

Am 16.09.19 um 11:06 schrieb Feng Tang:
> Hi Thomas,
>
> On Mon, Sep 09, 2019 at 04:12:37PM +0200, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 04.09.19 um 08:27 schrieb Feng Tang:
>>> Hi Thomas,
>>>
>>> On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
>>>> Hi
>>>>
>>>> Am 28.08.19 um 11:37 schrieb Rong Chen:
>>>>> Hi Thomas,
>>>>>
>>>>> On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
>>>>>> Hi
>>>>>>
>>>>>> Am 27.08.19 um 14:33 schrieb Chen, Rong A:
>>>>>>> Both patches have little impact on the performance from our side.
>>>>>> Thanks for testing. Too bad they doesn't solve the issue.
>>>>>>
>>>>>> There's another patch attached. Could you please tests this as well?
>>>>>> Thanks a lot!
>>>>>>
>>>>>> The patch comes from Daniel Vetter after discussing the problem on IRC.
>>>>>> The idea of the patch is that the old mgag200 code might display much
>>>>>> less frames that the generic code, because mgag200 only prints from
>>>>>> non-atomic context. If we simulate this with the generic code, we should
>>>>>> see roughly the original performance.
>>>>>>
>>>>>>
>>>>>
>>>>> It's cool, the patch "usecansleep.patch" can fix the issue.
>>>>
>>>> Thank you for testing. But don't get too excited, because the patch
>>>> simulates a bug that was present in the original mgag200 code. A
>>>> significant number of frames are simply skipped. That is apparently the
>>>> reason why it's faster.
>>>
>>> Thanks for the detailed info, so the original code skips time-consuming
>>> work inside atomic context on purpose. Is there any space to optmise it?
>>> If 2 scheduled update worker are handled at almost same time, can one be
>>> skipped?
>>
>> We discussed ideas on IRC and decided that screen updates could be
>> synchronized with vblank intervals. This may give some rate limiting to
>> the output.
>>
>> If you like, you could try the patch set at [1]. It adds the respective
>> code to console and mgag200.
>
> I just tried the 2 patches, no obvious change (comparing to the
> 18.8% regression), both in overall benchmark and micro-profiling.
>
> 90f479ae51afa45e 04a0983095feaee022cdd65e3e4
> ---------------- ---------------------------
> 37236 ± 3% +2.5% 38167 ± 3% vm-scalability.median
> 0.15 ± 24% -25.1% 0.11 ± 23% vm-scalability.median_stddev
> 0.15 ± 23% -25.1% 0.11 ± 22% vm-scalability.stddev
> 12767318 ± 4% +2.5% 13089177 ± 3% vm-scalability.throughput

Thank you for testing. I wish we'd seen at least some improvement.

Best regards
Thomas

> Thanks,
> Feng
>
>>
>> Best regards
>> Thomas
>>
>> [1]
>> https://lists.freedesktop.org/archives/dri-devel/2019-September/234850.html
>>
>>>
>>> Thanks,
>>> Feng
>>>
>>>>
>>>> Best regards
>>>> Thomas
>>
>> --
>> Thomas Zimmermann
>> Graphics Driver Developer
>> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
>> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
>> HRB 21284 (AG Nürnberg)
>>
>
>
>
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


Attachments:
signature.asc (499.00 B)
OpenPGP digital signature