There are cases where depending on the size of the devcoredump and the speed
at which the usermode reads the dump, it can take longer than the current 5 mins
timeout.
This can lead to incomplete dumps as the device is deleted once the timeout expires.
One example is below where it took 6 mins for the devcoredump to be completely read.
04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
Increase the timeout to 10 mins to accommodate system delays and large coredump
sizes.
Signed-off-by: Abhinav Kumar <[email protected]>
---
drivers/base/devcoredump.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index f4d794d..6b83ae5 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -18,8 +18,8 @@ static struct class devcd_class;
/* global disable flag, for security purposes */
static bool devcd_disabled;
-/* if data isn't read by userspace after 5 minutes then delete it */
-#define DEVCD_TIMEOUT (HZ * 60 * 5)
+/* if data isn't read by userspace after 10 minutes then delete it */
+#define DEVCD_TIMEOUT (HZ * 60 * 10)
struct devcd_entry {
struct device devcd_dev;
--
2.7.4
Hi Greg
Thanks for the response.
On 2/11/2022 3:09 AM, Greg KH wrote:
> On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
>> There are cases where depending on the size of the devcoredump and the speed
>> at which the usermode reads the dump, it can take longer than the current 5 mins
>> timeout.
>>
>> This can lead to incomplete dumps as the device is deleted once the timeout expires.
>>
>> One example is below where it took 6 mins for the devcoredump to be completely read.
>>
>> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
>> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
>
> What makes this so slow? Reading from the kernel shouldn't be the
> limit, is it where the data is being sent to?
We are still checking this. We are seeing better read times when we bump
up the thread priority of the thread which was reading this.
We are also trying to check if bumping up CPU speed is helping.
But, results have not been consistently good enough. So we thought we
should also increase the timeout to be safe.
>
>> Increase the timeout to 10 mins to accommodate system delays and large coredump
>> sizes.
>
> Nit, please wrap your changelog texts at 72 columns.
>
Yes, i will fix this when I re-post.
> And what is "large"?
We are seeing devcoredumps in the range of 2.5MB-3MB. I can also mention
this in the commit text in the next post.
Thanks
Abhinav
>
> thanks,
>
> greg k-h
On Fri, 2022-02-11 at 23:52 -0800, Abhinav Kumar wrote:
>
> The thread is writing the data to a file in local storage. From our
> profiling, the read is the one taking the time not the write.
>
That seems kind of hard to believe, let's say it's a 4/3 split (4
minutes reading, 3 minutes writing, to make read > write as you say),
and 3MiB size, that'd mean you get 12.8KiB/sec? That seems implausibly
low, unless you're reading with really tiny buffers?
Can you strace this somehow? (with timestamp info)
> Just doubling what we have currently. I am not sure how the current 5
> mins timeout came from.
>
To be honest it came out of thin air, and wasn't really meant as a limit
on how fast you can read (feels like even if it's tens of MiB you should
read it in milliseconds into userspace), but more of a maximum time that
we're willing to waste kernel memory if nobody is around to read the
data.
I thought it'd be better if we could somehow pin it while the userspace
is reading it, but OTOH maybe that's actually bad, since that means
userspace (though suitably privileged) could pin this kernel memory
indefinitely.
johannes
Hi Johannes
On 2/12/2022 12:24 AM, Johannes Berg wrote:
> On Fri, 2022-02-11 at 23:52 -0800, Abhinav Kumar wrote:
>>
>> The thread is writing the data to a file in local storage. From our
>> profiling, the read is the one taking the time not the write.
>>
>
> That seems kind of hard to believe, let's say it's a 4/3 split (4
> minutes reading, 3 minutes writing, to make read > write as you say),
> and 3MiB size, that'd mean you get 12.8KiB/sec? That seems implausibly
> low, unless you're reading with really tiny buffers?
>
> Can you strace this somehow? (with timestamp info)
>
Yes, like I have already mentioned in earlier comments, we continue to
check whats taking that long.
Once we find something from our analysis and also have the trace, will
update the thread.
>> Just doubling what we have currently. I am not sure how the current 5
>> mins timeout came from.
>>
>
> To be honest it came out of thin air, and wasn't really meant as a limit
> on how fast you can read (feels like even if it's tens of MiB you should
> read it in milliseconds into userspace), but more of a maximum time that
> we're willing to waste kernel memory if nobody is around to read the
> data.
>
> I thought it'd be better if we could somehow pin it while the userspace
> is reading it, but OTOH maybe that's actually bad, since that means
> userspace (though suitably privileged) could pin this kernel memory
> indefinitely.
>
> johannes
Hi Greg
On 2/11/2022 11:04 PM, Greg KH wrote:
> On Fri, Feb 11, 2022 at 10:59:39AM -0800, Abhinav Kumar wrote:
>> Hi Greg
>>
>> Thanks for the response.
>>
>> On 2/11/2022 3:09 AM, Greg KH wrote:
>>> On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
>>>> There are cases where depending on the size of the devcoredump and the speed
>>>> at which the usermode reads the dump, it can take longer than the current 5 mins
>>>> timeout.
>>>>
>>>> This can lead to incomplete dumps as the device is deleted once the timeout expires.
>>>>
>>>> One example is below where it took 6 mins for the devcoredump to be completely read.
>>>>
>>>> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
>>>> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
>>>
>>> What makes this so slow? Reading from the kernel shouldn't be the
>>> limit, is it where the data is being sent to?
>>
>> We are still checking this. We are seeing better read times when we bump up
>> the thread priority of the thread which was reading this.
>
> Where is the thread sending the data to?
The thread is writing the data to a file in local storage. From our
profiling, the read is the one taking the time not the write.
>
>> We are also trying to check if bumping up CPU speed is helping.
>> But, results have not been consistently good enough. So we thought we should
>> also increase the timeout to be safe.
>
> Why would 10 minutes be better than 30? What should the limit be? :)
Again, this is from our profiling. We are seeing a worst case time of 7
mins to finish the read for our data. Thats where the 10mins came from.
Just doubling what we have currently. I am not sure how the current 5
mins timeout came from.
>
> thanks,
>
> greg k-h
On Fri, Feb 11, 2022 at 10:59:39AM -0800, Abhinav Kumar wrote:
> Hi Greg
>
> Thanks for the response.
>
> On 2/11/2022 3:09 AM, Greg KH wrote:
> > On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
> > > There are cases where depending on the size of the devcoredump and the speed
> > > at which the usermode reads the dump, it can take longer than the current 5 mins
> > > timeout.
> > >
> > > This can lead to incomplete dumps as the device is deleted once the timeout expires.
> > >
> > > One example is below where it took 6 mins for the devcoredump to be completely read.
> > >
> > > 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
> > > 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
> >
> > What makes this so slow? Reading from the kernel shouldn't be the
> > limit, is it where the data is being sent to?
>
> We are still checking this. We are seeing better read times when we bump up
> the thread priority of the thread which was reading this.
Where is the thread sending the data to?
> We are also trying to check if bumping up CPU speed is helping.
> But, results have not been consistently good enough. So we thought we should
> also increase the timeout to be safe.
Why would 10 minutes be better than 30? What should the limit be? :)
thanks,
greg k-h
Hi Greg
On 2/12/2022 12:29 AM, Greg KH wrote:
> On Fri, Feb 11, 2022 at 11:52:41PM -0800, Abhinav Kumar wrote:
>> Hi Greg
>>
>> On 2/11/2022 11:04 PM, Greg KH wrote:
>>> On Fri, Feb 11, 2022 at 10:59:39AM -0800, Abhinav Kumar wrote:
>>>> Hi Greg
>>>>
>>>> Thanks for the response.
>>>>
>>>> On 2/11/2022 3:09 AM, Greg KH wrote:
>>>>> On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
>>>>>> There are cases where depending on the size of the devcoredump and the speed
>>>>>> at which the usermode reads the dump, it can take longer than the current 5 mins
>>>>>> timeout.
>>>>>>
>>>>>> This can lead to incomplete dumps as the device is deleted once the timeout expires.
>>>>>>
>>>>>> One example is below where it took 6 mins for the devcoredump to be completely read.
>>>>>>
>>>>>> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
>>>>>> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
>>>>>
>>>>> What makes this so slow? Reading from the kernel shouldn't be the
>>>>> limit, is it where the data is being sent to?
>>>>
>>>> We are still checking this. We are seeing better read times when we bump up
>>>> the thread priority of the thread which was reading this.
>>>
>>> Where is the thread sending the data to?
>>
>> The thread is writing the data to a file in local storage. From our
>> profiling, the read is the one taking the time not the write.
>
> The read is coming directly from memory, there should not be any
> slowdown at all here. How can that be the delay? Have a trace
> somewhere?
>
> thanks,
>
> greg k-h
Yes, like I mentioned in my previous comment we are still checking why
its taking so long. We will update with our findings if we have any.
Alright, we will try to capture trace to share and will update this
thread if we find something as well.
On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
> There are cases where depending on the size of the devcoredump and the speed
> at which the usermode reads the dump, it can take longer than the current 5 mins
> timeout.
>
> This can lead to incomplete dumps as the device is deleted once the timeout expires.
>
> One example is below where it took 6 mins for the devcoredump to be completely read.
>
> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
What makes this so slow? Reading from the kernel shouldn't be the
limit, is it where the data is being sent to?
> Increase the timeout to 10 mins to accommodate system delays and large coredump
> sizes.
Nit, please wrap your changelog texts at 72 columns.
And what is "large"?
thanks,
greg k-h
On Fri, Feb 11, 2022 at 11:52:41PM -0800, Abhinav Kumar wrote:
> Hi Greg
>
> On 2/11/2022 11:04 PM, Greg KH wrote:
> > On Fri, Feb 11, 2022 at 10:59:39AM -0800, Abhinav Kumar wrote:
> > > Hi Greg
> > >
> > > Thanks for the response.
> > >
> > > On 2/11/2022 3:09 AM, Greg KH wrote:
> > > > On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
> > > > > There are cases where depending on the size of the devcoredump and the speed
> > > > > at which the usermode reads the dump, it can take longer than the current 5 mins
> > > > > timeout.
> > > > >
> > > > > This can lead to incomplete dumps as the device is deleted once the timeout expires.
> > > > >
> > > > > One example is below where it took 6 mins for the devcoredump to be completely read.
> > > > >
> > > > > 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
> > > > > 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
> > > >
> > > > What makes this so slow? Reading from the kernel shouldn't be the
> > > > limit, is it where the data is being sent to?
> > >
> > > We are still checking this. We are seeing better read times when we bump up
> > > the thread priority of the thread which was reading this.
> >
> > Where is the thread sending the data to?
>
> The thread is writing the data to a file in local storage. From our
> profiling, the read is the one taking the time not the write.
The read is coming directly from memory, there should not be any
slowdown at all here. How can that be the delay? Have a trace
somewhere?
thanks,
greg k-h
Hi Johannes and Greg
On 2/12/2022 12:35 AM, Abhinav Kumar wrote:
> Hi Johannes
>
> On 2/12/2022 12:24 AM, Johannes Berg wrote:
>> On Fri, 2022-02-11 at 23:52 -0800, Abhinav Kumar wrote:
>>>
>>> The thread is writing the data to a file in local storage. From our
>>> profiling, the read is the one taking the time not the write.
>>>
>>
>> That seems kind of hard to believe, let's say it's a 4/3 split (4
>> minutes reading, 3 minutes writing, to make read > write as you say),
>> and 3MiB size, that'd mean you get 12.8KiB/sec? That seems implausibly
>> low, unless you're reading with really tiny buffers?
>>
>> Can you strace this somehow? (with timestamp info)
>>
>
> Yes, like I have already mentioned in earlier comments, we continue to
> check whats taking that long.
>
> Once we find something from our analysis and also have the trace, will
> update the thread.
>
>>> Just doubling what we have currently. I am not sure how the current 5
>>> mins timeout came from.
>>>
>>
>> To be honest it came out of thin air, and wasn't really meant as a limit
>> on how fast you can read (feels like even if it's tens of MiB you should
>> read it in milliseconds into userspace), but more of a maximum time that
>> we're willing to waste kernel memory if nobody is around to read the
>> data.
>>
>> I thought it'd be better if we could somehow pin it while the userspace
>> is reading it, but OTOH maybe that's actually bad, since that means
>> userspace (though suitably privileged) could pin this kernel memory
>> indefinitely.
>>
>> johannes
So, we were able to narrow down the bottle-neck further. The tiny
buffers which Johannes was referring to is coming from the sysfs method
below.
It defaults to a PAGE_SIZE worth of data which results in taking a lot
of time due to many number of reads.
If we increase the length to match the size of our data like below we
are able to finish the read in almost no-time.
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -184,10 +184,11 @@ static const struct seq_operations kernfs_seq_ops = {
static ssize_t kernfs_file_read_iter(struct kiocb *iocb, struct
iov_iter *iter)
{
struct kernfs_open_file *of = kernfs_of(iocb->ki_filp);
- ssize_t len = min_t(size_t, iov_iter_count(iter), PAGE_SIZE);
+ ssize_t len = min_t(size_t, iov_iter_count(iter), (PAGE_SIZE *
768));
const struct kernfs_ops *ops;
char *buf;
+ pr_err("[hbc debug] %s, len:%d\n", __func__, len);
buf = of->prealloc_buf;
if (buf)
mutex_lock(&of->prealloc_mutex);
( 768 because the size of our dump was ~3MB so that would be ~ 768 * 4kB
block sizes )
We also did some profiling around how much increasing the block size
helps and here is the data:
Block size cost
4KB 229s
8KB 86s
3MB 2s
So looks like 2 * block size OR 4 * block size can help quite a bit.
Hence, while we are exploring some options like reducing the size of the
dump etc, I wanted to also check if increasing the block size to like 4
* 4Kb could be a solution because it will bring down the read times
drastically based on our profiling.
Thanks
Abhinav
From: Abhinav Kumar
> Sent: 28 February 2022 21:38
...
> We also did some profiling around how much increasing the block size
> helps and here is the data:
>
> Block size cost
>
> 4KB 229s
> 8KB 86s
You must have an O(n^2) operation in there - find it.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, Feb 28, 2022 at 10:49 PM David Laight <[email protected]> wrote:
>
> From: Abhinav Kumar
> > Sent: 28 February 2022 21:38
> ...
> > We also did some profiling around how much increasing the block size
> > helps and here is the data:
> >
> > Block size cost
> >
> > 4KB 229s
> > 8KB 86s
>
> You must have an O(n^2) operation in there - find it.
The problem is how the devcoredump/sysfs interface works, which
results in "re-rendering" the output for each block.. it's fine for
moderate size sysfs files, but scales quite badly once you get into
couple MB size sysfs files.
It could be fixed by having some way to keep state across successive
read callbacks.
BR,
-R
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
On Tue, 2022-03-01 at 09:45 -0800, Rob Clark wrote:
> On Mon, Feb 28, 2022 at 10:49 PM David Laight <[email protected]> wrote:
> >
> > From: Abhinav Kumar
> > > Sent: 28 February 2022 21:38
> > ...
> > > We also did some profiling around how much increasing the block size
> > > helps and here is the data:
> > >
> > > Block size cost
> > >
> > > 4KB 229s
> > > 8KB 86s
> >
> > You must have an O(n^2) operation in there - find it.
>
> The problem is how the devcoredump/sysfs interface works, which
> results in "re-rendering" the output for each block.. it's fine for
> moderate size sysfs files, but scales quite badly once you get into
> couple MB size sysfs files.
>
> It could be fixed by having some way to keep state across successive
> read callbacks.
I'm not sure that's true? Perhaps for dev_coredumpm(), but only if you
implemented read() badly.
If you have e.g. dev_coredumpsg() or dev_coredumpv() that's just a
simple read from the existing buffer.
johannes