Re: Runaway intr, not flash related

[ Available lists | Index of freebsd-current | Month of Aug 2010 | Week of 19 Aug 2010 | Raw email | View thread | Wrap long lines | Reply | Tag ]
From
Andriy Gapon <avg@icyb.net.ua>
Date
19 Aug 2010 17:42:31
Subject
Re: Runaway intr, not flash related
Message-ID
4C6D6D03.4000101@icyb.net.ua

In reply to

[ Hide this part ]
on 19/08/2010 20:30 Doug Barton said the following:
> On 08/19/2010 08:24, Andriy Gapon wrote:
>> I am sorry, but I don't see anything dramatically wrong here. So
>> "swi4: clock" uses 5.76% of WCPU, is that such a big deal to be
>> called "runaway intr"?
>
> That's the symptom.

OK, I see.

Perhaps you will find this message (and its ancestor thread) interesting:
http://lists.freebsd.org/pipermail/freebsd-hackers/2008-February/023447.html
I believe that your issue is different, but perhaps that stuff will inspire you to
use ktr(4) and schedgraph to properly debug this issue. I strongly believe that
you have some sort of a scheduling issue and ktr seems to be the way to
investigate it.

Perhaps, you can first try the following dtrace script.
It should give a better view of what statclock sees, but I am not sure if that
information will be sufficient.
/********************************************************/
fbt::statclock:entry
/curthread->td_oncpu == 0/
{

@stacks0[stack()] = count();
counts0++;
}

fbt::statclock:entry
/curthread->td_oncpu == 1/
{

@stacks1[stack()] = count();
counts1++;
}

fbt::statclock:entry
{

@stacks[pid, tid, stack()] = count();
counts++;
}

END
{
printf("\n");
printf("***** CPU 0:\n");
normalize(@stacks0, counts0 / 100);
trunc(@stacks0, 5);
printa("%k%@u\n\n", @stacks0);

printf("\n\n");
printf("***** CPU 1:\n");
normalize(@stacks1, counts1 / 100);
trunc(@stacks1, 5);
printa("%k%@u\n\n", @stacks1);

printf("\n\n");
printf("***** Top Processes:\n");
normalize(@stacks, counts / 200);
trunc(@stacks, 20);
printa(@stacks);
}
/********************************************************/
You would run this script when the problem hits, few seconds should be sufficient.
You may want to play with values in trunc() calls, you may also want to filter
gathered statistics (using conditions in /.../) by pid/tid if you spot anything
interesting unusual.

--
Andriy Gapon


Elapsed time: 0.126 seconds