slightly scary syslog reports cpu problem

For discussing Linux compatible (or not) devices

Moderators: ChrisThornett, LXF moderators

slightly scary syslog reports cpu problem

Postby nordle » Mon May 08, 2006 12:14 am

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 1: 9400000000010011

Just noticed that in the syslog, no idea what it was doing at the time. A quick google and I can see people saying things like "replace the cpu" :shock: :shock:

This 3700+ was only purchased about 6 months ago and has NOT been overclocked in any way, temp wise its normally 32c, when gaming it does go up to 52c but that should be fine.

Anyone know anything about this or...
Last edited by nordle on Mon May 08, 2006 1:02 am, edited 1 time in total.
User avatar
nordle
LXF regular
 
Posts: 1500
Joined: Fri Apr 08, 2005 9:56 pm

RE: slightly scary syslog reports cpu problem

Postby nordle » Mon May 08, 2006 1:01 am

http://www.codemonkey.org.uk/cruft/parsemce.c/ can read in problems once the codes have been parsed to it, although I seem to be missing some codes, not verbose enough, or the error was not important, or need to enable more options in kernel. Will look into it.

Found http://pages.sbcglobal.net/redelm/ cpu burn, might use this to stress test system

Just hope I don't end up like this dude: http://tinyurl.com/fwzts

I knew a guy who's amd athlon 1ghz cpu (OEM) failed after four years. So A. it was OEM and B. even if it were retail it was out of warranty, but AMD still replaced it.
Hoping its nothing that serious of course.

I knew hours of playing BF2 would have an effect :)

Looking at the time of the error, I think I was running a qemu host, debian, and kaffeine TV recording, so the CPU was 99% for 2 hours, but maxed out at 52c......

I'll post back any findings.......hopefully ;)
User avatar
nordle
LXF regular
 
Posts: 1500
Joined: Fri Apr 08, 2005 9:56 pm

RE: slightly scary syslog reports cpu problem

Postby jjmac » Mon May 08, 2006 2:07 am

Sounds like it's talking about memory. I couldn't follow the tinyurl link. The only google responce that i could get (that wasn't about a bank :roll: ) was below. And that dosen't reveal much at all.

>>
http://www.linuxjournal.com/article/8211
>>


Have you tried greping through the kernel source for where the message is. Dosen't necessarily tell you what called it though. It can sometimes.


>>
I was running a qemu host, debian, and kaffeine TV recording, so the CPU was 99% for 2 hours, but maxed out at 52c.
>>

Well, at least it wasn't 99% idle (grin)



jm

Humpty Dumpty Was Pushed !
http://counter.li.org
#313537

The FVWM wm -=- www.fvwm.org -=-

Somebody stole my air guitar, It happened just the other day,
But it's ok, 'cause i've got a spare ...
jjmac
LXF regular
 
Posts: 1996
Joined: Fri Apr 08, 2005 1:32 am
Location: Sydney, Australia

Re: RE: slightly scary syslog reports cpu problem

Postby nordle » Tue May 09, 2006 9:33 pm

jjmac wrote:Sounds like it's talking about memory. I couldn't follow the tinyurl link. The only google responce that i could get (that wasn't about a bank :roll: ) was below. And that dosen't reveal much at all.

>>
http://www.linuxjournal.com/article/8211
>>


Have you tried greping through the kernel source for where the message is. Dosen't necessarily tell you what called it though. It can sometimes.


Thanks for the link jjmac, Im going to need fresh eyes and brain to abosorb it though, I only got a C in Computer Science :)

I've tried grep -R -B2 -A2 mcheck /usr/src/linux >> /tmp/grep_mcheck.txt
Which has thrown up
/usr/src/linux/arch/i386/kernel/cpu/common.c
/usr/src/linux/arch/i386/kernel/cpu/mcheck/k7.c
/usr/src/linux/arch/i386/kernel/cpu/mcheck/mce.c:
/usr/src/linux/arch/i386/kernel/cpu/mcheck/mce.h
Which I briefly looked at, but not being a programmer (some ado vba and sql) + not seeing any codes mentioned or explanations or clues etc


I think I've just over reacted, I'm going to concentrate on the non fatal, correctable incident occurred Especially as I ran cpuburn for 1 hour. For an hour the CPU was 99-100% and I temps went up to 55.5c within 35 mins and then remained constant. Thats the hottest its ever been, gaming doesn't get it that hot, and the kernel reported no problems at all, the desktop remained perfectly usable OOo and firefox ran fine, I wouldn't dare try that on another O/S.

The posts I saw were saying that if you get inaccurate data in the cahce's at high temps, it would show up here, or if the power supply was poor with fluctuating 12+5v rails then again, it would show up here. But everything remained solid, so I guess I'll try memtest to put the RAM through some grief, just in case.

EDIT: Ran memtest twice, no problems.....just going to keep an eye out for MCE messages, but guess it was a one off,not important, maybe even a false positive...
User avatar
nordle
LXF regular
 
Posts: 1500
Joined: Fri Apr 08, 2005 9:56 pm

RE: Re: RE: slightly scary syslog reports cpu problem

Postby jjmac » Thu May 11, 2006 10:12 am

Howdy,


Aha, the Machine Check Exception facility. I didn't notice that before. You must have that configured. And it Looks like it works too (grin). Cool the way it says that Linux can auto fix the error, and good that it logs. I really am impressed, and must read up more on that ... so much to do (grin).

>>
The posts I saw were saying that if you get inaccurate data in the cahce's at high temps, it would show up here, or if the power supply was poor with fluctuating 12+5v rails then again, it would show up here. But everything remained solid, so I guess I'll try memtest to put the RAM through some grief, just in case.

EDIT: Ran memtest twice, no problems.....just going to keep an eye out for MCE messages, but guess it was a one off,not important, maybe even a false positive...
>>

Yes, will inevitably involve a number of contributing factors, all variable, and so hard to single any particular one out as the prime caurse. A good power supply will still be at the mercy of mains fluctuations to a degree. Unless it's plugged into one of those ups facilities. Another thing i wish i could afford.

Iv'e been lucky in that regard. That tiny url worked this time too :).

I think if it was a definite hw issue, (ram), then it would likely occur when ever similar conditions existed. Then start to occur regularly without neccessarily the extreme conditions being present. A consistent thing that seems to come across with ram and temperature, from what i can gather, seems to involve what they call "bit flip". Possibly that was it.

As you have been stressing the system out, and it hasn't reoccured you may well be right that it was a one off. Probably the temp was the straw that fired it off, in context with the high cpu usage.

It can be a real pain too, so much time, more or less wasted, trying to unravel error messages/codes. Ok for the manufactures, i would expect manufactures would have them archived for their own use, and they don't really have a lot of rom space to devote. But , as Linux does represent a big change in the computing mind set ... that is ... computing has been brought to the people, so to speak ... it would be nice to see human readable translations for these things, become part of the kernel. Though, i do know that it is easier said than done, as it would be a mamouth task. But ... it's great that we have the MCE facility available in any case !. I wonders, just how often these things have occured in the past, but just gone unnoticed due to the absence of any facility to provide the log out put. Being for warned at least is a help.

As long as there aren't any further log entries, then it probably was just a free radicle event, but i wouldn't think it was false. Would have thought that 52c would have been ok though. It is well within spec. But 52c is damn hot too, and amd traces are really thin ... seems at least now you have a danger mark for temp, that you can check for.

I've got mce configured as a module thats not loaded, so i'll have to change it to a built in now :)

A long compile in the background, something like an X or libc6 compile, works as a good defacto system/ram test too.

NB: Had to re-login to post this, on a page opened in a tab from an existing login ...
strange how that happens occassionally ?


jm
Humpty Dumpty Was Pushed !
http://counter.li.org
#313537

The FVWM wm -=- www.fvwm.org -=-

Somebody stole my air guitar, It happened just the other day,
But it's ok, 'cause i've got a spare ...
jjmac
LXF regular
 
Posts: 1996
Joined: Fri Apr 08, 2005 1:32 am
Location: Sydney, Australia


Return to Hardware

Who is online

Users browsing this forum: No registered users and 0 guests