Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Interpreting program core dump in mdb
At $DAY_JOB, we've got a customer who has installed our product on a
Solaris 10 Sparc system and is getting a mysterious segment violation in
one of our background processes.  Of course, this problem does not occur
on any of our inhouse systems.

We did get the customer to send us a core file, but aren't very handy
with the debug tools on Solaris.


# mdb prog core
Loading modules: [ libc.so.1 ld.so.1 ]
> ::stack
strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
_start+0x108(0, 0, 0, 0, 0, 0)


I've googled up countless articles telling me that ::stack gets a
stack dump, but have yet to find one which tells me what the
values in the display **ARE**.


Some specifics on this one:  It's a daemon process which accepts
a connection and forks off a worker process to handle the connection.
Early on, it calls secure() which is linked from a different .o file:


char user_name[USER_LENGTH + 1];   /* global in .c containing secure */


secure(host)
char *host;
{
...
struct passwd *pw;
...

pw = getpwuid(getuid());
if (pw != NULL)
strncpy(user_name, pw->pw_name, sizeof(user_name)-1);


We seem to blow up on trying to move the user name from pw->pw_name,
which is very strange given that pw is supposed to point to static
space allocated by getpwuid().

struct passwd {
char    *pw_name;
char    *pw_passwd;
uid_t   pw_uid;
gid_t   pw_gid;
char    *pw_age;
char    *pw_comment;
char    *pw_gecos;
char    *pw_dir;
char    *pw_shell;
};


Understanding the context around the stack frame seems really
crucial.   One thing that is really strange is that
strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
contains                       r o o t  which should be in
memory at the address pointed to by pw_name...


We're pretty sure we're doing Something Stupid(tm), but don't see
how we could muck up the static space returned by getpwuid between
the time the program starts and getting to this point.  This is
code that has been running for quite a while on various Unix flavors
including Solaris 7 and upward.   We now see that we have two
Solaris 10 customers with this problem.   The code was compiled
under a Solaris 8 system.

So anyway, some pointers to interpreting the context around a crash
using mdb would be appreciated.

TIA

--
Clem
"If you push something hard enough, it will fall over."
- Fudd's first law of opposition

Report this thread to moderator Post Follow-up to this message
Old Post
Mr. Uh Clem
03-28-08 12:28 AM


Re: Interpreting program core dump in mdb
Mr. Uh Clem wrote:

> At $DAY_JOB, we've got a customer who has installed our product on a
> Solaris 10 Sparc system and is getting a mysterious segment violation in
> one of our background processes.  Of course, this problem does not occur
> on any of our inhouse systems.
>
> We did get the customer to send us a core file, but aren't very handy
> with the debug tools on Solaris.
>
>
> # mdb prog core
> Loading modules: [ libc.so.1 ld.so.1 ] 
> strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
> secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
> process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
> open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
> main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
> _start+0x108(0, 0, 0, 0, 0, 0)
>
>
> I've googled up countless articles telling me that ::stack gets a
> stack dump, but have yet to find one which tells me what the
> values in the display **ARE**.
>
>
> Some specifics on this one:  It's a daemon process which accepts
> a connection and forks off a worker process to handle the connection.
> Early on, it calls secure() which is linked from a different .o file:
>
>
> char user_name[USER_LENGTH + 1];   /* global in .c containing secure */
>
>
> secure(host)
> char *host;
> {
> ...
> struct passwd *pw;
> ...
>
>     pw = getpwuid(getuid());
>     if (pw != NULL)
>         strncpy(user_name, pw->pw_name, sizeof(user_name)-1);

Is it possible for pw->pw_name to be NULL?

Report this thread to moderator Post Follow-up to this message
Old Post
Noob
03-28-08 12:28 AM


Re: Interpreting program core dump in mdb
Noob wrote:
> Mr. Uh Clem wrote:
> 
 
.. 
>
> Is it possible for pw->pw_name to be NULL?

I suspect I could determine that if I could interpret the stack frame.
That 72ff6f74 ('root') occurs in the stack dump is real suspicious.
USER_LENGTH, btw, is 31.

--
Clem
"If you push something hard enough, it will fall over."
- Fudd's first law of opposition

Report this thread to moderator Post Follow-up to this message
Old Post
Mr. Uh Clem
03-28-08 12:28 AM


Re: Interpreting program core dump in mdb
In article <13un7sf5fhq685a@news.supernews.com>, "Mr. Uh Clem" <uhclem@DutchElmSt.invalid> 
wrote:
>At $DAY_JOB, we've got a customer who has installed our product on a
>Solaris 10 Sparc system and is getting a mysterious segment violation in
>one of our background processes.  Of course, this problem does not occur
>on any of our inhouse systems.
>
>We did get the customer to send us a core file, but aren't very handy
>with the debug tools on Solaris.
>
>
># mdb prog core
>Loading modules: [ libc.so.1 ld.so.1 ] 
>strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
>secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
>process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
>open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
>main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
>_start+0x108(0, 0, 0, 0, 0, 0)
>
>
>I've googled up countless articles telling me that ::stack gets a
>stack dump, but have yet to find one which tells me what the
>values in the display **ARE**.
>
>
>Some specifics on this one:  It's a daemon process which accepts
>a connection and forks off a worker process to handle the connection.
>Early on, it calls secure() which is linked from a different .o file:
>
>
>char user_name[USER_LENGTH + 1];   /* global in .c containing secure */
>
>
>secure(host)
>char *host;
>{
>....
>struct passwd *pw;
>....
>
>     pw = getpwuid(getuid());
>     if (pw != NULL)
>         strncpy(user_name, pw->pw_name, sizeof(user_name)-1);
>
>
>We seem to blow up on trying to move the user name from pw->pw_name,
>which is very strange given that pw is supposed to point to static
>space allocated by getpwuid().
>
>struct passwd {
>         char    *pw_name;
>         char    *pw_passwd;
>         uid_t   pw_uid;
>         gid_t   pw_gid;
>         char    *pw_age;
>         char    *pw_comment;
>         char    *pw_gecos;
>         char    *pw_dir;
>         char    *pw_shell;
>};
>
>
>Understanding the context around the stack frame seems really
>crucial.   One thing that is really strange is that
>strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
>contains                       r o o t  which should be in
>memory at the address pointed to by pw_name...
>
>
>We're pretty sure we're doing Something Stupid(tm), but don't see
>how we could muck up the static space returned by getpwuid between
>the time the program starts and getting to this point.  This is
>code that has been running for quite a while on various Unix flavors
>including Solaris 7 and upward.   We now see that we have two
>Solaris 10 customers with this problem.   The code was compiled
>under a Solaris 8 system.
>
>So anyway, some pointers to interpreting the context around a crash
>using mdb would be appreciated.
>
>TIA
>
It would help if you built with debug enabled, which is a -g parameter.

Eric

Report this thread to moderator Post Follow-up to this message
Old Post
EricF
03-28-08 09:48 AM


Re: Interpreting program core dump in mdb
EricF wrote:
 
> It would help if you built with debug enabled, which is a -g parameter.
>
> Eric

We did build a -g version here and purposely bombed it by feeding NULL
as the strncpy source argument.   For some reason, the ::stack display
was no more symbolic, remaining cryptic without a secret decoder ring.

We put the customer through a lot trying various things prior to
obtaining the core dump.   So we are kind of laying off them for
a little bit.  We'd like to understand what we are seeing before
bothering them again with another binary.  (And honestly, the nature
of this thing makes me suspect that -g will make the problem go
away.   Sticking debug statements near the strncpy seemed to heal
things...)

Nobody can tell what the values being displayed by ::stack are?

--
Clem
"If you push something hard enough, it will fall over."
- Fudd's first law of opposition

Report this thread to moderator Post Follow-up to this message
Old Post
Mr. Uh Clem
03-29-08 12:22 AM


Re: Interpreting program core dump in mdb
"Mr. Uh Clem" <uhclem@DutchElmSt.invalid> wrote in message
news:13upv4r74f5hif3@news.supernews.com...
> EricF wrote:
> 
>
> We did build a -g version here and purposely bombed it by feeding
> NULL
> as the strncpy source argument.   For some reason, the ::stack
> display
> was no more symbolic, remaining cryptic without a secret decoder
> ring.
>
> We put the customer through a lot trying various things prior to
> obtaining the core dump.   So we are kind of laying off them for
> a little bit.  We'd like to understand what we are seeing before
> bothering them again with another binary.  (And honestly, the nature
> of this thing makes me suspect that -g will make the problem go
> away.   Sticking debug statements near the strncpy seemed to heal
> things...)
>
> Nobody can tell what the values being displayed by ::stack are?

According to this (short) post, the numbers are the contents of 6
registers which are typically used by Solaris on SPARC machines to
pass arguments:  http://blogs.sun.com/ace/date/20050104   Obviously
for functions which take less than 6 arguments (e.g. strncpy) some of
these registers will have other values in.

With regards to compiling debug versions, we found that unless you
link with the -g option some debug information is unavailable
(possibly it is discarded?) so I would advise checking there. Also,
maybe you could try using dbx instead of mdb? mdb will only give you
assembly-level debugging, so you might find dbx easier to understand.

--
Mark



Report this thread to moderator Post Follow-up to this message
Old Post
Mark Holland
03-29-08 12:22 AM


Re: Interpreting program core dump in mdb
On Thu, 27 Mar 2008 09:22:22 -0400, "Mr. Uh Clem" <uhclem@DutchElmSt.invalid> wrote:
> At $DAY_JOB, we've got a customer who has installed our product on a
> Solaris 10 Sparc system and is getting a mysterious segment violation in
> one of our background processes.  Of course, this problem does not occur
> on any of our inhouse systems.
>
> We did get the customer to send us a core file, but aren't very handy
> with the debug tools on Solaris.
>
> # mdb prog core
> Loading modules: [ libc.so.1 ld.so.1 ] 
> strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
> secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
> process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
> open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
> main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
> _start+0x108(0, 0, 0, 0, 0, 0)
>
> I've googled up countless articles telling me that ::stack gets a
> stack dump, but have yet to find one which tells me what the
> values in the display **ARE**.

It looks like the daemon is overrunning a buffer inside strncpy().
Tracking down this sort of memory corruption can be tricky if it happens
in a child process (forking daemon), but you can use the libumem library
and mdb to debug this.

> Early on, it calls secure() which is linked from a different .o file:
>
> char user_name[USER_LENGTH + 1];   /* global in .c containing secure */
>
> secure(host)
> char *host;
> {
> ...
> struct passwd *pw;
> ...
>
>     pw = getpwuid(getuid());
>     if (pw != NULL)
>         strncpy(user_name, pw->pw_name, sizeof(user_name)-1);
>
> We seem to blow up on trying to move the user name from pw->pw_name,
> which is very strange given that pw is supposed to point to static
> space allocated by getpwuid().

Is it possible that you have corrupted the stack elsewhere?

You can try enabling the debugging and auditing features of libumem.so
by running your program inside an mdb session, after setting up the
environment like this:

$ UMEM_DEBUG=default ; export UMEM_DEBUG
$ UMEM_LOGGING=transaction ; export UMEM_LOGGING
$ LD_PRELOAD=libumem.so.1 ; export LD_PRELOAD
$ mdb a.out

Then when inside mdb, set up a breakpoint at _exit and run the program:

> ::sysbp _exit
> ::run

After it crashes, load libumem.so and try the memory allocation tricks
described at:

http://developers.sun.com/solaris/a...em_library.html


Report this thread to moderator Post Follow-up to this message
Old Post
Giorgos Keramidas
03-30-08 12:24 AM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

Unix Programming archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 09:24 AM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.