For Programmers: Free Programming Magazines  


Home > Archive > Unix Programming > March 2008 > Interpreting program core dump in mdb









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Interpreting program core dump in mdb
Mr. Uh Clem

2008-03-27, 7:28 pm

At $DAY_JOB, we've got a customer who has installed our product on a
Solaris 10 Sparc system and is getting a mysterious segment violation in
one of our background processes. Of course, this problem does not occur
on any of our inhouse systems.

We did get the customer to send us a core file, but aren't very handy
with the debug tools on Solaris.


# mdb prog core
Loading modules: [ libc.so.1 ld.so.1 ]
> ::stack

strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
_start+0x108(0, 0, 0, 0, 0, 0)


I've googled up countless articles telling me that ::stack gets a
stack dump, but have yet to find one which tells me what the
values in the display **ARE**.


Some specifics on this one: It's a daemon process which accepts
a connection and forks off a worker process to handle the connection.
Early on, it calls secure() which is linked from a different .o file:


char user_name[USER_LENGTH + 1]; /* global in .c containing secure */


secure(host)
char *host;
{
....
struct passwd *pw;
....

pw = getpwuid(getuid());
if (pw != NULL)
strncpy(user_name, pw->pw_name, sizeof(user_name)-1);


We seem to blow up on trying to move the user name from pw->pw_name,
which is very strange given that pw is supposed to point to static
space allocated by getpwuid().

struct passwd {
char *pw_name;
char *pw_passwd;
uid_t pw_uid;
gid_t pw_gid;
char *pw_age;
char *pw_comment;
char *pw_gecos;
char *pw_dir;
char *pw_shell;
};


Understanding the context around the stack frame seems really
crucial. One thing that is really strange is that
strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
contains r o o t which should be in
memory at the address pointed to by pw_name...


We're pretty sure we're doing Something Stupid(tm), but don't see
how we could muck up the static space returned by getpwuid between
the time the program starts and getting to this point. This is
code that has been running for quite a while on various Unix flavors
including Solaris 7 and upward. We now see that we have two
Solaris 10 customers with this problem. The code was compiled
under a Solaris 8 system.

So anyway, some pointers to interpreting the context around a crash
using mdb would be appreciated.

TIA

--
Clem
"If you push something hard enough, it will fall over."
- Fudd's first law of opposition
Noob

2008-03-27, 7:28 pm

Mr. Uh Clem wrote:

> At $DAY_JOB, we've got a customer who has installed our product on a
> Solaris 10 Sparc system and is getting a mysterious segment violation in
> one of our background processes. Of course, this problem does not occur
> on any of our inhouse systems.
>
> We did get the customer to send us a core file, but aren't very handy
> with the debug tools on Solaris.
>
>
> # mdb prog core
> Loading modules: [ libc.so.1 ld.so.1 ]
> strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
> secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
> process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
> open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
> main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
> _start+0x108(0, 0, 0, 0, 0, 0)
>
>
> I've googled up countless articles telling me that ::stack gets a
> stack dump, but have yet to find one which tells me what the
> values in the display **ARE**.
>
>
> Some specifics on this one: It's a daemon process which accepts
> a connection and forks off a worker process to handle the connection.
> Early on, it calls secure() which is linked from a different .o file:
>
>
> char user_name[USER_LENGTH + 1]; /* global in .c containing secure */
>
>
> secure(host)
> char *host;
> {
> ...
> struct passwd *pw;
> ...
>
> pw = getpwuid(getuid());
> if (pw != NULL)
> strncpy(user_name, pw->pw_name, sizeof(user_name)-1);


Is it possible for pw->pw_name to be NULL?
Mr. Uh Clem

2008-03-27, 7:28 pm

Noob wrote:
> Mr. Uh Clem wrote:
>
[color=darkred]
...[color=darkred]
>
> Is it possible for pw->pw_name to be NULL?


I suspect I could determine that if I could interpret the stack frame.
That 72ff6f74 ('root') occurs in the stack dump is real suspicious.
USER_LENGTH, btw, is 31.

--
Clem
"If you push something hard enough, it will fall over."
- Fudd's first law of opposition
EricF

2008-03-28, 4:48 am

In article <13un7sf5fhq685a@news.supernews.com>, "Mr. Uh Clem" <uhclem@DutchElmSt.invalid> wrote:
>At $DAY_JOB, we've got a customer who has installed our product on a
>Solaris 10 Sparc system and is getting a mysterious segment violation in
>one of our background processes. Of course, this problem does not occur
>on any of our inhouse systems.
>
>We did get the customer to send us a core file, but aren't very handy
>with the debug tools on Solaris.
>
>
># mdb prog core
>Loading modules: [ libc.so.1 ld.so.1 ]
>strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
>secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
>process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
>open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
>main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
>_start+0x108(0, 0, 0, 0, 0, 0)
>
>
>I've googled up countless articles telling me that ::stack gets a
>stack dump, but have yet to find one which tells me what the
>values in the display **ARE**.
>
>
>Some specifics on this one: It's a daemon process which accepts
>a connection and forks off a worker process to handle the connection.
>Early on, it calls secure() which is linked from a different .o file:
>
>
>char user_name[USER_LENGTH + 1]; /* global in .c containing secure */
>
>
>secure(host)
>char *host;
>{
>....
>struct passwd *pw;
>....
>
> pw = getpwuid(getuid());
> if (pw != NULL)
> strncpy(user_name, pw->pw_name, sizeof(user_name)-1);
>
>
>We seem to blow up on trying to move the user name from pw->pw_name,
>which is very strange given that pw is supposed to point to static
>space allocated by getpwuid().
>
>struct passwd {
> char *pw_name;
> char *pw_passwd;
> uid_t pw_uid;
> gid_t pw_gid;
> char *pw_age;
> char *pw_comment;
> char *pw_gecos;
> char *pw_dir;
> char *pw_shell;
>};
>
>
>Understanding the context around the stack frame seems really
>crucial. One thing that is really strange is that
>strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
>contains r o o t which should be in
>memory at the address pointed to by pw_name...
>
>
>We're pretty sure we're doing Something Stupid(tm), but don't see
>how we could muck up the static space returned by getpwuid between
>the time the program starts and getting to this point. This is
>code that has been running for quite a while on various Unix flavors
>including Solaris 7 and upward. We now see that we have two
>Solaris 10 customers with this problem. The code was compiled
>under a Solaris 8 system.
>
>So anyway, some pointers to interpreting the context around a crash
>using mdb would be appreciated.
>
>TIA
>

It would help if you built with debug enabled, which is a -g parameter.

Eric
Mr. Uh Clem

2008-03-28, 7:22 pm

EricF wrote:

> It would help if you built with debug enabled, which is a -g parameter.
>
> Eric


We did build a -g version here and purposely bombed it by feeding NULL
as the strncpy source argument. For some reason, the ::stack display
was no more symbolic, remaining cryptic without a secret decoder ring.

We put the customer through a lot trying various things prior to
obtaining the core dump. So we are kind of laying off them for
a little bit. We'd like to understand what we are seeing before
bothering them again with another binary. (And honestly, the nature
of this thing makes me suspect that -g will make the problem go
away. Sticking debug statements near the strncpy seemed to heal
things...)

Nobody can tell what the values being displayed by ::stack are?

--
Clem
"If you push something hard enough, it will fall over."
- Fudd's first law of opposition
Mark Holland

2008-03-28, 7:22 pm


"Mr. Uh Clem" <uhclem@DutchElmSt.invalid> wrote in message
news:13upv4r74f5hif3@news.supernews.com...
> EricF wrote:
>
>
> We did build a -g version here and purposely bombed it by feeding
> NULL
> as the strncpy source argument. For some reason, the ::stack
> display
> was no more symbolic, remaining cryptic without a secret decoder
> ring.
>
> We put the customer through a lot trying various things prior to
> obtaining the core dump. So we are kind of laying off them for
> a little bit. We'd like to understand what we are seeing before
> bothering them again with another binary. (And honestly, the nature
> of this thing makes me suspect that -g will make the problem go
> away. Sticking debug statements near the strncpy seemed to heal
> things...)
>
> Nobody can tell what the values being displayed by ::stack are?


According to this (short) post, the numbers are the contents of 6
registers which are typically used by Solaris on SPARC machines to
pass arguments: http://blogs.sun.com/ace/date/20050104 Obviously
for functions which take less than 6 arguments (e.g. strncpy) some of
these registers will have other values in.

With regards to compiling debug versions, we found that unless you
link with the -g option some debug information is unavailable
(possibly it is discarded?) so I would advise checking there. Also,
maybe you could try using dbx instead of mdb? mdb will only give you
assembly-level debugging, so you might find dbx easier to understand.

--
Mark


Giorgos Keramidas

2008-03-29, 7:24 pm

On Thu, 27 Mar 2008 09:22:22 -0400, "Mr. Uh Clem" <uhclem@DutchElmSt.invalid> wrote:
> At $DAY_JOB, we've got a customer who has installed our product on a
> Solaris 10 Sparc system and is getting a mysterious segment violation in
> one of our background processes. Of course, this problem does not occur
> on any of our inhouse systems.
>
> We did get the customer to send us a core file, but aren't very handy
> with the debug tools on Solaris.
>
> # mdb prog core
> Loading modules: [ libc.so.1 ld.so.1 ]
> strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
> secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
> process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
> open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
> main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
> _start+0x108(0, 0, 0, 0, 0, 0)
>
> I've googled up countless articles telling me that ::stack gets a
> stack dump, but have yet to find one which tells me what the
> values in the display **ARE**.


It looks like the daemon is overrunning a buffer inside strncpy().
Tracking down this sort of memory corruption can be tricky if it happens
in a child process (forking daemon), but you can use the libumem library
and mdb to debug this.

> Early on, it calls secure() which is linked from a different .o file:
>
> char user_name[USER_LENGTH + 1]; /* global in .c containing secure */
>
> secure(host)
> char *host;
> {
> ...
> struct passwd *pw;
> ...
>
> pw = getpwuid(getuid());
> if (pw != NULL)
> strncpy(user_name, pw->pw_name, sizeof(user_name)-1);
>
> We seem to blow up on trying to move the user name from pw->pw_name,
> which is very strange given that pw is supposed to point to static
> space allocated by getpwuid().


Is it possible that you have corrupted the stack elsewhere?

You can try enabling the debugging and auditing features of libumem.so
by running your program inside an mdb session, after setting up the
environment like this:

$ UMEM_DEBUG=default ; export UMEM_DEBUG
$ UMEM_LOGGING=transaction ; export UMEM_LOGGING
$ LD_PRELOAD=libumem.so.1 ; export LD_PRELOAD
$ mdb a.out

Then when inside mdb, set up a breakpoint at _exit and run the program:

> ::sysbp _exit
> ::run


After it crashes, load libumem.so and try the memory allocation tricks
described at:

http://developers.sun.com/solaris/a...em_library.html

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com