Code Comments
Programming Forum and web based access to our favorite programming groups.At $DAY_JOB, we've got a customer who has installed our product on a
Solaris 10 Sparc system and is getting a mysterious segment violation in
one of our background processes. Of course, this problem does not occur
on any of our inhouse systems.
We did get the customer to send us a core file, but aren't very handy
with the debug tools on Solaris.
# mdb prog core
Loading modules: [ libc.so.1 ld.so.1 ]
> ::stack
strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
_start+0x108(0, 0, 0, 0, 0, 0)
I've googled up countless articles telling me that ::stack gets a
stack dump, but have yet to find one which tells me what the
values in the display **ARE**.
Some specifics on this one: It's a daemon process which accepts
a connection and forks off a worker process to handle the connection.
Early on, it calls secure() which is linked from a different .o file:
char user_name[USER_LENGTH + 1]; /* global in .c containing secure */
secure(host)
char *host;
{
...
struct passwd *pw;
...
pw = getpwuid(getuid());
if (pw != NULL)
strncpy(user_name, pw->pw_name, sizeof(user_name)-1);
We seem to blow up on trying to move the user name from pw->pw_name,
which is very strange given that pw is supposed to point to static
space allocated by getpwuid().
struct passwd {
char *pw_name;
char *pw_passwd;
uid_t pw_uid;
gid_t pw_gid;
char *pw_age;
char *pw_comment;
char *pw_gecos;
char *pw_dir;
char *pw_shell;
};
Understanding the context around the stack frame seems really
crucial. One thing that is really strange is that
strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
contains r o o t which should be in
memory at the address pointed to by pw_name...
We're pretty sure we're doing Something Stupid(tm), but don't see
how we could muck up the static space returned by getpwuid between
the time the program starts and getting to this point. This is
code that has been running for quite a while on various Unix flavors
including Solaris 7 and upward. We now see that we have two
Solaris 10 customers with this problem. The code was compiled
under a Solaris 8 system.
So anyway, some pointers to interpreting the context around a crash
using mdb would be appreciated.
TIA
--
Clem
"If you push something hard enough, it will fall over."
- Fudd's first law of opposition
Post Follow-up to this messageMr. Uh Clem wrote:
> At $DAY_JOB, we've got a customer who has installed our product on a
> Solaris 10 Sparc system and is getting a mysterious segment violation in
> one of our background processes. Of course, this problem does not occur
> on any of our inhouse systems.
>
> We did get the customer to send us a core file, but aren't very handy
> with the debug tools on Solaris.
>
>
> # mdb prog core
> Loading modules: [ libc.so.1 ld.so.1 ]
> strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
> secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
> process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
> open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
> main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
> _start+0x108(0, 0, 0, 0, 0, 0)
>
>
> I've googled up countless articles telling me that ::stack gets a
> stack dump, but have yet to find one which tells me what the
> values in the display **ARE**.
>
>
> Some specifics on this one: It's a daemon process which accepts
> a connection and forks off a worker process to handle the connection.
> Early on, it calls secure() which is linked from a different .o file:
>
>
> char user_name[USER_LENGTH + 1]; /* global in .c containing secure */
>
>
> secure(host)
> char *host;
> {
> ...
> struct passwd *pw;
> ...
>
> pw = getpwuid(getuid());
> if (pw != NULL)
> strncpy(user_name, pw->pw_name, sizeof(user_name)-1);
Is it possible for pw->pw_name to be NULL?
Post Follow-up to this messageNoob wrote:
> Mr. Uh Clem wrote:
>
..
>
> Is it possible for pw->pw_name to be NULL?
I suspect I could determine that if I could interpret the stack frame.
That 72ff6f74 ('root') occurs in the stack dump is real suspicious.
USER_LENGTH, btw, is 31.
--
Clem
"If you push something hard enough, it will fall over."
- Fudd's first law of opposition
Post Follow-up to this messageIn article <13un7sf5fhq685a@news.supernews.com>, "Mr. Uh Clem" <uhclem@DutchElmSt.invalid>
wrote:
>At $DAY_JOB, we've got a customer who has installed our product on a
>Solaris 10 Sparc system and is getting a mysterious segment violation in
>one of our background processes. Of course, this problem does not occur
>on any of our inhouse systems.
>
>We did get the customer to send us a core file, but aren't very handy
>with the debug tools on Solaris.
>
>
># mdb prog core
>Loading modules: [ libc.so.1 ld.so.1 ]
>strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
>secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
>process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
>open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
>main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
>_start+0x108(0, 0, 0, 0, 0, 0)
>
>
>I've googled up countless articles telling me that ::stack gets a
>stack dump, but have yet to find one which tells me what the
>values in the display **ARE**.
>
>
>Some specifics on this one: It's a daemon process which accepts
>a connection and forks off a worker process to handle the connection.
>Early on, it calls secure() which is linked from a different .o file:
>
>
>char user_name[USER_LENGTH + 1]; /* global in .c containing secure */
>
>
>secure(host)
>char *host;
>{
>....
>struct passwd *pw;
>....
>
> pw = getpwuid(getuid());
> if (pw != NULL)
> strncpy(user_name, pw->pw_name, sizeof(user_name)-1);
>
>
>We seem to blow up on trying to move the user name from pw->pw_name,
>which is very strange given that pw is supposed to point to static
>space allocated by getpwuid().
>
>struct passwd {
> char *pw_name;
> char *pw_passwd;
> uid_t pw_uid;
> gid_t pw_gid;
> char *pw_age;
> char *pw_comment;
> char *pw_gecos;
> char *pw_dir;
> char *pw_shell;
>};
>
>
>Understanding the context around the stack frame seems really
>crucial. One thing that is really strange is that
>strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
>contains r o o t which should be in
>memory at the address pointed to by pw_name...
>
>
>We're pretty sure we're doing Something Stupid(tm), but don't see
>how we could muck up the static space returned by getpwuid between
>the time the program starts and getting to this point. This is
>code that has been running for quite a while on various Unix flavors
>including Solaris 7 and upward. We now see that we have two
>Solaris 10 customers with this problem. The code was compiled
>under a Solaris 8 system.
>
>So anyway, some pointers to interpreting the context around a crash
>using mdb would be appreciated.
>
>TIA
>
It would help if you built with debug enabled, which is a -g parameter.
Eric
Post Follow-up to this messageEricF wrote: > It would help if you built with debug enabled, which is a -g parameter. > > Eric We did build a -g version here and purposely bombed it by feeding NULL as the strncpy source argument. For some reason, the ::stack display was no more symbolic, remaining cryptic without a secret decoder ring. We put the customer through a lot trying various things prior to obtaining the core dump. So we are kind of laying off them for a little bit. We'd like to understand what we are seeing before bothering them again with another binary. (And honestly, the nature of this thing makes me suspect that -g will make the problem go away. Sticking debug statements near the strncpy seemed to heal things...) Nobody can tell what the values being displayed by ::stack are? -- Clem "If you push something hard enough, it will fall over." - Fudd's first law of opposition
Post Follow-up to this message"Mr. Uh Clem" <uhclem@DutchElmSt.invalid> wrote in message news:13upv4r74f5hif3@news.supernews.com... > EricF wrote: > > > We did build a -g version here and purposely bombed it by feeding > NULL > as the strncpy source argument. For some reason, the ::stack > display > was no more symbolic, remaining cryptic without a secret decoder > ring. > > We put the customer through a lot trying various things prior to > obtaining the core dump. So we are kind of laying off them for > a little bit. We'd like to understand what we are seeing before > bothering them again with another binary. (And honestly, the nature > of this thing makes me suspect that -g will make the problem go > away. Sticking debug statements near the strncpy seemed to heal > things...) > > Nobody can tell what the values being displayed by ::stack are? According to this (short) post, the numbers are the contents of 6 registers which are typically used by Solaris on SPARC machines to pass arguments: http://blogs.sun.com/ace/date/20050104 Obviously for functions which take less than 6 arguments (e.g. strncpy) some of these registers will have other values in. With regards to compiling debug versions, we found that unless you link with the -g option some debug information is unavailable (possibly it is discarded?) so I would advise checking there. Also, maybe you could try using dbx instead of mdb? mdb will only give you assembly-level debugging, so you might find dbx easier to understand. -- Mark
Post Follow-up to this messageOn Thu, 27 Mar 2008 09:22:22 -0400, "Mr. Uh Clem" <uhclem@DutchElmSt.invalid> wrote:
> At $DAY_JOB, we've got a customer who has installed our product on a
> Solaris 10 Sparc system and is getting a mysterious segment violation in
> one of our background processes. Of course, this problem does not occur
> on any of our inhouse systems.
>
> We did get the customer to send us a core file, but aren't very handy
> with the debug tools on Solaris.
>
> # mdb prog core
> Loading modules: [ libc.so.1 ld.so.1 ]
> strncpy+0x5d0(20, 7182f4, 1b, 726f6f74, 0, 20)
> secure+0x1b8(2e4088, b1978, c6068, 1f, 717298, 0)
> process_request+0x41c(2e7d8, 1, c60e4, 1, 5750bc, 0)
> open_socket+0x310(0, c8bf0, 5, 7efefeff, 81010100, ffbff9bc)
> main+0x664(1, ffbffc1c, ffbffc24, c6000, c80fc, 3)
> _start+0x108(0, 0, 0, 0, 0, 0)
>
> I've googled up countless articles telling me that ::stack gets a
> stack dump, but have yet to find one which tells me what the
> values in the display **ARE**.
It looks like the daemon is overrunning a buffer inside strncpy().
Tracking down this sort of memory corruption can be tricky if it happens
in a child process (forking daemon), but you can use the libumem library
and mdb to debug this.
> Early on, it calls secure() which is linked from a different .o file:
>
> char user_name[USER_LENGTH + 1]; /* global in .c containing secure */
>
> secure(host)
> char *host;
> {
> ...
> struct passwd *pw;
> ...
>
> pw = getpwuid(getuid());
> if (pw != NULL)
> strncpy(user_name, pw->pw_name, sizeof(user_name)-1);
>
> We seem to blow up on trying to move the user name from pw->pw_name,
> which is very strange given that pw is supposed to point to static
> space allocated by getpwuid().
Is it possible that you have corrupted the stack elsewhere?
You can try enabling the debugging and auditing features of libumem.so
by running your program inside an mdb session, after setting up the
environment like this:
$ UMEM_DEBUG=default ; export UMEM_DEBUG
$ UMEM_LOGGING=transaction ; export UMEM_LOGGING
$ LD_PRELOAD=libumem.so.1 ; export LD_PRELOAD
$ mdb a.out
Then when inside mdb, set up a breakpoint at _exit and run the program:
> ::sysbp _exit
> ::run
After it crashes, load libumem.so and try the memory allocation tricks
described at:
http://developers.sun.com/solaris/a...em_library.html
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.