Home > Archive > Unix Programming > March 2008 > 64-bit c++ application crashing on solaris
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
64-bit c++ application crashing on solaris
|
|
|
| Hi,
we are trying to migrate an existing C++ application on solaris
compiled in 32-bit env to 64-bit env.
We have successfully compiled the application in 64-bit mode. We are
facing problems while running the compiled application.
The application seems to be crashing intermittently for no specific
reason.
We think this could probably be due to sufficient memory not available
to the 64-bit application.
we investigated various tunable parameters in solaris for the same
purpose.
A brief description of them is listed below --
lwp_default_stksize
Specifies the default value of the stack size to be used when a kernel
thread is created, and when the calling routine does not provide an
explicit size to be used.
Data Type
Integer
Range
Minimum is the default values:
3 x PAGESIZE on SPARC systems (3 * 8192 = 24576)
Maximum is 32 times the default value.
Units
Bytes in multiples of the value returned by the getpagesize
parameter. For more information, see getpagesize(3C).
Dynamic?
Yes. Affects threads created after the variable is changed.
Validation
Must be greater than or equal to 8192 and less than or equal to
262,144 (256 x 1024). Also must be a multiple of the system page size.
If these conditions are not met, the following message is displayed:
Illegal stack size, Using N
The value of N is the default value of lwp_default_stksize.
When to Change
When the system panics because it has run out of stack space. The
best solution for this problem is to determine why the system is
running out of space and then make a correction.
Increasing the default stack size means that almost every kernel
thread will have a larger stack, resulting in increased kernel memory
consumption for no good reason. Generally, that space will be unused.
The increased consumption means other resources that are competing for
the same pool of memory will have the amount of space available to
them reduced, possibly decreasing the system's ability to perform
work. Among the side effects is a reduction in the number of threads
that the kernel can create. This solution should be treated as no more
than an interim workaround until the root cause is remedied.
segkpsize
Specify the amount of kernel pageable memory available. This
memory is used primarily for kernel thread stacks. Increasing this
number allows either larger stacks for the same number of threads or
more threads. This parameter can only be set on systems running 64-bit
kernels. Systems running 64-bit kernels use a default stack size of 24
Kbytes.
Data Type
Unsigned long
Default
64-bit kernels, 2 Gbytes
32-bit kernels, 512 Mbytes
Range
64-bit kernels, 512 Mbytes - 24 Gbytes
32-bit kernels, 512 Mbytes
Units
Mbytes
Dynamic?
No
Validation
Value is compared to minimum and maximum sizes (512 Mbytes and 24
Gbytes for 64-bit systems) and if smaller than the minimum or larger
than the maximum, it is reset to 2 Gbytes and a message to that effect
is displayed.
The actual size used in creation of the cache is the lesser of the
value specified in segkpsize after the constraints checking and 50% of
physical memory.
When to Change
This is one of the steps necessary to support large numbers of
processes on a system. The default size of 2 Gbytes, assuming at least
1 Gbyte of physical memory is present, allows creation of 24-Kbyte
stacks for more than 87,000 kernel threads. The size of a stack in a
64-bit kernel is the same whether the process is a 32-bit process or a
64-bit process. If more than this number is needed, segkpsize can be
increased assuming sufficient physical memory exists.
Does anyone have an idea about these parameters (or any other related
params), and if they could be helpful in resolving the issue at hand ?
Or is there any other area, we should look at which could help in this
case ?
| |
| Casper H.S. Dik 2008-03-28, 7:22 pm |
| Sumir <sumirmehta@gmail.com> writes:
>we are trying to migrate an existing C++ application on solaris
>compiled in 32-bit env to 64-bit env.
>We have successfully compiled the application in 64-bit mode. We are
>facing problems while running the compiled application.
>The application seems to be crashing intermittently for no specific
>reason.
Surely it crashes somewhere? "Intermittently, unspecific"
feels like memory corruption.
>We think this could probably be due to sufficient memory not available
>to the 64-bit application.
That would result in NULL being returned from allocation functions
and exceptions to be thrown.
>we investigated various tunable parameters in solaris for the same
>purpose.
>A brief description of them is listed below --
>lwp_default_stksize
Not relevant to user processes; this is a *kernel* stack limit.
>segkpsize
Something to do with the kernel, not user processes.
>Does anyone have an idea about these parameters (or any other related
>params), and if they could be helpful in resolving the issue at hand ?
>Or is there any other area, we should look at which could help in this
>case ?
If the crash is intermittent then it's not likely a resource issue
unless the crash happens when resources are allocated or shortly
afterwards; it is unlikely that a 64 bit app uses many more
resources than its 32 bit counterpart and you would notice it running
out of swap space (memory resources) and likely out of stack space.
(which you can change using ulimit)
Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
| |
| Paul Pluzhnikov 2008-03-29, 4:41 am |
| Sumir <sumirmehta@gmail.com> writes:
> The application seems to be crashing intermittently for no specific
> reason.
Oh, there *is* a reson (or a few).
The very first question you should ask is where exactly is it
crashing? Use debugger to find out.
> We think this could probably be due to sufficient memory not available
> to the 64-bit application.
You have not presented any reason why you'd think that.
Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
| |
|
| On Mar 28, 7:29=A0pm, Casper H.S. Dik <Casper....@Sun.COM> wrote:
> Sumir <sumirme...@gmail.com> writes:
>
> Surely it crashes somewhere? =A0"Intermittently, unspecific"
> feels like memory corruption.
>
>
> That would result in NULL being returned from allocation functions
> and exceptions to be thrown.
>
>
> Not relevant to user processes; this is a *kernel* stack limit.
>
>
> Something to do with the kernel, not user processes.
>
>
> If the crash is intermittent then it's not likely a resource issue
> unless the crash happens when resources are allocated or shortly
> afterwards; it is unlikely that a 64 bit app uses many more
> resources than its 32 bit counterpart and you would notice it running
> out of swap space (memory resources) and likely out of stack space.
> (which you can change using ulimit)
>
> Casper
> --
> Expressed in this posting are my opinions. =A0They are in no way related
> to opinions held by my employer, Sun Microsystems.
> Statements on Sun products included here are not gospel and may
> be fiction rather than truth.
Hi Casper,
Thanks for your reply.
The application does not crash upon resource allocation. In fact it
does run for sometime, before it dies. Also this behaviour is visible
when there are multiple clients (say a multithreaded process)
connected to this application and hitting it continuously with
requests.
Can you briefly elaborate about the usage of ulimit, and how it could
affect the application in our case.
| |
|
| On Mar 29, 10:14=A0am, Paul Pluzhnikov <ppluzhnikov-...@gmail.com>
wrote:
> Sumir <sumirme...@gmail.com> writes:
>
> Oh, there *is* a reson (or a few).
>
> The very first question you should ask is where exactly is it
> crashing? Use debugger to find out.
>
>
> You have not presented any reason why you'd think that.
>
> Cheers,
> --
> In order to understand recursion you must first understand recursion.
> Remove /-nsp/ for email.
Hi Paul,
I ran a truss on the application to have a trace. It resulted into
this ...
lwp_sema_post(0xFFFFFFFF7D003D60) =3D 0
lwp_mutex_lock(0xFFFFFFFF7E72B068) =3D 0
read(39, " 3 e c 0 a 8 f 1 c 7 d 0".., 512) =3D 512
lwp_mutex_wakeup(0xFFFFFFFF7E72B068) =3D 0
time() =3D 1205757131
lwp_mutex_lock(0xFFFFFFFF7E72B068) =3D 0
Incurred fault #6, FLTBOUNDS %pc =3D 0xFFFFFFFF7E449BBC
siginfo: SIGSEGV SEGV_MAPERR addr=3D0x0FAE0350
Received signal #11, SIGSEGV [default]
siginfo: SIGSEGV SEGV_MAPERR addr=3D0x0FAE0350
*** process killed ***
Seems there is some memory access violation. But the thing is, this
same application compiled in 32-bit mode, run on the same environment
(same machine) fares well. So if it were something to do within the
code, in way of accessing memory wrongly, it should have surfaced in
the 32-bit version as well.
| |
| Ian Collins 2008-03-31, 5:46 am |
| Sumir wrote:
> On Mar 29, 10:14 am, Paul Pluzhnikov <ppluzhnikov-...@gmail.com>
> wrote:
>
> Hi Paul,
>
> I ran a truss on the application to have a trace. It resulted into
> this ...
>
What do you see in your debugger? If there's a core file, load that.
--
Ian Collins.
| |
|
| On Mar 31, 11:22=A0am, Ian Collins <ian-n...@hotmail.com> wrote:[color=darkred]
> Sumir wrote:
>
>
[color=darkred]
>
>
>
>
> What do you see in your debugger? =A0If there's a core file, load that.
>
> --
> Ian Collins.- Hide quoted text -
>
> - Show quoted text -[/color]
I did have a couple of core files. Loading them gives the following
stack trace --
CORE 1 -->
=3D>[1]
__rwstd::time_reader<char,std::istreambuf_iterator<char,std::char_traits<cha=
r>[color=darkred]
0xffffffff7ac06978, 0x0, 0x10fe72590, 0x1005c20d8), at 0x100386884
[2]
std::time_get<char,std::istreambuf_iterator<char,std::char_traits<char>[color=darkred]
0xffffffff7ac0688c, 0xffffffff7ac06800), at 0x100384078
[3] Date::Date(0x10bc4f910, 0xffffffff7ac06a18, 0x6800, 0x6070,
0x1005c20d8, 0x68a0), at 0x1001b17f0
[4] ImagineToDbml::getCalendar(0x80000006c, 0x1, 0x1005d0748,
0xffffffff7ac069b0, 0x1000000f4, 0x1005c20d8), at 0x1001184e0
[5] IMDFn::getCalendarXML(0xffffffff7ac07238
, 0xffffffff7ac06dd8,
0xffffffff7ac07250, 0xffffffff7ac06cf0, 0xffffffff7ac06d60,
0xffffffff7ac06fd0), at 0x1000b6dc4
[6] SOAPServer::doFnRequest(0xffffffff7fffc8
10, 0x10064fb78,
0xffffffff7ac07728, 0xffffffff7ac07250, 0xffffffff7ac07250,
0x1005cbe30), at 0x1000e9ccc
[7] SOAPServer::callFn(0xffffffff7fffc810, 0xffffffff7ac07210,
0x10047b7aa, 0x10064fb78, 0xffffffff7ac07728, 0xffffffff7ac07690), at
0x1000e9b20
[8] SOAPServer::onBody(0xffffffff7fffc810, 0xffffffff7ac07478,
0xffffffff7ac073f8, 0xffffffff7ac07728, 0xffffffff7ac07690,
0xffffffff), at 0x1000e97b8
[9] SOAPServer::onMessage(0xffffffff7fffc810
, 0xffffffff7ac07608,
0x1005c20d8, 0xffffffff7ac07728, 0xffffffff7ac07690, 0x0), at
0x1000e9408
[10] SOAPServerTCPIP::onReceive(0xffffffff7ff
fc810, 0x1074ded50,
0x0, 0xbd, 0x105b99060, 0x0), at 0x1000ed170
[11] IOServer::check(0x1005d3988, 0x0, 0x1074ded50, 0x1079cfd00,
0x32, 0x1079cfd38), at 0x1001009f4
[12] IOServerMT::ChildServer::loop(0x1079cfd0
0, 0x32, 0x64,
0xffffffff7e3912ac, 0x0, 0x0), at 0x1000ff640
[13] IOServerMT::ChildServer::run(0x1079cfd00
, 0x15, 0x100639e38,
0x0, 0x0, 0x0), at 0x1001018cc
[14] Thread::entryFun(0x1079cfdc8, 0xffffffff7e722bb0, 0x0, 0x1,
0xffffffff7e720000, 0x0), at 0x1001afe9c
CORE 2 -->
=3D>[1] __rwstd::timepunct_data<char>::__initpat(0xb0, 0x10060d3d8,
0x10fe3d490, 0x1005c20d8, 0x0, 0x0), at 0x1003956ac
[2] __rwstd::timepunct<char>::__initfacet(0x10fe32730, 0x1005d5e50,
0x1, 0x2, 0x0, 0x10fe32760), at 0x10039466c
[3] std::locale::__install(0x1005d5e50, 0x10fe32730, 0x10060ded0,
0x10036aae4, 0x1005c20d8, 0x1), at 0x10036ab78
[4] std::locale::__make_explicit(0x1005d5e50
, 0x10060ded0, 0x1,
0x100, 0x100382db8, 0x5800), at 0x10036aaf4
[5]
std::time_get<char,std::istreambuf_iterator<char,std::char_traits<char>[color=darkred]
0x1005d5ea0), at 0x100382d1c
[6] std::locale::__install(0x1005d5e50, 0x10fe37b20, 0x10060ddb0,
0x10036aae4, 0x1005c20d8, 0x0), at 0x10036ab78
[7] std::locale::__make_explicit(0x1005d5e50
, 0x10060ddb0, 0x1,
0x100, 0x1001b1900, 0x5800), at 0x10036aaf4
[8] Date::Date(0x10fe42c50, 0x100481fdc, 0x6800, 0x6070,
0x1005c20d8, 0x68a0), at 0x1001b1770
[9] ImagineToDbml::getCalendar(0xffffffff7be
04dd8,
0xffffffff7be04f28, 0x1005d0748, 0xffffffff7be049b0,
0xffffffff7be04fb8, 0xffffffff7be04e27), at 0x1001184c4
[10] IMDFn::getCalendarXML(0xffffffff7be05238
, 0xffffffff7be04dd8,
0xffffffff7be05250, 0xffffffff7be04cf0, 0xffffffff7be04d60,
0xffffffff7be04fd0), at 0x1000b6dc4
[11] SOAPServer::doFnRequest(0xffffffff7fffc8
00, 0x10064fb78,
0xffffffff7be05728, 0xffffffff7be05250, 0xffffffff7be05250,
0x1005cbe30), at 0x1000e9ccc
[12] SOAPServer::callFn(0xffffffff7fffc800, 0xffffffff7be05210,
0x10047b7aa, 0x10064fb78, 0xffffffff7be05728, 0xffffffff7be05690), at
0x1000e9b20
[13] SOAPServer::onBody(0xffffffff7fffc800, 0xffffffff7be05478,
0xffffffff7be053f8, 0xffffffff7be05728, 0xffffffff7be05690,
0xffffffff), at 0x1000e97b8
[14] SOAPServer::onMessage(0xffffffff7fffc800
, 0xffffffff7be05608,
0x1005c20d8, 0xffffffff7be05728, 0xffffffff7be05690, 0x0), at
0x1000e9408
[15] SOAPServerTCPIP::onReceive(0xffffffff7ff
fc800, 0x102fb2b80,
0x0, 0xbd, 0x1006aad30, 0x0), at 0x1000ed170
[16] IOServer::check(0x1005d3988, 0x0, 0x102fb2b80, 0x103a03270,
0x32, 0x103a032a8), at 0x1001009f4
[17] IOServerMT::ChildServer::loop(0x103a0327
0, 0x32, 0x64,
0xffffffff7e3912ac, 0x0, 0x0), at 0x1000ff640
[18] IOServerMT::ChildServer::run(0x103a03270
, 0x15, 0x100639e38,
0x0, 0x0, 0x0), at 0x1001018cc
[19] Thread::entryFun(0x103a03338, 0xffffffff7e722bb0, 0x0, 0x1,
0xffffffff7e720000, 0x0), at 0x1001afe9c
| |
| Ian Collins 2008-03-31, 5:47 am |
| Sumir wrote:
> On Mar 31, 11:22 am, Ian Collins <ian-n...@hotmail.com> wrote:
*Please* don't quote signatures or that google nonsense.
[color=darkred]
>
>
> I did have a couple of core files. Loading them gives the following
> stack trace --
>
Your Date constructor looks to be a prime contender, run the application
under dbx until it crashes and see what it is passing to do_get_date.
I'm guessing you are building with Sun Studio, using the default STL.
If so, try stlport4.
--
Ian Collins.
| |
| Giorgos Keramidas 2008-03-31, 5:47 am |
| On Sun, 30 Mar 2008 23:11:53 -0700 (PDT), Sumir <sumirmehta@gmail.com> wrote:
>
> I ran a truss on the application to have a trace. It resulted into
> this ...
>
> lwp_sema_post(0xFFFFFFFF7D003D60) = 0
> lwp_mutex_lock(0xFFFFFFFF7E72B068) = 0
> read(39, " 3 e c 0 a 8 f 1 c 7 d 0".., 512) = 512
> lwp_mutex_wakeup(0xFFFFFFFF7E72B068) = 0
> time() = 1205757131
> lwp_mutex_lock(0xFFFFFFFF7E72B068) = 0
> Incurred fault #6, FLTBOUNDS %pc = 0xFFFFFFFF7E449BBC
> siginfo: SIGSEGV SEGV_MAPERR addr=0x0FAE0350
> Received signal #11, SIGSEGV [default]
> siginfo: SIGSEGV SEGV_MAPERR addr=0x0FAE0350
> *** process killed ***
>
> Seems there is some memory access violation. But the thing is, this
> same application compiled in 32-bit mode, run on the same environment
> (same machine) fares well. So if it were something to do within the
> code, in way of accessing memory wrongly, it should have surfaced in
> the 32-bit version as well.
Not necessarily. There are many programs which implicitly assume that
`int' is large enough to hold a memory address. They tend to work find
in 32-bit mode (because their assumption happens to be true), but fail
randomly in 64-bit mode.
| |
| Nikos Chantziaras 2008-03-31, 5:47 am |
| Sumir wrote:
> I ran a truss on the application to have a trace. It resulted into
> this ...
>
> lwp_sema_post(0xFFFFFFFF7D003D60) = 0
> lwp_mutex_lock(0xFFFFFFFF7E72B068) = 0
> read(39, " 3 e c 0 a 8 f 1 c 7 d 0".., 512) = 512
> lwp_mutex_wakeup(0xFFFFFFFF7E72B068) = 0
> time() = 1205757131
> lwp_mutex_lock(0xFFFFFFFF7E72B068) = 0
> Incurred fault #6, FLTBOUNDS %pc = 0xFFFFFFFF7E449BBC
> siginfo: SIGSEGV SEGV_MAPERR addr=0x0FAE0350
> Received signal #11, SIGSEGV [default]
> siginfo: SIGSEGV SEGV_MAPERR addr=0x0FAE0350
> *** process killed ***
>
>
> Seems there is some memory access violation. But the thing is, this
> same application compiled in 32-bit mode, run on the same environment
> (same machine) fares well. So if it were something to do within the
> code, in way of accessing memory wrongly, it should have surfaced in
> the 32-bit version as well.
It looks like a typical sign extension problem. It's the most common
bug when porting to 64-bit and only shows itself when 'long' is 64-bit
wide. On 32-bit, long and int are the same width, so it doesn't happen;
whatever arithmetics you can do with int, apply to long too since
they're the same on 32-bit.
You have to inspect the code for constants like 0xFFFF, 0xFFFF0000 (and
similar patterns) and make then unsigned long if needed (UL), search for
things like shift operations with longs, longs that ought to be unsigned
longs, stuff like that. Travel up the stacktrace of the crash to see
when you hit code that looks like what I just described.
|
|
|
|
|