Code Comments
Programming Forum and web based access to our favorite programming groups.Hi, we are trying to migrate an existing C++ application on solaris compiled in 32-bit env to 64-bit env. We have successfully compiled the application in 64-bit mode. We are facing problems while running the compiled application. The application seems to be crashing intermittently for no specific reason. We think this could probably be due to sufficient memory not available to the 64-bit application. we investigated various tunable parameters in solaris for the same purpose. A brief description of them is listed below -- lwp_default_stksize Specifies the default value of the stack size to be used when a kernel thread is created, and when the calling routine does not provide an explicit size to be used. Data Type Integer Range Minimum is the default values: 3 x PAGESIZE on SPARC systems (3 * 8192 = 24576) Maximum is 32 times the default value. Units Bytes in multiples of the value returned by the getpagesize parameter. For more information, see getpagesize(3C). Dynamic? Yes. Affects threads created after the variable is changed. Validation Must be greater than or equal to 8192 and less than or equal to 262,144 (256 x 1024). Also must be a multiple of the system page size. If these conditions are not met, the following message is displayed: Illegal stack size, Using N The value of N is the default value of lwp_default_stksize. When to Change When the system panics because it has run out of stack space. The best solution for this problem is to determine why the system is running out of space and then make a correction. Increasing the default stack size means that almost every kernel thread will have a larger stack, resulting in increased kernel memory consumption for no good reason. Generally, that space will be unused. The increased consumption means other resources that are competing for the same pool of memory will have the amount of space available to them reduced, possibly decreasing the system's ability to perform work. Among the side effects is a reduction in the number of threads that the kernel can create. This solution should be treated as no more than an interim workaround until the root cause is remedied. segkpsize Specify the amount of kernel pageable memory available. This memory is used primarily for kernel thread stacks. Increasing this number allows either larger stacks for the same number of threads or more threads. This parameter can only be set on systems running 64-bit kernels. Systems running 64-bit kernels use a default stack size of 24 Kbytes. Data Type Unsigned long Default 64-bit kernels, 2 Gbytes 32-bit kernels, 512 Mbytes Range 64-bit kernels, 512 Mbytes - 24 Gbytes 32-bit kernels, 512 Mbytes Units Mbytes Dynamic? No Validation Value is compared to minimum and maximum sizes (512 Mbytes and 24 Gbytes for 64-bit systems) and if smaller than the minimum or larger than the maximum, it is reset to 2 Gbytes and a message to that effect is displayed. The actual size used in creation of the cache is the lesser of the value specified in segkpsize after the constraints checking and 50% of physical memory. When to Change This is one of the steps necessary to support large numbers of processes on a system. The default size of 2 Gbytes, assuming at least 1 Gbyte of physical memory is present, allows creation of 24-Kbyte stacks for more than 87,000 kernel threads. The size of a stack in a 64-bit kernel is the same whether the process is a 32-bit process or a 64-bit process. If more than this number is needed, segkpsize can be increased assuming sufficient physical memory exists. Does anyone have an idea about these parameters (or any other related params), and if they could be helpful in resolving the issue at hand ? Or is there any other area, we should look at which could help in this case ?
Post Follow-up to this messageSumir <sumirmehta@gmail.com> writes: >we are trying to migrate an existing C++ application on solaris >compiled in 32-bit env to 64-bit env. >We have successfully compiled the application in 64-bit mode. We are >facing problems while running the compiled application. >The application seems to be crashing intermittently for no specific >reason. Surely it crashes somewhere? "Intermittently, unspecific" feels like memory corruption. >We think this could probably be due to sufficient memory not available >to the 64-bit application. That would result in NULL being returned from allocation functions and exceptions to be thrown. >we investigated various tunable parameters in solaris for the same >purpose. >A brief description of them is listed below -- >lwp_default_stksize Not relevant to user processes; this is a *kernel* stack limit. >segkpsize Something to do with the kernel, not user processes. >Does anyone have an idea about these parameters (or any other related >params), and if they could be helpful in resolving the issue at hand ? >Or is there any other area, we should look at which could help in this >case ? If the crash is intermittent then it's not likely a resource issue unless the crash happens when resources are allocated or shortly afterwards; it is unlikely that a 64 bit app uses many more resources than its 32 bit counterpart and you would notice it running out of swap space (memory resources) and likely out of stack space. (which you can change using ulimit) Casper -- Expressed in this posting are my opinions. They are in no way related to opinions held by my employer, Sun Microsystems. Statements on Sun products included here are not gospel and may be fiction rather than truth.
Post Follow-up to this messageSumir <sumirmehta@gmail.com> writes: > The application seems to be crashing intermittently for no specific > reason. Oh, there *is* a reson (or a few). The very first question you should ask is where exactly is it crashing? Use debugger to find out. > We think this could probably be due to sufficient memory not available > to the 64-bit application. You have not presented any reason why you'd think that. Cheers, -- In order to understand recursion you must first understand recursion. Remove /-nsp/ for email.
Post Follow-up to this messageOn Mar 28, 7:29=A0pm, Casper H.S. Dik <Casper....@Sun.COM> wrote: > Sumir <sumirme...@gmail.com> writes: > > Surely it crashes somewhere? =A0"Intermittently, unspecific" > feels like memory corruption. > > > That would result in NULL being returned from allocation functions > and exceptions to be thrown. > > > Not relevant to user processes; this is a *kernel* stack limit. > > > Something to do with the kernel, not user processes. > > > If the crash is intermittent then it's not likely a resource issue > unless the crash happens when resources are allocated or shortly > afterwards; it is unlikely that a 64 bit app uses many more > resources than its 32 bit counterpart and you would notice it running > out of swap space (memory resources) and likely out of stack space. > (which you can change using ulimit) > > Casper > -- > Expressed in this posting are my opinions. =A0They are in no way related > to opinions held by my employer, Sun Microsystems. > Statements on Sun products included here are not gospel and may > be fiction rather than truth. Hi Casper, Thanks for your reply. The application does not crash upon resource allocation. In fact it does run for sometime, before it dies. Also this behaviour is visible when there are multiple clients (say a multithreaded process) connected to this application and hitting it continuously with requests. Can you briefly elaborate about the usage of ulimit, and how it could affect the application in our case.
Post Follow-up to this messageOn Mar 29, 10:14=A0am, Paul Pluzhnikov <ppluzhnikov-...@gmail.com> wrote: > Sumir <sumirme...@gmail.com> writes: > > Oh, there *is* a reson (or a few). > > The very first question you should ask is where exactly is it > crashing? Use debugger to find out. > > > You have not presented any reason why you'd think that. > > Cheers, > -- > In order to understand recursion you must first understand recursion. > Remove /-nsp/ for email. Hi Paul, I ran a truss on the application to have a trace. It resulted into this ... lwp_sema_post(0xFFFFFFFF7D003D60) =3D 0 lwp_mutex_lock(0xFFFFFFFF7E72B068) =3D 0 read(39, " 3 e c 0 a 8 f 1 c 7 d 0".., 512) =3D 512 lwp_mutex_wakeup(0xFFFFFFFF7E72B068) =3D 0 time() =3D 1205757131 lwp_mutex_lock(0xFFFFFFFF7E72B068) =3D 0 Incurred fault #6, FLTBOUNDS %pc =3D 0xFFFFFFFF7E449BBC siginfo: SIGSEGV SEGV_MAPERR addr=3D0x0FAE0350 Received signal #11, SIGSEGV [default] siginfo: SIGSEGV SEGV_MAPERR addr=3D0x0FAE0350 *** process killed *** Seems there is some memory access violation. But the thing is, this same application compiled in 32-bit mode, run on the same environment (same machine) fares well. So if it were something to do within the code, in way of accessing memory wrongly, it should have surfaced in the 32-bit version as well.
Post Follow-up to this messageSumir wrote: > On Mar 29, 10:14 am, Paul Pluzhnikov <ppluzhnikov-...@gmail.com> > wrote: > > Hi Paul, > > I ran a truss on the application to have a trace. It resulted into > this ... > What do you see in your debugger? If there's a core file, load that. -- Ian Collins.
Post Follow-up to this messageOn Mar 31, 11:22=A0am, Ian Collins <ian-n...@hotmail.com> wrote: > Sumir wrote: > > > > > > > What do you see in your debugger? =A0If there's a core file, load that. > > -- > Ian Collins.- Hide quoted text - > > - Show quoted text -[/color] I did have a couple of core files. Loading them gives the following stack trace -- CORE 1 --> =3D>[1] __rwstd::time_reader<char,std::istreambuf_iterator<char,std::char_traits<cha = r> 0xffffffff7ac06978, 0x0, 0x10fe72590, 0x1005c20d8), at 0x100386884 [2] std::time_get<char,std::istreambuf_iterator<char,std::char_traits<char> 0xffffffff7ac0688c, 0xffffffff7ac06800), at 0x100384078 [3] Date::Date(0x10bc4f910, 0xffffffff7ac06a18, 0x6800, 0x6070, 0x1005c20d8, 0x68a0), at 0x1001b17f0 [4] ImagineToDbml::getCalendar(0x80000006c, 0x1, 0x1005d0748, 0xffffffff7ac069b0, 0x1000000f4, 0x1005c20d8), at 0x1001184e0 [5] IMDFn::getCalendarXML(0xffffffff7ac07238 , 0xffffffff7ac06dd8, 0xffffffff7ac07250, 0xffffffff7ac06cf0, 0xffffffff7ac06d60, 0xffffffff7ac06fd0), at 0x1000b6dc4 [6] SOAPServer::doFnRequest(0xffffffff7fffc8 10, 0x10064fb78, 0xffffffff7ac07728, 0xffffffff7ac07250, 0xffffffff7ac07250, 0x1005cbe30), at 0x1000e9ccc [7] SOAPServer::callFn(0xffffffff7fffc810, 0xffffffff7ac07210, 0x10047b7aa, 0x10064fb78, 0xffffffff7ac07728, 0xffffffff7ac07690), at 0x1000e9b20 [8] SOAPServer::onBody(0xffffffff7fffc810, 0xffffffff7ac07478, 0xffffffff7ac073f8, 0xffffffff7ac07728, 0xffffffff7ac07690, 0xffffffff), at 0x1000e97b8 [9] SOAPServer::onMessage(0xffffffff7fffc810 , 0xffffffff7ac07608, 0x1005c20d8, 0xffffffff7ac07728, 0xffffffff7ac07690, 0x0), at 0x1000e9408 [10] SOAPServerTCPIP::onReceive(0xffffffff7ff fc810, 0x1074ded50, 0x0, 0xbd, 0x105b99060, 0x0), at 0x1000ed170 [11] IOServer::check(0x1005d3988, 0x0, 0x1074ded50, 0x1079cfd00, 0x32, 0x1079cfd38), at 0x1001009f4 [12] IOServerMT::ChildServer::loop(0x1079cfd0 0, 0x32, 0x64, 0xffffffff7e3912ac, 0x0, 0x0), at 0x1000ff640 [13] IOServerMT::ChildServer::run(0x1079cfd00 , 0x15, 0x100639e38, 0x0, 0x0, 0x0), at 0x1001018cc [14] Thread::entryFun(0x1079cfdc8, 0xffffffff7e722bb0, 0x0, 0x1, 0xffffffff7e720000, 0x0), at 0x1001afe9c CORE 2 --> =3D>[1] __rwstd::timepunct_data<char>::__initpat(0xb0, 0x10060d3d8, 0x10fe3d490, 0x1005c20d8, 0x0, 0x0), at 0x1003956ac [2] __rwstd::timepunct<char>::__initfacet(0x10fe32730, 0x1005d5e50, 0x1, 0x2, 0x0, 0x10fe32760), at 0x10039466c [3] std::locale::__install(0x1005d5e50, 0x10fe32730, 0x10060ded0, 0x10036aae4, 0x1005c20d8, 0x1), at 0x10036ab78 [4] std::locale::__make_explicit(0x1005d5e50 , 0x10060ded0, 0x1, 0x100, 0x100382db8, 0x5800), at 0x10036aaf4 [5] std::time_get<char,std::istreambuf_iterator<char,std::char_traits<char> 0x1005d5ea0), at 0x100382d1c [6] std::locale::__install(0x1005d5e50, 0x10fe37b20, 0x10060ddb0, 0x10036aae4, 0x1005c20d8, 0x0), at 0x10036ab78 [7] std::locale::__make_explicit(0x1005d5e50 , 0x10060ddb0, 0x1, 0x100, 0x1001b1900, 0x5800), at 0x10036aaf4 [8] Date::Date(0x10fe42c50, 0x100481fdc, 0x6800, 0x6070, 0x1005c20d8, 0x68a0), at 0x1001b1770 [9] ImagineToDbml::getCalendar(0xffffffff7be 04dd8, 0xffffffff7be04f28, 0x1005d0748, 0xffffffff7be049b0, 0xffffffff7be04fb8, 0xffffffff7be04e27), at 0x1001184c4 [10] IMDFn::getCalendarXML(0xffffffff7be05238 , 0xffffffff7be04dd8, 0xffffffff7be05250, 0xffffffff7be04cf0, 0xffffffff7be04d60, 0xffffffff7be04fd0), at 0x1000b6dc4 [11] SOAPServer::doFnRequest(0xffffffff7fffc8 00, 0x10064fb78, 0xffffffff7be05728, 0xffffffff7be05250, 0xffffffff7be05250, 0x1005cbe30), at 0x1000e9ccc [12] SOAPServer::callFn(0xffffffff7fffc800, 0xffffffff7be05210, 0x10047b7aa, 0x10064fb78, 0xffffffff7be05728, 0xffffffff7be05690), at 0x1000e9b20 [13] SOAPServer::onBody(0xffffffff7fffc800, 0xffffffff7be05478, 0xffffffff7be053f8, 0xffffffff7be05728, 0xffffffff7be05690, 0xffffffff), at 0x1000e97b8 [14] SOAPServer::onMessage(0xffffffff7fffc800 , 0xffffffff7be05608, 0x1005c20d8, 0xffffffff7be05728, 0xffffffff7be05690, 0x0), at 0x1000e9408 [15] SOAPServerTCPIP::onReceive(0xffffffff7ff fc800, 0x102fb2b80, 0x0, 0xbd, 0x1006aad30, 0x0), at 0x1000ed170 [16] IOServer::check(0x1005d3988, 0x0, 0x102fb2b80, 0x103a03270, 0x32, 0x103a032a8), at 0x1001009f4 [17] IOServerMT::ChildServer::loop(0x103a0327 0, 0x32, 0x64, 0xffffffff7e3912ac, 0x0, 0x0), at 0x1000ff640 [18] IOServerMT::ChildServer::run(0x103a03270 , 0x15, 0x100639e38, 0x0, 0x0, 0x0), at 0x1001018cc [19] Thread::entryFun(0x103a03338, 0xffffffff7e722bb0, 0x0, 0x1, 0xffffffff7e720000, 0x0), at 0x1001afe9c
Post Follow-up to this messageSumir wrote: > On Mar 31, 11:22 am, Ian Collins <ian-n...@hotmail.com> wrote: *Please* don't quote signatures or that google nonsense. > > > I did have a couple of core files. Loading them gives the following > stack trace -- > Your Date constructor looks to be a prime contender, run the application under dbx until it crashes and see what it is passing to do_get_date. I'm guessing you are building with Sun Studio, using the default STL. If so, try stlport4. -- Ian Collins.
Post Follow-up to this messageOn Sun, 30 Mar 2008 23:11:53 -0700 (PDT), Sumir <sumirmehta@gmail.com> wrote: > > I ran a truss on the application to have a trace. It resulted into > this ... > > lwp_sema_post(0xFFFFFFFF7D003D60) = 0 > lwp_mutex_lock(0xFFFFFFFF7E72B068) = 0 > read(39, " 3 e c 0 a 8 f 1 c 7 d 0".., 512) = 512 > lwp_mutex_wakeup(0xFFFFFFFF7E72B068) = 0 > time() = 1205757131 > lwp_mutex_lock(0xFFFFFFFF7E72B068) = 0 > Incurred fault #6, FLTBOUNDS %pc = 0xFFFFFFFF7E449BBC > siginfo: SIGSEGV SEGV_MAPERR addr=0x0FAE0350 > Received signal #11, SIGSEGV [default] > siginfo: SIGSEGV SEGV_MAPERR addr=0x0FAE0350 > *** process killed *** > > Seems there is some memory access violation. But the thing is, this > same application compiled in 32-bit mode, run on the same environment > (same machine) fares well. So if it were something to do within the > code, in way of accessing memory wrongly, it should have surfaced in > the 32-bit version as well. Not necessarily. There are many programs which implicitly assume that `int' is large enough to hold a memory address. They tend to work find in 32-bit mode (because their assumption happens to be true), but fail randomly in 64-bit mode.
Post Follow-up to this messageSumir wrote: > I ran a truss on the application to have a trace. It resulted into > this ... > > lwp_sema_post(0xFFFFFFFF7D003D60) = 0 > lwp_mutex_lock(0xFFFFFFFF7E72B068) = 0 > read(39, " 3 e c 0 a 8 f 1 c 7 d 0".., 512) = 512 > lwp_mutex_wakeup(0xFFFFFFFF7E72B068) = 0 > time() = 1205757131 > lwp_mutex_lock(0xFFFFFFFF7E72B068) = 0 > Incurred fault #6, FLTBOUNDS %pc = 0xFFFFFFFF7E449BBC > siginfo: SIGSEGV SEGV_MAPERR addr=0x0FAE0350 > Received signal #11, SIGSEGV [default] > siginfo: SIGSEGV SEGV_MAPERR addr=0x0FAE0350 > *** process killed *** > > > Seems there is some memory access violation. But the thing is, this > same application compiled in 32-bit mode, run on the same environment > (same machine) fares well. So if it were something to do within the > code, in way of accessing memory wrongly, it should have surfaced in > the 32-bit version as well. It looks like a typical sign extension problem. It's the most common bug when porting to 64-bit and only shows itself when 'long' is 64-bit wide. On 32-bit, long and int are the same width, so it doesn't happen; whatever arithmetics you can do with int, apply to long too since they're the same on 32-bit. You have to inspect the code for constants like 0xFFFF, 0xFFFF0000 (and similar patterns) and make then unsigned long if needed (UL), search for things like shift operations with longs, longs that ought to be unsigned longs, stuff like that. Travel up the stacktrace of the crash to see when you hit code that looks like what I just described.
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.