Home > Archive > VC STL > March 2005 > std::vector performance with a custom allocator
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
std::vector performance with a custom allocator
|
|
|
| Recently I ran into a performance problem when using std::vector with a
custom allocator AND a built-in type.
Please note that the problem is NOT related to the dynamic memory
management.
In fact, I configured the test case such that the allocator doesn't
do memory allocation or deallocation during the main iteration loop.
The performance hit is simply triggered by the fact that the allocator
is a custom type (as opposed to std::allocator<> ).
The std::vector class treats the standard allocator differently in its
implementation (more details below).
The performance hit does not exist when a vector is used with a
user-defined type (non-built-in type).
I have found a workaround for the problem, but I'd like to know if
there is a more elegant way to address this issue.
I tracked it down to the implementation of the _Uninit_fill_n template.
Here is the scoop:
Say, we have a custom allocator class myAlloc whose implementation is
a replica of the std::allocator class.
Again, the allocator is not really involved per se during the main
iteration loop.
Here is the performance test:
template<typename V>
void testVector()
{
//a sufficiently large number
const int bufSize = 1000;
const int numIterations = 100; //mostly for timing
V vec; //the vector object
//reserve the memory in order to
//avoid dynamic memory allocations during the test
vec.reserve(bufSize);
//the performance test itself
for(int i = 0; i < numIterations; ++i)
{
for(int cnt = 0; cnt < bufSize; ++cnt)
vec.push_back(V::value_type(cnt));
//don't use vec.clear() because it deallocates
//memory under VC++ 7.1
vec.erase(vec.begin(),vec.end());
}
}
where the template parameter V is the vector type.
If you time the following four experiments (cases 1-4), you'll see
that the performance is dramatically (~10x in my case) different for a
built in type (cases 1 and 2). The performance is about the same for a
user-defined type (cases 3 and 4)
struct myInt {
myInt(int v): m_int(v){}
~myInt(){}
int m_int;
};
//case 1
testVector<std::vector<int> >();
//case 2
testVector<std::vector<int,myAlloc<int> > >();
//case 3
testVector<std::vector<myInt> >();
//case 4
testVector<std::vector<myInt,myAlloc<myInt> > >();
I believe that the problem is in the std::_Uninit_fill_n template, that
is an internal implementation detail of std::vector.
Built-in types are filled differently (which should be the case!),
where the flow is controlled by the _Scalar_ptr_iterator_tag argument.
Simply follow the flow with the Debugger:
Case 1: a built-in type + std::allocator<> combination is handled by:
template<class _Ty,
class _Diff,
class _Tval> inline
void _Uninit_fill_n(_Ty *_First, _Diff _Count,
const _Tval& _Val, allocator<_Ty>&, _Scalar_ptr_iterator_tag)
{ // copy _Count *_Val to raw _First, using _Al, scalar type
fill_n(_First, _Count, _Val);
}
That takes us to a "fast fill":
template<class _OutIt,
class _Diff,
class _Ty> inline
void fill_n(_OutIt _First, _Diff _Count, const _Ty& _Val)
{ // copy _Val _Count times through [_First, ...)
for (; 0 < _Count; --_Count, ++_First)
*_First = _Val;
}
While in case 2 we end up in:
template<class _FwdIt,
class _Diff,
class _Tval,
class _Alloc> inline
void _Uninit_fill_n(_FwdIt _First, _Diff _Count,
const _Tval& _Val, _Alloc& _Al, _Scalar_ptr_iterator_tag)
{ // copy _Count *_Val to raw _First, using _Al, arbitrary type
_Uninit_fill_n(_First, _Count,
_Val, _Al, _Nonscalar_ptr_iterator_tag());
}
That explicitly takes me to the "slow" non-built-in fill function
(refer to the last argument in the above line!):
template<class _FwdIt,
class _Diff,
class _Tval,
class _Alloc> inline
void _Uninit_fill_n(_FwdIt _First, _Diff _Count,
const _Tval& _Val, _Alloc& _Al, _Nonscalar_ptr_iterator_tag)
{ // copy _Count *_Val to raw _First, using _Al, arbitrary type
_FwdIt _Next = _First;
_TRY_BEGIN
for (; 0 < _Count; --_Count, ++_First)
_Al.construct(_First, _Val);
_CATCH_ALL
for (; _Next != _First; ++_Next)
_Al.destroy(_Next);
_RERAISE;
_CATCH_END
}
Hence the peformance hit for built-in types when using a custom
allocator!!!!
I think the root of the problem is that there is an overload of the
_Uninit_fill_n<> template specifically for the std::allocator class
(refer to the _Uninit_fill_n template in case 1). Take a look a the
signarture in case 1. Also, notice that the allocator is only used as a
type for template metaprogramming.
One fix that I found is to do exactly the same template overload for
myAlloc<> as well:
namespace std {
template<class _Ty,
class _Diff,
class _Tval> inline
void _Uninit_fill_n(_Ty *_First, _Diff _Count,
const _Tval& _Val, myAlloc<_Ty>&, _Scalar_ptr_iterator_tag)
{ // copy _Count *_Val to raw _First, using _Al, scalar type
fill_n(_First, _Count, _Val);
}
template<class _Ty,
class _Diff,
class _Tval> inline
void _Uninit_fill_n(_Ty *_First, _Diff _Count,
const _Tval& _Val, myAlloc <const _Ty>&,
_Scalar_ptr_iterator_tag)
{ // copy _Count *_Val to raw _First, using _Al, scalar type
fill_n(_First, _Count, _Val);
}
}
This does the trick - the performance in cases 1 and 2 becomes the
same.
However, it is really a lot of jumping through the hoops, so I consider
this a bug in the implementation of std::vector.
Am I missing something?
Also, are there any other similar places in the STL implementation
where the aforementioned performance hit is present?
Thanks in advance.
| |
| Pete Becker 2005-03-03, 4:05 am |
| anton wrote:
>
> Am I missing something?
>
The key difference between the two fill algorithms is that the "slow"
one uses the allocator's 'construct' member function. For builtin types
with the standard allocator we can bypass that call because we know that
it just copies its argument. For a user-supplied allocator we can't.
--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)
| |
|
| Ok, I think I see your implementation dilemma.
By the way, I tried to specialize the construct method of myAlloc to do
assignment (*ptr = val) instead of placement new for built-in types -
it didn't improve the performance. It looks like the extra dispatch to
Al.construct() in the "slow" _Uninit_fill_n prohibits the compiler
from some optimizations.
Maybe the Stadard is not flexible/generic enough to give the extra
degree of control to a custom allocator writer - the vector class
doesn't get any clues from a custom allocator with respect to possible
optimizations for built-in types. Some trait/tag technique (similar to
what you do) could have worked here.
The final question: I assume my workaround (to overload _Uninit_fill_n
for myAlloc a la yours) is safe then? My allocator doesn't really do
anything fancy - only memory pooling and tracking.
TIA
Anton
| |
| Pete Becker 2005-03-03, 4:02 pm |
| anton wrote:
>
> The final question: I assume my workaround (to overload _Uninit_fill_n
> for myAlloc a la yours) is safe then?
>
It should work fine. Just remember that it's not portable, except to
other compilers that use our library.
--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)
|
|
|
|
|