Home > Archive > Unix Programming > January 2007 > memcpy() and CPU usage
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
memcpy() and CPU usage
|
|
|
| I have a very CPU intensive application that should meet a specified
response time. One of my algorithms requires me to do a memcpy() of N
rows (one row = 2048 elements) very frequently.
I was wondering how memcpy() is implemented internally, i.e. does it
involve the CPU or the bus can handle transfer between different
locations?
Thanks.
| |
| Lew Pitcher 2007-01-10, 7:08 pm |
|
psp wrote:
> I have a very CPU intensive application that should meet a specified
> response time. One of my algorithms requires me to do a memcpy() of N
> rows (one row = 2048 elements) very frequently.
>
> I was wondering how memcpy() is implemented internally, i.e. does it
> involve the CPU or the bus can handle transfer between different
> locations?
memcpy() certainly would involve the bus, and likely the CPU as well
memcpy() copies values from one location to another within a single
process, so both the source memory and target memory locations exist
within a single address space. A copy from one place to another will
require at least data moving over the bus; the kernel wouldn't fiddle
with page mapping to make memcpy() happen. The movement over the bus
may be regulated by an off-stream processor (such as a DMA handler) or
by the CPU. If source or target memory has been paged out, it likely
would be paged back in, but even if memcpy() performed a copy from
memory to pagespace, it would involve CPU and bus activity (more so
than memory to memory).
memcpy() may be implemented by a short sequence of assembly code, but
the canonical memcpy() is a small (circa 3 line) C function.
HTH
--
Lew
| |
| Al Balmer 2007-01-10, 7:08 pm |
| On 10 Jan 2007 10:16:54 -0800, "psp" <pradhan.pushkar@gmail.com>
wrote:
>I have a very CPU intensive application that should meet a specified
>response time. One of my algorithms requires me to do a memcpy() of N
>rows (one row = 2048 elements) very frequently.
>
>I was wondering how memcpy() is implemented internally, i.e. does it
>involve the CPU or the bus can handle transfer between different
>locations?
>
It's implemented in whatever way the library programmer thought best
for the particular implementation.
Your implementation probably provides a way to look at the generated
code, but don't have high hopes of improving it.
>Thanks.
--
Al Balmer
Sun City, AZ
| |
| Eric Sosman 2007-01-10, 7:08 pm |
| psp wrote On 01/10/07 13:16,:
> I have a very CPU intensive application that should meet a specified
> response time. One of my algorithms requires me to do a memcpy() of N
> rows (one row = 2048 elements) very frequently.
Why?
Making copy after copy after copy of the same thing,
over and over and over, seldom advances the state of the
computation. It just gives you the same information you
already had, in N places instead of in just one. Can you
find a way to just point at the single authoritative
version instead of cloning it all over the place?
Yes, there are circumstances where this can't be done.
And there are circumstances where it may be simpler to make
a few copies than to keep track of an Ur-instance. But if
you've found that the copying has grown to the point where
it's a significant performance hit for your program, you
should start thinking about ways to avoid the copying before
thinking about ways to copy faster. (Especially since, as
others have pointed out, memcpy() is probably pretty close
to the fastest achievable anyhow.)
So: What is this algorithm that requires so many copies?
--
Eric.Sosman@sun.com
| |
| David T. Ashley 2007-01-11, 4:12 am |
| "psp" <pradhan.pushkar@gmail.com> wrote in message
news:1168453014.507334.125920@77g2000hsv.googlegroups.com...
>I have a very CPU intensive application that should meet a specified
> response time. One of my algorithms requires me to do a memcpy() of N
> rows (one row = 2048 elements) very frequently.
>
> I was wondering how memcpy() is implemented internally, i.e. does it
> involve the CPU or the bus can handle transfer between different
> locations?
If the rows are in a regular data arrangement, instead of copying the data,
you can change some control information that describes where a given row is.
For example, one might have an array of pointers, each pointing to a row
with 2048 elements.
Instead of moving the elements, just move the pointers around.
I suggest algorithmic improvement, if possible.
| |
| Barry Margolin 2007-01-11, 4:12 am |
| In article < dIOdnUOBX5K1IDjYnZ2dnUVZ_ruknZ2d@giganew
s.com>,
"David T. Ashley" <dta@e3ft.com> wrote:
> "psp" <pradhan.pushkar@gmail.com> wrote in message
> news:1168453014.507334.125920@77g2000hsv.googlegroups.com...
>
> If the rows are in a regular data arrangement, instead of copying the data,
> you can change some control information that describes where a given row is.
>
> For example, one might have an array of pointers, each pointing to a row
> with 2048 elements.
>
> Instead of moving the elements, just move the pointers around.
>
> I suggest algorithmic improvement, if possible.
If you're going to be modifying the data, and you need to retain the
original contents, you need to make a copy.
--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***
| |
| Logan Shaw 2007-01-11, 4:12 am |
| psp wrote:
> I have a very CPU intensive application that should meet a specified
> response time. One of my algorithms requires me to do a memcpy() of N
> rows (one row = 2048 elements) very frequently.
>
> I was wondering how memcpy() is implemented internally, i.e. does it
> involve the CPU or the bus can handle transfer between different
> locations?
Of course, how memcpy() is implemented depends on which implementation
you are talking about!
But if there is one good general rule about how memcpy(), it is that
memcpy() is usually implemented *very* *efficiently*. What's the reason?
The reason is that memcpy() is called ALL OVER the place, left and right,
from just about every program of every type. If the operating system can
provide a faster memcpy(), they can speed up almost every program that
runs on the system, so there is a huge incentive to make improvements.
Case in point: years ago, when I used to read the Amiga newsgroups,
someone was bragging about how they never used the OS's memory copying
function, and instead always used their own highly-optimized assembly
routine. At this point, someone (possibly someone from the Amiga team
at Commodore?) piped up and said, "Yeah, your routine is pretty OK, but
it would be faster if you used this other register to store more temporary
data like we implemented it in the operating system." Slam. ;-)
Also, on typical modern PC hardware, memory bandwidth is the real
constraint here, even if the routine is not carefully hand-optimized
assembly. Even if the bus could directly transfer, the rate at which
it could transfer is probably not faster than what the CPU can do
with any kind of decent routine.
Finally, I agree with another poster that avoiding the copy is probably
the best strategy. Can you, for example, maintain a pointer for each
row, then copy only the pointer when a row is copied, and use a "copy
on write" strategy (to "fork" its contents from the rest of the rows
it shares its copy of the data with) when a row actually needs to be
updated? For example:
conceptual state:
row[1]: abcd
row[2]: efgh
row[3]: ijkl
row[4]: mnop
actual state, with rows replaced by references to a pool of row values:
pool[1]: reference count 1; contents "abcd"
pool[2]: reference count 1; contents "efgh"
pool[3]: reference count 1; contents "ijkl"
pool[4]: reference count 1; contents "mnop"
row[1]: 1
row[2]: 2
row[3]: 3
row[4]: 4
operation 1: copy row 1 to row 3
new state:
pool[1]: reference count now 2; contents "abcd"
pool[2]: reference count 1; contents "efgh"
pool[3]: free since reference count just dropped to 0.
pool[4]: reference count 1; contents "mnop"
row[1]: 1
row[2]: 2
row[3]: 1
row[4]: 4
operation 2: change first byte of row 1 to "X"
operation 2, subtask 1: make private copy of shared row for row 1
new state:
pool[1]: reference count 1; contents "abcd"
pool[2]: reference count 1; contents "efgh"
pool[3]: reference count 1; contents "abcd"
pool[4]: reference count 1; contents "mnop"
row[1]: 3
row[2]: 2
row[3]: 1
row[4]: 4
operation 2, subtask 2: now you can freely modify private copy
new state:
pool[1]: reference count 1; contents "abcd"
pool[2]: reference count 1; contents "efgh"
pool[3]: reference count 1; contents "Xbcd"
pool[4]: reference count 1; contents "mnop"
row[1]: 3
row[2]: 2
row[3]: 1
row[4]: 4
This will require maintaining a free list of pool items, but that
is not too hard. The big advantage is that it will save copies.
The di vantage is that after enough operations, your conceptually
sequential data will no longer be actually sequential in real memory.
However, with 2048 elements in a row, that may not matter much, if
you tend to read through entire rows at a time.
- Logan
series
| |
| David Schwartz 2007-01-11, 7:08 pm |
|
Barry Margolin wrote:
> If you're going to be modifying the data, and you need to retain the
> original contents, you need to make a copy.
Not always. It depends upon the specifics of his algorithm. For
example, you could simply keep a table of changes and access the data
"through the table" applying the changes if needed.
For example, suppose you have 2,048 rows and your changes always
involve replacing exactly one entire row. You could keep a table of
pointers to the each row of the original data. When you need to replace
a row, just allocate a new row and switch the pointer to the new row
instead of the old row. When you are done, restore the original pointer
and free the newly allocated row (or reuse it the next time you need to
replace a row). This eliminates the copying entirely. (At the cost of
an extra level of indirection.)
Algorithmic optimization may be the way for the OP to go. It's hard to
know without knowing what his algorithm is.
DS
| |
| Stephen M. Dunn 2007-01-11, 7:08 pm |
| In article <45a5c82a$0$5155$4c368faf@roadrunner.com> Logan Shaw <lshaw-usenet@austin.rr.com> writes:
$Of course, how memcpy() is implemented depends on which implementation
$you are talking about!
And not just on how efficiently the library code implements it, but
on what support the CPU itself offers. The Intel x86 instruction
set, for instance, has had support for this sort of thing since day
one. The single instruction (or two, if you count the prefix as
a separate instruction)
REP MOVSB
will move whatever number of bytes you specify in one register
from one location to another. That's two bytes of machine code,
not including setting up various registers to tell it what to do
(which you'd have to do no matter how you implemented it), and
everything else is implemented directly in hardware. You'll have
a hard time beating that by writing a loop of code.
There's also a word (16-bit) version, and I believe later members
of the x86 family expand on this.
On some systems, there may be a DMA controller which can be
programmed to do memory-to-memory moves without further CPU
intervention; that might be another option.
As mentioned in previous responses, the folks who write these
implementations generally know a number of ways of accomplishing such
a task and will use the most efficient one.
--
Stephen M. Dunn <stephen@stevedunn.ca>[color=darkred]
------------------------------------------------------------------
Say hi to my cat -- http://www.stevedunn.ca/photos/toby/
| |
|
| Stephen M. Dunn wrote:
> The Intel x86 instruction set, for instance, has had support for this
> sort of thing since day one. The single instruction (or two, if you
> count the prefix as a separate instruction)
>
> REP MOVSB
>
> will move whatever number of bytes you specify in one register
> from one location to another. That's two bytes of machine code,
> not including setting up various registers to tell it what to do
> (which you'd have to do no matter how you implemented it), and
> everything else is implemented directly in hardware. You'll have
> a hard time beating that by writing a loop of code.
Quite the opposite.
cf. AMD's x86 Code Optimization Guide
http://www.amd.com/us-en/assets/con..._docs/22007.pdf
Optimizing Main Memory Performance for Large Arrays
where they start with a naive rep movsb and improve step by step.
On modern x86 implementations, an explicit loop is slightly faster than
rep movsd. The best performance comes from loop unrolling and explicit
prefetching.
Regards.
| |
| Pierre L. 2007-01-12, 7:05 pm |
| On Jan 10, 7:16 pm, "psp" <pradhan.push...@gmail.com> wrote:
> I have a very CPU intensive application that should meet a specified
> response time. One of my algorithms requires me to do a memcpy() of N
> rows (one row = 2048 elements) very frequently.
Unless you're using a real time OS, you probably can't make any
assumptions about the maximal response time you will have
(think of scheduling issues & what if your process/memory got swapped
to disk, for instance)
Now, are you sure this very memcpy() is the bottleneck to be fixed ?
Maybe some code profiling will tell you where to optimize first.
> I was wondering how memcpy() is implemented internally, i.e. does it
> involve the CPU or the bus can handle transfer between different
> locations?
It depends on your C library but you can most likely rely on it
being optimized since, as someone pointed it out, it's
used everywhere.
--
Pierre
| |
| Michel Bardiaux 2007-01-16, 8:07 am |
| On 12 Jan 2007 07:50:04 -0800, "Pierre L." <lombard.pierre@gmail.com>
wrote:
>On Jan 10, 7:16 pm, "psp" <pradhan.push...@gmail.com> wrote:
>
>Unless you're using a real time OS, you probably can't make any
>assumptions about the maximal response time you will have
>(think of scheduling issues & what if your process/memory got swapped
>to disk, for instance)
>
>Now, are you sure this very memcpy() is the bottleneck to be fixed ?
>Maybe some code profiling will tell you where to optimize first.
>
>It depends on your C library but you can most likely rely on it
>being optimized since, as someone pointed it out, it's
>used everywhere.
Not entirely true. Consider Linux debian sarge. By default, it
installs a libc containing a memcpy that is indeed highly optimized
for x86 - but *not* for the SIMD instructions that almost all recent
PCs have, and that have the potential of speeding up memcpy a lot.
There are packages for SSE-optimized userland memcpy, and kernel
options to have an SSE kernelspace memcpy, but you have to install all
that yourself.
|
|
|
|
|