R@M3$H.NBlog

Memcpy

05 November, 2013 - 13 min read

  1. Introduction

This article describes a fast and portable memcpy implementation that can replace the standard library version of memcpy when higher performance is needed. This implementation has been used successfully in several project where performance needed a boost, including the iPod Linux port, the xHarbour Compiler, the pymat python-Matlab interface, the Inspire IRCd client, and various PSP games.
The story began when a co-worker of mine made an implementation of memcpy that he was very proud of. His implementation was faster than many standardized C library routines found in the embedded market. When looking at his code, I found several places where improvements could be made. I made an implementation, which was quite a lot faster than my co-vorker's and this started a friendly competition to make the fastest portable C implementation of memcpy(). Both our implementations got better and better and looked more alike and finally we had an implementation that was very fast and that beats both the native library routines in Windows and Linux, especially when the memory to be copied is not aligned on a 32 bit boundary.
The following paragraphs contain descriptions to some of the techniques used in the final implementation.

  1. Mimic the CPU's Architecture

One of the biggest advantages in the original implementation was that the C code was made to imitate the instruction sets on the target processor. My co-work discovered that different processors had different instructions for handling memory pointers. On a Motorola 68K processor the code

*dst8++ = *src8++;

that copies one byte from the address src8 to the address dst8 and increases both pointers, compiled into a single instruction:

MOV.B (A0)+, (A2)+

This piece of code can be put into a while loop and will copy memory from the address src to the address dest:

   1: void* memcpy(void* dest, const void* src, size_t count) {

   2: * dst8 = (char*)dest;

   3: * src8 = (char*)src;

   4: e (count--) {

   5:         *dst8++ = *src8++;

   6:     }

   7: rn dest;

   8: }

While this is pretty good for the Motorola processor, it is not very efficient on a PowerPC that does not have any instructions for post incrementing pointers. The PowerPC uses four instructions for the same task that only required one instruction on the Motorola processor. However, the PowerPC has a set of instructions to load and store data that utilize pre increment of pointers which means that the following code only results in two instructions when compiled on the PowerPC:

   1: *++dst8++ = *++src8;

In order to use this construction, the pointers have to be decreased before the loop begins and the final code becomes:

   1: void* memcpy(void* dest, const void* src, size_t count) {

   2: * dst8 = (char*)dest;

   3: * src8 = (char*)src;

   4:     --src8;

   5:     --dst8;

   6: e (count--) {

   7:         *++dst8 = *++src8;

   8:     }

   9: rn dest;

  10: }

Unfortunately some processors have no instructions for either pre increment or post increment of pointers. This means that the pointer needs to be incremented manually. If the example above is compiled to an processor without pre and post increment instructions, the while loop would actually look something like:

   1: while (count--) {

   2:     dst8[0] = src8[0];

   3:     ++dst8;

   4:     ++src8;

   5: }

There are luckily other ways of accessing memory. Some processors can read and write to memory at a fixed offset from a base pointer. This resulted in a third way of implementing the same task:

   1: void *memcpy(void* dest, const void* src, size_t count) {

   2: * dst8 = (char*)dest;

   3: * src8 = (char*)src;

   4: count & 1) {

   5:         dst8[0] = src8[0];

   6:         dst8 += 1;

   7:         src8 += 1;

   8:     }

   9:     count /= 2;

  10: e (count--) {

  11:         dst8[0] = src8[0];

  12:         dst8[1] = src8[1];

  13:         dst8 += 2;

  14:         src8 += 2;

  15:     }

  16: rn dest;

  17: }

Here the number of turns the loop has to be executed is half of what it was in the earlier examples and the pointers are only updated half as often.

  1. Optimizing memory accesses

In most systems, the CPU clock runs at much higher frequency than the speed of the memory bus. My first improvement to my co-worker's original was to read 32 bits at the time from the memory. It is of course possible to read larger chunks of data on some targets with wider data bus and wider data registers. The goal with the C implementation of memcpy() was to get portable code mainly for embedded systems. On such systems it is often expensive to use data types like double and some systems doesn't have a FPU (Floating Point Unit).
By trying to read and write memory in 32 bit blocks as often as possible, the speed of the implementation is increased dramatically, especially when copying data that is not aligned on a 32-bit boundary.
It is however quite tricky to do this. The accesses to memory need to be aligned on 32-bit addresses. The implementation needs two temporary variables that implement a 64-bit sliding window where the source data is kept temporary while being copied into the destination. The example below shows how this can be done when the destination buffer is aligned on a 32-bit address and the source buffer is 8 bits off the alignment:

   1: srcWord = *src32++;

   2: e (len--) {

   3:     dstWord = srcWord << 8;

   4:     srcWord = *src32++;

   5:     dstWord |= srcWord >> 24;

   6:     *dst32++ = dstWord;

   7: }

  1. Optimizing branches

Another improvement is to make it easier for the compiler to generate code that utilizes the processors compare instructions in an efficient way. This means creating loops that terminates when a compare value is zero. The loop

   1: while (++i > count)

often generates more complicated and inefficient code than

   1: while (count--)

Another thing that makes the code more efficient is when the CPU's native loop instructions can be used. A compiler often generates better code for the loop above than for the following loop expression

   1: while (count -= 2)

 

  1. Conclusion

The techniques described here makes the C implementation of memcpy() a lot faster and in many cases faster than commercial ones. The implementation can probably be improved even more, especially by using wider data types when available. If the target and the compiler supports 64-bit arithmetic operations such as the shift operator, these techniques can be used to implement a 64-bit version as well. I tried to find a compiler with this support for SPARC but I didn't find one. If 64-bit operations can be made in one instruction, the implementation will be faster than the native Solaris memcpy() which is probably written in assembly.
The version available for download in the end of the article, extends the algorithm to work on 64-bit architectures. For these architectures, a 64-bit data type is used instead of a 32-bit type and all the methods described in the article is applied to the bigger data type.

  1. The complete source code

The complete source code for a memcpy() implementation using all the tricks described in the article can be downloaded from

http://www.vik.cc/daniel/memcpy.zip

The code is configured for an intel x86 target but it is easy to change configuration as desired.

However the interviewers interrupted noting that the man page for memcpy says it "copies n bytes from src to dest"  and then wanted me to iterate instead by size/4 and then pick up the remaining with another loop of index < size%4 (I guess assuming it was a 32 bit system?)

They were asking you to optimize your implementation and have it copy a 32-bit word at a time inside the loop vs. a byte at a time. This would necessitate some careful checking to handle the boun

As others have said, copying 4 bytes at a time is faster than copying 1 byte at a time. The interviewer wanted you to do something like this:

   1: void memcpy(void* dest, void* src, int size)

   2: {

   3:     uint8_t *pdest = (uint8_t*) dest;

   4:     uint8_t *psrc = (uint8_t*) src;

   5:

   6:     int loops = (size / sizeof(uint32_t));

   7:     for(int index = 0; index < loops; ++index)

   8:     {

   9:         *((uint32_t*)pdest) = *((uint32_t*)psrc);

  10:         pdest += sizeof(uint32_t);

  11:         psrc += sizeof(uint32_t);

  12:     }

  13:

  14:     loops = (size % sizeof(uint32_t));

  15:     for (int index = 0; index < loops; ++index)

  16:     {

  17:         *pdest = *psrc;

  18:         ++pdest;

  19:         ++psrc;

  20:     }

  21: }

Thank you for the demonstration. I understand now. I have come to terms with the fact that I misunderstood how memory is laid out. I thought that each individual address was capable of storing an entire word, not just a byte (for instance address 0x12345678 could hold a value of 0xffffffff and then 0x12345679 could hold a completely different value 0x00000000). Now I realize that 0x12345678 holds only one byte of a word and the next word will start at 0x12345678 + 4. This can easily be seen in gdb with the x command

The logic for your memcpy is correct and your interviewer didn't ask you to change it or add a restriction. Copying 4 bytes at a time is faster, but becomes a problem if your size is not a multiple of 4. Hence your interviewer told you to use two loops: the first copies 4 bytes at a time, and the 2nd loop one byte at a time (it will iterate at most 3 times).

So the bulk of the copy is done with the fast 4-byte copy, but you're not restricted to size being a multiple of 4 because the 2nd "cleanup" loop will copy anything that's not a multiple of 4.

1st loop: copy uint32_t and increment by 4
2nd loop: copy uint8_t and increment by 1

The interviewer was testing your knowledge of computer architecture, and wanted you to optimize your algorithm. Memory operates on words rather than bytes. In a 32-bit system, a word is typically 4 bytes, it takes the same amount of time to read/write 1 byte as it does to read/write 1 word. The second loop is to handle the case where the number of bytes you want to copy is not exactly divisible by 4 bytes.

What you actually want is 3 loops. 2 loops for the bytes after dest and before dest+size that when either is not word aligned. Then another loop for all the words in between.

You can actually optimize even more by taking advantage of architecture specific instructions. Check out this article if you are interested: http://www.eetimes.com/design/embedded/4024961/Optimizing-Memcpy-improves-speed