lurkertech.comLurker's Guide → Copying Data Quickly

Note: Updated Lurker's Guide available (but not this page!)

This page belongs to the old 1990s SGI Lurker's Guide. As of 2008, several of the Lurker's Guide pages have been updated for HDTV and for modern OS platforms like Windows and Mac. This particular page is not one of those, but you can see what new stuff is available here. Thanks!

Copying Data Quickly

By Chris Pirazzi and Micheal Minakami

Word/Doubleword Alignment Helps

Whenever you need to copy large buffers of data, keep in mind that SGI systems can copy data more efficiently if the source and destination buffers are properly aligned, since the code can use lw/ld and sw/sd instructions to copy multiple bytes at a time. For a copy from address src to address dst, the relative alignment of the pointers determines the efficiency:

   (src % 2) == (dest % 2) - probably won't help
   (src % 4) == (dest % 4) - definitely will help 
   (src % 8) == (dest % 8) - often helps even more
   even more well aligned  - usually same as 8
Referring to the 4-byte alignment case on some Indigo2-class machines, Micheal Minakami tells us that "when copying between buffers with the same alignment, throughput is approximately 400% greater than when copying between buffers with mismatched alignments." He also notes that "the memory allocation routines such as malloc return double-word (64-bit aligned) buffers." Entries in VLBuffers and DMbufferpools are also always at least 8 byte aligned.

Avoid Cache Thrashing

Another performance gotcha occurs if you copy between buffers that are too strictly aligned in virtual or physical address: cache thrashing. To check for and fix this problem, you will need to know about the primary and secondary CPU caches on your target platforms. The MIPS processor manuals, available in bookstores and www.mips.com, list the details (cache type, cache size, cache line type, cache line size).

SGI primary caches are virtually indexed, meaning that for a cache_size byte cache whose lines are each cache_line_size bytes long, a reference to a virtual address V will map to cache line number (V%cache_size)/cache_line_size. Say you're copying from a buffer A to a buffer B, and both buffers are 32k-aligned. Say your computer has a cache_size==32k primary data cache. Say your cache is direct-mapped, meaning each cache line can store data for only one address. If the inner loop of your copy looks like "B[i]=A[i]", then the math above tells us that &A[i] and &B[i] will always map to the same cache line. Thus each reference to A[i] and B[i] in the inner loop above will miss in the primary cache, severly hurting performance. To solve this, you need to reduce the number of cache misses by reducing the number of times the same cache line gets reused for a different address. If you can control the address of buffer A and B, then allocating A and B such that:

  (A%cache_size)/cache_line_size != (B%cache_size)/cache_line_size.
will minimize cache thrashing. If you can't control the address of A and B (for example, if you're copying between VLBuffers or DMbuffers), then you should try to read a whole cache line at a time from buffer A into registers, and then write a whole cache line at a time from registers into buffer B. For cache_line_size=32bytes, this might look something like:
{
  __uint64_t *A, *B;
  int i;
  ...
  assert((max%4) == 0);
  for(i=0; i < max; i+=4)
    {
      __uint64_t tmp0, tmp1, tmp2, tmp3;
      tmp0 = A[i+0];
      tmp1 = A[i+1];
      tmp2 = A[i+2];
      tmp3 = A[i+3];
      B[i+0] = tmp0;
      B[i+1] = tmp1;
      B[i+2] = tmp2;
      B[i+3] = tmp3;
    }
}
You should disassemble any C code written like this to make sure the compiler did the right thing, or write assembly yourself.

If your primary cache is 2-way set-associative instead of direct-mapped, then a copy operation ("B[i]=A[i]") will not thrash in the cache, but an operation involving three pointers ("C[i]=A[i]+B[i]") will, and the solution is the same.

SGI secondary caches are indexed by physical address, and it is possible for them to thrash in the same way. Because secondary caches are so much larger, and because a given range of virtual memory (say, our buffers A and B) sometimes maps to discontiguous physical pages, secondary cache thrashing is not as common. On O2 (mvp), it happens that video buffers are made up of 64k-aligned physical memory chunks of 64k in size, and so O2 customers notice this problem more often. Since a normal user-mode app cannot control the physical address of any memory it allocates, the only workaround is the second one above: try to read a whole secondary cache line into registers, and then write from registers into a whole secondary cache line.

Kernel Data Copying

The kernel can bcopy() faster than you can from user-mode on some SGI systems, using some protected instructions. The speedup is typically around 150%. So don't be surprised if copies in the kernel (for example, buffered I/O calls as described in Software Methods for Disk I/O) exceed the performance of the best user-mode bcopy() benchmark which you can cook up.

Like It?Like what you see on this site? Want to see more in the future?
DonateSubmit this site (helps attract more readers)
StumbleUpondel.icio.us
del.icio.us
Google Ads
Shameless
Plugs
If you find this information useful or entertaining, consider browsing these shameless plugs for things that me and my friends to do earn a living.
mapfling.com: free custom maps with your own labels
Party? Meeting? Request a map, label it yourself, and easily fling it to your friends!
world's stupidest everything
See some of the worst the world has to offer, and add some of your own!

World's Stupidest Holiday and Birthday Presents - stupidest-presents.com
World's Stupidest Wedding Websites - stupidest-wedding-sites.com
World's Stupidest Baby Websites - stupidest-baby-sites.com
World's Stupidest TV, Movie, Music, and Sports Stars - stupidest-stars.com
World's Stupidest Politicians - stupidest-politicians.com
World's Stupidest TV Shows - stupidest-tv-shows.com
World's Stupidest Movies - stupidest-movies.com
World's Stupidest Blogs - stupidest-blogs.com
World's Stupidest Websites - stupidest-websites.com
World's Stupidest Company Websites - stupidest-company-sites.com
lurkertech: video tech and diversions
Buzzword bingo, bill the borg, MEZ, lurker's guide to video, and Thai, oh my!
slice-of-thai.com
Tasty morsels of information on Thai food, language, culture, and general silliness.
thailand fever
I co-authored this bilingual cultural guidebook to Thai-Western romantic relationships.
thai-english software dictionary
Check out my Thai-English, English-Thai dictionary for Palm OS® PDAs.
allaboutpai.com
A site about Pai, my peaceful home in the mountains of Northern Thailand.
thailand your way
Travel with my friend Nang, who is a great nature, birding, and cultural guide.
jeed illustration
My English-fluent Thai friend Jeed is a freelance illustrator who is available for hire.
CopyrightAll text and images copyright 1999-2008 Chris Pirazzi unless otherwise indicated.