

|
End-user products => VAST/AltiVec => AltiVec FAQ
| ||||||||||||||||||||||||||||||||||||||||||||
|
VAST/AltiVec FAQ: Vectorization for AltiVec | ||||||||||||||||||||||||||||||||||||||||||||
|
Frequently Asked Questions about using VAST/AltiVec to improve program performance.
Last updated March 2003.
What is AltiVec? What is Vector SIMD? What is MIMD parallelism? What is the Velocity Engine? What are AltiVec’s limitations? What are AltiVec’s strengths? How can I get my code running on AltiVec? What are the AltiVec C Extensions? Why use automatic vectorization? What is VAST? How does VAST use AltiVec Extensions? What else can VAST do? What coding choices impact performance? What performance can I expect? How do I get started? Where do I get a vectorizer? What if I want Fortran? How do I use VAST? What other tools are useful? SIMG4 How is a loop vectorized? How are scalars dealt with in vectorization? What are reduction operations? What are data dependencies? What is indirect addressing? What about pointers in C? What about conditional statements in loops? What data element size should I use? How should I set up my arrays? How should I structure my data? How should I structure my loops? What if I don’t have any loops? What about function calls? How can I use cache more efficiently? What about data dependencies? How can I get vectorization messages? What about pointers in C? How can I control what loops are vectorized? How can I tell VAST that my arrays are aligned? What about potential dependencies that are OK? What other tools can help me? What documents should I look at? AltiVec Unit | ||||||||||||||||||||||||||||||||||||||||||||
| Why AltiVec? | ||||||||||||||||||||||||||||||||||||||||||||
|
To provide you with an understanding of the capabilities of the AltiVec unit and how automatic vectorization with VAST can take advantage of them, and to provide tips and techniques to help you get programs running at peak speed on the AltiVec. The AltiVec unit is a Motorola extension to the PowerPC architecture that allows parallel operations on data elements. It provides 32 vector registers that hold each hold 16 bytes of information. These 16 byte registers can then be operated on in parallel in the AltiVec unit. So for example, you can add eight 16-bit integers to eight other 16 bit integers to produce eight 16-bit results, all with one vector add instruction. The AltiVec unit supports 8 bit, 16 bit, and 32 bit integer operations, as well as 32 bit floating point operations. The AltiVec unit is classified as a vector SIMD unit, since it operates on multiple elements (in a vector register) with one instruction (SIMD = single instruction, multiple data). For example, array A is a 16 bit integer with 8 elements:
And array B has these 8 elements:
If we want to add these arrays together, then if they are each fetched into a vector register, one vector SIMD instruction can provide the result:
Many modern microprocessors are including vector SIMD capabilities, so restructuring a program to make it more vectorizable may be useful for lots of target systems. Having multiple independent processors working together on a problem is called MIMD parallelism (multiple instructions, multiple data) as each processor would be executing its own instruction stream. If you have multiple AltiVec-enhanced processors in a shared memory system, you can have MIMD parallelism at a higher level and vector SIMD parallelism at a lower level. What is the "Velocity Engine"? Velocity Engine is Apple’s name for the AltiVec unit. There is no difference other than the name change. What are AltiVec’s limitations? These are the main limitations in what the AltiVec unit can support:
These are some of the AltiVec unit’s main strengths:
How can I get my code running on AltiVec? Here are four ways to get AltiVec code:
While this tutorial was designed primarily to help people use the automatic vectorization features of the VAST vectorizers efficiently, it may also be useful for people who are starting to write explicit vector code by hand. The concepts and techniques discussed here a natural first step to creating AltiVec code by whatever means. What are the AltiVec C Extensions? The AltiVec C Extensions are a set of new data types and functions defined by Motorola to allow access to the AltiVec unit from C. Most of the vector functions map directly to vector instructions available in the AltiVec unit and are compiled inline by the AltiVec-enhanced C compiler. With these extensions, you can get performance at the C level that is comparable to assembly language. Why use automatic vectorization? By coding in vectorizable loops instead of writing in explicit AltiVec instructions, you preserve portability of your code to other processors. Also, your time can be concentrated on higher-level aspects of the application rather than on learning low-level details of the AltiVec unit. Automatic vectorization can provide performance very close to hand-coded vectorization, with much less effort. VAST is a high-level optimizer that automatically optimizes C/c++ and Fortran programs. For AltiVec-enhanced processors, it can automatically transform loops into AltiVec instructions. VAST is normally used with a compiler driver that combines vectorization with the rest of the compilation process. For more information on VAST, see www.crescentbaysoftware.com. How does VAST use AltiVec Extensions? VAST vectorizes C code by replacing loops with AltiVec extensions. There are no comparable extensions for Fortran, so VAST vectorizes Fortran code by cutting out vectorizable sections into C routines that contain AltiVec extensions. Versions of VAST can also optimize scalar loops and automatically parallelize loops for shared memory systems (giving a portion of a loop’s iterations to each parallel processor). For example, VAST can vectorize an inner loop for AltiVec while optimizing an outer loop What coding choices impact performance? Performance is affected by:
What performance can I expect? Performance of vectorized code for the AltiVec can vary greatly depending on the factors above. Most applications will require some tuning on the part of the programmer to get best AltiVec results. For whole vectorizable applications, a good performance target is to aim for half of the theoretical maximum performance for the predominant data type. For example, the AltiVec unit can process four floats at one time, so typically the maximum speedup one would expect would be a factor of four. A reasonable goal for a whole floating point application would then be to make it run twice as fast. On very large applications that are not completely vectorizable, a speedup of something like 25-30% would be a reasonable goal. To get your program using the AltiVec unit, you need to:
Now you can execute the program and compare timings with the non-vectorized version. If you are not getting the speedup you need, then you should look at techniques for improving performance (see later section). | ||||||||||||||||||||||||||||||||||||||||||||
| What AltiVec Tools are Available? | ||||||||||||||||||||||||||||||||||||||||||||
|
Where do I get an AltiVec-enhanced C compiler? To use the AltiVec unit, you need an AltiVec-enhanced compiler. These compilers have been modified to accept the Motorola AltiVec C Extensions: GCC (open source) MetaWare Metrowerks Wind River Diab Automatic vectorization is available with the VAST tools: www.crescentbaysoftware.com/vast_altivec.html VAST provides state-of-the-art automatic vectorization and works with all of the compilers listed above. VAST-F/AltiVec automatically vectorizes Fortran by extracting the vectorizable loops and using C with AltiVec vector extensions as an intermediate representation. VAST-F/AltiVec will work with any of the C compilers listed, in combination with g77 (open source), Absoft Fortran (www.absoft.com) or NAG Fortran. VAST-F will vectorize array syntax as well as do loops. VAST is normally invoked by the VAST driver, which looks just like the scalar compiler driver. For example, if you are compiling your C program with this command: gcc myfile.c To get automatic vectorization you would use this command instead: vcc myfile.c Now, the automatic vectorization is done along with all the other normal compilation functions. If you have vectorizable loops, the "a.out" that results will use the AltiVec unit when you execute it. Compiler flags are passed as you normally would. Flags to VAST itself are passed with the meta-flag –Wv, as in: vcc –Wv,-Valigned myfile.c. For Fortran automatic vectorization, you similarly use the v77 driver instead of g77/f77 or v90 driver instead of f90. DEEP is a program analysis and debugging environment that can quickly pinpoint performance bottlenecks to the loop level in your source code. DEEP works with VAST to provide both compile-time automatic vectorization information and run-time measured performance in graphical displays that map to the source code. More information on DEEP can be found at www.crescentbaysoftware.com/deep.html. SIMG4 is the simulator for the G4 chip, which does a great job of visualizing the complex tradeoffs in the highly parallel execution of instructions in the G4. If you really want to see what is happening to an important kernel of you program, SIMG4 will show you where every clock cycle is going. Caution: you do need a good understanding of the G4 architecture to comprehend all the information that SIMG4 provides. SIMG4 is available through Motorola or Apple, but requires a license. | ||||||||||||||||||||||||||||||||||||||||||||
| How Does Vectorization Work? | ||||||||||||||||||||||||||||||||||||||||||||
|
What kind of loops will vectorize? Loops should have a fixed iteration count (i.e. determinable before the start of the loop), for example:
In this loop, the iteration count is "n" and this is not changed through the course of the loop.
Below is a loop that has no fixed iteration count -- it will not be vectorized automatically:
Here is a loop that has an exit from the loop -- it will not be vectorized automatically:
The first step in vectorization is categorizing all references in a loop as one of three things: scalar, index, or vector:
Here is an example of each of these:
Classification of variables in the loop: i,j : index variables a,b : vectors x : scalar Once variables have been classified, we need to make sure that vectorization of the loop will still provide the same answers as the scalar version. If there are data dependencies in the loop, where values for one iteration are required to calculate a future iteration, then the loop cannot be vectorized directly. How are scalars dealt with in vectorization? If a scalar variable is only used (not set) in a loop, then for AltiVec the scalar will be "splatted" into a vector register (replicated in each postion in the vector register) and the result used in the vector calculation. Scalars that are set and then used in a loop are "promoted" to vectors. These scalars are generally holding temporary values in a loop that we now need to temporary vector values. In the example, below, x is a "used" scalar and y is a "promoted" scalar:
Scalar that are used before they are set in a loop are called "carry-around" scalars. They are a problem for vectorization since the value computed in one pass of the loop is used in the next pass of the loop (the value is "carried" from one pass to the next). In the example below, x is a "carry-around" scalar:
What are reduction operations? A special (and very important) case of scalar usage in a loop are reduction operations. These involve the reduction of some aspect of a vector of values down to a scalar result. The most common reduction is the summation of all elements of a vector. Other reductions include: dot product of two vectors, maximum value in a vector, minimum value in a vector, product of all vector elements, and location of maximum or minimum element in a vector. Here is an example of a dot product reduction (x is a "reduction scalar"):
Reduction operations are very important to vectorize since they occur so frequently. They are generally vectorized for you by creating a vector of "partial reductions" that is then reduced into the final resulting scalar. Vectorization must be done in a way that ensures that the optimized code gives the same answers as the original. For certain loops, parallel or vector execution would result (or could result) in incorrect answers. A loop that has results from one loop pass feeding back into a future pass of the same loop is said to have a data dependency conflict and may not be completely optimized. (Such a loop is also said to be recursive or to contain recurrences.) In these cases, VAST detects the problem, reports it to the user, and leaves the loop in its original form. To detect data dependencies involving arrays and/or pointers requires extensive analysis of the arrays used in each loop nest, and examination the offset and stride of accesses to elements along each dimension of arrays that are both used and stored in a loop. If the uses and stores overlap on different iterations of a loop, or if they could possibly overlap, then there is a data dependency problem. In this case the order of execution of the iterations could change the results, and the order of execution of iterations where a loop is vectorized is different from the original, so the loop cannot be safely vectorized. Avoiding dependencies is important in getting the highest performance. VAST provides directives, pragmas, and switches to help you do this. In the loop shown below, the reference to a[i-2] at the top of the loop conflicts with the store into a[i] at the bottom. The loop would give different answered if vectorized, so it is left alone.
Information from other array subscripts can be used as part of the analysis of dependencies. The loop in the following example is vectorized because the non-vector subscripts of the references to array a can never be equal (since n is not equal to n+1) and thus there is no recursion. (The references to a are to two totally different pieces of the array - they do not share data).
Indirect addressing occurs when an array is accessed by a vector of values. If the array is being fetched from memory, the operation is called a "gather", if the array is being stored into memory, the operation is called a "scatter". In the example below, a is being scattered, and b is being gathered:
Indirect addressing is not acceptable for the AltiVec unit, since it can only deal with vectors that have unit stride (vectors that are stored in consecutive order in memory). If you have some indirect addressing in your loops, and you have significant calculations in the loops, you may want to consider cutting out the indirect addressing into a separate (non-vector) loop so that the rest of the loop can be vectorized. For vectorization purposes, it is generally better to use arrays rather than pointers. The use of pointers makes it especially difficult to determine whether or not data dependencies exist, as pointers can be unrestricted in what piece of memory they are accessing. For the most part, pointers are equivalent to array indexing and, viewed as such, the existence of overlap can be determined. VAST is able to vectorize loops containing pointers if it can determine that the loop is safe. Both array references and pointer references in loops are analyzed to see if they represent a vector access (loop-invariant offset and stride) to memory. In some cases, VAST will create a run-time test, and execute a vector version or scalar version of the loop depending on the result of the test. Often function arguments are passed as pointers. These pointers could point at overlapping sections of memory. While this is rarely the case (different pointers usually point at disjoint regions of memory), lacking any further information VAST-C must take the safe road and avoid optimizing loops which involve these pointers appearing on both the left and right sides of an a ssignment operator. For example, consider this function:
func( int *pa, int *pb, int x )
{
for ( i = 0; i < 100; i++ )
*(pa+i) = *(pb+i) + x;
}
If, in the above example, pa and pb overlap in memory in a way that causes results from one loop pass to feedback to a subsequent loop pass, then vectorization of the loop could result in incorrect results. For example, if the function was called with these arguments, vectorization would be problematic: int *a; func( a, a-1 ); In actual practice, it is very rare for data dependence to exist due to function arguments; programs that pass overlapping pointers are very hard to understand and debug, apart from any vectorization concerns. For this reason, the assertion level switch is provided to inform VAST that such arguments never overlap in your program. If you specify an assertion level of 1 or higher (-L1), the loop in func could be optimized. What about conditional statements in loops? For efficient vectorization, loops should contain mostly assignment statements and be limited on their use of "if" statements. Conditional operations are vectorized by computing all pathways in vector mode and merging the results. If there is lots of computation being performed conditionally, then significant time can be wasted.
| ||||||||||||||||||||||||||||||||||||||||||||
| How Can I Improve Performance? | ||||||||||||||||||||||||||||||||||||||||||||
|
In attempting to improve the performance of a program, you should always start with a measurement of where the time is spent in the program. You need to find the performance bottlenecks – you don’t want to waste your time optimizing sections of code that don’t take much time. There are various tools that can help you do this. The "gprof" profiling tools available on most Unix-based systems give you function-level data on where the time is spent. Crescent Bay Software’s DEEP tool quickly points out performance bottlenecks in the source code down to the loop levels. What data element size should I use? If you have a choice, use the smallest data types you can. For example, long integer operations (32 bit) take twice as long in vector mode as short integer operations (16 bit). Also, single precision floating point (32 bit) operations can be vectorized efficiently but double precision (64 bit) floating point operations cannot. How should I set up my arrays? To vectorize efficiently, your program should be operating on arrays of data in reasonably straightforward loops. Programs which chase pointers in complicated data structures at the lowest level are generally difficult to vectorize. Most importantly, only array uses where consecutive array elements are accessed (unit stride) can make efficient use of the AltiVec unit. This means, for example, that on multiple-dimension arrays you want to have the rightmost subscript (in C, leftmost in Fortran) vectorized. In the example below, the arrays are accessed with unit stride.
How should I structure my data? Organize your data as a structure of arrays rather than an array of structures where possible. In the original version below, there is an array of structures s of which we are using two items (a and e). In the restructured version, there is one structure s2 where we are using two stride-one arrays.
How should I structure my loops? Here are some of the most valuable tips:
What if I don’t have any loops? If you don’t have any loops, then make them! If your application uses any time, then there are repeated operations going on somewhere, and this generally means that a loop can be used. Sometimes, programmers have written the calculations in an unrolled form, and you can "reroll" the statements back into a loop that can then be vectorized. For example:
If vectorizing the operations on a single dataset is difficult, consider processing multiple data sets simultaneously, to allow vectorization across the data sets. If you have a function with no loops that uses a lot of time, then you can either inline the function into the loop it is called from or make a "vector version" of the function that operates on several elements at one time. (Push the loop into the function, or pull the function into the loop.) (In some caes, VAST can pull the function into the loop automatically.) For example:
How can I use cache more efficiently? Efficient use of the levels of cache is extremely important to performance. Accessing data that is in main memory is over ten times slower than accessing data that is in the level 1 cache. To use cache well, you need to use all of the data that is fetched in a cache line (as opposed to using only a few data items from each cache line), and reuse the data in the cache line several times before it is kicked out of cache (as opposed to using it once). Of course, to get good performance from the AltiVec unit, you need to have stride one arrays anyway, so the needs of the cache and of the AltiVec coincide. Make sure that multiply-dimensioned arrays are accessed along the dimension that is contiguous in memory. (Rightmost subscript for C, leftmost subscript for Fortran). If you have nested loops, the innermost loop should be traversing the most contiguous subscript. Accessing structure components in arrays of structures will inhibit vectorization since this results in a stride greater than one. (An exception is an array with two elements of equal size, e.g. "complex" data). Better is to organize the data as a structure of arrays. If you have simple loops traversing arrays with small data types and these loops are still not being vectorized, then you are probably encountering data dependency problems, where the vectorizer cannot vectorize because there may be overlap among the arrays. Usually, there is no actual overlap, and there are many ways you can inform the vectorizer that your loops are safe to vectorize (assertion levels, disjoint and nodepchk pragmas). See the next section for information on these switches and pragmas. | ||||||||||||||||||||||||||||||||||||||||||||
| How Can I Use VAST Effectively? | ||||||||||||||||||||||||||||||||||||||||||||
|
How can I tell what is being vectorized? By default, VAST generates a message detailing how many loops have been vectorized for each routine VAST processes. The best way to get more detail is to have VAST generate a listing file, using -Wv,-lmylisting.lst. You can also request to see vectorization messages. How can I get vectorization messages? Vectorization messages are available with the "-Wv,-Vmessages" switch to the compiler driver. These messages are sent to the standard output, and tell you about vectorization inhibitors that may be preventing a loop from vectorizing. For example, if a loop contains a function call, there will be a message pointing out the call and noting that this prevents vectorization. If there is a data dependency between two references to an array in a loop, there will be a vectorization message generated that points out the problem. The easiest way to get code that depends on pointers to vectorize is to specify an assertion level. These are the levels that VAST provides: Level 0: Assume nothing. This is the default. At this level VAST-C must assume that any pointer may overlap with any other variable, and that arguments to a function may also overlap. This assumption hinders the ability of VAST-C to determine whether or not a data dependency exists in the presence of pointer variables and function arguments. If a data dependency may exist, VAST-C cannot vectorize the loop.
func ( a, b ) int *a, *b; { for ( i = 0; i < n; i++ ) *a++ = *b++; } Above, VAST-C must assume that pointers a and b may overlap in memory. Level 1.-- Assume that any pointer may overlap with another pointer or nonpointer, except that arguments to a function will not overlap with each other. VAST-C will assume, in the above example, that a and b do not overlap in memory. Level 2. -- In addition to level 0 and 1 assumptions, also assume that pointers will not overlap with non-pointers. for ( i = 0; i < n; i++ ) *p++ = a[i]; Above, VAST-C can assume that pointer p does not overlap with array a. Level 3. -- In addition to level 0, 1, and 2 assumptions, assume that pointers with different names do not overlap. for ( i = 0; i < n; i++ ) *p++ = *q++; Above, VAST-C may assume that pointer p does not overlap with pointer q. Level 4. -- In addition to level 0, 1, 2 and 3 assumptions, assume that no overlap exists unless it is explicit. For example, *p may not overlap with *(p+n). In addition to the assertion levels, you can use the disjoint pragma to inform VAST-C that certain pointers are disjoint, that is, the memory ranges they point to do not overlap. Syntax: #pragma vd_ disjoint [(pa,pb,...,pn)] disjoint states that the pointers listed do not overlap with each other or any other variables. If no list of pointers is given, all pointers are assumed to be disjoint.
#pragma vd_ disjoint ( pa, pb ) for ( i = 0; i < n; i++ ) (Vectorized.)*pa++ = *pb++; Since pa and pb are pointers and may overlap, it would be unsafe to optimize the above loop without the disjoint pragma. The nodepchk pragma may also be used in this case. How can I control what loops are vectorized? By default, VAST tries to vectorize all the loops it can. You may already know what loop you want vectorized, and want to control which loops VAST will optimize. This can be done with the –Vnovector switch and the novector pragma/directives. NOVECTOR disables conversion of loops to vector form. VECTOR serves only to toggle back from NOVECTOR; it does not force conversion. The -Vnovector switch is equivalent to NOVECTOR with file scope.If you want to explicitly pick your vector loops, you can set –Vnovector on the command line and then have a few loops for which you selectively use the vector pragma/directive in front of the loop in the code to turn on vectorization. How can I tell VAST that my arrays are aligned? You can use the -Valigned switch or the aligned directive/pragma. These assert that the start of the arrays is aligned on 16-byte boundary. This allows VAST/AltiVec to generate code that directly fetches them into vector registers, rather than generating the default code to deal with non-aligned arrays by shifting the values. Performance can improve by up to a factor of two, if you have arranged to have all of your arrays aligned. The average performance gain is about 20% with this option, though it can vary from 0-40%. Only use ALIGNED if you know that the arrays are aligned correctly, otherwise you can get wrong answers. What about potential dependencies that are OK? The NODEPCHK (no dependency check) pragma and directive asserts that no vector overlap exists in the loop, and thus no synchronization is needed. This is the pragma to use for potentially vectorizable loops that have hidden relationships between variables. This capability should be used only when you know that no real overlap exists. When it detects potential feedback, VAST-C issues a message (which you can get with the –Vmessages switch) asking you to apply this pragma if the loop is in fact not recursive. Let's look at an example of a situation where you may want to disable data dependency checking. If you know that the variable k has a value greater than 999, then there is no overlap between the references to array a in the loop below, and you can use nodepchk to tell VAST that the loop can be optimized.
DEEP provides all of VAST’s vectorization messages, and clicking on the message takes you right to the affected source code line in your program in the DEEP source code browser. DEEP also has a display of the whole program that shows at a glance what loops VAST was able to vectorize. | ||||||||||||||||||||||||||||||||||||||||||||
| Where can I get more information? | ||||||||||||||||||||||||||||||||||||||||||||
|
What web sites should I visit? For more information on AltiVec, visit these resources on the web: General AltiVec information source: Crescent Bay Software site for VAST/AltiVec automatic vectorization and DEEP: General Motorola site for AltiVec: Apple’s Velocity Engine site: What documents should I look at? Crescent Bay Software documents on VAST: (These documents are available on Crescent Bay Software's web site: www.crescentbaysoftware.com/docs) VAST-C/AltiVec User’s Guide VAST-F/AltiVec User’s Guide Apple Velocity Engine FAQ ( www.apple.com/scitech/physicalscience/VE041101.pdf) Motorola technical documents on AltiVec: (These documents are available online at the web site: www.motorola.com) AltiVec Technology Programming Interface Manual AltiVec Technology Programming Environments Manual |
||||||||||||||||||||||||||||||||||||||||||||
Contact
Legal