SIMD vectorize atan2 using ARM NEON assembly

SIMD vectorize atan2 using ARM NEON assembly

Content Index :

SIMD vectorize atan2 using ARM NEON assembly
Tag : assembly , By : Sharad
Date : November 25 2020, 07:27 PM

No Comments Right Now !

Boards Message :
You Must Login Or Sign Up to Add Your Comments .

Share : facebook icon twitter icon

How to vectorize inner loops with omp simd

Tag : fortran , By : anon
Date : March 29 2020, 07:55 AM
To fix the issue you can do There is no more place for this in the comments:
I get this when I compile it at an Ivy Bridge CPU. The loop on line 15 is not profitable to be vectorized on the CPU, but notice it IS VECTORIZED for the Intel MIC architecture. The loop 16 is vectorized on the CPU also with the target directives removed.
ifort -openmp simd.f90 -warn -O3 -c -vec-report=3 -xHOST -fpp 
ifort: command line remark #10382: option '-xHOST' setting '-xCORE-AVX-I'
simd.f90(17): (col. 33) remark: loop was not vectorized: subscript too complex
simd.f90(15): (col. 5) warning #13379: loop was not vectorized with "simd"
simd.f90(16): (col. 8) remark: LOOP WAS VECTORIZED
simd.f90(13): (col. 3) remark: loop was not vectorized: not inner loop
simd.f90(13): (col. 3) remark: loop was not vectorized: not inner loop
simd.f90(31): (col. 4) remark: LOOP WAS VECTORIZED
simd.f90(30): (col. 3) remark: loop was not vectorized: not inner loop
simd.f90(29): (col. 7) remark: loop was not vectorized: not inner loop
simd.f90(29): (col. 7) remark: BLOCK WAS VECTORIZED
ifort: warning #10362: Environment configuration problem encountered.  Please check for proper MPSS installation and environment setup.
simd.f90(15): (col. 5) remark: *MIC* OpenMP SIMD LOOP WAS VECTORIZED
simd.f90(13): (col. 3) remark: *MIC* loop was not vectorized: not inner loop
simd.f90(13): (col. 3) remark: *MIC* loop was not vectorized: not inner loop
simd.f90(31): (col. 4) remark: *MIC* LOOP WAS VECTORIZED
simd.f90(31): (col. 4) remark: *MIC* PEEL LOOP WAS VECTORIZED
simd.f90(31): (col. 4) remark: *MIC* REMAINDER LOOP WAS VECTORIZED
simd.f90(30): (col. 3) remark: *MIC* loop was not vectorized: not inner loop
simd.f90(29): (col. 7) remark: *MIC* loop was not vectorized: not inner loop

Aggregate sum for set bits in NEON SIMD

Tag : development , By : Ari
Date : March 29 2020, 07:55 AM
I wish did fix the issue. You can do table lookups in NEON using the VTBL and VTBX instructions, but they are only useful for look up tables with few entries. When optimising for NEON it is often best to look for a way to calculate values at run time instead of using a table.
In this example it is straightforward to calculate the lookup at run time. The function is essentially
int lookup(int val, int bit) { return (val & (1<<bit) >> bit); }
#include <arm_neon.h>

void f(uint32_t *output, const uint8_t *input, int length)

    static const uint8_t mask_vals[] = {  0x1,  0x2,  0x4,  0x8,
                                         0x10, 0x20, 0x40, 0x80 };
    /* NEON shifts are left shifts, and we want a right shift,
       so use negative numbers here */
    static const int8_t shift_vals[] = { 0, -1, -2, -3, -4, -5, -6, -7 };

    /* constants we need in the main loop */
    uint8x8_t mask    = vld1_u8(mask_vals);
    int8x8_t shift    = vld1_s8(shift_vals);

    /* accumulators for results, bits 0-3 in cumul1, bits 4-7 in cumul2 */
    uint32x4_t cumul1 = vdupq_n_u32(0);
    uint32x4_t cumul2 = vdupq_n_u32(0);

    for (int i = 0; i < length; i++)
        uint8x8_t v = vld1_dup_u8(input+i);
        /* this gives 0 or 1 in each lane, depending on whether the
           appropriate bit is set */
        uint8x8_t incr = vshl_u8(vand_u8(v, mask), shift);

        /* widen to 16 bits */
        uint16x8_t incr_w = vmovl_u8(incr);

        /* increment the accumulators */
        cumul1 = vaddw_u16(cumul1, vget_low_u16(incr_w));
        cumul2 = vaddw_u16(cumul2, vget_high_u16(incr_w));
        /* store the accumulator values */
        vst1q_u32(output + i*8, cumul1);
        vst1q_u32(output + i*8 + 4, cumul2);

atan2 approximation with 11bits in mantissa on x86(with SSE2) and ARM(with vfpv4 NEON)

Tag : development , By : kbrust
Date : March 29 2020, 07:55 AM
To fix this issue The first thing you would want to check is whether your compiler is able to vectorize atan2f (y,x) when applied to an array of floats. This usually requires at least a high optimization level such as -O3 and possibly specifying a relaxed "fast math" mode, in which errno handling, denormals and special inputs such as infinities and NaNs are largely ignored. With this approach, accuracy may be well in excess of what is required, but it may be hard to beat a carefully tuned library implementation with respect to performance.
The next thing to try is to write a simple scalar implementation with sufficient accuracy, and have the compiler vectorize it. Typically this means avoiding anything but very simple branches which can be converted to branchless code through if-conversion. An example of such code is the fast_atan2f() shown below. With the Intel compiler, invoked as icl /O3 /fp:precise /Qvec_report=2 my_atan2f.c, this is vectorized successfully: my_atan2f.c(67): (col. 9) remark: LOOP WAS VECTORIZED. Double checking the generated code through disassembly shows that fast_atan2f() has been inlined and vectorized using SSE instructions of the *ps flavor.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

/* maximum relative error about 3.6e-5 */
float fast_atan2f (float y, float x)
    float a, r, s, t, c, q, ax, ay, mx, mn;
    ax = fabsf (x);
    ay = fabsf (y);
    mx = fmaxf (ay, ax);
    mn = fminf (ay, ax);
    a = mn / mx;
    /* Minimax polynomial approximation to atan(a) on [0,1] */
    s = a * a;
    c = s * a;
    q = s * s;
    r =  0.024840285f * q + 0.18681418f;
    t = -0.094097948f * q - 0.33213072f;
    r = r * s + t;
    r = r * c + a;
    /* Map to full circle */
    if (ay > ax) r = 1.57079637f - r;
    if (x <   0) r = 3.14159274f - r;
    if (y <   0) r = -r;
    return r;

/* Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007 */
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

float rand_float(void)
    volatile union {
        float f;
        unsigned int i;
    } cvt;
    do {
        cvt.i = KISS;
    } while (isnan(cvt.f) || isinf (cvt.f) || (fabsf (cvt.f) < powf (2.0f, -126)));
    return cvt.f;

int main (void)
    const int N = 10000;
    const int M = 100000;
    float ref, relerr, maxrelerr = 0.0f;
    float arga[N], argb[N], res[N];
    int i, j;

    printf ("testing atan2() with %d test vectors\n", N*M);

    for (j = 0; j < M; j++) {
        for (i = 0; i < N; i++) {
            arga[i] = rand_float();
            argb[i] = rand_float();

        // This loop should be vectorized
        for (i = 0; i < N; i++) {
            res[i] = fast_atan2f (argb[i], arga[i]);

        for (i = 0; i < N; i++) {
            ref = (float) atan2 ((double)argb[i], (double)arga[i]);
            relerr = (res[i] - ref) / ref;
            if ((fabsf (relerr) > maxrelerr) && 
                (fabsf (ref) >= powf (2.0f, -126))) { // result not denormal
                maxrelerr = fabsf (relerr);

    printf ("max rel err = % 15.8e\n\n", maxrelerr);

    printf ("atan2(1,0)  = % 15.8e\n", fast_atan2f(1,0));
    printf ("atan2(-1,0) = % 15.8e\n", fast_atan2f(-1,0));
    printf ("atan2(0,1)  = % 15.8e\n", fast_atan2f(0,1));
    printf ("atan2(0,-1) = % 15.8e\n", fast_atan2f(0,-1));
    return EXIT_SUCCESS;
testing atan2() with 1000000000 test vectors
max rel err =  3.53486939e-005

atan2(1,0)  =  1.57079637e+000
atan2(-1,0) = -1.57079637e+000
atan2(0,1)  =  0.00000000e+000
atan2(0,-1) =  3.14159274e+000

ARM NEON SIMD version 2

Tag : development , By : Tonix
Date : March 29 2020, 07:55 AM

A64 Neon SIMD - 256-bit comparison

Tag : development , By : user107021
Date : March 29 2020, 07:55 AM
Related Posts Related QUESTIONS :
  • How many byes is each instruction compiled to in x86 assembly?
  • What does X mean in EAX,EBX,ECX ... in assembly?
  • How is return address specified in stack?
  • How to write to the console in fasm?
  • reading a BYTE as a DWORD in Masm
  • Print double-word number to string
  • 8086 assembly right mouse click interrupts
  • Assembly program crashes when reading second BMP file
  • how to find the implementation of s_init() which called by lowlevel_init() in uboot
  • How to write two bytes to a chunk of RAM repeatedly in Z80 asm
  • Re-use string at known address to save bytes and reduce size of shellcode payload
  • How to access PCIe configuration space? (ECAM)
  • Is the stack frame of a function cleared, or is it left as such on the stack, when we return from the function
  • Why does SEG not give an error message with this code fragment?
  • x86_64 Opcode encoding formats in the intel manual
  • Should using MOV instruction to set SS to 0x0000 cause fault #GP(0) in 64-bit mode?
  • How to change the foreground color of a string (32 Bit Assembly kernel)?
  • Is this AND+CMP between the same two operands just checking for one of them being zero?
  • segmentation fault while calling functions in nasm assembly
  • String reverse with x64 SSE / AVX registers
  • How Can Assembly Program Multitask Two Or More Tasks In The Same Time?
  • Cortex M7 floating arithmetic instruction duration with zero operand
  • How does mulw, prodl and prodh work together in assembler programming?
  • Questions about APIC interrupt
  • assembly: position of static/ global variables
  • When is the zero flag set?
  • how TEST instruction check if number is EVEN or ODD in assembly language
  • What does "Undefined" mean in Intel's asm documentation? FST effect on C0, C2, C3
  • How to decompile 64-bit binary to retrieve content?
  • BIOS Always Fails to Perform Disk Operations
  • My TSR program freezes when executed a second time
  • ARM assembly syntax in VST/VLD commands
  • Do assembly instructions map 1-1 to machine language?
  • What are the technical mechanics and operation of declaring variables in 32-bit MASM?
  • How are the SCAS and MOVS instructions affected by the value of the direction EFLAG?
  • Pointer to string in stand-alone binary code without .data section
  • How to enable MMU in QEMU (virt machine a57 cpu)
  • How to count character of a string in assembly not use 0Ah/int21h
  • mips nested procedure with jr $ra
  • NASM-64bits-segmentation fault calling procedure
  • Is it possible to map a process into memory without mapping the kernel?
  • Index an array with constants with Turbo C++ inline asm
  • Inline Assembly Procedure Crash When Accessing Arguments
  • Assembly TASM: Multiply 4 digit numbers by 4 digit numbers
  • How do assembly instruction differentiate between register, memory address, immediate value or offset parameter?
  • How can segmentation be used to expand memory?
  • Print spaces between every numbers that's printed in 8086 Assembly
  • the why of MOV command restrictions
  • Why a Word is 2 bytes and not 4 bytes in x86 Assembly?
  • Assembly memory operands clarification
  • Syntax error in assembly when using letter "C"
  • 6502 XASM defini data
  • How do I identify the instruction stored LC-3
  • Int 21h/ah=39h returns with AX=3 upon directory creation
  • Filling up Delayed Branch slots
  • Syntax to read IA32 register with offset in GDB?
  • What is the purpose of positive EBP referencing?
  • What does the following byte specifier for adding to a memory reference does in NASM assembler?
  • Can I read label value in assembly?
  • How to create a void function in Assembly?
  • shadow
    Privacy Policy - Terms - Contact Us © scrbit.com