Author Topic: {Challenge} connect each two points whose distance is shorter than given value  (Read 100621 times)

0 Members and 1 Guest are viewing this topic.

highflyingbird

  • Bull Frog
  • Posts: 415
  • Later equals never.
I found a bug In this forum?
when I use the word
Code: [Select]
"pts[i]" ---the "[ i ]",it 's dispeared. it can't be displayed?  :-)
« Last Edit: June 23, 2010, 09:40:02 PM by highflybird »
I am a bilingualist,Chinese and Chinglish.

pkohut

  • Guest
Quote
         
            if(fabs(pt2.x - pt1.x) <= distance && fabs(pt2.y - pt1.y) <= distance &&
                pt1.distanceTo(pt2) <= distance &&
                (pts3.R ^ pts3[j].R)) {
                    // yep, build a line
                    AddLine(pRec, pt1, pt2);
            }

I think these codes can be improved.
           if(fabs(pt2.x - pt1.x) <= distance &&
               //fabs(pt2.y - pt1.y) <= distance &&         //This line can be cut. because
                                                                         // we have "if(fabs(pt2.y - pt1.y) > distance) break; "    
                (pts3.R ^ pts3[j].R)  &&                                  
                pt1.distanceTo(pt2) <= distance ){         //The distance calculation is more expensive.
                    // yep, build a line
                    AddLine(pRec, pt1, pt2);
            }

Based on the code snippet you show, I don't think so. The original is better as it does a quick Manhattan Distance check before doing the more expensive Pythagorean calc.

highflyingbird

  • Bull Frog
  • Posts: 415
  • Later equals never.
Based on the code snippet you show, I don't think so. The original is better as it does a quick Manhattan Distance check before doing the more expensive Pythagorean calc.
in my test, yep,the distance check is faster,but just a little,nearly the same speed.I don't know the why.
in my old opinion,it's slower.
I am a bilingualist,Chinese and Chinglish.

pkohut

  • Guest
Based on the code snippet you show, I don't think so. The original is better as it does a quick Manhattan Distance check before doing the more expensive Pythagorean calc.
in my test, yep,the distance check is faster,but just a little,nearly the same speed.I don't know the why.
in my old opinion,it's slower.

Floating point subtraction and addition are faster operations than multiplication and division. fabs, abs, etc take a single machine clock cycle to perform.

highflyingbird

  • Bull Frog
  • Posts: 415
  • Later equals never.
Floating point subtraction and addition are faster operations than multiplication and division. fabs, abs, etc take a single machine clock cycle to perform.

but
Code: [Select]
(Pts3[ i ].R ^ Pts3[ j ].R) is a logistic calculation, distanceto includes substracts,additions,and multiplication and squrare root calculation.Ok,this is another topic.pkohut, thanks your replay.
« Last Edit: June 23, 2010, 11:16:50 PM by highflybird »
I am a bilingualist,Chinese and Chinglish.

pkohut

  • Guest
As promised, the source code to the new KDTree version, there are also 2 2008 arx builds for doing the full test and the raw speed test, each one has commands "doit" and "doit2".  The first runs with an optimized (balanced KDTree), the second with an unbalanced tree.

The 1 million point test runs are:
5.77 seconds raw speed.
15.92 seconds with drawing the AcDbLine's (full test)

The full test is taking a performance hit because the KDNODE position data must be converted to AcGePoint3d for the line to be built. I have been so concentrated on RS that I overlooked this important issue.  The fix is easy, probably take an hour to implement and another hour to test, but I promised to get the code out today, so that's where it stands (for now). Never mind, this is in line with the performance with DandC.

Notes about this code (almost worth its own thread) -
The source code is heavily documented, and I ran Doxygen on it to produce developer documentation in the Doc directory. Just load the "index.html" file to browse the developer docs.

The code uses C++ templates to help the compiler optimize the code.

Total this is about 4 days effort to put together into this form. Was it worth it?  I think so and I hope others do too. Those that have been following this thread, and seeing how everyone's code has progressed, can probably learn something from our work.

As I already mentioned, the full test can be sped up by having KDNodes store AcGePoint3d data, that will take care of the hit when creating AcDbLines.   The code is not thread safe so utilizing multiple processors is out of the question (actually there is one method I can think of to queue up a bunch of point data and have another thread draw the AcDbLines, this would take 4 to 6 seconds off the full test run time.  And I'm thinking with a major rewrite I can get this to between 2 to 4 seconds raw speed (dawned on me yesterday while I was away).

Thanks  :-)
« Last Edit: June 24, 2010, 01:35:40 AM by pkohut »

pkohut

  • Guest
Floating point subtraction and addition are faster operations than multiplication and division. fabs, abs, etc take a single machine clock cycle to perform.

but
Code: [Select]
(Pts3[ i ].R ^ Pts3[ j ].R) is a logistic calculation, distanceto includes substracts,additions,and multiplication and squrare root calculation.Ok,this is another topic.pkohut, thanks your replay.

This is better, check if point 3 i and j are on the right side first, then do the calculations.

Code: [Select]
if((pts3[i].R ^ pts3[j].R) &&
    fabs(pt2.x - pt1.x) <= distance && fabs(pt2.y - pt1.y) <= distance &&
    pt1.distanceTo(pt2) <= distance) {
        // yep, build a line
        AddLine(pRec, pt1, pt2);
}

pkohut

  • Bull Frog
  • Posts: 483
Floating point subtraction and addition are faster operations than multiplication and division. fabs, abs, etc take a single machine clock cycle to perform.

but
Code: [Select]
(Pts3[ i ].R ^ Pts3[ j ].R) is a logistic calculation, distanceto includes substracts,additions,and multiplication and squrare root calculation.Ok,this is another topic.pkohut, thanks your replay.

This is better, check if point 3 i and j are on the right side first, then do the calculations.

Code: [Select]

if((pts3[i].R ^ pts3[j].R) &&
    fabs(pt2.x - pt1.x) <= distance && fabs(pt2.y - pt1.y) <= distance &&
    pt1.distanceTo(pt2) <= distance) {
        // yep, build a line
        AddLine(pRec, pt1, pt2);
}

How to get an easy 5% speed boost from DandC
Been playing around with some code optimizations today and revisited the DandC and KdTree code bases (originals can be found in this thread). For the DandC version, there are 2 if statements similar to the one above. That's a lot of if processing, especially in the inner loop of the main work function. Here is what the code looks like without the && operators -
Code: [Select]
if(pts3[i].R ^ pts3[j].R)
  if(fabs(pt2.x - pt1.x) <= distance)
    if(fabs(pt2.y - pt1.y) <= distance)
      if(pt1.distanceTo(pt2) <= distance) {
        // yep, build a line
        AddLine(pRec, pt1, pt2);
       }
The first 3 compares can all be combined into a single logical & statement which will get rid of 2 if's, and we come out ahead in clock cycles. So the above if statement can be rewritten to -
Code: [Select]
if( (fabs(pt2.x - pt1.x) <= distance) & (fabs(pt2.y - pt1.y) <= distance) &
    (pts3[i].R ^ pts3[j].R) &&
    (pt1.distanceTo(pt2) <= distance))

When run with the core algorithm only (no line work drawing) you get about a 3.5% speed boost with the 1,000,000 dot drawing. Not a whole bunch, especially when the program does the work in under 2 seconds, but we'll take the reduced runtime all the same.

For an addition 1.5% speed boost, add void Set(int n, bool r) { N = n; R = r; } to the struct IndexAndSide, then call Set instead of accessing the members indirectly. See source code.

Code: [Select]

struct IndexAndSide
{
    int N; // index of point
    bool R; // is right side
    void Set(int n, bool r) { N = n; R = r; }
};

void DandC( AcDbBlockTableRecord * pRec, const AcGePoint3d * pts, int low, int high, const double & distance )
{
    int index = high - low + 1;

    if(index < 3) {
        if(index == 2) {
            // only 2 points in the array, are they less than distance appart?
            //if( fabs(pts[high].x - pts[low].x) <= distance &&
            // fabs(pts[high].y - pts[low].y) <= distance &&
            // pts[low].distanceTo(pts[high]) <= distance) {
            if( (fabs(pts[high].x - pts[low].x) <= distance) &
                (fabs(pts[high].y - pts[low].y) <= distance) &&
                pts[low].distanceTo(pts[high]) <= distance) {

                    // yep, build a line
                    AddLine(pRec, pts[low], pts[high]);
            }
        }
        return;
    }

    int u = low + index / 2;

    DandC(pRec, pts, low, u - 1, distance);
    DandC(pRec, pts, u, high, distance);

    int k1 = 0, k2 = 0;

    // note pts[u].x is middle x
    for(int i = u - 1; i >= low && pts[i].x >= (pts[u].x - distance); --i, ++k1);

    if(k1 == 0)
        return;

    for(int i = u; i <= high && pts[i].x <= (pts[u].x + distance); ++i, ++k2);

    std::vector<IndexAndSide> pts3(k1 + k2);

    for(int i = 0; i < k1; ++i) {
        pts3[i].Set(u - 1 - i, false);
        //pts3[i].N = u - 1 - i;
        //pts3[i].R = false;
    }

    for(int i = 0; i < k2; ++i) {
        pts3[i + k1].Set(u + i, true);
        //pts3[i + k1].N = u + i;
        //pts3[i + k1].R = true;
    }

    QuickSortY(pts, &pts3.front(), 0, k1 + k2 - 1);

    // look at points in array from 0 to k1 + k2
    for(int i = 0; i < k1 + k2 - 1; ++i) {
        AcGePoint3d pt1 = pts[pts3[i].N];
        // look at points in array from i + 1 to k1 + k2
        for(int j = i + 1; j < k1 + k2; ++j) {
            // compare the points to see if they are within distance of each other
            AcGePoint3d pt2 = pts[pts3[j].N];
            if(fabs(pt2.y - pt1.y) > distance)
                break;
            //if(fabs(pt2.x - pt1.x) <= distance && fabs(pt2.y - pt1.y) <= distance &&
            //    pt1.distanceTo(pt2) <= distance &&
            //    (pts3[i].R ^ pts3[j].R))
            if( (fabs(pt2.x - pt1.x) <= distance) & (fabs(pt2.y - pt1.y) <= distance) &
                (pts3[i].R ^ pts3[j].R) &&
                (pt1.distanceTo(pt2) <= distance))
            {
                // yep, build a line
                AddLine(pRec, pt1, pt2);
            }

        }
    }
}


For the KdTree version applying the same principles reduced the runtime from about 6.4 seconds to 5.4 seconds, about 18% (earlier reports of 5.7 seconds was based on different OS and system configuration). That will be a different report.
« Last Edit: September 14, 2010, 09:11:36 AM by pkohut »
New tread (not retired) - public repo at https://github.com/pkohut

pkohut

  • Bull Frog
  • Posts: 483
How 18% speed increase was achieved in KdTree version.

Most of the code changes were made in the inner most FindNearest function. Here's the original code for the function (Changed area is red, noted areas purple)
Code: [Select]
int __fastcall FindNearest(KDNODE<T, DIM> * pNode, RESULT_NODE<T, DIM> *pList)
{
if(!pNode)
return 0;

int added_res = 0; // track number of nodes are added to the pList
// get the Manhattan distance from pNode to the queried position. If Manhattan distance is <= query range
// the go ahead and do more expensive KDTree distance check.
[color=purple]if( fabs(pNode->pos[0] - m_dQueryPos[0]) <=m_dQueryRange && fabs(pNode->pos[1] - m_dQueryPos[1]) <= m_dQueryRange) {[/color]
// get the KDTree distance from pNode to the queried position
double dist_sq = 0.0;
for(int i = 0; i < DIM; i++) {
dist_sq += SQ(pNode->pos[i] - m_dQueryPos[i]);
}

// Only add nodes that are within the query range
if(dist_sq <= SQ(m_dQueryRange)) {
if(RlistInsert(pList, pNode, dist_sq) < 0) {
// error
return -1;
}
added_res = 1;
}
}

// Get the KDTree direction distance from the queryPosition to pNode,
// which determines which leaf node to traverse down next.
double dx = m_dQueryPos[pNode->dir] - pNode->pos[pNode->dir];
[color=red]int ret = FindNearest(dx <= 0.0 ? pNode->left : pNode->right, pList);
if(ret >= 0 && fabs(dx) < m_dQueryRange) {[/color]
added_res += ret;
[color=red]ret = FindNearest(dx <= 0.0 ? pNode->right : pNode->left, pList);[/color]
}
if(ret < 0) {
return -1;
}

added_res += ret;
return added_res;
}

The FindNearest function is the most used function in the KdTree algorithm and makes lots of recursive calls, so speeding it up is an obvious choice. In such a tight loop function, simplifying if statements is a good start, we've got 4 candidates. For know lets concentrate on the second one highlighted in red.
Code: [Select]
[color=red]if(ret >= 0 && fabs(dx) < m_dQueryRange)[/color]
This basically says
Code: [Select]
if(ret >= 0)
    (if(fabs(dx) < m_dQueryRange) {
        ...
    }       
Changing the && operator to & makes the statement a boolean check from a single if -
Code: [Select]
[color=green]if(ret >= 0 & fabs(dx) < m_dQueryRange)[/color]
Run benchmarks to determine overall effect. In this case we got a positive speed boost. Notice the purple highlighted line in the original code. Making to same change to it cause a negative impact on speed. After lots of testing I'm guessing that it is a memory cache miss that caused the slow down, so the line won't be changed.

Let's turn our attention to the first red highlighted line -
Code: [Select]
int ret = FindNearest([color=red]dx <= 0.0 ? pNode->left : pNode->right[/color], pList);For those not familiar with C or Java syntax this may look strange, it's called a Ternary operator, and basically we put a big ol if/then/else statement right in the function call parameter. It reads "if dx is less than or equal to 0.0 then use pNode->left else use pNode->right as the first parameter to FindNearest".

The goal of this exercise is to speed up the program by removing if/then/else statements or ternay operators from this heavily called function. If speed is not an issue then by all means use if/then/else and ternay, they help make the code maintainable and readable.

For us, the problem with if/then/else or ternay is that they force a CPU jump instruction no matter what and those cost CPU clock cycles. Depending on the CPU there is also branch prediction and pipelines to contend with, look it up on wiki if your interested. From a simplified CPU point of view here's how if/then/else and if/then may look.
Code: [Select]
[color=navy]if/then/else type statement[/color]
    Compare dx <=  0.0
    Jump to Label1 if false
    Use pNode->left
    Jump to Label2
Label1:
    Use pNode->right
Label2:

[color=navy]if/then type statement[/color]
    Compate dx <= 0.0
    Jump to Label3 if false
    Use pNode->left;
Label3:
See the 2 jumps in the if/then/else statement, for each pass through the if statement one of them must be executed, and those add up. Depending on the condition we will use either left or right. It would be nice if we could use the simpler if/then style statement and still have the proper left or right value set. By preloading a variable we can. The idea being, preload a variable with left or right (we'll use right), then do the check and change the preloaded variable if necessary. Here it is in code -
Code: [Select]
[color=green]        KDNODE<T, DIM> * pNode1 = pNode->right; // preload variable
        if(dx <= 0.0) {
            pNode1 = pNode->left; // oops, needed left
        }
        int ret = FindNearest(pNode1, pList);[/color]
Test the changes. Nice boost in speed. So when you have 2 conditions and the variables are simple, then preloading may be beneficial (watch out for memory cache misses).

Ok, let's do the same thing to the third red highlighted line from the original code -
Code: [Select]
[color=red]ret = FindNearest(dx <= 0.0 ? pNode->right : pNode->left, pList);[/color]
... becomes ...
Code: [Select]
[color=purple]            pNode1 = pNode->left;
            if(dx <= 0.0)
                pNode1 = pNode->right;
            ret = FindNearest(pNode1, pList);[/color]
Test the changes, oh ya, very nice speed boost. However, -
Code: [Select]
[color=green]            pNode1 = pNode->right;
            if(dx > 0.0)
                pNode1 = pNode->left;
            ret = FindNearest(pNode1, pList);[/color]
Gives a massive speed boost. And for a final speed boost to 18% remove the nCount++ from the ConnectDotsWithin function as it causes memory cache misses.

Getting these results only took an hour or so of messing with the code. After every little change benchmarks were checked. I also explored hand written SSE2 instructions, but could not get it correct. I especially wanted that purple highlighted line in the original code to use SSE2 as that would have been nice to parallel compute those values.

Final source code for the FindNearest function -
Code: [Select]
[color=green]    int __fastcall FindNearest(KDNODE<T, DIM> * pNode, RESULT_NODE<T, DIM> *pList)
    {
        if(!pNode)
            return 0;

        int added_res = 0; // track number of nodes are added to the pList
        // get the Manhattan distance from pNode to the queried position. If Manhattan distance is <= query range
        // the go ahead and do more expensive KDTree distance check.
        if( fabs(pNode->pos[0] - m_dQueryPos[0]) <= m_dQueryRange &&
            fabs(pNode->pos[1] - m_dQueryPos[1]) <= m_dQueryRange) {
                // get the KDTree distance from pNode to the queried position
                double dist_sq = 0.0;
                for(int i = 0; i < DIM; i++) {
                    dist_sq += SQ(pNode->pos[i] - m_dQueryPos[i]);
                }

                // Only add nodes that are within the query range
                if(dist_sq <= SQ(m_dQueryRange)) {
                    if(RlistInsert(pList, pNode, dist_sq) < 0) {
                        // error
                        return -1;
                    }
                    added_res = 1;
                }
        }

        // Get the KDTree direction distance from the queryPosition to pNode,
        // which determines which leaf node to traverse down next.
        double dx = m_dQueryPos[pNode->dir] - pNode->pos[pNode->dir];


        KDNODE<T, DIM> * pNode1 = pNode->right;
        if(dx <= 0.0) {
            pNode1 = pNode->left;
        }
        int ret = FindNearest(pNode1, pList);

        if(ret >= 0 & fabs(dx) < m_dQueryRange) {
            added_res += ret;
            // ret = FindNearest(dx <= 0.0 ? pNode->right : pNode->left, pList);
            pNode1 = pNode->right;
            if(dx > 0.0)
                pNode1 = pNode->left;
            ret = FindNearest(pNode1, pList);
        }
        if(ret < 0) {
            return -1;
        }

        added_res += ret;
        return added_res;
    }[/color]
New tread (not retired) - public repo at https://github.com/pkohut

highflyingbird

  • Bull Frog
  • Posts: 415
  • Later equals never.
Thanks,pkohut.
I'll read your code carefully.KDTree is a very good algorithm,I wish I can understand it completely.
I am a bilingualist,Chinese and Chinglish.

pkohut

  • Bull Frog
  • Posts: 483
Re: {Challenge} connect each two points whose distance is shorter than given value
« Reply #100 on: September 17, 2010, 07:51:35 AM »
KdTree breaks 5 second barrier for 1,000,000 point drawing  ^-^ (without SSE2, or special compiler settings)

From the previous message, it was mentioned that KdTree was running about 6.4 seconds and after some code changes down to 5.4 seconds. It's now down to 4.977 seconds (4.742 seconds with SSE2 enabled).

For this version the first change dropped about 3 tenths of a second from the running time. The busy FindNearest routine was restructured to eliminate recursive calls when a child node is NULL. This saves 43,000,000 recursive calls that would have immediately returned (see code below). Each recursive call causes some stack space to be reserved and initialized, then unreserved upon return. So that's 43 million times that doesn't have to be done. For each of the recursive functions, except ClearRecord, the same refactoring was done. After the changes the program was running just under 5.1 seconds.

Those changes were easy, but the 5 second barrier was just right there. Mocking me. The next couple hours resulted in obtaining 5.022 seconds, and for the next 4 hours that was the best I could achieve. While outside having a smoke, I was thinking if only I could find 1 more if statement to refactor. Came back in and 5 minutes later found one, right there in the FindNearest function. Perfect. 4.977 seconds, mission accomplished.

In summary, after the tree optimization routine, the next biggest impact was inner loop and if statement refactoring. The tiny details of chasing 2 hundredths of a second here and there are not important, and probably don't apply from application to application.

The attached binaries are for 2008 32bit. One is just the raw speed test, no lines are drawn.  The other draws the lines in Autocad, which takes an additional 8 seconds or so. Source code will be posted after some code clean up.

Refactored FindNearest function.
Code: [Select]
    int __declspec(nothrow) __fastcall FindNearest(KDNODE<T, DIM> * pNode, RESULT_NODE<T, DIM> *pList)
    {
        // This function should never be called if pNode is NULL,
        // otherwise program will fault

        int added_res = 0; // track number of nodes are added to the pList
        // get the Manhattan distance from pNode to the queried position. If Manhattan distance is <= query range
        // the go ahead and do more expensive KDTree distance check.
        if( fabs(pNode->pos[0] - m_dQueryPos[0]) <= m_dQueryRange &&
            fabs(pNode->pos[1] - m_dQueryPos[1]) <= m_dQueryRange) {

                // get the KDTree distance from pNode to the queried position
                double dist_sq = 0.0;
                for(int i = 0; i < DIM; i++) {
                    dist_sq += SQ(pNode->pos[i] - m_dQueryPos[i]);
                }

                // Only add nodes that are within the query range
                if(dist_sq <= SQ(m_dQueryRange)) {
                    if(RlistInsert(pList, pNode, dist_sq) < 0) {
                        // error
                        return -1;
                    }
                    added_res = 1;
                }
        }

        // Get the KDTree direction distance from the queryPosition to pNode,
        // which determines which leaf node to traverse down next.
        double dx = m_dQueryPos[pNode->dir] - pNode->pos[pNode->dir];

        // preload pNode1
        KDNODE<T, DIM> * pNode1 = pNode->right;
        if(dx <= 0.0) {
            pNode1 = pNode->left;
        }
       
        // preload ret
        int ret = 0;
        if(pNode1)
            ret = FindNearest(pNode1, pList);

        if(ret < 0)
            return -1;

        if(fabs(dx) < m_dQueryRange) {
            added_res += ret;
            ret = 0; // preload ret

            pNode1 = pNode->right; // preload pNode1
            if(dx > 0.0)
                pNode1 = pNode->left;

            if(pNode1)
                if( (ret = FindNearest(pNode1, pList)) < 0)
                    return -1;
        }

        added_res += ret;
        return added_res;
    }


New tread (not retired) - public repo at https://github.com/pkohut

Kerry

  • Mesozoic relic
  • Seagull
  • Posts: 11654
  • class keyThumper<T>:ILazy<T>
Re: {Challenge} connect each two points whose distance is shorter than given value
« Reply #101 on: September 17, 2010, 08:02:55 AM »

Your attitude and persistance should be a shining example to us all Paul.
kdub, kdub_nz in other timelines.
Perfection is not optional.
Everything will work just as you expect it to, unless your expectations are incorrect.
Discipline: None at all.

pkohut

  • Bull Frog
  • Posts: 483
Re: {Challenge} connect each two points whose distance is shorter than given value
« Reply #102 on: September 17, 2010, 08:15:11 AM »

Your attitude and persistance should be a shining example to us all Paul.


Thank you.

The back story might change your view though. I spent about 12 hours writing assembler code to replace the FindNearest function, but could not beat VC's generated code.  It ain't like the old Z80 and 6502 days.

edit: what resurrected my interest is the SSE2 intrinsics and running computations in parallel. Come to find out MSVC 2008 doesn't do a very good job with these, which lead me to the back story. Read a MS blog that MSVC 2010 generates much better code from SSE2 intrinsics.
« Last Edit: September 17, 2010, 08:30:27 AM by pkohut »
New tread (not retired) - public repo at https://github.com/pkohut

highflyingbird

  • Bull Frog
  • Posts: 415
  • Later equals never.
Re: {Challenge} connect each two points whose distance is shorter than given value
« Reply #103 on: September 17, 2010, 09:22:08 AM »
Your attitude and persistance should be a shining example to us all Paul.
Thank you.
The back story might change your view though. I spent about 12 hours writing assembler code to replace the FindNearest function, but could not beat VC's generated code.  It ain't like the old Z80 and 6502 days.

edit: what resurrected my interest is the SSE2 intrinsics and running computations in parallel. Come to find out MSVC 2008 doesn't do a very good job with these, which lead me to the back story. Read a MS blog that MSVC 2010 generates much better code from SSE2 intrinsics.
I admire you very much. 12 hours,assembler code, you work very hard!
I am a bilingualist,Chinese and Chinglish.

pkohut

  • Bull Frog
  • Posts: 483
Re: {Challenge} connect each two points whose distance is shorter than given value
« Reply #104 on: September 17, 2010, 07:03:51 PM »
One simple change since the last post got me to 4.82 seconds. Simply make the ManhattenDistance calc it's own function allowed the compiler to do a little better optimization. The function is still inline, however a couple instructions in the disassembly were different. Most noticable the original had a fstp (0) instruction instead of the faster fabs.  Here's the code -

Code: [Select]
template<class T>
static int __fastcall ManhattenDistance2dArray( T * pos1,  T * pos2,
                                               const T & range)
{
    return (fabs(pos1[0] - pos2[0]) <= range) &&
        (fabs(pos1[1] - pos2[1]) <= range);
}

// Replace the if statement for the Manhattan distance calc in FindNearest with this -
if(ManhattenDistance2dArray<T>(pNode->pos, m_dQueryPos, m_dQueryRange)) {

Seeing that improvement, I wanted to get an assembly SSE2 function implemented. Since the calculation is data bound and not computation bound, the overhead is pretty big, pushing 64 bit data to 128 registers in the tightest loop of the whole program.

Code: [Select]
   // add these to the CKdTreeT class

    // define some typedefs for use withing the __asm sections. MSVC complains
    // about the brackets in the templates.
    typedef KDNODE<T, DIM> tKdNode;
    typedef CKdTreeT<T, DIM, ORDERED> tKtree;

    FORCEINLINE int __fastcall ManhattenDistance2dArray_2(const double * v1) const
    {
        __asm {
            mov eax, offset absMask    
            movapd xmm4, [eax]          // load mask data to r0 and r1 of xmm4

            mov eax, dword ptr [ecx]    // ecx is pointer to "this" object
            lea eax, [ecx + tKtree.m_dQueryPos] // this + CKdTreeT::m_dQueryPos
            movapd xmm1, [eax]          // load queryPos to r0 and r1 of xmm1

            lea eax, [ecx + tKtree.m_dQueryRange]   // m_dQueryRange is a single
            movsd xmm2, [eax]                       // double, so have to first
                                                    // stuff it into r0 of xmm1.
                                                    // Need to stuff it into r1
                                                    // too, but do something else
                                                    // ...

            mov eax, [edx]tKdNode.pos   // edx points to pNode
            movapd xmm0, [eax]          // load to r0 and r1 of xmm0

            shufpd xmm2, xmm2, 0        // ... we're back with xmm2. copy r0 to r1

            subpd xmm0, xmm1            // subtrack xmm0{r0:r1} - xmm1{r0:r1}
                                        // results go to xmm0

            andpd xmm0, xmm4            // make 'em positive with absMask
            cmpltpd xmm0, xmm2          // compare the results of the subtraction
                                        // m_dQueryRange. r0 and r1 will be -1 for true
                                        // and 0 for false

            movmskpd eax, xmm0          // pack the compare results to an int. They
                                        // will be in 2's complements, so later bitwise
                                        // checks can be applied to see which was true
                                        // or false.

                                        // eax is the return value.
        };
    }


///////////////////////////////////

// Substitute this if statement for the Manhattan distance calc in FindNearest

if(ManhattenDistance2dArray_2(pNode->pos)) {

edit: 2 good SSE2 sources (easy reads), out of about 20 I've been looking at
http://developer.apple.com/hardwaredrivers/ve/sse.html#Translation_Cmp
http://www.gamedev.net/reference/articles/article1987.asp


Anyways, it's been fun, but the compiler is way to good to be mucking around in this neck of the woods. Time to come up for air  :-)
« Last Edit: September 17, 2010, 07:07:27 PM by pkohut »
New tread (not retired) - public repo at https://github.com/pkohut