How 18% speed increase was achieved in KdTree version.
Most of the code changes were made in the inner most FindNearest function. Here's the original code for the function (Changed area is red, noted areas purple)
int __fastcall FindNearest(KDNODE<T, DIM> * pNode, RESULT_NODE<T, DIM> *pList)
{
if(!pNode)
return 0;
int added_res = 0; // track number of nodes are added to the pList
// get the Manhattan distance from pNode to the queried position. If Manhattan distance is <= query range
// the go ahead and do more expensive KDTree distance check.
[color=purple]if( fabs(pNode->pos[0] - m_dQueryPos[0]) <=m_dQueryRange && fabs(pNode->pos[1] - m_dQueryPos[1]) <= m_dQueryRange) {[/color]
// get the KDTree distance from pNode to the queried position
double dist_sq = 0.0;
for(int i = 0; i < DIM; i++) {
dist_sq += SQ(pNode->pos[i] - m_dQueryPos[i]);
}
// Only add nodes that are within the query range
if(dist_sq <= SQ(m_dQueryRange)) {
if(RlistInsert(pList, pNode, dist_sq) < 0) {
// error
return -1;
}
added_res = 1;
}
}
// Get the KDTree direction distance from the queryPosition to pNode,
// which determines which leaf node to traverse down next.
double dx = m_dQueryPos[pNode->dir] - pNode->pos[pNode->dir];
[color=red]int ret = FindNearest(dx <= 0.0 ? pNode->left : pNode->right, pList);
if(ret >= 0 && fabs(dx) < m_dQueryRange) {[/color]
added_res += ret;
[color=red]ret = FindNearest(dx <= 0.0 ? pNode->right : pNode->left, pList);[/color]
}
if(ret < 0) {
return -1;
}
added_res += ret;
return added_res;
}
The FindNearest function is the most used function in the KdTree algorithm and makes lots of recursive calls, so speeding it up is an obvious choice. In such a tight loop function, simplifying if statements is a good start, we've got 4 candidates. For know lets concentrate on the second one highlighted in red.
[color=red]if(ret >= 0 && fabs(dx) < m_dQueryRange)[/color]
This basically says
if(ret >= 0)
(if(fabs(dx) < m_dQueryRange) {
...
}
Changing the && operator to & makes the statement a boolean check from a single if -
[color=green]if(ret >= 0 & fabs(dx) < m_dQueryRange)[/color]
Run benchmarks to determine overall effect. In this case we got a positive speed boost. Notice the purple highlighted line in the original code. Making to same change to it cause a negative impact on speed. After lots of testing I'm guessing that it is a memory cache miss that caused the slow down, so the line won't be changed.
Let's turn our attention to the first red highlighted line -
int ret = FindNearest([color=red]dx <= 0.0 ? pNode->left : pNode->right[/color], pList);
For those not familiar with C or Java syntax this may look strange, it's called a Ternary operator, and basically we put a big ol if/then/else statement right in the function call parameter. It reads "if dx is less than or equal to 0.0 then use pNode->left else use pNode->right as the first parameter to FindNearest".
The goal of this exercise is to speed up the program by removing if/then/else statements or ternay operators from this heavily called function. If speed is not an issue then by all means use if/then/else and ternay, they help make the code maintainable and readable.
For us, the problem with if/then/else or ternay is that they force a CPU jump instruction no matter what and those cost CPU clock cycles. Depending on the CPU there is also branch prediction and pipelines to contend with, look it up on wiki if your interested. From a simplified CPU point of view here's how if/then/else and if/then may look.
[color=navy]if/then/else type statement[/color]
Compare dx <= 0.0
Jump to Label1 if false
Use pNode->left
Jump to Label2
Label1:
Use pNode->right
Label2:
[color=navy]if/then type statement[/color]
Compate dx <= 0.0
Jump to Label3 if false
Use pNode->left;
Label3:
See the 2 jumps in the if/then/else statement, for each pass through the if statement one of them must be executed, and those add up. Depending on the condition we will use either left or right. It would be nice if we could use the simpler if/then style statement and still have the proper left or right value set. By preloading a variable we can. The idea being, preload a variable with left or right (we'll use right), then do the check and change the preloaded variable if necessary. Here it is in code -
[color=green] KDNODE<T, DIM> * pNode1 = pNode->right; // preload variable
if(dx <= 0.0) {
pNode1 = pNode->left; // oops, needed left
}
int ret = FindNearest(pNode1, pList);[/color]
Test the changes. Nice boost in speed. So when you have 2 conditions and the variables are simple, then preloading may be beneficial (watch out for memory cache misses).
Ok, let's do the same thing to the third red highlighted line from the original code -
[color=red]ret = FindNearest(dx <= 0.0 ? pNode->right : pNode->left, pList);[/color]
... becomes ...
[color=purple] pNode1 = pNode->left;
if(dx <= 0.0)
pNode1 = pNode->right;
ret = FindNearest(pNode1, pList);[/color]
Test the changes, oh ya, very nice speed boost. However, -
[color=green] pNode1 = pNode->right;
if(dx > 0.0)
pNode1 = pNode->left;
ret = FindNearest(pNode1, pList);[/color]
Gives a massive speed boost. And for a final speed boost to 18% remove the nCount++ from the ConnectDotsWithin function as it causes memory cache misses.
Getting these results only took an hour or so of messing with the code. After every little change benchmarks were checked. I also explored hand written SSE2 instructions, but could not get it correct. I especially wanted that purple highlighted line in the original code to use SSE2 as that would have been nice to parallel compute those values.
Final source code for the FindNearest function -
[color=green] int __fastcall FindNearest(KDNODE<T, DIM> * pNode, RESULT_NODE<T, DIM> *pList)
{
if(!pNode)
return 0;
int added_res = 0; // track number of nodes are added to the pList
// get the Manhattan distance from pNode to the queried position. If Manhattan distance is <= query range
// the go ahead and do more expensive KDTree distance check.
if( fabs(pNode->pos[0] - m_dQueryPos[0]) <= m_dQueryRange &&
fabs(pNode->pos[1] - m_dQueryPos[1]) <= m_dQueryRange) {
// get the KDTree distance from pNode to the queried position
double dist_sq = 0.0;
for(int i = 0; i < DIM; i++) {
dist_sq += SQ(pNode->pos[i] - m_dQueryPos[i]);
}
// Only add nodes that are within the query range
if(dist_sq <= SQ(m_dQueryRange)) {
if(RlistInsert(pList, pNode, dist_sq) < 0) {
// error
return -1;
}
added_res = 1;
}
}
// Get the KDTree direction distance from the queryPosition to pNode,
// which determines which leaf node to traverse down next.
double dx = m_dQueryPos[pNode->dir] - pNode->pos[pNode->dir];
KDNODE<T, DIM> * pNode1 = pNode->right;
if(dx <= 0.0) {
pNode1 = pNode->left;
}
int ret = FindNearest(pNode1, pList);
if(ret >= 0 & fabs(dx) < m_dQueryRange) {
added_res += ret;
// ret = FindNearest(dx <= 0.0 ? pNode->right : pNode->left, pList);
pNode1 = pNode->right;
if(dx > 0.0)
pNode1 = pNode->left;
ret = FindNearest(pNode1, pList);
}
if(ret < 0) {
return -1;
}
added_res += ret;
return added_res;
}[/color]