C++ Source code and binary attached to this post.
Had a chance to clean up the code and update the comments. I've also regenerated the programmers docs and added those as a separate attachment.
For those that want to compare changes made since the prior project posting, look at these areas in the code. These changed areas had the biggest impact.
Functions -
int __fastcall CKdTreeT::FindNearest(KDNODE<T, DIM> * pNode, RESULT_NODE<T, DIM> *pList);
KDNODE<T, DIM> * __fastcall CKdTreeT::FindMinimum(const KDNODE<T, DIM> * pNode, int axis, int dir) const;
KDNODE<T, DIM> * __fastcall CKdTreeT::EraseNode(KDNODE<T, DIM> * pNode, const T * pos, int dir) const;
struct KDNODE
Header file
MemoryPool.h (and few functions in KdTreeT that use the memory pool)
Most of the other changes whether be code structure, or data layout. contributed to the low running times, but only a small amount. I did lots of benchmarking, testing, and looking at the assembly output. If the change reduced the running time, no matter how small, it was a keeper. BTW, during code cleanup, made another change (included in source code) and know the 10,000 query range sample runs 26 seconds faster.
Thanks everyone for putting up with my posts, especially all the non-C++ types. This was more an educational exercise for myself, and what I've documented here is only a small fraction of what I got out of it.