Author Topic: {Challenge} connect each two points whose distance is shorter than given value (Read 101226 times)

pkohut · « **Reply #105 on:** September 18, 2010, 07:54:03 AM »

An hour to implement a simple memory pool and its down to 4.695 seconds. No SSE2, nasty assembly, or crazy tweaks.

Not as much of speed improvement as I'd hoped. Probably because the memory pool is biased towards working with arrays,
and KdTree is a binary tree. So there is probably a lot of memory cache misses. Need to find a tool that can report that type
stuff of the CPU.

edit: Dawned on me a few minutes ago, that the dynamic memory used to hold the leaf node X, Y, Z position data isn't necessary since any more, since the code now uses templates. Changed it from dynamic to a standard array...and booYa - 4.228 seconds.

edit: For comparison sake, KdTree w/ SSE2 enabled 3.994 seconds. Divide and Concur (DandC) w/ SSE2 enabled 1.839 seconds.
3.994

gskelly · « **Reply #106 on:** September 18, 2010, 09:12:17 AM »

Wow... a lot of info in this thread, most of it way over my head

Cool to follow the progression and helps to open my mind a bit to how to look at code differently.

Thanks

pkohut · « **Reply #107 on:** September 18, 2010, 12:17:51 PM »

Quote from: gskelly on September 18, 2010, 09:12:17 AM

Wow... a lot of info in this thread, most of it way over my head

Cool to follow the progression and helps to open my mind a bit to how to look at code differently.

Thanks

Thanks for the comments. I enjoy doing these write ups. Oh ya, chasing the 4 second barrier know. Down to 4.165.

edit:
In my code I tend to write a function like so

Code: [Select]

double DoHeavyCalc(const double & d1, const double & d2, const double & d3, const double & d4) { ... } Thinking being that on a 32 bit machine it is more efficient to pass a 4 byte reference and make it const to let the compiler do any optimizations it can. For giggles I changed the function signatures to

Code: [Select]

double DoHeavyCalc(double d1, double d2, double d3, double d4) { ... } which turns out to be slightly more efficient. Running with the 1 million point shaved about 0.025 seconds, but in my new quest to 4, I'll take it.

pkohut · « **Reply #108 on:** September 18, 2010, 02:06:08 PM »

Found a piece in the code that has the potential to shave off 0.578 seconds. The structure ResultNode is being dynamically allocated and released about 4,500,000 times. At most probably only a thousand or so need to be allocated. My thought is to revamp the simple memory pool introduced earlier and use it for both the ResultNode's and tree leaf nodes.

What's nice about all these optimizations (so far) is they don't rely on tricks or slight of hand. So far it's just been simple refactoring and I'm learning a great deal.

Lee Mac · « **Reply #109 on:** September 18, 2010, 02:10:47 PM »

Quote from: pkohut on September 18, 2010, 02:06:08 PM

So far it's just been simple refactoring and I'm learning a great deal.

That's good to hear, I really admire your persistence Paul, but I'm afraid with every post it is going further and further above my head

pkohut · « **Reply #110 on:** September 18, 2010, 02:39:47 PM »

Quote from: Lee Mac on September 18, 2010, 02:10:47 PM

Quote from: pkohut on September 18, 2010, 02:06:08 PM
So far it's just been simple refactoring and I'm learning a great deal.

That's good to hear, I really admire your persistence

Making up for lost seat time at the computer. Summer is over and the kids are back in school, so I'm left with way to much time on my hands. Happens every year

pkohut · « **Reply #111 on:** September 20, 2010, 03:55:14 PM »

The final optimization from the last post was a simple memory pool and the code was running 4.165 seconds for the test, and there might be a potential to shave off another 0.578 seconds. Implementing the changes dropped the running time to 3.5233 seconds, shaving off 0.6417 seconds (3.2719 seconds with SSE2 enabled). Very happy with the results since it only took an hour or so to implement. (All tests run on 1,000,000 point drawings, benchmarks attached in Excel spread sheet below)

At this point I figure I'm done trying to optimize this thing to the hilt and decide additional benchmarks are in order using different query ranges. The first benchmarks were to compare the KdTree version with the SSE2 compiler setting off and on. The results were a bit of a surprise in that the SSE2 enabled version only ever so slightly edged out the none SSE2 version, consistently running less than a second faster. The timings for 4000, 5000, and 10,000 show the non-SSE2 winning, but I suppose that could have been some OS activity taking place in the background when the test was running. Given that the KdTree is more data driven than computation driven explains some of why SSE2 didn't provide more of an edge.

The next task was to run benchmarks on the DandC (Divide and Conquer algorithm) version (SSE2 enabled). For query ranges less than 200 the DandC beats out the KdTree version. DandC has very little overhead in the way of data structures so it can start cranking on the problem right away. As the query ranges increase the speed of DandC becomes very bad. For example, querying a range of 10,000 with DandC takes about 2 hrs 15 minutes to complete. The same query range with KdTree took about 19 minutes (15.8 minutes after another optimization was done).

After the benchmarks were run I found another piece in the code to optimize. So I ran another set of benchmarks for comparison. The optimization was to simplify 2 if statements and precompute a value that had the potential to be used (computed) twice.

A sampling of the query ranges performed

pkohut · « **Reply #112 on:** September 20, 2010, 10:53:14 PM »

C++ Source code and binary attached to this post.

Had a chance to clean up the code and update the comments. I've also regenerated the programmers docs and added those as a separate attachment.

For those that want to compare changes made since the prior project posting, look at these areas in the code. These changed areas had the biggest impact.

Code: [Select]

Functions -
int __fastcall CKdTreeT::FindNearest(KDNODE<T, DIM> * pNode, RESULT_NODE<T, DIM> *pList);
KDNODE<T, DIM> * __fastcall CKdTreeT::FindMinimum(const KDNODE<T, DIM> * pNode, int axis, int dir) const;
KDNODE<T, DIM> * __fastcall CKdTreeT::EraseNode(KDNODE<T, DIM> * pNode, const T * pos, int dir) const;
struct KDNODE

Header file
MemoryPool.h (and few functions in KdTreeT that use the memory pool)

Most of the other changes whether be code structure, or data layout. contributed to the low running times, but only a small amount. I did lots of benchmarking, testing, and looking at the assembly output. If the change reduced the running time, no matter how small, it was a keeper. BTW, during code cleanup, made another change (included in source code) and know the 10,000 query range sample runs 26 seconds faster.

Thanks everyone for putting up with my posts, especially all the non-C++ types. This was more an educational exercise for myself, and what I've documented here is only a small fraction of what I got out of it.

pkohut · « **Reply #113 on:** September 20, 2010, 11:09:14 PM »

Figured I'd better split this info from the previous post -

...And just to restate, almost every major improvement came from simplifying the code, mostly if statements. For instance take the if statement below, where pNode->right and pNode->left are pointers, and we want to see if the are both NULL.

Code: [Select]

if(!pNode->right && !pNode->left) { ... }As explained in an earlier post. This is doing 2 if compares; if pNode->right is NULL AND if pNode->left is NULL. This can be simplified (in terms of the computer) to

Code: [Select]

if(!pNode->right & !pNode->left) { ... }
Now it's a single if statement doing a "logical and" of right and left, which provides a nice little speed increase. Though the assembly output is better with this change, it didn't look very efficent

Code: [Select]

; 1257 :             if( !pNode->right & !pNode->left) {
  000ba	33 c9		 xor	 ecx, ecx
  000bc	85 c0		 test	 eax, eax
  000be	0f 94 c1	 sete	 cl
  000c1	33 db		 xor	 ebx, ebx
  000c3	85 d2		 test	 edx, edx
  000c5	0f 94 c3	 sete	 bl
  000c8	85 cb		 test	 ecx, ebx
  000ca	74 0a		 je	 SHORT $LN5@EraseNode

At some point a forgotten memory was ebbing it's way to the surface. Something about doing a "not not" operation on a pointer being fast. Ok, reform the if statement to use !! on the pointers ...

Code: [Select]

if(!( !!pNode->right & !!pNode->left)) { ... }
The !!SomePointer tells the compiler to first negate the pointer to true/false based on the value of the pointer, and then negate the true/false result. The results for both pointers are "logically and" together then that result is inverted for the final answer.

Running the benchmark showed a very nice improvement. The "KdTree w/SSE2 & mod*" section in the timings chart reflects this change.

So what's the assembly look like for this statement?

Code: [Select]

; 1257 :             if(!( !!pNode->right & !!pNode->left)) {
  000aa	8b 5c 24 1c	 mov	 ebx, DWORD PTR tv271[esp+16]
  000ae	85 c3		 test	 eax, ebx
  000b0	75 0a		 jne	 SHORT $LN5@EraseNode

That's rockin'!

Again, thanks everyone.

It's Alive! · « **Reply #114 on:** September 21, 2010, 03:06:00 AM »

Now you need to make a version for X64. doubles are stuck into the XMM registers by default, all you need to do is pack them

pkohut · « **Reply #115 on:** September 21, 2010, 03:55:21 AM »

Quote from: __declspec on September 21, 2010, 03:06:00 AM

Now you need to make a version for X64. doubles are stuck into the XMM registers by default, all you need to do is pack them

The seeds for that thought are already planted. Probably on the next rain of boredom will they germinate.

CAB · « **Reply #116 on:** September 21, 2010, 09:27:47 AM »

Programmer and a poet!

pkohut · « **Reply #117 on:** September 22, 2010, 03:11:18 PM »

Quote from: __declspec on September 21, 2010, 03:06:00 AM

Now you need to make a version for X64. doubles are stuck into the XMM registers by default, all you need to do is pack them

The program know runs from the command prompt.

Timings for a couple different ranges and compilers
MSVC 2008 32 - (70) 2.94975, (100) 3.49471
MSVC 2008 x64 - (70) 3.03533, (100) 3.60556
X64 version is running slower than 32 bit. For the ranges 70 = -0.15, 100 = -0.18, 500 = 0.85. So it's actually getting worse the larger the range.

MSVC 2010 32 - (70) 3.09298, (100) 3.72199
MSVC 2010 x64 - (70) 3.04821, (100) 3.63488

32 bit VS 2008 version blows the doors off any version. Though that should be taken with a grain of salt, as I just compiled and run the 2010 moments ago and haven't dig into the compiler options.

In the meantime, going through the code to make sure the correct width variables are being passed around. Also, have gone through different for compiler settings VC 2008 x64, so far nothing jumps out.

After I pack the data to a manageable size I'll release it and the code.

Found this on MSDN

Quote

When you compile applications as 64-bit, the calculations get more complicated. A 64-bit program uses 64-bit pointers, and its instructions are slightly larger, so the memory requirement is slightly increased. This can cause a slight drop in performance. On the other hand, having twice as many registers and having the ability to do 64-bit integer calculations in a single instruction will often more than compensate. The net result is that a 64-bit application might run slightly slower than the same application compiled as 32-bit, but it will often run slightly faster.

highflyingbird · « **Reply #118 on:** September 23, 2010, 11:32:36 AM »

It looks like Divide and Conquer has a good performance with a small distance.
and it can be converted to a LISP code.but It's very difficult for KDtree algorithm.

pkohut · « **Reply #119 on:** September 23, 2010, 01:03:39 PM »

Quote from: highflybird on September 23, 2010, 11:32:36 AM

It looks like Divide and Conquer has a good performance with a small distance.
and it can be converted to a LISP code.but It's very difficult for KDtree algorithm.

Each has their place. The divide and conquer routine is tightly coupled to the problem statement (non-reusable), and the Kd-tree version is loosely coupled (reusable).

The challenge for this thread offered an opportunity to present the Kd-Tree algorithm, something I've reused on a number of occasions and where it ran fast enough. Except for the major refactoring (to templates) and optimizations of the code, it has been plug and play technology for several years.

For this challenge, "fast enough" in the past wasn't good at all. Based on the criteria of the challenge, the routine initially performed really bad, being pushed past it's limits. All of the performance issues were implementation details that were easily fixed.

As far as implementing in lisp, I don't see why not. I wouldn't want to do it, but those scary looking data structures with pointers, those are just nicely wrapped C structures. Lay 'em out flat and index, like we use to do before OOP.

News:

Author Topic: {Challenge} connect each two points whose distance is shorter than given value (Read 101226 times)

pkohut

Re: {Challenge} connect each two points whose distance is shorter than given value

gskelly

Re: {Challenge} connect each two points whose distance is shorter than given value

pkohut

Re: {Challenge} connect each two points whose distance is shorter than given value

pkohut

Re: {Challenge} connect each two points whose distance is shorter than given value

Lee Mac

Re: {Challenge} connect each two points whose distance is shorter than given value

pkohut

Re: {Challenge} connect each two points whose distance is shorter than given value

pkohut

Re: {Challenge} connect each two points whose distance is shorter than given value

pkohut

Re: {Challenge} connect each two points whose distance is shorter than given value

pkohut

Re: {Challenge} connect each two points whose distance is shorter than given value

It's Alive!

Re: {Challenge} connect each two points whose distance is shorter than given value

pkohut

Re: {Challenge} connect each two points whose distance is shorter than given value

CAB

Re: {Challenge} connect each two points whose distance is shorter than given value

pkohut

Re: {Challenge} connect each two points whose distance is shorter than given value

highflyingbird

Re: {Challenge} connect each two points whose distance is shorter than given value

pkohut

Re: {Challenge} connect each two points whose distance is shorter than given value