Author Topic: optimization for a grid/coords duplicate combiner (Read 4223 times)

CADDOG · « **on:** December 12, 2012, 09:14:53 PM »

That's a mouthfull.

My brain is just not working today. Normally I write, vb style, and then work backwards, because that's how lisp is to me... backwards. hehe.
But today, I just can't see the "hook". I run it through in my mind, and each chess move fails before I can even begin typing.
Sample data will average a couple hundred entries, no more than a few thousand.

Code: [Select]

(setq a '(((2 3) "2-3")
	  ((1 0) "1-0")
	  ((2 0) "2-0")
	  ((1 3) "1-3")
	  ((2 2) "2-2")
	  ((2 3) "2-3")
	  ((1 2) "1-2")
	  ((0 0) "0-0")
	  ((0 1) "0-1")
	  ((1 1) "1-1")
	  ((0 2) "0-2")
	  ((2 1) "2-1")
	  )

(merge-grids a)
;_1_$(((2 1) "2-1") ((0 2) "0-2") ((2 0) "2-0") ((1 0) "1-0\n1-0") ((2 3) "2-3\n2-3"))

The text/string value is irrelevant, and may include much more data in the future (eg. enames, tmatrix, vla ref's). However the text value will always be last

The code's intent is to find duplicate coordinates and merge what text it finds with a new line. Duplicate coords will be rare (1 in 50~100).

Code: [Select]

(defun merge-grids (indexed_lst
		    /
		    return item duplicates
		    )
  (while indexed_lst
   (setq item (car indexed_lst)
	 indexed_lst (cdr indexed_lst))
    (if (setq duplicates (vl-remove-if-not (function (lambda (x) (equal (car x) (car item)))) indexed_lst))
      (progn
	(setq indexed_lst (vl-remove-if (function (lambda (x) (equal (car x) (car item)))) indexed_lst))
	(setq item (list (car item) (vl:list->string "\n" (cons (last item) (mapcar 'last duplicates)))))
	);progn
      );if dudplicates
    (setq return (cons item return))
    );while list
  );defun merge-grids



  (defun vl:list->string (delim ls / out)
    (setq out (apply 'strcat (mapcar (function (lambda (x) (strcat x delim))) ls)))
    (if out (vl-string-right-trim delim out))
    );defun list->string

CAB · « **Reply #1 on:** December 12, 2012, 09:55:03 PM »

Maybe this? not much testing.

Code: [Select]

  (defun unique (lst / result itm doup)
    (while (setq itm (car lst))
     (setq lst (cdr lst))
     (while (setq doup (assoc (car itm) lst))
        (setq lst (vl-remove doup lst)
              itm (list (car itm) (strcat (cadr itm) "\n" (cadr doup)))
        )
      )
      (setq result (if result (cons itm result) (list itm)))
    )
    result
  )

(setq a '(((2 3) "2-3") ; doup
	  ((1 0) "1-0")
	  ((2 0) "2-0")
	  ((1 1) "1-1") ; doup
	  ((1 3) "1-3")
	  ((2 2) "2-2")
	  ((2 3) "2-3") ; <<
	  ((1 2) "1-2")
	  ((0 0) "0-0")
	  ((0 1) "0-1")
	  ((1 1) "1-1") ; <<
	  ((0 2) "0-2")
	  ((2 1) "2-1")
	  ))

(mapcar 'print (unique a))

CADDOG · « **Reply #2 on:** December 12, 2012, 10:56:36 PM »

Super... 3x better.
Larger data sets hit 12x better.

I get isolated into a recently learned style, and forget the basics.

Thanks CAB.

CAB · « **Reply #3 on:** December 12, 2012, 11:35:45 PM »

Glad to help.

irneb · « **Reply #4 on:** December 13, 2012, 01:24:16 AM »

CAB: Just a slight thing I'm a bit worried about:

Quote

(defun unique (lst / result itm lst ...

Anyhow, here's my take on this:

Code - Auto/Visual Lisp: [Select]

(defun merge-grids  (input / result found)
  (foreach item  input
    (setq result (if (setq found (assoc (car item) result))
               (subst (list (car item) (strcat (cadr found) "\n" (cadr item))) found result)
               (cons item result))))
  (reverse result))

Lee Mac · « **Reply #5 on:** December 13, 2012, 08:35:33 AM »

Minor optimisation of Irneb's code:

Code - Auto/Visual Lisp: [Select]

(defun merge-grids2 ( lst / ass rtn )
    (foreach itm lst
        (if (setq ass (assoc (car itm) rtn))
            (if (wcmatch (cadr ass) "~*\n*")
                (setq rtn (subst (list (car itm) (strcat (cadr ass) "\n" (cadr itm))) ass rtn))
            )
            (setq rtn (cons itm rtn))
        )
    )
    (reverse rtn)
)

CAB · « **Reply #6 on:** December 13, 2012, 10:07:01 AM »

Thanks inerb I fixed the error in my code. I was in too big of a hurry.

CADDOG · « **Reply #7 on:** December 13, 2012, 01:47:04 PM »

CAB's still holds the cup for my data set (500~1500 entries w/ 5% duplicate coords) by a significant amount, although more verbose.
I removed the benchmarks restriction on data set depth. Did this two different ways, tested seperately.
One was to simply remove the list length restriction.
Second(separate method) was to supply the testing function as a (mapcar '<mergefunction> <nesteddatalist>)
Nested data was 1 list of 60 duplicate lists. Each nested list was a lot of 25 with two of the items being duplicates.

Lee's tweak improved inerb's decently, however introduces a compatability issue...
My output is to two optional destinations; Excel and tab delimited. When output to plain text, the delimiter is not a tab, rather a space, so testing for a space would present problems with text items that contains spaces.

Side note: Compiling actually degraded the performance over all. Didn't expect that one.
When running a compiled test, should the benchmark program be compiled as well?

Lee Mac · « **Reply #8 on:** December 13, 2012, 02:15:38 PM »

CAB's method could also be optimised slightly:

Code - Auto/Visual Lisp: [Select]

(defun unique ( lst / dup itm rtn )
    (while (setq itm (car lst))
        (setq lst (cdr lst))
        (if (setq dup (assoc (car itm) lst))
            (setq lst (vl-remove dup lst)
                  itm (list (car itm) (strcat (cadr itm) "\n" (cadr dup)))
            )
        )
        (setq rtn (cons itm rtn))
    )
    rtn
)

(Note that results will be in reverse)

CAB · « **Reply #9 on:** December 13, 2012, 03:05:17 PM »

Noticed a problem if there were more than two matches in the list the vl-remove would remove ALL matching items.
My fix:

Code: [Select]

  (defun unique (lst / result itm doup)
    (while (setq itm (car lst))
     (setq lst (cdr lst))
     (cond
       ((null result)(setq result (list itm)))
       ((setq found (assoc (car itm) result))
        (setq result (subst (list (car itm) (strcat (cadr found) "\n" (cadr itm))) found result)))
       ((setq result (cons itm result)))
     )
    )
    result
  )

Code: [Select]

(setq a '(((2 3) "2-3") ; doup
	  ((1 0) "1-0")
	  ((2 0) "2-0")
	  ((1 1) "1-1") ; doup
	  ((1 3) "1-3")
	  ((2 2) "2-2")
	  ((2 3) "2-3") ; <<
	  ((1 2) "1-2")
	  ((0 0) "0-0")
	  ((0 1) "0-1")
	  ((1 1) "1-1") ; <<
	  ((0 2) "0-2")
	  ((2 1) "2-1")
	  ((2 3) "2-3") ; <---<<<
	  ))

(defun c:test()
  (mapcar 'print (unique a))
  (princ)
  )

CADDOG · « **Reply #10 on:** December 13, 2012, 04:37:18 PM »

Quote from: CAB on December 13, 2012, 03:05:17 PM

Noticed a problem if there were more than two matches in the list the vl-remove would remove ALL matching items.

Good catch -> as everybody's does as well (except for original post).
Downside is that now CAB's is as efficient as original

< edit >
OOPS. Correction. inerb's works with 2+ duplicate coords.

Lee Mac · « **Reply #11 on:** December 13, 2012, 06:05:00 PM »

Quote from: CADDOG on December 13, 2012, 04:37:18 PM

Quote from: CAB on December 13, 2012, 03:05:17 PM
Noticed a problem if there were more than two matches in the list the vl-remove would remove ALL matching items.

Good catch -> as everybody's does as well (except for original post).

As far as I can see, Irneb's code should handle multiple duplicate items correctly.

Evidently I hadn't completely understood your intentions for this function, as I purposely modified Irneb's function to skip additional duplicates to improve efficiency...

CADDOG · « **Reply #12 on:** December 13, 2012, 06:30:33 PM »

Quote from: Lee Mac on December 13, 2012, 06:05:00 PM

As far as I can see, Irneb's code should handle multiple duplicate items correctly.

Your right! I'm now up to 15 variations, and I mis-labelled his (most of mine don't work and are plays on what you guys submitted - is why i didn't post them as well).

So those that work...
CAB:unique:v2
CD:merge-grids
IN:merge-grids

I didn't expect this...

Code: [Select]

Elapsed milliseconds / relative speed for 128 iteration(s):
    (CD:MERGE-GRIDS A).....17940 / 1.57 <fastest>
    (IN:MERGE-GRIDS A).....27347 / 1.03
    (CAB:UNIQUE:V2 A)......28158 / 1.00 <slowest>

I was really looking forward to the 12x increase by cab's original

Nesting the vl:list->string function in the cd:merge-grids function is what made the, almost insignificant, difference.

Lee Mac · « **Reply #13 on:** December 13, 2012, 06:57:53 PM »

Another variation:

Code - Auto/Visual Lisp: [Select]

(defun LM:group ( l / a r x )
    (while (setq x (car l) a (car x))
        (setq l
            (vl-remove-if
                (function
                    (lambda ( b )
                        (if (equal a (car b))
                            (setq x (list a (strcat (cadr x) "\n" (cadr b))))
                        )
                    )
                )
                (cdr l)
            )
            r (cons x r)
        )
    )
    (reverse r)
)

CADDOG · « **Reply #14 on:** December 13, 2012, 08:37:39 PM »

Quote from: Lee Mac on December 13, 2012, 06:57:53 PM

Another variation: LM:group

It looked very promising, and much simpler at first, then I was scratching my head for a moment tracking the assignments - very smart.

It's better than inerb, but only slightly (0.09 better). I can only guess that it's the 62 hits to the IF in the lambda, using CAB's last data sample. Using live data, the number of if hits is much greater.

And to throw a monkey wrench in the whole thing, not even I honored my original requirement that other data may be present and text will always be last. Should have supplied a better data set. The item used for the "primary" data will not matter, as it will most likely be format information for a grid, so the extra "other data" will be lost from all but one duplicate.

It just throws me that this is still working best, even with the extra test for middle data.

Code: [Select]

  (defun CD:merge-grids2 (indexed_lst / return item duplicates middledata vl:list->string)
    (defun vl:list->string (delim ls / out)
      (setq out (apply 'strcat (mapcar (function (lambda (x) (strcat x delim))) ls)))
      (if out (vl-string-right-trim delim out))
      );defun list->string
    
    (while indexed_lst
      (setq item (car indexed_lst)
	    indexed_lst (cdr indexed_lst))
      (if (setq duplicates (vl-remove-if-not (function (lambda (x) (equal (car x) (car item)))) indexed_lst))
	(progn
	  (setq indexed_lst (vl-remove-if (function (lambda (x) (equal (car x) (car item)))) indexed_lst))
	  (if (setq middledata (reverse (cdr (reverse (cdr item)))))
	    (setq item (append (list (car item)) middledata (list (vl:list->string "\n" (cons (last item) (mapcar 'last duplicates))))))
	    (setq item (list (car item) (vl:list->string "\n" (cons (last item) (mapcar 'last duplicates)))))
	    );if
	  );progn
	);if dudplicates
      (setq return (cons item return))
      );while list
    );defun merge-grids

Future data expansion:

Code: [Select]

(setq a '(((2 3) "other" "data" "2-3-0") ; doup
	  ((1 0) "other" "data" "1-0")
	  ((2 0) "other" "data" "2-0")
	  ((1 1) "other" "data" "1-1") ; doup
	  ((1 3) "other" "data" "1-3")
	  ((2 2) "other" "data" "2-2")
	  ((2 3) "other" "data" "2-3-1") ; <<
	  ((1 2) "other" "data" "1-2")
	  ((0 0) "other" "data" "0-0")
	  ((0 1) "other" "data" "0-1")
	  ((1 1) "other" "data" "1-1") ; <<
	  ((0 2) "other" "data" "0-2")
	  ((2 1) "other" "data" "2-1")
	  ((2 3) "other" "data" "2-3-2") ; <---<<<
	  ((2 3) "other" "data" "2-3-3") ; <---<<<<
	  ))

Thanks for the help guys.

CAB · « **Reply #15 on:** December 13, 2012, 11:07:53 PM »

I believe that if the records do not match exactly then this code will work:

Code: [Select]

  (defun unique (lst / result itm doup)
    (while (setq itm (car lst))
     (setq lst (cdr lst))
     (while (setq doup (assoc (car itm) lst))
        (setq lst (vl-remove doup lst)
              itm (reverse (cons (strcat (last itm) "\n" (last doup)) (cdr (reverse itm))))
        )
      )
      (setq result (if result (cons itm result) (list itm)))
    )
    result
  )

Code: [Select]

(setq a '(((2 3) "other" "data" "2-3-0") ; doup
	  ((1 0) "other" "data" "1-0")
	  ((2 0) "other" "data" "2-0")
	  ((1 1) "other" "data" "1-1") ; doup
	  ((1 3) "other" "data" "1-3")
	  ((2 2) "other" "data" "2-2")
	  ((2 3) "other" "data" "2-3-1") ; <<
	  ((1 2) "other" "data" "1-2")
	  ((0 0) "other" "data" "0-0")
	  ((0 1) "other" "data" "0-1")
	  ((1 1) "other" "data" "1-1") ; <<  if there are more "1-1"s then they will be lost unless the "other" OR "data" are different 
	  ((0 2) "other" "data" "0-2")
	  ((2 1) "other" "data" "2-1")
	  ((2 3) "other" "data" "2-3-2") ; <---<<<
	  ((2 3) "other" "data" "2-3-3") ; <---<<<<
	  ))

(defun c:test()
  (mapcar 'print (unique a))
  (princ)
  )

Note that the results save the "other" & "data" from the first record in the series.

Lee Mac · « **Reply #16 on:** December 14, 2012, 07:37:15 AM »

Quote from: CADDOG on December 13, 2012, 08:37:39 PM

It's better than inerb, but only slightly (0.09 better). I can only guess that it's the 62 hits to the IF in the lambda, using CAB's last data sample. Using live data, the number of if hits is much greater.

It just throws me that this is still working best, even with the extra test for middle data.

In my tests:

Code - Auto/Visual Lisp: [Select]

_$ (setq a
   '(
        ((2 3) "2-3") ; +
        ((1 0) "1-0") 
        ((2 0) "2-0")
        ((1 1) "1-1") ; *
        ((1 3) "1-3")
        ((2 2) "2-2")
        ((2 3) "2-3") ; ++
        ((1 2) "1-2")
        ((0 0) "0-0")
        ((0 1) "0-1")
        ((1 1) "1-1") ; **
        ((0 2) "0-2")
        ((2 1) "2-1")
        ((2 3) "2-3") ; +++
    )
)

Testing function validity:

Code - Auto/Visual Lisp: [Select]

_$ (equal (LM:Group a) (reverse (CD:merge-grids a)))
T
_$ (equal (LM:Group a) (reverse (CAB:unique a)))
T
_$ (equal (LM:Group a) (IB:merge-grids a))
T

(Note that CD:merge-grids & CAB:unique both return reversed data).

Benchmark with small data set:

Code - Auto/Visual Lisp: [Select]

_$ (length a)
14
_$ (benchmark '((LM:Group a) (CD:merge-grids a) (CAB:unique a) (IB:merge-grids a)))
Benchmarking .................Elapsed milliseconds / relative speed for 16384 iteration(s):
 
    (IB:MERGE-GRIDS A).....1201 / 3.00 <fastest>
    (CAB:UNIQUE A).........1216 / 2.96
    (LM:GROUP A)...........2387 / 1.51
    (CD:MERGE-GRIDS A).....3604 / 1.00 <slowest>

Benchmark with large data set with many duplicates:

Code - Auto/Visual Lisp: [Select]

_$ (repeat 5 (setq a (append a a)))
_$ (length a)
448
 
_$ (benchmark '((LM:Group a) (CD:merge-grids a) (CAB:unique a) (IB:merge-grids a)))
Benchmarking ............Elapsed milliseconds / relative speed for 512 iteration(s):
 
    (LM:GROUP A)...........1669 / 1.76 <fastest>
    (IB:MERGE-GRIDS A).....1825 / 1.61
    (CAB:UNIQUE A).........1934 / 1.52
    (CD:MERGE-GRIDS A).....2933 / 1.00 <slowest>

CADDOG · « **Reply #17 on:** December 14, 2012, 01:01:18 PM »

Yeah... Once again. You're right. Rebooted, and now my results are completely different (matches yours). Frustrating.

So a few things needed to be changed. CAB's last structure rebuild captures the inner goodies of a potential data set expansion.

Code: [Select]

(setq itm (reverse (cons (strcat (last itm) "\n" (last doup)) (cdr (reverse itm))))
This can be replicated on each of the others (except for mine

). And Inerb's and Lee's catch the, really rare duplicate text scenario.
See each modification (final reverse calls have been removed - not needed).

Code: [Select]

(defun IN:merge-grids (input / result found)
  (foreach item  input
    (setq result (if (setq found (assoc (car item) result))
		   ;(subst (list (car item) (strcat (cadr found) "\n" (cadr item))) found result)
		   (subst (reverse (cons (strcat (last item) "\n" (last found)) (cdr (reverse item)))) found result)
		   (cons item result)
		   );if
	  );setq
    );foreach
   result
  )

  (defun LM:group ( l / a r x )
    (while (setq x (car l) a (car x))
      (setq l (vl-remove-if (function (lambda ( b )
			(if (equal a (car b))
			  ;(setq x (list a (strcat (cadr x) "\n" (cadr b))))
			  (setq x (reverse (cons (strcat (last x) "\n" (last b)) (cdr (reverse x))))))))
		(cdr l))
	    r (cons x r));setq
      );while
    r
    );defun LM:group

  (defun CAB:unique:v3 (lst / result itm doup)
    (while (setq itm (car lst))
     (setq lst (cdr lst))
     (while (setq doup (assoc (car itm) lst))
        (setq lst (vl-remove doup lst)
              itm (reverse (cons (strcat (last itm) "\n" (last doup)) (cdr (reverse itm))))
        )
      )
      (setq result (if result (cons itm result) (list itm)))
    )
    result
  )

Data set is CAB's last post, and data building resembles Lee's repeat.
Was going to post my results, but wanted to see Lee's results first (cuz I tire of being wrong) hahaha.
But really, this isn't the first time my results waned compared to others. Something I've been trying to figure out.

< edit >
Added attachment - maybe someone can spot why my results go one way then another. Benchmark is the benchmark program found on this site.

CAB · « **Reply #18 on:** December 14, 2012, 01:34:13 PM »

Quote

And Inerb's and Lee's catch the, really rare duplicate text scenario.

Note that if the data & other are not exactly the same then duplicate text in the last position will not be lost.
Example, these four are combined and none are lost:
((2 3) "other" "data" "2-3-0")
((2 3) "other" "data2" "2-3-0")
((2 3) "other" "data3" "2-3-0")
((2 3) "other" "data4" "2-3-0")
yield
((2 3) "other" "data" "2-3-0\n2-3-0\n2-3-0\n2-3-0")

But data2 3 & 4 are lost

CADDOG · « **Reply #19 on:** December 14, 2012, 01:50:45 PM »

Quote from: CAB on December 14, 2012, 01:34:13 PM

Note that if the data & other are not exactly the same then duplicate text in the last position will not be lost.

Gotcha. The potential is there, but more likely because a user messed up, than done on purpose. Since there's no way for the program to know, it can't be lost, forcing the blame on the user and not the program eliminating data.

I'm starting to see why my tests vary. When going for those large, probably never to be seen datasets, cd:merge-grids started climbing the ranks.

OK.. I'll show it.

Code: [Select]

Testing with a list length of 480
Elapsed milliseconds / relative speed for 16384 iteration(s):
    (CAB:UNIQUE:V3 A)........18906 / 7.76 <fastest>
    (LM:GROUP A).............75734 / 1.94
    (IN:MERGE-GRIDS A).......99844 / 1.47
    (CD:MERGE-GRIDS2 A).....146704 / 1.00 <slowest>
Testing with a list length of 1920
Elapsed milliseconds / relative speed for 2048 iteration(s):
    (CAB:UNIQUE:V3 A).......20265 / 4.29 <fastest>
    (LM:GROUP A)............62203 / 1.40
    (IN:MERGE-GRIDS A)......81797 / 1.06
    (CD:MERGE-GRIDS2 A).....87016 / 1.00 <slowest>
Testing with an unreasonable amount of data
 List length of 7680
Elapsed milliseconds / relative speed for 512 iteration(s):
    (CAB:UNIQUE:V3 A).......15203 / 5.43 <fastest>
    (CD:MERGE-GRIDS2 A).....57875 / 1.43
    (LM:GROUP A)............60313 / 1.37
    (IN:MERGE-GRIDS A)......82578 / 1.00 <slowest>

Lee Mac · « **Reply #20 on:** December 14, 2012, 02:07:00 PM »

But surely we are comparing apples with oranges here, since CAB's code will remove multiple duplicates, e.g. consider:

Code - Auto/Visual Lisp: [Select]

_$ (CAB:unique:v3 '(((1 0) "a" "1-0") ((1 0) "a" "1-0") ((1 0) "a" "1-0") ((2 0) "b" "2-0")))
(
    ((2 0) "b" "2-0")
    ((1 0) "a" "1-0\n1-0")
)
 
_$ (LM:group '(((1 0) "a" "1-0") ((1 0) "a" "1-0") ((1 0) "a" "1-0") ((2 0) "b" "2-0")))
(
    ((2 0) "b" "2-0")
    ((1 0) "a" "1-0\n1-0\n1-0")
)

Without the overhead required to allow for such cases, the function will inevitably be much faster.

CADDOG · « **Reply #21 on:** December 14, 2012, 04:48:30 PM »

Quote from: Lee Mac on December 14, 2012, 02:07:00 PM

But surely we are comparing apples with oranges here, since CAB's code will remove multiple duplicates, e.g. consider:

Yes. Didn't mean to imply CAB's as my final use solution. CAB's is not currently a viable solution, because pure duplicates will be lost completely, and I want the text, just not the "extra" data on the duplicate coords.
Was keeping it an option in case it can be tweaked to not loose pure duplicate's text, because that 7 fold is lucrative.
And because I subconsciously probably wanted to give some credit for his piece that fit nicely into the other's to bring them up to par. :kewl:

As far as the "extra" data loss on duplicates - is a desirable affect. The order of the supplied live list is already determined by another function, and the first one in a set will always hold the defining "extra" data for other duplicate coords. I intend to use that field with formatting options eventually.

In the run-off, though Lee's looks like the one I'll be using / already using. I don't forsee ever having data exceeding 1500, but even then the breaking point for mine to become more efficient is extremely large (5000+), never will use, lists.

In the grand scheme of things, we're talking a 12 milisecond spread for live data between the worst (mine) and the best (LM). And it was worth the ride to get here.

News:

Author Topic: optimization for a grid/coords duplicate combiner (Read 4223 times)