Author Topic: WCMatch Search Challenge  (Read 7129 times)

0 Members and 1 Guest are viewing this topic.

CAB

  • Global Moderator
  • Seagull
  • Posts: 10401
WCMatch Search Challenge
« on: November 06, 2012, 07:33:42 PM »
The challenge is to return one or a group of matches from a user query.

The code I have here reads a data file, text file that is TAB delimited, with the first item in the list is the key.
My search routine only gets exact matches.
The file I am using contains the SIMPSON part number followed by the Florida Code Approval number.
In some cases only the first few letters may be known so the object is to get a group of potential matches
for the user to pick from.
As stated the challenge is only to return that group of matches and nothing more.

An example would be "LSTA9*" which would only return one item because there are no other near matches.
Using "HUC210*" would return the following:
Code: [Select]
(("HUC210-2" "10531.28")
 ("HUC210-2" "10655.57")
 ("HUC210-3" "10655.58"))
With 3 potential matches.


Code - Auto/Visual Lisp: [Select]
  1. (defun c:Simpson(/ DataList)
  2.   ;;  CAB 11/06/2012
  3.   ;;  Binary Search for string match, ("target str" more data)
  4.   (defun search(data key / result imin imax imid itm)
  5.     (setq imin 0 imax (length data))
  6.     ;; continue searching while [imin,imax] is not empty
  7.     (while (and (not result)(>= imax imin))
  8.         (setq imid (/ (+ imin imax) 2)) ; calculate the midpoint
  9.         ;; determine which subarray to search
  10.         (setq itm (car (nth imid data)))
  11.         (cond
  12.           ((> key itm) (setq imin (1+ imid))) ; change min index to search upper subarray
  13.           ((> itm key) (setq imax (1- imid))) ; change max index to search lower subarray
  14.           (t (setq result (nth imid data)))   ; key found
  15.         )
  16.     )
  17.     result
  18.   )      
  19.  
  20.   (defun ReadFileData(filename / Data FileData handle stream result itm idx)
  21.     ;;  Tab delimited file
  22.     (if
  23.       (and
  24.         (eq 'str (type filename))
  25.         (setq filename (findfile filename))
  26.         (setq handle (open filename "r"))
  27.       )
  28.        (progn
  29.          (while (setq stream (read-line handle))
  30.                   (setq  FileData (cons stream FileData))
  31.          ) ; while
  32.          (close handle)
  33.        ) ; progn
  34.     ) ; endif
  35.     ;;  make a list of pairs
  36.     (foreach itm FileData
  37.       (setq Data (cons (list (substr itm 1 (setq idx (vl-string-position 9 itm))) (substr itm (+ idx 2))) Data))
  38.     )
  39.     Data
  40.   )
  41.  
  42.  
  43.   (setq DataList (ReadFileData "C:/Users/Alan/Documents/Simpson FL Numbers2012.txt"))
  44.  
  45.   ;;  test search
  46.   (print (search datalist "CS14-R"))
  47.   (print (search datalist "HGUS414"))
  48.   (print (search datalist "HUC210*"))     ; wildcard search
  49.   (print (search datalist "WPI49.25"))
  50.   (princ)
  51. ) ; oef
I've reached the age where the happy hour is a nap. ()
Windows 10 core i7 4790k 4Ghz 32GB GTX 970
Please support this web site.

pBe

  • Bull Frog
  • Posts: 402
Re: WCMatch Search Challenge
« Reply #1 on: November 06, 2012, 11:56:04 PM »
A quick one using vl-remove-if-not & wcmatch

Code - Auto/Visual Lisp: [Select]
  1. (defun c:Homer ( / _delFinder)
  2. (defun _delFinder  (str md / d l str)
  3.       (while (setq d (vl-string-position md str nil T))
  4.             (setq l (cons (substr str (+ 2 d)) l)
  5.                   str (substr str 1 d)))
  6.       (cons str l)
  7.       )  
  8. (if  (setq data nil
  9.            fl   (findfile
  10.                   "C:/Users/Alan/Documents/Simpson FL Numbers2012.txt"
  11.                 )
  12.      )
  13.   (progn
  14.         (setq openfile (open fl "r"))
  15.         (while (setq lines (read-line openfile))
  16.         (setq data (cons (_delfinder lines 9) data))
  17.         )
  18.         (close openfile)
  19.         (foreach itm '("CS14-R" "HGUS414" "HUC210*" "WPI49.25" "LSTA9*")
  20.             (if (setq fc (vl-remove-if-not
  21.                            '(lambda (x)
  22.                               (wcmatch (Car x) itm)
  23.                             )
  24.                            data
  25.                          )
  26.                 )
  27.               (print fc)
  28.             )
  29.           )
  30.     )
  31.   )
  32.   (princ)
  33.   )

« Last Edit: November 07, 2012, 12:37:06 AM by pBe »

irneb

  • Water Moccasin
  • Posts: 1794
  • ACad R9-2016, Revit Arch 6-2016
Re: WCMatch Search Challenge
« Reply #2 on: November 07, 2012, 04:56:44 AM »
You could of course try regular expressions. E.g. using my wrapper function RX:Search. And if you combine it with the idea from the RX:CSV-Token function ... you can actually have it extract only the lines which comply to the match and only tokenize them into the data list.
Common sense - the curse in disguise. Because if you have it, you have to live with those that don't.

Lee Mac

  • Seagull
  • Posts: 12892
  • London, England
Re: WCMatch Search Challenge
« Reply #3 on: November 07, 2012, 06:37:36 AM »
Similar to pBe's code:

Code - Auto/Visual Lisp: [Select]
  1. (defun c:simpson ( / f i l r s )
  2.     (if (and
  3.             (setq f (findfile "Simpson FL Numbers2012.txt"))
  4.             (setq f (open f "r"))
  5.         )
  6.         (progn
  7.             (while (setq r (read-line f))
  8.                 (setq l (cons (list (substr r 1 (setq i (vl-string-position 9 r))) (substr r (+ i 2))) l))
  9.             )
  10.             (setq l (reverse l))
  11.             (close f)
  12.             (while (/= "" (setq s (getstring t "\nSpecify Pattern <Exit>: ")))
  13.                 (if (not (mapcar 'print (vl-remove-if-not '(lambda ( x ) (wcmatch (car x) s)) l)))
  14.                     (princ "\nPattern not found.")
  15.                 )
  16.             )
  17.         )
  18.     )
  19.     (princ)
  20. )

CAB

  • Global Moderator
  • Seagull
  • Posts: 10401
Re: WCMatch Search Challenge
« Reply #4 on: November 07, 2012, 09:07:33 AM »
Good job all.
Very concise code Lee.

It appears that iterating the data set for each item to be found is not going to be too slow.
That was my first thought as the data set is at 2000+ and could grow to 10,000 with other vendors added.
That is why I did the binary search.

Thanks


PS iteration does allow for a non sorted data set.
I've reached the age where the happy hour is a nap. ()
Windows 10 core i7 4790k 4Ghz 32GB GTX 970
Please support this web site.

CAB

  • Global Moderator
  • Seagull
  • Posts: 10401
Re: WCMatch Search Challenge
« Reply #5 on: November 07, 2012, 09:09:26 AM »
irneb that too is a good option, Thanks
I've reached the age where the happy hour is a nap. ()
Windows 10 core i7 4790k 4Ghz 32GB GTX 970
Please support this web site.

CAB

  • Global Moderator
  • Seagull
  • Posts: 10401
Re: WCMatch Search Challenge
« Reply #6 on: November 07, 2012, 10:34:07 AM »
This is what I settled on:
Code - Auto/Visual Lisp: [Select]
  1.   (setq DataList (ReadFileData "C:/Users/Alan/Documents/Simpson FL Numbers2012.txt"))
  2.  
  3.   (setq targets '("CS14-R" "HGUS414" "HUC210*" "WPI49.25" "LSTA*"))
  4.  
  5.   ;;  test search
  6.   (setq results
  7.          (vl-remove-if-not
  8.             (function (lambda (x)
  9.                 (vl-some (function (lambda (y) (wcmatch (car x) y))) targets)))
  10.                        DataList))
  11.   (mapcar 'print results)

Code: [Select]
("CS14-R" "10852.1")
("HGUS414" "10531.3")
("HGUS414" "11468.1")
("HUC210-2" "10531.28")
("HUC210-2" "10655.57")
("HUC210-3" "10655.58")
("LSTA12" "10852.4")
("LSTA15" "10852.4")
("LSTA18" "10852.4")
("LSTA21" "10852.4")
("LSTA24" "10852.4")
("LSTA30" "10852.4")
("LSTA36" "10852.4")
("LSTA9" "10852.5")
("WPI49.25" "10667.108")

Thanks again for playing along.  8-)
I've reached the age where the happy hour is a nap. ()
Windows 10 core i7 4790k 4Ghz 32GB GTX 970
Please support this web site.

VovKa

  • Water Moccasin
  • Posts: 1616
  • Ukraine
Re: WCMatch Search Challenge
« Reply #7 on: November 07, 2012, 11:10:45 AM »
Code: [Select]
(setq targets "CS14-R,HGUS414,HUC210*,WPI49.25,LSTA*"))
(wcmatch (car x) targets)

Lee Mac

  • Seagull
  • Posts: 12892
  • London, England
Re: WCMatch Search Challenge
« Reply #8 on: November 07, 2012, 02:07:20 PM »
Alan,

Since you are looking for performance, the 'ReadFileData' function could be optimised to iterate the data set only once, rather than twice as per your current code:

Code - Auto/Visual Lisp: [Select]
  1. (defun ReadFileData ( f / i l x )
  2.     (if (and
  3.             (setq f (findfile f))
  4.             (setq f (open f "r"))
  5.         )
  6.         (progn
  7.             (while (setq x (read-line f))
  8.                 (setq l (cons (list (substr x 1 (setq i (vl-string-position 9 x))) (substr x (+ i 2))) l))
  9.             )
  10.             (close f)
  11.             (reverse l)
  12.         )
  13.     )
  14. )



Also, if the search pattern is fixed, the function could be optimised further still to incorporate the wcmatch test:

Code - Auto/Visual Lisp: [Select]
  1. (defun SearchFile ( f / l x )
  2.     (if (and
  3.             (setq f (findfile f))
  4.             (setq f (open f "r"))
  5.         )
  6.         (progn
  7.             (while (setq x (read-line f))
  8.                 (if (wcmatch x "CS14-R\t*,HGUS414\t*,HUC210*,WPI49.25\t*,LSTA*")
  9.                     (setq l (cons x l))
  10.                 )
  11.             )
  12.             (close f)
  13.             (reverse l)
  14.         )
  15.     )
  16. )

Note that I removed the check for valid argument data type as, in my opinion, if the calling function passes the subfunction an incorrect data type, an error should result.


CAB

  • Global Moderator
  • Seagull
  • Posts: 10401
Re: WCMatch Search Challenge
« Reply #9 on: November 07, 2012, 02:54:00 PM »
Ah the wcmatch trick is a good one. I've used it before but forgot due to lack of use.

I've never tested it but thought reading the file without processing would be faster than reading & then processing.
I guess it would depend on the processing taking long enough to delay the next read.

Thanks for your input.  8-)
I've reached the age where the happy hour is a nap. ()
Windows 10 core i7 4790k 4Ghz 32GB GTX 970
Please support this web site.

irneb

  • Water Moccasin
  • Posts: 1794
  • ACad R9-2016, Revit Arch 6-2016
Re: WCMatch Search Challenge
« Reply #10 on: November 07, 2012, 04:34:45 PM »
Using my idea, reading the file in one go (not line by line). Then using RegEx to extract only the lines matching the pattern, then using RegEx on those lines to parse the tab delimiters:
Code - Auto/Visual Lisp: [Select]
  1.  
  2. (defun RX:Start  (/)
  3.   (if (or (not *RX-Link*)
  4.           (not (setq *RX-Link* (vl-bb-ref '*RX-Link*)))
  5.           (vlax-object-released-p *RX-Link*))
  6.     (progn (vl-bb-set '*RX-Link* (setq *RX-Link* (vlax-get-or-create-object "VBScript.RegExp")))
  7.            (RX:Set t t t)))
  8.   *RX-Link*)
  9.  
  10. (defun RX:Stop  (/)
  11.   (if (and (or *RX-Link* (setq *RX-Link* (vl-bb-ref '*RX-Link*)))
  12.            (not (vlax-object-released-p *RX-Link*)))
  13.     (vlax-release-object *RX-Link*))
  14.   (setq *RX-Link* nil)
  15.   (vl-bb-set '*RX-Link* nil)
  16.   *RX-Link*)
  17.  
  18. (defun RX:Set  (multiline global ignorecase)
  19.   (if *RX-Link*
  20.     (progn (vlax-put *RX-Link* 'Multiline (if multiline -1 0))
  21.            (vlax-put *RX-Link* 'Global (if global -1 0))
  22.            (vlax-put *RX-Link* 'IgnoreCase (if ignorecase -1 0)))))
  23.  
  24. (defun RX:Search  (string pattern / lst col)
  25.   (if (RX:Start)
  26.     (progn (vlax-Put *RX-Link* 'Pattern pattern)
  27.            (setq col (vl-catch-all-apply 'vlax-Invoke (list *RX-Link* 'Execute string)))
  28.            (if (not (vl-catch-all-error-p col))
  29.              (progn (vlax-for item  col
  30.                       (setq lst (cons (cons (vlax-Get item 'FirstIndex) (vlax-Get item 'Value)) lst))
  31.                       (vlax-release-object item))
  32.                     (vlax-release-object col)))
  33.            (reverse lst))))
  34.  
  35. (defun TabToken-Match  (source match /)
  36.   (mapcar '(lambda (row) (mapcar 'cdr (RX:Search (cdr row) "[^\t]+|\t(?=\t)|\t$")))
  37.           (RX:Search source (strcat "[^\n\r]*(" match ")[^\n\r]*"))))
  38.  
  39. (defun ReadTabFile-Match  (filePath match / f fs str)
  40.   (if (setq filePath (findfile filePath))
  41.     (progn (setq fs  (vlax-get-or-create-object "Scripting.FileSystemObject")
  42.                  f   (vlax-invoke fs "OpenTextFile" filePath 1 2)
  43.                  str (vlax-invoke f "ReadAll"))
  44.            (vlax-invoke f "Close")
  45.            (vlax-release-object f)
  46.            (vlax-release-object fs)
  47.            (TabToken-Match str match))))
The pattern is a bit different from the normal wcmatch though. Instead of , you use |. And to match any character 0 or more times you use .* in combination so it's something like this:
Code: [Select]
(ReadTabFile-Match "Simpson FL Numbers2012.txt" "CS14-R|HGUS414|HUC210|WPI49\\.25|LSTA")
Edit: Allowed option of testing for partial not at start of line and also testing for anything without getting the "\r" in the result.
« Last Edit: November 07, 2012, 05:18:51 PM by irneb »
Common sense - the curse in disguise. Because if you have it, you have to live with those that don't.

Lee Mac

  • Seagull
  • Posts: 12892
  • London, England
Re: WCMatch Search Challenge
« Reply #11 on: November 07, 2012, 05:47:17 PM »
I've never tested it but thought reading the file without processing would be faster than reading & then processing.
I guess it would depend on the processing taking long enough to delay the next read.

This is how I [rightly or wrongly] think about the process:

Since AutoLISP doesn't permit multiple processor threads, everything is evaluated in sequence, one function after another, hence, each read-line expression will not be evaluated until the previous expression has completed evaluation.

One cannot avoid the string processing time - in both examples the string must be separated into two items and hence this processing time is incurred whether performed within the first while loop, or within a secondary loop; the only difference being that in your code the program must first construct the list before iterating over the same data again to process each string. Whereas in my example, the string processing is performed whilst the list is constructed, and so the data set is only being iterated once.

Though, I've also never tested this theory.  :-)

Thanks for your input.  8-)

Anytime my friend  8-)

pBe

  • Bull Frog
  • Posts: 402
Re: WCMatch Search Challenge
« Reply #12 on: November 08, 2012, 12:31:32 AM »
... I've never tested it but thought reading the file without processing would be faster than reading & then processing....

Process at read time

Code - Auto/Visual Lisp: [Select]
  1. (defun readata (f lst / conme tar y data result)
  2. (setq conme (lambda (r d)
  3.               (cons (list (substr r 1 (setq i (vl-string-position 9 r)))
  4.                           (substr r (+ i 2))) d)))
  5.   (if (and  (setq f (findfile f))
  6.             (setq f (open f "r"))
  7.         )
  8.     (progn
  9.         (while (setq y nil tar (car lst))
  10.                 (while (and (setq x (read-line f)) (not y))
  11.                         (if (wcmatch x tar)
  12.                                 (progn
  13.                                   (setq data (conme x data))
  14.                                   (while (wcmatch (setq x (read-line f)) tar)
  15.                                     (setq data (conme x data))
  16.                                   )
  17.                                   ;;;   For Testing purposes    ;;;
  18.                                   (foreach itm  (reverse data)
  19.                                         (print itm))
  20.                                   ;;;                           ;;;
  21.                                   (setq y t result (cons data result) data nil)
  22.                                  
  23.                                 )
  24.                           ))
  25.           (setq lst (cdr lst))
  26.           )
  27.   (close f)
  28.     )
  29.     )
  30. ;;;     result
  31.   (princ)
  32.   )


Code: [Select]
(setq lst '("CS14-R\t*" "HGUS414\t*" "HUC210*" "LSTA*" "WPI49.25\t*"))
(readata f lst)
Code: [Select]
("CS14-R" "10852.1")
("HGUS414" "10531.3")
("HGUS414" "11468.1")
("HUC210-2" "10531.28")
("HUC210-2" "10655.57")
("HUC210-3" "10655.58")
("LSTA12" "10852.4")
("LSTA15" "10852.4")
("LSTA18" "10852.4")
("LSTA21" "10852.4")
("LSTA24" "10852.4")
("LSTA30" "10852.4")
("LSTA36" "10852.4")
("LSTA9" "10852.5")
("WPI49.25" "10667.108")

Last X value: "WPI49.5\t10667.108"

Code: [Select]
(setq lst '("CS14-R\t*" "HGUS414\t*" "HUC210*" ))(readata f lst)("CS14-R" "10852.1")
("HGUS414" "10531.3")
("HGUS414" "11468.1")
("HUC210-2" "10531.28")
("HUC210-2" "10655.57")
("HUC210-3" "10655.58")

last X value: "HUC212-2\t10655.59"

Totally untested , it should stop reading the file  after the last element on the lst list. (assuming text source is sorted)

pBe

  • Bull Frog
  • Posts: 402
Re: WCMatch Search Challenge
« Reply #13 on: November 08, 2012, 01:03:44 AM »

.... (progn (setq fs  (vlax-get-or-create-object "Scripting.FileSystemObject")
                 f   (vlax-invoke fs "OpenTextFile" filePath 1 2)
                 str (vlax-invoke f "ReadAll"))


I didn't know that  :-)

irneb

  • Water Moccasin
  • Posts: 1794
  • ACad R9-2016, Revit Arch 6-2016
Re: WCMatch Search Challenge
« Reply #14 on: November 08, 2012, 02:30:56 AM »
I didn't know that  :)
It's nice, but keep in mind it uses more RAM than working line-by-line. It "shouldn't" be a problem though. I think we've found somewhere else that ALisp has a max string size of around 500MB: so unless your text file is larger than this, I don't see a reason to not use this idea.
Common sense - the curse in disguise. Because if you have it, you have to live with those that don't.