Author Topic: Find duplicates (and do it quickly)  (Read 13023 times)

0 Members and 1 Guest are viewing this topic.

kaefer

  • Guest
Find duplicates (and do it quickly)
« on: November 24, 2010, 02:41:38 AM »
Hi all,

this is in reference to http://www.theswamp.org/index.php?topic=31859.msg410457#msg410457, where Gile proposes to find pairs which satisfy a predicate by enumerating each pair and test it subsequently. For small numbers of candidates that would be acceptable, but consider, say, 10000 objects, then we're talking about 50 million tests, i.e. C(n,2). Even while triangular, it runs still in quadratic time.

A more linear approach would involve sorting or grouping by a metric involving those object properties that have to be different for their objects to be considered different. The predicate would have to return a value comparable to other values, instead of doing the comparison itself, and then we need higher order functions to do the actual sorting/grouping.

This is where F# comes into play: it supports structural comparison, and it has the necessary higher order functions in its Seq module. Left as an exercise for the reader is the equivalent C# code, most certainly involving LINQ.

Code: [Select]
let duplicate (b: BlockReference) =
    b.Name,
    b.OwnerId,
    b.Position,
    b.Layer,
    Math.Round(b.Rotation, 6),
    b.ScaleFactors

let deleteDuplicatedBlockRef() =
    let db = HostApplicationServices.WorkingDatabase
    use tr = db.TransactionManager.StartTransaction()
    
    let getRefs btrId =
        let btr = tr.GetObject(btrId, OpenMode.ForRead) :?> BlockTableRecord
        btr.GetBlockReferenceIds(true, false)
        |> Seq.cast<_>
    let openRefs brefId =
        tr.GetObject(brefId, OpenMode.ForRead) :?> BlockReference
    let skipFirst = function
        | LazyList.Nil -> Seq.empty
        | LazyList.Cons(_, xs) -> LazyList.toSeq xs
    
    let groupRefsAndDelete =
        Seq.cast<_>
        >> Seq.collect getRefs
        >> Seq.map openRefs
        >> Seq.groupBy duplicate
        >> Seq.map (snd >> LazyList.ofSeq)
        >> Seq.collect skipFirst
        >> Seq.fold
            (fun cnt bref ->
                bref.UpgradeOpen()
                bref.Erase()
                cnt + 1 ) 0
                
    let cnt =
        tr.GetObject(db.BlockTableId, OpenMode.ForRead) :?> BlockTable
        |> groupRefsAndDelete
    tr.Commit()
    cnt

I do not claim that this is an shining example of performance, there's certainly avoidable overhead in the flattening of collections and the conversions to and fro the LazyList type (which lives in FSharp.PowerPack.dll, don't forget to reference it).

Not implemented yet is also the possibility to supply a tolerance for comparison of properties like Position or ScaleFactors.

All the best, Thorsten
« Last Edit: November 24, 2010, 02:50:41 AM by kaefer »

gile

  • Gator
  • Posts: 2507
  • Marseille, France
Re: Find duplicates (and do it quickly)
« Reply #1 on: November 24, 2010, 09:29:37 AM »
Thank you very much, kaefer, my F# teacher...
Speaking English as a French Frog

gile

  • Gator
  • Posts: 2507
  • Marseille, France
Re: Find duplicates (and do it quickly)
« Reply #2 on: November 24, 2010, 02:09:22 PM »
keafer,

Studying your code, I learned some of the Seq module (higher order) functions today, I'll have a look around the LazyList later.

Here's what I'm able to write now:

Code: [Select]
let duplicate (b: BlockReference) =
    b.Name,
    b.OwnerId,
    Math.Round(b.Position.X, 6),
    Math.Round(b.Position.Y, 6),
    Math.Round(b.Position.Z, 6),
    b.Layer,
    Math.Round(b.Rotation, 6),
    Math.Round(b.ScaleFactors.X, 6),
    Math.Round(b.ScaleFactors.Y, 6),
    Math.Round(b.ScaleFactors.Z, 6) 

let deleteDuplicatedBlockRef() =
    let db = HostApplicationServices.WorkingDatabase
    use tr = db.TransactionManager.StartTransaction()

    let getRefs btrId =
        let btr = tr.GetObject(btrId, OpenMode.ForRead) :?> BlockTableRecord
        btr.GetBlockReferenceIds(true, false)
        |> Seq.cast<_>
       
    let openRefs brefId =
        tr.GetObject(brefId, OpenMode.ForRead) :?> BlockReference
       
    let cnt = tr.GetObject(db.BlockTableId, OpenMode.ForRead) :?> BlockTable
              |> Seq.cast<_>
              |> Seq.collect getRefs
              |> Seq.map openRefs
              |> Seq.groupBy duplicate
              |> Seq.collect (fun x -> snd x |> Seq.skip 1)
              |> Seq.fold (fun cnt bref ->
                            bref.UpgradeOpen()
                            bref.Erase()
                            cnt + 1 ) 0
    tr.Commit()
    cnt

More and more, I think I like this language...
« Last Edit: November 24, 2010, 04:00:54 PM by gile »
Speaking English as a French Frog

It's Alive!

  • Retired
  • Needs a day job
  • Posts: 8658
  • AKA Daniel
Re: Find duplicates (and do it quickly)
« Reply #3 on: November 24, 2010, 06:41:18 PM »
So how fast is it? do you have a sample DWG to test on?

LE3

  • Guest
Re: Find duplicates (and do it quickly)
« Reply #4 on: November 24, 2010, 07:17:08 PM »
Not related with the function in specific, but found this:

C++ vs C# vs F# vs Haskell
http://www.trelford.com/blog/post/C2b2b-vs-C-vs-F-vs-Haskell.aspx

It's Alive!

  • Retired
  • Needs a day job
  • Posts: 8658
  • AKA Daniel
Re: Find duplicates (and do it quickly)
« Reply #5 on: November 24, 2010, 08:09:45 PM »
Might be a nice Challenge. I wonder how ARX would stack up?

F# Man, please post yor test data  :laugh:

Jeff H

  • Needs a day job
  • Posts: 6144
Re: Find duplicates (and do it quickly)
« Reply #6 on: November 24, 2010, 08:12:28 PM »
This is first thing to come to mind real quick, but this is a incomplete thought.

Try to set up a algorithm to only have loop through each set of blockreferences once

Something like create a dictionary<point3d , point3d>
Let key be the insertion point
The Values  x = LayerObjectID
The Values  Y = rotation
.........
Something like that

 and check if it exists and if so erase it.
I am already thinking of things that would need to be checked are make a false duplicate but might spark a idea.

Kerry

  • Mesozoic relic
  • Seagull
  • Posts: 11654
  • class keyThumper<T>:ILazy<T>
Re: Find duplicates (and do it quickly)
« Reply #7 on: November 24, 2010, 08:18:03 PM »

Mirrored blocks will need to be accounted for also ... perhaps
kdub, kdub_nz in other timelines.
Perfection is not optional.
Everything will work just as you expect it to, unless your expectations are incorrect.
Discipline: None at all.

rom1_am

  • Guest
Re: Find duplicates (and do it quickly)
« Reply #8 on: November 25, 2010, 01:52:59 AM »
Try to set up a algorithm to only have loop through each set of blockreferences once

Something like create a dictionary<point3d , point3d>

Hi Jeff F,

I tried to implement that kind of solution, you can have a look at the code in this post http://www.acadnetwork.com/boardseen,/topic-51.msg79.html#new.

I used a structure for the key of my dictionary and i've implemented the 'equals' function to make the test as i wanted to do. But i don't really understand how to use the GetHashCode function and if i need tu use it.

Thanks.

(Edit: I think the post of Gile in the topic of AcadNetWork answers to my question)
« Last Edit: November 25, 2010, 02:14:11 AM by rom1_am »

Jeff H

  • Needs a day job
  • Posts: 6144
Re: Find duplicates (and do it quickly)
« Reply #9 on: November 25, 2010, 02:58:36 AM »
A structure is a value type and derives from Sytem.ValueType
And the Sytem.ValueType will override the System.Object Equals to use value-base instead of reference-based(same item in memory) to check.


Here is a simple Console Application and the struct Equals will return true and the class will return false.

It contains a struct and a class both with the same two properties
Then creates two struct and two class instances and sets all the properties the same to show how a value type overrides Equals
 
Code: [Select]

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConsoleApplication4
{
    class Program
    {
        static void Main(string[] args)
        {
            EqualsWillUseReferece classVariable1 = new EqualsWillUseReferece();           
            classVariable1.int1 = 15;
            classVariable1.str1 = "string";

            EqualsWillUseReferece classVariable2 = new EqualsWillUseReferece();
            classVariable2.int1 = 15;
            classVariable2.str1 = "string";

            EqulsWillUseValues stuctVariable1 = new EqulsWillUseValues();
            stuctVariable1.int1 = 15;
            stuctVariable1.str1 = "string";

            EqulsWillUseValues stuctVariable2 = new EqulsWillUseValues();
            stuctVariable2.int1 = 15;
            stuctVariable2.str1 = "string";


            if (classVariable1.Equals(classVariable2))
            {
                Console.WriteLine("Class variables are Equal");
            }
            else
            {
                Console.WriteLine("Class variables are Not Equal");
            }

            if (stuctVariable1.Equals(stuctVariable2))
            {
                Console.WriteLine("Struc variables are Equal");
            }
            else
            {
                Console.WriteLine("Struc variables are Not Equal");
            }

            Console.ReadLine();
        }
    }

    public class EqualsWillUseReferece
    {
        public int int1;
        public string str1;
    }

    public struct EqulsWillUseValues
    {
        public int int1;
        public string str1;

    }
}


Jeff H

  • Needs a day job
  • Posts: 6144
Re: Find duplicates (and do it quickly)
« Reply #10 on: November 25, 2010, 03:21:54 AM »
Try to set up a algorithm to only have loop through each set of blockreferences once

Something like create a dictionary<point3d , point3d>

Hi Jeff F,

I tried to implement that kind of solution, you can have a look at the code in this post http://www.acadnetwork.com/boardseen,/topic-51.msg79.html#new.

I used a structure for the key of my dictionary and i've implemented the 'equals' function to make the test as i wanted to do. But i don't really understand how to use the GetHashCode function and if i need tu use it.

Thanks.

(Edit: I think the post of Gile in the topic of AcadNetWork answers to my question)

When you override Equls you override GetHashCode because when you store the objects in a HashTable it calls Equals() and GetHashCode() to get the object.
By default it uses the objects memory location to produce the Hashcode

rom1_am

  • Guest
Re: Find duplicates (and do it quickly)
« Reply #11 on: November 25, 2010, 03:59:13 AM »
Ok, Thanks.

So I think that the code given in the answer #6 in this topic http://www.acadnetwork.com/topic-51.0.html corresponds exactly to what i wanted to do.

Thanks a lot.

gile

  • Gator
  • Posts: 2507
  • Marseille, France
Re: Find duplicates (and do it quickly)
« Reply #12 on: November 25, 2010, 05:17:25 AM »
Hi,

If someones want to play, I attached a test dwg file.
The comparison criteria are those given by rom1_am:
- Name
- Layer
- Position (tolerance 1e-6 5e-7)
- Rotation (tolerance 1e-6 5e-7)
- ScaleFactors (tolerance 1e-6 5e-7)

The function returns the number of erased blocks.

A C# purpose
Code: [Select]
using System;
using System.Collections.Generic;
using Autodesk.AutoCAD.ApplicationServices;
using Autodesk.AutoCAD.DatabaseServices;
using Autodesk.AutoCAD.EditorInput;
using Autodesk.AutoCAD.Geometry;
using Autodesk.AutoCAD.Runtime;

namespace DuplicateTest
{
    public class Commands
    {
        struct DuplicatePattern
        {
            public string Name;
            public string Layer;
            public ObjectId OwnerId;
            public double PositionX;
            public double PositionY;
            public double PositionZ;
            public double Rotation;
            public double ScaleX;
            public double ScaleY;
            public double ScaleZ;
        }

        private int DeleteDuplicatedBlockRef()
        {
            Document doc = Application.DocumentManager.MdiActiveDocument;
            Editor ed = doc.Editor;
            Database db = doc.Database;
            int result = 0;
            SelectionFilter filter = new SelectionFilter(new TypedValue[1] { new TypedValue(0, "INSERT") });
            PromptSelectionResult psr = ed.SelectAll(filter);
            if (psr.Status != PromptStatus.OK)
                return 0;
            using (Transaction tr = db.TransactionManager.StartTransaction())
            {
                BlockTable bt = (BlockTable)tr.GetObject(db.BlockTableId, OpenMode.ForRead);
                List<DuplicatePattern> Patterns = new List<DuplicatePattern>();
                foreach (ObjectId id in psr.Value.GetObjectIds())
                {
                    BlockReference br = (BlockReference)tr.GetObject(id, OpenMode.ForRead, false);
                    DuplicatePattern dp = new DuplicatePattern();
                    dp.OwnerId = br.OwnerId;
                    dp.Name = br.Name;
                    dp.Layer = br.Layer;
                    dp.PositionX = Math.Round(br.Position.X, 6);
                    dp.PositionY = Math.Round(br.Position.Y, 6);
                    dp.PositionZ = Math.Round(br.Position.Z, 6);
                    dp.Rotation = Math.Round(br.Rotation, 6);
                    dp.ScaleX = Math.Round(br.ScaleFactors.X, 6);
                    dp.ScaleY = Math.Round(br.ScaleFactors.Y, 6);
                    dp.ScaleZ = Math.Round(br.ScaleFactors.Z, 6);
                    if (Patterns.Contains(dp))
                    {
                        br.UpgradeOpen();
                        br.Erase();
                        result++;
                    }
                    else
                        Patterns.Add(dp);
                }
                tr.Commit();
            }
            return result;
        }

        [CommandMethod("Cs_gile")]
        public void DelDupBlkRef()
        {
            DateTime t0 = DateTime.Now;
            int del = DeleteDuplicatedBlockRef();
            DateTime t1 = DateTime.Now;
            TimeSpan ts = t1 - t0;
            Application.DocumentManager.MdiActiveDocument.Editor.WriteMessage(
                "{0} duplicated block(s) have been erased in {1} milliseconds",
                del, ts.TotalMilliseconds);
        }
    }
}

The F# solution given upper
Code: [Select]
module DupBlks

open System
open System.Collections.Generic
open Autodesk.AutoCAD.ApplicationServices
open Autodesk.AutoCAD.DatabaseServices
open Autodesk.AutoCAD.EditorInput
open Autodesk.AutoCAD.Geometry
open Autodesk.AutoCAD.Runtime

let duplicate (b: BlockReference) =
    b.Name,
    b.OwnerId,
    Math.Round(b.Position.X, 6),
    Math.Round(b.Position.Y, 6),
    Math.Round(b.Position.Z, 6),
    b.Layer,
    Math.Round(b.Rotation, 6),
    Math.Round(b.ScaleFactors.X, 6),
    Math.Round(b.ScaleFactors.Y, 6),
    Math.Round(b.ScaleFactors.Z, 6) 

let deleteDuplicatedBlockRef() =
    let db = HostApplicationServices.WorkingDatabase
    use tr = db.TransactionManager.StartTransaction()

    let getRefs btrId =
        let btr = tr.GetObject(btrId, OpenMode.ForRead) :?> BlockTableRecord
        btr.GetBlockReferenceIds(true, false)
        |> Seq.cast<_>
       
    let openRefs brefId =
        tr.GetObject(brefId, OpenMode.ForRead) :?> BlockReference
       
    let cnt = tr.GetObject(db.BlockTableId, OpenMode.ForRead) :?> BlockTable
              |> Seq.cast<_>
              |> Seq.collect getRefs
              |> Seq.map openRefs
              |> Seq.groupBy duplicate
              |> Seq.collect (fun x -> snd x |> Seq.skip 1)
              |> Seq.fold (fun cnt bref ->
                            bref.UpgradeOpen()
                            bref.Erase()
                            cnt + 1 ) 0
    tr.Commit()
    cnt
   
[<CommandMethod("Fs_gile")>]
let Test() =
    let t0 = DateTime.Now
    let del = deleteDuplicatedBlockRef()
    let t1 = DateTime.Now
    let ts = t1 - t0
    Application.DocumentManager.MdiActiveDocument.Editor.WriteMessage(
        "{0} duplicated block(s) have been erased in {1} milliseconds", 
        del, ts.TotalMilliseconds)

Results:
Quote
Commande: FS_GILE
10134 duplicated block(s) have been erased in 221.0126 milliseconds

Quote
Commande: CS_GILE2
10134 duplicated block(s) have been erased in 1372.8024 milliseconds
« Last Edit: November 25, 2010, 07:49:07 AM by gile »
Speaking English as a French Frog

gile

  • Gator
  • Posts: 2507
  • Marseille, France
Re: Find duplicates (and do it quickly)
« Reply #13 on: November 25, 2010, 06:22:36 AM »
The upper C# code is certainly not very good: it seems to run a little slower than the old Autodesk del-blk.lsp

Here's the code modified to select all blocks and compare them with the tolerance.

Code: [Select]
;;;---------------------------------------------------------------------------;
;;;
;;;    DEL_BLK.LSP
;;;
;;;    (C) Copyright 1999 by Autodesk, Inc.
;;;
;;;    Permission to use, copy, modify, and distribute this software
;;;    for any purpose and without fee is hereby granted, provided
;;;    that the above copyright notice appears in all copies and
;;;    that both that copyright notice and the limited warranty and
;;;    restricted rights notice below appear in all supporting
;;;    documentation.
;;;
;;;    AUTODESK PROVIDES THIS PROGRAM "AS IS" AND WITH ALL FAULTS.
;;;    AUTODESK SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTY OF
;;;    MERCHANTABILITY OR FITNESS FOR A PARTICULAR USE.  AUTODESK, INC.
;;;    DOES NOT WARRANT THAT THE OPERATION OF THE PROGRAM WILL BE
;;;    UNINTERRUPTED OR ERROR FREE.
;;;
;;;    Use, duplication, or disclosure by the U.S. Government is subject to
;;;    restrictions set forth in FAR 52.227-19 (Commercial Computer
;;;    Software - Restricted Rights) and DFAR 252.227-7013(c)(1)(ii)
;;;    (Rights in Technical Data and Computer Software), as applicable.
;;;
;;;    July 1996
;;;
;;;---------------------------------------------------------------------------;
;;;
;;;    DESCRIPTION
;;;
;;;    This function deletes duplicate block objects.
;;;    A duplicate object is defined as
;;;    -the same location
;;;    -the same layer
;;;    -the same block name
;;;    -the same rotation angle
;;;    -the same x scale
;;;    -the same y scale
;;;    -the same z scale
;;;---------------------------------------------------------------------------;
;;;****************************************************************************
;;;
;;; Code modified:
;;; - add a tolerance with position, rotation and scales
;;; - add a chrono
;;; - all blocks selected

(defun c:deldup_blk( / ss ssdup ct len e eb
                       pt lay bname ang sclx scly sclz obj obj_list)

   ;;(princ "\nSelect block objects.")            ;Select objects and filter all but block insert objects.
   (setq ss (ssget "_X" (list (cons 0 "INSERT"))))
   (if ss                                       ;If any valid objects were selected.
   (progn
      (princ "\nBuilding list of objects.")
      (setq t0 (get-utime))
      (setq ssdup (ssadd))                      ;Initialize new selection set to hold objects to delete
      (setq len (sslength ss))                  ;Find out how many objects were selected.     
      (setq ct 0)
      (while (< ct len)                         ;Loop through selected objects
         (setq e (ssname ss ct))                ;Get an object name
         (setq eb (entget e))                   ;Get the entity list from the object name
         (setq ct (+ ct 1))                     ;Increment index into selection set
         (setq pt (mapcar         ;Access object's coordinate
    '(lambda (x) (round x 0.000001))
    (cdr (assoc 10 eb))
  )
)
         (setq lay (cdr (assoc 8 eb)))          ;Access object's layer
         (setq bname (cdr (assoc 2 eb)))        ;Access object's block name
         (setq ang (round (cdr (assoc 50 eb))0.000001))         ;Access object's rotation angle
         (setq sclx (round (cdr (assoc 41 eb))0.000001))        ;Access object's x scale
         (setq scly (round (cdr (assoc 42 eb))0.000001))        ;Access object's y scale
         (setq sclz (round (cdr (assoc 43 eb))0.000001))        ;Access object's z scale
                                                ;Make list of object properties
         (setq obj (list pt lay bname ang sclx scly sclz))

         (if (not (member obj obj_list))        ;If these properties are not already in list
            (setq obj_list (cons obj obj_list)) ;Add them to the list
            (ssadd e ssdup)                     ;Else add object to selection set to delete
         )                                      ;End if
      )                                         ;End of while loop

      (if (> (sslength ssdup) 0)                ;If there are any objects in the selection set to delete
      (progn
         (princ "\nDeleting duplicate objects.")
         (setq len (sslength ssdup))            ;Find out how many many objects to delete.     
         (setq ct 0)
         (while (< ct len)                      ;Loop through objects and delete.
            (setq e (ssname ssdup ct))          ;Get object name
            (setq ct (+ ct 1))                  ;Increment index into selection set
            (entdel e)                          ;Delete duplicate object
         )                                      ;End of while loop

(setq t1 (get-utime))
         (princ                                 ;Print the number of objects deleted to command line
            (strcat "\nDeleted "
                    (itoa len)
                    " duplicated block(s) have been erased in "
    (rtos (- t1 t0))
    " millseconds"
         ))
      )                                         ;End progn
         (princ "\nNo duplicates found.")       ;Else no there were no duplicates to delete.
      )                                         ;End if

   )                                            ;End progn
      (princ "\nNo block objects selected.")    ;Else there were no valid objects selected
   )                                            ;End if
   (princ)
)

(defun round (num prec)
  (if (zerop (setq prec (abs prec)))
    num
    (if (minusp num)
      (* prec (fix (- (/ num prec) 0.5)))
      (* prec (fix (+ (/ num prec) 0.5)))
    )
  )
)

(defun get-utime ()
  (* 86400000. (getvar "tdusrtimer"))
)

Quote
Commande: DELDUP_BLK

Building list of objects.
Deleting duplicate objects.
Deleted 10134 duplicated blocks have been erased in 1359 millseconds
« Last Edit: November 25, 2010, 06:32:20 AM by gile »
Speaking English as a French Frog

xiaxiang

  • Guest
Re: Find duplicates (and do it quickly)
« Reply #14 on: November 25, 2010, 07:11:14 AM »
Thank you ,gile
I don't like c# code ,it's too difficult for me :oops: