Author Topic: Challenge: Optimize function MatchCharInString  (Read 2522 times)

0 Members and 1 Guest are viewing this topic.

pkohut

  • Guest
Challenge: Optimize function MatchCharInString
« on: February 26, 2010, 07:48:01 AM »
modified - added source project file

The function MatchCharInString listed below (project file with test harness attached also) works but is not as efficent as I'd like it to be.  If you except, your task is to make it faster.  Note that the strings are UTF-16 multibyte. This means that "for and while loops" will probably be the wrong optimizations to make, as it needs to work with different Unicode locals.  An obvious optimization is to not copy the string in pcszSource and pcszMatchChars.  Here's some ground rules, the MatchCharsInString parameters cannot be changed, CString and/or AString are not viable options.  TIA, and have fun.

Sample output from included project file. String comparisions are done in lower case and outputs are in their normal case.
Here the first test compares the characters 'B' 'B' 'O' against the string "This is a test", and finds no match because 'B' was not found.
Second line test 'B' 'B' 'O' against "Bob said, "Hello World!" and found matches at position 0 - 'B', 2 - 'B', 15 - 'O'.
Each character of 'B' 'B' 'O' is only matched once in the string, and if found the search continues with the next character at the next string position.
Code: [Select]
Testing -- BBO
   Does not match: This is a Test
          Matches: Bob said, "Hello World!"   0 2 15
   Does not match: Lizzard Lips
   Does not match: Zebra tails                2
   Does not match: Luke warm milk
Testing -- RZ
   Does not match: This is a Test
   Does not match: Bob said, "Hello World!"   19
   Does not match: Lizzard Lips               5
   Does not match: Zebra tails                3
   Does not match: Luke warm milk             7
Testing -- ZR
   Does not match: This is a Test
   Does not match: Bob said, "Hello World!"
          Matches: Lizzard Lips               2 5
          Matches: Zebra tails                0 3
   Does not match: Luke warm milk
Testing -- KML
   Does not match: This is a Test
   Does not match: Bob said, "Hello World!"
   Does not match: Lizzard Lips
   Does not match: Zebra tails
          Matches: Luke warm milk             2 8 12

Listing for MatchCharInString
Code: [Select]
// Does a lowercase match each character in sMatchChars to a character in the string sSource
// (also lower case).
// Each match is done in sequence. When the first character is found from sMatchChars the
// next character from sMatchChars is searched for the current position of sSource to
// the end of sSource string.
// The position of each match is stored in the vector index that is passed by reference.
// Returns true if all sMatchChars are found before reaching the end of sSource, false otherwise.
bool MatchCharsInString(const wchar_t * pcszSource, const wchar_t * pcszMatchChars, std::vector<size_t> & index)
{
    std::wstring sSource(pcszSource);
    std::wstring sMatchChars(pcszMatchChars);
    // change input strings to lower case
    std::transform(sSource.begin(), sSource.end(), sSource.begin(), std::tolower);
    std::transform(sMatchChars.begin(), sMatchChars.end(), sMatchChars.begin(), std::tolower);

    bool bFound = true;
    size_t pos = 0;

    std::wstring::iterator itChar = sMatchChars.begin();
    for(itChar; itChar != sMatchChars.end(); itChar++) {
        pos = sSource.find_first_of(*itChar, pos);
        if(pos == std::wstring::npos) {
            // if the position of sSource is the end of the string then
            // there was not a complete match of each character in sMatchChars
            bFound = false;
            break;
        }
        index.push_back(pos);  //push valid position to index
        pos++;
    }
    return bFound;
}
« Last Edit: February 26, 2010, 08:19:56 AM by pkohut »

pkohut

  • Guest
Re: Challenge: Optimize function MatchCharInString
« Reply #1 on: February 26, 2010, 07:53:53 AM »
HEADS UP.
This version is very fast and wrong.  Why? Because it does not take into account the local Unicode setting, and will bomb on any multibyte Unicode values.

Code: [Select]
////////////////////////////////////////////////////////////////////////
// Will not work for some unicode locales.
////////////////////////////////////////////////////////////////////////
bool MatchString(const wchar_t * pcszSource, const wchar_t * pcszMatchChars)
{
while(*pcszMatchChars) {
wchar_t c = std::tolower(*pcszMatchChars);
while(*pcszSource) {
if(c == std::tolower(*pcszSource)) {
break;
}
pcszSource++;
}
if(!*pcszSource)
break;
pcszMatchChars++;
}
return (*pcszSource && !*pcszMatchChars);
}

It's Alive!

  • BricsCAD
  • Needs a day job
  • Posts: 6940
  • AKA Daniel
Re: Challenge: Optimize function MatchCharInString
« Reply #2 on: February 26, 2010, 10:13:39 PM »
I have no idea  :lol:

Code: [Select]
#include "stdafx.h"
#include <locale>
#include <vector>
#include <windows.h>

void MatchString(const wchar_t *pcszSource, const wchar_t *pcszMatchChars,
                 std::vector<size_t> & index)
{
  std::locale loc;
  bool bMatchIsHigh;
  bool bSrcIsHigh;

  for(size_t idx = 0; *pcszSource != '\0'; idx++)
  {
    if(*pcszMatchChars != '\0')
    {
      bMatchIsHigh = IS_HIGH_SURROGATE(*pcszMatchChars);
      bSrcIsHigh = IS_HIGH_SURROGATE(*pcszSource);
      if(!bMatchIsHigh && !bSrcIsHigh)
      {
        if(std::tolower(*pcszSource,loc) == std::tolower(*pcszMatchChars,loc))
        {
          index.push_back(idx);
          pcszMatchChars++;
        }
        pcszSource++;
      }
      else if(bMatchIsHigh && bSrcIsHigh)
      {
        if( *pcszSource+1 != '\0' && *pcszMatchChars+1 != '\0')
        {

          UINT32 a = ((*pcszSource - 0xD800) * 0x400) +
             (*pcszSource+1 - 0xDC00) + 0x10000;
          UINT32 b = ((*pcszMatchChars - 0xD800) * 0x400) +
            (*pcszMatchChars+1 - 0xDC00) + 0x10000;

          if(std::tolower(a,loc) == std::tolower(b,loc))
          {
            index.push_back(idx);
            pcszMatchChars++;
            pcszMatchChars++;
          }
          pcszSource++;
          pcszSource++;
        }
        else
        {
          return;// bad format
        }
      }
      else if(bMatchIsHigh  && !bSrcIsHigh)
      {
        pcszSource++;
      }
      else if(!bMatchIsHigh && bSrcIsHigh)
      {
        pcszSource++;
        pcszSource++;
      }
    }
    else
    {
      break;
    }
  }
}

int _tmain(int argc, _TCHAR* argv[])
{
  std::vector<size_t> index;
  wchar_t a[] = {'d', 'a', 0xD834, 0xDD1E,'\0'};
  wchar_t b[] = {'a', 0xD834, 0xDD1E,'\0'};

  //wchar_t a[] = {'L','u','k','e',' ',0xD834,0xDD1E,'w','a','r','m',' ','m','i','l','k','\0'};
  //wchar_t b[] = {'K',0xD834,0xDD1E,'L','\0'}; // 2 5 13

  //wchar_t a[] = {'L','u','k','e',' ','w','a','r','m',' ','m','i','l','k','\0'};
  //wchar_t b[] = {'K','M','L','\0'};

  MatchString(a,b,index);

  for(size_t i = 0; i < index.size(); i++)
  {
    wprintf(_T("%ld "),index[i]);
  }
  system("pause");
  return 0;
}
« Last Edit: February 27, 2010, 12:11:21 AM by Daniel »

pkohut

  • Guest
Re: Challenge: Optimize function MatchCharInString
« Reply #3 on: February 27, 2010, 05:47:53 AM »
Nice solution Dan.  I had not seen the surrogate macros and they just might work.
http://msdn.microsoft.com/en-us/library/dd374069(VS.85).aspx

I think that
Code: [Select]
          UINT32 a = ((*pcszSource - 0xD800) * 0x400) +
             (*pcszSource+1 - 0xDC00) + 0x10000;
          UINT32 b = ((*pcszMatchChars - 0xD800) * 0x400) +
            (*pcszMatchChars+1 - 0xDC00) + 0x10000;

          if(std::tolower(a,loc) == std::tolower(b,loc))

can be reduced to just
if(std::tolower(pcszSource, loc) == std::tolower(pcszMathChars, loc))
since it is locale aware (untested).  If that is the case then the original problem becomes simple record keeping of the pointers before sending them off  to tolower.
The only other thing to do is initialize std::locale early on in the application and have a function that returns its reference, then this thing will be right DAN fast.

In my original version I used std::tolower from <cctype> because I miss interpreted what "locale independent" meant, good catch.

Thanks again.

pkohut

  • Guest
Re: Challenge: Optimize function MatchCharInString
« Reply #4 on: February 27, 2010, 06:43:56 AM »
BTW, can you clarify the intent of
UINT32 a = ((*pcszSource - 0xD800) * 0x400) +
             (*pcszSource+1 - 0xDC00) + 0x10000;




It's Alive!

  • BricsCAD
  • Needs a day job
  • Posts: 6940
  • AKA Daniel
Re: Challenge: Optimize function MatchCharInString
« Reply #5 on: February 27, 2010, 06:49:03 AM »
BTW, can you clarify the intent of
UINT32 a = ((*pcszSource - 0xD800) * 0x400) +
             (*pcszSource+1 - 0xDC00) + 0x10000;





I think its a convert to UTF-32, I'm probably wrong in using it though.

It's Alive!

  • BricsCAD
  • Needs a day job
  • Posts: 6940
  • AKA Daniel
Re: Challenge: Optimize function MatchCharInString
« Reply #6 on: February 27, 2010, 06:50:09 AM »
I didn't know if any characters above 0xd800 had lower case, originally I just had it like

Code: [Select]
if((*pcszSource + *pcszSource+1) ==
             (*pcszMatchChars + *pcszMatchChars+1))


 :|

pkohut

  • Guest
Re: Challenge: Optimize function MatchCharInString
« Reply #7 on: February 27, 2010, 06:58:18 AM »
Quote
I think its a convert to UTF-32, I'm probably wrong in using it though
Oh I see.

Quote
I didn't know if any characters above 0xd800 had lower case, originally I just had it like
I think with the surrogate macros you would just need to move the pointer to the correct location and let std::tolower(ptr, loc) take care of the rest.  I don't know for sure, cause I had my head buried inside the STL and Boost libraries trying to figure it out, but all the generics was getting in the way of comprehension.


It's Alive!

  • BricsCAD
  • Needs a day job
  • Posts: 6940
  • AKA Daniel
Re: Challenge: Optimize function MatchCharInString
« Reply #8 on: February 27, 2010, 07:13:58 AM »
I think with the surrogate macros you would just need to move the pointer to the correct location and let std::tolower(ptr, loc) take care of the rest. 

your probably right  :-)

pkohut

  • Guest
Re: Challenge: Optimize function MatchCharInString
« Reply #9 on: February 27, 2010, 09:42:53 PM »
The memory locations for the different strings looks like this in OS X.  Note szStr and sStr is handled with UTF-8 encoding, while wchar_t and wstring are UTF-32 (on Windows they are UTF-16).

Code: [Select]
int main (int argc, const char * argv[]) {
char szStr[] = "olé";
wchar_t szWStr[] = L"olé";
std::string sStr(szStr);
std::wstring sWStr(szWStr);    [pool drain];
    return 0;
}


It's Alive!

  • BricsCAD
  • Needs a day job
  • Posts: 6940
  • AKA Daniel
Re: Challenge: Optimize function MatchCharInString
« Reply #10 on: February 27, 2010, 10:01:32 PM »
Bummer, I guess that makes it harder to write portable code.

pkohut

  • Guest
Re: Challenge: Optimize function MatchCharInString
« Reply #11 on: February 28, 2010, 12:12:40 AM »
Bummer, I guess that makes it harder to write portable code.

Ya, it does.  A while ago I was writing some low level code for the Mac and string issues was causing all kinds of pain.  Finally it was suggested to use std::string instead of wstring, as it already handlings UTF-8 encoding.  Then whenever anything needed to be displayed to the user, send the string thru CFString and let the OS handle the localization issue.

Good blog post from an ex Macromedia programmer
http://losingfight.com/blog/2006/07/28/wchar_t-unsafe-at-any-size/

and another perspective UTF-8/16/32 issue.
http://gcc.gnu.org/ml/gcc-help/2008-08/msg00292.html


It's Alive!

  • BricsCAD
  • Needs a day job
  • Posts: 6940
  • AKA Daniel
Re: Challenge: Optimize function MatchCharInString
« Reply #12 on: March 01, 2010, 06:52:49 AM »

pkohut

  • Guest
Re: Challenge: Optimize function MatchCharInString
« Reply #13 on: March 07, 2010, 05:59:22 AM »
As usual got to watch out for platform enianness.  Frak'in big endian just shot me in the ars.  :-D

It's Alive!

  • BricsCAD
  • Needs a day job
  • Posts: 6940
  • AKA Daniel
Re: Challenge: Optimize function MatchCharInString
« Reply #14 on: March 08, 2010, 12:13:51 AM »
Little End In is more better  :lol: