Author Topic: Byte Shift when Reading Binary Stream  (Read 7394 times)

0 Members and 1 Guest are viewing this topic.

Lee Mac

  • Seagull
  • Posts: 12906
  • London, England
Byte Shift when Reading Binary Stream
« on: October 21, 2011, 01:04:41 PM »
I have been attempting to read a UTF-16 (Big/Little Endian) Unicode Encoded Text file and return the content as an ASCII representation (using the Unicode code-points, e.g. U+0414).

However, I have noticed that when reading the Binary Stream, the returned byte values are sometimes shifted by an arbitrary number of bytes.

For example, the UTF-16 Byte order mark at the start of the file would usually be 255 254 (Little Endian), but this is sometimes returned as 1279 1278  (i.e. 255 + 1024, 254 + 1024); however, this shift will vary a different times during testing.

I can't see the reason for the shift, or how to prevent it and so I wonder if anyone here can offer an insight?

Here is my code:

Code: [Select]
(defun test ( file / _padzeros lst out n )

    (defun _padzeros ( s )
        (if (< (strlen s) 4) (_padzeros (strcat "0" s)) s)
    )
    (if (setq lst (LM:ReadBinaryStream file))
        (progn
            (cond
                (   (and (= 255 (car lst)) (= 254 (cadr lst))) ;; UTF-16 Little Endian
                    (setq lst (cddr lst))
                    (while (cadr lst)
                        (if (< (setq n (+ (car lst) (* 256 (cadr lst)))) 128)
                            (setq out (cons (chr n) out))
                            (setq out (cons (strcat "\\U+" (_padzeros (LM:Dec->Base n 16))) out))
                        )       
                        (setq lst (cddr lst))
                    )
                    (setq out (reverse out))
                )
                (   (and (= 254 (car lst)) (= 255 (cadr lst))) ;; UTF-16 Big Endian
                    (setq lst (cddr lst))
                    (while (cadr lst)
                        (if (< (setq n (+ (cadr lst) (* 256 (car lst)))) 128)
                            (setq out (cons (chr n) out))
                            (setq out (cons (strcat "\\U+" (_padzeros (LM:Dec->Base n 16))) out))
                        )
                        (setq lst (cddr lst))
                    )
                    (setq out (reverse out))
                )
                (   (setq out (mapcar 'chr lst))   ) ;; Assume ASCII
            )
           
            (LM:str->lst (apply 'strcat out) "\r\n")
        )
    )
)

(defun LM:Dec->Base ( n b )
    (if (< n b)
        (chr (+ n (if (< n 10) 48 55)))
        (strcat (LM:Dec->Base (/ n b) b) (LM:Dec->Base (rem n b) b))
    )
)

(defun LM:str->lst ( str del / pos )
  (if (setq pos (vl-string-search del str))
    (cons (substr str 1 pos) (LM:str->lst (substr str (+ pos 1 (strlen del))) del))
    (list str)
  )
)

(defun LM:ReadBinaryStream ( file / adostream result )
    (if
        (and
            (setq file (findfile file))
            (setq adostream (vlax-create-object "ADODB.Stream"))
        )
        (progn
            (setq result
                (vl-catch-all-apply
                    (function
                        (lambda nil
                            (vlax-put-property  adostream 'type 1)
                            (vlax-invoke-method adostream 'open nil nil nil nil nil)
                            (vlax-invoke-method adostream 'loadfromfile file)
                            (vlax-put-property  adostream 'position 0)
                            (setq result (vlax-invoke-method adostream 'read -1))
                            (vlax-invoke-method adostream 'close)
                            result
                        )
                    )
                )
            )
            (vlax-release-object adostream)
            (if (not (vl-catch-all-error-p result))
                (vlax-safearray->list (vlax-variant-value result))
            )
        )
    )
)

JohnK

  • Administrator
  • Seagull
  • Posts: 10605
Re: Byte Shift when Reading Binary Stream
« Reply #1 on: October 21, 2011, 05:33:12 PM »
`Endian'? Where did this file come from/go to? Are you sure it's big endian (most PCs use little endian, I thought only the big stuff -i.e. mainframes - use big endian)? Whats the application for this "thing" you are writing?

Past that, I will try and read up on the topic again to chime in further. "Big endian: byte order ='s floped" would be about the extent of my endian knowledge at the moment.

Generaly speaking though:
Do you have any idea what the format of the file is you are trying to read? When you get to a location you kinda have to know (or have an idea) as to what you expect there ...a letter, number, start of a string, image, credit card numbers, etc., etc..

Sorry I cant be of more help right now but I don't remember much about this topic and I only read up on interpreting data-big/little endian once so I wasn't all that keen on it to begin with. However, I will try to do some reading tonight.
TheSwamp.org (serving the CAD community since 2003)
Member location map - Add yourself

Donate to TheSwamp.org

Lee Mac

  • Seagull
  • Posts: 12906
  • London, England
Re: Byte Shift when Reading Binary Stream
« Reply #2 on: October 22, 2011, 07:58:52 AM »
`Endian'?

Endian

Where did this file come from/go to?

Its a Unicode Text file exported from Excel.

Are you sure it's big endian (most PCs use little endian, I thought only the big stuff -i.e. mainframes - use big endian)?

I never said it was Big Endian, it can be either. The above function should be able to read both. You can save the file as UTF-16 Big or Little Endian, however, it defaults to Little Endian.

Whats the application for this "thing" you are writing?

to read a UTF-16 (Big/Little Endian) Unicode Encoded Text file and return the content as an ASCII representation

Do you have any idea what the format of the file is you are trying to read?

I have been attempting to read a UTF-16 (Big/Little Endian) Unicode Encoded Text file

The format can be ascertained from the Byte Order Mark at the start (usually) of the file, as an example: FF FE (255 254), indicating UTF-16 Little Endian, or if I re-save as UTF-16 Big Endian, FE FF (254 255).

Lee Mac

  • Seagull
  • Posts: 12906
  • London, England
Re: Byte Shift when Reading Binary Stream
« Reply #3 on: October 22, 2011, 08:31:06 AM »
Found a workaround solution, although it isn't pretty...

Code: [Select]
      (if (setq lst (LM:ReadBinaryStream file))
          (progn
              (if
                  (or
                      (zerop (rem (setq d (- (car lst) 255)) 8))
                      (zerop (rem (setq d (- (car lst) 254)) 8))
                  )
                  (setq lst (mapcar '(lambda ( x ) (- x d)) lst))
              )
...

JohnK

  • Administrator
  • Seagull
  • Posts: 10605
Re: Byte Shift when Reading Binary Stream
« Reply #4 on: October 22, 2011, 12:18:56 PM »
Ummm!? Sorry if I made you angry with my question.

I think we are having two different conversations. I was questioning the necessity.

...you have fun with that.
TheSwamp.org (serving the CAD community since 2003)
Member location map - Add yourself

Donate to TheSwamp.org

Lee Mac

  • Seagull
  • Posts: 12906
  • London, England
Re: Byte Shift when Reading Binary Stream
« Reply #5 on: October 22, 2011, 01:20:00 PM »
Ummm!? Sorry if I made you angry with my question.

Angry?  :?

JohnK

  • Administrator
  • Seagull
  • Posts: 10605
Re: Byte Shift when Reading Binary Stream
« Reply #6 on: October 25, 2011, 08:39:47 AM »
Angry?
Forget it.


BTW, what was your decision on this topic?
TheSwamp.org (serving the CAD community since 2003)
Member location map - Add yourself

Donate to TheSwamp.org

Lee Mac

  • Seagull
  • Posts: 12906
  • London, England
Re: Byte Shift when Reading Binary Stream
« Reply #7 on: October 25, 2011, 08:46:20 AM »
BTW, what was your decision on this topic?

I ended up shifting the values back by the required amount, its not the best solution since it still doesn't answer why the values were shifted in the first place, but it has worked in all my tests since.

Thanks for your reponse.

irneb

  • Water Moccasin
  • Posts: 1794
  • ACad R9-2016, Revit Arch 6-2016
Re: Byte Shift when Reading Binary Stream
« Reply #8 on: October 25, 2011, 10:29:22 AM »
Lee,
Perhaps it's something wrong with the ActiveX ADODB.Stream object's implementation. What about trying to implement a binary reader through using read-char? Probably not as efficient, but might actually "work". The only trouble I can see with this approach would be when it omits the new-line + carriage-return characters and only returns a 10 instead of a 13 as well. But in your case this aught not to be a problem.
Common sense - the curse in disguise. Because if you have it, you have to live with those that don't.

Jeff H

  • Needs a day job
  • Posts: 6144
Re: Byte Shift when Reading Binary Stream
« Reply #9 on: October 25, 2011, 02:35:02 PM »
Seems like before when reading a binary stream for something, for a new line windows returned 2 bytes 10 & 13 while others said Unix returned 10

irneb

  • Water Moccasin
  • Posts: 1794
  • ACad R9-2016, Revit Arch 6-2016
Re: Byte Shift when Reading Binary Stream
« Reply #10 on: October 26, 2011, 12:40:30 AM »
Jeff: that's correct, text files usually have only the newline (10) byte to mark EOL when created in a Unix environment (incl. Linux, BSD, OSX, etc.) On DOS based the return carriage (13) was also appended, this is perpetuated in Windows.

The problem is that most programming environments would automatically add the code(s) depending on the OS's default. And then omit the 13 when reading. This is fine for text files, but not binary. Unfortunately AutoLisp only has this type of write & read functions - no binary functions. It might be a good idea to implement a true binary read / write through dotnet / arx.

Lee: Seeing as you're getting this intermittent curiosity (error? ) , can you be sure it doesn't happen further into the file as well? I'd be rather concerned if I got such inconsistency even just on the first bytes in a file. Perhaps you'd need to run the returned list through a check to see if each byte is less than 256. What happens if there's an EOF marker somewhere midway through the file? Or even a FF?
Common sense - the curse in disguise. Because if you have it, you have to live with those that don't.

Lee Mac

  • Seagull
  • Posts: 12906
  • London, England
Re: Byte Shift when Reading Binary Stream
« Reply #11 on: October 26, 2011, 07:06:40 AM »
Lee: Seeing as you're getting this intermittent curiosity (error? ) , can you be sure it doesn't happen further into the file as well?

It happens for evey byte value in the binary stream - I may have been misleading in my earlier post by only using the first two bytes as an example, but every value in the binary stream is shifted by the same amount. This is confirmed by the workaround solution that I have implemented since the output is correct when all byte values are shifted back by the same amount.