More DigitalMars D - finding a string in a stream : RooJSolutions

Published 2006-03-15 10:14:53

The three good ways to learn a language:

hack on some existing code
write a simple program from scratch
port some code from another language to the one you want to learn.

Well, this week I though I'd have a go at the third. Picking something that was well writen to start with, I decided to use binc, an extremely well writen imap server which is writen in C++, and see how it converts to D.

Rather than attacking the core imap bit, I decided to start with the MimeDocument decoding part. something relatively self contained, and conceptually quite simple. Most of the Porting involved bringing together Classes that had methods defined in multiple files (as seems common with C++), and merging them into nice classes in D.

While most of it will probably end up untested until it's all ported, one single method stood out as a good simple test of working with D. - Searching for a string (or delimiter) in a stream.

Obviously, one of the things that happens with an imap server, is that it has to scan a email message, and find out how what makes up the email (eg. attachments, different mimetypes and how they are nested. A brute force approach would be to load the whole message into memory, and just scan through looking for the sections. However, since email messages can frequently be over 5Mb, It's obviously horribly inefficent. So the existing code used a simple C++ method to search for a delimiter.

Hit the more link for another simple tutorial...

This is what the original function looked like in C++

static bool skipUntilBoundary(const string &delimiter,
                  unsigned int *nlines, bool *eof)
{
  int endpos = delimiter.length();
  char *delimiterqueue = 0;
  int delimiterpos = 0;
  const char *delimiterStr = delimiter.c_str();
  if (delimiter != "") {
    delimiterqueue = new char[endpos];
    memset(delimiterqueue, 0, endpos);
  }

  // first, skip to the first delimiter string. Anything between the
  // header and the first delimiter string is simply ignored (it's
  // usually a text message intended for non-mime clients)
  char c;

  bool foundBoundary = false;
  for (;;) {    
    if (!mimeSource->getChar(&c)) {
      *eof = true;
      break;
    }

    if (c == '\n')
      ++*nlines;

    // if there is no delimiter, we just read until the end of the
    // file.
    if (!delimiterqueue)
      continue;

    delimiterqueue[delimiterpos++ % endpos] = c;

    if (compareStringToQueue(delimiterStr, delimiterqueue,
                 delimiterpos, endpos)) {
      foundBoundary = true;
      break;
    }
  }

  delete [] delimiterqueue;
  delimiterqueue = 0;

  return foundBoundary;
}

Quite nicely written, and using and doing some quite obtuse test to see if the string stored in the current buffer matched the one being looked for. (I didnt have a look at compareStringToQueue, but my guess is that it just went through the string in the buffer, starting at the delimiter position, and check to see if it matched what was being looked for.

In looking at re-writing this in D, I started to consider a few things.

I can pretty much ignore all the input until it matches the first character of the delimiter. (and so I dont need to copy the data into the delimiterqueue)
If we did start matching the delimiter, then I can just test the incomming character against the expected one in the stream.
When we got to a character that did not match the delimiter, we should tidy up this test string. (and avoid re-allocating the string) -> just copy the bit that was left, and matches to the beginning of the string.

So I started writing a slightly more verbose method, to do the same thing, in a more "D" manner.

bool skipUntilBoundary(MimeStream mimeSource, char[] delimiter, 
		inout uint nlines, inout bool eof)
{

our function signature is slighly different here:

inout is used rather than the *
uint is used rather than unsigned int.
char[] is used rather than string.
The stream being read is an argument, rather than using a global.

	char[] teststring = "";
        teststring.length = delimiter.length * 2;

Next up is creating a test string, to store our buffer to test against, the second line of this ensures that it's size is fixed at twice that of the original delimiter. (which should be more than enough).

  	char c;
        int endpos = delimiter.length;
        bool foundBoundary = false;
        int lookup_offset = 0;
        int teststring_offset = 0;

Next we set up our variables,

teststring_offset, is the postition we write to in the teststring
lookup_offset, is where we are trying to match against in the delimiter.
endpos is just a shortcut to the length
c is our character being read
foundBoundary is our result.

 	while (true) {    
            if (!mimeSource.getChar(c)) {
                eof = true;
                break;
            }

Now we start reading the incomming stream, checking to see if we have reached the end of the stream.

	    if (c == '\n') {
                nlines++;
            }

We keep an eye on how many lines we have read.

     	    if ((teststring_offset == 0) && (delimiter[0] != c)) {
                writefln("first character does not match: %s != %s", 
			delimiter[0]  , c );
                continue;
            }

If we are looking for the first character, and it doesnt match, just keep reading!

 	    teststring[teststring_offset] = c;
            teststring_offset++;

We now add the character to our test string. (even if it doesnt match)

	   if (delimiter[lookup_offset] == c) {    
                writefln("got a matching character match (%d/%d): %s == %s", 
			lookup_offset , endpos, delimiter[0]  , c );

                if ((lookup_offset + 1) == endpos) {
                    writefln("GOT FULL MATCHING STRING ");
                    foundBoundary = true;
                    break;
                }
                lookup_offset++;
                continue; // go and find next character..
            }

Now we test to see if the character we got matches the expected one, and if we have reached the end of the delimiter, then stop processing. otherwise make sure the lookup offset is increased.

At this location we will only arrive if we matched the first character, and the remaining data does not match. So we need to alter the test_string.

            int trim_offset = 1;
            
            while(true) {
                writefln("testing teststring_offset=%d teststring[%d] 
			(%s) against first character %s",
			 teststring_offset, trim_offset, 
			teststring[trim_offset] , delimiter[0]);
                if (trim_offset >= teststring_offset) { // reached the end..
                    writefln("Gone to end of string");
                    teststring_offset = 0;
                    lookup_offset = 0;
                    break;
                }

We start going through the test_string, starting at the second character, first off, we check to see if we have check all of the test_string, and just tell it to clean up if we have. (eg. nothing in this bit matches.)

		if (teststring[trim_offset] == delimiter[0]) {
                    
                    // found the start...
                    //check if string matches now..
                    int test_len = teststring_offset - trim_offset;
                    writefln("MATCH testing available remaining string 
			[%d..%d]%s == [%d]%s", 
			trim_offset, 
			test_len, 
			teststring[trim_offset..test_len] ,
			test_len, delimiter[0..test_len]
		    );
                    if (teststring[trim_offset..test_len] == delimiter[0..test_len]) {
                        teststring[0..test_len] = teststring[trim_offset..test_len];
                        teststring_offset = 0;
                        lookup_offset = test_len;
                        break;
                    }
                    
                }
                trim_offset++;
 	   }
       }

Now we compare the section of the string against the portion of the delimiter, if they match, we rearrange the test_string by copying the string to the beginning, and reseting our pointers.

Finally when we return the boundaryFound variable (when the loops have broken out.)

   return foundBoundary;
}

To make this little test work, we need to create a simple stream reader.

class MimeStream 
{

    char[] thestring = "";
    int pos = 0;
    
    this(char[] string) {
        this.thestring = string;
    }
    
    bool getChar(inout char c)
    {
        if (pos >= thestring.length) {
            return false;
        }
        
        c = this.thestring[pos];
        pos++;
        return  true;
    }
    

}

Then create a main() functions so we can test it.

import std.stdio;
void main () {

    MimeStream x = new MimeStream("This is a test  - hello  with XXX - hello world - in the middle".dup);
    uint lines = 0;
    bool eof = 0;
    bool ret = skipUntilBoundary(x,  "- hello world -".dup, lines, eof);
    if (ret) {
        writefln("GOT STRING!");
    } else {
        writefln("NO MATCH");
    }
}

and with a simple line, build a binary to test:

 #/dmd/bin/dmd test_string.d
 #./test_string

And out comes our result: GOT STRING! (with a few more debugging messages preceeding it.)

While not as clever as the C or C++ version, the resulting code I think is slightly more readable, and is probably just as memory efficient (along with being just as fast).

Got any ideas to improve it?

Mentioned By:
google.com : april (92 referals)
google.com : php remove first character from string (66 referals)
www.planet-php.net : Planet PHP (64 referals)
google.com : march (50 referals)
www.digitalmars.com : Digital Mars - digitalmars.D.learn - Another short blog post/tutorial (Finding a string in a stream) (37 referals)
google.com : remove first character from string PHP (34 referals)
planet-php.org : Planet PHP (25 referals)
google.com : december (25 referals)
google.com : php string remove first character (23 referals)
google.com : php remove first character (22 referals)
google.com : php remove first char from string (20 referals)
www.digitalmars.com : Another short blog post/tutorial (Finding a string in a stream) (19 referals)
google.com : php remove first character from a string (19 referals)
google.com : php remove first character string (19 referals)
google.com : digitalmars (16 referals)
google.com : php remove first character of string (16 referals)
google.com : remove first character of string php (16 referals)
google.com : php delete first character of string (13 referals)
google.com : remove first character string php (13 referals)
google.com : getchar php (11 referals)

Related

More DigitalMars D - finding a string in a stream

Comments

Add Your Comment

Follow us

Blog Latest

Twitter - @Roojs