Writing declarative XML parsing code

Introduction

Writing XML parsing code in C/C++ can get quite complicated. Especially if the parsers design forces the developer to get lost in “context” objects and the instruction flow is obfuscated by a “context stack”. This makes development and debugging a nightmare. A good example of a design gone wild is the Fast Parser of the OpenOffice.org. Despite what the name suggests it’s actually slow --- in execution as well as in development --- and it is impossible to debug.

Recursive Descent Parser

The answer to the problem is well known in computer science: Recursive Descent Parser. Recursive descent parsers are fast and easy to maintain; especially if they are written with the help of some macros in a declarative way.
Suppose you want to parse XML documents which conform to the following grammar:
<rng:element name="book“ >
    <rng:attribute name="name">
        <rng:text/>
    </rng:attribute>
   <rng:zeroOrMore>
      <rng:element name="page">
         <rng:text/>
      </rng:element>
   </rng:zeroOrMore>
</rng:element> 

The would be XML documents like
<book name="libopc">
<page>Page 1</page>
<page>Page 2</page>
</book>

An MCE-aware parser based on libopc’s macros will look like this:
void parse(mceTextReader_t *reader) {
  int page_no=0;
  mce_start_document(reader)
    mce_start_element(reader, "","book") {
      mce_start_attributes(reader) {
        mce_start_attribute(reader, "","name") {
          // handle @name attrinute
          printf("== %s ==\n", xmlTextReaderConstValue(reader->reader));
        } mce_end_attribute(reader);
      } mce_end_attributes(reader);
      mce_start_children(reader) {
        mce_start_element(reader, "","page") {
          mce_skip_attributes(reader);
          printf("Content page %i\n", ++page_no);
          mce_start_children(reader) {
            mce_start_text(reader) {
              // print text
              printf("%s", xmlTextReaderConstValue(reader->reader));
            } mce_end_text(reader);
          } mce_end_children(reader);
        } mce_end_element(reader);
      } mce_end_children(reader);
    } mce_end_element(reader); 
  mce_end_document(reader);
}
The above parser will output
== libopc ==
Content page 1:
Page 1
Content page 2:
Page 2

A real world sample

The following complete real world example will dump all text of the OOXMLI1.docx document as HTML:
#include <opc/opc.h>

static void dumpText(mceTextReader_t *reader) {
    mce_skip_attributes(reader);
    mce_start_children(reader) {
        mce_start_element(reader, _X("http://schemas.openxmlformats.org/wordprocessingml/2006/main"), _X("t")) {
            mce_skip_attributes(reader);
            mce_start_children(reader) {
                mce_start_text(reader) {
                    for(const xmlChar *txt=xmlTextReaderConstValue(reader->reader);0!=*txt;txt++) {
                        switch(*txt) {
                        case '<': 
                            printf("&lt;");
                            break;
                        case '>': 
                            printf("&gt;");
                            break;
                        case '&': 
                            printf("&amp;");
                            break;
                        default:
                            putc(*txt, stdout);
                            break;
                        }
                    }
                } mce_end_text(reader);
            } mce_end_children(reader);
        } mce_end_element(reader);
        mce_start_element(reader, _X("http://schemas.openxmlformats.org/wordprocessingml/2006/main"), _X("p")) {
            printf("<p>");
            dumpText(reader);
            printf("</p>\n");
        } mce_end_element(reader);
        mce_start_element(reader, NULL, NULL) { // match any other element
            dumpText(reader);
        } mce_end_element(reader);
    } mce_end_children(reader);
}

int main( int argc, const char* argv[] )
{
    opcInitLibrary();
    opcContainer *c=opcContainerOpen(_X("OOXMLI1.docx"), OPC_OPEN_READ_ONLY, NULL, NULL);
    if (NULL!=c) {
        mceTextReader_t reader;
        if (OPC_ERROR_NONE==opcXmlReaderOpen(c, &reader, _X("/word/document.xml"), NULL, 0, 0)) {
            mce_start_document(&reader) {
                mce_start_element(&reader, NULL, NULL) {
                    printf("<html>\n");
                    printf("<head>\n");
                    printf("<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\n");
                    printf("</head>\n");
                    printf("<body>\n");
                    dumpText(&reader);
                    printf("<body>\n");
                    printf("</html>\n");
                } mce_end_element(&reader);
            } mce_end_document(&reader);
            mceTextReaderCleanup(&reader);
        }
        opcContainerClose(c, OPC_CLOSE_NOW);
    }
    opcFreeLibrary();
    return 0;
}
The most recent version of the sample program can be found in sample/opc_text.c.

Error handling

Another important issue is error handling. MCE has the
mce_error_guard_start(reader) {
} mce_error_guard_end(reader)
block for this.
Inside the error guard block you can signal an error by using one of the following functions
mce_error(reader, guard, error_code, msg) // signals an error
mce_errorf(reader, guard, error_code, msg) // signals an error in a printf like fashion
mce_error_strict(reader, guard, error_code, msg) // signals an error only in strict processing mode
mce_error_strictf(reader, guard, error_code, msg) // signals an error in a printf like fashion only in strict processing mode
The following code is taken from libopc itself and it parses XML fragments like this:
<Default Extension="xml" ContentType="application/xml" />
Here is the code to parse the above XML fragments:
   mce_start_element(&reader, NULL, _X("Default")) {
       const xmlChar *ext=NULL;
       const xmlChar *type=NULL;
       mce_start_attributes(&reader) {
           mce_start_attribute(&reader, NULL, _X("Extension")) {
               ext=xmlTextReaderConstValue(reader.reader);
           } mce_end_attribute(&reader);
           mce_start_attribute(&reader, NULL, _X("ContentType")) {
               type=xmlTextReaderConstValue(reader.reader);
           } mce_end_attribute(&reader);
       } mce_end_attributes(&reader);
       mce_error_guard_start(&reader) {
           mce_error(&reader, NULL==ext || ext[0]==0, MCE_ERROR_VALIDATION, "Missing @Extension attribute!");
           mce_error(&reader, NULL==type || type[0]==0, MCE_ERROR_VALIDATION, "Missing @ContentType attribute!");
           opcContainerType *ct=insertType(c, type, OPC_TRUE);
           mce_error(&reader, NULL==ct, MCE_ERROR_MEMORY, NULL);
           opcContainerExtension *ce=opcContainerInsertExtension(c, ext, OPC_TRUE);
           mce_error(&reader, NULL==ce, MCE_ERROR_MEMORY, NULL);
           mce_errorf(&reader, NULL!=ce->type && 0!=xmlStrcmp(ce->type, type), MCE_ERROR_VALIDATION, "Extension \"%s\" is mapped to type \"%s\" as well as \"%s\"", ext, type, ce->type);
           ce->type=ct->type;
       } mce_error_guard_end(&reader); 
       mce_skip_children(&reader);
   } mce_end_element(&reader);
You can see that an error is issued if e.g. “Extension” or “ContentType” attribute is not present.

Last edited Jun 16, 2011 at 3:37 PM by flr, version 6

Comments

gemmell Jun 6, 2013 at 5:18 AM 
The code snippet "An MCE-aware parser based on libopc’s macros will look like this:" is missing a brace on mce_start_document(reader)