Out of order inline elements

Preamble

This is the case of OUT-OF-ORDER inline elements. This has been put forward as a bug - see http://tidy.sf.net/issue/1942407, like :-

<p>
  <b>
    <i>
      <u>
        <strike>
        Bold, italics, strike-through, underline.
        </b>
      </i>
    </u>
  </strike>
  back to normal
</p>

Usually, most browsers have no problem in correctly rendering this, like -

~~Bold, italics, strike-through, underline.~~ back to normal

But HTML Tidy gets into a real pickle trying to 'tidy' this badly formed HTML. In the past, some effort has been made to patch Tidy to do better. See http://tidy.sf.net/issue/1426419, and see the patches http://tidy.sf.net/issue/1426424 for some more discussion on this issue.

top

Current Code Trace

Using a special debug development version of the Tidy code, the following is a trace of what happens. It also reveals some of the inner workings of the Tidy code.

Essentially the problem starts in ParseInline( ... , node *element, ... ), where the 'element' is the last <strike> tag ... ParseInline() does a GetToken(), where the 'plain text' is absorbed. On finding the '<' of the , the lexer state changes to LEX_GT, then get next char. On finding the '/' of the , in case LEX_GT, the lexer reads the next char ...

If it is a letter, in this case a 'b', then the lexer is backed up 3 chars, and the character is 'un-got' , the lexer state is set the LEX_ENDTAG. Since there is some 'text' before this, a lexer token is created, and returned to ParseInline(), which appends this text node to the tree, and continues ...

The next GetToken() call gets the 'b' off the un-got stack, and it is parsed in LEX_ENDTAG state ... the tag text is added to the lexer until a non-tag-name character is encountered, a new EndTag node is created, with the tag added, the lexer state is set to LEX_CONTENT, and the node returned to ParseInline() ...

This node is the , so it is marked as a type = EndTag, so the current 'element', <strike> is INLINE, and this next 'node', is INLINE, and as the comment states -

   /* allow any inline end tag to end current element */

If this was </strike> then there would be NO PROBLEM, but it is in fact the close of a stacked element, ... and some effort has been made to allow Tidy to tolerate, to a degree, such out-of-order inline closing ... but presently only _VERY_ minimally ...

So, the stacked inline tags are exchanged, the <strike> for the , which gets past the first problem. But the very misleading warning message of -
1942407.htm:7:13: Warning: replacing unexpected b by 
must also be addressed. The token is put back, and a close of 'strike', this current element, is done - that is a return from this ParseInline() ...

This returns to ParseTag(), which returns to ParseInline() at the previous level, where the 'element' is the previous , and we cycle to GetToken() ... which gets the put-back token ... so again checks the stacked elements to see if a 'switch' can be made, but this time this FAILS - we are checking switch 'u' with 'b' ...

There after the current code gets into a mess. The full list of warning messages is -

line 7 column 13 - Warning: replacing unexpected b by </b>
line 7 column 10 - Warning: replacing unexpected b by </b>
line 7 column 66 - Warning: inserting implicit <strike>
line 7 column 66 - Warning: inserting implicit <i>
line 7 column 66 - Warning: inserting implicit <u>
line 7 column 66 - Warning: replacing unexpected i by </i>
line 7 column 70 - Warning: inserting implicit <u>
line 7 column 7 - Warning: missing </i> before </p>
line 7 column 4 - Warning: missing </b> before </p>
line 7 column 66 - Warning: trimming empty <u>
line 7 column 66 - Warning: trimming empty <i>
line 7 column 70 - Warning: trimming empty <u>
line 7 column 66 - Warning: trimming empty <strike>

But worse than just these strange warnings, the actual HTML output is -

<p>
 <b>
  <i>
   <u>
    <strike>
    Bold, italics, strike-through, underline.
    </strike>
   </u>
   back to normal
   </i>
 </b>
</p>

Which does not render anything like it should!

~~Bold, italics, strike-through, underline.~~ back to normal

Now to try to carefully extend this patch ...

top

Code Patches

As is sometimes usual, the initial patch is relatively easy. Just return 'yes' from the SwitchInline() service, whether it actually does the stack switch or not, and adjust the comments to match, and the output becomes -

<p><b><i><u><strike>Bold, italics, strike-through,
underline.</strike></u></i></b> back to normal</p>

Which is PERFECT ;=)) Now to address the warning message output. For this I create a new define -

#define REPLACE_UNEXPECTED           88

Add the new text message and special case in localize.c, and am rewarded by a better set of warning, namely -

line 7 column 13 - Warning: replacing unexpected </b> by </strike>
line 7 column 10 - Warning: replacing unexpected </b> by </u>
line 7 column 7 - Warning: replacing unexpected </b> by </i>
line 7 column 66 - Warning: inserting implicit <i>
line 7 column 66 - Warning: inserting implicit <u>
line 7 column 66 - Warning: replacing unexpected </i> by </u>
line 7 column 70 - Warning: inserting implicit <u>
line 7 column 74 - Warning: discarding unexpected </strike>
line 7 column 66 - Warning: trimming empty <u>
line 7 column 66 - Warning: trimming empty <i>
line 7 column 70 - Warning: trimming empty <u>

While not yet perfect, they do represent more or less what happened ...

So I have change the code to return 'yes', whether it actually does a stack switch or not, but at least found the two items present in the stack. Now in all the cases the SwitchInline() service returns 'yes', and we return from ParseInline(), back to the element. Now the (node->tag == element->tag && node->type == EndTag), so the node is freed, and we again return from ParseInline().

This of course is back to the element ... but the problem now is that as part of the above, some nodes were duplicated, so they could be propagated, but in this case they have all been closed, so I guess the 'duplication' process should take into account nodes that have been explicitly closed ... That is remove implicit node when an explicit close is encountered ... hmmm ... have to think about that ... and of course as the lexer moves on, it now emits warnings about discarding unexpected and </strike> ... another hmmm ...

But in no time at all we fall back to ParseBody() ... and back through ParseHTML() ...

*** AND IT IS DONE ;=)) ***

You will note the 'implicit' and tags that I was concerned about have been pruned as empty elements ... more hmmms ...

Anyway, time to test this patch further, but since it actually only extended the stack 'searching' minimally, it is really the same as before, but now handles this particular HTML mess ;=))

top

Downloads

As usual, take care with downloading and running executable files from the web!

Description	Download	Date	Size	MD5
WIN32 EXE using MSVC8	tidycvs04e04.zip	16/04/2008	136,666	59be2875d983939e1243c22510f3c5da
diff patch file, against CVS	tidycvs04.patch	16/04/2008	4,396	98efdf347c03eafa701b46e2b52147e2

top

Have FUN ;=))

Regards,

Geoff.

EOF - based on 1942407-01.doc

top