HtmlParse() with XmlSearch() namespace issue

Description

I parse a simple html string using HtmlParse(), the result is a xml document which I would expect to be able to XmlSearch() HtmlParsed strings to return the same results as XmlParsed strings.

savecontent variable="html" {
echo("<body>
<div>foo</div>
<div>bar</div>
</body>
")
}
packet = HtmlParse(html);
// packet = XmlParse(packet); // uncomment this for a semi workaround
dump(IsXML(packet));
dump(XmlSearch(packet, "div")); // this should return 2 div elements
dump(XmlSearch(packet, "//:div")); // this should return 2 div elements, though I should be able to use the expression "div"

Sorry about the vague description.. I'm not quite sure how to accurately describe the issue.. Only the symptoms

Environment

OSX

Attachments

3

Activity

Brad Wood 
16 March 2022 at 17:18

Ran into this myself-- using HTMLParse() on an HTML document makes it very difficult to perform XPath searches on. has a nice blog post here with an XLT transformation that will remove the namespaces.

https://www.bennadel.com/blog/3650-parsing-html-natively-with-htmlparse-in-lucee-5-3-2-77.htm

David Raschper 
16 October 2018 at 11:43

Here another example/case.
I use htmlParse, because XmlParse would fail (validation).
When i try XmlSearch, it dont work.
When i parse the xml-object(which i got from htmlParse back) to a string and remove the xmlns-attribute and then use XmlParse to get a new xml-object from my new (and now valid) xml.
I can use XmlSearch.

Code.

Hugh Rainey 
17 November 2017 at 19:19

I'm having similar issues but I think I've figured out why it's happening. When you run html through the htmlparse() function it adds an &lt;html&gt; tag with xmlns attributes. so if you used something like

and then ran that through htmlparse(), you'd get this

Notice the extraneous stuff that is added to the html tag. From my experimentation it's the added attributes that are causing the problem.

Now that we have some html to parse.

This is the result of the htmlparse() function.

Notice that the function adds an html and body tag. Thats not a big deal. If the code you have already has a body tag then htmlparse doesn't add another one. If you already have an html tag it doesn't add another one, but it does add the xmlns attributes to it.

So if you dump the XML_HTML variable it looks like something we could search.

But can we search it. The xpath expression I've used in the following snippet should dump all the div elements into array created by the xmlSearch function.

But all you get is an empty array

What I found that works is to remove all the extaneous attributes from the html tag using the REreplace function. Notice that I'm using the REreplace function on the same htmlparsed variable XML_HTML.

The result looks like the following:

Now lets search this with the same xpath expression and dump the result.

This is the result:

If the htmlparse function could be modified so that it doesn't include the extra attributes, I believe no workaround would be necessary.

Details

Assignee

Reporter

Priority

New Issue warning screen

Before you create a new Issue, please post to the mailing list first https://dev.lucee.org

Once the issue has been verified, one of the Lucee team will ask you to file an issue

Affects versions

Created 12 May 2016 at 03:45
Updated 16 March 2022 at 17:18