SWI-Prolog SGML/XML parser
AllApplicationManualNameSummaryHelp

  • Documentation
    • Reference manual
    • Packages
      • SWI-Prolog SGML/XML parser
        • Introduction
        • Bluffer's Guide
          • ‘Goodies' Predicates
        • Predicate Reference
        • Stream encoding issues
        • library(xpath): Select nodes in an XML DOM
        • Processing Indexed Files
        • External entities
        • library(pwp): Prolog Well-formed Pages
        • Writing markup
        • Unsupported SGML features
        • Acknowledgements

2 Bluffer's Guide

This package allows you to parse SGML, XML and HTML data into a Prolog data structure. The high-level interface defined in library(sgml) provides access at the file-level, while the low-level interface defined in the foreign module works with Prolog streams. Please use the source of sgml.pl as a starting point for dealing with data from other sources than files, such as SWI-Prolog resources, network-sockets, character strings, etc. The first example below loads an HTML file.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

<html>
<head>
<title>Demo</title>
</head>
<body>

<h1 align=center>This is a demo</title>

Paragraphs in HTML need not be closed.

This is called `omitted-tag' handling.
</body>
</html>
?- load_html('test.html', Term, []),
   pretty_print(Term).

[ element(html,
          [],
          [ element(head,
                    [],
                    [ element(title,
                              [],
                              [ 'Demo'
                              ])
                    ]),
            element(body,
                    [],
                    [ '\n',
                      element(h1,
                              [ align = center
                              ],
                              [ 'This is a demo'
                              ]),
                      '\n\n',
                      element(p,
                              [],
                              [ 'Paragraphs in HTML need not be closed.\n'
                              ]),
                      element(p,
                              [],
                              [ 'This is called `omitted-tag\' handling.'
                              ])
                    ])
          ])
].

The document is represented as a list, each element being an atom to represent CDATA or a term element(Name, Attributes, Content). Entities (e.g. &lt;) are expanded and included in the atom representing the element content or attribute value.1Up to SWI-Prolog 5.4.x, Prolog could not represent wide characters and entities that did not fit in the Prolog characters set were emitted as a term number(+Code). With the introduction of wide characters in the 5.5 branch this is no longer needed.

2.1 ‘Goodies' Predicates

These predicates are for basic use of the library, converting entire and self-contained files in SGML, HTML, or XML into a structured term. They are based on load_structure/3.

load_sgml(+Source, -ListOfContent, :Options)
Calls load_structure/3 with the given Options, using the default option dialect(sgml)
load_xml(+Source, -ListOfContent, :Options)
Calls load_structure/3 with the given Options, using the default option dialect(xml)
load_html(+Source, -ListOfContent, :Options)
Calls load_structure/3 with the given Options, using the default options dialect(HTMLDialect), where HTMLDialect is html4 or html5 (default), depending on the Prolog flag html_dialect. Both imply the option shorttag(false). The option dtd(DTD) is passed, where DTD is the HTML DTD as obtained using dtd(html, DTD). See dtd/2.