In some cases applications wish to process small portions of large SGML, XML or RDF files. For example, the OpenDirectory project by Netscape has produced a 90MB RDF file representing the main index. The parser described here can process this document as a unit, but loading takes 85 seconds on a Pentium-II 450 and the resulting term requires about 70MB global stack. One option is to process the entire document and output it as a Prolog fact-base of RDF triplets, but in many cases this is undesirable. Another example is a large SGML file containing online documentation. The application normally wishes to provide only small portions at a time to the user. Loading the entire document into memory is then undesirable.
Using the parse(element) option, we open a file, seek
(using seek/4) to
the position of the element and read the desired element.
The index can be built using the call-back interface of
sgml_parse/2.
For example, the following code makes an index of the structure.rdf
file of the OpenDirectory project:
:- dynamic
location/3. % Id, File, Offset
rdf_index(File) :-
retractall(location(_,_)),
open(File, read, In, [type(binary)]),
new_sgml_parser(Parser, []),
set_sgml_parser(Parser, file(File)),
set_sgml_parser(Parser, dialect(xml)),
sgml_parse(Parser,
[ source(In),
call(begin, index_on_begin)
]),
close(In).
index_on_begin(_Element, Attributes, Parser) :-
memberchk('r:id'=Id, Attributes),
get_sgml_parser(Parser, charpos(Offset)),
get_sgml_parser(Parser, file(File)),
assert(location(Id, File, Offset)).
The following code extracts the RDF element with required id:
rdf_element(Id, Term) :-
location(Id, File, Offset),
load_structure(File, Term,
[ dialect(xml),
offset(Offset),
parse(element)
]).