SAM Parser is an application to process SAM files into an equivalent XML representation which you can then further process in any desired form of output.
SAM Parser allows to specify an XSLT 1.0 stylesheet to post-process the output XML into a desired output format. In particular, it only allows for processing individual SAM files to individual output files. It does not support any kind of assembly or linking of multiple source files. This facility is not intended to meet all output needs, but it provides a simple way to create basic outputs.
SAM Parser is a Python 3 program that requires the regex and libxml libraries that are not distributed with the default Python distribution. If you don't already have a Python 3 install, the easiest way to get one with the required libraries installed is to install Anaconda.
SAM Parser is invoked as follows:
samparser <output_mode> <options>
Three output modes are available:
xml outputs a version of the document in XML. You can use an XSD schema to validate the output file and/or an XSLT 1.0 stylesheet to transform it into another format.
html outputs a version of the document in HTML with embedded semantic markup. You can supply one of more CSS stylesheets and/or one or more JavaScrip files to include.
regurgitate outputs a version of the document in SAM with various references resolved and normalized.
All output modes take the following options:
<infile> The path the SAM file or files to be processed. (required)
[-outfile <output file>]|[-outdir <output directory> [-outputextension <output extension>]] Specifies either an output file or a directory to place output files and the file extension to apply those files. (optional, defaults to the console)
-smartquotes <smartquote_rules> The path to a file containing smartquote rules.
-expandrelativepaths Causes the parser to expand relative paths in SAM insert statements when serializing output. Use this if you want paths relative to the source file to be made portable
in the output file.
XML output mode takes the following options:
[-xslt <xslt-file> [-transformedoutputfile <transformed file>]\[-transformedoutputdir <tansformed ouput dir> [-transformedextension <transformed files extension]] Specifies an XSL 1.0 stylesheet to use to transform the XML output into a final format, along with the file name or directory and extension to use for the transformed output.
[-xsd <XSD schema file>] Specifies an XSD schema file to use to validate the XML output file.
HTML output mode takes the follow options:
-css <css file location> Specifies the path to a CSS file to include in the HTML output file. (optional, repeatable)
Regurgitate mode does not take any additional options.
Short forms of the options are available as follows
Eventually, SAM is going to have its own schema language, but until that is available (and probably afterward) you can validate your document against an XML schema. Schema validation is done on the XML output format, not the input (because it is an XML schema, not a SAM schema). To invoke schema validation, use the
-xsd option on the command line:
The SAM parser can expand relative paths of insert statements in the source document while serializing the output. This can be useful if the location of the output file is not the same relative to the included resources as the location of the source file. To tell the parser to expand relative paths into absolute URLs, use the
-expandrelativepaths option. The short form is
Note that this applies to paths in SAM insert statements only. If you include paths in custom structures in your markup, they will not be expanded as the parser has no way of knowing that the value of a custom structure is a path.
The parser can also regurgitate the SAM document (that is, create a SAM serialization of the structure defined by the original SAM document). The regurgitated version may be different in small ways from the input document but will create the same structures internally and will serialize the same way as the original. Some of the difference are:
Character entities will be replaced by Unicode characters.
Paragraphs will be all on one line
Bold and italic decorations will be replaced with equivalent annotations.
Some non-essential character escapes may be included.
Annotation lookups will be performed and any
!annotation-lookup declaration will be removed.
Smart quote processing will be performed and any
!smart-quotes declaration will be removed.
To regurgitate, use the
regurgitate output mode.
The parser incorporates a smart quotes feature. The writer can specify that they want smartquotes processing for their document by including the smartquotes declaration at the start of their document.
By default, the parser supports two values for the smart quotes declaration,
off (the default). The built-in
on setting supports the following translations:
single quotes to curly quotes
double quotes to curly double quotes
single quotes as apostrophe to curly quotes
<space>--<space> to en-dash
--- to em-dash
Note that no smart quote algorithm is perfect. This option will miss some instances and may get some wrong. To ensure you always get the characters you want, enter the unicode characters directly or use a character entity.
Smart quote processing is not applied to code phrases or to codeblocks or embedded markup.
Because different writers may want different smart quote rules, or different rules may be appropriate to different kinds of materials. the parser lets you specify your own sets of smart quote rules. Essentially this lets you detect any pattern in the text and define a substitution for it. You can use it for any characters substitutions that you find useful, even those having nothing to do with quotes.
To define a set of smart quote substitutions, create a XML file like the
sq.xml file included with the parser. This file includes two alternate sets of smart quote rules,
justdashes, which contains rulesets which process just quotes and just dashes respectively. The dashes and quotes rules in this file are the same as those built in to the parser. Note, however, that the parser does not use these files by default.
To invoke the
justquotes rule set:
Add the declaration
!smart-quotes: justquotes to the document.
Use the command line parameter
To add a custom rule set, create your own rule set file and invoke it in the same way.
Note that the rules in each rule set are represented by regular expressions. The rules detect characters based on their surroundings. They do not detect quotations by finding the opening and closing quotes as a pair. They find them separately. This means that the order of rules in the rule file may be important. In the default rules, close quote rules are listed first. Reversing the order might result in some close quotes being detected as open quotes.
Normally SAM is serialized to XML which you can then process to produce HTML or any other output you want. However, the parser also supports outputting HTML directly. The attraction of this is that it allows you to have a semantically constrained input format that can be validated with a schema but which can still output to HTML5 directly.
SAM structures are output to HTML as follows:
Named blocks are output as HTML
<div> elements. The SAM block name is output as the
class attribute of the DIV elements, allowing you to attach specific CSS styling to each type of block.
Codeblocks are output as
pre elements with the language attribute output as a
data-language and the
codeblock. Code is wrapped in
code tags, also with
Embedded data is ignored and a warning is issued.
Paragraphs, ordered lists, and unordered lists are output as their HTML equivalents.
Labelled lists are output as definition lists.
Grids are output as tables.
Record sets are output as tables with the field names as table headers.
Inserts by URL become
object elements. String inserts are resolved if the named string is available. Inserts by ID are resolved by inserting the object with the specified ID. A warning will be raised and the insert will be ignored if you try to insert a block element with an inline insert or and inline element with a block insert. All other inserts are ignored and a warning is raised.
Phrases are output as spans with the
Annotations are output as spans nested within the phrase spans they annotate. The specifically and namespace attributes of an annotation are output as
Attributes are output as HTML attributes. ID attributes are output as HTML
id attributes. Language-code attributes are ouput as HTML
lang attributes. Other attributes are output as HTML
head element is generated which includes the
title elements if the root block of the SAM document has a title. It also includes
<meta charset = "UTF-8">.
To generate HTML output, use the
html output mode the command line.
To specify a stylesheet to be imported by the resulting HTML file, use the
-css option with the URL of the css file to be included (relative to the location of the HTML file). You can specify the
-css option more than once.
To run SAM Parser on Windows, use the
samparser batch file:
samparser xml foo.sam -o foo.xml -x foo2html.xslt -to foo.html
To run SAM Parser on Xnix or Mac, invoke Python 3 as appropriate on your system. For example:
python3 samparser.py xml foo.sam -o foo.xml -x foo2html.xslt -to foo.html