Semantic Authoring Markdown (SAM)

This document is a formal description of the SAM language. It describes the formalisms behind the syntax and it may make SAM sound a lot more complex to work with than it really is. If you just want to learn how to write in SAM, see the SAM Quick Start.

Semantic Authoring Markdown (SAM) is a simplified markup for semantic authoring. Semantic authoring means writing texts that identify some part of their semantics thought embedded markup. Semantic authoring also places constraints on the author, making sure they create the content that is required. Semantic authoring allows for sophisticated validation and manipulation of content, supporting increased quality and improving the efficiency of the publishing process.

SAM is a tale that grew in the telling. It was designed from the bottom up, not the top down, and so formalizing the language description comes after the fact. This document is a step along the way to a fully formal definition. There is not yet a full formal grammar for SAM. It is possible that the development of such a grammar could reveal ambiguities in the language definition that might require design changes in the future.

Why SAM?

Why is SAM needed? What are the limits of current semantic authoring solutions? The core of a semantic authoring solution is a markup language. There are, broadly, two kinds of markup languages: general and lightweight. Lightweight markup languages are easier to use, but less capable. XML is the only general languages in common use today (thought SGML is also still in use). There are many lightweight markup languages. The best known is Markdown.

The key issue for real semantic authoring is extensibility: the ability to specify your own semantics. (To create new tag that are specific to what you are writing about.)

Current lightweight markup languages have fixed semantics with little or no extensibility. Where extensibility exists, it generally requires coding extensions to the parser.

XML supports extensibility through schemas. But the price it pays for this is an abstract data model and a verbose syntax. Writing and editing in XML is cumbersome, even with a dedicated XML editor (and dedicated XML editors can be expensive). And while XML is intended for doing structured writing, in practice it tends to hide the structure of the content either behind a WYSIWYG editing interface or verbose markup.

SAM attempts to bridge these two worlds, providing a lightweight syntax that is easy to read, write, and edit, and which makes the structure of the content clear and explicit. SAM is also designed to be extensible and to support the specification of new tags through a schema. (Currently the SAM Parser uses an XML schema language to validate the output XML, but a true SAM schema language is planned.)

Most lightweight markup languages provide a fixed set of document structures with little or no extensibility. XML provides no document structures by default and makes you specify everything in a schema. SAM defines a limited set of core text structures such as paragraph and lists and then leaves it to you to specify the semantic structures that contain the text.

It is important to note that SAM is not designed as a language for text representation. That is, it is not a language designed for taking an existing text, and, without violence to that text, adding metadata to explicate its meaning or structure. For that you need a form of markup with the ability to clearly delineate in-band and out-of-band data. XML provides this by consigning in-band data (the text of the document) to the textual content of elements, and out-of-band (metadata describing the document) to attributes. SAM defines a few preset management attributes but provided no facility for creating new ones. Any additional metadata you want to add to a block, for instance, has to take the form or fields. This muddies the distinction between in-band and out-of-band information. For content creation, this is entirely appropriate, and the whole element/attribute distinction is tedious and confusing. If you are looking for a markup language that is designed so that you can simply remove the markup and be left with the original document, SAM is not the right choice.

Processing model

SAM is a cross between an abstract or meta language like XML and a concrete language like MarkDown. XML does not describe the semantics of any of the tagging languages you base on it. The XML parser does not implement any markup semantics. Implementing semantics is left to the application layer -- another program or set of program that process the output of the parser. In many XML-based tool chains, the application layer is written in XSLT. Markdown describes the entire semantics of the language. The Markdown processor executes the entire semantics of the language, producing HTML output directly from Markdown. It does not require a separate application layer.

SAM follows the XML model. The SAM parser simply reads the SAM markup and extracts its structures. It does not interpret the semantics of the document, but leaves that to the application layer.

A SAM parser may make the structure of a SAM document available to the applications layer in different ways. It can expose the structure via an API and/or it may output an XML representation of the structure of the SAM document.

Semantics

While the SAM parser does not act on the semantics of the document, deferring that to the application layer, it contains concrete markup which the writer is entitled to expect will be processed in a particular way. These are the intended semantics of SAM structures. The application layer is expected to honor and implement the semantics of SAM markup.

A lot of SAM's concrete markup is simply shortcuts for common writing structures like lists. The exact same structures could be created using blocks, fields, and paragraphs, and you can create variant structures (other types of lists, for example) using these standard components. However, some of SAM's concrete markup expresses a particular view about markup language design. For example, it use of annotations and citations expresses a particular view about how text inside paragraphs should be marked up. Since SAM does not provide other facilities to create markup in paragraphs, it may not be suitable for expressing all types of markup language design. SAM is not, and is not intended to be, a full replacement for XML.

SAM Structure

Like all other markup languages, a SAM document is essentially a set a set of nested blocks of various types. In concrete markup languages like markdown there are a fixed set of block types, each with a specific meaning, and each with a corresponding piece of syntax that allows the parser to identify them. In an abstract markup language like XML, there are a few abstract block types that have no inherent meaning, but can be named so that a processing application can attach meaning to the name in specific markup languages. Thus a paragraph is represented with a <p> element in HTML but by a <para> element in DocBook.

As a hybrid language, SAM has a set of concrete block types, each with its own semantics and syntax, and a set of abstract nameable block types that can be used to define additional semantics for specific SAM-based markup languages. Thus in SAM a paragraph is always created as a block of text, but you can create named blocks to capture things like author-name:.

Blocks may contain flows. A flow is a data value, the contents of a block, but a flow can also contain other structures. This is also a common feature of document markup languages. In HTML, for instance, you can have an <a> element in the middle of a paragraph to create a link. XML-based languages, including HTML, use elements for structures within a flow (creating something called mixed content). In SAM, structures within a flow are distinct from blocks and have a different syntax.

Four kinds of structures can exist in a flow, a phrase, an insert, and a citation. However, a phrase can be annotated to indicate its semantics, which makes phrases an extensible feature of SAM.

The following example shows a paragraph with two annotated phrases.

In {Rio Bravo}(movie), {the Duke}(actor "John Wayne" (SAG)) plays a union colonel.

SAM Syntax

Syntactically, SAM takes its cues from YAML and Python (for displaying overall structure), from Markdown (for displaying text structures) and from common English usage (for displaying annotations.) In contrast to some of the richer lightweight markup languages, it tries to minimize the use of exotic forms of punctuation to delineate structures.

The intention is that an author writing in SAM should feel no need to use a visual editor, but should work directly in raw SAM (albeit with editor-supplied syntax coloring if desired). A SAM document should be a very natural read.

SAM is not designed to be a full replacement for XML. It is far less general, because of its use of predefined syntax for common text structures such as lists and paragraphs, and it has very limited support for attributes. Basically, each structure has a fixed set of attributes available. There is no extensibility for attributes. This is essential to maintain the readability of SAM sources, but it limits certain styles of markup design.

In particular, SAM is designed to support a declarative approach to semantic authoring meaning that it is designed to express declarations about the structure of the document and the meaning of the text, rather than to embed management and processing information into the source, a practice common in many structured authoring languages.

Document-oriented markup languages such as reStructuredText and Markdown tend to be very flat in structure. One things follows another. There is rarely a case where one thing is inside another. For example, these languages use heading levels, but don't delineate sections of the document. More semantically rich languages tend to be more hierarchical, with one structure inside another. For example, a document might be divided into sections with title and paragraph elements inside a section. This means that you need to show which structures are inside others and where the containing structure ends.

XML does this by having an explicit end tag for every element. A large part of the verbosity of XML comes from its requirement to close every element tag. Many programming languages, which similarly have to express nested structures, use some form of brackets to delineate each structure. Python is an exception to this. It uses indentation to express the relationship between structures. This is a far more natural way for writers to express the subordination of elements -- one that is sometimes also used in published works. SAM uses this approach and avoids the use of end tags entirely.

SAM syntax defines the following structures:

named block

Named blocks are structures that contain other structures. For instance, a section is a block. Blocks must have names. Named blocks (and named fields) are how you introduce new structures into a SAM document. Named blocks may be referred to a "blocks" in this document, but all of SAMs concrete block types are also blocks.

field

Fields are named blocks that contain individual values. For instance, the date a document is usually a field.

flow

A flow is a string of text. All text content in a SAM document constitutes a flow. A paragraph is a flow. The value of a field is a flow. A title is a flow. Etc. Flows can contain phrases, citations, and inserts.

paragraph

A paragraph is a continuous block of text across one or more contiguous lines. The content of a paragraph is a flow.

phrase

A phrase is a string of characters in a flow. You use a phrase when you want to attach an annotation or a citation to a string of text.

string

Strings are named pieces of text that can be included by reference in other parts of a document or in other documents.

annotation

Annotations are metadata attached to a phrase.

attributes

Attributes are metadata attached to a block. There are a very limited set of attributes with specific rules for where they can appear. For most of the purposes for which you would use attributes in XML, you should use fields in SAM.

block insertion

Block insertions are an instruction to the application layer in content from an internal or external source at the block level.

flow insertion

Flow insertions are an instruction to the application layer in content from an internal or external source at the flow level (that is, inside a flow).

citations

Citations are references to other works or structures.

include

You can include one SAM file in another.

ordered list

An ordered list is as list in which the order of the items is meaningful. An ordered list is any sequence of one or more ordered list items.

unordered list

An unordered list is a list in which the order of the items is not meaningful (as least as far as markup semantics is concerned). A unordered list is any sequence of one or more unordered list items.

labeled list

A labeled list is a list in which each item has a label (rather than a bullet or a number). A labeled list is an sequence of one or more labeled list items. A labeled list item consists of a label and a single paragraph.

line

A line is a single line of text, as in a poem, with a fixed line end. The contents of a line is a flow.

record set

A record set is a simple database table with named columns. It allows you to capture tabular data at a semantic level rather than at the presentation level of a table.

grid

A grid is a simple table-like layout structure.

fragment

Fragments are sections of a documents (potentially larger than a single paragraph) that can be included by reference in other parts of a document or in other documents. Strings used in a fragment can be redefined when a fragment is reused, allowing you to vary the text of the fragment when it is used in different places. (Note that this is describes the presumed semantics of fragments. Implementation of these semantics is up to the application layer.

comment

A comment on the SAM markup file itself, as opposed to a comment on the document. For comments on the document, use remark

remark

A remark is a comment on the text of the document, intended for editorial purposes. It must be attributed to a specific author.

Name tokens

Several things in SAM have names. These names must be valid name tokens. A name token in a SAM document must be a valid XML local name. This is to allow ease of integration with XML tools and languages and for a straightforwards serialization of SAM documents to XML.

Name-like string

A name like string is a string that looks like a name. Name-like strings are used in parsing SAM files to detect the intention of the author to create a name. Not all name-like-strings are valid names. If the parser sees a name-like-string that is not a valid name, it reports an error.

The reason for recognizing name-like-strings rather than only recognizing valid names is that it enables the parser to catch cases where the writer probably intended to create a name but failed to create a valid name.

By recognizing name-like-strings the parser is able to report an errors on these dubious cases, which reduce the likelihood or misinterpreting the author's intent. In the case where the name-like-string is indeed intended to be interpreted as plain text, the writer can ensure it is interpreted as such by escaping the concluding colon with a backslash.

However, the definition of a name-like-string can't interfere with the recognition of other SAM structures. Therefore name-like-strings don't include characters that might be mistaken for SAM markup.

The current definition of a name-like-string is as follows, but this is subject to change:

(?P<name>\w[^\s`]*?) 

Syntax

A block in SAM is formed by starting a line with a name followed by a colon:

description:
    This is the content of a block named "description".

Structures in SAM can contain other structures. The hierarchy of structures in SAM is defined by indentation. In the example above, the paragraph structure is inside the block named description because it is indented under it.

SAM Object Model

The SAM Object Model is a way of representing the structure of a SAM document as a data structure to a computer program. The SAM object model is a way for program to access a SAM document in a standard way. Unless you are concerned with writing programs to access the SAM Object Model you can ignore all references to it in this document. The SAM Object model is the model implemented by the current SAM parser, which is written in Python. However, SAM parsers are not required to implement the SAM Object Model.

The other way for applications to access the structure of is by serializing it to XML. This is the only require access method for parsers to implement.

Like the language itself, the SOM is a tale that grew in the telling. In the absence of a full formal grammar for SAM, the object model must be considered provisional. It should be considered a less stable interface to a SAM document than the canonical XML serialization.

Parsing

A SAM document may be interpreted by a SAM parser to form an object tree in memory, which can include the SAM object model or any standardized XML representation such as SAX or DOM. In these cases, the DOM or SAX interface should represent the standard XML serialization, or an optional variant of that serialization, not the SOM. A SAM parser must offer the option of serializing the document as XML. A SAM parser may offer various methods of serializing XML, but it must default to the serialization described in this document.

The following is a list of the structures of a SAM document.

Declaration

A SAM file may start with one or more declarations. A declarations gives the SAM parser information it needs to correctly parse the file. Declarations do not become part of the output.

Declarations must come before all other content. A declaration is created by a line beginning with an exclamation mark followed by an declaration type followed by a colon.

!namespace: http://spfeopentoolkit.org/sam/ns/tests

Declarations cannot take attributes.

A declaration is an instruction to the parser on how to interpret the SAM document. Declarations are not part of the content of the document described by the SAM markup. Rather, they describe the markup itself.

The following declarations are supported:

namespace

Sets the namespace for the entire SAM document. You cannot assign namespaces to individual structures in the text of a SAM document like you can in XML. When schema support is available, you will be able to assign individual elements to namespaces in the schema schema.

annotation-lookup

Sets the annotation lookup mode for the document. The options are:

  • on Annotation lookup is performed in a case insensitive manner.

  • off Annotation lookup is not performed.

  • case sensitive Annotation lookup is performed in a case sensitive manner.

  • case insensitive (default) Annotation lookup is performed in a case insensitive manner.

  • <custom> Annotation lookup is performed using a custom algorithm supported by a user-supplied extension to the parser.

smart-quotes

Set the smart quotes processing mode for the parser. The options are:

  • on Smart quotes processing is performed using the default rules.

  • off (default) Smart quotes processing is not performed.

  • <custom> Smart quotes processing is performed using a custom rule set provided by the parser. The default parser supports this via an XML file which is specified on the command line using the -sq option. Other parsers are free to support smart quotes or not, and to support custom smart quotes rules or not using a mechanism of their choice.

The namespace declaration causes the namespace of a document to be set to the specified namespace. This will result in the appropriate namespace declaration in the serialized XML file. A parser may handle namespace prefixes in any way it sees fit.

None. If a namespace declaration is used, it will not be included in the HTML serialization.

The namespace declaration is used to update the namespace property of all blocks. In future, the namespace property of blocks may be set by the schema.

The schema declaration specifies the location of a schema file. The location will be processed through an XML catalog if one is specified to the parser. If a schema is specified, the namespace declaration is ignored and namespaces are assigned per the schema. A schema may assign different elements to different namespaces.

!schema: http://spfeopentoolkit.org/schemas/think-plan-do-topic.sams

Root

Every SAM document has a root which contains the entire content of the document. The root can contain a single block which is the parent of all other blocks in the document. This block is called the document block.

The root had no syntax of its own. It is an implicit container for the contents of the file.

The root cannot take attributes.

None. The root is just a container.

The root can contain the following, in order:

  • One or more declaratioms.

  • One of more comments, remarks, or string definitions.

  • The root block.

  • One of more comments, or remarks.

The XML declaration is output as follows:

<?xml version="1.0" encoding="UTF-8"?>

The HTML header is output as follows:

<!DOCTYPE html>
<html lang="en">
<head>
<title>SAM Parser Tests</title>
<meta charset = "UTF-8">
</head>
<body>
...
</body>
</html>

The root is represented by a Root object that is created automatically by the parser when parsing starts.

All declarations, comments, remarks, and string definitions prior to the document block, and the document block itself, are children of the Root object.

Note that the SOM Root object is not the same thing as the document block. The Root object is a container for all the structures declared at column zero in the SAM document. The document block is the only block structure permitted as a direct child of the Root object.

Field

A field is named container with a single value.

A block is indicated by a string consisting of:

  • the appropriate indent for its position in the document structure

  • the field name

  • a colon

  • attributes, if any

  • at least one space

  • the field value

  • a line end

name: Fred Flintstone
address:(?foo) Bedrock
era: BC

The name of a field must be a valid SAM name.

The value of a field is a flow, meaning it can contain phrases, decorations, annotations, and citations.

A field can take the standard attributes.

A field is essentially a way of creating a key-value pair in a document.

A field is serialized as an XML element. The field content is the data content of the XML element. The name of the XML element matches the name of the field. Thus the example above would be serialized as:

<name>Fred Flintstone</name>
<address>Bedrock</address>
<era>BC</era>

A field is serialized as an HTML div element with the field name given as the value of the class attribute:

<div class="name">Fred Flintstone</div>
<div class="address">Bedrock</div>
<div class="era">BC</div>

A field is represented by a Block object. The only difference between a block and a field is whether it has children. If it has children, it is a block. If not, it is a field.

This design is mostly to make parsing easier. When we see a field definition in a file we can't tell without looking ahead whether it is going to have children or not. Therefore we treat it as a block. The difference between a field and a block really only occurs on serialization, where the content of a field is simply its text value, whereas the content of a block is treated as a title and output as a <title> element.

Programs accessing the SOM directly should note that the value of a block should be treated as the title of that block.

The value of a field is contained in a Flow object.

Block

A block is a named container for other structures. Blocks and fields are the extensible structures of a SAM document that allow you to create a document with specific semantics in any of the structured writing domains.

Like a field, a block can have a value. However, a block can also have children. A block is simply a field with children. (Or you could equally say that a field is a block without children.)

Blocks are used to model large document structures such as sections. The value of the block is treated as the title of the section.

A block is syntactically identical to a field. It is distinguished from a field based on whether or not it has children. The children of a block are any structures following the block header that are indented more than the block header.

The following example defines a block named movie-review. It is a block because it has children. Its children are the field movie and the block stars:

movie-review: Wayne shines in Rio Bravo
    movie: Rio Bravo
    stars:
        star: John Wayne

The movie-review block as a value, Wayne shines in Rio Bravo' which will be treated as the title of the block on serialization. The stars` block does not have a value. Its function is to act as a container for the list of stars. (As part of the regular structure of a movie review, the list of stars would likely be assigned a title during output processing, but it does not need one in the source.)

A block can take the standard attributes.

The value of a block is treated as the title of a the block of a block, and will be serialized as such.

Structurally, the title of a block is its value, in just the same way that the content of a field is its value.

The SAM markup above would be serialized to XML as follows:

<movie-review>
<title>Wayne shines in Rio Bravo</title>
<movie>Rio Bravo</movie>
<stars>
<star>John Wayne</star>
</stars>
</movie-review>

A block is serialized to HTML as a div element with the class attribute set to the name of the block. The block title is output as an HTML heading element with the heading number calculated based on its place in the hierarchy of the document structure. Thus the SAM markup above would be serialized to HTML as follows:

<div class="movie-review">
<h1 class="title">Wayne shines in Rio Bravo</h1>
<div class="movie">Rio Bravo</div>
<div class="stars">
<div class="star">John Wayne</div>
</div>
</div>

The representation of a block in the SOM is identical to the representation of a field. A block is distinguished from a field in the SOM by whether or not it has children.

Document block

The document block is the first structure declared in the document after any declarations, comments, remarks or string definitions. It is the container for all other blocks in the document.

Any type of block can be the document block, including concrete structures such as paragraphs and lists. However, choosing a structure other than a named block as the document block will be extremely limiting, as most concrete structure types allow for little or no nesting of other blocks. For instance, nothing can be nested under a paragraph, so while a paragraph can be the document block of a SAM document, it would also be the only structure in the document.

The document block is not indented. All other structures in the document (excluding comments, declarations, and string definitions) must be children of the document block, and therefore must be indented under it. In all other respects, the syntax of the document block is the syntax of it type.

The document block determines the type of the SAM document (just as the root element name in XML forms the type of an XML document). In all other respects, the semantics of the document block are no different from the semantics of that structure type.

The document block is serialized according to its block type. In addition, however, if a namespace is declared for the document, the appropriate namespace declaration will be added to the root element generated.

The docuemnt block is serialized according to its block type.

The document block is the only block child of the Root object. In every other respect, it is represented as its normal block type.

Paragraph

A paragraph is a block of text.

A paragraph is indicated as sequence of lines ending with a blank line.

Now is the time
for all good men
to come to the aid
of the party.

It was the best of 
times, it was the 
worst of times.

Any line that is not identified as a different structure, or an attempt to create a different structure, is identified as a paragraph.

In some cases a paragraph may begin with a sequence that looks like an attempt to create a different structure. For instance, if there is a colon before the first space in a paragraph, and if the string before the colon is a name-like-value, then the writer can prevent the sequence being recognized as a block or field by escaping the colon with a backslash or a character escape.

Once the first line of a paragraph has been identified, the parser simply accumulates subsequent lines, without looking for other structures, until one of the following is encountered:

  • A blank line.

  • A line whose indent is less than the indent of the first line of the paragraph.

  • If the paragraph is the child of a list item, the start of a new list item.

A paragraph cannot have block children. Is is an error if any structure is indented under a paragraph.

The content of a paragraph is a flow. A flow is distinct from a simple string value in that it can contain phrases, decorations, code, annotations, and citations.

Some writers may instinctively indent a list under the paragraph it follows. This is an error in SAM. A list is always the peer of the preceding paragraph is SAM and as such the list items must start at the same indent level as the preceding paragraph. The output formatter may, of course, choose to indent lists relative to their sibling paragraphs.

A paragraph is block of text. (SAM) has no opinion on whether the block of text would be considered a paragraph in grammatical or literary terms. As far as SAM is concerned, it is simply a block of text set off from other blocks of text and other structures. (This is identical to the meaning of paragraph in most markup languages.)

A paragraph is serialized as a XML p element. Thus the example above would be serialized as the following XML:

<p>Now is the time for all good men to come to the aid of the party.</p>

<p>It was the best of times, it was the worst of times.</p>

A paragraph cannot have block children. It is a syntax error to have anything indented under a paragraph.

A paragraph is serialized as an HTML p element.

<p class="p">Now is the time for all good men to come to the aid of the party.</p>

<p class="p">It was the best of times, it was the worst of times.</p>

A paragraph is represented by a Paragraph object with a single child Flow object.

Flow

A flow is a section of document text in a SAM document. A flow can contain intra-textual structures like annotations, decorations, and inline inserts. Paragraphs, field values, block values, and variable values are all flows.

There is no special syntax to define a flow. Flows are defined by their relationship to other structures. All flows are trimmed of leading and trailing whitespace. In composing a flow from multiple text lines, all spaces and returns are collapsed to a single space.

The text in a flow is simply text, along with any intra-textual structures it contains.

The flow object is not represented as an object in XML. Its content becomes the content of whatever element is created to represent the structure that contains the flow.

The flow object is not represented as an object in HTML. Its content becomes the content of whatever element is created to represent the structure that contains the flow.

A flow is contained in a Flow object.

Phrase

A phrase structure is a delineated string of text within a flow. You create a phrase in order to add an attribute, annotation, or citation to a word or phrase in the text.

A phrase is indicated by surrounding the phrase with curly braces:

In {Rio Bravo}, {the Duke} plays a union colonel.

It is not an error to have a phrase with no attributes, annotations or citations attached. When such as phrase is found, the parser must:

  • First, look backward through the document to find out if the same phrase has been annotated before. If it finds the same phrase, with an annotation, it copies the entire phrase and all attached annotations to the current phrase. The search proceeds in reverse document order and stops at the first instance found.

  • Second, if no matching phrase is found, the parser must issue a warning for the unannotated phrase.

The search is conducted based on the annotation lookup mode in effect for the document, as determines by the annotation-lookup declaration.

Note that attributes, citations, and local annotations are not copied.

A phrase has no semantics of its own. Its semantics are determined by any annotations attached to it.

A phrase is serialized as a phrase element in XML.

<p>In <phrase>Rio Bravo</phrase>, <phrase>the Duke</phrase> plays a union colonel.</p>

A phrase is serialized as a span element in HTML with the class element set to phrase.

<p>In <span class="phrase">Rio Bravo</phrase>, <span class="phrase">the Duke</phrase> plays a union colonel.</p>

A phrase is represented by a Phrase object.

Attribute

Attributes are metadata on a block or phrase. Unlike XML, which allows for an infinite number and type of attributes, SAM supports a limited set of attributes, with the attribute type being indicated by its position or by a prefixed symbol. If you need attributes not in this set supplied by SAM, you will have to model them as fields instead.

Attributes on blocks are contained in parentheses that follow the colon of a block header with no space:

name:(?BC) Fred Flintstone

A block can have multiple attributes, one after another, with no spaces between:

name:(?BC)(#fred) Fred Flintstone 

Attributes on phrases are contained in parentheses that follow the closing curly brace of the phrase or the closing parenthesis or bracket of another annotation, attribute, or citation.

An attribute annotation looks like this:

In Quebec, a stop sign says {Arrêt}(!fr).

A phrase can have an unlimited mix of attributes, annotations, and citations attached to it.

There are four general attributes. The type of an attribute it determined by the flag character that precedes it. The four general types are:

  • Condition, which start with a ?

  • Name, which start with #

  • ID, which start with *

  • Language code, which start with !

A block or phrase can only have one name, ID, or Language code attribute but it may have an umlimited number of condition attributes.

Condition expressions must not include commas.

Names must be SAM name tokens.

IDs must be SAM name tokens.

Languages should be BCP 47 language codes, but the parser is not required to enforce this.

Certain block types take additional attributes with meanings specific to those block types.

Attributes provide support for basic management domain metadata in a SAM file.

Unlike XML, SAM does not support arbitrarily named attributes. Only a fixed set of management attributes are supported. The semantics of each attribute type is as follows:

Condition

A condition attribute is used to determine if a block should appear in a particular output or not. If the condition expression evaluates to true, then the block is included. If it evaluates to false, it is not included.

Conditions are evaluated from the outside in, so if a block is conditionally excluded, all its contents are excluded, regardless of any conditional expressions that may occur on any of its children.

The syntax, interpretation, and truth values of condition expressions are left entirely to the application layer. Similarly, including or excluding blocks is left entirely to the application layer.

Multiple conditions may be applied to the same block.

ID

An id attribute assigns an ID to a block. That id can be referenced by an idref. IDs must be unique in a file. IDRefs must refer to an ID in the current file. The action to be taken by the application layer on encountering an IDref is up to the tagging language designer.

Multiple ids are not permitted on the same block.

Name

A name attribute assigns a name to a block. A name performs the same function as an ID, allowing the block to be referenced from elsewhere. However, there are no limits on the scope of names. Uniqueness of names is not enforced. Name references are not required to match a name in the current file. This allows you to use names to reference blocks in other files. (Note the namerefs so not support naming the file in which a name is located. Rather it is left to the application layer to locate names by whatever mechanism seems appropriate.)

Language code

A language code attribute specified the language in which the content of a block or phrase is written. It is expected, but not required, that standard BCP 47 language codes be used to identify languages.

Certain block types have additional attributes that are specific to those block types. See the relevant entries for details.

This is the list of the standard attributes allowed on each structure. Block types that take special attributes are indicated with <special>:

block !language #name ?condition *id
field !language #name ?condition *id
fragment !language #name ?condition *id
grid !language #name ?condition *id
recordset !language #name ?condition *id
ordered-list-item !language #name ?condition *id
unorderd-list-item !language #name ?condition *id
line !language #name ?condition *id
blockquote !language #name ?condition *id
phrase !language #name ?condition *id
codeblock #name ?condition *id <special>
string-def
paragraph
record
block insert #name ?condition *id <special>
inline insert #name ?condition *id <special>
include <special>

Note that the lack of arbitrary named attributes does not reduce the semantic flexibility of SAM as compared to XML. Any metadata that can be expressed by an attribute on a block can also be expressed by a field that is a child of that block. The reason for using attributes in XML is to preserve a pure distinction between text and metadata, but that is not a design consideration in SAM, which is meant for the creation of new text through structured writing, not for the annotation of existing texts for study.

Attributes are represented in XML by XML attributes on the block element.

  • language is represented by an xml:lang attribute.

  • ID is represented by an id attribute.

  • name is represented by a name attribute.

  • conditions are represented by a conditions attribute. If multiple condition attributes are applied in SAM, they will be concatenated with commas in the conditions attribute value.

Attributes on phrases are serialized wrapped around the phrase to which they apply just as described for annotations.

Attributes are represented in HTML by HTML attributes on the block or phrase element.

  • language is represented by an lang attribute.

  • ID is represented by an id attribute.

  • name is represented by a data-name attribute.

  • conditions are represented by a data-conditions attribute. If multiple condition attributes are applied in SAM, they will be concatenated with commas in the data-conditions attribute value.

Attributes on phrases are serialized wrapped around the phrase to which they apply just as described for annotations.

Attributes on blocks are represented by equivalent attributes of the Block object.

Attributes on phrases are represented by equivalent attributes of the Phrase object.

Annotation

Annotations clarify what a piece of text is about. Whereas the role of an attribute is to attach management metadata, the role of an attribute is to attach metadata that clarifies exactly what the phrase is describing so that it can be reliably processed by algorithms.

An annotation is attached to a phrase and is expressed as an expression in parentheses following the closing } of the phrase with no space in between.

An annotation specifies the type of a string. Here the type expression says that the string "Rio Bravo" is a reference to a movie:

In {Rio Bravo}(movie), {the Duke}(actor "John Wayne" (SAG)) plays a union colonel.

There are three parts to an annotation:

type

The first word immediately following the opening parentheses is the type of the subject being annotated. In the sample above, "Rio Bravo" is a movie and "the Duke" is an actor. An annotation type must be a valid SAM name. The annotation type is required.

specifically

In some cases, the annotated text may not specify its subject clearly. In this case, the specifically attribute is use to clarify what is meant. The specifically attribute is a string in double or single straight quotes. In the sample above, "the Duke" means, specifically, "John Wayne". The specifically attribute is optional.

namespace

In some cases, it is necessary to specify the namespace to which the annotated term belongs. The namespace attribute follows the specifically attribute (if specified) and is contained in parentheses. In the sample above, the name of the actor "John Wayne" is part of a set of names managed by the Screen Actors Guild which makes sure two actors don't use the same stage name. (In most cases, the namespace is implied by the type, so you will not usually need to specify it.) The namespace attribute is optional. (Note that the namespace attribute of an annotation is not related to the concept of the namespace of an block or field name. It is the namespace of the value, not the namespace of the container.)

If you are annotating the same phrase more than one in the same file. You can skip the annotation metadata and just use curly braces around the phrase. The parser will copy the metadata from the last occurring instance of that same phrase.

You can create a local annotation, one that applies to this instance of a phrase and will not be copied by the annotation lookup mechanism. To create a local annotation, precede the opening parentheses with a a +.

You can cancel a looked-up annotation (one that would normally be added by the annotation lookup feature). To cancel an annotation, create a cancel annotation, which consists of an annotation with an minus sign (-) before the opening parenthesis. The type name of the cancel annotation specifies the annotation type to be canceled. Cancel annotations cannot have specifically or namespace attributes. They cancel all instances of the canceled type.

The specific semantics of individual annotation depend on the language designer, however, in order to maintain consistency across languages, it is important to use the specifically and namespace attributes for their intended purposes and not to hijack them to express some other semantics.

Annotations may be from any of the structured writing domains (media, document, subject, management). The relationship between the annotation type and the meaning of the specifically attribute changes in each domain.

In the media domain, the annotation type refers to some media property to be applied to the text, and the specifically attribute supplies more specific information. For example it could be {the Duke}(highlight "yellow") or {the Duke}(link "http://JohnWayne.com").

In the document domain, the annotation type refer to some document property, and the specifically attribute supplies more specific information. For example, it could be {the Duke}(index "Wayne, John").

In the subject domain, the annotation type refers to the type of the subject being referred to and the specifically attribute clarifies the intended subject if required. For example `the Duke.

The key to using the annotation attributes correctly is simply to make sure that the attribute names apply in a reasonable fashion to the values they contain. If they don't, a different markup strategy should be used.

An annotation is serialized to XML as an <annotation> element. The type, specifically, and namespace attributes become attributes with those names in XML.

The <annotation> element is wrapped inside the <phrase> element that it annotation, with the content of the phrase on the inside.

<p>In <phrase><annotation type="movie">Rio Bravo</annotation></phrase>, <phrase><annotation type="actor" specifically="John Wayne" namespace="SAG>the Duke</annotation></phrase> plays a union colonel.</p>

When a phrase has one or more annotations, those annotations are wrapped inside the <phrase> element in the order in which they occur in the source document. Thus the following:

The film starred {John Wayne}(actor)(director), who also directed.

would be serialized as:

<p>The film starred <phrase><annotation type="actor"><annotation type="director">John Wayne</annotation></annotation></phrase>, who also directed.

An annotation is serialized to HTML as an <span> element with the class attribute set to annotation. The type attribute is serialized as a data-annotation-type attribute, the specifically attribute as a data-specifically attribute, and the namespace attributes as a data-namespace attribute.

<span class="annotation" data-annotation-type="foo">

The <annotation> element is wrapped inside the <phrase> element that it annotates, with the content of the phrase on the inside.

<p class="p">In <span class="phrase"><span class="annotation" data-annotation-type="movie">Rio Bravo</span></span>, <phrase><span class="annotation" data-annotation-type="actor" data-specifically="John Wayne" data-namespace="SAG>the Duke</span></span> plays a union colonel.</p>

When a phrase has one or more annotations, those annotations are wrapped inside the <phrase> element in the order in which they occur in the source document. Thus the following:

The film starred {John Wayne}(actor)(director), who also directed.

would be serialized as:

<p class="p">The film starred <span class="phrase"><span class="annotation" data-annotation-type="actor"><span class="annotation" data-annotation-type="director">John Wayne</span></span></span>, who also directed.

An annotation is represented in the SOM by an Annotation object.

Citation

Citations are a reference to another resource. This includes internal references to other structures within the document, such as to a graphic,procedure, or footnote, as well as references to external works. Citations are supported wherever attributes or annotations are supported.

Citations come in two form. Textual citations cite a resource using whatever citation format the language designer or writer wishes to use. Reference citations cite a resource by its id, name, or key.

A citation is created using square brackets:

Moby Dick[Melville, 1851] is about a big fish.

They can also be applied to a phrase, rather then floating in the text.

{Moby Dick}[Melville, 1851] is about a big fish.

Citations can be chained with annotations:

{Moby Dick}(novel)[Melville, 1851] is about a big fish.

Citations can also be added to blocks. The most obvious application of this is blockquotes:

"""[Melville, 1851] 
    Call me Ishmael.

SAM has no opinion about the format of a citation text. That is entirely up to the application layer to decipher.

You can use reference citations to reference other parts of the current document or other documents. To cite a resource that has an id within the current SAM document, reference the id like this:

Moby Dick is about a big fish[*fn.moby]. See [*fig.whale].

fig:(*whale)
    >>>(image whale.png)

footnote:(*moby)
    Actually, Moby Dick is a whale, not a fish.

Citation by id, name, or key can include additional information. For example, if you have a bibliography where each bibliography entry has a name assigned to it, you could cite the work by reference to the name, and then add additional information for the page number.

"""[#Moby page 1] 
    Call me Ishmael.

The additional information is anything between the end of the name and the closing square bracket.

Citations can also use compound identifiers. For instance, if you want to cite a figure in a chapter, you can cite it by an identifier that is composed of the chapter name and the figure id joined by a /:

See [#chapter.moby/*fig.whale]

The function of a citation is to generate a piece of content in the output that is a reference to the specified resource. Some text should always be generated in the place where the citation is specified. What type of reference that is depends of the type of object that is being referenced. Generating such references is the job of the application layer.

The purpose of a citation is not to create a link. Links should be created using annotations. However, the reference generated by a citation may include a link to the resource in question if appropriate.

The semantics of a reference by id, name, or key, depend on the type of the structure referred to. This is different from the usage in some specific markup languages in XML where there may be a specific element used to reference a footnote, such as <footnote idref="fn.moby"/>. In SAM, all references are generic citations, which would serialize as <citation idref="fn.moby">. This means that in order for the application layer to determine what kind of reference to generate for a citation, it needs to look up the type of the resource being referenced. Thus to determine that it should create a footnote reference, the application layer needs to look up the structure with the id fn.moby and determine its type. Once it determines that the type is footnote, it knows to create a footnote reference.

While it is not required, it is a good idea for markup designers and writers to adopt and follow a convention for naming the IDs and names of structures to reflect their type, such as in the use of fn. and fig. prefixes in these examples.

The semantics of a citation of an external work is that it will generate a reference to that work in whatever citation format is appropriate for publication. This means that a citation that is inline in the SAM document does not necessarily imply that an inline citation format will be used on output. The SAM citation is just capturing the citation information. It is up to the application layer to decide how to present it, including removing the citation to a footnote or endnote if appropriate. The language designer is entitled to specify any citation format they like, and to interpret it in any way they like.

Additional information in a citation is simply passed through to the application layer. SAM does not specify any semantics for it.

A citation is output as a <citation> element in XML.

For example,

<p>Moby Dick<citation>Melville, 1851</citation> is about a big fish.

If the citation is attached to a phrase, then the citation will inserted at the end of the <phrase> element:

<p><phrase>Moby Dick<citation>Melville, 1851</citation></phrase> is about a big fish.

If citations and annotations are attached to the same phrase, then the annotations will be wrapped around the text of the phrase and then the citations will be serialized.

<phrase><annotation type="novel">Moby Dick</annotation><citation>Melville, 1851</citation></phrase> is about a big fish.

If the citation is attached to a blockquote, the citation element will be nested inside the <blockquote> element.

<blockquote>
    <citation>Melville, 1851</citation> 
    <p>Call me Ishmael.</p>
</blockquote>

If a citation by reference has extra information, it is output as the body of the citation.

<blockquote>
    <citation nameref="Moby">page 1</citation> 
    <p>Call me Ishmael.</p>
</blockquote>

If a citation uses a compound identifier, the identifier is broken into pieces and serialized like this:

<phrase>call me Ishmael<citation><reference-elements><reference-element method="nameref" value="Melville"/><reference-element method="idref" value="MobyDick"></reference-elements>page 1</citation></phrase>

Unlike other blocks, the serialization of a citation does not include a line feed after the citation tag. This is because a citation can also occur inside a flow, where the addition of a line feed would introduce extra whitespace into the flow. The lack of a linefeed in a block context does not change the semantics of the resulting document, but can make a difference in tests that compare literal XML output to expected output.

A textual citation is output as a <cite> element in HTML.

A reference citation by ID that is attached to a phrase is output as a link to the referenced structure.

Other types of reference citations, including those that use compound identifiers, are not supported in HTML output mode. If you need to resolve reference citations, use XML output mode and process the XML output to produce the citation style you need for your output.

For example,

<p>Moby Dick<cite>Melville, 1851</cite> is about a big fish.

If the textual citation is attached to a phrase, then the citation will be wrapped around the text of the phrase and nested inside the <phrase> element:

<p><phrase><cite>Melville, 1851</cite>Moby Dick</phrase> is about a big fish.

If a textual citation and an annotation are attached to the same phrase, then the citation will be nested within the <phrase> element and relative to any <annotation> elements, just as with annotation nesting.

<phrase><span class="annotation" annotation-type="novel"><cite type="citation" value="Melville, 1851">Moby Dick</cite></span></span> is about a big fish.

If the citation is attached to a blockquote, the citation element will be nested inside the <div> element.

<div class="blockquote">
    <cite>Melville, 1851</cite> 
    <p>Call me Ishmael.</p>
</div>

A citation is represented in the SOM by a Citation object.

When applied to a block, the Citation object is added as the first child of the block.

When applied to a phrase, the Citation object is chained in sequence with any other citations or annotations.

When a citation occurs standalone in a flow, the Citation object becomes an item in the Flow object list.

Insert

An insert is an instruction to the application layer to insert a resource into the document output.

Inserts may be created at the block level or inside a flow.

At the block level, an insert is placed on a line by itself and is indicated by three greater-than signs:

>>>(image foo.png)

Inside a flow, an insert is indicated by a single greater-than followed the the identification of the resource in parentheses:

My favorite flavor of ice cream is >($favorite-flavor).

The resource to be inserted may be identified either by type and URL as in the block example above or by reference to an id, name, variable, or key.

You can also assign names, conditions, or ids to an insert. For example, to insert a fragment containing the introduction to the deluxe version of a product, you could do this:

>>>(~deluxe-intro)(?model=deluxe)

Remember that it is up to the application layer to implement such inserts.

You can use any name you like for the insert type, however the following types are reserved so that editors can act on them to provide enhanced views of the document while editing:

  • image

  • video

  • audio

  • feed

  • app

  • object

They should be used only with their natural semantics. Note, however, that the spec does not prescribe a mechanism for identifying the object to insert. The item will, of course, be identified by a URL, but how that URL is to be interpreted is not specified. For instance, it could point to an intermediate resource such as an XML file that contains information used to choose the appropriate object for a particular media. The SAM language spec does not specify the format or interpretation for such a file.

Inserts may also be specified by idref, nameref, or keyref:

>>>(#foo)

Inserts by reference can use compound identifiers:

>(#foo/*bar)

Insert syntax is also used insert the value of variables using a variableref:

>($boz)

The insert is an instruction to the application layer to insert the specified resource. While implementation is entirely up to the applications layer, SAM specifies the semantics of the reserved insert types as follows:

image

Insert the graphic image into the output.

video

Insert the video into the output.

audio

Insert a playable audio clip into the output.

feed

Insert the result of a feed into the output.

app

Insert an embedded app into the output.

object

Inserts a generic object (per HTML5 'object' tag.)

The semantics of inserting by reference to an ID, name, or variable, or key are as follows:

a SAM ID

Insert the resource pointed to by the ID in a manner appropriate to its type.

a SAM name

Insert the resource pointed to by the name in a manner appropriate to its type.

a SAM variable

Insert the value of the variable into the document, and process any structures it contains.

a SAM key

Insert the resource pointed to by the key in a manner appropriate to its type.

Note that you are not obliged to use any of these types in order to do their assigned functions. You can devise any insert types you like, since responsibility for implementing them rests entirely with the application layer. The constraint is that you should not assign other semantics to these names. Having a standard set simply allows for editors to presume the semantics of the basic types and act on them if they wish.

In XML output mode the insert instruction is simply passed through to the application layer like this:

<insert item="foo.gif" type="image"/>

Inserts by reference are serialized like this:

<insert idref="foo"/>

Inserts by reference with compound identifiers serialize like this:

<insert><reference-elements><reference-element method="nameref" value="foo"/><reference-element method="idref" value="bar"/></reference-elements></insert>

In HTML output mode, the parser will attempt to insert the inserted object in the output.

For direct inserts, it will check if the insert type is in its list of known insert types and the file extension is in its list of known file types. If both conditions are true, it will insert it in the document using an HTML object element. For a block insert, the object element will be wrapped in a div element with the class attribute set to insert as well as any of the standard attributes that are present:

<div class="insert" data-conditions="a,b" id="id15" data-name="name"><object data="foo.gif"></object></div>

Treatment of an inline direct insert is identical except that the wrapper element is a span:

<span class="insert" data-conditions="a,b" id="id15" data-name="name"><object data="foo.gif"></object></span>

The known insert types for purposes of HTML output are: "image", "video", "audio", "feed", "app", and "object".

The known file types for purposes of HTML output are: ".gif", ".jpeg", ".jpg", ".png", ".apng", ".bmp", ".svg", ".ico", ".ogv", ".ogg", ".mp4", ".m4a", ".m4p", ".m4b", ".m4r", ".m4v", ".webm", ".oga", ".l16", ".wav", ".aiff", ".au", ".pcm", ".mp3", ".m4a", ".mp4", ".3gp", ".m4a", ".m4b", ".m4p", ".m4r", ".m4v", ".aac", ".spx", ".opus", ".atom", ".rss", and ".jar".

If the insert type or file type are not in the lists, a warning is issued and the insert is ignored.

For inserts by reference, HTML output mode attempts to find and resolve certain types by looking up the referenced structure and inserting it at the point of the insert. HTML output mode does not guarantee that the results of inserting a structure from elsewhere in the document will produce valid HTML in all cases. The following insert by reference cases are supported:

  • Inline inserts of variables.

  • Inserts by ID.

  • Inserts by name if the named structure is found in the current document.

Block inserts of variables, inserts by key, and inserts with compound identifiers are not supported. If these insert types are found, or if the referenced structure cannot be found, a warning is issued and the insert is ignored.

A block insert is represented by a BlockInsert object. An inline-insert is represented by an InlineInsert object.

The list of known insert types used for HTML serialization is stored in the global variable known_insert_types.

The list of known file types used for HTML serialization is stored in the global variable known_file_types.

Record set

A record set is a set of records (like a database table) inserted into the document.

A recordset consists of a set of records, one per line, with named fields. The recordset declaration consists of the record set name followed by two colons and then the names of the fields, separated by commas.

history:: date, author, status, notes

Each line indented below the record set declaration is a record, which contains field values separated by commas.

history:: date, author, status, notes
    2018-01-23, mbaker, draft, First draft
    2018-01-29, jsmith, draft, Added info on haggis
    2018-02-04, ajones, approved, Reviewed and approved with minor changes

A record set is a simple database table. Its purpose is to provide a subject-domain container for some of the classes of information that are commonly presented as tables in the document domain, allowing greater freedom to query the information in the document or to present the information in various different ways.

<history>
<record>
<date>2018-01-23</date>
<author>mbaker</author>
<status>draft</status>
<notes>First draft</notes>
</record>
<record>
<date>2018-01-29</date>
<author>jsmith</author>
<status>draft</status>
<notes>Added info on haggis</notes>
</record>
<record>
<date>2018-02-04</date>
<author>ajones</author>
<status>approved</status>
<notes>Reviewed and approved with minor changes</notes>
</record>
</history>

HTML serialization includes an empty header so you can add appropriate header labels through CSS.

<table class="recordset">
<thead class="recordset-header">
<tr class="recordset-header-row">
<th class ="record-set-field" data-field-name="date"></th >
<th class ="record-set-field" data-field-name="author"></th >
<th class ="record-set-field" data-field-name="status"></th >
<th class ="record-set-field" data-field-name="notes"></th >
</tr>
</thead>
<tbody class="recordset-body">
<tr class="record">
<td class="record-field" data-field-name="date">2018-01-23</td>
<td class="record-field" data-field-name="author">mbaker</td>
<td class="record-field" data-field-name="status">draft</td>
<td class="record-field" data-field-name="notes">First draft</td>
</tr>
<tr class="record">
<td class="record-field" data-field-name="date">2018-01-29</td>
<td class="record-field" data-field-name="author">jsmith</td>
<td class="record-field" data-field-name="status">draft</td>
<td class="record-field" data-field-name="notes">Added info on haggis</td>
</tr>
<tr class="record">
<td class="record-field" data-field-name="date">2018-02-04</td>
<td class="record-field" data-field-name="author">ajones</td>
<td class="record-field" data-field-name="status">approved</td>
<td class="record-field" data-field-name="notes">Reviewed and approved with minor changes</td>
</tr>
</tbody>
</table>

A recordset is represented by a RecordSet object. Each record is represented by Record object which is a child of the RecordSet object.

Codeblock

A codeblock is a block of computer code or data, or any other material with its own syntax that is to be presented as part of the document content. The contents of a codeblock are not interpreted by the SAM parser but are presented as entered in the source.

Codeblocks begin with three back ticks followed by optional attributes.

```(python)
    print("Hello World.")

Code must be indented under the codeblock marker. The codeblock ends at the first line that is less indented than the codeblock marker, or at the end of the file.

Codeblocks take the standard attribute types.

Because is is delineated by indentation, the contents of a codeblock do not require any character escaping, and no SAM character escapes are recognized in a codeblock. The text is taken literally without modification of any kind.

Thus, for instance, you can show blocks of SAM in a SAM codeblock (as this document does) without any need for escaping any part of the markup.

The text of the codeblock is not processed as SAM markup. It is recorded as is with no interpretation or escape processing. Line breaks are preserved as in the original.

Because indentation may be significant in a codeblock, and because the code will be indented under the codeblock header in the SAM source, the SAM parser uses the following algorithm to distinguish SAM indenting from indenting in the code itself: The parser looks at all the lines that are indented under the codeblock header, finds the one with the smallest indent, and treats it as occurring at the first column in the code. This allows for cases where the first line of a code sample may not be the least indented line. (That is, the first line of the code sample may be more indented than some of the lines that follow.) It also means that all codeblocks will be formatted with zero initial indent.

The semantics of the language attribute are slightly different for codeblocks, or at least, the implication of the semantics is. For other blocks, the language attribute denotes the human language of the text. For a codeblock, it denotes the computer or data language of the codeblock. In the unlikely event that you need to apply both languages, you should use a wrapper around the codeblock to capture the human language. For example, you might do this:

code-sample:(fr)
    ```(sam)
        La Plume de ma Tante.

A codeblock is serialized like this:

<codeblock conditions="a,b" id="id12" language="python" name="name">
def escape_for_xml(s):
    t = dict(zip([ord('&lt;'), ord('&gt;'), ord('&amp;')], ['&amp;lt;', '&amp;gt;', '&amp;amp;']))
    try:
        return s.translate(t)
    except AttributeError:
        return s
</codeblock>

Note that nothing is done to indicate that the contents of the codeblock are to be treated as preformatted. That is up to the application layer.

For HTML output, the content of a codeblock is wrapped in an HTML pre tag with the class attribute set to codeblock. If the codeblock has a code_language attribute, a data-language attribute is added to the pre tag containing the language value, and the content is also wrapped in code tags with the class attribute set to the code language.

<pre class="codeblock" data-language="python" lang="en"><code class="python">def escape_for_xml(s):
    t = dict(zip([ord('&lt;'), ord('&gt;'), ord('&amp;')], ['&amp;lt;', '&amp;gt;', '&amp;amp;']))
    try:
        return s.translate(t)
    except AttributeError:
        return s
</code></pre>

A codeblock in represented in the SOM with a Codeblock object. The contents of the codeblock are represented as an Pre object.

Embedded markup

You can embed markup in other markup languages. Unlike a codeblock, the intended semantics of an embed are that the markup will be processed in the application layer. For example, a piece of latex math markup might be processed in the application layer to display an equation.

An embed works just like a codeblock, except that there is an encoding attribute on the block.

```(=latexmathml)(*id13)(#name)(?a)(?b)
    n_{\mathrm{offset}} = \sum_{k=0}^{N-1} s_k n_k

Embedded markup is intended to be processed by the application layer and the results of that processing to be displayed.

Note that SAM knows nothing about the embedded markup and does not processes it in any way. It merely captures it and sends it on to the application layer.

An embed is serialized as follows:

<embed conditions="a,b" id="id13" encoding="latexmathml" name="name">
n_{\mathrm{offset}} = \sum_{k=0}^{N-1} s_k n_k
</embed>

In HTML output mode, the output of an embed is hidden by default. You could supply a javascript program to display the embedded data is needed.

A block embed is serialized as a div:

<div class="embed" hidden  data-conditions="a,b" data-encoding="latexmathml" id="id13" data-name="name">
n_{\mathrm{offset}} = \sum_{k=0}^{N-1} s_k n_k
</div>

A block embed is represented in the SOM as an Embedblock object. The contents are represented as a Pre object.

An inline embed is represented in the SOM as a Embed object.

Blockquote

A block quote is a quotation presented as a separate block rather than inline in a paragraph.

A blockquote is created by and appropriate indent followed by three " or ' characters followed by optional attributes and/or a citation.

Mother Goose

The quick brown fox jumps over the lazy dog.

The contents of a blockquote is regular SAM markup, and thus can contain phrases and annotations.

A blockquote is just like any other block with the exception that it is allowed to have a citation. The material inside a blockquote is treated as regular SAM markup and is processed as such.

A blockquote is serialized as follows:

<blockquote conditions="foo">
<citation>Mother Goose</citation>
<p>The quick brown fox jumps over the lazy dog.</p>
</blockquote>
<blockquote data-conditions="foo" class="blockquote">
<cite>Mother Goose</cite>
<p>The quick brown fox jumps over the lazy dog.</p>
</blockquote>

A blockquote is represented in the SOM by a Blockquote object. The content of a blockquote are regular SAM objects which are children of the Blockquote object as normal for the children of any object.

Line

A line is piece of text with fixed line endings, such as a poem. Lines are generally used in groups for a piece of text with fixed line endings.

To create a set of lines, precede each one with a pipe character followed by a space.

| You gotta walk that lonesome valley,
| You gotta walk it by yourself,
| Nobody here can walk it for you,
| You gotta walk it by yourself.

You can also add ids or conditions to lines. In this case, the opening parentheses of the annotation must follow the leading pipe character immediately, and the closing parenthesis must be followed by a space:

|(#foo2) You gotta walk that lonesome valley,[*bing page 22]
| You gotta walk it by yourself,
|(?bar) Nobody here can walk it for you,
|    You gotta walk it by yourself.

The entire line must be on one line in the source.

All spaces after the first space following the pipe are considered significant and are retained on output.

A line cannot have children.

The text of a line is a normal SAM flow and is processed as such.

A line is a line of text with a fixed line break. Unlike a codeblock, the contents of a line are a regular SAM flow.

A line is serialized as follows.

<line name="foo2">You gotta walk that lonesome valley,<citation idref="bing">page 22</citation></line>
<line>You gotta walk it by yourself,</line>
<line conditions="bar">Nobody here can walk it for you,</line>
<line>   You gotta walk it by yourself.</line>

A line is serialized as follows.

<pre class="line" data-name="foo2">You gotta walk that lonesome valley,</pre>
<pre class="line">You gotta walk it by yourself,</pre>
<pre class="line" data-conditions="bar">Nobody here can walk it for you,</pre>
<pre class="line">   You gotta walk it by yourself.</pre>

A line is represented in the SOM as a Line object.

Unordered List

An unordered list a piece of shorthand for a fully explicit unordered list structure.

An unordered list item is created by starting a paragraph with a * followed by a space. Note that the space is require. Without it, the * will be recognized as the beginning of a bold decoration, or as plain text.

* Dog
* Cat
* Monkey

The text following the * is treated as a paragraph. This means that the ordered list above is equivalent to this:

ul:
    li:
        Dog
    li: 
        Cat
    li:
        Monkey

and not this:

ul: 
    li: Dog 
    li: Cat 
    li: Monkey

The only way to get the latter structure is to enter it as shown above.

You can have more than one paragraph in a list item. To add a second paragraph, insert a space followed by a paragraph that is indented to the same level as the first letter following the * of the list item:

* How much is that doggy in the window.

  I hope that that fleabag's for sale. 

* Hey diddle diddle, the cat and the fiddle

  The cow jumped over the moon. 

List can be nested. A nested list means that one list is the child of a list item in the parent list. To create this, indent the second list under an item of the first.

* Dogs
    * Spot
    * Fang
* Cats
    * Mittens
    * Skimbleshanks

Note that when an unordered list follows a paragraph, it must be a the same indent level as the preceding paragraph, not indented under it.

Unordered lists using this explicit concrete syntax are more limited than the lists you could create with explicit blocks and fields. The only children a list item can have with this syntax are paragraphs and nested unordered lists or ordered lists.

Note that when an unordered list follows a paragraph, it is a sibling of that paragraph. A list cannot be the child of a paragraph since a paragraphs cannot have children.

An ordered list is serialized as follows. Note that the content of a list item is always wrapped in a <p> tag.

<ul>
    <li>
        <p>Dogs</p>
        <ul>
            <li>
                <p>Spot</p>
            </li>
            <li>
                <p>Fang</p>
            </li>
        </ul>
    </li>
    <li>
        <p>Cats</p>
        <ul>
            <li>
                <p>Mittens</p>
            </li>
            <li>
                <p>Skimbleshanks</p>
            </li>
        </ul>
    </li>
</ul>

An ordered list is serialized as follows. Note that the content of a list item is always wrapped in a <p> tag.

<ul class="ul">
    <li class="li">
        <p class="p">Dogs</p>
        <ul  class="ul">
            <li class="li">
                <p class="p">Spot</p>
            </li>
            <li class="li">
                <p class="p">Fang</p>
            </li>
        </ul>
    </li>
    <li>
        <p class="p">Cats</p>
        <ul class="ul">
            <li class="li">
                <p class="p">Mittens</p>
            </li>
            <li class="li">
                <p class="p">Skimbleshanks</p>
            </li>
        </ul>
    </li>
</ul>

An unordered list is represented by an UnorderedList object. Each list item is represented by a UnorderedListItem object.

Ordered list

An ordered list a piece of shorthand for a fully explicit ordered list structure.

The syntax of an ordered list is the same as for an unordered list except that it uses one or more numerical characters followed by a period followed by a space:

0. Robot
12. Spaceship
7. Ray gun

The numbers are not retained, nor is their style. The processing application will decide the style and numbering of lists.

Ordered list items can have multiple paragraphs and nested lists just like unordered lists, and ordered and unordered list can be nested inside each other.

The semantics of ordered lists are the same as those of unordered list, except, of course that the list is ordered.

Identical to Unordered List except with a ol wrapper element.

Identical to Unordered List except with a ol wrapper element.

An unordered list is represented by an OrderedList object. Each list item is represented by a OrderedListItem object.

Labeled list

A labeled list is a list in which each item has a label rather than a bullet of a number. They are a close analogue of HTML definition lists. The main difference is that definition lists have false semantics. They are not always used for definitions. So SAM labeled lists are named more accurately.

Labeled lists begin with a label between pipe characters. There must be no space between the opening pipe character and the start of the label text. (Otherwise it will be parsed as a line, not a labeled list.

|fa| A long long way to run.
|so| A needle pulling thread.
|la| A note to follow so.

As with other lists, the content of a list item is a paragraph and you can add other structures after the initial paragraph by adding them at the same indent as the first paragraph. (The start of the paragraph, not the label.)

|Do| a deer,
a female deer
|Re| a drop of
     golden sun

     Is it a particle or a wave?


|Me| a name I call myself

A labeled list is simply as list with a label instead of a bullet. There is no restriction built into the language as to what structures you can nest under a definition list item, but prudence would suggest that it would be appropriate to restrict things in the schema.

A labeled list is serialized as follows:

<ll>
<li>
<label>Do</label>
<p>a deer, a female deer</p>
</li>
<li>
<label>Re</label>
<p>a drop of golden sun</p>
<p>Is it a particle or a wave?</p>
</li>
</ll>

A labeled list is represented in the SOM by a LabeledList object. Each item is represented by a LabeledListItem object.

Grid

Grids are an very simple table-like construct. They are not full tables which are complex document domain beasts with potentially elaborate syntax. There are many ways to do tables or semantically represent data that may be displayed in tables in SAM. Grids are meant for the simplest cases that have no document domain or subject domain semantics. They are just a tick tack toe board.

A grid is created by starting a line with three + signs. These may be followed by block annotations.

Each row of the grid is on a separate line. Cells are separated by pipes. There is no pipe before the first cell or after the last. All rows must have the same number of cells.

There is no concept of a header, no row or column spanning. A grid consists of rows and cells. Each cell contains a single flow. Each row must be on one line of the source files. Cell contents are trimmed of leading and trailing whitespace. This allows you to align the pipes if you wish (though this is not required).

+++
    *Type*  | *Term*    | *Notes*

    feature | fragment  | bing

    feature | fragments | bang

Management attributes are permitted on the grid, but not the rows.

+++(*foo #bar ?baz)
    *Type*  | *Term*    | *Notes*

    feature | fragment  | bing

    feature | fragments | bang

A grid is nothing more than a two dimensional grid of cells. If you want any more complex representation of a table, you need to find a different way to mark it up.

Note that SAM is primarily intended for creating subject domain languages and as such it does not provide native support for complex tables, which are document domain objects. Record sets are the SAM-like way to represent real record data. To create document domain tables in SAM, you would need to create a table markup using blocks and fields. Another alternative would be to use embedded XML.

Grids are serialized as follows:

<grid>
<row>
<cell><phrase><annotation type="bold">Type</annotation></phrase></cell>
<cell><phrase><annotation type="bold">Term</annotation></phrase></cell>
<cell><phrase><annotation type="bold">Notes</annotation></phrase></cell>
</row>
<row>
<cell>feature</cell>
<cell>fragment</cell>
<cell>bing</cell>
</row>
<row>
<cell>feature</cell>
<cell>fragments</cell>
<cell>bang</cell>
</row>
</grid>

Grids are serialized as tables:

<table class="grid">
<tr class="row">
<td class="cell">foo</td>
<td class="cell">bar</td>
</tr>
<tr class="row">
<td class="cell">baz</td>
<td class="cell">bat</td>
</tr>
</table>

Grids are represented in the SOM by a Grid object and rows by Row objects.

Comment

A comment is a extra-textual comment on the markup of the document similar to a comment in code or in XML.

A comment is created by starting a line with a #. There is no support for multi-line comments or for comments within running text.

# This is a comment.

A comment is intended as a comment on the source file. To comment on the document as a document, use remarks.

A comment is represented in XML as an XML comment.

<!-- This is a comment -->

A comment is represented as an HTML comment:

<!-- This is a comment -->

A comment is represented in the SOM with a Comment object.

Remark

A remark is an editorial comment on the document text.

A remark looks like a blockquote except that is is introduced with three ! characters and takes a attribution attribute:

!!!(Mark Baker) 
    This is a remark.

The body of a remark is normal SAM so it can contain other SAM structures.

Remarks can take a id or language attribute, but not a name or condition attribute.

Remarks are a block structure. There is not provision for including a remark in the middle of a paragraph. Generally, speaking, placing a remark after the paragraph it remarks on should be sufficient, however, if you want to anchor a remark to a specific spot in a paragraph, you can do so by giving the remark an ID attribute and referring to it using a citation:

This requires a remark[*cdg1], but the paragraph continues.

!!!(Charles André Joseph Marie de Gaulle)(*cdg1)(!fr)
    La Plume de ma tante.

Remarks are intended for communication between authors, editors, and reviewers during the preparation of the document. They may be presented in editors or in draft versions of output. They do not form part of the finished document.

The XML representation of a remark is as follows:

<remark attribution="mbaker">
<p>This is a remark.</p>
</remark>

Remarks anchored with a citation are serialized like this:

<p>This requires a remark<citation idref="*cdg1"/>, but the paragraph continues.
<remark attribution="Charles André Joseph Marie de Gaulle" id="cdg1" xml:lang="fr">
<p>La Plume de ma tante.</p>
</remark>

The HTML representation of a remark uses a div with the class attribute set to remark. The attribution is represented as a data-attribution element:

<div class="remark" data-attribution="mbaker">
<p class="p">This is a remark.</p>
</div>

HTML output mode will not render a unanchored citations, so the remark will not be connected to the anchor point in the paragraph in HTML output. A warning will be issued for the unanchored citation.

Bold and italic decorations

A decoration is a shortcut for creating an annotated phrase. They are used for common text formatting features bold, and italic. Decorations are shortcuts.

This text is in *important* type.

is a shortcut for

This text is {important}+(bold).

Decorations are always local (they create local annotations). Thus the equivalent explicit annotation in the example above is local.

Decoration are indicated by characters in the text that bracket phrases. The decorations are as follows:

  • * is a decoration for bold.

  • _ is a decoration for italic

Examples:

This text is in *bold* type.
This text is _italic_ type.

Decorations cannot be nested, so

This text is *_important_*.

is not a shortcut for

This text is {important}(bold)(italic)

but would be interpreted as

This text is {_important_}(bold).

Decorations are not recognized inside of phrases, since decorations are shorcuts for phrases and phrases cannot be nested. Decoration characters inside annotations are read as plain text.

Annotation chaining is supported for decorations. Therefore if you want to make text both both and italic, you could use:

This text is *important*(italic)

Although it may be clearer to do this:

This text is {important}(bold)(italic)

In SAM, the bold and italic decorations are media domain structures calling for the use of bold and italic typefaces respectively. This is in contrast to most lightweight markup languages which use them (or similar markup) to mean an abstraction like "emphasis" or "strong". The problem with these abstractions is that their meaning in not clear. Bold and italic type are conventionally used for specific purposes and should not be abstracted out without some definite indication of the purpose. Thus in SAM, you would normally expect a tagging language that expected the titles of books to occur regularly in content would provide a suitable title annotation to mark them up.

{Moby Dick}(book) is about a whale.

But if there the tagging language provides no such support, because books are not a significant subject for that language, then the writer who needs to mention a book should not be forced to use a vague and unreliable abstraction like emphasis in order to ensure proper formatting of the title. They should be able to specify the correct media domain formatting choice:

_Moby Dick_ is about a whale.

Which is equivalent to:

{Moby Dick}+(italic) is about a whale.

And we should note in this case that the abstract emphasis is not only vague, it is actually incorrect. This is not a case of emphasizing a word, but of formatting a book title according to the commonly accepted convention. If you actually want an emphasis annotation in your tagging language, you should do it like this:

{Moby Dick}+(italic) is a {very}+(emphasis) good book.

The bold and italic decoration are serialized as bold and italic elements respectively.

<p>This text is in <bold>bold</bold> type.</p>

<p>This text is <italic>italic</italic> type.</p>

The bold and italic decoration are serialized as bold and italic elements respectively.

<p>This text is in <b class="bold">bold</b> type.</p>

<p>This text is <i class="italic">italic</i> type.</p>

The bold and italic decorations are represented in the SOM as bold and italic annotations using Annotation objects. No distinction is made between what were decorations in the source and what were annotations of the same name.

Inline code

Inline code is used to include code or data samples in the body of a paragraph.

Inline code is indicated by surrounding it in backticks:

In Python 3, use `print("Hello World")`. 

The body of an inline code structure is not parsed for character escapes or other markup, so you can do this:

In SAM, a phrase looks like this: `{phrase}`.

The only character that can be escaped in inline code is the back tick, which is escaped using a double back tick: `.

In Python 3, use `print("Hello World")`(python). 

Inline code can also take all the regular attributes.

In the correct form is `print("Hello World")`(python)(?p3)`print "Hello World`(python)(?p2). 

Inline code cannot take chained annotations, however.

The content of an inline code structure to presented as code, without interpretation. It may be formatted in accordance with its language.

<code language="sam">markup:</code>
<code class="code" data-language="sam">markup:</code>

Inline code is represented in the SOM as an Phrase object with an annotation of "code".

Inline embed

An inline embed is a piece of embedded code that is intended to be interpreted by the application layer. For example, an inline embed could be used to embed an equation in latexmath.

An inline embed looks exactly like inline code except that it has an annotation containing an encoding attribute, which it the name of the encoding language preceded by a =:

This is an equation `\frac{ \sum_{t=0}^{N}f(t,k) }{N}`(=latexmath) in the middle of a paragraph. 

The application layer should interpret the inline embed according to its language type and display the result.

<p>This is an equation <embed encoding="latexmath">\frac{ \sum_{t=0}^{N}f(t,k) }{N}</embed> in the middle of a paragraph.</p> 

An inline embed is serialized as a span with the hidden attribute:

<span class="embed" hidden data-encoding="latexmath">\frac{ \sum_{t=0}^{N}f(t,k) }{N}</span>

An inline embed is represented in the SOM as a Code object with an Encoding attribute.

Variable

A variable is a way to store a value (specifically, a Flow) that can be inserted into any part of a document. Variable definitions are scoped, meaning that you can locally override the value of a variable. This can be used to replace values in a block of text that is reused by reference.

A variable definition is created by the appropriate indent followed by a $ and a name-like-string followed by an = and the value of the variable. The variable value ends at the end of the current line:

$foo = bar

The variable can then be inserted using a block insert or inline insert:

The sentence includes a string insert >($foo).

>>>($foo)

Variable definitions do not take any attributes. The variable name is a name token.

In XML output mode, the resolution of variables is left to the application layer. This allows the application layer to consider variables definitions from outside the local file.

In HTML output mode, variables are resolved using definitions in the local file. A warning is issued if the variable cannot be resolved and the insertion is skipped.

Within a local file, variable are local to the scope in which they are defined.

When resolving a variable, the application layer should first look at any variable definitions that are children of the insert statement. If no matching variable definition is found, it should then work its way up the tree, looking in the parent structure and then the preceding siblings of the parent structure and so on to the document root. Note that this is not the same as searching in reverse document order, since it does not look in children of preceding siblings.

The application layer is entitled to introduce variable definitions from outside the local file. Such variables should always be last in the search order for variable definitions. In other words, the most local definition always wins. System designers may want to consider imposing variable naming conventions to avoid accidental variable name conflicts.

Variables are serialized as variable elements:

<variable name="foo">bar</variable>

Variables are serialized as span elements with the hidden attribute:

<div class="variable" data-name="foo" hidden>bar</div>

Variables are represented as Variable objects.

Fragment

A fragment is a chunk of content that is set off so that it can be managed separately from the rest of the document. Reasons for creating a fragment could include making a chunk of the document conditional or available for reuse. A fragment is essentially a block of content that has no semantic identity of its own, but exists only to be managed arbitrarily, such as for ad hoc reuse of text that is not based on reusing a complete semantic block. Examples include passages that are conditionalized for use in different contexts.

Fragments are created beginning a line with the ~ characters:

This paragraph is outside the fragment.

~~~(?foo)

    This paragraph is inside the fragment and should only appear if the {condition} `foo` is true.

This paragraph is outside the fragment again and will always be included.

You can allow for the redefinition of parts of the text of a fragment by defining a variable inside a fragment and use that variable in the definition of the fragment text:

~~~(#my-fragment) 
    $color=black 

To insert a fragment:

>>>(#my-fragment)

As per the variable scoping rules, any redefinition of those variable as children of the insert will override the values in the fragment. Thus:

>>>(#my-fragment)
    $color=white

should be resolved to:

Bar bar white sheep Have you any wool?

(But remember that variable resolution is performed by the application layer in XML output mode but by the SAM parser in HTML output mode.)

The handling of fragments and strings belongs to the application layer.

Semantically, a fragment is just an arbitrary chunk of the document hierarchy. If it is inserted somewhere using an insert with a idref, nameref, or keyref, then the fragment is to be inserted at that location.

If a fragment is reuse in this way, and if there are stringrefs in the body of the fragment, any strings defined in the fragment insert override those defined in the fragment and any other string definitions of the same name.

Fragments can be nested inside other fragments. Fragments can contain inserts of other fragments. Circular fragment references are illegal but must be caught at the application layer.

A fragment is serialized to XML as follows:

<fragment conditions="gruz-natz" name="foo3">
<variable name="a">apple</variable>
<variable name="b">banana</variable>
<p>This is a sentence inside a fragment.</p>
</fragment>

A fragment is serialized to HTML by wrapping it in a div element with the class attribute set to fragment:

<span class="fragment" data-conditions="gruz-natz" data-name="foo3">
<span class="variable" data-name="a">apple</span>
<span class="variable" data-name="b">banana</span>
<p>This is a sentence inside a fragment.</p>
</span>

A fragment is represented in the SOM by a Fragment object.

Include

You can include one SAM file in another. The included file must be a complete SAM file. Its structure is included in the structure of the included file at the indent level of the include statement. That is to say, the structures of the included file are treated as if they had their own indent plus the indent of the include statement.

An include is indicated by the appropriate indent, followed by three less-than signs, followed by the path to a SAM file in parentheses.

<<<(foo.sam)

Unlike inserts, which are simply passed on to the application layer for processing, the file pointed to by an include is parsed by the parser. The result of the include is presented to the application layer as part of the parsed document.

The ID uniqueness constraint that applies to individual SAM document also applies to included files. The IDs must be unique across the entire document parsed from the source file and any included files.

Includes cannot be made conditional, since conditions are parsed by the application layer and includes are processed by the parser.

If you need to apply a condition to an include, place the include in a fragment and make the fragment conditional. This will result in the file being parsed and included in the output, but its content will be inside the fragment, making the whole thing conditional.

For similar reasons, an include cannot have a name, id, or language attribute, since the include instruction itself is not present in the output. If you need to apply any of these things to the included content, you can wrap the include statement in another structure such as a block or a fragment and apply the attributes to that. This will result in the included content being wrapped in that block or fragment in the output where the application layer can deal with it appropriately.

Include instructions are not serialized on output. Their content resulting from processing the include is serialized as it would have been if it were entered inline.

Include instructions are not serialized on output. Their content resulting from processing the include is serialized as it would have been if it were entered inline.

The include is represented in the SOM by an Include object. Thus it is possible when accessing the SOM to tell whether or not the content was part of an included file. It is not possible to tell this in the serialized XML output.

There is a proposal to decorate the IDs occurring in an included file to ensure they do not overlap those of the including file. See issue 104.