Terminology and Information Model

This is a DRAFT information model that was developed as part of the GA4GH GKS SA subgroup’s exercise to model sequence features and transcripts. It is a work in progress.

This information model was derived from the draft conceptual and logical models that were developed as foundational work for this effort. Please reference those models in addition to this documentation (which is incomplete).

Note: Some elements are sourced from the Variation Representation Specification (VRS) and others from the HL7 Clinical Genomics FHIR Molecular Sequence Resource. This draft model was aligned to those specifications as much as possible while still achieving the goals of the SA modeling exercise.

_images/Generalized_SA_Model.png — Draft SA Model

**Legend** DRAFT model of Annotated Sequence, Transcript, and building blocks (including Sequence and SequenceFeature). See the conceptual and logical models for rationale, design choices, and examples.

Core Classes

Sequence

The definition of Sequence is conceptually identical to VRS Sequence, but the information model is more detailed.

A sequence is a contiguous, linear polymer of nucleic acid or amino acid Residues. Sequences are the prevalent representation of these polymers, particularly in the domain of variant representation.

Computational Definition

A character string of Residues that represents a biological sequence using the conventional sequence order (5’-to-3’ for nucleic acid sequences, and amino-to-carboxyl for amino acid sequences). IUPAC ambiguity codes are permitted in Sequences.

Information Model

The Sequence class represents the concept of a particular sequence rather than its instantiation in a particular database or serialization. The Sequence class provides convenience attributes to capture human-readable names and identifiers that are associated with a sequence, but the representation of the sequence itself (the linear string of residues) is captured using the SequenceRepresentation abstract data type (defined by the FHIR Molecular Sequence Resource, which are similar to the VRS SequenceExpression classes).

Field	Type	Limits	Description
type	CodeableConcept (FHIR)	[1..1]	The type of sequence (e.g., DNA, RNA, protein). The value of this attribute should represent a term in an ontology (e.g., Sequence Ontology).
identifier	Identifier (FHIR)	[0..*]	Identifier(s) for the sequence. Identifiers are used for cross-referencing sequence concepts only, not for capturing the sequence itself (which is done with the representation attribute). Note the complex datatype includes attributes to capture both the identifer and the system or namespace that assigned the identifier. Identifiers MUST be unique within a system or namespace.
name	string	[0..*]	Name(s) for the sequence. Names are intended to be for human-readable purposes only and are not guaranteed to uniquely specify a sequence.
representation	SequenceRepresentation	[0..*]	Representation(s) of the sequence. All representations for a given sequence MUST resolve to the same literal string.
features	LocatedFeature	[0..*]	A list of annotated features on the sequence. This attribute should be used only by Sequence-based structures.

LocatedFeature

A located feature is a mapping of a sequence feature to a location on a sequence.

Computational Definition

TBD

Information Model

A mapping between a given sequence feature and its location on a given sequence.

Field	Type	Limits	Description
feature	SequenceFeature	[1..1]	The feature at the given location
location	Location (VRS)	[1..1]	The location of the feature
strand	CodeableConcept (FHIR)	[0..1]	An indicator specifying whether the feature is on the forward or reverse strand of a double-stranded sequence. If not set, the feature is assumed to be on the forward strand (by convention). If the sequence is single-stranded and this attribute is set, it must not be set to “reverse”.

SequenceFeature

A sequence feature is a structural or functional feature that can be annotated on a Sequence.

Computational Definition

TBD

Information Model

A structural or functional feature that can be annotated on a Sequence at a defined location (interval). A given feature is not unique to a single Sequence and can be mapped to different locations on different Sequences through contextualization.

Examples of Sequence Features include:

Gene locus
Exon
Intron
5’ or 3’ UTR
Transcription start site
Translation start or stop sites
Coding sequence (CDS)
Codon
Transcription factor binding site
Post-translational modification site
Splice donor/acceptor site
PolyA site

Sequence Features are first-class entities that can stand on their own as independent entries in knowledge bases and be used as the subject of VA statements.

Field	Type	Limits	Description
identifier	Identifier (FHIR)	[0..*]	Identifier(s) for the sequence feature. Note the complex datatype includes attributes to capture both the identifer and the system or namespace that assigned the identifier. Identifiers MUST be unique within a system or namespace.
name	string	[0..*]	Name(s) for the sequence feature. Names are intended to be for human-readable purposes only and are not guaranteed to uniquely specify a sequence feature.
type	CodeableConcept (FHIR)	[1..1]	The type of feature. The value of this attribute should refer to a term in an ontology (e.g., Sequence Ontology).
contexts	SequenceContext	[0..*]	A list of sequence contexts for the feature. This attribute should be used only by Feature-based structures.

SequenceContext

A sequence context is the definition of a location on a sequence.

Computational Definition

TBD

Information Model

The definition of a location on a given sequence. This mapping provides sequence context (contextualization) for a parent entity (e.g., a feature).

Field	Type	Limits	Description
sequence	Sequence	[1..1]	The Sequence that provides the context for the mapped location.
location	Location (VRS)	[1..1]	A location that defines a region (interval) on the given Sequence.
strand	CodeableConcept (FHIR)	[0..1]	An indicator specifying whether the feature is on the forward or reverse strand of a double-stranded sequence. If not set, the feature is assumed to be on the forward strand (by convention). If the sequence is single-stranded and this attribute is set, it must not be set to “reverse”.

Annotated Sequence

An annotated sequence is a sequence that contains annotations (e.g., features).

Annotated Sequence

Computational Definition

TBD

Information Model

Conceptually, an annotated sequence is a sequence with associated annotations (e.g., features). Structurally, the class is a container for a Sequence and located features (and because of this an Annotated Sequence is a complex object that is neither a Sequence nor a Feature). This class may serve as a generalized parent class that can be specialized to support complex types of annotated sequences (e.g., transcripts).

Field	Type	Limits	Description
sequence	Sequence	[1..1]	The Sequence that is annotated with features.
features	LocatedFeature	[0..*]	A list of annotated features on the Sequence.

Transcript

A transcript is a type of annotated sequence.

Computational Definition

TBD

Information Model

A transcript is a type of annotated sequence. It represents a single-stranded RNA sequence that may contain structural features (e.g., exons) and functional features (e.g., coding region). Because the concept of transcript is so widely used, and because the concepts of sequence, exon, and gene are so closely intertwined with it, an explicit model for Transcript was developed.

Field	Type	Limits	Description
exons	LocatedFeature	[0..*]	A list of sequence features representing the exons of the transcript.
codingRegion	LocatedFeature	[0..*]	A list of sequence features representing the coding region of the transcript.
associatedGenes	identifier	[0..*]	A list of gene(s) associated with the transcript. Note: This structure provides support for fusions, prokaryotic operons, etc, but it does not specify which region(s) or feature(s) within the transcript are associated with a particular gene.