LongSpine Cache

All of the models and data for LongSpine listed on the Models & Data pages are best loaded into a cache for easy data access. With a 'triplestore' cache, you can run either the example queries listed on the Examples page or, in fact, any SPARQL query language queries you wish!

Cache sections

  • What is a Semantic Web graph cache?
  • Named Graphs
  • Inferencing
  • Data Loading
  • Extending the cache
  • System technical details

§ What is a Semantic Web graph cache?

The a cache of RDF data is usually called a triplestore, that is, a node-and-edge graph database that stores RDF data. Caches' purposes are to allow queries across all the elements loaded into them so you can use one to query from vocab to vocab (via Linksets), vocab to Dataset and across Datasets is all the component items are loaded.

Unlike other databases, e.g. relational or noSQL, RDF triplestores queqire essentially no configuration to well index and link disparate data items loaded into them. This is because RDF data is already fully defined - the data schema is present within the data itself - and triplestores index RDF data in predictable ways, regardless of the particular data.

With most modern RDF triplestores, the SPARQL query language is supported. Like SQL for relational databases, you can expect to use the same queries regardless of the specific product used. Unlike relational SQL though, SPARQL queries also have stnadardised input methods and output formats so their use is somewhat more sntadardised accross implementations than SQL.

What a cache is not

A cache itself is NOT the LongSpine's spine! As noted in the Principles Page's Spine section, the spine is "...a collection of datasets and methods for data presentation that act as an anchor point for other data..." thus the spine is the LongSpine outputs and the fact that they have common purpose. A cache is only a convenient way to access all of the spine's content.

Having said that, a related project, the Location Spine (LocI) has a main, supported, cache and while that system's spine, like this one, is not the cache, that project's cache may easily be conflated with the spine. In time, a well-known, supported, cache of LongSpine data may emerge but multiple caches of all or part of this spine's content may always be created.

Just repreating: this spine/cache distinction must be made since it's possible to create multiple or partial caches of the LongSpine's spine for multiple purposes and the conveninence of the cache shouldn't distract from the spine's points-of-truth which are the hole (namespace) locations of the models and datasets which the Models & Data pages list.

§ Named Graphs

Each ontology, vocabulary, Dataset & Linkset loaded into this cache is partitioned off from other loaded element by placing it into it's own Named Graph. These are somewhat like relational database's schemas - namespace-separated collections of tables within a single database - in that they allow all the data in a Named Graph to managed as a single unit (removed, updated, reloaded) without affecting other Named Graphs' content, and they also allow both queries within and across Named Graphs.

Querying data in Named Graphs

Like with SQL queries across or within particular schema, SPARQL queries can be written that query all the data in the cache as one or just the data in one or more identified Named Graphs. This query gets all skos:Concept instances from the cache, regardless of the Named Graph they are in:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    ?c a skos:Concept .

This query only gets skos:Concept instances from the COFOG-A vocabulary:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
FROM <http://linked.data.gov.au/def/cofog-a>
    ?c a skos:Concept .

Note that since the models' data and the datasets' data are separated out into their own Named Graphs, you will need to combine appropriate models with datasets if you write queries that rely on model rules. By far the easiest thing is to just query the whole cache.

See the SPARQL query language section on Named Graphs for more.

Named Graph details

All items in the cache are given Named Graph IDs which are their namespace URIs. Table C1 lists all the items in the cache and givens their Named Graph IDs which can be used in FROM clauses in SPARQL queries.

Table C1: Named Graphs in the LongSpine Cache

Item Named Graph URI Access URI
Background Ontologies
Resource Description Framework (RDF) http://www.w3.org/1999/02/22-rdf-syntax-ns same
RDF Schema (RDFS) http://www.w3.org/2000/01/rdf-schema same
Web Ontology Language (OWL) http://www.w3.org/2002/07/owl same
Time Ontology in OWL (TIME) http://www.w3.org/2006/time same
The Dataset Catalogue Vocabulary (DCAT) http://www.w3.org/ns/dcat same
Vocabulary of Interlinked Datasets (VoID) http://rdfs.org/ns/void same
LocI Ontology http://linked.data.gov.au/def/loci same
Simple Knowledge Organization System (SKOS) http://www.w3.org/2004/02/skos/core same
The Organization Ontology (ORG) http://www.w3.org/ns/org same
LongSpine Ontologies
LongSpine Ontology http://linked.data.gov.au/def/longspine http://test.linked.data.gov.au/def/longspine
Administrative Arrangement Orders (AAO) Ontology http://linked.data.gov.au/def/longspine http://test.linked.data.gov.au/def/aao
AGOR Ontology http://linked.data.gov.au/def/longspine http://test.linked.data.gov.au/def/agor
Commonwealth Record Series (CRS) Ontology http://linked.data.gov.au/def/longspine http://linked.data.gov.au/def/crs
Portfolio Budget Statements (PBS) Ontology http://linked.data.gov.au/def/longspine http://test.linked.data.gov.au/def/pbs
Records Disposal Authority (RDA) Ontology http://linked.data.gov.au/def/longspine http://test.linked.data.gov.au/def/rda
LongSpine Vocabularies
Australian Government Interactive Functions Thesaurus (AGIFT) http://data.naa.gov.au/def/agift same
Classifications of Functions of Government (COFOG) http://linked.data.gov.au/def/cofog http://test.linked.data.gov.au/def/cofog
Classifications of Functions of Government - Australia (COFOG-A) http://linked.data.gov.au/def/cofog-a http://test.linked.data.gov.au/def/cofog-a
Commonwealth Record Series Thesaurus (CRS-Th) http://linked.data.gov.au/def/crs-th http://test.linked.data.gov.au/def/crs-th
Government Purpose Classification (GPC) http://linked.data.gov.au/def/gpc http://test.linked.data.gov.au/def/gpc
Local Government Purpose Classification (LGPC) http://linked.data.gov.au/def/lgpc http://test.linked.data.gov.au/def/lgpc
Records Disposal Authority (RDA) Vocabulary http://linked.data.gov.au/def/rda-voc http://test.linked.data.gov.au/def/rda-voc
LongSpine Datasets
Administrative Arrangements Orders (AAO) Dataset not loaded yet http://test.linked.data.gov.au/dataset/aao
Australian Government Organisations Register (AGOR) not loaded yet http://test.linked.data.gov.au/dataset/agor
Commonwealth Record Series (CRS) Dataset http://linked.data.gov.au/dataset/crs http://test.linked.data.gov.au/dataset/crs
Portfolio Budget Statements (PBS) Dataset not loaded yet http://test.linked.data.gov.au/dataset/pbs
LongSpine Linksets
AGIFT → CRS Thesaurus Linkset http://linked.data.gov.au/dataset/agiftcrs http://test.linked.data.gov.au/dataset/agiftcrs
AGIFT → COFOG-A Linkset http://linked.data.gov.au/dataset/agiftcofoga http://test.linked.data.gov.au/dataset/agiftcofoga
COFOG → COFOG-A Linkset http://linked.data.gov.au/dataset/cofogcofoga http://test.linked.data.gov.au/dataset/cofogcofoga
LGPC → COFOG Linkset http://linked.data.gov.au/dataset/lgpccofog http://test.linked.data.gov.au/dataset/lgpccofog
LGPC → GPC Linkset http://linked.data.gov.au/dataset/lgpcgpc http://test.linked.data.gov.au/dataset/lgpcgpc

The Dublin Core Terms http://purl.org/dc/terms and schema.org background ontologies, listed on the Models page aren't loaded into the cache as neither provide any ontological rules useful to build new data (see the Inferencing section next). Both essentially provide vocabularies of properties used for annotations only.

§ Inferencing

Triplestores, such as the one used for this LongSpine cache, are able to take reasoning rules added to them, usually in the form of ontology statements, and apply them to data. This has the effect of "building out" all the data that the rules indicate. This for of reasoning with RDF data is called inference.

For example, the ontological rule in the LongSpine ontology agor:Entity rdfs:subClassOf long:GovernmentStructuralUnit . will cause a triple stating that every instance of a agor:Entity will be also typed as being a long:GovernmentStructuralUnit since, as per the rule, everything that is a agor:Entity is also a long:GovernmentStructuralUnit.

When reasoning is performed at load time, i.e. ontological rules in the triplestore are applied to data as it's loaded or applied to stored data when an ontology is loaded, then it is called forward-chain inference.

The triplestore used for this cache applies all the ontological rules it finds in all loaded ontologies to all data within the family of rules known as OWL-RL which are, by design, amenable to rule-based technologies meaning. This means while calculating all the ontology rule possibilities for forward-chaining all loaded data takes some time, it takes less time than some other, more expressive, OWL profiles which are not needed for any LongSpine ontological rules.

§ Data Loading

All of the items listed in Table C1 are loaded into this cache in a 2-step manner using two scripts:

  1. files/get_rdf.py - a Python script which downloads a copy of each item's RDF data from the Internet and stored it in a single directory
  2. files/load_rdf.py - a Python script which uses a the API that come with this triplestore to both load all of the RDF file and perform forward-chain reasoning as per OWL-RL and the ontological rules in the ontologies (see section above)

The Python scripts can be used to assemble all of the cache's data in one place and many triplestores have commands/script to load them and apply inference using RDF files.

§ Extending the cache

To extend this particular cache of the LongSpine information, that information - perhaps new vocabularies of government functions or longitudinal data of government structural units - should be accepted as being part of LongSpine and listed on the Data Page. Then it's just a matter of, either manually or via a script, loading the new data into the database that this cache uses while ensuring that the new information is placed in an appropriately named Named Graph.

Currently, the governance of what is in and what is not in LongSpine hasn't been codified.

If non-LongSpine data is to be loaded into this cache, perhaps to demonstrate how others may use the LongSpine by joining their own (external) data to it, then that data can also be loaded into additional Named Graphs within this database. Individual Named Graph can be cleared from this database by a database admin so graphs can be added and later removed without compromising the rest of the cache.

The scripts given above can be used to create another cache of LongSpine + any other information so should someone want to greatly extend the cache, or to perhaps just load a portion of it, they should use/extend the scripts to do that.

§ System technical details

The specific systems powering this cache of LongSpine are:

  • Database - Ontotext's GraphDB
    • free edition
    • version 8.8
    • a single Repository is used for all LongSpine Named Graphs with OWL-RL inferencing turned on in the Repository config
  • Website - Python Flask
    • a very simple web framework with URLs calling python functions that generate HTML by using Python's Jinja2 templates
    • the website only implements 8 endpoints, one for each website page, and a couple of helper functions for authentication
  • Query page - HTML, JavaScript + YASGUI
    • the query page is loaded like all others by the Flask application
    • the fancy query box is presented by the YASGUI tool embedded in the query page's HTML
    • queries from this page are posted to the GraphDB database's query endpoint by YASGUI and results from it collected and displayed also by YASGUI
  • Server - CSIRO-owned Amazon Web Services Linux Virtual Machine
    • all of the items in this systems - database, website etc., are all hosted on a single virtual machine in AWS
    • alternate hosting may be appropriate if the site starts to experience much traffic
  • Persistent URIs - Australian Government Linked Data Working Group
    • all URIs used by LongSpine data within the linked.data.gov.au domain or subdomains of it are managed by the Liked Data WG
    • required changes to LD WG URIs, or new URIs, need to be managed through the group