Cache tables
============

The Indicia data model is normalised, which means that data are organised in such a way as
to reduce redundancy and improve integrity. The data required for a biological  record are
split across multiple tables with relationships between the records, as opposed to a more
flat "spreadsheet" approach where there a large number of columns in a single table. This
is generally  desirable in that each item of information needs to only be stored once
rather than repeated in several places, therefore ensuring consistency. However it can make
queries more complex with multiple joins required to bring in all the tables required for
the output of a query and in some cases the additional joins required can reduce the
performance of queries. For  example, to provide the basic "what, where, when and who" of a
set of biological records from the last week's input you need  something akin to the
following SQL:

.. code-block:: sql

  select
    t.taxon, tg.title as taxon_group,
    s.entered_sref, l.name as location_name,
    s.date_start, s.date_end, s.date_type,
    who.text_value as recorder
  from occurrences o
  join samples s on s.id=o.sample_id and s.deleted=false
  left join locations l on l.id = s.location_id and l.deleted=false
  join taxa_taxon_lists ttl on ttl.id=o.taxa_taxon_list_id
    and ttl.deleted=false
  join taxa t on t.id=ttl.taxon_id and t.deleted=false
  join taxon_groups tg on tg.id=t.taxon_group_id and tg.deleted=false
  left join (sample_attribute_values who
  join sample_attributes whoa on whoa.id=who.sample_attribute_id
    and whoa.deleted=false and whoa.system_function='full name'
  ) on who.sample_id=s.id and who.deleted=false
  where o.created_on>now() - '1 week'::interval
  and t.taxon_group_id=<taxon_group_id>

In order to make queries easier to write and also performant, Indicia includes a
set of tables which "flatten" the multiple tables of key parts of the data model into
a few tables which are easy to query and, more importantly, perform well when used to
generate report outputs. Here's an alternative version of the above query:

.. code-block:: sql

  select
    cttl.taxon, cttl.taxon_group,
    snf.public_entered_sref, o.location_name,
    o.date_start, o.date_end, o.date_type,
    snf.recorders
  from cache_occurrences_functional o
  join cache_samples_nonfunctional snf on snf.id=o.sample_id
  join cache_taxa_taxon_lists cttl on cttl.id=o.taxa_taxon_list_id
  where o.created_on>now() - '1 week'::interval
  and o.taxon_group_id=<taxon_group_id>

Not only are there less joins, but an important point is that the vast majority of fields
you might want to filter on are in the `cache_occurrences_functional` table. Filtering
in a single table then joining extra tables for addition of the information required for
output fields is much faster than filtering in different tables in PostgreSQL.

The cache_* tables available in the database are described below.

cache_occurrences_functional
----------------------------

This table contains all the fields which have a common functional use in building reports
that output occurrence data. This means it includes fields that are used for filtering,
sorting and grouping the report output rather than the labels which are typically output
in the report columns displayed to the user.

Hierarchical taxonomic queries are supported by the ``taxon_path`` field which contains an array
of ``taxon_meaning_ids`` for the taxon and all its taxonomic parents. E.g. to query for all mammal
data, find the ``taxon_meaning_id`` for Mammalia on your warehouse's master list, then use an SQL
where clause as follows:

.. code-block:: sql

  WHERE taxon_path && ARRAY[*taxon_meaning_id*]


.. note::

  By keeping all the commonly filtered, sorted and grouped columns in a single table, the
  PostgreSQL query optimiser is able to effectively perform all the processing on a single
  table then join in other columns to obtain output values for display as a last step. This
  is much more efficient than filtering 2 separate tables then joining to obtain the
  intersection. For example, a query that shows all taxa in the 'insects - beetles' group
  for the Dorset vice county should first obtain the IDs matching the 'insects - beetles'
  taxon group and the Dorset location, then select data from cache_occurrences_functional
  filtering on taxon_group_id and location_id_vice_county within the same table. This is
  MUCH more efficient than joining to the taxon_groups and locations tables to filter
  within those tables.

Ref. :ref:`developing/data-model/tables:cache_occurrences_functional`

cache_occurrences_nonfunctional
-------------------------------

Contains additional values for each record which are frequently useful to construct the
display output of a record but rarely used in filtering, grouping or sorting of the report
output.

Custom occurrence attribute values can be obtained from the `attrs_json` field which is an object
keyed by attribute ID. This saves joins to ``occurrence_attribute_values`` and
``cache_termlists_terms`` in order to get attribute values in query results.

Ref. :ref:`developing/data-model/tables:cache_occurrences_nonfunctional`

cache_samples_functional
------------------------

Similar to the cache_occurrences_functional, contains the commonly filtered, sorted and
grouped values for a sample. Note that when querying occurrences this table is unnecessary
since all the values are duplicated in cache_occurrences_functional (for the performance
reasons described above). It is only necessary to use this table when querying a list
of samples.

Ref. :ref:`developing/data-model/tables:cache_samples_functional`

cache_samples_nonfunctional
---------------------------

Contains additional values for each sample which are frequently useful to construct the
display output of a sample or the sample elements of a record but rarely used in filtering,
grouping or sorting of the report output.

Custom sample attribute values can be obtained from the ``attrs_json`` field which is an object
keyed by attribute ID. This saves joins to ``sample_attribute_values`` and
``cache_termlists_terms`` in order to get attribute values in query results.

Ref. :ref:`developing/data-model/tables:cache_samples_nonfunctional`

cache_taxa_taxon_lists
----------------------

Contains values pertaining to a single taxon name, for example you can find the used name,
the preferred name for the taxon as well as the default common name, kingdom, order and
family.

Ref. :ref:`developing/data-model/tables:cache_taxa_taxon_lists`

cache_taxon_searchterms
-----------------------

A table containing all variants and codes that can be used to lookup a taxon in a single
indexed list.

The following example shows how the cache_* tables can be joined to include all the cached
data relevant to a record. Note that in most cases you won't need to include all the
tables here, just the cache_occurrences_functional table plus any others required in the
output:

.. code-block:: sql

  select
  o.id,
  onf.licence_code,
  snf.public_entered_sref,
  vague_date_to_string(o.date_start, o.date_end, o.date_type),
  cttl.taxon,
  cttl.preferred_taxon as accepted_name,
  cttl.default_common_name as common_name,
  cttl.family_taxon,
  cttl.order_taxon
  from cache_occurrences_functional o
  join cache_occurrences_nonfunctional onf on onf.id=o.id
  join cache_samples_nonfunctional snf on snf.id=o.sample_id
  join cache_taxa_taxon_lists cttl on cttl.id=o.taxa_taxon_list_id
  where o.taxon_group_id=1
  and o.website_id=2
  and o.survey_id=3

.. tip::

  Because of the way the indexing works on cache_occurrences_functional, if you want to
  filter on a survey_id to restrict the output to a single dataset, also include a filter
  on the website_id as shown in the query above. This allows a compound index to work so
  is much more efficient.

Ref. :ref:`developing/data-model/tables:cache_taxon_searchterms`