*******************************
Importing a Darwin Core Archive
*******************************

This tutorial guides you through the process of importing from an example Darwin Core
Archive file. As the steps involve the manual preparation of the survey dataset to import
into as well as the data file for upload, the principles can be applied to data from a
variety of sources.

Before proceeding, ensure you have created a new Survey dataset on the warehouse ready
to configure. You should also have a taxon list created ready.

The DwC file contains a meta.xml document which for our example lists the
following columns for 2 different classes. Each DwC file may contain different columns so
you'll need to use the steps below as a guide rather than to exactly copy it:

Class: http://rs.tdwg.org/dwc/terms/Event
-----------------------------------------

In Indicia, DwC events map to the samples entity. Here's an example text file containing
a single record from our example DwC archive

.. code-block:: none
  
  id	eventID	samplingProtocol	eventDate	countryCode	decimalLatitude	decimalLongitude	geodeticDatum
  2010-3-27:858	2010-3-27:858	Point/Punkt	2010-03-27 11:00:00+00	SE	59.0329	17.31623	WGS84

Event Columns
^^^^^^^^^^^^^

* **id** - as this is duplicated by eventID we can ignore this column and remove it from
  the import file so there is no confusion when the data are merged with the occurrence
  data.
* **http://rs.tdwg.org/dwc/terms/eventID** - a unique, externally generated identifier
  for the event which we can map to samples.external_key.
* **http://rs.tdwg.org/dwc/terms/samplingProtocol** - the method or protocol used. We
  can either create a custom attribute with the system function set to "Sample method",
  or use the ``samples.sample_method_id`` field which points to a predefined termlist
  called "Sample Methods". Using the ``sample_method_id`` rather than a custom attribute
  allows us to do things like dynamically configure recording forms according to the
  chosen protocol, but has the disadvantage that there is a single global list of
  sample method terms. If a specialist local list of terms is required for the
  protocols in a survey dataset then a custom sample attribute with its own termlist is
  a better approach. For our example import, we add the distinct terms in this column
  to the "Sample Methods" termlist. To do this:

  * Go to **Lookup lists > Termlists** in the menu.
  * Set the filter to "Filter for **Sample methods** in **Title**" and apply the
    filter.
  * Edit the termlist.
  * On the **Terms** tab, click **New term**, fill in the **Term** field then save it
    and repeat for each term.

  Note you could also import terms from CSV if you had lots to add.
* **http://rs.tdwg.org/dwc/terms/eventDate** - the date of the sampling event which can
  be mapped to the built in sample date fields. Either remove the time from the values
  in the import file, or split it into a 2nd column called Time and import this into
  a new sample custom attribute.
* **http://rs.tdwg.org/dwc/terms/countryCode** - we're going to create a sample custom
  attribute to capture this value, using the list of ISO 3166-1-alpha-2 country codes as
  terms for a lookup. Use the option to create a new termlist when creating the
  attribute and paste in the list of codes. Also set the **Term name** to
  "countryCode" and the **Term identifier** to "http://rs.tdwg.org/dwc/terms/countryCode"
  to maintain the association between the Indicia custom attribute and the DwC field.
  Also note that the custom attribute must be linked to the survey dataset you plan to
  import into (using the checkboxes at the bottom of the form).
* **http://rs.tdwg.org/dwc/terms/decimalLatitude** and **http://rs.tdwg.org/dwc/terms/decimalLongitude**
  can be mapped to ``samples.entered_sref``. We need to combine the 2 values into a single
  field using a simple spreadsheet formula so that decimalLatitude 59.17822 and
  decimalLongitude 17.41426 would be formatted as a single value "59.17822N 17.41426E".
* **http://rs.tdwg.org/dwc/terms/geodeticDatum** - our file contains the value WGS84 for
  all fields. For latitude and longitude data, Indicia requires the SRID of the
  projection which is imported into ``samples.entered_sref_system``, in this case the
  value should be 4326. So a new column is added to the spreadsheet for this value which
  is replicated for all columns.

Class: http://rs.tdwg.org/dwc/terms/Occurrence
----------------------------------------------

Here's an example text file containing a single record from our example DwC archive

.. code-block:: none
  
  id	basisOfRecord	individualCount	eventID	eventDate	scientificNameID	scientificName	kingdom	taxonRank
  2010-3-27:858	HumanObservation	1	2010-3-27:858	2010-03-27 11:00:00+00	201058	Colias hyale	Animalia	species

Occurrence Columns
^^^^^^^^^^^^^^^^^^

* **id** - in order to be able to cross reference back to the source, we store this
  value in ``occurrences.external_key`` since Indicia always generates its own unique
  record IDs.
* **http://rs.tdwg.org/dwc/terms/basisOfRecord** - we can create an occurrence custom
  attribute called "Basis of record" with the data type set to "Lookup", using the
  option to create a new termlist when the attribute is saved. The terms in the termlist
  can be populated from the suggested Darwin Core class names, i.e.:

  * LivingSpecimen
  * PreservedSpecimen
  * FossilSpecimen
  * HumanObservation
  * MachineObservation

  The **Term name** field can be set to "basisOfRecord" and the **Term identifier**
  field can be set to "http://rs.tdwg.org/dwc/terms/basisOfRecord" in order to preserve
  the association between the custom attribute and the associated field in DwC. Also
  note that the custom attribute must be linked to the survey dataset you plan to import
  into (using the checkboxes at the bottom of the form).
* **http://rs.tdwg.org/dwc/terms/individualCount** - we can create an occurrence custom
  attribute to capture the individualCount values. Assuming all the values are integers
  then the data type should be set to "Integer" but in some cases it may be necessary to
  use the "text" data type to capture more messy data. Set the **Term name** to
  "individualCount", the **Term identifier** value to
  "http://rs.tdwg.org/dwc/terms/individualCount" and the **System function** to "Count
  or abundance of a sex or life stage" to link it to the DwC field and identify it as
  holding abundance values. Also note that the custom attribute must be linked to the
  survey dataset you plan to import into (using the checkboxes at the bottom of the
  form). Finally, you can add a validation rule so that the minimum value allowed is
  0, to prevent negative counts being imported.
* **http://rs.tdwg.org/dwc/terms/eventID** - does not need to be imported into the
  occurrences data as it will be identified by a link to the sample. This can be removed
  from the import file *after merging the occurrences and event data as described
  below*.
* **http://rs.tdwg.org/dwc/terms/eventDate** - as **eventID**.
* **http://rs.tdwg.org/dwc/terms/scientificName** - will be mapped to the record's
  identification during the import.
* **http://rs.tdwg.org/dwc/terms/kingdom**, **http://rs.tdwg.org/dwc/terms/taxonRank**
  and **http://rs.tdwg.org/dwc/terms/scientificNameID** should be stored within the
  species list data in Indicia so are not required as import columns. They can be left
  in the file or deleted to keep things a bit simpler.

We now need to merge our events.txt file and occurrences.txt file into a single CSV file,
using the eventID to join the 2 files. Here's an explnation of one possible approach:
https://www.ablebits.com/office-addins-blog/2018/11/14/excel-join-tables-power-query/.

Combined with the above suggested column mappings, the resulting file should look like the
following

.. code-block:: none
  
  eventID	samplingProtocol	eventDate	countryCode	decimalLatitude	decimalLongitude	lat/lon	geodeticDatum	id	basisOfRecord	individualCount	scientificName
  2010-3-27:858	Point/Punkt	2010-03-27	SE	59.0329	17.31623	59.0329N 17.31623E	WGS84	2010-3-27:858	HumanObservation	1	Nymphalis antiopa

This can be saved as a CSV file then imported into the **Observations data > Occurrences**
page on the warehouse.