Project for Cave and Karst Data Exchange Standards

Peter Matthews, UIS Informatics Commission

Updated: 2010-07-19


This page will be progressively updated as discussion proceeds via the Commission's web forum, expected to start shortly.
[ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]

Contents

Introduction
Proposal
Requirements to allow data exchange
Record identifiers
Field definitions
Transfer format
Entity list
Discussion
General
Record identifiers
Unique IDs
Merging records
Converting keys
Field definitions
Procedure
Data entry forms
Transfer format
Background
Technique
Summary
Summary
Status

[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]


INTRODUCTION

One of the basic aims of the UIS Informatics Commission has been to facilitate local and international exchange of data related to caves and karst. This project proposes standards which would achieve this. Most of the work which has been going on to date has been in preparation for opening the formal discussion. The project will be carried out in collaboration with the Karst Information Portal (KIP), and with the online Karst Index Database (KID) run by the Australian Speleological Federation (ASF). The International Geographical Union may also be interested.

In various places through this document and associated web pages you will see reference to this Australian open-source web-based KID database. This is because it is a real-world example of a national cave/karst database which for some years has already been using the standards proposed here, and so gives a degree of confidence that the proposed standards work in practice.

KID background: This national database was originally designed and implemented by myself for ASF, but always in mind was the idea to keep it suitable also for international usage. In 1985 it was used to produce ASF's 500-page book, the Australian Karst Index 1985, and in the late 90's I converted it to a much expanded relational PC-based database, including a data dictionary for the fields. Then in 2001 under the management of Mike Lake, ASF converted it to an open-source web-based database using a paid professional programmer. Mike continues to manage the KID and has been collaborating closely with UISIC in the design of new fields and tables. For more information see: Overview | Website.

The plan now is to make use of this preliminary repository of data fields and their definitions as a practical starting point for our discussions, and where necessary, modify them and add to them to ensure they are fully satisfactory for international use.

In this proposal the same database software or structure is not needed at each end of an exchange, so the standard would facilitate, for example:

Below is UISIC's current proposal. Each aspect in turn will be presented for discussion via our coming Internet forum, prior to formal comment and voting by the delegates from each interested country. In the meantime, your comments and suggestions to improve this proposal are very welcome.

[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]


PROPOSAL

Design and implement a Cave and Karst Data Exchange Standard as follows:

Requirements to allow data exchange

The following three requirements are needed to allow the valid transfer, comparison, analysis and/or consolidation of cave/karst data between independent databases:

  1. Record Identifier Use of a record identifier which is internationally unique and permanent for each cave or karst feature or other entity being transferred.
  2. Field Definitions Use of internationally agreed definitions for the data fields and field values to be transferred.
  3. Transfer Format Export and import of the exchange data from/to the database via an intermediate standard UIS transfer format.

It is not required that the same software or database structure be used at each end of the transfer.

We now look at these three requirements in more detail:

1. Record Identifiers

The record identifiers (database table keys) should be constructed as follows to conveniently achieve uniqueness:

aabbbnnnnn           For example: AUVSA00035

where:

aa
2-letter ISO country code indicating the country where the record was originally created (example: AU = Australia).
bbb
3-letter organisation code issued within that country and indicating the organisation which originally created the record. (example: VSA = Victorian Speleological Association).
nnnnn
a numeric serial number, being an agreed fixed length for a given entity, and padded left with zeros. Unique within a given aa and bbb. See Entity List below for length.

Once created for a record, the identifier should never be changed, regardless of where the record travels, or what has happened to the original organisation, or which organisation is currently looking after the master copy of the record.

2. Field Definitions

When the field names and field values of international definitions are actually being used, they will need to be expressed in various human languages. Language-independent numeric codes are therefore used wherever possible as a common reference to the field name or field value regardless of the language currently being used.

Field names: Each field name is represented by a simple numeric integer such that a given field with a particular meaning has the same numeric code regardless of the language in which its name and definition are expressed. For example, a Field ID of "7" has the name "Rock type" when expressed in English. The field names themselves have been kept to a length of 25 characters.

Field values: Where field values can be chosen from a fixed vocabulary, they are represented by a simple numeric integer code such that a given field value with a particular meaning has the same numeric code regardless of the language being used. For example, a Field Value of "26" in Field 7 (Rock Type) translates to "sandstone" when expressed in English, or "Sandstein" when expressed in German.

Where commonly accepted local field values or codes already exist for a field which has only local significance, for example, "Geological Bed Names" or "Parish", then these local codes should be used, but the meanings will then need to be transferred, along with the data, in any data exchange.

Field definitions: The formal definition for a field should normally include where appropriate: the definition itself, any comments about the definition, an example, any usage notes, and a Help text for use during data entry. There should also be scope for the addition of locally based comments, examples and usage notes.

Each definition should be set first in English, then after its approval, translated into as many other languages as possible.

To help find whether a particular field exists in the repository, they need to be classified according to their subject. A preliminary Speleological Subject Classification (SCC) scheme has been created, based on the scheme used by the UIS Bibliography Commission.

3. Transfer format

When transferring data between different databases, UIS's standard transfer format should be used. This format will use only standard ISO text characters, and will be independent of any database software or structure. Therefore any database system needs only to be able to translate to or from this one common intermediate format in order to exchange data with any other co-operating database system. This format or markup language is likely to use XML (Extensible Markup Language) and its associated standards.

To translate to or from the transfer format, the project will include an open-source, multi-platform program to translate from csv or fixed-width text files to the transfer format at the source end, and a similar program to translate from the transfer format to csv or fixed-width text files at the target end. It will also include text file templates for describing the table structures at each end of the transfer for use by the respective program.

Entity List

The lengths in the following list should be used for the fixed-length serial number component in the record IDs of the respective entities. Note that the serial number need only be large enough to allow for the maximum number of records for that entity generated by the one organisation, not for the quantity of records stored at any one site; this is because any duplicate serial numbers will be distinguished by the originating country+organisation code.

The list is a draft initial list only. Further entities can be added as needed. The original two-letter codes were chosen to reflect the entity in more than one language where possible; however in keeping with our policy of language-independence, these are expected to be changed to numeric codes, allowing users to substitute letter codes related to their own language if desired.

                                       Max Records
                          Length of    created
ID    Entity (English)    Serial No.   by one Org.
----  ------------------  ----------   -----------
AR    article, paper      6            1M
AT    attribute, field    n/a
AV    attribute value     n/a
CA    cave/karst feature  5            100K
EN    entity              n/a
OR    organisation        4            10K
PA    land parcel         5            100K
PB    publication         4            10K
PE    person              5            100K
PH    photograph          5            100K
PL    plan, map           5            100K
PM    permanent mark      5            100K
PS    map series          3            1K
RE    region, area        4            10K
RL    role                3            1K
RP    report              5            100K
SM    specimen            5            100K
SP    species             5            100K
ST    site, place         3            1K
SV    survey              5            100K
SU    subject             n/a
SY    system              n/a
XK    key-in batch        5            100K
XL    upload batch        5            100K
XU    update batch        5            100K

[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]


DISCUSSION

General

The first two requirements above (identifier and definitions) should be used from the beginning of a new database if possible. It does not matter which database software you use, nor the structure of your database, nor which subset of the available fields you have chosen, provided that you have adhered to the field and field value definitions. For example, multi-valued fields have to stay as multi-valued fields.

The fact that many of us already have cave databases in existence, and are already using various independent field definitions, should not be a reason to prevent us from establishing a standard which can be used by new systems, or by later evolution of our existing systems if/when we feel that the time is right. Further, as we go through the field definitions, it is expected that we can come up with definitions which will allow many of our existing fields to comply with them anyway. In fact, one of the fields in the proposed list allows classifying the level of compliance of each existing field. Any existing fields which are found to already comply with the standard definitions could then be flagged as such, and validly transferred to other databases.

1. Record identifiers

Unique IDs

The use of an internal identifier (key) is normally routine for identifying and linking database records. Ideally these should normally be a simple integer containing no actual data, however in this case they need to be globally unique so that there is no risk of duplicating an existing key when loaded into someone else's database. We do not want to have to change the incoming key in such a situation, because then any linkages ('foreign keys') between entities in the original incoming tables would be lost.

The method described (a country code + an org code) allows each organisation which produces data records to issue internationally unique keys without needing to refer to any central authority. The 3-letter org codes would be set at the national level by the speleo community in that country.

The serial number part is fixed-length, left-zero-filled, so that the alphanumeric record IDs will sort correctly when required. The serial number component for the ID of a particular entity needs only to be large enough to cover the maximum quantity of records for the entity which could be generated by the one organisation, as opposed to the maximum quantity of entity records stored at any site. The proposed entity IDs and key lengths are shown in the table above.

Public "cave numbers", while needed for normal public usage, are not ideal for this identifier because they do sometimes get changed, they vary in their structure, and they can be unnecessarily long.

The record identifier does not need a component to identify the entity type, because this can readily be handled externally.

The scheme described above is currently already in use in the Australian national cave/karst database.

Merging records

Regardless of record ID design, if data will be consolidated between databases for the same cave or other entity, then arrangements will need to be made for merging the data under a single record ID, usually that of the destination database.

Where local clubs or individuals are contributing to a national or other higher level database, alternatives to complex merging software could be (1) the higher level database is online, so that all contributors can log on in order to add or upload their data, or (2) distributed local databases, but where only one club is responsible for adding and uploading data, both its own and contributed, for any particular cave area. In both cases, this permits the use of a single record ID for each cave or other entity.

In all cases, a mechanism is needed to allow data contributors to get proper attribution for the data they have provided. The UIS field repository already includes many fields to allow this attribution, and sample table structures for both attribution and data quality are available.

Converting keys

Where a database already exists, and it proves to be not feasible to convert its keys to the above scheme, then a mechanism needs to be added so that the international keys are produced whenever data is exported. The mechanism must ensure that the same internal record always produces the same external key. For example, if the existing internal record keys were a simple integer, then the external key could be produced easily by left-padding the number with zeros and adding the appropriate five letters to the front.

Note however that if the key was changed in this way, any instances of its use as a "foreign key" in tables linked to other entities must still also be changed. (A "foreign key" is a non-key field (usually) in a table whose value is the same as the key field(s) of a different table.). For example, a map entity record which describes a map might have a field containing the ID of a cave entity which was shown on that map; when the map and cave records are exported with their new IDs, the cave ID value in the foreign key field of the map record must be altered in the same way as the cave records were. Obviously it's much simpler if a once-only change can be made to the whole database to align its keys and foreign keys to the international standard; from then on, no more key conversions need to be made.

2. Field definitions

Procedure

Field definitions will be systematically discussed in English via the Internet before being circulated to UISIC delegates for further comment and eventual voting. The initial batch of fields will be first-pass general caving fields; after some of these are out of the way and the repository mechanisms are bedded down, we can invite contributions of fields which are more scientific or specialised. The suggested procedure is (improvements invited!):

Data entry forms

A technique for producing paper data-entry forms containing the standard fields has been devised using HTML coding, which gives platform independence and requires no software. To print the form, each A4 page of empty form is displayed on the screen, then printed out to create a stock of blank forms for later off-line filling in. Each field will have its own HTML module in the Fields Form Library. Custom data-entry forms containing only the locally desired standard fields can then be produced by assembling only these field modules, and in the order wanted, into an empty page template. The wording in the field module can also easily be edited to suit local conditions. The use of paper forms facilitates the off-line consolidation and checking of data from disparate sources prior to data entry into a database. These HTML modules will be developed in parallel with the definitions and added to the Fields Form Library.

3. Transfer format

Background

An early version of a transfer format was successfully devised and used by the Australian cave and karst database in 1985 when ASF used it to produce their national cave, map and reference list, the 500-page book, Australian Karst Index 1985. UISIC subsequently developed this format further and issued a draft standard to delegates for comment at the UISIC meeting during the 1989 UIS Congress in Budapest. In 1991 ASF produced a standard formalising their Karst Data Interchange (KDI) format as used in 1985. Since then, a programme has been produced by Glenn Baddeley which translates from this early KDI transfer format into a series of plaintext csv files, each being a table for importing into a multi-table relational database. This was demonstrated at the 1993 UIS Congress in Beijing, and was used in 1999 by ASF to convert all its old 1985 data into its new PC-based relational database. From there it was exported to today's online web-based MySQL database.

Based on the foregoing successful experience, UISIC had planned to issue an updated version of the 1989 UISIC draft for further discussion and comment, however in the meantime, XML and its associated standards had become available, so the whole exchange format issue was revisited by a special UISIC Working Group. This WG aimed to use cave survey fields initially, because of significant interest in being able to transfer cave survey data. However, dealing with cave survey data proved to be too complicated and difficult at that time, and the WG eventually went into hibernation.

The plan is to start a separate WG which will pick up the threads of the old group, but this time use general cave data while developing the format (a markup language), which we have called CaveXML. Hopefully the old group can in due course recommence work on cave survey data transfer, working within the CaveXML format.

This use of XML also implies expressing the previously discussed field definitions by using XML and its associated standards.

Technique

Regardless of the actual exchange format, the basic technique for exchanging data would be as follows:

Voilá!

[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]


SUMMARY

Summary

Overview of what needs to be done:

  1. Formally start the project and invite delegates from interested countries.
  2. Decide on a name for the standard and the transfer format (markup language). The current name is CaveXML.
  3. Decide on the format for record indentifiers.
  4. Implement a list of 3-letter organisation codes as decided in each country.
  5. Successively choose groups of related fields for discussion, adding new fields to those existing if necessary.
  6. Establish the definitions, and value codes where necessary, for these fields.
  7. After acceptance by vote of the delegates, translate these definitions into other languages.
  8. Create the data entry form module for each field and add it to the Fields Form Library. Appoint a volunteer Forms Manager.
  9. During the project, as fields are decided, keep the web pages updated with progress.
  10. Establish the transfer format.
  11. Write the two transfer programs and design the associated descriptor files.
  12. Formalise the above decisions into an overall standards document for voting.
  13. Convert the existing PC-based fields repository to a web-based repository.
  14. Set up a mechanism for the review and addition of new fields into the repository. Appoint a volunteer Fields Manager.
  15. Invite karst scientists to define and directly contribute peer-reviewed data fields online, related to their particular discipline.
  16. Upgrade the Speleological Subject Classification scheme.

Status

Preliminary work which has already been done:
  1. A record identifier has already been designed (see above), and has proved operational in the Australian national database. It has been designed for general international use, but this needs review and decision.
  2. The 3-letter organisation codes recommended as part of the record ID are already implemented in the Australian database, with most caving organisations in the country identified by their own code.
  3. A preliminary list of relevant entities has been set up as shown above, but their letter identifiers need to be swapped to numeric identifiers.
  4. Our data dictionary or repository is operational as a PC-based relational database which generates the current web pages for field definitions and listings, but it needs conversion to a web-based repository.
  5. As of this date, 680 data fields have already been defined and categorised in our data dictionary, along with a numerically coded fixed vocabulary for those fields needing one, some of which have been translated into German. These fields are displayed on our website. Many of the fields and codes are already in use in the Australian database.
  6. Sample database table structures using these fields are on the website for caves and karst features, maps, data quality and security, and data history and attribution. Sample tables for other entities have been drafted but are not yet on the website.
  7. A preliminary Speleological Subject Classification scheme, extended from the UIS Speleo Abstracts scheme, has been set up to classify the data fields.
  8. Some work has been done successfully on HTML modules for a range of field types for the paper data entry forms to prove the method's feasibility.
  9. The website for the working group for the survey data transfer format is operational, but we need a similar site for the working group tackling general cave data.

[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]


Updates

2010-07-19

2009-01-24

2002-12-09

2000-11-29

1997-10-26

1997-07-20 Original version

[ Top ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ] [ Subjects ] [ UISIC home ] [ UIS ]


Copyright © 1997-2010 International Union of Speleology. May be freely reproduced provided this copyright notice is retained.
Page address: http://www.uisic.uis-speleo.org/exchange/exchprop.html
Site: P. Matthews