Proposal for Cave and Karst Data Exchange Standards

Peter Matthews, UIS Informatics Commission

Updated: 9 Dec 2002


As discussion proceeds via the
CaveData mailing list, this version will be progressively updated from the original version published in the Informatics Bulletin (No.5, July 1997). Note: The detailed material is not yet complete, so discussion has not yet been started.
[ UISIC home ] [ Contents ] [ Updates ] [ Fields List ] [ Database tables ]

Contents

Introduction
Proposal
Requirements to allow data exchange
Record identifiers
Field definitions
Transfer format
Entity list
Discussion
Record identifiers
Field definitions
Transfer format

INTRODUCTION

One of the basic aims of the UIS Informatics Commission has been to facilitate local and international exchange of data related to caves and karst. Most of the work which has been going on to date has been in preparation for opening this discussion. The
International Geographical Union is also interested in this issue, and so it will be a joint UIS/IGU project.

Because the same database software or structure is not needed at each end, an exchange standard will facilitate:

Below is UISIC's opening proposal. Each aspect in turn will be presented for discussion via the CaveData Internet mailing list, prior to emailing (or posting) the results to delegates for comment and later for voting.

If you do not have an Internet connection you will not really be able to take part fully in this discussion phase. However, if you have views on this matter, you can still supply input by sending it to me (if possible in plain text on a diskette) for incorporation into the discussion. Official UISIC delegates who do not have Internet will still receive the final drafts by post for comment, and later will receive the final material for voting.

PROPOSAL

Requirements to allow data exchange

The following three requirements are needed to allow the valid transfer, comparison and/or consolidation of cave/karst data between independent databases:

  1. Record Identifier Use of a record identifier which is internationally unique and permanent for each cave or karst feature or other entity being transferred.
  2. Field Definitions Use of internationally agreed definitions for the data fields and field values to be transferred.
  3. Transfer Format Export and import of the exchange data from/to the database via an intermediate standard UIS transfer format.

It is not required that the same software or database structure be used at each end of the transfer.

We now look at these three requirements in more detail:

1. Record Identifiers

The record identifiers (database keys) should be constructed as follows to conveniently achieve uniqueness:

aabbbnnnnn

For example: AUVSA00035

where:

aa
2-letter ISO country code indicating the country where the record was originally created (example AU = Australia).
bbb
3-letter organisation code issued within that country and indicating the organisation which originally created the record. (example VSA = Victorian Speleological Association).
nnnnn
a numeric serial number, being an agreed fixed length for a given entity, and padded left with zeros. Unique within a given aa and bbb. See Entity List below for length.

Once created for a record, the identifier should never be changed, regardless of where the record travels, or what has happened to the original organisation, or which organisation is currently looking after the master copy of the record.

2. Field Definitions

When the field names and field values of international definitions are actually being used, they will need to be expressed in various human languages. Language-independent numeric codes are therefore used wherever possible as a common reference to the field name or field value regardless of the language currently being used.

Field names: Each field name is represented by a simple numeric integer such that a given field with a particular meaning has the same numeric code regardless of the language in which its name and definition are expressed. For example, a Field ID of "7" could have the name "Rock type" when expressed in English.

The field names themselves are recorded in two fields - one for normal usage and having a length of 25 characters, and another to suit some early database systems and having a length of 10 characters.

Field values: Each field value is, wherever possible, represented by a simple numeric integer code such that a given field value with a particular meaning has the same numeric code regardless of the language being used. For example, a Field Value of "26" in Field 7 (Rock Type) could translate to "sandstone" when expressed in English, or "Sandstein" when expressed in German.

Where commonly accepted local field values or codes already exist for a field which has only local significance, for example, "Geological Bed Names" or "Parish", then these local codes should be used, but the meanings will then need to be transferred, along with the data, in any data exchange.

3. Transfer format

When transferring data between different databases, UIS's standard transfer format should be used (Name: Karstcom? InterKarst? ...). This format will use only standard ISO text characters, and will be independent of any database software or structure. Therefore any database system needs only to be able to translate to or from this one common intermediate format in order to exchange data with any other co-operating database system.

Entity List

The lengths in the following list should be used for the fixed-length serial number component in the record IDs of the respective entities. Note that the serial number need only be large enough to allow for the maximum number of records for that entity generated by the one organisation, not for the quantity of records stored at any one site; this is because any duplicate serial numbers will be distinguished by the originating country+organisation code.

The list is a draft initial list only. Further entities can be added as needed. The two-letter codes have been chosen to reflect the entity in more than one language where possible.

                                       Max Records
                          Length of    created by
ID    Entity              Serial No.   one Org.
----  ------------------  ----------   -----------
AR    article, paper      6            1M
AT    attribute, field    n/a
AV    attribute value     n/a
CA    cave/karst feature  5            100K
EN    entity              n/a
JN    journal             4            10K
OR    organisation        4            10K
PA    land parcel         5            100K
PE    person              5            100K
PH    photograph          5            100K
PL    plan, map           5            100K
PM    permanent mark      5            100K
PS    map series          3            1K
RE    region, area        4            10K
RP    report              5            100K
SM    specimen            5            100K
SP    species             5            100K
ST    site, place         3            1K
SV    survey              5            100K
SU    subject             n/a
SY    system              n/a
XK    key-in batch        5            100K
XL    upload batch        5            100K
XU    update batch        5            100K


DISCUSSION

The first two requirements above (identifier and definitions) should be used from the beginning if possible. It does not matter which database software you use, nor the structure of your database, nor which subset of the available fields you have chosen, provided that you have adhered to the field and field value definitions. For example, multi-valued fields have to stay as multi-valued fields.

The fact that many of us already have cave databases in existence, and are already using various independent field definitions, should not be a reason to prevent us from establishing a standard which can be used by new systems, or by later evolution of our existing systems if/when we feel that the time is right. Further, as we go through the field definitions, it is expected that we can come up with definitions which will allow many of our existing fields to comply with them anyway. In fact, one of the fields in the proposed list allows classifying the level of compliance of each existing field. Any existing fields which are found to already comply with the standard definitions could then be validly transferred to other databases.

Record identifiers

The use of an internal identifier (key) is normally routine for identifying and linking database records. However it needs to be globally unique so that there is no risk of it duplicating an existing key when loaded into someone else's database. We do not want to have to change the incoming key in such a situation, because then any linkages between entities in the original incoming tables would be lost.

Public "cave numbers", while needed for normal public usage, are not ideal for this identifier because they do sometimes get changed, they vary in their structure, and they can be unnecessarily long.

The record identifier does not need a component to identify the entity type, because this can readily be handled externally.

The scheme described above is currently in use as a test in the Australian national database.

The method described (a country code + an org code) allows each organisation which produces data records to issue internationally unique keys without needing to refer to any central authority. The 3-letter org codes would be set at the national level by the speleo community in that country.

The serial number part is fixed-length, left-zero-filled, so that the alphanumeric record IDs will sort correctly when required. The serial number component for the ID of a particular entity needs only to be large enough to cover the maximum quantity of records for the entity which could be generated by the one organisation, as opposed to the maximum quantity of entity records stored at any site. The proposed entity codes and key lengths are shown in the table above.

Regardless of ID design, organisational arrangements need to be made to allow separate clubs to contribute their data to the total database information for a given cave, i.e. merging of records. In the Australian pilot, this is done by allowing only one club to update the national database for any given cave area, but of course with a mechanism to allow other clubs to contribute, and to get proper attribution for the data they have provided.

Where a database already exists, and it proves to be not feasible to convert its keys to the above scheme, then a mechanism needs to be added so that the international keys are produced whenever data is exported. The mechanism must ensure that the same internal record always produces the same external key. For example, if the existing internal record keys were a simple integer, then the external key could be produced by left-padding the number with zeros and adding the appropriate five letters to the front.

Note however that if the key was changed in this way, any instances of its use as a "foreign key" in linked tables of other entities must also be changed. (A "foreign key" is a non-key field (usually) in a table whose value is the same as the key field(s) of a different table.). For example, a map entity record describing a map might have a field containing the ID of a cave entity which was shown on that map; when the map and cave records are exported, the cave ID value in the foreign key field of the map record must be altered in the same way as the cave records were. Obviously it's much simpler if a once-only change can be made to the whole database to align its keys and foreign keys to the international standard; from then on, no more key conversions need to be made.

Field definitions

Field definitions will be systematically discussed in English via the Internet before being circulated to UISIC delegates for further comment and eventual voting. This initial batch of fields are first-pass general caving fields; after some of these are out of the way we can also start to look at fields which are more scientific or specialised. The suggested procedure is (improvements invited!):

A technique for producing paper data-entry forms containing the standard fields has been devised using HTML for platform independence. Custom data-entry forms containing the desired sub-set of standard fields can therefore be produced where wanted, facilitating the off-line consolidation of data from disparate sources prior to data entry into a database. These will be developed in parallel with the definitions and added to a fields Form Library.

Transfer format

Background: An early version of a transfer format was successfully used by the Australian database in 1985 when ASF used it to produce their national cave, map and reference list, the 500-page Australian Karst Index 1985 book. UISIC subsequently issued a draft standard to delegates for comment at the UISIC meeting during the 1989 UIS Congress in Budapest. In 1991 ASF produced a standard formalising their Karst Data Interchange (KDI) format as used in 1985. Since then, a programme has been produced by Glenn Baddeley which translates from this early KDI transfer format into a series of plaintext tables for importing into a multi-table database. This was demonstrated at the 1993 UIS Congress in Beijing, and was used as recently as 1999 by ASF to convert all its old 1985 data into its new PC-based relational database.

Based on the foregoing successful experience, UISIC had planned to issue an updated version of the 1989 UISIC draft for further discussion and comment, however XML and its associated standards are now available, so the whole exchange format is being revisited by a special Working Group. This also implies expressing the previously discussed field definitions using XML and associated formats.

Nevertheless, regardless of the actual exchange format, the basic technique of exchange would be as follows:

Voilá!

[ UISIC ] [ Contents ] [ Top ]


Updates

9-Dec-02

29-Nov-00

26-Oct-97

20-Jul-97 Original version


Copyright © 1997-2000 UIS Informatics Commission.
May be freely reproduced provided this copyright notice is retained.
Page address: http://www.uisic.uis-speleo.org/exchange/exchprop.html
Site: P. Matthews