The Calimera Project is funded under the  European Commission,
IST Programme

 

 
Calimera Report cover with logoCalimera Guidelines

 

 

Cultural Applications:

Local Institutions Mediating Electronic Resources

 

 

 

Digital

preservation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 

Calimera Guidelines

Digital preservation

 

                                                       SCOPE                               

 

This guideline deals with: 

Selection

Technology preservation

Technology emulation

Data migration

Authenticity

Storage

Conservation

Disaster recovery procedures

Formats

Media

Standards

Web archiving and domain archiving

Staffing implications

Administrative and legal implications

 

                                                POLICY ISSUES                         Back to Scope

 

‘The volume of information is growing at an unprecedented pace. We already produce more information per year than we did in the whole period since we descended from the trees. A lot of this information is digital only, meaning it has no physical representation. That makes it much more volatile. An XML-document for instance is created while you view it. So how do you keep it?’ (Ulrich Kampffmeyer, director of PROJECT CONSULT, Germany) [1].

 

The EU recognises that long-term access to digital as well as analogue resources is crucial to delivering the objective of making Europe “the most competitive and dynamic knowledge-based economy in the world” [2]. ERPANET (Electronic Resource Preservation and Access Network) was set up to address digital preservation issues [3].

 

Digital materials, whether “born digital” or converted to digital form, are at risk from technology obsolescence and physical deterioration. The objective of preserving resources is to ensure that they remain accessible for current and future generations. In the case of digital resources additional considerations (as compared with the preservation of traditional analogue materials) include:

·        technological obsolescence, generally regarded as the greatest technical threat to ensuring continued access. The speed of changes in technology means that the timeframe during which action must be taken is measured in a few years, perhaps only 2 to 5, as opposed to decades or even centuries for traditional materials;

·        the fragility of some storage media used for digital resources. These can deteriorate quickly although externally no damage may be visible;

·        the ease with which changes can be made. This means that there can be challenges associated with ensuring continued authenticity;

·        the dynamic nature of some “born digital” materials. This means that they are intended to be continually updated. This use of technology is very effective for providing up-to-date reference information, maps, etc., but poses challenges in terms of the ability to compare data at different points in time;

·        the lifecycle of a website. The average lifespan of a website is estimated to be about 44 days. This is similar to the problem of ephemera in the analogue world. Ways need to be found to collect and preserve important websites and selected examples of all other websites;

·        the question of originals. With digitised materials, care must be taken to preserve analogue originals. However, with “born digital” materials there is no analogue equivalent to fall back on – once they are lost they are gone forever. For example, the first telegram ever sent, in 1844, has been preserved in analogue format and has been digitised; the first e-mail, sent in 1971, has been lost.   

 

Preservation issues must be considered an integral part of the digital creation process, whether making a digital copy of an analogue item or creating a “born digital” item. It is essential to document and record all the technological procedures that led to the creation of the digital object, and much critical information can be captured only at the point of creation. The costs and implications of not having a preservation strategy can be high. Retrospective preservation, if possible at all, is likely to be expensive. Although techniques such as digital archaeology (rescuing digital resources which have become inaccessible) exist, they are not always successful. Loss of access to the growing body of material only available in digital form could have serious implications for future generations. Precautions can be taken which will reduce the danger of loss such as:

·        storage in a stable controlled environment;

·        implementing regular refreshment cycles;

·        making preservation copies (subject to licensing/copyright permission);

·        establishing appropriate handling procedures;

·        using standard storage formats and media.

 

As well as a technical strategy, an organisational strategy is useful in order to ensure budgets, staff and time are available for what should be an ongoing procedure. ERPANET has published a useful policy tool [4].

 

Some institutions, especially smaller ones, might consider contracting out either the whole preservation process, or the storage of digital materials, to a third party. This too will require careful planning. For guidance on the issues which need to be considered see Simpson, Duncan: Contracting out for digital preservation services: information leaflet and checklist. Digital Preservation Coalition, October 2004. [5]

 

                                    GOOD PRACTICE GUIDELINES             Back to Scope

 

How to provide for long-term access should be considered from the planning stage when resources are being digitised (see the guideline on Digitisation), or from the creation stage in the case of “born digital” resources. It is useful to have a “life-cycle” strategy which takes into account data creation, access policies and preservation procedures, and which is in place and ready to be applied before any images are captured. 

 

Selection                                                                                           Back to Scope

A key initial decision which needs to be made concerns selection, i.e. which resources justify preservation. Not all resources can be or need to be preserved for ever; some will not need to be preserved at all, some for a defined period of time, and some indefinitely.  With traditional collections, lack of selection for preservation may not necessarily mean that the item will be lost, but in the digital environment it almost certainly will be, so such decisions are crucial. In the case of “born digital” materials it is advisable wherever possible to involve the creators in the selection decision. In cases where there are multiple versions, a decision must be made as to which version is the best one for preservation, or whether more than one should be selected. Sampling dynamic resources as opposed to attempting to save every change, may be the most practical option. Making such decisions as early as possible helps to target resources towards preserving the most valuable assets. 

 

Once the selection of material has been made, an appropriate technical strategy must be chosen, e.g. technology preservation, technology emulation, or data migration.

 

Technology preservation                                                                 Back to Scope

This is a very high risk strategy. It involves preserving the original software (and possibly hardware) that was used to create and access the information. It also involves preserving and maintaining both the original operating system and hardware on which to run the resource, and continuing to train staff so that they have the skills needed to keep the systems running. Long-term costs are impossible to estimate. It is likely to be too expensive and impractical for individual institutions (except very large ones with very important collections), although co-operation with other institutions to keep a “pool” of such equipment could be considered. The disadvantages to this strategy include obsolescence, software and hardware eventually wearing out, technical support disappearing over time, and the “pool” equipment being in a location at some distance from the digital material making access for local users difficult. Technology preservation is not really an option for small local institutions. 

 

Technology emulation                                                                      Back to Scope

This involves developing techniques for imitating obsolete systems on future generations of computers. At the present time this tends to be expensive and technically complex. Also it will have to be re-done each time a new technological platform appears. It can thus only be regarded as a solution for long-term preservation of perhaps globally important material held in large national institutions where emulation can take place on a more-or-less continual basis. An additional consideration is that software copyright issues may need to be addressed (see the guideline on Legal and rights issues).

 

Data migration                                                                                 Back to Scope

This involves copying the data from one hardware/software generation to a newer one, thus keeping it stored in an up-to-date form that continually keeps pace with changes in technology. This is perhaps the simplest and most commonly used method, despite the possibility of data being lost or changed in the migration process. It preserves the intellectual content of the original data but may lose original features and appearance. If these are important then technology preservation or emulation may have to be used. The capture of metadata is a critical part of a migration strategy in order to ensure continued use of the resource if any change in, or loss of, functionality occurs, as it probably will over successive migrations. In this case preservation metadata - describing the software, hardware and management requirements of the digital material - will provide essential information.

 

Data refreshing is associated with migration. It is the process of copying data from one set or copy of the digital media to another of the same kind and helps to keep the data in good condition until it is migrated to a new media.

 

Consideration might also be given to copying data to an older generation of media, namely analogue format. With resources which are digitised from analogue originals it is of course sensible to preserve the originals. It is also possible to preserve “born digital” resources in an analogue format such as permanent paper, preservation microfilm or nickel disk, but this is only suitable in a limited number of cases such as a print-out of a digital document. It is inappropriate for increasingly complex websites etc., where loss of functionality would diminish the usefulness of the resource. 

 

Another possibility which might be considered is preserving screen shots of systems, virtual exhibitions or creative works (particularly those which are “born digital”). This provides a record of the system in the form of a digital image file which is likely to be suitable for long-term preservation.

 

Authenticity (see also the guideline on Security)                                   Back to Scope

The choice of preservation strategy will be influenced by how authentic the preserved item needs to be. There is no universally accepted definition of authenticity, but it broadly means that the preserved copy should be as much like the original as possible, and the connections between documents and objects should be preserved to assist with interpretation. With analogue records, it is possible for example to trace how decisions were reached by examining the relationships between documents; historians are concerned that, with the proliferation of records only available in digital format, this ability might be  lost to future generations.

 

In the analogue world, the preserved item usually is the original, although copies may be made for use in order to prevent damage from handling etc. In the digital world the preserved item will be a copy of some sort since there is no physical artefact. As it is dependent on technology for access, over time this copy will be subject to many changes in order to ensure that it is still accessible on new technologies. It is therefore crucial that metadata is preserved with it to define its authenticity, and ideally this should be created simultaneously with the information. For discussions on the challenges posed by authenticity and preservation see Integrity and authenticity of digital cultural heritage objects. Digicult Thematic Issue 1, August 2002. [6]

 

 

 

Storage (see also the guideline on Digitisation)                                      Back to Scope

Strategies for both online and offline storage will be needed. Delivery files in continual use will need to be stored online, on servers. Master files are best stored offline since they are less frequently accessed. Storage of the original analogue objects or source texts is also important and links need to be made to these.

·        Online storage – it is easy to allow storage space to become cluttered with several versions of documents and other unnecessary resources. It would be useful to have a plan which:

°        clarifies which resources need to be accessible online, nearline and offline;

°        sets times for removing certain categories of material from online storage;

°        sets times for reviewing online storage.

·        Offline storage must take into account the problem of media degradation. However, despite its fragility as compared for example with paper, most storage media will outlive the hardware and software needed to use it. Over the last 30 years storage media has moved from punch cards to DVDs via cassettes, floppy disks and CD-ROMs, but the technology to retrieve data stored on the early media is difficult to find. Storage cannot therefore be a once-for-all task but must be part of an ongoing regime. Points to consider include:

°        environmental conditions – good environmental conditions will increase the longevity of  digital storage media and help prevent damage to data. Large fluctuations in temperature and humidity are generally thought to be more damaging than constant levels, even when these are slightly less than ideal. BS 4783 Storage, transportation and maintenance of media for use in data processing and information storage [7] contains guidance;

°        archival media – it is advisable to select the best quality archival media affordable. A variety of digital storage media is available, including CD-R, DVD-R, DAT (Digital Audio Tape) and DLT (Digital Linear Tape). Digital images should be preserved on Write Once Read Many (WORM) drives which enable the files to be viewed frequently without being overwritten. TASI (Technical Advisory Service for Images) gives advice on CD-R and DVD-R [8]. The Bundesamt für Sicherheit in der Informationstechnik (BSI) gives advice on standards for archival media [9];

°        archival copies – at least two archival copies should always be made and stored in different locations. If possible multiple copies should be made on different storage media. Copies made using different software, and/or comparable software purchased from different suppliers, will help to protect data against corruption from malfunctioning software or viruses;

°        media refreshing – it is useful to have a plan for refreshing or transferring archive copies to new media at specified times, e.g. 

-       within the minimum time specified by the supplier for the media's viability;

-       when new storage devices are installed;

-       when a quality control check discloses significant temporary or read "errors" in a data resource;

°        quality control – consider having a quality control procedure involving:

-       checking all media periodically for readability;

-       using bit/byte or other checksum comparisons with originals to ensure the authenticity and integrity of items after media refreshing;

-       recording all actions taken. 

 

Conservation                                                                                    Back to Scope

In the analogue world, preservation is the term used for activities which generally ensure the safekeeping and survival of resources (careful handling, secure packaging during transport, controlled environment etc.), and conservation is that particular aspect of those activities which involve some kind of active intervention with the object, i.e. repair or restoration. In the field of digital resources, the term preservation is more often used, although active conservation measures might be needed at times; some damaged media  can be repaired for example.

 

Disaster recovery procedures and risk management                      Back to Scope

It is always sensible to have a plan in case of disasters. It should include such considerations as:

·        creating archive copies of all important digital resources as soon as they are acquired or created;

·        using standard storage media and formats;

·        storing archive copies on and off site - in areas in danger of natural disasters such as flooding, off site copies should be at a safe distance away;

·        ensuring that all staff are trained in what to do in the event of a disaster;

·        having a risk management policy. ERPANET has published a risk communication toolkit. [10]

 

Formats                                                                                            Back to Scope

The requirements of a file format for archiving are broadly the same as for creation (see the guideline on Digitisation). It is preferable:

·        to use an open standard file format rather than a proprietary format to guard against obsolescence;

·        to use a file format that can support the embedding of metadata;

·        not to use any compression to guard against losing data - a lossless format such as TIFF (Tagged Image File Format) [11] is preferable, but if there is real pressure on space, the PNG (Portable Network Graphic) file format [12] can provide an alternative lossless format.

 

The same general rules apply to the preservation of audiovisual formats (see also the guideline on Multimedia services), but digital broadcasting media present enormous challenges. Most television is now produced in digital format. It is imperative to involve the programme makers and journalists  who create the programmes in the preservation strategy. TV companies such as the Netherlands Institute for Sound and Vision [13] are working on strategies to preserve hundreds of thousands of hours of broadcast material in authentic ways.

 

ERPANET suggests the following file formats for preservation [14]:

·        Text documents: plain ASCII [15], PDF [16], XML [17], TIFF  [11];

·        Image documents: TIFF [11], JPEG2000 [18];

·        Audiovisual documents: WAV [19], BWF [20], MPEG [21].

 

Because of the number and variety of formats available, information about them is being collected in file format registries such as the UK National  Archives’ PRONOM [22]. Local institutions may need to seek advice from regional or national professional associations as to which one is suitable for their requirements.

 

Media (see also storage)                                                                      Back to Scope

As already mentioned, it is advisable to select the best quality archival media affordable. A variety of digital storage media is available, including CD-R, DVD-R, DAT (Digital Audio Tape) and DLT (Digital Linear Tape). Digital images should be preserved on Write Once Read Many (WORM) drives which enable the files to be viewed frequently without being overwritten. TASI (Technical Advisory Service for Images) gives advice [8].

 

Standards (see also the guideline on Digitisation)                                  Back to Scope

It is advisable to adhere to open standards when archiving digital resources. As these are not tied to specific hardware/software they help to guard against the dangers of technological obsolescence. There are standards for the different aspects of storing digital information. Some examples include:

·        interoperability standards - these allow communication between different systems. Examples include ISO 23950:1998 Information and documentation -- Information retrieval (Z39.50) - Application service definition and protocol specification [23] and the CIMI (Consortium for the Computer Interchange of Museum Information) Profile: Z39.50 Application Profile for Cultural Heritage Information [24];

·        resource encoding standards (see also the guidelines on Resource description and Discovery and retrieval) - these define formats for different types of digital information. Adherence to this type of standard allows data compatibility across a wide range of systems.  Examples include standards for:

·       page description formats e.g. PostScript [25], Portable Document Format (PDF) [16];

·       graphics formats e.g. Tagged Image File Format (TIFF) [11], Graphics Interchange Format (GIF) [26];

·       structured information formats e.g. Standard Generalized Markup Language (SGML) [27], Extensible Markup Language (XML) [17];

·       moving images and audio formats e.g. WAV [19], Broadcast Wave Format (BWF) [20], MPEG [21];

·        resource identification standards (see also the guidelines on Resource description and Discovery and retrieval) – these provide a way of uniquely identifying digital resources in order to ensure long-term and reliable access to resources while they are available over the Internet even when their location changes. URLs (Uniform Resource Locators) can change. Examples of permanent identifiers include URNs (Universal Resource Names) [28], DOIs (Digital Object Identifiers) [29], PURLs (Persistent Uniform Resource Locators) [30], Handles [31] and ARKs (Archival Resource Keys) [32];

·        resource description standards (see also the guideline on Discovery and retrieval) – these can facilitate effective resource discovery. Examples include AACR2 [33] and Dublin Core [34]. There are also a group of standards which relate to metadata syntax, such as MARC (Machine-Readable Cataloguing) [35] and the EAD (Encoded Archival Description) [36]. The World Wide Web Consortium (W3C) [37] is currently involved in developing the Resource Description Framework (RDF) [38] which will provide the infrastructure to support the coexistence of many different metadata sets, or "schemas", of which the Dublin Core will be one example;

·        data archiving standards – these provide for the long-term preservation of and access to digital information. The Open Archival Information System (OAIS) Reference Model (ISO 14721:2003 Space data and information transfer systems -- Open archival information system -- Reference model) is an example. [39];

·        records management standards – these provide guidance on how to implement records management strategies, procedures and practices. The main examples are ISO 15489 Information and documentation -- Records management [40] and ISO/TS 23081-1:2004 Information and documentation - Records management processes - Metadata for records -- Part 1: Principles [41].

 

It is also advisable to adhere to standard formats and media. Simply using standard file formats and standard media will go a long way towards ensuring the safety of a digital collection. The Technical Advisory Service for Images (TASI) [42] and the Digital Preservation Coalition [43] both give valuable advice on both topics, and see also the guideline on Digitisation. The Bundesamt für Sicherheit in der Informationstechnik (BSI) gives advice on standards for archival media [9].

 

Various projects are working on standards for digital archiving including:

InterPARES [44], Project Prism [45], DAVID [46] and VERS [47].

 

In 2003 the International Internet Preservation Consortium [48] was set up by the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, the UK and the USA, with the aim of preserving Internet content for future generations. In order to do this it aims to develop common tools, techniques and standards, and working groups have been set up to work on these.

 

Web archiving and domain archiving                                               Back to Scope

The ever-expanding size of the World Wide Web and its dynamic and ephemeral nature pose special challenges for projects aiming to capture, store and make it accessible for the long term. Several countries, including  Canada, Denmark, New Zealand, Norway, South Africa and the UK have enacted legislation to extend legal deposit to digital publications. Some national libraries, including Austria, the Czech Republic, Denmark, Finland, France, Germany, Lithuania, the Netherlands, Norway, Sweden and the UK, are beginning to build national web archives using a variety of approaches, e.g.

·        selective archiving of static web resources, i.e. resources that do not change or contain inter-active or dynamic elements are archived on a selective basis. Denmark, Canada and Japan are the principal exponents of this approach;</