The Calimera Project is funded
under the European Commission,
IST Programme
Calimera Guidelines
Cultural Applications:
Local Institutions Mediating Electronic Resources
Digital
preservation
|
Calimera
Guidelines |
This guideline deals with:
Web archiving and domain archiving
Administrative and legal implications
POLICY ISSUES Back to Scope
‘The volume of
information is growing at an unprecedented pace. We already produce more
information per year than we did in the whole period since we descended from
the trees. A lot of this information is digital only, meaning it has no
physical representation. That makes it much more volatile. An XML-document for
instance is created while you view it. So how do you keep it?’ (Ulrich Kampffmeyer, director of
PROJECT CONSULT,
The EU recognises that long-term access to
digital as well as analogue resources is crucial to delivering the objective of
making
Digital materials, whether “born digital” or
converted to digital form, are at risk from technology obsolescence and
physical deterioration. The objective of preserving resources is to ensure that
they remain accessible for current and future generations. In the case of
digital resources additional considerations (as compared with the preservation
of traditional analogue materials) include:
·
technological
obsolescence, generally regarded as the greatest technical
threat to ensuring continued access. The speed of changes in technology means
that the timeframe during which action must be taken is measured in a few
years, perhaps only 2 to 5, as opposed to decades or even centuries for
traditional materials;
·
the fragility of some storage media used for
digital resources. These can deteriorate quickly although externally no damage
may be visible;
·
the ease with which changes can be made. This
means that there can be challenges associated with ensuring continued
authenticity;
·
the dynamic nature of some “born digital”
materials. This means that they are intended to be continually updated. This
use of technology is very effective for providing up-to-date reference information,
maps, etc., but poses challenges in terms of the ability to compare data at
different points in time;
·
the lifecycle of a website. The average
lifespan of a website is estimated to be about 44 days. This is similar to the
problem of ephemera in the analogue world. Ways need to be found to collect and
preserve important websites and selected examples of all other websites;
·
the question of originals. With digitised
materials, care must be taken to preserve analogue originals. However, with
“born digital” materials there is no analogue equivalent to fall back on – once
they are lost they are gone forever. For example, the first telegram ever sent,
in 1844, has been preserved in analogue format and has been digitised; the
first e-mail, sent in 1971, has been lost.
Preservation
issues must be considered an integral part of the digital creation process,
whether making a digital copy of an analogue item or creating a “born digital”
item. It is essential to document and record all the technological procedures
that led to the creation of the digital object, and much critical information
can be captured only at the point of creation. The costs and implications of
not having a preservation strategy can be high. Retrospective preservation, if
possible at all, is likely to be expensive. Although techniques such as digital
archaeology (rescuing digital resources which have become inaccessible) exist,
they are not always successful. Loss of access to the growing body of material only
available in digital form could have serious implications for future
generations. Precautions can be taken which will reduce the danger of loss such
as:
·
storage in a stable controlled environment;
·
implementing regular refreshment cycles;
·
making preservation copies (subject to
licensing/copyright permission);
·
establishing appropriate handling procedures;
·
using standard storage formats and media.
As well
as a technical strategy, an organisational strategy is useful in order to
ensure budgets, staff and time are available for what should be an ongoing
procedure. ERPANET has published a useful policy tool [4].
Some institutions, especially smaller ones, might
consider contracting out either the whole preservation process, or the storage
of digital materials, to a third party. This too will require careful planning.
For guidance on the issues which need to be considered see Simpson,
GOOD
PRACTICE GUIDELINES Back to Scope
How to provide for long-term access should be
considered from the planning stage when resources are being digitised (see the
guideline on Digitisation),
or from the creation stage in the case of “born digital” resources. It is
useful to have a “life-cycle” strategy which takes into account data creation,
access policies and preservation procedures, and which is in place and ready to
be applied before any images are captured.
Selection Back to Scope
A key initial decision which needs to be made concerns
selection, i.e. which resources justify
preservation. Not all resources can be or need to be preserved for ever; some
will not need to be preserved at all, some for a defined period of time, and
some indefinitely. With traditional
collections, lack of selection for preservation may not necessarily mean that
the item will be lost, but in the digital environment it almost certainly will
be, so such decisions are crucial. In the case of “born digital” materials it
is advisable wherever possible to involve the creators in the selection
decision. In cases where there are multiple versions, a decision must be made
as to which version is the best one for preservation, or whether more than one
should be selected. Sampling dynamic resources as opposed to attempting to save
every change, may be the most practical option. Making such decisions as early
as possible helps to target resources towards preserving the most valuable
assets.
Once the selection of material has been made,
an appropriate technical strategy must be chosen, e.g. technology
preservation, technology emulation, or data migration.
Technology preservation Back to Scope
This is a very high risk strategy. It involves
preserving the original software (and possibly hardware) that was used to
create and access the information. It also involves preserving and maintaining
both the original operating system and hardware on which to run the resource,
and continuing to train staff so that they have the skills needed to keep the
systems running. Long-term costs are impossible to estimate. It is likely to be
too expensive and impractical for individual institutions (except very large
ones with very important collections), although co-operation with other
institutions to keep a “pool” of such equipment could be considered. The
disadvantages to this strategy include obsolescence, software and hardware
eventually wearing out, technical support disappearing over time, and the
“pool” equipment being in a location at some distance from the digital material
making access for local users difficult. Technology preservation is not really
an option for small local institutions.
Technology emulation Back to Scope
This involves developing techniques for imitating
obsolete systems on future generations of computers. At the present time this
tends to be expensive and technically complex. Also it will have to be re-done
each time a new technological platform appears. It can thus only be regarded as
a solution for long-term preservation of perhaps globally important material
held in large national institutions where emulation can take place on a
more-or-less continual basis. An additional consideration is that software
copyright issues may need to be addressed (see the guideline on Legal
and rights issues).
This
involves copying the data from one hardware/software generation to a newer one,
thus keeping it stored in an up-to-date form that continually keeps pace with
changes in technology. This is perhaps the simplest and most commonly used
method, despite the possibility of data being lost or changed in the migration
process. It preserves the intellectual content of the original data but may
lose original features and appearance. If these are important then technology
preservation or emulation may have to be used. The capture of metadata is a
critical part of a migration strategy in order to ensure continued use of the
resource if any change in, or loss of, functionality occurs, as it probably
will over successive migrations. In this case preservation metadata -
describing the software, hardware and management requirements of the digital material
- will provide essential information.
Data
refreshing is associated with migration. It is the process of copying data from
one set or copy of the digital media to another of the same kind and helps to
keep the data in good condition until it is migrated to a new media.
Consideration
might also be given to copying data to an older generation of media, namely
analogue format. With resources which are digitised from analogue originals it
is of course sensible to preserve the originals. It is also possible to
preserve “born digital” resources in an analogue format such as permanent
paper, preservation microfilm or nickel disk, but this is only suitable in a
limited number of cases such as a print-out of a digital document. It is
inappropriate for increasingly complex websites etc., where loss of
functionality would diminish the usefulness of the resource.
Another
possibility which might be considered is preserving screen shots of systems,
virtual exhibitions or creative works (particularly those which are “born
digital”). This provides a record of the system in the form of a digital image
file which is likely to be suitable for long-term preservation.
Authenticity (see also the guideline on Security) Back to Scope
The
choice of preservation strategy will be influenced by how authentic the
preserved item needs to be. There is no universally accepted definition of
authenticity, but it broadly means that the preserved copy should be as much
like the original as possible, and the connections between documents and
objects should be preserved to assist with interpretation. With analogue
records, it is possible for example to trace how decisions were reached by
examining the relationships between documents; historians are concerned that,
with the proliferation of records only available in digital format, this
ability might be lost to future
generations.
In the
analogue world, the preserved item usually is the original, although copies may
be made for use in order to prevent damage from handling etc. In the digital
world the preserved item will be a copy of some sort since there is no physical
artefact. As it is dependent on technology for access, over time this copy will
be subject to many changes in order to ensure that it is still accessible on
new technologies. It is therefore crucial that metadata is preserved with it to
define its authenticity, and ideally this should be created simultaneously with
the information. For discussions on the challenges posed by authenticity and
preservation see Integrity and
authenticity of digital cultural heritage objects. Digicult Thematic Issue
1, August 2002. [6]
Storage (see also the guideline on Digitisation) Back to Scope
Strategies
for both online and offline storage will be needed. Delivery files in continual
use will need to be stored online, on servers. Master files are best stored
offline since they are less frequently accessed. Storage of the original
analogue objects or source texts is also important and links need to be made to
these.
·
Online storage – it is easy to allow storage
space to become cluttered with several versions of documents and other
unnecessary resources. It would be useful to have a plan which:
°
clarifies which resources need to be
accessible online, nearline and offline;
°
sets times for removing certain categories of
material from online storage;
°
sets times for reviewing online storage.
·
Offline storage must take into account the
problem of media degradation. However, despite its fragility as compared for
example with paper, most storage media will outlive the hardware and software
needed to use it. Over the last 30 years storage media has moved from punch
cards to DVDs via cassettes, floppy disks and CD-ROMs, but the technology to
retrieve data stored on the early media is difficult to find. Storage cannot
therefore be a once-for-all task but must be part of an ongoing regime. Points
to consider include:
°
environmental conditions – good environmental
conditions will increase the longevity of
digital storage media and help prevent damage to data. Large
fluctuations in temperature and humidity are generally thought to be more
damaging than constant levels, even when these are slightly less than ideal. BS 4783 Storage, transportation and maintenance of media for use in data
processing and information storage [7] contains guidance;
°
archival media – it is advisable to
select the best quality archival media affordable. A variety of digital storage
media is available, including CD-R, DVD-R, DAT (Digital Audio Tape) and DLT
(Digital Linear Tape). Digital images should be preserved on Write Once Read
Many (WORM) drives which enable the files to be viewed
frequently without being overwritten. TASI (Technical
Advisory Service for Images) gives advice on CD-R and DVD-R
[8]. The Bundesamt für Sicherheit in
der Informationstechnik (BSI)
gives advice on standards for archival media [9];
°
archival copies
– at least two archival copies should always be made and stored in different
locations. If possible multiple copies should be made on different storage
media. Copies made using different software, and/or comparable software
purchased from different suppliers, will help to protect data against
corruption from malfunctioning software or viruses;
°
media
refreshing – it is useful to have a plan for refreshing or transferring archive
copies to new media at specified times, e.g.
-
within the
minimum time specified by the supplier for the media's viability;
-
when new
storage devices are installed;
-
when a quality
control check discloses significant temporary or read "errors" in a data
resource;
°
quality control
– consider having a quality control procedure involving:
-
checking all
media periodically for readability;
-
using bit/byte
or other checksum comparisons with originals to ensure the authenticity and
integrity of items after media refreshing;
-
recording all
actions taken.
In the analogue world, preservation is the term
used for activities which generally ensure the safekeeping and survival of
resources (careful handling, secure packaging during transport, controlled
environment etc.), and conservation is that particular aspect of those
activities which involve some kind of active intervention with the object, i.e.
repair or restoration. In the field of digital resources, the term preservation
is more often used, although active conservation measures might be needed at
times; some damaged media can be
repaired for example.
Disaster
recovery procedures and risk management Back to Scope
It is
always sensible to have a plan in case of disasters. It should include such
considerations as:
·
creating archive copies of all important
digital resources as soon as they are acquired or created;
·
using standard storage media and formats;
·
storing archive copies on and off site - in
areas in danger of natural disasters such as flooding, off site copies should
be at a safe distance away;
·
ensuring that all staff are trained in what
to do in the event of a disaster;
·
having a risk management policy. ERPANET has
published a risk communication toolkit. [10]
The requirements of a file format for archiving are
broadly the same as for creation (see the guideline on Digitisation).
It is preferable:
·
to use an open
standard file format rather than a proprietary format to guard against
obsolescence;
·
to use a file
format that can support the embedding of metadata;
·
not to use any
compression to guard against losing data - a lossless format such as TIFF (Tagged Image File Format) [11]
is preferable, but if there is real pressure on space, the PNG
(Portable Network Graphic) file format [12] can provide an alternative
lossless format.
The same general rules apply to the preservation of
audiovisual formats (see also the guideline on Multimedia
services), but digital broadcasting media present enormous challenges. Most
television is now produced in digital format. It is imperative to involve the
programme makers and journalists who
create the programmes in the preservation strategy. TV companies such as the
Netherlands Institute for Sound and Vision [13] are working on strategies
to preserve hundreds of thousands of hours of broadcast material in authentic
ways.
ERPANET suggests the following file formats for
preservation [14]:
·
Text documents:
plain ASCII [15], PDF [16], XML
[17],
TIFF [11];
·
Image
documents: TIFF [11],
JPEG2000
[18];
·
Audiovisual
documents: WAV [19], BWF [20], MPEG [21].
Because of the number and variety of formats
available, information about them is being collected in file format registries
such as the UK National Archives’ PRONOM [22]. Local institutions may
need to seek advice from regional or national professional associations as to
which one is suitable for their requirements.
Media (see also storage) Back to Scope
As already mentioned, it is advisable to select
the best quality archival media affordable. A variety of digital storage media
is available, including CD-R, DVD-R, DAT (Digital Audio Tape) and DLT (Digital
Linear Tape). Digital images should be preserved on Write Once Read Many (WORM)
drives which enable the files to be viewed frequently without being
overwritten. TASI
(Technical Advisory Service for Images) gives advice [8].
Standards (see also the guideline on Digitisation) Back to Scope
It is advisable to adhere to open standards when
archiving digital resources. As these are not tied to specific
hardware/software they help to guard against the dangers of technological
obsolescence. There are standards for the different aspects of storing digital
information. Some examples include:
·
interoperability
standards - these allow communication between different systems. Examples
include ISO 23950:1998 Information and documentation -- Information
retrieval (Z39.50) - Application service definition and protocol specification
[23]
and the CIMI (Consortium for the
Computer Interchange of Museum Information) Profile: Z39.50 Application Profile
for Cultural Heritage Information [24];
·
resource
encoding standards (see also the guidelines on Resource
description and Discovery
and retrieval) - these define formats for different types of digital
information. Adherence to this type of standard allows data compatibility
across a wide range of systems.
Examples include standards for:
·
page
description formats e.g. PostScript [25],
Portable Document Format (PDF) [16];
·
graphics
formats e.g. Tagged Image File Format (TIFF) [11], Graphics Interchange Format (GIF) [26];
·
structured
information formats e.g. Standard Generalized Markup Language (SGML) [27],
Extensible Markup Language (XML) [17];
·
moving images
and audio formats e.g. WAV [19], Broadcast Wave Format (BWF) [20], MPEG [21];
·
resource
identification standards (see also the guidelines on Resource
description and Discovery
and retrieval) – these provide a way of uniquely identifying digital
resources in order to ensure long-term and reliable access to resources while
they are available over the Internet even when their location changes. URLs
(Uniform Resource Locators) can change. Examples of permanent identifiers
include URNs (Universal Resource Names)
[28],
DOIs
(Digital Object Identifiers) [29], PURLs
(Persistent Uniform Resource Locators) [30], Handles [31] and ARKs (Archival Resource Keys) [32];
·
resource
description standards (see also the guideline on Discovery
and retrieval) – these can facilitate effective resource discovery.
Examples include AACR2 [33] and Dublin Core [34].
There are also a group of standards which relate to metadata syntax, such as MARC (Machine-Readable Cataloguing) [35]
and the EAD (Encoded Archival Description) [36].
The World Wide Web Consortium (W3C) [37]
is currently involved in developing the Resource Description Framework (RDF)
[38]
which will provide the infrastructure to support the coexistence of many
different metadata sets, or "schemas", of which the Dublin Core will
be one example;
·
data archiving
standards – these provide for the long-term preservation of and access to
digital information. The Open Archival Information System (OAIS) Reference Model (ISO 14721:2003 Space data and information transfer systems
-- Open archival information system --
Reference model) is an example. [39];
·
records
management standards – these provide guidance on how to implement records
management strategies, procedures and practices. The main examples are ISO 15489 Information and documentation -- Records management [40]
and ISO/TS 23081-1:2004 Information and documentation - Records
management processes - Metadata for records -- Part 1: Principles [41].
It is also advisable
to adhere to standard formats and media. Simply using standard file formats and
standard media will go a long way towards ensuring the safety of a digital
collection. The Technical Advisory Service for Images (TASI)
[42]
and the Digital Preservation Coalition [43]
both give valuable advice on both topics, and see also the guideline on Digitisation.
The Bundesamt für Sicherheit in der Informationstechnik (BSI) gives advice on standards for archival media [9].
Various projects are working on standards for
digital archiving including:
InterPARES [44],
Project Prism [45], DAVID [46]
and VERS [47].
In 2003 the International Internet Preservation Consortium [48]
was set up by the national libraries of
Web
archiving and domain archiving Back to Scope
The ever-expanding size of the World Wide Web
and its dynamic and ephemeral nature pose special challenges for projects
aiming to capture, store and make it accessible for the long term. Several
countries, including
·
selective
archiving of static web resources, i.e. resources that do not change or contain
inter-active or dynamic elements are archived on a selective basis.