CEDARS: A multi-site UK project to
create exemplars in Digital Archiving
David Holdsworth / Leeds University
Link to DADs info in Leeds
Title
A sub-title
|
Some Tales from the Backwoods
|
both a pun and an
indication that we have been quietly operating archival
activities for many years.
Preamble
CEDARS is a very new project, and has not yet formed
truly collective views.
This presentation focusses on experiences from building in-house
systems for digital preservation, and relates these both to the OAIS model
and to the needs of research libraries.
My personal perspective comes from decades of dealing with
data storage at the University of Leeds, where we have a history
of in-house systems which have been taken up elsewhere.
Participants
CURL, the Consortium of Research Libraries is the project owner.
There are three lead sites: Oxford, Cambridge and Leeds.
These are the sites with existing preservation systems in service.
- Oxford
use IBM's ADSM to provide an archive for the whole University.
began operation in 199x.
Strong interest in meta-data.
About 3 Tbytes of data.
- Cambridge
use EPOCH to provide an archive with an in-house user interface
began operation in 199x.
Holds data from earliest computation in Cambridge.
- Leeds
use an in-house system to provide an archive for the whole University.
began operation in 1992.
Current system is designed for indefinitely long life.
Holds data from previous systems back to 1980 (maybe 1970s).
About 1 Tbyte of data, in approx 2.7 million files
belonging to several thousand users.
Issues
The following are some issues which I perceive as important.
- Distributed stores v central
I personally strongly favour a distributed archive,
hence my preference for a global namespace (below).
- Keep the format - or convert
i.e. emulation v conversion
I like the OAIS's notion of being able to recover the original
data stream. An easy wasy to do that is to keep the original.
I think that we shall keep something very close to the original.
- Keep the media
examples for the technology,
but not for storing the information
I regard copying of the data as inevitable, due more
to the march of technology than to drop-out.
- Underpinning abstract form (e.g. byte-stream, bitfile, Virtual Storage Object)
personal peference for byte-stream (bit-stream?)
A distributed archive needs such an underlying form in order that
interchange takes place easily.
- How much meta-data is bound to the data?
enough to guarantee subsequent human readability (Representation Information)
plus something to understand the data's origins
I'd love some guidance on this.
It is probably a good idea to attach enough information such that
someone finding a storage volume in the distant future could decode it --
or do we explicitly want to prevent such discoveries?
- Is there any separate meta-data?
surely a good idea
My present personal model has the prospect of
multiple meta-data engines (special-purpose
search facilities) driven by meta-data of all sorts.
- Global namespace
We need a global namespace for digitally archived objects (c.f. ISBN)
At present we have several to choose from, of which
digital object ID (DOI)
seems to be a front runner.
An early agreement on the fomat of such a name seems a good idea.
How about using Internet domain names in the same way as Java package names?
protects against relocation
We can distinguish the storage location from the information.
enables separation of data and meta-data.
enables multiple meta-data implementations
empowers federations of archival stores.
i.e. allows distribution in general.
- Indirection
Each system that we have built has had
more indirection than its predecessors.
We now have all indexing of files and owners by internal names
which are largely invisible.
This allows easy renaming of files and owners in indexing systems,
without implying the need to modify the stored data which is
probably in near-line storage at best and may be off-line.
UK perspective
-
Experience of George3 makes hierarchical systems the norm
The George3 operating system brought the concept of automatically
managed hierarchical filestore to the UK mainstream in the 1970s.
Personal perspective
- First implemented and used hierarchical storage in 1968
Our in-house Eldon2 system used automatic migration between tape and
disk to the extent that the total filestore was about 10 times the on-line
disk space (48 Mbytes).
- Migrated data:
| 1960s | Eldon2(in-house) | KDF9 (approx IBM7090) |
| 1970s | George3 | ICL1906A |
| 1980s | VM/CMS | Amdahl 470 |
| 1990s | LEEDS | EXB-120 / EXB-480 / ATL4/52
UNIX, NetWare3&4, NT? |
| 2000s | LEEDS | absorbing new technology |
With the LEEDS system we have an architecture designed for indefinite preservaton.
- Indirection is vital in a hierarchical system
Our systems have used increasing indirection with each new system
- Versioning of data formats is important
We need to attach version numbers to the volumes and to the stored data objects.
A Librarian's Perspective
From Peter Fox, Librarian at Cambridge University
What I as a librarian want is:
- Somewhere that I can be sure that a database can be stored securely and
retrieved at any time in the future by anyone who has approved access.
- In our discussions in Cambridge it has emerged that the file stores that
Leeds, Cambridge and Oxford have are just that - stores where the people that
have deposited the data can retrieve it.
- I fear that, in earlier
discussions, 'access' has meant different things to the librarians and
the computer people.
- What I as a librarian want is for there to be an
archive for academia which operates in perpetuity,
like the commercial ones do now.
- Anyone (subject to whatever security/financial arragements were in place)
could have access to archived information.
- This might mean two
separate archives - one like the existing file stores as an 'archival
archive' and one behaving like the commercial suppliers' machines as an
'archive for access'.
I see Peter's last comment as indicating an archival store which would have
simple access control, but in front of that an access facility that knew all
the complex rights of access, and was allowed access by the archival store.
The Oxford System
- Uses IBM’s ADSM.
- back-up -- mirrors user’s file space
- archive -- users can copy fils into the archive
- HSM -- a file system is coupled to ADSM with automatic migration
- Long-term future may be locked with IBM, as ADSM is necessary for moving data
The Cambridge System
- An EPOCH HSM filestore
- is accessed via an in-house user interface
- Data transfer uses FTP
| L |
ow-cost |
| E |
ver-lasting |
| E |
xtensible |
| D |
ata |
| S |
tore |
|
The LEEDS File Archive
A network file archiver written at Leeds University
|
http://www.leeds.ac.uk/ucs/systems/archive/
- Operational in 1992
but it also contains data migrated from the VM/CMS system,
going back into 1980, and raw images of the filestore tapes
of the previous George3 system, which shut down in 1980.
- Based on IEEE-MSS v4
This standard (well-known to the audience) introduced the "bitfile"
concept, and this abstraction is central.
My understanding is that the "virtual storage object" of version 5 of the model
and the "Information Package" of OAIS are almost equivalent.
- diagram
This diagram (93K JPEG)
shows part of the network of systems
linked into the LEEDS archive.
The purple part is the actual archive system itself.
The rest of the diagram represents the client systems, with UNIX systems on
the left and NetWare systems on the right.
Not shown is the FTP link used to transfer he contents of PC hard disks to
the archive.
- We own the data format
i.e. data is always accessible
This guarantees us against loss of data owing to supplier bankruptcy
(We don't have chapter 11 in the UK), or system obsolescence.
- Minimalist assumptions about the medium
i.e. write from start, fwd skip, read
no overwriting, no appending
This gives us easy portability, as proven by the move from helical scan
on Solaris to DLT on AIX.
- Copying is inevitable
no virtue in indestructible media
Even with an indestructible medium, the capability for
reading the medium will become obsolete.
Perhaps when we are storing at the quantum-mechanical level this will no
longer be true.
- User interface allows submission to and
recovery from the archive
There is also a facility for getting directory information
and also a (rarely used) facility for the deletion of unwanted files.
Requirements
formulated by a nationally based working group of 47 UK universities in November 1990.
A university has a large floating population of computer users who own many
files
It is in the nature of academia that there is much data whose value is not known.
The major points were:
- Files can be sent to the archive from any of the participating file systems on
the campus.
- Recovery of a file can be onto any system, not necessarily the originating
system.
- The retention of indexing information is done by the system.
- It should be easy for an end-user to rename files.
- The overheads per file must be very small, as many of the files are themselves
small.
- The system should be able to exploit new storage technology seamlessly.
- The system should cope with data for a shifting population of many thousands
of users.
- Data should be safe.
- There should be no reliance on operating system modifications.
The
actual list of requirements and their rationale
are available on the Web.
Overview of concepts in LEEDS system
I suspect that this is much too much for a slide, but the
concepts would be covered by talking.
Many of the properties which I plan to use in the CEDARS demonstrator
map well onto OAIS.
- Unit of storage is the BITFILE
No attempt is made to alter the contents of bitfiles.
They are treated as indivisible units, except that they contain
a header which gives some (but not much) meta-data information.
- A BITFILE has a name (c.f. IP number) and an owner.
An owner concept is vital if access is to be permitted to
non-trusted systems.
This tends to be vital where several systems have access.
- Only the bitfile-ID is required to request information
from the archive. Name space of bitfile-IDs is flat.
- Owners are allocated numbers (another flat namespace).
- Bitfile names are not secret, but only the owner
or a permitted reader can actually receive the data.
- A bitfile is a sequence of bits (bytes actually).
The archive undertakes to deliver the byte-stream
intact on request.
- Against each owner is kept a list of bitfiles with
the full UNC pathname that the file had when it was
archived.
- The owner is also given a stub file, which
holds the bitfile-ID. Stub is often kept where the
original file lived, but can be kept anywhere, and can
also be regenerated from the directory in the archive.
- The owner of the file has the responsibility of remembering
how to make sense of the data (or another authorised requester).
Is LEEDS an OAIS?
Well only partly. One can identify traces of
OAIS concepts such as
- Content Information (CI) -- clearly the body of the bitfile,
- Repesentation Information (RI) -- only identity of originating system, and
- Preservation Description Information (PDI) -- details of ownership and origin.
- The goal of indefinite preservation
What can we feed into the OAIS debate?
Theory
- We need indirection for digital objects -- we need a digital ISDON now
(International Standard Digital Object Number)
This needs to be the definitive way of referring to objects,
not an optional feature of the PDI.
For CEDARS we will probably plump for one of the current candidates,
but on a national/global scale we could use agreement.
An early agreement on the fomat of such a name seems a good idea.
How about using Internet domain names in the same way as Java does?
-- perhap as a fall-back when no other ID is available.
- We need indirection for user identities (maybe use certificates)
If you do not have some such mechanism, only trusted systems
can be allowed to access the store. They must be trusted
by each other as well as by the archive.
- Is a "collector" a special kind of "producer" or a new element in
the OAIS environment (section 2.1)
In the world of libraries the concept of a collector is
probably different from a producer.
The collector is not the same as the management.
- Include enough PDI with the objects on media to allow last-ditch recovery.
Things sometimes go wrong, and to have enough identity information
physically next to the data helps in recovery operations.
Practice
- CEDARS will demonstrate OAIS-style operation in a limited service environment
- It is addressing bothe issues of data storage, and collection management
- If we cannot make the ideas work on a small scale, they certainly
won't work on a national/international scale.