ISO Archiving Standards - Fourth US Workshop- Minutes
The Johns Hopkins University Applied Physics Laboratory
Laurel, MD 20707-6009, USA
July 10-11, 1996
(NOTE: We invite all participants to critique these minutes and
to offer updates on any significant points they feel are missing or
inadequately reflected.)
Results of French National Data Archiving Workshop
Don reviewed the results of the French National Data Archiving workshop.
- It had been well attended and lots of interest was shown in
the IEEE Reference Model for Open Storage System Interconnection
- They saw a common storage approach - i.e.,
- problem of changing media with time
- general interest to manage the physical medium independent
of data
- It was noted that the boundary between data and metadata is not
the same for everyone
- POSC has produced a model called EPICENTRE using EXPRESS
- There was some concern that the archiving Reference Model work
might be too ambitious
- No commitment was made regarding supporting the ISO workshops,
but there was interest in tracking the effort and having
another French workshop in the future
Comments on Huc paper "Toward a metadata model"
- Don reviewed this paper briefly explaining Claude's concept of
collections and describing his search process. He felt it
is an interesting paper.
- Mike asked where this paper is going. It is interesting but are
we going to integrate it as part of our work? If so, how?
- JohnR sees the Collection of objects as an object that one
could reference
- A generic way of looking at metadata, albeit not yet complete
- Needs to reference physical items which then could be used for
mapping.
- Concrete examples are needed
Comments on Reference Model paper, version 5
- General Issues/Comments
- Don noted that the CCSDS and ISO document styles are compatible.
He asked for views about the need for a companion green book (a-la-the
CCSDS convention), or report, to accompany the Reference Model and
provide more context. An alternative may be to provide this context
in one or more annexes.
- Does the current ToC provide a smooth flow for the reader and are
any major topics missed?
- Paul noted that when one tries to model something, there is no
"one answer"
- Don noted that the version 4 paper included more material on
three key information preservation concerns
- media degradation
- handling the variety of representation forms
- security
- This was cut back to make the version 5 paper more readable.
Perhaps some of this needs to be put back somewhere.
- Don asked for comments on the statements that 'full information
preservation is not a practical goal. The objective must be to
minimize information loss by formulating and executing effective
policies and procedures.'
- Highlighted issues for the paper include:
- Find common terminology
- Make document more readable; this is of primary importance
for a large customer community
- definitions too abstract, not reader friendly
- Top-down breakout should provide a more readable document
- Clearer Purpose and Scope
- Need a clear description of what is meant by an archive and
include figures
- Are we excluding a DBMS from Storage; can it support both
Storage and Data Management?
- Is storage just long term?
- First draft of the RM for wide public review is scheduled in about
four months. We hope to solidify issues at this meeting
- Mike asked about integrating some of the papers,
"Towards a metadata model"
- There was concern about how much discussion regarding metadata is
relevant to include.
- Section 1
- Revised Purpose and Scope
- Avoid the acronym AIS (Archival Information System)
- Paul felt something about "standards" should be added.
He is looking at this document as a guideline for users and
the document should be made more positive if this is to be a standard
- there should be some minimum basis of what it is
- If this is to be a standard, it should have a greater aura of
"mandatory-ness" about it.
- Need to add words to Purpose and Scope to do this.
However, issue of whether this should be a standard or guide is not
settled. Most reference models are not standards.
- There was discussion about digital versus non-digital data.
- The non-digital information only differs with regard to the storage
mechanism. It still has to be described. This should be made clear.
- Is everything in an archive related eventually to a physical item?
- Mike feels we should focus on the digital aspect of this
(like migration) and later address the physical aspect of archiving.
- Make clear where digital and non-digital archiving is separated
- Need to add a statement regarding minimum requirements for a data
archive under purpose and scope
- Add justification for Section 2.2 in Purpose and Scope
- 1.1 to be more clearly described and compared
- 1.1 text regarding non-digital - to be clearer
- Definitions
- It was agreed that under Definitions terms are to be added,
including definitions of Permanent Data and Persistent data
- JohnR provided several definition markups in his copy of the paper
- Need to define Media
- Add examples to definitions to make them clearer
- Action Item: Elise to write the definition section
- References
- Include more references as we address then
- Section 2
- Intent is to define "What is an archive."
- The original title included "Services" but that was seen as too
restrictive
- Sect 2.1 now addresses the traditional view, preserving government
records
- Sect 2.2 shows six aspects of the Definition
- We should include non-digital items, like moon rocks etc
- Digital information cannot be opaque to the archive if full
information is to be preserved
- Is Information Object a useful concept? How can we define it
more clearly?
- Under section 2.2.2, how about maintaing a link to a data access
authority and possible need to transfer authority if old
authority dissolved? Involves access controls and access
guides.
- Add provenance under definitions
- Should have policy to detect and recover (when found) all IO
processing errors
- Significant issue to address in the model: What constitutes a
reasonable scheme to do this
- The model should not go into any level of detail, just alert the
data archiver of the need for such a policy
- May need to describe minimalist and maximalist levels of
conformance, if this is to be a standard
- Administration should decide the degree of implementation in a given
instance. If you are running an archive, you decide
- Randy was not sure if some of material in Section 2 couldn't be
consolidated and reordered. Should 2.2.3 come first?
- He feels the Negotiation aspect should be brought out earlier
- It was noted that 'what information' in the archive needs to be
maintained is a periodically reviewable undertaking
- Randy suggested 1)negotiate, 2) accept and assume control,
3) Preserve and 4) Disseminate
- When talking about Preserve, need to keep policy and action together
- Randy feels that Functions map to the characteristics - there is a
parallelism
- Don feels there are (or should be) key characteristics that
identify an archive
- Don proposes to make it normative because he wants to move
toward criteria that distinguish an archive from other things.
We should think about talking about the minimum criteria,
some mandatory, some optional
- Lou disagreed; he feels a reference model is usually intended to be
informative. This needs to be resolved.
- Randy: 2.2.4 is a good statement and should be stated upfront
- Lou feels it is OK to define the territory, and asked if Sec 2.2.4
does this
- Lou feels we have this concept of designated consumers, which Don
showed is still in there.
- Lou regarding 2.2.6 add terms regarding making material available
to designated community. We should be careful not to make
all this applicable to a broader community.
- Back to section 2.2, Don's intent was to lay out territory, an ISO
archive. What are the key characteristics that define an
archive, the essential things?
- Lou: We are missing the Environment.
- To Don, the environment view is not the essentials of an archive
- Should we start with Section 2.2.4 and make this the "Archival Task"
- We start talking (about IOs) too technical, too soon. Bring IO
concept in at a different point
- Attack the key responsibilities
- Randy: Another aspect, most people have come at it from the
traditional point of view where they pick and chose what they
want to archive
- There are some differences between an archive system and the
traditional archive which should be/is described in the document
- Lou stated he got lost in this version; what Don calls
documentation and data objects, he would call something else
- what is difference between archives and libraries? We're not
defining libraries, only 'archives'
- Re Section 2, Don did not just want to describe functions in several
places in the document
- IOs should be deleted here and placed at the end of this section
or in section 2.2.4
- Mike: key element is that there are no consistent standards which
explain what the problem is. It is critical to address this,
"Observational data versus Traditional data."
- It was noted that there was a bad reaction at international meeting
because of the words used
- Randy: we need to call out our specific problems without necessarily
defining them here. We don't yet have context diagrams and
other materials to help the reader. All this should be
straight forward at this point.
- "How to read this document" is Section 1.4 and perhaps should
indicate that it may be necessary to read the document more
than once.
- Regarding the Context Diagrams, it might benefit the reader, as
they tried to follow the model, if there were text which
talked about some type of services.
- Lou feels context diagrams are the best approach we have since they
identify interfaces
- Lou feels we should add his interface diagrams
- JohnG wanted to see more of these diagrams included
- Problem is how to try out all of this
- Mike: Environment view should move into Section 2; all of which
would make a nice high level and lead to a readable
presentation. He suggested merging Section 3.1 (Environment
View) into section 2.
- Lou uncomfortable with Assume Control. Randy had some changes
- 2.2.2: Need to keep older versions of documents; need statement
as to what is preserved
- Is copyright the only issue impacting archive ownership
- How much control is needed to be able to effect long term
preservation.
This is seen as only extending to the archive copy
- Don is concerned about migration issues relative to ownership
- What is copyrighted, the medium or the bit stream
- Copyright issues are beyond the scope of this document but this
consideration should be addressed at the Ingest negotiation time
- Should make statement that Control assumes certain things about
preservation and distribution
- We may not need full control but sufficient control to do certain
things, i.e. the taskings; we do this per responsibilities -
and things will shake out
- Lou would like to see a bunch of annexes for various procedures
- Environmental model should go into section 2
- Section 3
- Environmental View: there may be instances of archives exchanging
data with other archives; is this any different than acting
as a producer or consumer?
- JohnR provided following inputs:
- Section 3.2.2.1 Ingest
- Expected data products list
- Predict volume of expected data items
- Interface to Storage
- Validate that what is received is what was sent by
producer on data item level
- Log all input from producers
- Section 3.2.2.2 Storage
- Inventory control
- Interface to Ingest
- Store physical objects
- Interface to Access
- Section 3.2.2.3 Data Management
- Data management consists of, or is, Administration of
Ingest-Storage-Access-Dissemination; Data Management
sounds
- Section 3.2.2.5 Access
- Implements access controls; customer access must come
from administration management
- Handle subscription request schedules
- JohnR provided a slightly different view of the functional entities
- Maybe the term Data Management is not adequate or is mis-leading?
- Section 4
- JohnR also gave Lou some proposed changes in written form
- JohnR had some proposed changes to the Object model for Accession
Interface:
- Add "created by" line between Information Object and
Producer
- Add "Verification Object" box below Object Description
Record
- In Producer View, must define to the data item level
- On Ingest Interface diagram, change metadata object to something
that is more descriptive
- Internal is not relation be contained in; rather it is described by
- On Dissemination view, change Reporting to Repeating
- Paul wondered about readability of document. He felt Lou
should distill what he has learned by going through these
OMT diagrams.
- Action Item: Lou to add a page of commentary with each diagram
which shows his thinking process
- Is this level of detail appropriate for the document or does it
approach an archive design
- In the processes themselves, we never talk about a request
session or the request environment. These need to be added.
- Section 4.3
- Need better examples in Conceptual view of data; Item
representation
- Try to tackle the complete description of the data's meaning (Paul)
- Media information set maps to the bottom of Don's model
- John: information set is a group of objects all of which have
been put on a media which forced it into a set which is
described by that media information set
- Driver is what is described on that media
- Regarding the model
- What are reasonable levels to which to document?
- Still wrestling which how we talk to people as to what they are
submitting
- Submitter must see that all these levels of description exist and
determine what/how he/she will submit things
- Need to be connected to information objects; this is still not done
- Candidate Data model
- Randy has two problems with the representation data model: too
abstract and lot of words in section 2 that talk about
specific things like degradation. Can we put them in some
formal way?
- Randy noted that when you layout the entire problem, it looked
like a previous paper authored by Lou, Don, and Randy
- Start with the issue associated with the media's turning the
signal into a bit stream and aggregating them into a larger
data structure
- There are two different ways to look at data:
- As user sees it, and
- As computer sees it - a collection of files and records.
- Two things go in parallel
- We need to isolate the issues that we have:
________
|________|
MARS MAP |________|
+90 ___________ |________|
|__|__|__|__| _______|_______
|__|__|__|__| |____| |_____|
0 |__|__|__|__| /_____\ |____| |_____| FILE/RECORD
OBJECT |__|__|__|__| \ / |____| |_____|
|__|__|__|__| |____| |_____|
-90|__|__|__|__|
-180 +180 /\
||
||
PRIMITIVE Floating point, Integers, Characters
STRUCTURE
/\
||
||
BIT STEAM ...0 1 0 0 1 0 1 ...
/\
||
||
_ _ _
MEDIA ..._| |__| |_| |_ ...
/\
||
e.g., pits
- MEDIA
- archiving issue
- damage
- degradation
- obsolescence
- Guidelines -
data can be copied to the same or different medium
as long as the "original bit stream" is preserved
- Monitor/test media for degradation
- develop handling procedures to avoid media damage
- develop migrating policies for media upgrades to avoid obsolescence
- STREAM
- sublevels (only first required)
- original bit stream
- compressed (must be lossless)
- coded
- blocked
- Guidelines
- sufficient metadata should be available to
identify any non-media specific
compression, coding, blocking
(Lou asked about whether its media
responsibility to do coding. Goal is:
Given anything, to be able to reconstitute data)
- PRIMITIVE STRUCTURE
- archiving issues
- machine dependencies
- range and precision of numbers
- character sets
- guidelines
- provide metadata to identify machine dependencies,
range, and precision of number, character sets
used
- can copy/convert to another representation if
number, precision, range is preserved and
character set is not more restrictive
- FILE/RECORD
- sublevels
- volume set
- volume
- directory
- file
- segment
- record
- archiving issues
- operation system dependencies
- naming conventions/restrictions
- hierarchical organization
- guidelines
- can copy if organization and names are preserved
- provide metadata on file structure and naming
conventions
- OBJECT
- Archiving issues
- applications software dependencies
- Human interpretation
- One-to-one, Many-to-One, One-to-Many and
Many-to-Many mapping of objects
to file/record entities
- History
- Guidelines
- provide metadata to allow a human familiar with
the type of object to understand each instance
- provide metadata on history of the object
- provide metadata needed by applications software
- provide metadata on mapping of objects to
file/record entities
- Annex C Archival Information Migration Strategies
- Mike feels we need some metric representations to back up what it
might cost to make some conversions
- Mike suggested one might demand that everything that comes in be
in ASCII form and then compress it with something
- There is a real dearth of archiving cost information
- Don noted that in the Commission's report (Item 8), it was stated
that one effort had found it cheaper to store data in hard copy
- Mike noted that Access time relates to usage; hard to evaluate
- Mike reported he is having a difficult time interpreting old
Magellan data and is having problems getting permission to
destroy old data.
- We need rules on Migration strategy as technology changes advance
- Inventory control: need law to inventory archives periodically,
i.e. to read to the data item level and verify data. Now we
only verify the headers and ship the data
- This would have to be done statistically and probably at night
- Mike: when you design a system, make back-ups even though
crashing is problematical
- Need to have a design to recover from crashes, like how to restart
- Also need a good classification system to avoid duplicating effort
- Scenarios
Organization of Reference Model document version 6
- Perhaps we should combine sections 3 and 4. This would lead to:
- Here is the functional model
- here is the data model
- here is the interface model
- Randy feels a critical issue is that the document must be written so that
readers can easily and thoroughly understand it or it will cripple our
efforts
- Lou suggested the following organization for Section 3 which was accepted
by the group.
REFERENCE MODEL OUTLINE AND WRITING ASSIGNMENTS:
Note: This organization assumes that the environment has
been discussed in Section 2.
1.0 Introduction Don
1.1 Purpose and Scope Don
1.7 Definitions Elise
2.0 Reference Model Concepts Don
2.1 Environmental Model Lou
2.6 Archive Responsibilities Don
3.0 Technical Model Lou
3.1 Functional Decomposition Lou
3.1.n Expand bullets Mike
3.2 Data Model Lou
3.2.1 Logical Model Lou
3.2.2 Fold in Claude's work at this point Lou/Don
3.2.3 Representation of Information Don/Randy
3.3 Interfaces among Functional Entities Lou
3.3.1 Service Calls (Context diagrams) Lou
3.3.2 Object Models (OMT diagrams) Lou
4.0 Scenarios
PDS Mike
Life Sciences Elise
BMDO John Rainey
NDADS Stephen
NARA Bruce Ambacher
5.0 Classification and metrics Lou
Annex C should address lessons learned
Section 3 Discussion Continued
- Action Item: Mike to expand the Archive Services Entities. He will make them
statements in lieu of bullets
Section 4 - Information Model (discussion continued)
- Randy feels this is too hard to understand
- Need to explain permanent and persistent
- Need to explain that permanent data objects have to go through all this
representation
- Question is how to introduce transient data
- Figure 4-1 to be dropped
- We are doing the model for different agencies
- Both Lou and Mike said, "Everyone outside of scientists are used to
starting with information objects and description records
- A user-guide function is a repository for information about the system; a
data dictionary. (Some of this is now in 4.2 and 4.3)
- Randy had some comments, such as being too proscriptive in showing these
diagrams; the document's being reorganized seemed to ameliorate some of
Randys concerns
- Lou to do a more detailed discussion to look like catalog guides
- This is an object view while Mike's is seen as a tool view
- What are we saying about fig 4.2 in the document?
- An object model for each of three things
- We expect you can model your data in this framework so that we can talk
about them in this context
- Don was concerned about representation methods and the relationship between
Object Descriptions and Representation Methods
- Don feels Z39.50 is coming out of libraries with focus on data
access, not on preservation as is this activity.
- Lou indicated there is more work required in this area
- Eliminate where list is stored; Data Store
- Replace description Record boxes
- Lou plans another cut on this
- Critical link: When we go to access, we must use their terminology
- Their model for associated description is helpful and useful
- Need to push the object models with some examples
- Scenarios may do the mapping for Figure 4-3
- Re the Ingest interface, Lou is concerned. He is not sure how to show the
Process interface at the borders
- At Producer interface, he liked having production agreements
- Take out pieces to demonstrate key concepts
- Action Item: Members were asked to send in comments regarding diagrams
- Section 4.3
- Use Randy's text as to what this means in the real world, keeping
Don's introduction
- Don: What he was driving at, but which never came out clearly, is
that a lot of representations exist and have numbers/names
assigned to them
- Don was asked to take his bit model and complete it for more
complex case
- Randy to take the cleaned up vugraph stuff and extend that to
incorporate more on complex information objects
- Section 5
- Lou feels Section 5 is incredibly important and will take a shot
at writing it.
- We have talked so much about how archives are different. We need
to tell people about statistical facts. (Need input from Mike)
- What are the classes of an archive?; all need to help
- Don feels Claude's work should be in Section 4
- Mike: We do not have a model for data store or data management
- Mike: It is important to have some of these Real Things in there.
- A data set is a collection of similar values, so one can determine
what are appropriate things to go with the data set, i.e. the
corrections, the processing algorithm.
- Have sets of data sets and extract some of each; have a collection
of granules with pointers to original objects
- There is a question if there is a special level of collections
that are to be processed with a special tool
- Data sets are homogeneous
- Want to find a bunch of axes and have people identify where they
are on the axes. One can then discover the type of archive
you are from your position on the axes.
- A lot of archive services are poorly funded activities
- EOS invented a classification of active archives where the ingest
process takes priority
- The scenarios will be box oriented
- Action Item: Lou is to take a shot at writing Section 5.
- We should talk to people about their storage facilities and see
if they qualify as a data archive
- Lou feels NARA probably has already done this
COMMISSION ON PRESERVATION AND ACCESS
- A separate issue on Special Digital Libraries Preservation Issue
was discussed. Don is to send the DA members information on it.
- Active participants are invited to participate in a paper, probably
summarizing the Reference Model. It is important to get to
get the "Word" out.
- A serious primary author must be named and Don expressed
willingness to take the lead
- Mike is willing to participate.
- It is necessary to find out the schedule for the abstract and the
paper.
Lou feels it is soon.
- In this general regard, Bob is to draft a letter for Don's
signature to Commission (address on Abstract) introducing
ourselves and indicating cognizance though comments that
we have read the report and wish to indicate our areas of
commonality and differences. We should send them a
copy of the reference model. We need to get ourselves
listed. We may want to ask the preservation commission
to do some things, now that the task force is gone.
All were asked to provide Bob with suggestions/comments for
this letter.
OTHER ITEMS
- After the meeting proper, Don posed the following definition:
"Submission Information Object - An aggregation of physical objects or an
aggregation of bits with associated documentation meeting archive
requirements and giving the aggregations source, purpose, history and
sufficient representation information to understand the meaning of the
aggregation."
- Don felt that documentation and accessibility are the important things
- Lou did not agree with this definition
Wider Views
Overview of the Fourth US Workshop
Overview of US Effort
Overview of International Effort
URL: http://ssdoo.gsfc.nasa.gov/nost/isoas/us04/minutes_long.html
A service of
NOST at
NSSDC.
Access statistics for this web are available.
Comments and suggestion are always welcome.
Editor: Robert Stephens (stephens@us.net) +1.301.949.0965
and Don Sawyer (sawyer@ncf.gsfc.nasa.gov) +1.301.286.2748
Curator: John Garrett (garrett@ncf.gsfc.nasa.gov) +1.301.441.4169
Responsible Official: Code 633.2 / Don Sawyer (sawyer@ncf.gsfc.nasa.gov) +1.301.286.2748
Last Revised: September 17, 1996, Don Sawyer (January 30, 1997, John Garrett)