Friday, April 18, 2008

MAC 2008 Cleaning Cobwebs: Studies in Archiving Web-Based Records

Presenters:
Rosemary Pleva Flynn, Energy and Environmental Research Center, University of North Dakota
Philip C. Bantin, Indiana University Bloomington
Mark J. Myers, Kentucky Department for Libraries and Archives

Internet Archiving Tools include:

Archive-It
Web-at-Risk: Collaboration between LOC, California Digital Library, New York University and University of North Carolina
Echo Depository Project (WAW): Collaboration between UIUC and LOC
Web Curator Toolset (WCT): Collaboration between Library of New Zealand and the British Library

Some tools let you select specific content, others select domains or directories. You can set quotas and block hosts or domains. The best strategy is to identify series that have long term value and harvest those. Bulk harvest is too large and costly, item by item too slow.

The way your institution uses your website will determine how often and what to capture. Static pages can be captured less frequently. Most crawlers can be set to capture on a specific schedule.

Generally good to capture one link off the domain name, but this can get costly in terms of size and sites can be blocked if too large.

Items that aren't captured include:
  • Java Script
  • Streaming Video
  • Dynamic Database
  • Protected sites
  • Form driven sites
The tool must document authenticity by capturing header information and other metadata. Most use a Dublin Core basis which is not designed for records, but more for photographs. Archive-It does not capture metadate as well as WAW and WAS.

The tool must be able to restrict access to copyrighted information. Again Archive-It doesn't do this well, WCT and WAS have better authorization tools.

The search engine must be able to search metadata and full text to be effective.

For preservation purposes the final format must be non-proprietary. The Web Archive file format (WARC) has been proposed as a standard to combine multiple digital resources into an aggregate archival file with metadata. Wayback Machine uses this type of file format.

Whatever service used must be able to migrate information preserved.

Institutions have the option to build a service using existing tool sets or to join a service. Build your own is harder and takes more resources in terms of expertise and server space, but is more flexible to fit your needs.

Mark Myers of KDLA uses pdf as primary file format. Other formats are reformatted to pdf. Pages are captured based on the state retention schedule.

Harvesters available are:
  • Wayback Machine
  • Grab-a-Site
They did manual harvesting and consultation with other agency offices. Most harvesters have problems with things like drop down menus. Just guessing, but a Flash menu would probably also cause major problems with harvesting layers.

Other problems arising include boundaries of the harvested site, functionality loss because of loss of structure and content management systems.

Rosemary and Mark are co-chairs of the ARMA Task Force on Website Management which will be developing guidelines and best practices for identifying and archiving websites.

The question remains "Are websites records?" and "to what extent?" In my experience there are a lot of records on websites that are permanent, but most of these are also printed. This will continue to change as we see more and more documents born digital and never in print. And even if they are in print the electronic version is much more versatile due to searchability.

No comments:

Post a Comment