Making the Greatest ImpactThis story is part of an ongoing series illustrating how the university — through its community, its creative culture, its unmatched scope and scale — is having and will continue to have a greater impact than any institution of its kind on issues, people and events that are changing the world. Read more about how we make the greatest impact. >

Archives of the future

Texas Advanced Computing Center helps the National Archives find solutions to the nation’s digital records deluge

April 11, 2011

How does an archivist understand the relationship among billions of documents or search for a single record in a sea of data? For the National Archives and Records Administration (NARA) — the government agency responsible for managing and preserving the nation’s historical records — the proliferation of all kinds of digital data, especially digital records, has made the task of the archivist exponentially more complex.

At the end of President George W. Bush’s administration in 2009, NARA received about 35 times the amount of data it previously received from the administration of President William Clinton, which was many times that of the preceding administration. By 2014, NARA is expecting to accumulate more than 35 petabytes (quadrillions of bytes) of electronic records, ranging from Web sites, to spreadsheets, to AutoCAD generated plans.

TACC's graduate student Suyog Jain working at a computer in the Vislab
TACC graduate student Suyog Dutt Jain works at a computer in the Vislab.

“The National Archives is a unique national institution that responds to requirements for preservation, access and the continued use of government records,” said Robert Chadduck, acting director for the National Archives Center for Advanced Systems and Technologies.

To find innovative and scalable solutions for their large-scale electronic records collections, Chadduck turned to the Texas Advanced Computing Center (TACC) at The University of Texas at Austin, drawing on the expertise of TACC’s digital archivist, Maria Esteva, and data analysis expert, Weijia Xu.

“For the government and the nation to effectively respond to all of the requirements that are associated with very large digital record collections, some candidate approaches and tools are needed, which are embodied in the class of cyberinfrastructure that is currently under development at TACC,” Chadduck said.

Just as a physical infrastructure made up of roads, power grids, and water systems supports modern society, a “cyberinfrastructure” of distributed computers, scientists, and information and communication technologies enables modern research.

After consulting with NARA about their research needs, members of TACC’s Data and Information Analysis group developed a multi-pronged approach to the problem that combines different data analysis methods into a visualization framework.

“Visualization is the process of transforming data into images that can be readily understood. We’re all familiar with desktop icons, representing folders and files,” said Maria Esteva, a digital archivist at TACC. “But imagine a screen clogged with millions of such icons with little clue as to what they mean. It takes a new visual representation to show patterns emerging from millions of files at a time.”

An archival visualization tool is applied to a collection of web images from the National Park Service
An archival visualization tool is applied to a collection of Web images from the U.S. Geological Survey. Each of the nested boxes represents a different national park. The researchers created an algorithm that parsed the HTML files of each image to extract 10 terms relating to topography features (“rivers,” “hills,” “valley”). A rainbow of colors represents a diverse topography.Visualizations courtesy of Suyog Jain (UT), Varun Jain (UT), Maria Esteva (TACC) and Weijia Xu (TACC)

Archivists spend a significant amount of time determining the organization, contents and characteristics of collections.

“This process involves a set of standard practices and years of experience from the archivist side,” said Weijia Xu of TACC. “To accomplish this task in large-scale digital collections, we’re developing technologies that combine computing power with the archivists’ domain expertise.”

The visualizations act as a bridge between the archivist and the data, interactively rendering information as shapes and colors to foster an understanding of the archive’s organization, contents, and preservation.

Human visual perception is a powerful information processing system. TACC researchers decided to take advantage of this innate skill by adapting the well-known Treemap visualization — traditionally used to represent distributions of information — so that it incorporates additional dimensions of information. File format types and their correlations, organizational criteria, and preservation risk–levels are determined by data-driven analysis methods and then visualized using the Treemap format.

The renderings are tailored to suit the archivist’s need to compare and contrast different groups of electronic records on the fly. Once the results are presented visually, archivists can assess, validate or question the results and run other analyses.

One of the data analysis methods developed by the team combines string alignment and Natural Language Processing. This method helps archivists predict whether a group of records is organized by similar names, by date, by geographical location, in sequential order, or by a combination of any of those categories. Another analysis method computes paragraph-to-paragraph similarities to automatically discover “stories” from large collections of email messages. These stories may then become the points of entry to large collections that cannot be explored manually.

The researchers distribute their data and computational tasks across many computing processors on TACC’s data analysis and visualization cluster, Longhorn, which was funded by the National Science Foundation. This accelerates computing tasks that would otherwise take much longer time on standard workstations.

“TACC’s nationally recognized high-performance supercomputers constitute wonderful national investments,” said Chadduck. “The understanding of how such systems can be used effectively is at the core of our collaboration with TACC.”

Whether archivists will connect with the abstract data representations proposed by TACC remains a question. Like the Dewey Decimal System or domain naming on the Internet, the system must be intuitive and useful to stick.

What constitutes research today at TACC will eventually be integrated into the cyberinfrastructure of the country, at which point it will become commonplace. In that way, TACC is providing what I believe is a window on the archives of the future. Robert Chadduck, acting director, National Archives Center for Advanced Systems and Technologies

“A fundamental aspect of our research involves determining if the representation and the data abstractions are meaningful to archivists,” said Esteva, “if they allow them to have a thorough understanding of the collection.”

Throughout the research process, the TACC team worked with archivists and librarians on The University of Texas at Austin campus and in the Austin community to better understand the end-user experience.

According to Jennifer Lee, a librarian at The University of Texas at Austin, “The research addresses many of the problems associated with comprehending the preservation complexities of large and varied digital collections. The ability to assess varied characteristics and to compare selected file attributes across a vast collection is a breakthrough.”

The NARA/TACC project was highlighted by the White House in their report to Congress as a national priority for the federal 2011 technology budget. The researchers presented their findings at the 6th International Digital Curation Conference and at the 2010 Joint Conference on Digital Libraries.

As data collections grow, new ways to display and interact with the data are becoming necessary. TACC is building a transformable multi-touch display system to enhance the interactive and collaborative aspects of archival analysis tools. The new system will allow multiple users to explore data concurrently while discussing its meaning.

“What constitutes research today at TACC will eventually be integrated into the cyberinfrastructure of the country, at which point it will become commonplace,” said Chadduck. “In that way, TACC is providing what I believe is a window on the archives of the future.”

For more information, contact: Aaron Dubrow, Texas Advanced Computing Center, 512 475 9439;
On the banner: TACC's Maria Esteva and Weijia Xu in the Vislab.

Banner photo and photo of Suyog Dutt Jain: Marsha Miller