BUILDING A DIGITAL LIBRARY "FOR THE AGES"

by Leslie Mertz

"If we do this right, we'll build for the ages, because we have a chance to get the data that will be gone in a century or two. We can preserve this for posterity."

With those words, WSU linguist Anthony Aristar summarized a new computational project that will help to preserve endangered languages through a digital library to be developed at the university. Farshad Fotouhi of Wayne State's Department of Computer Science, is the principal investigator of the project, which has co-PIs (including Aristar) from five university units: the Department of Computer Science, the Linguistics Program, Computing & Information Technology, the School of Business Administration, and the Library and Information Science Program.

The project is particularly significant when viewed in the light of the tremendous loss of cultural and other information that accompanies the demise of even a single language, explained Aristar. "When you lose a language, you lose the knowledge about how that language was formed. That's not a small thing, because one of the best avenues we have into how the human brain works is by looking at how languages work," he said. "We also lose history in the sense that all these languages are the end result of many, many thousands of years of change and cultural experience. All the stories that were told through the language usually get lost as well. So we really lose the culture."

It has become an especially urgent problem, Aristar asserted. "At the moment, there are about 7,200 languages spoken on Earth, which seems like a substantial number, but we are losing languages at an extraordinary rate. We estimate that about 400 languages will be gone within 25 years alone."

They are disappearing because dominant languages like English, Spanish, Arabic, and others offer educational and economic advantages that minority languages don't, according to linguist and project member Geoff Nathan, who is also faculty liaison to Computing & Information Technology. "In many countries, people ... go to school, and, in increasing numbers, to university, where the language of instruction and term papers and such is either an international language such as English or the national language. They tend to forget their home languages, and, as they marry and have children, raise them with the major languages in which they function," he said. "While this may well improve everyone's well-being, the small languages that are being forgotten risk being lost forever."

Aristar agreed, noting, "There are so many languages dying at this point in history that our work is essentially an attempt to use modern technologies in a desperate salvage operation."

Spurred by the dire outlook, linguists, historians, and other scientists have begun scrambling to collect written, audio, visual, and other records of endangered languages. While the data collection is a positive thing, it has also presented problems because the data are in so many different formats that much of it cannot be easily disseminated, interpreted, or compared, according to Fotouhi. One person may put the information in a database, another may place it in word-processing documents, and a field linguist may have a stack of cassette tapes or video on CDs filled with taped interviews, he said. "Our objective in building this digital library is not to bring all the data here to Wayne State, but to serve as a 'virtual warehouse.' Then, when a user submits a request for certain data, we can identify the appropriate sources, grab the data, and answer the user's query with information he or she can use."

While it is a massive undertaking, the digital library has a good head start through Aristar's previous E-MELD project and FIELD tool. The FIELD (Field Input Environment for Linguistic Data) tool is "a computational knowledge system that knows how languages work," described Aristar. "It allows you to salvage data and then find generalizations across languages that otherwise you would not find." The E-MELD project, which stands for Electronic Metastructure for Endangered Languages Data, is a database infrastructure to assist in saving and sorting endangered-languages data to a common format. WSU worked with Eastern Michigan University to develop the FIELD tool, and was part of a five-university consortium to build E-MELD.

With the digital-library project, Fotouhi hopes to build on E-MELD and FIELD to create a system that doesn't demand that data conform to set formats, but instead can translate the diversity of data and present it to a user as if it all came from one source. "Our goal is to do all the mapping needed to be able to match a user request to the data that's out there," he said.

For instance, a linguist might want to compare similar phrases in two languages. At the user's request, the digital library would locate Web pages or other sources that contain the information, interpret it to such an extent that it could pluck out the specific, relevant data, and deliver the information to the user in a format that permits comparisons. "That means that we need to be able to collect metainformation, which is data about the data, and also understand the taxonomies/ontologies that are being used to store these languages," said Fotouhi, noting that the latter is necessary to learn how each database, Web page or other source has organized its information. For much of the work, the team relies on C&IT's gigabit network backbone to provide a fast and flawless connection to other universities and data resources. "We depend on it when we are working with people at Eastern Michigan University, particularly, because we need to bring their data here and process it quite fast," he noted.

One of the greatest challenges arises from audio and visual data, Fotouhi said, "We are dealing with a lot of data captured when linguists interview somebody in the field. We need to be able to take that audiovisual data, digitize it, parse it (or break it down into usable language data), collect it, and then store the metainformation in our database."

Another challenge will be to identify the sources of endangered-language information by manually tracking down Web-based data. "We might hear that Dr. XYZ in Arizona has some data. We'll go see what it is, identify its significance, try to parse it, and determine what we need to extract and put in our database," Fotouhi said. The researchers eventually hope to speed the process by creating intelligent agents. "These agents are software pieces that sniff around on the Web and find out where these sources are. So, some time in the future, this process is going to be automatic."

Although the project is just beginning, Aristar said the work on the digital library will play a central part in a planned certificate program in language engineering, one of the few such programs in the nation.

Work on the digital library officially kicked off on September 1, 2003, in conjunction with a two-year, $200,000 grant from the WSU Research Enhancement Program. To reach the team's goal, the project will ultimately require a few more years and considerable external funding, but the team of researchers is ready. Commented Nathan, "(I)t's exciting to be able to contribute at least to some extent in preserving and making available this valuable and potentially irreplaceable knowledge."

WSU's Digital Library Investigating Team

Department of Computer Science, College of Science
Farshad Fotouhi
Ming Dong
Robert Reynolds
Shiyong Lu

Linguistics Program, Colleges of Liberal Arts and Science
Anthony Aristar
Martha Ratliff
Geoff Nathan, who is also faculty liaison to Computing & Information Technology

Information Systems Manufacturing, School of Business Administration
Joseph Tan

Library and Information Science Program, University Libraries
Ronald Powell

Published by Wayne State University Computing & Information Technology
in the INFORMATION TECHNOLOGY @ wayne.edu Newsletter, Fall Term 2003
© 2004 copyright