AUSTIN, Texas — A National Endowment for the Humanities (NEH) grant will make many of the first books printed in the Americas available for the first time in digital full-text format, thanks to innovations in optical character recognition (OCR) technology.
The University of Texas at Austin is one of six recipients of a Digital Humanities Implementation Grant award from the NEH. The grant of $215,000 will fund “Reading the First Books: Multilingual, Early-Modern OCR for Primeros Libros,” a project to extend the capabilities of current open-source OCR technology for use in the transcription of 16th-century texts. LLILAS Benson Latin American Studies and Collections will administer the grant as part of its new Digital Scholarship program.
The tool developed under the project will be used to produce transcriptions of the digitized books in the Primeros Libros de las Américas collection, which currently includes over 330 copies of books printed in the Americas before 1601. Books in the collection include text in Spanish, Latin and several indigenous Latin American languages, including Nahuatl, once spoken by the Aztecs and still spoken by some 1.5 million people. UT Libraries and the Benson Latin American Collection are founding members of the international Primeros Libros consortium, which currently has over 20 member libraries from throughout the Americas and Europe.
The ability of scholars and students to work with ancient texts in digital form has been limited by the challenges of transcribing early-modern books. Printed long ago, they contain variable typefaces, typesetting, spelling and multilingual text that is not recognized by conventional OCR software. The goal of this project is to develop and implement groundbreaking methods in the automatic transcription of such books. This will help scholars shine a light on a period of historical transition from oral culture to the rise of literacy and the birth of the scientific method.
The two-year project, which begins Sept. 1, 2015, will be overseen by Sergio Romero, assistant professor at the Teresa Lozano Long Institute of Latin American Studies (LLILAS) and the Department of Spanish and Portuguese; and by Kent Norsworthy, LLILAS Benson digital scholarship coordinator. The project further develops a prototype of Ocular, a new OCR tool developed by Taylor Berg-Kirkpatrick at the University of California, Berkeley and adapted for the Primeros Libros by UT Austin comparative literature doctoral student Hannah Alpert-Abrams and computer scientist Dan Garrette. The tool will be integrated into the Early Modern OCR Project by a team at Texas A&M University, who are partners in the grant. UT Libraries will incorporate the transcriptions produced under the project into the existing Primeros Libros website.
“The NEH grant is exciting because it gives us an opportunity to conduct research and build tools with scholars from multiple disciplines and universities,” said Alpert-Abrams. “The ultimate goal is to produce a tool that will be useful for anyone interested in producing digital collections of historical documents, across regions and languages.”
Nahuatl scholar Kelly McDonough, assistant professor in the university’s Department of Spanish and Portuguese, sees great promise in this technology for the classroom and beyond. She says that as a result of the successful extension of OCR technology, “scholars and students will be able to rapidly search multiple corpora of multilingual texts — a task that is extraordinarily, often prohibitively, time-consuming without this technology.”
In her own work, which includes the study of female indigenous leaders in colonial Mexico, she will be able to search for rarely used terms and “variants of terminology utilized by indigenous scribes over a long period of time and a large geographic area,” said McDonough. “In short, we will be able to ask questions of massive amounts of data that we simply couldn’t ask before.”