My Experience at the HathiTrust Uncamp

A word cloud showing the most frequently occurring words in a selection of dime novels from the HathiTrust corpus.

On September 10th and 11th, I attended the HathiTrust Research Center UnCamp held in Bloomington, Indiana. The UnCamp was a joint venture organized by Indiana University, my institution, and the University of Illinois Urbana-Champaign. All in all, the UnCamp spanned a day and a half of demonstrations and hands-on examples geared to orienting attendees toward new uses of HTRC data. As a graduate student, I was lucky enough to have my registration paid for by IU’s Data to Insight Center in exchange for volunteering throughout the UnCamp. For my post today I wanted to briefly share my experience.

As library students, many of you have probably heard of HathiTrust, which has been in the news practically since its inception. In case you’re not familiar, HathiTrust is an organization with the goal of creating a universal digital library. They are going about this by partnering with academic institutions around the United States (and increasingly the world) and digitizing and pooling the materials from these institutions, creating a unified corpus that is accessible online. Last fall, HathiTrust and its partner institutions made the news when they were sued by the Authors Guild and other parties with the accusation of infringing upon the copyright of their works. You can find a great account of the pending lawsuit, involved parties, and a review of HathiTrust in Emily Ford’s article on In the Library with the Lead Pipe.

The UnCamp consisted of a mix of speakers, demonstrations, and hands-on activities interspersed with generous snack breaks and socializing. Throughout the event, I was most interested in learning about the concept of “non-consumptive research” that I had been hearing thrown around. This refers to a researcher’s ability to run algorithms against the HathiTrust corpus without having full access to the materials, which allows for in-copyright materials to safely be used by researchers. Basically, you get the results you need without ever having to download the complete text. It truly is the best of both worlds for both the researcher craving access to large datasets and the copyright holder maintaining his or her rights.

The suite of computational text analysis tools that are part of the forthcoming HTRC, or HathiTrust Research Center, were demoed at the UnCamp but haven’t yet been released–I’m not sure what that timeline looks like. However, you can see a brief overview of what types of analyses to expect here. Imagine doing a faceted search of the HathiTrust corpus, curating a collection of resources–perhaps works within a certain genre or by a particular author–then running an algorithm against it and being thrown a visualization. That is exactly what I did with a collection of dime novels; the result is shown in the initial screenshot at the beginning of this post. If word clouds aren’t your thing there is a cadre of other visualization options. I thought it was thrilling, and I’m not even a digital humanist or researcher!

The UnCamp was an excellent way to get acquainted not only with the powerful tools being built, but also with the individuals pouring energy into this project. It is undoubtedly a complicated, massive undertaking without easy answers, but those involved are tackling the preservation, access, and copyright issues with vigor and I have the utmost respect for them. It was a pleasant surprise to feel so welcome within the diverse group of attendees, who ranged from library technology administrators to developers, digital humanists, and graduate students. I am occasionally at conferences where a delineation between lowly grad students vs. everyone else emerges, so experiencing the UnCamp and feeling like a legitimate participant showed me that HathiTrust respects the next generation that will ultimately take up the cause. Hey guys, that’s us! And I’ll tell you what: when I was at the UnCamp I could tell I was being exposed to an organization that is shaping the way we will handle digital information well into the future. It’s important stuff, and as library students we should pay attention.

Have you been keeping up with HathiTrust? Do you think it’s an important subject for librarians to be knowledgeable of?

4 replies

  1. I have been keeping up with HathiTrust. Knowing about HathiTrust and similar projects, like Google Books, are very important for any librarian or library student to know. By scanning titles not available everywhere, it opens a new way to research because there is instant access to material and no need to travel. Plus OCR allows in-text searching, which is time-saving, and the ability to create and run programs to analyze contents, such as in this example: http://mininghumanities.com/2010/04/28/tools-for-getting-a-sense-of-stuff-part-1-visualization/

    Like

Leave a reply to Amy C. Nickless Cancel reply