31 Digital Humanities and New Literature: Archiving and Digitisation

Prof. Kunal Chattopadhay

Conceptualizing humanities research in the context of new digital technology

Digital humanities, or its previous term, humanities computing, have been around now for a few decades. However, definitions are still hard to pin down. Computers have been part of our lives for decades, and even in India, since the late 1980s they have come in in a big way. During this time digital humanities has accumulated a strong professional apparatus that is probably more rooted in English than any other departmental home. The obvious reason for that is the international dominance of the English language for general exchange, reinforced by the use of English as the standard language of the internet. The outlines of this professional apparatus are easily found. An annual international conference named Digital Humanities has been hosted for many years by the Alliance of Digital Humanities Organizations. Wiley-Blackwell has brought out it’s a New Companion to Digital Humanities. The University of Illinois Press has a book series titled Topics in the Digital Humanities, and there are refereed journals for the field. In India, Jadavpur University launched a course on Digital Humanities under its Project Equal. Across the world there are over a hundred Digital Humanities centres and institutes.

Put most simply, digital humanities is a field of study, research, teaching, and invention concerned with the intersection of computing and the disciplines of the humanities. It is methodological by nature and interdisciplinary in scope. It is not something that has emerged out of conscious decisions to create a new interdisciplinary field, as much as due to the emergence of the digital condition. In our everyday tasks, we constantly find ourselves negotiating digital technology. Our present day experience of the humanities, whether we are talking of literature, films, music, are all profoundly impacted by the digital technologies. It is not simply a matter of personal choice, not a question of whether we are more comfortable with printed books and magazines, or online texts, whether we prefer going to a physical library or accessing JSTOR and Project Muse. To take just the example of the book, if we think of a book printed in 1960 in India, it was first handwritten or typed by the author, then composed by hand composing or linotype, and printed and bound. Now, before the first hard copy is placed on the shelves, the book has been in all likelihood composed by the author or her paid composer in a computer, turned from Word into the required software by the Publisher/Printer, and the images, diagrams or charts digitally prepared or enhanced in the overwhelming majority of cases.

At its core, digital humanities is closer to a common methodological outlook than a focus on one particular set of texts, or one set of technologies. This preliminary definition could be refined quantitatively, using some of the tools developed by digital humanities itself, such as using tools to mine the proceedings from the annual Digital Humanities conference and develop lists of topic frequencies or collocate key terms or visualize the papers’ citation networks. Alternatively, we can use qualitative methods, examining sets of projects from different digital humanities institutions or centres. But regardless of its precise location, its connection with literature has now been asserted and institutionalised for some time. At the 2009 MLA Convention at Philadelphia, there was a major subfield on Digital Humanities.

At this point, we need to advance a key proposition, namely, Digital Humanities is not a field where the relationship between humanities and technologies are defined once and for all, and fixed absolutely.

Origins of the Name

According to John Unsworth, the name Digital Humanities came about through two events occurring close together in time. One was the plan for the publication of Blackwell’s original “Companion to Digital Humanities”, initially proposed as “A Companion to Humanities Computing”. By using “digital” rather than digitized, the stress was sought to be moved away from mere digitization. The other incident was the attempt to form an Alliance for Digital Humanities Organizations by merging Association for Computers in the Humanities and the Association for Literary and Linguistic Computing. [Svensson:Chapter 2] Finally, in 2008, US Federal funding was earmarked for “Digital Humanities”, putting a seal of official governmental approval to the term. At the 2009 MLA Convention, mentioned earlier, Digital Humanities was present in two ways – a number of panels on different aspects of Digital Humanities, and the extensive presence of Twitter. Twitter is often seen as a dumbing down, since one has a total of 140 characters. The reason has less to do with attention spans than Twitter’s origins in the messaging protocols of mobile devices, but the format encourages brief, conversational posts (“tweets”) that also tend to contain a fair measure of flair and wit. Unlike Facebook, Twitter allows for asymmetrical relationships: you can “follow” someone (or they can follow you) without the relationship’s being reciprocated. Tweeting has rapidly become an integral part of the conference scene, with a subset of attendees on Twitter providing real-time running commentary through a common “tag” (#mla09, for example), which allows everyone who follows it to tune in to the conversation. This phenomenon has some very specific ramifications. Amanda French ran the numbers and concluded that nearly half (48%) of attendees at the Digital Humanities 2009 conference were tweeting the sessions. By contrast, only 3% of MLA convention attendees tweeted— according to French’s data, out of about 7,800 attendees at the MLA convention only 256 tweeted. Of these, the vast majority were people already associated with digital humanities through their existing networks of followers. Jennifer Howard, a veteran Chronicle of Higher Education journalist, noted the centrality of Twitter to the DH crowd and its impact on scholarly communication, going so far as to include people’s Twitter identities in her roundup of major stories from the convention.

Globalisation and Digital Humanities

Digital Humanities has also seen a rise due to the neoliberal economic turn and its impact on the public universities. As state funding declines, tuition and other fees and educational costs rise, endowments and private financial aid to public educational institutions also come down, and as full time faculty are increasingly replaced by differentially waged staff, often part timers, Guest Faculty, etc.; students, research scholars, younger (and sometimes older) faculty have come together to use Digital humanities as a conscious tool. Recording talks, presenting papers in absentia via digitial instruments, using blogs for self-publication, the emergence of sites like Academia.edu where scholars can publish and exchange opinions and download others’ writings, have taken great strides forward in the last decade.

So what is digital humanities and what is it doing in literature departments? The second part is easier to answer. After numerals, text is the easiest thing to manipulate. While film, photography, are extensively digitized, there is a long tradition of text-based data processing that was within the capabilities of even some of the earliest computer systems and that has for decades fed research in fields like stylistics, linguistics, and author attribution studies, all heavily associated with literature departments. Second, there is a long connection between computers and composition. Third, there was a convergence between a discussion on editorial methods in the 1980s, and another on a discussion to implement electronic archives and editions. Fourth, around the same time, there began a discussion on hypertext and other forms of electronic literature. Fifth, the shift of literature Departments to culture studies made them open to computers as cultural artefacts. Thus, Stuart Hall and his collaborators created a Reader around the Sony Walkman. Finally, in recent times, on one hand there has been a growth of electronic reading via Kindle, iPad and other instruments, on the other hand there has been a vast expansion in text digitization, both Project Gutenberg, and Google books. A scholar like Franco Moretti has taken up data mining and visualization for the distance reading of multiple books at the same time.

Digital Humanities and their archiving and digitisation of the new literature have also been connected to new property regimes coming in the wake of the foundation of the WTO and the kind of impact the notion of intellectual property has had on scholars and their labour. On one hand, the forms of academic publication (peer review, over formalized citation styles) connected to a cycle of publication – permanent job – API score – promotion are seen as bureaucratic institutionalization by many scholars. On the other hand these in turn assist a set of privileged periodicals and publishing houses to control digitization in ways that many academics dislike. Almost every scholar has had experiences of being halted from reading papers one urgently needs or would simply like to read, because there is a firewall blocking the aspiring reader until she has paid large numbers of dollars, which for scholars of the Global South in particular become insurmountable. Digital Humanities is at this point a cry for an open scholarship and a pedagogy that are collaborative and depend on human networks and that are accessible online round the clock, cross the world.

Inclusions and Exclusions

It is necessary to ask, before we move on to archives themselves, what are we including in DH and what are we excluding? There is a problem with definitions that are too inclusive. But excessive rigidity also poses problems. When we talk about DH and the creation of archives, are we therefore suggesting that this is little more than old words using new tools? Are we then talking about set ups like the Folger Shakespeare Library and its digital resources, or the Bichitra project, the online Rabindranath Tagore variorum ? To start with, even as we consider these, we should recognise that the tools actually transform the way we handle the texts. These are very different from the physical archives that scholars have been accustomed to in the past. As Derrida explains: “the technical structure of the archiving archive also determines the structure of the archivable content even in its very coming into existence and in its relationship to the future”.

But beyond that, digital tools enable the generation of literary forms, argue others, that are not just the old words with new tools. For example, there is the rise of interactive fiction. It could be argued that nineteenth century novelists writing serial novels in journals also had an interactive mode. Interactive Fiction or IF is somewhat different. It is software simulating environments where players use text commands to control characters and influence the environment. An interaction with these with traditional literature can be seen, for example, in Samit Basu’s GameWorld trilogy, a novel which takes IF for granted and develops from there. The term can also be used to refer to digital versions of literary works that are not read in a linear fashion, known as gamebooks, where instead of one definite ending the reader is given the choice at different points in the text; these decisions determine the flow and outcome of the story.

Stephen Ramsay, in a presentation at a 2011 MLA panel, raised the question of inclusions and exclusions in a different way? Does being in DH means knowing how to code or not? Does it have to be about text? He answers, based on actual situations, that the response varies from University to University. No discipline can survive without actively engaging with disciplinary questions. Stressing that the discipline is about building things, he wanted to include people who theorize about building, people who design so that others might build, and those who supervise building, ignoring the issue of coding since many people build without knowing programming, and including people seeking to rebuild the broken down system of scholarly publishing. But he wanted to exclude all who were not engaged in building, saying that they could be game theorists, classicists with a blog, etc, but not DH practitioners.

Archiving

In the physical world, we make a distinction between libraries and archives. The library is a repository of books and periodicals, whereas the archives keep documents, photographs, ephemera. A more refined explanation is possible but this will serve our purpose. When we think of digital archives, there can be somewhat of a shift. Late in the twentieth century, the National Library, Calcutta, decided that early books were becoming too damaged, and their digitisation was desirable. Thus in the process of digitisation, archives and libraries can be somewhat blurred. Some of the huge and well known digital archives are indeed libraries, if we imagine that keeping a book is what makes it a library. One thinks for example of the Project Gutenberg, which has a vast store of books and stories whose copyright has expired or which are not under copyright.

Digital archives are much favoured, as they seem to hold out a series of advantages. First, the physical paper might be damaged. Digitising appears to give it greater longevity. Second, copying by whatever means of physical documents (photocopy, photograph) eventually leads to a certain loss of quality. Digitisation makes it possible to copy, and re-copy, without any loss of quality.

Digitising a mass of material does not complete the process of creating the archive, however. How does an archivist look at a massive archive consisting of thousands of digital objects, and create an attractive digital library/archive useful for the non-specialists? There are different answers. Archives of pre-twentieth century documents are more open, since copyright issues seldom bog them down. Niche archives serve one type of readership, while mass archives serve another. But very often, the term archive means something very different here. In the world of the archivist, the archives preserve original documents, often the single copy. Digital archives are often selected material taken from different physical repositories and arranged in order to support some definite scholarly goal. One could mention here the Rossetti Archive (http://www.rossettiarchive.org/ ).

It is possible that the domination of literature is a major factor in this shift. The studia humanitatis included the study of grammar, poetics, rhetoric, history, and moral philosophy. Turned into modern discipline, however, Digital Humanities and its archives all too often focus on literary studies, rather than philosophy, history or political science. Political Science and History in particular are disciplines that extensively use archives proper, at least for the last two centuries. And archivists are definitely not expected to take part in selection activities, retaining some and rejecting other material as part of the creation of the archives. In classical archives, preserving context is a vital part of retaining authenticity. As a result, the DH archives may make sense to literary scholars, but not often to historians or political scientist or other humanists.

The DH response would be to argue that creation of archives predates DH. Orature (sometimes misleadingly called Oral Literature) has to be preserved, and this involves selection, based on field work, rather than the archivists’ favourite idea that archives are “Materials created or received by a person, family, or organization, public or private, in the conduct of their affairs and preserved because of the enduring value contained in the information they contain or as evidence of the functions and responsibilities of their creator, especially those materials maintained using the principles of provenance, original order, and collective control”. (Richard Peace-Moses, “Archives” in A Glossary of Archives and Records Terminology (Chicago: Society of American Archivists, 2005), available at http://www.archivists.org/glossary/term_details.asp?DefinitionKey=156 ) Second, it might be argued that the “archives” created by digital humanists are themselves archives in that they represent the records of those people’s own professional activities. Finally, it could simply be stated that in the field of DH archives has a different meaning, just as realism has different meanings in literature and political science.

Beyond the definitional problem, which is not insurmountable, since each discipline uses terms in its own way, the Digital Archives have other problems. The digital technology itself poses problems. Today, the average student uses computers that twenty five years ago even cutting edge workers could not use. This means the stored material also requires upgradation. Secondly, while digitisation is easy, takes virtual space relatively cheaper to buy and expand than real space, it has its own problems. An archive is meant to last for a long time. Material in the West Bengal state archives go back to the early days of the East India Company, for example. People working on early Indian communists can find material on them in the police archives, going back to 1919-onwards. Keeping online archives for twenty years, to say nothing of a hundred or two hundred, is a difficult matter. Typically, a website cannot be maintained for over five years at a time. Funding and other issues come up. Even the space can change.

Core challenges in terms of technology can be summed up under three heads:

Redundancy – computer data is stored in compressed magnetic disks and these eventually wear out. So repeated copying and locating the copies in different sites is essential. This is often not clear to many practitioners of DH who come from mainly literary background with relatively little computer knowledge.
Accessibility – The whole point of creating the archive is to ensure that scholars at a later date get access to the data. Typically, a scholar will search for certain types of information. Two aspects come in here. One is retrieval, which is dependent on how the data is coded and sored. The other is forward and backward compatibility. This is the issue mentioned earlier in talking about long duration of archives.
Renewal and depreciation – hardware wears out. Domain names are not purchased, as is our conception with property in the physical world, but merely leased. Softwares also change. Thus, records need to be made and stored in forms that permit easier renewal and the tackling of depreciation. For example, saving data in a form like MS Word, rather than in HTML or PDF, could make retrieval later on much more difficult. Open technology, rather than technology over which one concern holds complete control, is preferable. This is something that can be seen in Project Gutenberg, though that is turn has problems since it has too limited metadata.

Metadata and Visualisation

To make retrieval easier, digital archives use a technique which is called metadata. Metadata is information that provides information about other data. There are three types of metadata. Descriptive metadata describes for the purpose of identification – title, author, keyword, etc.

Structural metadata indicates how compound objects are put together, for example how ages come together to form chapters.Administrative metadata gives data about when and how a resource was created, who can access it, and other technical information.

People listen to music, post photos; locate video; manage finances; connect with others – and these are all done using a variety of applications and online techniques – YouTube, Isntagram, Email, MMS, Twitter, etc. This content comes with metadata—information about the item’s creation, name, topic, features, and the like. Metadata is key to the functionality of the systems holding the content, enabling users to find items of interest, record essential information about them, and share that information with others.

We are all aware of Wikipedia, the last-minute saviour of the student. It is a crowd sourced free online encyclopedia. It uses and generates metadata. The Wikidata project is an open and collaboratively edited knowledge base that stores information about topics in structured forms that can be pulled into Wikipedia articles or other information systems. The DBpedia project by contrast mines metadata from Wikipedia infoboxes, categories, images, geospatial information, and links to generate an open resource of structured metadata that can be reused in countless ways.

Visual data exploration allows faster data exploration and generally provides a better result than automatic data mining algorithms. The classification of VDM techniques are done in three dimensions: data type to be visualized, visualization technique, and interaction and distortion. There exist a large number of different visualization techniques all depending on the suitability to the type of data that are to be visualized. Visual Data Exploration is usually done in three steps called “the Visual Exploration Paradigm”. The steps are: Overview first, zoom and filter, and then details-on-demand. The user needs firstly an overview of the data. Secondly, the user may want to focus on interesting patterns. Finally, it wants the user to examine and analyze the patterns and therefore needs to drill-down to look at details of the data. All this can be done visually using different techniques.

The role of Data Mining is to extract information from a data base that the user did not already know about. The result is findings of models and patterns which describes useful relationships. There are many ways to graphically represent a model, the visualizations that are used should therefore be chosen to maximize the value for the viewer. To be able to do this we need to understand the user’s needs and design the visualization after that.

you can view video on Digital Humanities and New Literature: Archiving and Digitisation