Why Researchers Shouldn’t Share All Their Data

April 08, 2018 Premium

In 1925, the year Gertrude Stein published her 1,000-page book The Making of Americans, she felt the need to explain, in lectures at the Universities of Cambridge and Oxford, why it was so long. Her aim, she said, was to depict a tense she described as the "continuous present." Her method for doing so — and the culprit for her verbosity — was "using everything."

The thought of "using everything" should send a chill down the spine of any researcher. Producing publishable results from an investigation typically requires managing far more material than can fit into the publication format. A secret realm of dark data resides in the notebooks and hard drives of the data-gatherers; they judge that data to be excess, but who knows? What if it is not? In this excess, in this "everything," surely there are the ingredients of unrealized cures and upheavals. And in this excess, every researcher knows, are parts of our process we would rather not share. There, we are vulnerable.

During the years when I worked primarily as a reporter, this excess haunted me. I have hours-long interview transcripts from which only a few words, if that, appeared in an article — not because those words were the only ones of value, but because of the needs and constraints of that particular article.

Eventually I started to collect my reporting notes into public notebooks, including one that is the basis of my next book and of this article. Doing so has become an easy way to share what I gather with people who want more than what the published work can hold. It has also inclined me to take better notes, and to notice more threads of connection among disparate projects. But I have also found myself holding back. I hide my detailed reading notes behind a password. To protect my sources, interview recordings and transcripts remain offline altogether. Field notes stay in paper notebooks.

Predictably, the hard sciences have charged ahead on this curve. Far-flung research teams frequently collaborate in examining common data sets. Some government grants come with the requirement of publishing open data as well. The resulting demand has warranted open-notebook software like Jupyter, Observable, and Zenodo. Researchers frequently post their own code on platforms like GitHub or GitLab. These are based on Git, a tool designed for large groups collaborating on open-source software. Among other features, Git keeps a meticulous record of a given project’s version history. It remembers every change and every bug. Likewise, open-source software communities tend to regard maximal transparency as an intrinsic good.

Some humanists have followed suit. The Rice University historian W. Caleb McDaniel, for instance, has developed a system that feeds his research notes into a public wiki, thanks to a mix of open-source tools and scripts he had to code for himself. Scholars across many fields share their bibliographies online using tools like Zotero, which was developed through an academic collaboration. Hypothesis, a nonprofit platform, enables users to make, collect, and share annotations on nearly any website. Requiring my students to use it, I’ve found, is a handy way of checking that they’re doing their reading assignments and getting them to debate their interpretations.

Among journalists, there has been talk at times of "open journalism" as a new paradigm for reportage that extends beyond just the polished report. In 2011, as an executive in residence at the University of Southern California, the former Sacramento Bee editor Melanie Sill published a report called "The Case for Open Journalism Now: A New Framework for Informing Communities." Yet her call has not been widely answered, and it remains the definitive work on the subject. As editor in chief of the British daily The Guardian, Alan Rusbridger adopted open journalism as his strategy for the newspaper, but he left the job in 2015. Organizations such as BuzzFeed and ProPublica, at least, publish code and data sets on GitHub.

The emerging opportunities for self-exposure extend from research to the writing process. Kathleen Fitzpatrick, now a professor of English and director of digital humanities at Michigan State University, undertook a widely publicized "open review" process for her 2011 book, Planned Obsolescence: Publishing, Technology, and the Future of the Academy. She waited until she had finished a full draft, but one need not do so. I version-tracked the entire drafting of my latest book in Git, which means nearly my whole process of writing and revision could become immediately public if I simply pushed it to GitHub.

I don’t think I will do that.

The "blockchain" technology underlying Bitcoin, which makes possible secure databases with no centralized authority, could open the doors of transparency still farther. Every Bitcoin transaction is recorded in the open, and the same mechanisms could record acts of scholarly research, writing, and certification. Natalie Smolenski, an anthropologist who works for the blockchain start-up Learning Machine, wants to use such tools to transform how we register academic achievements. Yet in her paper "Academic Decentralization in an Era of Digital Decentralization," Smolenski reserves some of her most arresting words for transparency.

"Transparency," she writes, "is socially pornographic and facilitates violence." It can mean revealing data about ourselves without the context we might otherwise provide. It can objectify the researcher and the process, inviting viewers to feel a false sense of intimacy, of inside knowledge.

This is a sentiment I’ve sometimes come across as a minority opinion in hacker communities I’ve studied. It’s expressed most often by participants representing vulnerable identity groups, people for whom more self-exposure can mean more vulnerability. In the academy, I’ve heard it from those on "watch lists," whose every move is scrutinized for political reasons, in search of what might be construed as a misstep. Graduate students are often taught to be careful what they publish, for fear of being pigeonholed too early. Too much self-exposure might compromise a career. It might also muddle one’s message.

"Meaning is not transparent," Smolenski told me in an email; rather, she stresses that meaningful communication happens through context and time. She contrasts the exposure of radical transparency to what the more careful, intentional cultivation of intimacy allows: "provisionality." Without the requirement of transparency, one can try on ideas, see how they look and work, then take them off.

Feminist techies, while sympathetic to calls for open-sourcing everything, have also recoiled at the most extreme demands to be transparent. As Ellen Marie Dash, a software developer, wrote in the magazine Model View Culture, for those accustomed to harassment online, the call for openness feels like a call to invite more harassment. "The only way to handle this sort of problem properly," Dash contends, "is by explicitly placing consent and safety over openness and transparency."

Dash also questions whether dumping vast amounts of information online counts as transparency in the first place: "What you wind up with is a company that produces so much unorganized, uninteresting and irrelevant data that you can’t find meaningful information."

It’s the old paradox of Jorge Luis Borges’s Library of Babel, which contains such multitudes that little of use can be found. And this is the trouble with reading Gertrude Stein, as soon as you’re ready to leave her bewildering "continuous present." The tools that afford us new opportunities for openness and collaboration also come at the risk of obfuscation and danger.

Nathan Schneider is an assistant professor of media studies at the University of Colorado at Boulder. His forthcoming book, Everything for Everyone: The Radical Tradition That Is Shaping the Next Economy, will be published by Nation Books.