Deep-freeze data that will last 1,000 years
Beneath the Arctic permafrost lies a unique effort to preserve open source software for a millennium
Longyearbyen, located on the west coast of the Norwegian island of Spitsbergen in the Svalbard archipelago, is cold. It always is: The annual average temperature there is -5˚C, but between January and March that average drops to -13˚C. In July, by contrast, you can expect an almost tropical 7˚C. It has a population of under 3,000 people, a single-screen cinema that plays films on Wednesdays and Sundays, and a school whose 270 students show up for their first day in August in mittens and hats. When school trips head into the mountains, teachers carry rifles to account for the risk of polar bears.
Seven key considerations for microservices-based application delivery
Ensuring the success of your cloud-native journeyDownload now
It’s also a place that, in a thousand years, might provide archaeologists with the most accurate clues as to what 21st century society was like. Inside an abandoned mine with an underground vault, a unique, long-term backup effort that so far includes artefacts as diverse as the United Nations Convention on the Rights of the Child, sample data acquired by the European Space Agency’s first Earth remote sensing satellite, a digitised copy of Edvard Munch’s The Scream and more than 20TB of data from open source repository GitHub. As humanity grows curious about its ancestors, or, as seems more likely in 2020, emerges blinking from its own preservative bunker to restart civilisation, the Arctic World Archive may be the best place to begin building an authoritative history of humans – or the perfect place to figure out how to start again.
Caught on film
On a Zoom call to Oslo, Katrine Loen Thomsen holds her business card up to her webcam. Thomsen is the deputy managing director of Norwegian company Piql (you get the joke), whose technology is behind the Arctic World Archive. Her business card is made of the same stuff that could one day offer insights into our ancient civilisation: piqlFilm is the same size as the 35mm film used in motion picture cameras and comes in reels almost a kilometre long.
Thomsen’s piqlFilm business card has all the usual information you’d expect, but it’s the light-grey frame at the bottom that’s the company’s prize. “This frame is a very high density QR code which holds a fair amount of information,” says Patricia Alfheim, Piql’s communication manager. “It’s called nanodensity.”
Don’t scoff: That single 35mm frame of film packs in 8.8 million data points, translating to 2MB of data per frame. “To have that much data stored, [the film] has to have a very, very clear base,” says Alfheim. That means using an ultra-clear emulsion, as noise accidentally recorded on the film could damage the data it holds. Each reel of film is 950m long and holds 120GB of data. Once a film is written, it’s packed safely into canisters and transported to Longyearbyen, where it’s stored in Mine Number 3, an abandoned coal mine repurposed for ultra-long-term data archival.
Once the door of the mine slams shut, the question becomes one of longevity. “Our film will last 750 years at least,” claims Alfheim. She’s quoting tests done by Norwegian research firm Norner, whose accelerated testing of piqlFilm found a longevity of 750 years at 21˚C and humidity of 50%, which means a reel of film stored today would be reaching its theoretical expiry date in autumn 2770.
However, the chemical reactions responsible for the decay of film stock slow as the temperature drops. “So, as it gets colder, that life expectancy increases dramatically,” says Alfheim. “The mine up in the Arctic World Archive is ideal for the film. It’s always cold.”
Or -4˚C four year-round, to be precise, allowing Piql to make the eye-catching projection that a reel of film will last over a thousand years. This means future explorers could crack the vault open in 3020, although it’s being cautious for now. “We say 750 years because it takes a really long time to test,” says Alfheim, “so 750 years is the mark we’ve reached. We’re still going.”
The climate isn’t the only thing in Longyearbyen’s favour. Tactically unappealing in the event of a war, seismologically stable and, at around 115m above sea level, safe from climate change and rising seawater, there may not be anywhere on Earth more suitable for long-term storage. It’s no surprise that the Arctic World Archive is just a few hundred metres from the equally ambitious Svalbard Global Seed Vault, which holds almost a million different species of seed in an effort to safeguard against the extinction of plants in the wild. It’s the perfect place to keep something safe for a long time. The question is: What do we put there?
Keeping code cool
“Our lives depend on open source software,” says Thomas Dohmke, vice president for strategic programs at GitHub, an unsurprising evangelist for open source. “Open source has won,” he says. “No human invention will happen without open source software, that’s our belief, and that’s why we believe it’s worth preserving open source software for a thousand years, in the same way mankind has preserved the Roman Forum, the Taj Mahal, the Bodleian Library. All those artefacts of human history tell us something about who we are and how we have developed.”
In July 2020, GitHub sent 180 reels of piqlFilm to Svalbard, which equates to more than 20TB of data. The reels comprise a snapshot of every active public repository on GitHub, from Bitcoin to Linux. The software contained in the archive can be found everywhere in modern life from smartphones to smart thermostats.
That GitHub would want a backup is unsurprising, but piqlFilm is forever – there’s no updating a frame if a mistake is found or an improvement made. So why wouldn’t GitHub stick with a traditional backup? “I have two answers,” says Dohmke. “Our whole archive programme follows a pace layer approach, a concept from the Long Now Foundation.” That means backing up in layers, described by GitHub as hot, warm and cold, with each layer updated with decreasing frequency, from near real-time for the hot layer to every five years or more at the cold layer. No prizes for guessing which layer the Arctic World Archive is on.
GitHub’s backup strategy allows it to survive catastrophes of varying magnitudes. A file is accidentally deleted? Access a live backup from the hot layer. Interested in what a project looked like last year? Software Heritage provides access to GitHub’s public repositories via a public API. Humanity all but wiped out in an enormous but increasingly plausible disaster? Simply dust off your trusty hoverboard and head to the Arctic World Archive.
That’s answer one: The Arctic World Archive is only a layer of GitHub’s backup strategy, rather than the go-to option when a data centre crashes. And answer two? “All of us have multiple backups of all our stuff,” says Dohmke. “We all went through losing a CD-ROM with wedding photos, or having this old hard drive that still had some MP3s on, and now you can no longer boot it because some cluster is damaged or something like that. In other words, the problem with iterative backups is that the technology used to store them iterates – and deteriorates – as well.
“We believe that a thousand years is a significant enough time period [that] life will have changed … radically,” says Dohmke. “If I go back into my childhood – or even longer than that – software development went from stamp cards to my first C64 cassette. Probably you [will] still find some museum that has a C64 that can load that cassette, but it’s evolving really fast, so the media that we have today – it’s safe to assume that in a thousand years all this is gone.”
The idea that migration-based backup – where backups are migrated to new storage mediums as the old ones decay or fall out of use – is not an idea endorsed by Piql. “Archives generally use a migration-based archival system,” says Piql’s Alfheim. “New formats come out all the time as old ones become obsolete – think of a floppy disk, for example.
“[However], when you migrate your data, something gets lost each time you migrate. If you look at stats on migration of information, it’s crazy risky, it’s very time consuming [and] very, very expensive. After a hundred times of migration you actually don’t know what you have.”
That makes Piql’s approach a double-edged sword – it’s hard to update, but the media the backup is stored on and the backup itself are totally stable.
Junjie Cao, head of IT at the National Museum of Norway, sounds a similar note. “One of the problems [is that] digital technology changes,” he says. The digital age was an unimaginable fantasy when Edvard Munch painted The Scream in 1893. The Scream currently lives at the museum, where like every other unique physical artefact in the world, it’s vulnerable to all manner of threats. The utility of backing up the painting is obvious, but the practicalities required a little more imagination. “It’s very important for us to have the right format,” says Cao. “The lifetime [of] a hard drive is quite short – five to ten years, [so] we are not always sure about the quality about the things we store there.”
The benefits of digitising and storing priceless physical artefacts long-term are clear. “If you have an object and it disappears, that’s the only thing you’ve got,” he says. “When we have all this data, and we have high [resolution] pictures, that information is as important as the painting, both for telling the world how it is and [to avoid] bringing everyone physically to Oslo to see it.”
The benefits of spending enormous time and energy preserving GitHub’s public repositories are arguably more opaque, but Dohmke disagrees. “We don’t expect somebody to go there in 50 years and restore some project, because we have those other layers,” he says. Instead, GitHub’s archives might provide our descendants with fascinating insights into how we lived and worked.
“Technology is moving so fast, so we think in a thousand years that people will rediscover how we lived, how we collaborated all around the world, how do we bridge cultures and time zones and politics in software development,” he says.
“If you go into any open source project on GitHub, it’s not just a couple of folks from Germany and the UK,” he adds. “It’s people all around the world speaking different languages, using their spare time, their weekends, sometimes using their professional time… to collaborate together in what we call the largest team sport on Earth.” That means that the benefit of a long-term archive isn’t limited to “just rebooting Linux in a thousand years… the value is more in how did we work together, how did we write the software, how did we collaborate?”
Just about the only thing Piql’s clients – and the company itself – are unwilling to divulge is the question of cost. Dohmke tells us the cost of archiving with Piql is “not significant in the business model of GitHub and Microsoft”. I politely suggest this could mean almost anything, but Piql insists straightforward pricing is unavailable because each project is tailored precisely to the needs of each client. “We do try to be very competitive,” says Piql’s Alfheim. And while she accepts that backing up with Piql isn’t cheap, “we think there is value in longevity, and that’s what we offer that no one else can”.
Reading the past
It is 3020. The newly opened door to the mysterious vault swings open. Our intrepid explorers shine their torches along the abandoned mine, walking a few hundred metres to the large fireproof container where, a thousand years ago, Norwegian photo archivist Vidar Ibenfeldt, on behalf of the National Museum of Norway, carefully placed a reel of film on a shelf and made a short speech about the materials to be preserved.
Carefully, the explorers remove their gloves and open the lid of the canister. A question presents itself…
Back to the future
If you have data stored on a reel of piqlFilm and you want it back today, the process is simple. “At the moment, you would just request it through Piql, and we restore it and make it available online for you,” says Alfheim. “Word documents, PDFs, video files, image files, whatever you’ve got.”
Banking on the continued existence of Piql – or even Western civilisation – in a thousand years seems decidedly optimistic when it comes to getting data off the reels. But piqlFilm is unique in the world of long-term data storage: unlike CDs, hard disks, reel-to-reel tape or solid-state storage, it’s human-readable.
If whoever opens the vault doesn’t have a QR reader to hand, each reel begins with a user guide that merely needs holding up to the light. “We start with a user guide, kind of like a manual to the archive, that explains what it is [and] how to use it,” says Dohmke. “We wrote this together with a panel of advisors, linguists, historians, archivists and people from libraries, for example the Library of Alexandria in Egypt… we had them advise us of how to write this so people can actually understand.”
The human-readable user guide is written in English, Hindi, Spanish, Arabic and Chinese, and is designed to give uninformed explorers of the Arctic World Archive a fighting chance at decoding their discovery.
Also available: The piqlReader. Looking like a prop from 2001: A Space Odyssey, the reader allows a reel of film to be loaded and read by the end user. Piql has a plan if no piqlReaders survive too. “Instructions on how to build a reader are on the film,” says Alfheim. “In the distant, distant future, if there are no readers available, you can manually extract [data]. It’s a slower, manual process but you can extract all the information with just a magnifying glass, a camera of some description and a computer that can read code.”
So future explorers might get a jump start on rebuilding their civilisation by restarting 21st century software? “Assuming a natural disaster happens that cuts off part of the world and all of a sudden those pieces of open source technology are no longer available… it would definitely be possible to restore the software to that version that we deposited,” says Dohmke. However, he adds: “We are optimistic. We think our future will be bright.”
2022 State of the multi-cloud report
What are the biggest multi-cloud motivations for decision-makers, and what are the leading challengesFree Download
The Total Economic Impact™ of IBM robotic process automation
Cost savings and business benefits enabled by robotic process automationFree Download
Multi-cloud data integration for data leaders
A holistic data-fabric approach to multi-cloud integrationFree Download
MLOps and trustworthy AI for data leaders
A data fabric approach to MLOps and trustworthy AIFree Download