Protecting Science: TIB Builds Dark Archive for ArXiv

4 hours ago 2

Research and science are international, hence we are speaking of international scientific communities. A service such as arXiv might be operated by a US-based institution, Cornell University, but arXiv is being used by researchers worldwide, as, e.g., impressively evidenced by the submission statistics. Moreover, since the introduction of arXiv Membership in 2010, the funding of arXiv has been partially internationalised. TIB funds the German contribution, together with the Helmholtz Associaton of German Research Centres (HGF) und the Max Planck Society (MPG).

What is arXiv?

The platform arXiv.org is a freely accessible online archive for scientfic preprints, i.e. publications of scientific works that have not yet (fully) been peer-reviewed. The arXiv preprint service holds great importance for providing information to physics, mathematics, computer science, and neighbouring subjects. Via arXiv, researchers are able to access the latest research results, even before their actual publication in a quality-assured scientific journal. Since its founding in 1991 as the first online preprint service, arXiv serves as a model for the development of preprint services in other subjects (cf. Rzayeva et al. 2025, https://doi.org/10.31235/osf.io/xdwc4_v2).

So when the Trump administration makes decisions that have fatal consequences for science and research in the US, the repercussions reach far beyond the Gulf of Mexico: Over the last days, reports are mounting in German media that attest to researchers not only fearing the loss of data , but also the loss of established information portals such as PubMed.

Research data under threat

Initiatives such as ”Safeguarding Research and Culture” are scrambling to save threatened research data and websites for scientific communities and for posteriority. Contents under threat range from the social sciences (e.g. research on LBGTQIA+ topics) and medicine (e.g. vaccines) to the natural sciences (e.g. climate research).

While it is research linked to political debates that is subject to the most blatant and egregious reprisals, in principle all research can be threatened by ”cost cutting” and restructuring measures. This is evidenced e.g. by the planned shutdown of the renowned 120 year old atomic spectroscopy group at the National Institute of Science and Technology (NIST).

Decentral scientific infrastructures

Unfortunately, a further escaltion of the already dismal curtailing of academic freedom in the US appears to be likely. Not at least due to the great importance of US institutions in the international academic system, these developments affect research infrastructures worldwide. As ”Safeguarding Research and Culture” are writing in their mission statement, this warrants a change of mind, among other things towards more decentralised and thus more resilient infrastructures.

For arXiv a system which could have helped for at least some time had been in place until last year: In the early days of the internet, which were also the early days of arXiv, besides the main server arxiv.org there existed a network of arXiv mirror sites, distributed around the globe that allowed access to a copy of arXiv contents that were closer to the user location, geographically. A legendary example was the Augsburg arxiv mirror which often convinced with its shorter access and reply latencies.

With years of technical progress, the differences in performance between the local mirrors (amongst others at the European Organization for Nuclear Research (CERN), at Los Alamos National Laboratory (LANL), in France, and Japan) and the main server arxiv.org flattened out. Resulting in more than 90 % of the traffic going via the main server and little usage of the mirror sites. Thus, in the view of the arXiv team, the expense for maintenance and updating of the mirrors was no longer matched by their ”utility and utilization”, as can be read in an arXiv blog entry under ”Attention arXiv users: arXiv mirrors to shut down September 15th, 2024”.

After the arXiv system had been migrating to a completely cloud-centric architecture for its services over the last couple of years, those responsbile for arXiv came to the conclusion that

“The arXiv mirror network served a role – acting as a backup for the corpus, allowing some degree of load distribution, and providing improved access for users who were geographically closer to a mirror – that is no longer necessary. arXiv now has multiple backups for the arXiv corpus in place, and the Fastly CDN (Content Delivery Network) that we use to deliver content provides excellent service throughout the world.“

As a European institution, we have always taken a bit of a different view – and the recent developments, unfortunately, appear to confirm our reservations – and have always advocated for preserving the mirrors, while also looking for alternatives. Some processes turned out to be cumbersome and complicated, e.g. also due to legal constraints regarding licencing. (Open Access is not absolutely Open Access if authors have granted arXiv an exclusive right for provision.) Some others, we might be able to explore further.

Why TIB is archiving arXiv data

What we have implemented over the last few weeks, is to build a Dark Archive of arXiv contents:

As a first step in building a Dark Archive, the rights clearance needs to be addressed, of course. Here, TIB had already commissioned a legal advisory survey back in 2016, in the context of a possible cooperation with arXiv.org. This included studying the licences used by arXiv, which broadly fall into the categories “arXiv.org licence” , “Creative Commons“, and “Public Domain“.

While nothing stands in the way of archiving the data and metadata as such, the status of these rights would have to be explored in detail if they were to be made accessible in the context of a public-facing service. This is especially relevant for resources under arXiv licences, since this licence type over the course of the years underwent several versions. Between the years 1991 and 2003, users were even able to upload data objects without explicitly stating a licence.

But before a user service can be even set up, the data need to be ingested into the TIB infrastructure. Here, arXiv itself offers several methods for full texts. Since both PDF and (La)Tex sources ought to be part of the TIB Dark Archive, we have opted for the download via Amazon S3. This is a possibility arXiv offers as a “requester pays buckets” method – meaning that TIB as the fetching entity covers the expenses arising with Amazon Web Services (AWS) https://info.arxiv.org/help/bulk_data_s3.html. For 2,686,172 fetched datasets with a data volume of just under 10 terabytes, the S3 transfer came to about 900 Euros.

Because metadata from arXiv have since a long time been used as a data source for the TIB portal, there was no need to establish a new workflow. Eventually, this also facilitates making the datasets accessible via the TIB-Portal. A possibility for this is, e.g., supplying the arXiv datasets in the TIB portal with a second download link in the background. In case the first download link pointing to the arXiv source is no longer accessible, the second link should come into play, pointing to the now existing copy at TIB. Users of the TIB-Portal could thus seemlessly access arXiv records, even in case of an outage of the main platform over at Cornell. As mentioned earlier, this accessibility is however contingent of the specific licences.

Moreover, after the first complete transfer of the arXiv holdings, a process needs to be implemented which in regular intevals fetches new, additional arXiv records as well as versioning information for already existing records.

“Building a Dark Archive is an expression of our longstanding commitment for a reliable, international academic provision, and as a partner of arXiv. Even though the Dark Archive today only works in the background, it is a key element in safeguarding digital research contents in the long term, because in case of a crisis, we could open the archive.”

Dr Irina Sens, Deputy Director of TIB

Dark Archive: Data stored, but not openly accessible

The data are being stored, but if push comes to shove it would need some more steps to make them publicly available. Because a database service is much more than a mere backup copy of the data: Operating a productive user-facing service not only needs technical resources, but first and foremost a committed team which in the background takes care of diverse aspects such as quality assurance, content curation, or (technical) development.

In the case of arXiv, there are not only the accessibility of the papers and the search functionality, the upload services for authors, and further technical services. Rather, it is the integration within the scientfic communities that is the heart of arXiv: Numerous researchers who volunteer to take on roles on various boards, for content moderation or as Volunteer Developers!

Schlagwörter: