Data From "A Biocodicological Analysis of the Medieval Library and Archive From Orval Abbey, Belgium"

ABSTRACT

The dataset contains the first-ever comprehensive biocodicological analysis of medieval library books and charters using Zooarchaeological Mass Spectrometry (ZooMS).Here, we analyze 68 codices and 59 charters (1490+59 samples in total) from one single monastic institution, namely the Cistercian abbey of Orval in presentday Belgium.The data entails both peptide mass fingerprinting (using MALDI-ToF) and peptide sequencing (using LC-MS/MS) analysis of almost the entire library and all the preserved single leaf charters from the monastery.MALDI-ToF data is stored in Zenodo -a multidisciplinary open access repository while the LC-MS/MS data is deposited in ProteomeXchange Consortium via PRIDE -a publicly available repository for MS-based proteomics data.Mass spectrometric data generated from an entire monastic library and archive is of immense value to integrate with multiple case studies aiming at understanding parchment production and use in medieval Europe.
(1) OVERVIEW CONTEXT Zooarchaeological Mass Spectrometry (ZooMS) has been widely applied to archaeological contexts and materials of cultural interest [1,2,4,5].In particular, the application of ZooMS on parchment, being the most common writing support in Medieval times, has been successfully used to investigate questions about its production, local economies, the use of the manuscripts and their conservation [3,5,6,11,12].The study of parchment from a biological perspective has become so widespread that the name "Biocodicology" has been recently proposed to refer to the discipline [7].Moreover, while fulfilling compliance with conservation standards in terms of non-invasiveness, the rapidity of sampling, thanks to triboelectric extraction [5], but also of sample preparation and analysis, often allows researchers to collect a large amount of samples, with the consequent development of new bioinformatic methods to also speed up the data analysis [8].
The article "A biocodicological analysis of the medieval library and archive from Orval Abbey, Belgium" presents the results obtained from the analysis of almost a complete corpus of books and charters from the library of the Cistercian abbey of Orval, Belgium, mostly over a period ranging from 12th to 13th centuries.In total, 1490 folia were analysed by ZooMS and 86 by peptide sequencing.The dataset we provide here is the largest proteomic dataset generated from a single scriptorium to date (Figure 1).For completeness we include all the analysis, thus there are replicates of a number of samples.These include instrument performance (repeat of a whole MALDI-ToF plate) or sample failure (replicate analysis of specific sample).
This large dataset allowed us to explore differences in the selection of animal skins in and outside the Orval scriptorium and the possible reasons behind the choice of sheep/goat or calf skin in the production [10].
This dataset with consistent metadata defined by a single codicological team, and analysed using MS1 (folia) and MS2 (charters) can be used to explore alternative tools for discriminating closely related species, and for refining tools for high throughput analysis of MALDI-ToF (ZooMS) data.

Spatial coverage
Description: The dataset in this study was generated from book and charter samples from one single monastic institution, namely the Cistercian abbey of Orval in present-day Belgium.Sixty-eight codices, representing 118 codicological units in total, were classified in the Orval manuscripts catalogues and belong to the collection of the National Library of Luxembourg, and 59 charters belong to the Belgian State Archives in Arlon.

Temporal coverage
The majority of the data were generated from samples of manuscripts and charters from between the ninth to seventeenth century, though mostly twelfth to thirteenth century.

SAMPLING STRATEGY
Sampling of parchment was conducted by noninvasive triboelectric extraction of collagen following a previously described method [5].Briefly sampling procedure of parchment entailed gentle rubbing of non-written areas of parchment surface with an eraser followed by collection of the eraser crumbs (typically 10 to 50 mg PVC) in a 1.5 mL Eppendorf tube.We advise that nitrile gloves be used, and a freshly cut piece of PVC eraser is used for each sample.The eraser crumbs are transferred to clean Eppendorf tubes using acid free paper prior to analysis.Samples were stored in 4 °C or room temperature.As far as codices were concerned, samples were taken systematically on the recto of the first folium of each quire composing the manuscripts.For the charters, one sample was taken from the single leaf composing each charter.Subsequently the samples were subjected to peptide mass fingerprinting using MALDI-ToF and peptide sequencing using LC-MS/MS (Figure 1).

Steps
Peptide mass fingerprinting: All samples except manuscript 22 (1463 samples in total) were analysed by MALDI-ToF at BioArch laboratory at the University of York using a Bruker Ultraflex III mass spectrometer at the Center of Excellence in mass spectrometry.Scripts for processing the data were written in R.These scripts used the bacollite package for R, which resolves species ID on the basis of alignment with theoretical MALDI peaks that are generated from peptide data [8].Briefly each peptide is aligned to each replicate spectra using crosscorrelation, yielding a correlation value by exploring the number of matches to each of the test species at each of a number of threshold levels.In order to come up with a score for each taxa, the number of hits are summed for each correlation threshold so that for a given threshold, only the highest scoring taxa will be added, while the rest add 0. Therefore when a taxon has consistently more hits than the rest, it will get a higher score, whereas it will be lower in a more ambiguous case.The resulting automation has an associated confidence value that can be used to compare the automated and expert-based classifications.The comparison demonstrated that the automated approach to classification is identical to the expert-based classification where the confidence score is above 10 (for this peptide set).The benefit of this approach is that the expert analysis can be directed towards the interpretation of less clear spectra in a principled manner.The scripts used for the analysis are provided in appendix I below.
Peptide sequencing by LC-MS/MS: Samples from all charters (59 samples) and manuscript 22 (27 samples) were analysed by liquid chromatography (UltiMate 3000, ThermoFisher) coupled to electrospray ionisation tandem mass spectrometry (MaXis Impact UHR-, Bruker) at MaSUN mass spectrometry platform at the University of Namur, Belgium.Compass HyStar 3.2 (Bruker) was used to control the instruments.Raw files (.mgf format) were generated using the software DataAnalysis 4.0 (Bruker) for downstream data analysis using Mascot 2.4 (Matrix Science).Scaffold (version Scaffold_4.8,Proteome Software Inc., Portland) was used to validate MS/MS based peptide and protein identifications.Peptide identifications were accepted if they could be established at greater than 90.0%probability by the Peptide Prophet algorithm.
Protein identifications were accepted if they could be established at greater than 95.0% probability and contained at least 2 identified peptides.Protein probabilities were assigned by the Protein Prophet algorithm [9].

Constraints
Not all bifolia in each codicological unit (CU) were analysed. (

3) DATASET DESCRIPTION OBJECT NAME
In the present work the dataset is available in the following format: 1. metadata.csv-of each codicological unit.

metadata.xlsx -Excel version of the metadata.
3. species_id.csv-Species identification of each quire within the codicological unit.4. species_id.xlsx-Excel version of the species identification.mzML where ms stands for the manuscript number, cu is the codicological unit within the manuscript as a roman numeral, and q is the quire number within the codicological unit.Some quires were analyzed several times, this occurred if the sample size of erase crumbs was too small or due to an issue with the instrumental analysis (MALDI-ToF), and these replicate analyses are indicated by adding a -bis, -ter, … suffix, or a number preceded by two underscores, as in "__1", "__2, … In the case of sample size, this sample would fail.If the issue was with the instrument, typically a whole plate of samples would fail.Each sample is analyzed in triplicates within the same MALDI plate.Each replicate is identified by a final number preceded by an underscore, "_1", "_2" and "_3".

LC-MS/MS data was uploaded to the PRIDE server as
.baf for raw files and .mgfformat for peak list, and .datfiles for Mascot result.Scaffold files were also included (.sf3).
Below is a detailed list of fields of the metadata file in which we have included the codicological information analyzed for each of the manuscripts.The column name and units or range, where relevant, is in parenthesis.
- CUs with less than 10 folia were considered as 'very thin', (index 1).Those containing between 11 and 100 folia as 'thin' (index 2), 'medium' (index 3), when they counted between 101 and 200 folia, and finally they were regarded as 'thick', (index 4) when they counted more than 200 folia.-Quality index (quality_idx, 0-10): This field contains information based on layout, scribe's skill (calligraphy) and decoration of the texts.Based on these features, we have set up a score ranging from 0 to 10: 'very low quality' (score lower than 3), 'low quality' (score higher than or equal to 3 and lower than 5), 'medium quality' (score higher than or equal to 5 and lower than 6.5), 'good quality' (score higher than 8) and 'superior quality' (score higher than 8).

Data type
Primary and secondary data.

Creation dates
Dataset created between November 2017 to August 2019.
Olivier Deparis (olivier.deparis@unamur.be,Université de Namur) was in charge of the project and coordinated data collection.

License
This dataset was deposited and has been released under a Creative Commons Attribution 4.0 international license, which permits unrestricted use, provided the original author and source are credited.

(4) REUSE POTENTIAL
The data generated and described here is the first of its kind, a complete analysis of monastic library and archive described by a single codicological team, which can be used as a reference dataset for multiple coherent case studies with an intent to understand parchment production and use in medieval Europe.Consequently, aggregating datasets from the analysis of new manuscripts and charters from all around Europe is necessary for a large-scale study that can provide insights into medieval literacy and intellectual life.The dataset can be used for validating downstream bioinformatic analyses, e.g., aiming at assessing the quality of parchment production from the MALDI-ToF spectra relating it to the metadata and potentially answering novel questions.

Figure 1
Figure 1 Workflow describing the steps followed for the generation of this dataset.

1 .
Each sampling was performed with new eraser and nitrile gloves.2. To avoid cross contamination the table was cleaned with Isopropanol.3. Proteomic analyses were repeated if it was not possible to attribute to a taxon.