(1) Overview


Zooarchaeological Mass Spectrometry (ZooMS) has been widely applied to archaeological contexts and materials of cultural interest [1, 2, 4, 5]. In particular, the application of ZooMS on parchment, being the most common writing support in Medieval times, has been successfully used to investigate questions about its production, local economies, the use of the manuscripts and their conservation [3, 5, 6, 11, 12]. The study of parchment from a biological perspective has become so widespread that the name “Biocodicology” has been recently proposed to refer to the discipline [7]. Moreover, while fulfilling compliance with conservation standards in terms of non-invasiveness, the rapidity of sampling, thanks to triboelectric extraction [5], but also of sample preparation and analysis, often allows researchers to collect a large amount of samples, with the consequent development of new bioinformatic methods to also speed up the data analysis [8].

The article “A biocodicological analysis of the medieval library and archive from Orval Abbey, Belgium” presents the results obtained from the analysis of almost a complete corpus of books and charters from the library of the Cistercian abbey of Orval, Belgium, mostly over a period ranging from 12th to 13th centuries. In total, 1490 folia were analysed by ZooMS and 86 by peptide sequencing. The dataset we provide here is the largest proteomic dataset generated from a single scriptorium to date (Figure 1). For completeness we include all the analysis, thus there are replicates of a number of samples. These include instrument performance (repeat of a whole MALDI-ToF plate) or sample failure (replicate analysis of specific sample).

Workflow describing the steps followed for the generation of this dataset
Figure 1 

Workflow describing the steps followed for the generation of this dataset.

This large dataset allowed us to explore differences in the selection of animal skins in and outside the Orval scriptorium and the possible reasons behind the choice of sheep/goat or calf skin in the production [10].

This dataset with consistent metadata defined by a single codicological team, and analysed using MS1 (folia) and MS2 (charters) can be used to explore alternative tools for discriminating closely related species, and for refining tools for high throughput analysis of MALDI-ToF (ZooMS) data.

Spatial coverage

Description: The dataset in this study was generated from book and charter samples from one single monastic institution, namely the Cistercian abbey of Orval in present-day Belgium. Sixty-eight codices, representing 118 codicological units in total, were classified in the Orval manuscripts catalogues and belong to the collection of the National Library of Luxembourg, and 59 charters belong to the Belgian State Archives in Arlon.

Temporal coverage

The majority of the data were generated from samples of manuscripts and charters from between the ninth to seventeenth century, though mostly twelfth to thirteenth century.

(2) Methods

Sampling strategy

Sampling of parchment was conducted by non-invasive triboelectric extraction of collagen following a previously described method [5]. Briefly sampling procedure of parchment entailed gentle rubbing of non-written areas of parchment surface with an eraser followed by collection of the eraser crumbs (typically 10 to 50 mg PVC) in a 1.5 mL Eppendorf tube. We advise that nitrile gloves be used, and a freshly cut piece of PVC eraser is used for each sample. The eraser crumbs are transferred to clean Eppendorf tubes using acid free paper prior to analysis. Samples were stored in 4ºC or room temperature. As far as codices were concerned, samples were taken systematically on the recto of the first folium of each quire composing the manuscripts. For the charters, one sample was taken from the single leaf composing each charter. Subsequently the samples were subjected to peptide mass fingerprinting using MALDI-ToF and peptide sequencing using LC-MS/MS (Figure 1).


Peptide mass fingerprinting: All samples except manuscript 22 (1463 samples in total) were analysed by MALDI-ToF at BioArch laboratory at the University of York using a Bruker Ultraflex III mass spectrometer at the Center of Excellence in mass spectrometry. Scripts for processing the data were written in R. These scripts used the bacollite package for R, which resolves species ID on the basis of alignment with theoretical MALDI peaks that are generated from peptide data [8]. Briefly each peptide is aligned to each replicate spectra using cross-correlation, yielding a correlation value by exploring the number of matches to each of the test species at each of a number of threshold levels. In order to come up with a score for each taxa, the number of hits are summed for each correlation threshold so that for a given threshold, only the highest scoring taxa will be added, while the rest add 0. Therefore when a taxon has consistently more hits than the rest, it will get a higher score, whereas it will be lower in a more ambiguous case. The resulting automation has an associated confidence value that can be used to compare the automated and expert-based classifications. The comparison demonstrated that the automated approach to classification is identical to the expert-based classification where the confidence score is above 10 (for this peptide set). The benefit of this approach is that the expert analysis can be directed towards the interpretation of less clear spectra in a principled manner. The scripts used for the analysis are provided in appendix I below.

Peptide sequencing by LC-MS/MS: Samples from all charters (59 samples) and manuscript 22 (27 samples) were analysed by liquid chromatography (UltiMate 3000, ThermoFisher) coupled to electrospray ionisation tandem mass spectrometry (MaXis Impact UHR-, Bruker) at MaSUN mass spectrometry platform at the University of Namur, Belgium. Compass HyStar 3.2 (Bruker) was used to control the instruments. Raw files (.mgf format) were generated using the software DataAnalysis 4.0 (Bruker) for downstream data analysis using Mascot 2.4 (Matrix Science). Scaffold (version Scaffold_4.8, Proteome Software Inc., Portland) was used to validate MS/MS based peptide and protein identifications. Peptide identifications were accepted if they could be established at greater than 90.0% probability by the Peptide Prophet algorithm.

Protein identifications were accepted if they could be established at greater than 95.0% probability and contained at least 2 identified peptides. Protein probabilities were assigned by the Protein Prophet algorithm [9].

Quality Control

  1. Each sampling was performed with new eraser and nitrile gloves.
  2. To avoid cross contamination the table was cleaned with Isopropanol.
  3. Proteomic analyses were repeated if it was not possible to attribute to a taxon.


Not all bifolia in each codicological unit (CU) were analysed.

(3) Dataset description

Object name

In the present work the dataset is available in the following format:

  1. metadata.csv – of each codicological unit.
  2. metadata.xlsx – Excel version of the metadata.
  3. species_id.csv – Species identification of each quire within the codicological unit.
  4. species_id.xlsx – Excel version of the species identification.
  5. Orval_Bacollite_example_for_datapaper.Rms. R markdown notebook with the script to generate the automatic species identification with bacollite.
  6. Orval_Bacollite_example_for_datapaper.Rmd. Knitted html version of the notebook.
  7. *.mzML files of MALDI-ToF data containing spectra from each sample as mass and intensity vectors. The name of these files is encoded as ms_cu_q_r.mzML where ms stands for the manuscript number, cu is the codicological unit within the manuscript as a roman numeral, and q is the quire number within the codicological unit. Some quires were analyzed several times, this occurred if the sample size of erase crumbs was too small or due to an issue with the instrumental analysis (MALDI-ToF), and these replicate analyses are indicated by adding a -bis, -ter, … suffix, or a number preceded by two underscores, as in “__1”, “__2, … In the case of sample size, this sample would fail. If the issue was with the instrument, typically a whole plate of samples would fail. Each sample is analyzed in triplicates within the same MALDI plate. Each replicate is identified by a final number preceded by an underscore, “_1”, “_2” and “_3”.
  8. LC-MS/MS data was uploaded to the PRIDE server as .baf for raw files and .mgf format for peak list, and .dat files for Mascot result. Scaffold files were also included (.sf3).

Below is a detailed list of fields of the metadata file in which we have included the codicological information analyzed for each of the manuscripts. The column name and units or range, where relevant, is in parenthesis.

  • Manuscript number (ms): Manuscript books from a single medieval Cistercian monastery (Orval Abbey, Belgium). The manuscripts are classified in the Orval manuscript catalogue and belong to the collection of the National Library of Luxembourg. The manuscript number refers to its number in the catalogue: Falmagne T. 2017 Die Orvaler Handschriften bis zum Jahr 1628 in den Beständen der Bibliothèque Nationale de Luxembourg und des Grand Séminaire de Luxembourg. Wiesbaden, Germany: Harrassowitz.
  • CU label (cu): For this field we define each codicological unit (CUs) as a volume, a part of a volume, or a set of volumes whose production may be considered a single operation, prepared in one place, at one time using the same available resources.
  • Number of quires (n_quires): Number of quires is a collection of leaves of parchment or paper, folded one within the other, in a manuscript.
  • Orval local production: This field contains information regarding where they were produced. 26 of the 118 CUs were produced according to an analysis of codicological elements performed by T. Falmagne. 26 of the 118 were produced locally by the Orval scriptorium according to this analysis.
  • Production period (prod_period): This field contains information about the historical period in which the library books and charters were produced. They were produced between the ninth to seventeenth century, though mostly twelfth to thirteenth century.
  • Origin (origin): In this field is displayed the geographical place where the manuscripts and charters were produced. For this study we consider codicological units produced by the Orval scriptorium and outside.
  • Height (mm): height of the book as reported in the catalogue.
  • Width (mm): width of the book as reported in the catalogue.
  • Nbre of folia (n_folia): This field encloses information regarding the number of folia contained in a codicological unit.
  • Typology (typology): This field is based on the manuscripts topic. For the present study, eight types of texts were defined: bible, liturgy, grammar and rhetoric, sciences, narrative texts, law, preaching and theology. Also, we have included a category ‘Other’ which was composed of normative texts and letter collections.
  • Thickness index (thinckness_idx, 1–4): Thickness is based on the number of folia a given CU contains. CUs with less than 10 folia were considered as ‘very thin’, (index 1). Those containing between 11 and 100 folia as ‘thin’ (index 2), ‘medium’ (index 3), when they counted between 101 and 200 folia, and finally they were regarded as ‘thick’, (index 4) when they counted more than 200 folia.
  • Quality index (quality_idx, 0–10): This field contains information based on layout, scribe’s skill (calligraphy) and decoration of the texts. Based on these features, we have set up a score ranging from 0 to 10: ‘very low quality’ (score lower than 3), ‘low quality’ (score higher than or equal to 3 and lower than 5), ‘medium quality’ (score higher than or equal to 5 and lower than 6.5), ‘good quality’ (score higher than 8) and ‘superior quality’ (score higher than 8).

Data type

Primary and secondary data.

Format names and versions

*.XLSX, *.CSV, *.mzML, *.baf, *.mgf, *.dat, *.sf3

Creation dates

Dataset created between November 2017 to August 2019.

Dataset Creators

The codicology assessment, identification of codicological units and measurement of bifolia was conducted by Thomas Falmagne, (Universitätsbibliothek Frankfurt am Main, t.falmagne@ub.uni-frankfurt.de). Nicolas Ruffini-Ronzani (nicolas.ruffini@unamur.be) with the support of Chiara Ruzzier (chiara.ruzzier@unamur.be, Université de Namur) created the database based on Falmagne’s work, Catherine Charles (catherine.charles@unamur.be, Université de Namur) defined the categories based on the measurements from the original catalogue and measured the thickness of the charters.

Catherine Charles with Julie Bouhy (julie.bouhy@unamur.be, Université de Namur) and Olivier Deparis (olivier.deparis@unamur.be, Université de Namur) were responsible for sampling.

Silvia Soncin, (silvia.soncin@york.ac.uk, University of York) conducted ZooMS analysis (MALDI-ToF) and with Simon Hickinbotham (simon.hickinbotham@york.ac.uk, York) and Matthew Collins, Ismael Rodríguez Palomo and Bharath Nair (Copenhagen University) were responsible for data generation and checking conversion. Marc DIEU (marc.dieu@unamur.be, Université de Namur) generated LC-MS/MS data with the support of Julie Bouhy.

Olivier Deparis (olivier.deparis@unamur.be, Université de Namur) was in charge of the project and coordinated data collection.




This dataset was deposited and has been released under a Creative Commons Attribution 4.0 international license, which permits unrestricted use, provided the original author and source are credited.

Repository location

Zenodo last version repository

DOI: https://doi.org/10.5281/zenodo.5583377

PRIDE: PXD029196

Publication date

02 June 2021

(4) Reuse potential

The data generated and described here is the first of its kind, a complete analysis of monastic library and archive described by a single codicological team, which can be used as a reference dataset for multiple coherent case studies with an intent to understand parchment production and use in medieval Europe. Consequently, aggregating datasets from the analysis of new manuscripts and charters from all around Europe is necessary for a large-scale study that can provide insights into medieval literacy and intellectual life. The dataset can be used for validating downstream bioinformatic analyses, e.g., aiming at assessing the quality of parchment production from the MALDI-ToF spectra relating it to the metadata and potentially answering novel questions.

Additional File

The additional file for this article can be found as follows:

Appendix I

Orval data analysis using Bacollite. DOI: https://doi.org/10.5334/joad.89.s1