(1) Overview


Spatial coverage

Description: State Game Lands within the State of Pennsylvania, United States.


North_Bounding_Coordinate: 42.269479

South_Bounding_Coordinate: 39.719860

East_Bounding_Coordinate: –74.689583

West_Bounding_Coordinate: –80.519349

Temporal coverage


(2) Methods


These data were used to build an object detection model to locate Relict Charcoal Hearths (RCH) as described in the paper “When Computers Dream of Charcoal: Using Deep Learning, Open Tools and Open Data to Identify Relict Charcoal Hearths in and around State Game Lands in Pennsylvania” [1].


A Python program was used to split the Slope TIFF files into smaller 1024x768 pixel JPEG tiles. If a tile contained at least one location point of a known RCH it was also retained to train the model. For each training image the program generated a corresponding XML file with the pixel coordinates of the boundaries of known RCHs in the image. Mask R-CNN is used to train the model using the images and XML files [7].

The trained model H5 file was used to detect RCHs in all of the JPEG tiles. This produces a Shapefile of predicted RCHs. The initial predictions from the model were processed using cluster analysis to produce a final list of detected RCHs, stored in GeoJSON. (Figure 1) lists the detailed steps.

Process to train and run the object detection model using Mask R-CNN.
Figure 1 

Process to train and run the object detection model using Mask R-CNN.

Sampling strategy

For training the model, 20% of the images across all SGLs were set aside for automated testing.

Quality Control

The quality control for model training and selection had these steps:

  • Each model was assessed for its Average Precision. A score above 50% meant the model was a possible candidate for object detection.
  • Candidate models were used to detect RCHs in 20 images. The results were manually inspected for accuracy.
  • Models that passed visual inspection were formally scored using a set of 100 images selected at random.
  • Models passing formal scoring were reviewed in depth using large sample size, spanning multiple SGLs.


Predictions for State Game Lands 264 and 258 could not be processed due to problems in the source files.

(3) Dataset description

Object name

Since the dataset contains multiple objects, filenames are displayed in Zenodo.

Data type

Primary data: Locations of known RCHs. Secondary data: JPEG tiles, XML files, Object Recognition File, ShapeFile and GeoJSON of results.

Format names and versions

TIFF, Shapefile, GeoJSON, JPEG, XML, H5, Jupyter Notebook

Creation dates

01/01/2020 – 31/08/2020

Dataset Creators

Jeff Blackadar, object recognition programmer, author, Carleton University. https://orcid.org/0000-0002-8160-0942

Ben Carter, archaeologist, author, data collection and verification, GIS expert, Muhlenberg College. https://orcid.org/0000-0002-7464-0989

Weston Conner, archaeologist, author, data collection and verification, GIS expert, Lehigh University. https://orcid.org/0000-0001-9906-3762





Repository location

Documentation: https://zenodo.org/record/4766351

Data for

  1. Vector files resulting from manual identification of relict charcoal hearths (RCHs). These were used to train the Mask R-CNN model.
    1. Shapefile - https://zenodo.org/record/4593605
    2. GeoJSON - https://zenodo.org/record/4593622
  2. Program for splitting slope analysis (in TIFF format) into smaller tiles (in JPEG format).
    1. https://github.com/jeffblackadar/charcoalhearths/blob/master/0_split_tifs_refactored.ipynb
  3. Data files for Training:
    1. RCH Detection with Mask R-CNN Image Annotations. https://zenodo.org/record/4575582
      This file contains a collection of xml files that contain coordinates of the locations of known RCHs on images (from above). These files are known as annotations and are used by Mask R-CNN to identify objects to detect during training of a model.
    2. RCH Detection with Mask R-CNN Training Images. https://zenodo.org/record/4579935
      This file contains all of the images used for training the Mask R-CNN model. Each image contains at least one known RCH.
    3. Polygons for tiles of LiDAR data.
  4. Program for Mask R-CNN training and prediction. Known as “data_5000_3_rcnn_charcoal_hearths.ipynb” [7]
  5. Data files of the model and predictions
    1. Resultant trained Mask R-CNN model.
    2. Predictions from the model, in x, y coordinates (not geolocated) in XML format. https://zenodo.org/record/4581281
      The format of these files is similar to the training annotations.
    3. RCH Detection with Mask R-CNN Images.
      This file contains all of the images representing tiles of LiDAR images of State Game Lands. These images are used for predictions to locate RCHs. (A subset was used for training. See 3b RCH Detection with Mask R-CNN Training Images above.)
  6. Program that produces confidence scores for the predictions above. Known as “data_5000_4_rcnn_charcoal_hearths_count_results.ipynb”
  7. Program to remove duplicates (because some tiles included multiple SGLs and some tiles overlapped both the north and south LiDAR tile indices of the state), converts squares to their centroid for comparison and saves unique squares in geolocated vector files. Known as “2_read_predictions_from_xml_put_into_shp.ipynb”
  8. Vector file of prediction results.
    1. Shapefile- https://zenodo.org/record/4593734
    2. GeoJSON- https://zenodo.org/record/4593747
  9. Prediction results with additional variables (bins for assessment, ID of training data, cluster analysis and visual confirmation)
    1. GeoJSON (no shapefile)- https://zenodo.org/record/4593767
    2. Variables:
      1. id = unique identifier starting with 3-digit SGL number, PAN or PAS (projections) and, within those a unique four-digit identifier
      2. score = confidence score
      3. SGL = State Game Land number
      4. SGLImage = name of TIFF file of merged LiDAR tiles
      5. Confirm = Whether the predicted hearth was determined, through visual inspection, to be a likely true positive (Y) or a false positive (N)
      6. Bin#- in assessing these predictions we “binned” the results based upon the confidence score.
      7. Bin_select = 1 if this record (predicted RCH) was selected for assessment within that bin
      8. TrainID = Original ID of the training data (only training data that matched with a prediction are included).
      9. Clusters5_300 = resultant clusters from DBSCAN where minimum cluster size = 5 and maximum distance = 300 meters
      10. Clusters10_500 = resultant clusters from DBSCAN where minimum cluster size = 10 and maximum distance = 500 meters
      11. Clusters20_1000 = resultant clusters from DBSCAN where minimum cluster size = 20 and maximum distance = 1000 meters
      12. CLUSTERCT = How many of the above clusters included the predicted RCH (0–3). Derived from the previous three variables.
      13. 3Cluster = whether or not this predicted RCH was included in all three clusters.
  10. False Negatives for the tiles around SGL 43 after close visual inspection (at 1:1000 scale).
    1. GeoJSON = https://zenodo.org/record/4758647

Publication date


(4) Reuse potential

The model may be reused to detect similar looking objects in other landscapes. The shapefile and images may be used to train an improved prediction model. Also, the images and programs may be re-used to train a different object detection model to locate other types of objects in SGLs.