(1) Overview
Context
Spatial coverage
Description: State Game Lands within the State of Pennsylvania, United States.
(WGS84):
North_Bounding_Coordinate: 42.269479
South_Bounding_Coordinate: 39.719860
East_Bounding_Coordinate: –74.689583
West_Bounding_Coordinate: –80.519349
Temporal coverage
AD1700-AD1945
(2) Methods
Introduction
These data were used to build an object detection model to locate Relict Charcoal Hearths (RCH) as described in the paper “When Computers Dream of Charcoal: Using Deep Learning, Open Tools and Open Data to Identify Relict Charcoal Hearths in and around State Game Lands in Pennsylvania” [1].
Steps
A Python program was used to split the Slope TIFF files into smaller 1024x768 pixel JPEG tiles. If a tile contained at least one location point of a known RCH it was also retained to train the model. For each training image the program generated a corresponding XML file with the pixel coordinates of the boundaries of known RCHs in the image. Mask R-CNN is used to train the model using the images and XML files [7].
The trained model H5 file was used to detect RCHs in all of the JPEG tiles. This produces a Shapefile of predicted RCHs. The initial predictions from the model were processed using cluster analysis to produce a final list of detected RCHs, stored in GeoJSON. (Figure 1) lists the detailed steps.
Process to train and run the object detection model using Mask R-CNN.
Sampling strategy
For training the model, 20% of the images across all SGLs were set aside for automated testing.
Quality Control
The quality control for model training and selection had these steps:
- Each model was assessed for its Average Precision. A score above 50% meant the model was a possible candidate for object detection.
- Candidate models were used to detect RCHs in 20 images. The results were manually inspected for accuracy.
- Models that passed visual inspection were formally scored using a set of 100 images selected at random.
- Models passing formal scoring were reviewed in depth using large sample size, spanning multiple SGLs.
Constraints
Predictions for State Game Lands 264 and 258 could not be processed due to problems in the source files.
(3) Dataset description
Object name
Since the dataset contains multiple objects, filenames are displayed in Zenodo.
Data type
Primary data: Locations of known RCHs. Secondary data: JPEG tiles, XML files, Object Recognition File, ShapeFile and GeoJSON of results.
Format names and versions
TIFF, Shapefile, GeoJSON, JPEG, XML, H5, Jupyter Notebook
Creation dates
01/01/2020 – 31/08/2020
Dataset Creators
Jeff Blackadar, object recognition programmer, author, Carleton University. https://orcid.org/0000-0002-8160-0942
Ben Carter, archaeologist, author, data collection and verification, GIS expert, Muhlenberg College. https://orcid.org/0000-0002-7464-0989
Weston Conner, archaeologist, author, data collection and verification, GIS expert, Lehigh University. https://orcid.org/0000-0001-9906-3762
Language
English
License
CC-BY.
Repository location
Documentation: https://zenodo.org/record/4766351
Data for
- Vector files resulting from manual identification of relict charcoal hearths (RCHs). These were used to train the Mask R-CNN model.
- Shapefile - https://zenodo.org/record/4593605
- GeoJSON - https://zenodo.org/record/4593622
- Program for splitting slope analysis (in TIFF format) into smaller tiles (in JPEG format).
- Data files for Training:
- RCH Detection with Mask R-CNN Image Annotations. https://zenodo.org/record/4575582
This file contains a collection of xml files that contain coordinates of the locations of known RCHs on images (from above). These files are known as annotations and are used by Mask R-CNN to identify objects to detect during training of a model. - RCH Detection with Mask R-CNN Training Images. https://zenodo.org/record/4579935
This file contains all of the images used for training the Mask R-CNN model. Each image contains at least one known RCH. - Polygons for tiles of LiDAR data.
https://zenodo.org/record/4580726
- RCH Detection with Mask R-CNN Image Annotations. https://zenodo.org/record/4575582
- Program for Mask R-CNN training and prediction. Known as “data_5000_3_rcnn_charcoal_hearths.ipynb” [7]
https://github.com/jeffblackadar/charcoalhearths/blob/master/data_5000_3_rcnn_charcoal_hearths.ipynb - Data files of the model and predictions
- Resultant trained Mask R-CNN model.
https://zenodo.org/record/4579946 - Predictions from the model, in x, y coordinates (not geolocated) in XML format. https://zenodo.org/record/4581281
The format of these files is similar to the training annotations. - RCH Detection with Mask R-CNN Images.
https://zenodo.org/record/4583945
This file contains all of the images representing tiles of LiDAR images of State Game Lands. These images are used for predictions to locate RCHs. (A subset was used for training. See 3b RCH Detection with Mask R-CNN Training Images above.)
- Resultant trained Mask R-CNN model.
- Program that produces confidence scores for the predictions above. Known as “data_5000_4_rcnn_charcoal_hearths_count_results.ipynb”
https://github.com/jeffblackadar/charcoalhearths/blob/master/data_5000_4_rcnn_charcoal_hearths_count_results.ipynb - Program to remove duplicates (because some tiles included multiple SGLs and some tiles overlapped both the north and south LiDAR tile indices of the state), converts squares to their centroid for comparison and saves unique squares in geolocated vector files. Known as “2_read_predictions_from_xml_put_into_shp.ipynb”
https://github.com/jeffblackadar/charcoalhearths/blob/master/2_read_predictions_from_xml_put_into_shp.ipynb - Vector file of prediction results.
- Shapefile- https://zenodo.org/record/4593734
- GeoJSON- https://zenodo.org/record/4593747
- Prediction results with additional variables (bins for assessment, ID of training data, cluster analysis and visual confirmation)
- GeoJSON (no shapefile)- https://zenodo.org/record/4593767
- Variables:
- id = unique identifier starting with 3-digit SGL number, PAN or PAS (projections) and, within those a unique four-digit identifier
- score = confidence score
- SGL = State Game Land number
- SGLImage = name of TIFF file of merged LiDAR tiles
- Confirm = Whether the predicted hearth was determined, through visual inspection, to be a likely true positive (Y) or a false positive (N)
- Bin#- in assessing these predictions we “binned” the results based upon the confidence score.
- Bin_select = 1 if this record (predicted RCH) was selected for assessment within that bin
- TrainID = Original ID of the training data (only training data that matched with a prediction are included).
- Clusters5_300 = resultant clusters from DBSCAN where minimum cluster size = 5 and maximum distance = 300 meters
- Clusters10_500 = resultant clusters from DBSCAN where minimum cluster size = 10 and maximum distance = 500 meters
- Clusters20_1000 = resultant clusters from DBSCAN where minimum cluster size = 20 and maximum distance = 1000 meters
- CLUSTERCT = How many of the above clusters included the predicted RCH (0–3). Derived from the previous three variables.
- 3Cluster = whether or not this predicted RCH was included in all three clusters.
- False Negatives for the tiles around SGL 43 after close visual inspection (at 1:1000 scale).
- GeoJSON = https://zenodo.org/record/4758647
Publication date
11/03/2021
(4) Reuse potential
The model may be reused to detect similar looking objects in other landscapes. The shapefile and images may be used to train an improved prediction model. Also, the images and programs may be re-used to train a different object detection model to locate other types of objects in SGLs.