Geospatial and Image Data from the “When Computers Dream of Charcoal: Using Deep Learning, Open Tools and Open Data to Identify Relict Charcoal Hearths in and Around State Game Lands in Pennsylvania” Paper

These data were used to build an object detection model to locate Relict Charcoal Hearths (RCH) as described in the paper “When Computers Dream of Charcoal: Using Deep Learning, Open Tools and Open Data to Identify Relict Charcoal Hearths in and around State Game Lands in Pennsylvania” [1]. This is the first grouping of data for the paper above. The second grouping is also available in this journal, see “Object detection model, image data and results from the “When Computers Dream of Charcoal: Using Deep Learning, Open Tools and Open Data to Identify Relict Charcoal Hearths in and around State Game Lands in Pennsylvania” paper”. These files consist of georeferenced Digital Elevation Model, Hillshade and Slope files derived from LiDAR for the State Game Lands (SGL). Included is a Shapefile and GeoJSON of State Game Land borders as well as the program used for downloading the LiDAR files. These data are stored on Zenodo.org .


INTRODUCTION
This dataset consists of LiDAR derivatives covering the State Game Lands (SGLs) of Pennsylvania. It was created with the objective of automatically detecting all relict charcoal hearths (RCHs) in these SGLs. For details on these results, please see "When Computers Dream of Charcoal: Using Deep Learning, Open Tools and Open Data to Identify Relict Charcoal Hearths in and around State Game Lands in Pennsylvania" [1].
Herein we detail the steps taken to assemble the raw data and process it into a usable format. The unprocessed LiDAR data used for this project was obtained through Pennsylvania Spatial Data Access (PASDA) [2]. The process included identifying the data needed to encompass the SGLs and a 1 kilometer buffer around them, downloading the raw LiDAR files in .las format, organizing these files, and processing them into the following LiDAR derivatives: digital elevation map (DEM), hillshade, and slope analysis. The text below and (Figure 1) describe the detailed steps.

INITIAL STEPS
Pennsylvania's LiDAR data is organized in two grids for the north and the south of the state, which overlap slightly [3,4]. These grids were merged into a single shapefile via the "Merge Vector Layers" tool in QGIS. Shapefiles for these grids contain attributes such as direct links to download data for each grid tile. Next, we downloaded a shapefile containing SGL boundaries [5] provided by Pennsylvania and added a onekilometer buffer around each SGL, which was given the same identifying number as the SGL it surrounded. By overlaying this buffer with the unified grid and using the "Join Attributes by Location" function, any grid tile that touched an SGL or its buffer was given that SGL's number as an attribute. This provided a single shapefile of the grid with two central attributes for each tile, one with the SGL/buffer number and one with a download link for the LiDAR data for that tile.
Exporting the attribute table for the state's tile grid into Microsoft Excel allowed us to manipulate the data more easily. We deleted all grids that were not associated with a SGL/buffer. This resulted in a selection of 3,925 unique tiles to download out of the 13,797 (28.4%) total tiles in the grid. One field given for each grid tile was a direct link to download the .las file, the raw LiDAR point cloud. Copying this list of URLs into a text document allowed us to use simple bash scripting and the wget command to automate the downloading of these files. Note that the list of tiles downloaded was actually over 4,000, not exactly 3,925. This is because tiles were downloaded consecutively by each individual SGL and some tiles overlapped more than one SGL.
In order to use bash scripting on a Windows computer, we enabled the Linux Subsystem and download a Linux Distro, which is described below. These instructions are current for Microsoft Windows 10 Home, Version 10.0.19041 Build 19041.

Enabling a Linux Subsystem on Windows
Enable the Linux Subsystem by following these steps: • On a computer running Windows, open "Settings" from the Start Menu • Click "Apps" | "Apps & features" | "Optional features" • Click "More Windows features" under the "Related settings" submenu • Scroll down to "Windows Subsystem for Linux" and check the box • Click "OK" • When prompted, restart computer Second, download a Linux distribution through the Microsoft Store app. We used Ubuntu, published by Canonical Group Limited, and our instructions assume you are using the 20.04 LTS version [6]. Once you have downloaded Ubuntu, open the app and follow the prompts to create a username and password.
It is important to note that files saved through this system will be saved to the Linux Subsystem. It can be somewhat difficult to locate these files if you are not familiar with the program. To find where your files will be saved, open the File Explorer and follow these steps: • Navigate to %localappdata% in the File Explorer search bar • Open the "Packages" folder • Housed within should be the folder: "CanonicalGroupLimited. UbuntuonWindows_79rhkp1fndgsc" or the equivalent for your Linux distro • Within this particular folder, navigate down through these subfolders: -"LocalState" -"rootfs" -"home" -Username Figure 1 Steps to download and process LiDAR data files for State Game Lands. This Username folder is where you must store and place files for your Linux subsystem to read and edit. It is also where files you download using Linux will be stored.

Downloading .las Files with Wget
To download the .las files using this Linux subsystem, we used the wget command. This command tells the computer to download a file from a specific URL or to download multiple files from a list of URLs in a text file. For this dataset, we copied the column from Excel containing the URLs of .las files for each SGL and pasted it into a text document. There was a single URL per linethis is critical. Blank lines separated each SGL into clusters within the text document due to Excel formatting, but this did not impact the functionality of using wget.
The wget command was formatted with several modifiers. To run wget as an administrator, which is necessary to execute the command, you must use sudo which means "superuser do." Using the modifier -i allowed us to specify a text file to be read, and using the modifier -P allowed us to specify where the downloads should be saved. The final command executed in Ubuntu is: sudo wget -i SampleList.txt -P sample_file_folder In this example, SampleList.txt is the name of the text file containing the URLs and sample_file_folder is the name of the folder where downloads should be saved. Both this file and this folder should be in the 'Username' folder discussed earlier prior to running the command.
After pressing enter to run the command, wget will download the files from the list and save them in the specified folder. The Linux app will display progress and the files will appear in the specified folder.
This process may take some time, and we left wget running overnight. Using this process, over 4,000 LAS files for a total of about 350 gigabytes of data in zip files was downloaded. The files contained the LiDAR data for every tile touching a State Game Land, or their buffer, in Pennsylvania.

ORGANIZING FILES
Although we had downloaded all of the data, it was all stored within a single folder. Separating these files into separate folders by SGL was accomplished with a program (ProjectLambda) created by Moritz "Moe" Schiesser. Moe Schiesser's program read an Excel sheet containing two columns -Column A: the current file location, Column B: desired file location -and copied each file to the appropriate folder. (See Table 1). Note that the file structure in Column B had been created prior to running this program.

PROCESSING LIDAR DATA
The .las files, which were now separated into folders by SGL, were processed through LAStools into three LiDAR derivatives: a digital elevation model (DEM), hillshade, and slope analysis [7]. This process was automated using another program (ProjectKappa) written by Schiesser. This program read the default file names for Pennsylvania's .las files to determine which coordinate reference system (CRS) to use (State Plane 83 PA South or State Plane 83 PA North) and saved the three LiDAR derivative outputs to the same SGL folder. All tiles for each SGL were combined into one image (for each derivative) in TIF format. For those SGLs that contained portions in both the North and South CRS, it processed the files for each CRS separately, which means some SGLs had two DEMs, hillshades, and slope analyses. In the end, the total dataset, including all .las files and LiDAR derivatives was 1.3 terabytes.

SAMPLING STRATEGY
The project studied all SGLs in Pennsylvania. We added a 1km buffer outside of the SGL boundary for each file in order to ensure objects on the edges of SGLs were fully visible. All LAS tiles of which any portion fell within the SGL or buffer were included.

QUALITY CONTROL
We wrote a program to load each file into QGIS according to the list of SGLs in order to validate each file was present and opened correctly. An example of this program loads the layers Northeast Pennsylvania: https://github. com/jeffblackadar/charcoalhearths/blob/master/qgis_3_load_ rasters_north_east.py.

CONSTRAINTS
The files for SGLs 264 and 258 could not be processed due to problems in the source files.

(3) DATASET DESCRIPTION OBJECT NAME
Since the dataset contains multiple objects, filenames are displayed in Zenodo.

DATA TYPE
Primary data: Digital Elevation Model, Slope, Hillshade files. Secondary data: SGL boundaries.

(4) REUSE POTENTIAL
This dataset is useful for further study of Pennsylvania's SGLs. Generally, SGLs are relatively undeveloped today, preserving archaeological evidence such as RCHs, building foundations and roadbeds. The Shapefile and GeoTIFFs may be used for further study of the SGLs for RCHs or other types of objects. The method and programs are useful for the collection and processing of LiDAR files to study of areas of Pennsylvania outside of the SGLs. The process may also be adapted for use with LiDAR datasets for geographic areas outside of Pennsylvania.