AstroMaterials PDS Archive Description Stephen M. Richard 2021-06-28 This document provides a quick overview of the archive layout of the Astro Materials geochemistry database archive. The archive is a dump of data from the Elasticsearch instance under stewardship of Lamont Doherty Earth Observatory of Columbia University, accessible at https://www.astromat.org/. The AstroMaterials Data System (AstroMat) is a project started in 2018 with NASA funding to build and operate a data infrastructure for laboratory data acquired from samples curated in the Astromaterials Collection of the Johnson Space Center. AstroMat will contain data from past, present, and future studies. The archive will be updated periodically as new data are added. The archive contains comma-delimited text (.csv) files that serialize tables that are best used in a relational database. There are data tables, with names ending in suffix '_t.csv', and lookup tables with names prefixed 'astro_lkup_'. Each data table has a primary key that is the first column in each row. Relationships (joins) between tables are described in the text below, and also in the File_Area_Observational/File/comment element in the XML label files. Recommended usage would be to import the tables into a relational database system like PostGreSQL, MySQL, Microsoft Access, or SQL server and construct views that combine joined data. The tables are not nomalized, so in many cases the csv file are useful when opened with spreadsheet software like Microsoft Excel or OpenOffice Calc, or imported to a Pandas dataframe in a Python environment. Archive layout: This is the logical structure for the archive. Bundle: astromat_chem Description—contains all data from Lamont AstroMaterials database stored in an Elasticsearch instance • This readme.txt file • Bundle_Label: bundle_astromat_chem.xml Collection: astromat_data • One collection of products that contain data. • Collection inventory: collection_astromat_data.csv, contains urn identifiers for the products in the collection. -Products • All product files have prefix 'astro_' in the file name. Products derived directly from the AstroMaterials System Elasticsearch (ES) index have the ES index name following that prefix. Several of the fields that are included in the index allow multiple values, e.g. identifier, authors, geoFeature, children. For these, the archive includes a set of product files with names prefixed with astro_lkup. For most of these, the cross reference is denormalized—many index records might reference the same lookup item. To streamline the archive process, these values are repeated in the lookup table. In the case of geofeature, there is a separate index for those features, and thus an archive file (astro_geofeature_t) that contains external identifiers for the feature if they are available. Details about the semantics and implementation of individual fields in each table are documented in the xml label files. ---Tables from ElasticSearch indexes: astro_analysis_action_t.csv Table contains descriptions of analysis actions from which the chemical data were determined. analysisActionNum is primary key. Use the primary key to link to astro_sample_data_t.csv to bind the analysis method information with analytical result data. astro_citation_t.csv Table contains citation information for sources of analytical data in this database. The astro_dataset_t, astro_analysis_action and astro_sample_data tables all contain a citation.id field that is a foreign key to citation_id in this table. Citations are linked to authors listed in astro_lkup_authors.csv via join from citation_id to foreignkey in astro_lkup_authors. astro_dataset_t.csv Table contains information about datasets compiled in this database. Datasets generally correspond to individual data tables in a publication. Analyzed material and analysis type are typically consistent for a dataset, but not necessarily. Datasets are joined to individual analysis results in the astro_sample_data_t.csv table via the dataset.id. The first reported material and analysis type are included in this table, but a full listing would require queries on the astro_sample_data_t table. astro_expedition_t.csv Table contains brief information describing expeditions from which samples returned were analyzed to generate data include in this compilation. Includes meteorite search projects on Earth, as well as NASA missions to extraterrestrial bodies (like Apollo missions to the moon). Samples are linked to expeditions via a join from astro_expedition_t.expedition_id to astro_sample_t.expedition.id. astro_geofeature_t.csv Table of named locations from which samples were acquired. The AstroMaterials data model allows more than one location to be associated with a sample; one location is specified in the astro_sample_t (randomly determined based on the first one encountered in archive processing) based on a join from Astro_sample_t. geoFeatures.id to astro_geofeature_t. geoFeature_id. A correlation table astro_lkup_geoFeatures.csv links samples with locations with a join on astro_sample_t. sample_id = astro_lkup_geoFeatures .foreignKey. astro_sample_data_t.csv Table contains a record for each analyte result from each analysis for each sample. Massively denormalized because it is designed for a search index, not data management. Each analysis result is joined with an analysis action (based on analysisActionNum), a citation (based on citation.id), a dataset (based on dataset.id), and a sample (based on sample.id). astro_sample_t.csv Table contains descriptions of all samples and sub samples; child samples are linked via sample.id to astro_lkup_children.csv table; sample identifiers are linked via sample id the astro_lkup_identifiers.csv correlation table; geoFeatures are linked via sample id the astro_lkup_geoFeatures.csv map; taxons are linked via sample id the astro_lkup_taxons.csv map ---Lookup/correlation tables: astro_lkup_authors.csv Table correlates authors with datasets. Author names have not been denormalized, so the same author might havemultiple unique identifiers in the the author_id column. Author_id is generated by concatenating a prefix 'auth#' and citation_id, where # is the sequence number of that author in the particular citation. The table fields include first (given) and last (family) name, sequence number to establish author order, and identifier(s) if available. 0 to many identifiers might be associated with a person. The first listed identifier is included in this table. Specified by the identifiers.identifier, identifiers.identifierType, and identifiers.link columns. See astro_lkup_identifiers.csv for explanation of these columns. If there are other identifiers, they are linked to the author via join author_id to Astro_lkup_identifiers.csv foreignkey. Provided identifiers are all ORCIDs. Join with astro_citation_t table citation_id from foreignKey field in this table. astro_lkup_children.csv: correlation table that maps a parent sample_id in the foreignKey field to the sample_id for a child sample from the id field. astro_lkup_collections.csv: Table that maps citation_id from astro_citation_t to a collection name and type. Table is denormalized; only 7 unique collections are used. astro_lkup_geoFeatures.csv: Table mapping samples to geoFeature named locations. Join foreignKey to astro_sample_t.sample_id, and astro_lkup_geoFeatures.id to astro_geofeature_t.geoFeature_id astro_lkup_identifiers.csv: Correlation table, foreignKey field joins to xxx_id in containing index, use in astro_expedition_t, astro_geofeature_t, astro_citation_t, astro_sample_t, astro_lkup_authors; most of these only have one identifier, but multiple identifiers are allowed; this correlation table implements the relationship. astro_lkup_taxons.csv Table mapping taxon classes to samples; join foreignKey to sample_id. -PDS label files The PDS label files are XML metadata files included in the archive. Each data file has a corresponding label file with the same name, but suffix '.xml'. This files were generated manually using and XML editor, following guidance in The PDS4 Data Provider’s Handbook, v1.14.0.1