Gen3 - Data Dictionary
We recommend data contributors and analysts familiarize themselves with the Gen3 data dictionary of their data commons by clicking “Dictionary” in the top navigation bar or by entering the URL https://gen3.datacommons.io/dd. Here one can find a graphical model that illustrates how a particular data commons organizes and describes its data.
The following sections describe:
In a Gen3 data commons, a semantic distinction is made between two types of data: “data files” and “metadata”.
A “data file” could be information like tabulated data values in a spreadsheet or a fastq/bam file containing DNA sequences. The contents of the file are not exposed to the API as queryable properties, so the file must be downloaded to view its content.
“Metadata” are variables that help to organize or convey additional information about corresponding data files so that they can be queried via the Gen3 data commons’ API or viewed in the Gen3 data commons’ data exploration tool. In a Gen3 data dictionary, variable names are termed “properties”, and data contributors provide the values for these pre-defined properties in their data submissions.
Examples of Metadata are:
This format allows analysts to search for files of interest using the metadata for querying or filtering.
Metadata can be browsed or queried by clicking on the “Exploration” or “Query” link in the top navigation bar of the data commons submission portal, respectively. For more information on accessing data in the submission portal, see the documentation on accessing data.
Metadata queries can also be sent to the API programmatically, using, for example, the python requests package. For more information on sending requests to the API, see the API documentation.
If you would like to review how to upload these different types of data to the gen3 commons, see the documentation on data submission.
Every Gen3 data commons employs a data model, which serves to describe, organize, and harmonize data sets submitted by different users. Data harmonization facilitates cross-project analyses and is thus one of the pillars of the data commons paradigm.
The data model organizes experimental metadata variables, “properties”, into linked categories, “nodes”, through the use of a data dictionary. The data dictionary lists and describes all nodes in the data model, as well as defines and describes the properties in each node.
Each node in the data dictionary is linked in a logical manner to other nodes, which facilitates generating a visual overview, or graphical model, of a project.
Properties can be assigned different types, depending on the value of the data that will be submitted to them. The following types are available: string, boolean, number, integer, enum, null, array, and regex patterns. More information on the property value types can be found here.
The following image displays the data dictionary in Table View, the ‘Medical History’ node entry in the dictionary with the list of properties, and an example graphical model of a project:
Having all participating members use the same data model:
If a desired submission element is not currently described in the model, users will need to work with the commons to extend the data model. Provide the commons with a description of the requested data elements, and they will work with the sponsor or data modeling working group to review the request and find an appropriate home for the data elements.
In the case of developing a personal data dictionary, such as for use with Docker Compose, please note that due to the graph nature of the data model, some nodes are dependent on others. In addition, the Windmill service specifies nodes that are required for it run properly through preset parameters. For example, if Windmill is set to use the default data dictionary, it will require the
The Data Dictionary Viewer is designed to make it easier to understand the data model, the field types associated with each node, and the potential values associated with each field. It displays available fields in a node and the dependencies a given node has to the existence of a prior node. This is an invaluable tool for both the submission of data and later analysis of the entire commons.
The Data Dictionary Viewer allows toggling views and browsing the nodes as a graph and as tables.
Gen3 members can use it through the ‘Dictionary’ icon at https://gen3.datacommons.io/dd.
NOTE: For these user guides, https://gen3.datacommons.io is an example URL and can be replaced with the URL of other data commons powered by Gen3.