Loading, displaying, and plotting structured metadata stored under the NHGRI project on the Canine Data Commons¶

1. Introduction to the commons¶

The Canine Data Commons supports the management, analysis and sharing of genomics data for the canine research community and aims to accelerate opportunities for discovery and development for the treatment and prevention of canine cancer.

2. Install dependencies and import python libraries¶

# Uncomment the lines to install libraries if needed.
# !pip install --force --upgrade gen3 --ignore-installed certifi
# !pip install numpy
# !pip install matplotlib
# !pip install pandas
# !pip install seaborn

# Import libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
import seaborn as sns
import re
from pandas import DataFrame
import warnings
warnings.filterwarnings("ignore")
!pip install gen3

Requirement already satisfied: gen3 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (2.4.0)
Requirement already satisfied: requests in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from gen3) (2.24.0)
Requirement already satisfied: pandas in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from gen3) (1.0.5)
Requirement already satisfied: indexclient>=1.6.2 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from gen3) (2.1.0)
Requirement already satisfied: aiohttp in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from gen3) (3.6.2)
Requirement already satisfied: backoff in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from gen3) (1.10.0)
Requirement already satisfied: click in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from gen3) (7.1.2)
Requirement already satisfied: idna<3,>=2.5 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->gen3) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->gen3) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->gen3) (1.25.9)
Requirement already satisfied: certifi>=2017.4.17 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from requests->gen3) (2020.6.20)
Requirement already satisfied: pytz>=2017.2 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from pandas->gen3) (2020.1)
Requirement already satisfied: numpy>=1.13.3 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from pandas->gen3) (1.19.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from pandas->gen3) (2.8.1)
Requirement already satisfied: attrs>=17.3.0 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from aiohttp->gen3) (19.3.0)
Requirement already satisfied: multidict<5.0,>=4.5 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from aiohttp->gen3) (4.7.6)
Requirement already satisfied: yarl<2.0,>=1.0 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from aiohttp->gen3) (1.4.2)
Requirement already satisfied: async-timeout<4.0,>=3.0 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from aiohttp->gen3) (3.0.1)
Requirement already satisfied: six>=1.5 in /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages (from python-dateutil>=2.6.1->pandas->gen3) (1.15.0)
WARNING: You are using pip version 20.2.3; however, version 20.2.4 is available.
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.8/bin/python3.8 -m pip install --upgrade pip' command.

3. Start interacting with the data commons by authentication using the SDK¶

The Gen3 PSDK is a Python librabry containing classes and functions for sending common requests to the Gen APIs. The SDK is open source and the full documentation about the SDK can be found here.

# Import Gen SDK tools
import gen3
from gen3.auth import Gen3Auth
from gen3.submission import Gen3Submission
from gen3.index import Gen3Index

# Define the Gen3 API (URL of the Gen3 commons)
endpoint = "https://caninedc.org/"

# Download the credentials JSON under https://caninedc.org/identity and call the path to the JSON file.
creds = "/user/path/canine_creds.json"

# Authentication using the class "Gen3Auth", which generates access tokens.
auth = Gen3Auth(endpoint, creds)
sub = Gen3Submission(endpoint, auth)

home_directory = os.getcwd() # replace with a path if needed.

4. Download structured metadata from the dataset "NHGRI" using the SDK¶

# First, we need to know which program and project we want to download the structured data from.
# Programs and projects of interest can be found on commons.url/submission, or, in this notebook https://caninedc.org/submission.
# In this notebook, we select program "Canine" and project "NHGRI"
program = "Canine"
project = "NHGRI"

# Now we can search for structured data that is stored under nodes in the NHGRI project.
# All nodes in the NHGRI project can be found on the graph model on https://caninedc.org/Canine-NHGRI.
# For this notebook, we want to take a look at the structured data that is stored under the nodes "subject" and "sample".

# Export the structured data that is stored under the two nodes using the SDK function "export_node":
# Syntax: subject_data = sub.export_node(program, project, node_type, fileformat, filename)
subject_data = sub.export_node(program, project, "subject", "tsv", home_directory + "/subject.tsv")
sample_data = sub.export_node(program, project, "sample", "tsv", home_directory + "/sample.tsv")

Output written to file: /Users/xeniaritter/Documents/CDIS_Tasks/Goals20-21/my_notebooks/subject.tsv

Output written to file: /Users/xeniaritter/Documents/CDIS_Tasks/Goals20-21/my_notebooks/sample.tsv

5. Load NHGRI Dataset to Pandas, show dataframe, group, and plot¶

# Load the downloaded subject tsv file to the Pandas dataframe with regex delimiter '\t'.
subject = pd.read_csv("subject.tsv", sep='\t', header=0)

# As "subject" is now the dataframe, we can run Pandas functions on it by adding a ".function" 

# Return the first 5 rows of the dataframe "subject"
subject.head()

# Commands to show dataframe shape and info:
# 1. Return info of the dataframe using: $ subject.info
# 2. Return the format of the dataframe in (rows, columns) using: $ subject.shape

# Dropping all columns that have NaN as values and replacing the previous dataframe
subject_clean = subject.dropna(axis = 1, how = 'all')
subject_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1090 entries, 0 to 1089
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   type                     1090 non-null   object
 1   id                       1090 non-null   object
 2   project_id               1090 non-null   object
 3   submitter_id             1090 non-null   object
 4   breed                    1090 non-null   object
 5   disease_type             1090 non-null   object
 6   primary_site             1090 non-null   object
 7   species                  1090 non-null   object
 8   tissue_source_site_code  1090 non-null   object
 9   studies.id               1090 non-null   object
 10  studies.submitter_id     1090 non-null   object
dtypes: object(11)
memory usage: 93.8+ KB

# Return only one column from dataframe "subject". Here we show two options to do this.
# Option 1: call the name of the column in the dataframe
subject_clean['species']

0       Canis lupus familiaris
1       Canis lupus familiaris
2       Canis lupus familiaris
3       Canis lupus familiaris
4       Canis lupus familiaris
                 ...
1085    Canis lupus familiaris
1086    Canis lupus familiaris
1087    Canis lupus familiaris
1088    Canis lupus familiaris
1089    Canis lupus familiaris
Name: species, Length: 1090, dtype: object

# Option 2: use the function "iloc"
subject_clean.iloc[:, 7]

0       Canis lupus familiaris
1       Canis lupus familiaris
2       Canis lupus familiaris
3       Canis lupus familiaris
4       Canis lupus familiaris
                 ...
1085    Canis lupus familiaris
1086    Canis lupus familiaris
1087    Canis lupus familiaris
1088    Canis lupus familiaris
1089    Canis lupus familiaris
Name: species, Length: 1090, dtype: object

# Removing columns not necessary for data analysis with the function "drop"
subject_clean_slim = subject_clean.drop(columns=['type', 'id', 'project_id', 'studies.id', 'studies.submitter_id'])
subject_clean_slim.head()

# We can count the occurrences of different breeds using three options.
# Option 1: Use the function "value_counts"
subject_clean_slim_breeds = subject_clean_slim.breed.value_counts()
print(subject_clean_slim_breeds)

Saluki                                 29
Standard Schnauzer                     17
Italian Greyhound                      17
Saint Bernard                          17
Great Pyrenees                         13
                                       ..
Maltese                                 2
American Staffordshire Bull Terrier     2
Dingo                                   1
Unknown                                 1
Norwegian Elkhound                      1
Name: breed, Length: 132, dtype: int64

# Option 2: Use the function "groupby" and let Pandas show the counts in ascending size order using "sort_values": 
subject_clean_slim.groupby('breed').size().sort_values(ascending=False)

breed
Saluki                29
Saint Bernard         17
Italian Greyhound     17
Standard Schnauzer    17
Great Pyrenees        13
                      ..
Parson Russell         2
Puli                   2
Unknown                1
Dingo                  1
Norwegian Elkhound     1
Length: 132, dtype: int64

# We can directly plot the top entries using matplotlib.pyplot as plt (defined in the beginning)
subject_clean_slim.groupby('breed').size().sort_values(ascending=False).plot(kind='bar')
plt.ylabel('n')
plt.title('Breeds')
plt.xlim(-1, 17.5) # setting the limits to the first 18 entries (instead of the full 132)
plt.show()

# Option 3: We can also show the top entries with the function pivot_table and save it as a new file 
countsbreed=subject_clean_slim.pivot_table(index=['breed'], aggfunc='size')
print(countsbreed)
countsbreed.shape

# Save the file to csv
countsbreed.to_csv('countsbreed.csv')

# Loading the saved file
counts_breed = pd.read_csv("countsbreed.csv", header=0)

# Renaming the column names with the function "columns" 
counts_breed.columns = ['breed', 'counts']
counts_breed.head(10) # shows the first 10 rows

breed
Afghan Hound                   10
Airedale Terrier                3
Akita                          10
Alaskan Malamute               10
American Cocker Spaniel        10
                               ..
Whippet                        10
Wirehaired Pointing Griffon     6
Wolf                            7
Xigou                           5
Yorkshire Terrier              10
Length: 132, dtype: int64

# Create pie chart of breeds showing only top 13 entries
top13 = counts_breed[counts_breed.counts > 9].nlargest(13, 'counts') # top 13 entries with counts > 9
data = top13['counts']
categories = top13["breed"]

fig1, ax1 = plt.subplots()
ax1.pie(data, labels=categories, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

# We want to show the breeds that have above 9 counts and store all other breeds as "Other"
# First, we set limits for the counts and separate into two bins (those below 9 counts and those above).
# The first row keeps the count above 9 as the original "breed" and change the counts below 9 to "Other"
counts_breed["new_breed"] = np.where(counts_breed["counts"] >9, counts_breed['breed'], 'Other')

# Using the groupby function from before, we can again count the amount of entries of each breed
count_table = counts_breed.groupby('new_breed').sum() # count_table has now only one column
count_table = count_table.reset_index() # this command resets the index of the table

# Return a pie chart of the results
top14 = count_table[count_table.counts > 9].nlargest(10, 'counts') # show only the top 10 and others
data = top14["counts"]
categories = top14["new_breed"]
fig1, ax1 = plt.subplots()
ax1.pie(data, labels=categories, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

fig1.set_size_inches(10, 10) # Adjust figure size 
mpl.rcParams['font.size'] = 11.5 # Adjust font size


plt.show()

# Save the figure above
fig1.savefig('plot.png')

<Figure size 432x288 with 0 Axes>

End of notebook.

	type	id	project_id	submitter_id	breed	days_to_lost_to_followup	disease_type	index_date	lost_to_followup	primary_site	species	tissue_source_site_code	studies.id	studies.submitter_id
0	subject	05f90a4e-5fbd-11e9-8647-d663bd873d93	Canine-NHGRI	1a608300-5fc1-11e9-8647-d663bd873d93	Norwich Terrier	NaN	healthy	NaN	NaN	whole blood	Canis lupus familiaris	whole blood	4a175584-5fbb-11e9-8647-d663bd873d93	GSE90441
1	subject	05f9102a-5fbd-11e9-8647-d663bd873d93	Canine-NHGRI	1a60844a-5fc1-11e9-8647-d663bd873d93	Old English Sheepdog	NaN	healthy	NaN	NaN	whole blood	Canis lupus familiaris	whole blood	4a175584-5fbb-11e9-8647-d663bd873d93	GSE90441
2	subject	05f911b0-5fbd-11e9-8647-d663bd873d93	Canine-NHGRI	1a608580-5fc1-11e9-8647-d663bd873d93	Old English Sheepdog	NaN	healthy	NaN	NaN	whole blood	Canis lupus familiaris	whole blood	4a175584-5fbb-11e9-8647-d663bd873d93	GSE90441
3	subject	05f912fa-5fbd-11e9-8647-d663bd873d93	Canine-NHGRI	1a6086b6-5fc1-11e9-8647-d663bd873d93	Old English Sheepdog	NaN	healthy	NaN	NaN	whole blood	Canis lupus familiaris	whole blood	4a175584-5fbb-11e9-8647-d663bd873d93	GSE90441
4	subject	05f91426-5fbd-11e9-8647-d663bd873d93	Canine-NHGRI	1a6087f6-5fc1-11e9-8647-d663bd873d93	Old English Sheepdog	NaN	healthy	NaN	NaN	whole blood	Canis lupus familiaris	whole blood	4a175584-5fbb-11e9-8647-d663bd873d93	GSE90441

	submitter_id	breed	disease_type	primary_site	species	tissue_source_site_code
0	1a608300-5fc1-11e9-8647-d663bd873d93	Norwich Terrier	healthy	whole blood	Canis lupus familiaris	whole blood
1	1a60844a-5fc1-11e9-8647-d663bd873d93	Old English Sheepdog	healthy	whole blood	Canis lupus familiaris	whole blood
2	1a608580-5fc1-11e9-8647-d663bd873d93	Old English Sheepdog	healthy	whole blood	Canis lupus familiaris	whole blood
3	1a6086b6-5fc1-11e9-8647-d663bd873d93	Old English Sheepdog	healthy	whole blood	Canis lupus familiaris	whole blood
4	1a6087f6-5fc1-11e9-8647-d663bd873d93	Old English Sheepdog	healthy	whole blood	Canis lupus familiaris	whole blood

	breed	counts
0	Afghan Hound	10
1	Airedale Terrier	3
2	Akita	10
3	Alaskan Malamute	10
4	American Cocker Spaniel	10
5	American Hairless Terrier	10
6	American Staffordshire Bull Terrier	2
7	Anatolian Shepherd	6
8	Australian Cattle Dog	2
9	Australian Kelpie	2