Build a Cohort¶

Example use case:

alt_text Julia is an oncologist that specializes in female reproductive health. As part of her research, she is interested in using existing data on uterine cancers. If possible, she would like to see multiple datatypes (gross imaging, genomic data, proteomic data, histology) that come from the same patient, so she can look for shared phenotypes to test for their potential as early diagnostics. Julia heard that the Cancer Data Aggregator has made it easy to search across multiple datasets created by NCI, and so has decided to start her search there.

Before Julia does any work, she needs to import several functions from cdapython:

Q and query which power the search
columns which lets us view entity field names
unique_terms which lets view entity field contents

She also asks cdapython to report it's version so she can be sure she's using the one she means to.

In [1]:

            
                Copied!
                
from cdapython import Q, columns, unique_terms, query
import cdapython
import pandas as pd 
print(cdapython.__version__)
Q.set_host_url("http://35.192.60.10:8080/")
from cdapython import Q, columns, unique_terms, query
import cdapython
import pandas as pd 
print(cdapython.__version__)
Q.set_host_url("http://35.192.60.10:8080/")

2022.6.22

CDA data comes from three sources:

The Proteomic Data Commons (PDC)
The Genomic Data Commons (GDC)
The Imaging Data Commons (IDC)

The CDA makes this data searchable in four main endpoints:

subject: A specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets
researchsubject: a person/plant/animal/microbe within a given study. An individual who participates in 3 studies will have 3 researchsubject IDs
specimen: a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject ID
file: A unit of data about subjects, researchsubjects, specimens, or their associated information

and two endpoints that offer deeper information about data in the researchsubject endpoint:

diagnosis: Information about what medical diagnosis a researchsubject has
treatment: Information about what medical treatment(s) were performed for a given diagnosis

Any metadata field can be searched from any endpoint, the only difference between search types is what type of data is returned by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.

Accordingly, to see what search fields are available, Julia starts by using the command columns:

In [2]:

            
                Copied!
                
columns().to_list()
columns().to_list()

Out[2]:

['File.id',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri',
 'File.byte_size',
 'File.checksum',
 'File.data_modality',
 'File.imaging_modality',
 'File.dbgap_accession_number',
 'id',
 'identifier.system',
 'identifier.value',
 'species',
 'sex',
 'race',
 'ethnicity',
 'days_to_birth',
 'subject_associated_project',
 'vital_status',
 'age_at_death',
 'cause_of_death',
 'ResearchSubject.id',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.member_of_research_project',
 'ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',
 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',
 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',
 'ResearchSubject.Diagnosis.Treatment.treatment_effect',
 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason',
 'ResearchSubject.Diagnosis.Treatment.number_of_cycles',
 'ResearchSubject.Specimen.id',
 'ResearchSubject.Specimen.identifier.system',
 'ResearchSubject.Specimen.identifier.value',
 'ResearchSubject.Specimen.associated_project',
 'ResearchSubject.Specimen.age_at_collection',
 'ResearchSubject.Specimen.primary_disease_type',
 'ResearchSubject.Specimen.anatomical_site',
 'ResearchSubject.Specimen.source_material_type',
 'ResearchSubject.Specimen.specimen_type',
 'ResearchSubject.Specimen.derived_from_specimen',
 'ResearchSubject.Specimen.derived_from_subject']

There are a lot of columns in the CDA data, but Julia is most interested in diagnosis data, so she filters the list to only those:

In [3]:

            
                Copied!
                
columns().to_list(filters="diagnosis")
columns().to_list(filters="diagnosis")

Out[3]:

['ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',
 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',
 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',
 'ResearchSubject.Diagnosis.Treatment.treatment_effect',
 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason',
 'ResearchSubject.Diagnosis.Treatment.number_of_cycles']

To search the CDA, a user also needs to know what search terms are available. Each column will contain a huge amount of data, so retreiving all of the rows would be overwhelming. Instead, the CDA has a `unique_terms()` function that will return all of the unique values that populate the requested column. Like `columns`, `unique_terms` defaults to giving us an overview of the results, and can be filtered.

Since Julia is interested specificially in uterine cancers, she uses the unique_terms function to see what data is available for 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site' and 'ResearchSubject.primary_diagnosis_site' to see if 'uterine' appears:

In [4]:

            
                Copied!
                
unique_terms("ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site").to_list()
unique_terms("ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site").to_list()

Out[4]:

['Brain',
 'Cervix',
 'Head - Face Or Neck, Nos',
 'Lymph Node(s) Paraaortic',
 'Other',
 'Pelvis',
 'Spine',
 'Unknown']

In [5]:

            
                Copied!
                
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()

Out[5]:

['Abdomen',
 'Abdomen, Mediastinum',
 'Adrenal Glands',
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bile Duct',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung',
 'Cervix',
 'Cervix uteri',
 'Chest',
 'Chest-Abdomen-Pelvis, Leg, TSpine',
 'Colon',
 'Connective, subcutaneous and other soft tissues',
 'Corpus uteri',
 'Ear',
 'Esophagus',
 'Extremities',
 'Eye and adnexa',
 'Floor of mouth',
 'Gallbladder',
 'Gum',
 'Head',
 'Head and Neck',
 'Head-Neck',
 'Heart, mediastinum, and pleura',
 'Hematopoietic and reticuloendothelial systems',
 'Hypopharynx',
 'Intraocular',
 'Kidney',
 'Larynx',
 'Lip',
 'Liver',
 'Liver and intrahepatic bile ducts',
 'Lung',
 'Lung Phantom',
 'Lymph nodes',
 'Marrow, Blood',
 'Meninges',
 'Mesothelium',
 'Nasal cavity and middle ear',
 'Nasopharynx',
 'Not Reported',
 'Oropharynx',
 'Other and ill-defined digestive organs',
 'Other and ill-defined sites',
 'Other and ill-defined sites in lip, oral cavity and pharynx',
 'Other and ill-defined sites within respiratory system and intrathoracic organs',
 'Other and unspecified female genital organs',
 'Other and unspecified major salivary glands',
 'Other and unspecified male genital organs',
 'Other and unspecified parts of biliary tract',
 'Other and unspecified parts of mouth',
 'Other and unspecified parts of tongue',
 'Other and unspecified urinary organs',
 'Other endocrine glands and related structures',
 'Ovary',
 'Palate',
 'Pancreas',
 'Pancreas ',
 'Pelvis, Prostate, Anus',
 'Penis',
 'Peripheral nerves and autonomic nervous system',
 'Phantom',
 'Prostate',
 'Prostate gland',
 'Rectosigmoid junction',
 'Rectum',
 'Renal pelvis',
 'Retroperitoneum and peritoneum',
 'Skin',
 'Small intestine',
 'Spinal cord, cranial nerves, and other parts of central nervous system',
 'Stomach',
 'Testicles',
 'Testis',
 'Thymus',
 'Thyroid',
 'Thyroid gland',
 'Tonsil',
 'Trachea',
 'Unknown',
 'Ureter',
 'Uterus',
 'Uterus, NOS',
 'Vagina',
 'Various',
 'Various (11 locations)',
 'Vulva']

CDA makes multiple datasets searchable from a common interface, but does not harmonize the data. This means that researchers should review all the terms in a column, and not just choose the first one that fits, as there may be other similar terms available as well.

Julia sees that "treatment_anatomic_site" does not have 'Uterine', but does have 'Cervix'. She also notes that both 'Uterus' and 'Uterus, NOS' are listed in the "primary_diagnosis_site" results. As she was initially looking for "uterine", Julia decides to expand her search a bit to account for variable naming schemes. So, she runs a fuzzy match filter on the "ResearchSubject.primary_diagnosis_site" for 'uter' as that should cover all variants:

In [6]:

            
                Copied!
                
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

Out[6]:

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

Just to be sure, Julia also searches for any other instances of "cervix":

In [7]:

            
                Copied!
                
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="cerv")
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="cerv")

Out[7]:

['Cervix', 'Cervix uteri']

With all her likely terms found, Julia begins to create a search that will get data for all of her terms. She does this by writing a series of Q statements that define what rows should be returned from each column. For the "treatment_anatomic_site", only one term is of interest, so she uses the = operator to get only exact matches:

In [8]:

            
                Copied!
                
Tsite = Q('ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site = "Cervix"')
Tsite = Q('ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site = "Cervix"')

However, for "primary_diagnosis_site", Julia has several terms she wants to search with. Luckily, Q also can run fuzzy searches. It can also search more than one term at a time, so Julia writes one big Q statement to grab everything that is either 'uter' or 'cerv':

In [9]:

            
                Copied!
                
Dsite = Q('ResearchSubject.primary_diagnosis_site = "%uter%" OR ResearchSubject.primary_diagnosis_site = "%cerv%"')
Dsite = Q('ResearchSubject.primary_diagnosis_site = "%uter%" OR ResearchSubject.primary_diagnosis_site = "%cerv%"')

Finally, Julia adds her two queries together into one large one:

In [10]:

            
                Copied!
                
ALLDATA = Tsite.OR(Dsite)
ALLDATA = Tsite.OR(Dsite)

Now that Julia has a query, she can use it to look for data in any of the CDA endpoints. She starts by getting an overall summary of what data is available using count:

In [11]:

            
                Copied!
                
ALLDATA.count.run()
ALLDATA.count.run()

Getting results from database

Total execution time: 3475 ms

specimen_count : 40766

treatment_count : 3045

diagnosis_count : 3683

researchsubject_count : 4867

subject_count : 3740

Out[11]:

It seems there's a lot of data that might work for Julias study! Since she is interested in the beginings of cancer, she decides to start by looking at the researchsubject information, since that is where most of the diagnosis information is. She again gets a summary using count:

In [12]:

            
                Copied!
                
ALLDATA.researchsubject.run()
ALLDATA.researchsubject.run()

Getting results from database

Total execution time: 3523 ms

Out[12]:

            QueryID: 065affcf-84fb-4cc4-9fe7-73535c7bce0a
            
            Offset: 0
            Count: 100
            Total Row Count: 4867
            More pages: True

Browsing the primary_diagnosis_condition data, Julia notices that there are a large number of research subjects that are Adenomas and Adenocarcinomas. Since Julia wants to look for common phenotypes in early cancers, she decides it might be easier to exclude the endocrine related data, as they might have different mechanisms. So she adds a new filter to her query:

In [13]:

            
                Copied!
                
Noadeno = Q('ResearchSubject.primary_diagnosis_condition != "Adenomas and Adenocarcinomas"')

NoAdenoData = ALLDATA.AND(Noadeno)

NoAdenoData.researchsubject.count.run()
Noadeno = Q('ResearchSubject.primary_diagnosis_condition != "Adenomas and Adenocarcinomas"')

NoAdenoData = ALLDATA.AND(Noadeno)

NoAdenoData.researchsubject.count.run()

Getting results from database

Total execution time: 3415 ms

    total : 3196

   files : 297923

system	count
PDC	104
GDC	1918
IDC	1174

primary_diagnosis_condition	count
Uterine Corpus Endometrial Carcinoma	104
Cystic, Mucinous and Serous Neoplasms	487
Squamous Cell Neoplasms	609
Complex Mixed and Stromal Neoplasms	320
None	1175
Myomatous Neoplasms	187
Not Reported	12
Epithelial Neoplasms, NOS	230
Complex Epithelial Neoplasms	27
Soft Tissue Tumors and Sarcomas, NOS	14
Neoplasms, NOS	12
Trophoblastic neoplasms	13
Mesonephromas	5
Neuroepitheliomatous Neoplasms	1

primary_diagnosis_site	count
Uterus, NOS	961
Corpus uteri	373
Cervix uteri	688
Uterus	867
Cervix	307

Out[13]:

She then previews the actual metadata for researchsubject, subject, and file, to make sure that they have all the information she will need for her work:

In [14]:

            
                Copied!
                
NoAdenoData.researchsubject.run().to_dataframe()
NoAdenoData.researchsubject.run().to_dataframe()

Getting results from database

Total execution time: 3482 ms

Out[14]:

	id	identifier	member_of_research_project	primary_diagnosis_condition	primary_diagnosis_site	subject_id
0	146bd9db-1645-4950-bd18-de30d0db2487	[{'system': 'GDC', 'value': '146bd9db-1645-495...	CGCI-HTMCP-CC	Squamous Cell Neoplasms	Cervix uteri	HTMCP-03-06-02138
1	32e83039-7663-422b-a541-6d9149851560	[{'system': 'GDC', 'value': '32e83039-7663-422...	GENIE-GRCC	Complex Mixed and Stromal Neoplasms	Uterus, NOS	GENIE-GRCC-4f168dad
2	37063f74-ccc7-426e-ac1c-ad733f2f7e95	[{'system': 'GDC', 'value': '37063f74-ccc7-426...	GENIE-UHN	Epithelial Neoplasms, NOS	Corpus uteri	GENIE-UHN-247706
3	3878f58e-76ba-4480-a784-88505bd464d0	[{'system': 'GDC', 'value': '3878f58e-76ba-448...	TCGA-UCEC	Cystic, Mucinous and Serous Neoplasms	Corpus uteri	TCGA-FI-A2EX
4	3df6abe2-2123-4bfa-a4e4-88df5f940c04	[{'system': 'GDC', 'value': '3df6abe2-2123-4bf...	TCGA-CESC	Squamous Cell Neoplasms	Cervix uteri	TCGA-JX-A3PZ
...	...	...	...	...	...	...
95	fa219ae6-def1-4200-972a-3fd17d688d34	[{'system': 'GDC', 'value': 'fa219ae6-def1-420...	FM-AD	Squamous Cell Neoplasms	Cervix uteri	AD7747
96	fb6f2e38-9281-4085-923c-ef99955fd5ea	[{'system': 'GDC', 'value': 'fb6f2e38-9281-408...	CGCI-HTMCP-CC	Squamous Cell Neoplasms	Cervix uteri	HTMCP-03-06-02062
97	13d72130-604c-4d79-95cc-53c2e25d91b0	[{'system': 'GDC', 'value': '13d72130-604c-4d7...	TCGA-CESC	Squamous Cell Neoplasms	Cervix uteri	TCGA-ZJ-AAX4
98	15d1d0ad-4196-49d1-8eb3-38c75b7db58c	[{'system': 'GDC', 'value': '15d1d0ad-4196-49d...	GENIE-MSK	Myomatous Neoplasms	Uterus, NOS	GENIE-MSK-P-0005582
99	1d6f367d-a00d-4bd0-9a8b-0a25e37fc1cd	[{'system': 'GDC', 'value': '1d6f367d-a00d-4bd...	GENIE-DFCI	Cystic, Mucinous and Serous Neoplasms	Uterus, NOS	GENIE-DFCI-001660

100 rows × 6 columns

ResearchSubject Field Definitions

A research subject is the entity of interest in a research study, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subject’s privacy. An individual who participates in 3 studies will have 3 researchsubject IDs

id: The unique identifier for this researchsubject
identifier: An embedded array of information that includes the originating data center and the ID the researchsubject had there
member_of_research_project: The name of the study/project that the subject particpated in
primary_diagnosis_condition: The cancer, disease or other condition under study
primary_diagnosis_site: The primary_disease_site that qualifies the researchsubject for the research_project
subject_id: An identifier for the subject. Can be joined to the `id` field from subject results

In [15]:

            
                Copied!
                
NoAdenoData.subject.run().to_dataframe()
NoAdenoData.subject.run().to_dataframe()

Getting results from database

Total execution time: 3460 ms

Out[15]:

	id	identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	age_at_death	cause_of_death
0	AD2728	[{'system': 'GDC', 'value': 'AD2728'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
1	C3N-01876	[{'system': 'IDC', 'value': 'C3N-01876'}]	Homo sapiens	None	None	None	NaN	[cptac_ucec]	None	NaN	None
2	GENIE-DFCI-007276	[{'system': 'GDC', 'value': 'GENIE-DFCI-007276'}]	Homo sapiens	female	white	not hispanic or latino	-18627.0	[GENIE-DFCI]	Not Reported	NaN	None
3	GENIE-DFCI-009140	[{'system': 'GDC', 'value': 'GENIE-DFCI-009140'}]	Homo sapiens	female	white	not hispanic or latino	-24837.0	[GENIE-DFCI]	Not Reported	NaN	None
4	GENIE-DFCI-009144	[{'system': 'GDC', 'value': 'GENIE-DFCI-009144'}]	Homo sapiens	female	white	not hispanic or latino	-19723.0	[GENIE-DFCI]	Not Reported	NaN	None
...	...	...	...	...	...	...	...	...	...	...	...
95	AD14317	[{'system': 'GDC', 'value': 'AD14317'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
96	AD3008	[{'system': 'GDC', 'value': 'AD3008'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
97	AD6414	[{'system': 'GDC', 'value': 'AD6414'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
98	AD7975	[{'system': 'GDC', 'value': 'AD7975'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
99	C3L-00157	[{'system': 'GDC', 'value': 'C3L-00157'}, {'sy...	Homo sapiens	female	white	hispanic or latino	-22118.0	[CPTAC3-Discovery, CPTAC-3, cptac_ucec]	Dead	1396.0	Cancer Related

100 rows × 11 columns

Subject Field Definitions

A subject is a specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets

id: The unique identifier for this subject
identifier: An embedded array of information that includes the originating data center and the ID the subject had there
species: The species of the subject
sex: A reference to the biological sex of the donor organism.
race: The race of the subject
ethnicity: The ethnicity of the subject
days_to_birth: Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days
subject_associated_project: An embedded array of the names of projects (studies) the subject was part of
vital_status: Whether the subject is alive
age_at_death: The number of days after first enrollment that the subject died
cause_of_death: The cause of death, if known

In [16]:

            
                Copied!
                
NoAdenoData.file.run().to_dataframe()
NoAdenoData.file.run().to_dataframe()

Getting results from database

Total execution time: 3742 ms

Out[16]:

	id	identifier	label	data_category	data_type	file_format	associated_project	drs_uri	byte_size	checksum	data_modality	imaging_modality	dbgap_accession_number	researchsubject_specimen_id	researchsubject_id	subject_id
0	d3151fb9-9dd5-470e-b181-4d920f686068	[{'system': 'GDC', 'value': 'd3151fb9-9dd5-470...	TCGA-B5-A11E-01A-21-A163-20_RPPA_data.tsv	Proteome Profiling	Protein Expression Quantification	TSV	TCGA-UCEC	drs://dg.4DFC:d3151fb9-9dd5-470e-b181-4d920f68...	22341	f44fc349969dda464ddf37f5e1f149f1	Genomic	None	None			TCGA-B5-A11E
1	2200d48f-d10d-4e0c-aff6-a71958fc2b1b	[{'system': 'GDC', 'value': '2200d48f-d10d-4e0...	TCGA-A5-A0G9-01A-21-A162-20_RPPA_data.tsv	Proteome Profiling	Protein Expression Quantification	TSV	TCGA-UCEC	drs://dg.4DFC:2200d48f-d10d-4e0c-aff6-a71958fc...	24285	8edb8c63f398d0d6dab0655d62b1cd93	Genomic	None	None			TCGA-A5-A0G9
2	e6ee1e9e-9c28-4db8-9f7f-3916f5351717	[{'system': 'GDC', 'value': 'e6ee1e9e-9c28-4db...	TCGA-N7-A4Y5-01A-21-A41P-20_RPPA_data.tsv	Proteome Profiling	Protein Expression Quantification	TSV	TCGA-UCS	drs://dg.4DFC:e6ee1e9e-9c28-4db8-9f7f-3916f535...	22026	73159e8898216b617ac3e135af51d87e	Genomic	None	None			TCGA-N7-A4Y5
3	81674772-fd6d-48b6-93b1-fa585d1ed568	[{'system': 'GDC', 'value': '81674772-fd6d-48b...	49b02eb4-8e31-42cd-a3e7-065611836434.wgs.BRASS...	Somatic Structural Variation	Structural Rearrangement	BEDPE	CPTAC-3	drs://dg.4DFC:81674772-fd6d-48b6-93b1-fa585d1e...	9977	64560c17caa67fa25411218ef57101a6	Genomic	None	phs001287			C3L-01307
4	c3392a1e-1241-4068-9bca-31fd836148de	[{'system': 'GDC', 'value': 'c3392a1e-1241-406...	TCGA-BG-A0MA-01A-21-A18Q-20_RPPA_data.tsv	Proteome Profiling	Protein Expression Quantification	TSV	TCGA-UCEC	drs://dg.4DFC:c3392a1e-1241-4068-9bca-31fd8361...	22324	3da3113805454ac4fca6482fbaf4b4b1	Genomic	None	None			TCGA-BG-A0MA
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	b42cdaba-46c8-4a02-b7cf-86ceb5d1f712	[{'system': 'GDC', 'value': 'b42cdaba-46c8-4a0...	TCGA-HG-A2PA-01A-21-A40H-20_RPPA_data.tsv	Proteome Profiling	Protein Expression Quantification	TSV	TCGA-CESC	drs://dg.4DFC:b42cdaba-46c8-4a02-b7cf-86ceb5d1...	22070	398e6cca19ff30d932a1a78669254710	Genomic	None	None			TCGA-HG-A2PA
96	1731c20f-1f10-4f80-8793-99593f81515f	[{'system': 'GDC', 'value': '1731c20f-1f10-4f8...	TCGA-B5-A11P-01B-21-A18Q-20_RPPA_data.tsv	Proteome Profiling	Protein Expression Quantification	TSV	TCGA-UCEC	drs://dg.4DFC:1731c20f-1f10-4f80-8793-99593f81...	22344	8bcbded9fbd5a48a58f77ae1e3ea829f	Genomic	None	None			TCGA-B5-A11P
97	36f66d66-f71f-49be-9e51-ac640b826d3f	[{'system': 'GDC', 'value': '36f66d66-f71f-49b...	TCGA-EY-A1GH-01A-21-A18Q-20_RPPA_data.tsv	Proteome Profiling	Protein Expression Quantification	TSV	TCGA-UCEC	drs://dg.4DFC:36f66d66-f71f-49be-9e51-ac640b82...	22338	52a07dbf0ebfcafeb41958d4a1e2b489	Genomic	None	None			TCGA-EY-A1GH
98	59a4c826-87d1-43ab-9b2e-3c6088275fd7	[{'system': 'GDC', 'value': '59a4c826-87d1-43a...	de3cbd77-822b-4c86-80e8-9be54ca8b324.wgs.BRASS...	Somatic Structural Variation	Structural Rearrangement	BEDPE	CGCI-HTMCP-CC	drs://dg.4DFC:59a4c826-87d1-43ab-9b2e-3c608827...	121947	c9bcda0d917caf81773efd8e2f827ebb	Genomic	None	phs000528			HTMCP-03-06-02040
99	a7c7ba3e-7d9d-4735-afff-22442a0e9a84	[{'system': 'GDC', 'value': 'a7c7ba3e-7d9d-473...	44c6a116-146f-43fd-aa1e-7e8b1636a722.wgs.BRASS...	Somatic Structural Variation	Structural Rearrangement	VCF	CGCI-HTMCP-CC	drs://dg.4DFC:a7c7ba3e-7d9d-4735-afff-22442a0e...	76341	5e513d0fae6e32b5425c647c9d8d3ba3	Genomic	None	phs000528			HTMCP-03-06-02144

100 rows × 16 columns

File Field Definitions

A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.

id: The unique identifier for this file
identifier: An embedded array of information that includes the originating data center and the ID the file had there
label: The full name of the file
data_catagory: A desecription of the kind of general kind data the file holds
data_type: A more specific descripton of the data type
file_format: String to identify the full file extension including compression extensions
associated_project: The name the data center uses for the study this file was generated for
drs_uri: A unique identifier that can be used to retreive this specific file from a server
byte_size: Size of the file in bytes
checksum: The md5 value for the file
data_modality: Describes the biological nature of the information gathered as the result of an activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging"
imaging_modality: For files with the `data_modality` of "Imaging", a descriptor for the image type
dbgap_accession_number: The project id number for this data on dbGaP

Finally, Julia wants to save these results to use for the future. Since the preview dataframes only show the first 100 results of each search, she uses the paginator function to get all the data from the subject and researchsubject endpoints into their own dataframes:

In [17]:

            
                Copied!
                
researchsubs = NoAdenoData.researchsubject.run()
rsdf = pd.DataFrame()
for i in researchsubs.paginator(to_df=True):
    rsdf = pd.concat([rsdf, i])
researchsubs = NoAdenoData.researchsubject.run()
rsdf = pd.DataFrame()
for i in researchsubs.paginator(to_df=True):
    rsdf = pd.concat([rsdf, i])

Getting results from database

Total execution time: 3318 ms

In [18]:

            
                Copied!
                
subs = NoAdenoData.subject.run()
subsdf = pd.DataFrame()
for i in subs.paginator(to_df=True):
    subsdf = pd.concat([subsdf, i])
subs = NoAdenoData.subject.run()
subsdf = pd.DataFrame()
for i in subs.paginator(to_df=True):
    subsdf = pd.concat([subsdf, i])

Getting results from database

Total execution time: 3283 ms

In [19]:

            
                Copied!
                
rsdf # view the researchsubject dataframe
rsdf # view the researchsubject dataframe

Out[19]:

	id	identifier	member_of_research_project	primary_diagnosis_condition	primary_diagnosis_site	subject_id
0	146bd9db-1645-4950-bd18-de30d0db2487	[{'system': 'GDC', 'value': '146bd9db-1645-495...	CGCI-HTMCP-CC	Squamous Cell Neoplasms	Cervix uteri	HTMCP-03-06-02138
1	32e83039-7663-422b-a541-6d9149851560	[{'system': 'GDC', 'value': '32e83039-7663-422...	GENIE-GRCC	Complex Mixed and Stromal Neoplasms	Uterus, NOS	GENIE-GRCC-4f168dad
2	37063f74-ccc7-426e-ac1c-ad733f2f7e95	[{'system': 'GDC', 'value': '37063f74-ccc7-426...	GENIE-UHN	Epithelial Neoplasms, NOS	Corpus uteri	GENIE-UHN-247706
3	3878f58e-76ba-4480-a784-88505bd464d0	[{'system': 'GDC', 'value': '3878f58e-76ba-448...	TCGA-UCEC	Cystic, Mucinous and Serous Neoplasms	Corpus uteri	TCGA-FI-A2EX
4	3df6abe2-2123-4bfa-a4e4-88df5f940c04	[{'system': 'GDC', 'value': '3df6abe2-2123-4bf...	TCGA-CESC	Squamous Cell Neoplasms	Cervix uteri	TCGA-JX-A3PZ
...	...	...	...	...	...	...
91	TCGA-N9-A4Q7__tcga_ucs	[{'system': 'IDC', 'value': 'TCGA-N9-A4Q7'}]	tcga_ucs	None	Uterus	TCGA-N9-A4Q7
92	TCGA-QS-A744__tcga_ucec	[{'system': 'IDC', 'value': 'TCGA-QS-A744'}]	tcga_ucec	None	Uterus	TCGA-QS-A744
93	c64d5576-df00-4772-a3d1-1f8863000750	[{'system': 'GDC', 'value': 'c64d5576-df00-477...	CGCI-HTMCP-CC	Squamous Cell Neoplasms	Cervix uteri	HTMCP-03-06-02099
94	cc500ada-7440-412f-b54c-4966c8098dcb	[{'system': 'GDC', 'value': 'cc500ada-7440-412...	GENIE-DFCI	Cystic, Mucinous and Serous Neoplasms	Uterus, NOS	GENIE-DFCI-000331
95	d7a75bf5-5189-4978-99d9-fcef91c9fbd2	[{'system': 'GDC', 'value': 'd7a75bf5-5189-497...	TCGA-CESC	Squamous Cell Neoplasms	Cervix uteri	TCGA-EK-A2R7

3196 rows × 6 columns

In [20]:

            
                Copied!
                
subsdf # view the subject dataframe
subsdf # view the subject dataframe

Out[20]:

	id	identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	age_at_death	cause_of_death
0	AD2728	[{'system': 'GDC', 'value': 'AD2728'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
1	C3N-01876	[{'system': 'IDC', 'value': 'C3N-01876'}]	Homo sapiens	None	None	None	NaN	[cptac_ucec]	None	NaN	None
2	GENIE-DFCI-007276	[{'system': 'GDC', 'value': 'GENIE-DFCI-007276'}]	Homo sapiens	female	white	not hispanic or latino	-18627.0	[GENIE-DFCI]	Not Reported	NaN	None
3	GENIE-DFCI-009140	[{'system': 'GDC', 'value': 'GENIE-DFCI-009140'}]	Homo sapiens	female	white	not hispanic or latino	-24837.0	[GENIE-DFCI]	Not Reported	NaN	None
4	GENIE-DFCI-009144	[{'system': 'GDC', 'value': 'GENIE-DFCI-009144'}]	Homo sapiens	female	white	not hispanic or latino	-19723.0	[GENIE-DFCI]	Not Reported	NaN	None
...	...	...	...	...	...	...	...	...	...	...	...
3	TCGA-EY-A72D	[{'system': 'GDC', 'value': 'TCGA-EY-A72D'}, {...	Homo sapiens	female	black or african american	not hispanic or latino	-31818.0	[TCGA-UCEC, tcga_ucec]	Alive	NaN	None
4	TCGA-IE-A4EH	[{'system': 'GDC', 'value': 'TCGA-IE-A4EH'}, {...	Homo sapiens	female	white	not hispanic or latino	-12871.0	[tcga_sarc, TCGA-SARC]	Alive	NaN	None
5	TCGA-IS-A3KA	[{'system': 'GDC', 'value': 'TCGA-IS-A3KA'}, {...	Homo sapiens	female	white	not hispanic or latino	-26775.0	[tcga_sarc, TCGA-SARC]	Dead	413.0	None
6	TCGA-NA-A4QY	[{'system': 'GDC', 'value': 'TCGA-NA-A4QY'}, {...	Homo sapiens	female	white	not hispanic or latino	-22756.0	[tcga_ucs, TCGA-UCS]	Dead	114.0	None
7	TCGA-VS-A9V3	[{'system': 'GDC', 'value': 'TCGA-VS-A9V3'}, {...	Homo sapiens	female	white	not reported	-22990.0	[TCGA-CESC, tcga_cesc]	Alive	NaN	None

2608 rows × 11 columns

Then Julia uses the id fields in each result to join them together into one big dataset:

In [21]:

            
                Copied!
                
allmetadata = rsdf.set_index("subject_id").join(subsdf.set_index("id"), lsuffix='resub', rsuffix="subject")
allmetadata = rsdf.set_index("subject_id").join(subsdf.set_index("id"), lsuffix='resub', rsuffix="subject")

In [22]:

            
                Copied!
                
allmetadata
allmetadata

Out[22]:

	id	identifierresub	member_of_research_project	primary_diagnosis_condition	primary_diagnosis_site	identifiersubject	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	age_at_death	cause_of_death
AD100	0f08e2e9-9983-4204-972f-a630b7ab2c25	[{'system': 'GDC', 'value': '0f08e2e9-9983-420...	FM-AD	Squamous Cell Neoplasms	Cervix uteri	[{'system': 'GDC', 'value': 'AD100'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
AD1026	6d9d6cb9-652f-4749-b4c5-aa9e6b80de69	[{'system': 'GDC', 'value': '6d9d6cb9-652f-474...	FM-AD	Complex Mixed and Stromal Neoplasms	Uterus, NOS	[{'system': 'GDC', 'value': 'AD1026'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
AD10328	514fc104-1ee5-4701-8f45-9a011143f1e2	[{'system': 'GDC', 'value': '514fc104-1ee5-470...	FM-AD	Squamous Cell Neoplasms	Cervix uteri	[{'system': 'GDC', 'value': 'AD10328'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
AD10460	8c36611d-be2f-432a-afde-e684ab4333ea	[{'system': 'GDC', 'value': '8c36611d-be2f-432...	FM-AD	Cystic, Mucinous and Serous Neoplasms	Uterus, NOS	[{'system': 'GDC', 'value': 'AD10460'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
AD10485	0ad0fdda-dd96-48df-8edd-e5e471e9f680	[{'system': 'GDC', 'value': '0ad0fdda-dd96-48d...	FM-AD	Cystic, Mucinous and Serous Neoplasms	Uterus, NOS	[{'system': 'GDC', 'value': 'AD10485'}]	Homo sapiens	female	not reported	not reported	NaN	[FM-AD]	Not Reported	NaN	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
TCGA-ZJ-AB0H	TCGA-ZJ-AB0H__tcga_cesc	[{'system': 'IDC', 'value': 'TCGA-ZJ-AB0H'}]	tcga_cesc	None	Cervix	[{'system': 'GDC', 'value': 'TCGA-ZJ-AB0H'}, {...	Homo sapiens	female	not reported	not reported	-17869.0	[TCGA-CESC, tcga_cesc]	Alive	NaN	None
TCGA-ZJ-AB0I	a4f13656-a941-498a-9ac9-f020ed559b35	[{'system': 'GDC', 'value': 'a4f13656-a941-498...	TCGA-CESC	Squamous Cell Neoplasms	Cervix uteri	[{'system': 'GDC', 'value': 'TCGA-ZJ-AB0I'}, {...	Homo sapiens	female	white	not hispanic or latino	-9486.0	[TCGA-CESC, tcga_cesc]	Alive	NaN	None
TCGA-ZJ-AB0I	TCGA-ZJ-AB0I__tcga_cesc	[{'system': 'IDC', 'value': 'TCGA-ZJ-AB0I'}]	tcga_cesc	None	Cervix	[{'system': 'GDC', 'value': 'TCGA-ZJ-AB0I'}, {...	Homo sapiens	female	white	not hispanic or latino	-9486.0	[TCGA-CESC, tcga_cesc]	Alive	NaN	None
TCGA-ZX-AA5X	4756acc0-4e96-44d4-b359-04d64dc7eb84	[{'system': 'GDC', 'value': '4756acc0-4e96-44d...	TCGA-CESC	Squamous Cell Neoplasms	Cervix uteri	[{'system': 'GDC', 'value': 'TCGA-ZX-AA5X'}, {...	Homo sapiens	female	white	not hispanic or latino	-23440.0	[TCGA-CESC, tcga_cesc]	Alive	NaN	None
TCGA-ZX-AA5X	TCGA-ZX-AA5X__tcga_cesc	[{'system': 'IDC', 'value': 'TCGA-ZX-AA5X'}]	tcga_cesc	None	Cervix	[{'system': 'GDC', 'value': 'TCGA-ZX-AA5X'}, {...	Homo sapiens	female	white	not hispanic or latino	-23440.0	[TCGA-CESC, tcga_cesc]	Alive	NaN	None

3196 rows × 15 columns

And saves it out to a csv so she can browse it with Excel:

In [23]:

            
                Copied!
                
allmetadata.to_csv("allmetadata.csv")
allmetadata.to_csv("allmetadata.csv")

Julia knows from her subject count summary that there are 33480 files associated with her subjects, which is likely far more than she needs. To help her decide what files she wants, Julia uses endpoint chaining to get summary information about the files that are assigned to researchsubjects for her search criteria

In [24]:

            
                Copied!
                
NoAdenoData.researchsubject.file.count.run()
NoAdenoData.researchsubject.file.count.run()

Getting results from database

Total execution time: 3399 ms

    total : 3196

   files : 297923

system	count
PDC	104
GDC	1918
IDC	1174

primary_diagnosis_condition	count
Uterine Corpus Endometrial Carcinoma	104
Cystic, Mucinous and Serous Neoplasms	487
Squamous Cell Neoplasms	609
Complex Mixed and Stromal Neoplasms	320
None	1175
Myomatous Neoplasms	187
Not Reported	12
Epithelial Neoplasms, NOS	230
Complex Epithelial Neoplasms	27
Soft Tissue Tumors and Sarcomas, NOS	14
Neoplasms, NOS	12
Trophoblastic neoplasms	13
Mesonephromas	5
Neuroepitheliomatous Neoplasms	1

primary_diagnosis_site	count
Uterus, NOS	961
Corpus uteri	373
Cervix uteri	688
Uterus	867
Cervix	307

Out[24]:

Last update: 2022-06-23