Build a Cohort¶
Example use case:
Julia is an oncologist that specializes in female reproductive health. As part of her research, she is interested in using existing data on uterine cancers. If possible, she would like to see multiple datatypes (gross imaging, genomic data, proteomic data, histology) that come from the same patient, so she can look for shared phenotypes to test for their potential as early diagnostics. Julia heard that the Cancer Data Aggregator has made it easy to search across multiple datasets created by NCI, and so has decided to start her search there.
Before Julia does any work, she needs to import several functions from cdapython:
Q
andquery
which power the searchcolumns
which lets us view entity field namesunique_terms
which lets view entity field contents
She also asks cdapython to report it's version so she can be sure she's using the one she means to.
from cdapython import Q, columns, unique_terms, query
import cdapython
import pandas as pd
print(cdapython.__version__)
Q.set_host_url("http://35.192.60.10:8080/")
2022.6.22
- The Proteomic Data Commons (PDC)
- The Genomic Data Commons (GDC)
- The Imaging Data Commons (IDC)
- subject: A specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets
- researchsubject: a person/plant/animal/microbe within a given study. An individual who participates in 3 studies will have 3 researchsubject IDs
- specimen: a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject ID
- file: A unit of data about subjects, researchsubjects, specimens, or their associated information
- diagnosis: Information about what medical diagnosis a researchsubject has
- treatment: Information about what medical treatment(s) were performed for a given diagnosis
Accordingly, to see what search fields are available, Julia starts by using the command columns
:
columns().to_list()
['File.id', 'File.identifier.system', 'File.identifier.value', 'File.label', 'File.data_category', 'File.data_type', 'File.file_format', 'File.associated_project', 'File.drs_uri', 'File.byte_size', 'File.checksum', 'File.data_modality', 'File.imaging_modality', 'File.dbgap_accession_number', 'id', 'identifier.system', 'identifier.value', 'species', 'sex', 'race', 'ethnicity', 'days_to_birth', 'subject_associated_project', 'vital_status', 'age_at_death', 'cause_of_death', 'ResearchSubject.id', 'ResearchSubject.identifier.system', 'ResearchSubject.identifier.value', 'ResearchSubject.member_of_research_project', 'ResearchSubject.primary_diagnosis_condition', 'ResearchSubject.primary_diagnosis_site', 'ResearchSubject.Diagnosis.id', 'ResearchSubject.Diagnosis.identifier.system', 'ResearchSubject.Diagnosis.identifier.value', 'ResearchSubject.Diagnosis.primary_diagnosis', 'ResearchSubject.Diagnosis.age_at_diagnosis', 'ResearchSubject.Diagnosis.morphology', 'ResearchSubject.Diagnosis.stage', 'ResearchSubject.Diagnosis.grade', 'ResearchSubject.Diagnosis.method_of_diagnosis', 'ResearchSubject.Diagnosis.Treatment.id', 'ResearchSubject.Diagnosis.Treatment.identifier.system', 'ResearchSubject.Diagnosis.Treatment.identifier.value', 'ResearchSubject.Diagnosis.Treatment.treatment_type', 'ResearchSubject.Diagnosis.Treatment.treatment_outcome', 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start', 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end', 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent', 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site', 'ResearchSubject.Diagnosis.Treatment.treatment_effect', 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason', 'ResearchSubject.Diagnosis.Treatment.number_of_cycles', 'ResearchSubject.Specimen.id', 'ResearchSubject.Specimen.identifier.system', 'ResearchSubject.Specimen.identifier.value', 'ResearchSubject.Specimen.associated_project', 'ResearchSubject.Specimen.age_at_collection', 'ResearchSubject.Specimen.primary_disease_type', 'ResearchSubject.Specimen.anatomical_site', 'ResearchSubject.Specimen.source_material_type', 'ResearchSubject.Specimen.specimen_type', 'ResearchSubject.Specimen.derived_from_specimen', 'ResearchSubject.Specimen.derived_from_subject']
There are a lot of columns in the CDA data, but Julia is most interested in diagnosis data, so she filters the list to only those:
columns().to_list(filters="diagnosis")
['ResearchSubject.primary_diagnosis_condition', 'ResearchSubject.primary_diagnosis_site', 'ResearchSubject.Diagnosis.id', 'ResearchSubject.Diagnosis.identifier.system', 'ResearchSubject.Diagnosis.identifier.value', 'ResearchSubject.Diagnosis.primary_diagnosis', 'ResearchSubject.Diagnosis.age_at_diagnosis', 'ResearchSubject.Diagnosis.morphology', 'ResearchSubject.Diagnosis.stage', 'ResearchSubject.Diagnosis.grade', 'ResearchSubject.Diagnosis.method_of_diagnosis', 'ResearchSubject.Diagnosis.Treatment.id', 'ResearchSubject.Diagnosis.Treatment.identifier.system', 'ResearchSubject.Diagnosis.Treatment.identifier.value', 'ResearchSubject.Diagnosis.Treatment.treatment_type', 'ResearchSubject.Diagnosis.Treatment.treatment_outcome', 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start', 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end', 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent', 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site', 'ResearchSubject.Diagnosis.Treatment.treatment_effect', 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason', 'ResearchSubject.Diagnosis.Treatment.number_of_cycles']
Since Julia is interested specificially in uterine cancers, she uses the unique_terms
function to see what data is available for 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site' and 'ResearchSubject.primary_diagnosis_site' to see if 'uterine' appears:
unique_terms("ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site").to_list()
['Brain', 'Cervix', 'Head - Face Or Neck, Nos', 'Lymph Node(s) Paraaortic', 'Other', 'Pelvis', 'Spine', 'Unknown']
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()
['Abdomen', 'Abdomen, Mediastinum', 'Adrenal Glands', 'Adrenal gland', 'Anus and anal canal', 'Base of tongue', 'Bile Duct', 'Bladder', 'Bones, joints and articular cartilage of limbs', 'Bones, joints and articular cartilage of other and unspecified sites', 'Brain', 'Breast', 'Bronchus and lung', 'Cervix', 'Cervix uteri', 'Chest', 'Chest-Abdomen-Pelvis, Leg, TSpine', 'Colon', 'Connective, subcutaneous and other soft tissues', 'Corpus uteri', 'Ear', 'Esophagus', 'Extremities', 'Eye and adnexa', 'Floor of mouth', 'Gallbladder', 'Gum', 'Head', 'Head and Neck', 'Head-Neck', 'Heart, mediastinum, and pleura', 'Hematopoietic and reticuloendothelial systems', 'Hypopharynx', 'Intraocular', 'Kidney', 'Larynx', 'Lip', 'Liver', 'Liver and intrahepatic bile ducts', 'Lung', 'Lung Phantom', 'Lymph nodes', 'Marrow, Blood', 'Meninges', 'Mesothelium', 'Nasal cavity and middle ear', 'Nasopharynx', 'Not Reported', 'Oropharynx', 'Other and ill-defined digestive organs', 'Other and ill-defined sites', 'Other and ill-defined sites in lip, oral cavity and pharynx', 'Other and ill-defined sites within respiratory system and intrathoracic organs', 'Other and unspecified female genital organs', 'Other and unspecified major salivary glands', 'Other and unspecified male genital organs', 'Other and unspecified parts of biliary tract', 'Other and unspecified parts of mouth', 'Other and unspecified parts of tongue', 'Other and unspecified urinary organs', 'Other endocrine glands and related structures', 'Ovary', 'Palate', 'Pancreas', 'Pancreas ', 'Pelvis, Prostate, Anus', 'Penis', 'Peripheral nerves and autonomic nervous system', 'Phantom', 'Prostate', 'Prostate gland', 'Rectosigmoid junction', 'Rectum', 'Renal pelvis', 'Retroperitoneum and peritoneum', 'Skin', 'Small intestine', 'Spinal cord, cranial nerves, and other parts of central nervous system', 'Stomach', 'Testicles', 'Testis', 'Thymus', 'Thyroid', 'Thyroid gland', 'Tonsil', 'Trachea', 'Unknown', 'Ureter', 'Uterus', 'Uterus, NOS', 'Vagina', 'Various', 'Various (11 locations)', 'Vulva']
Julia sees that "treatment_anatomic_site" does not have 'Uterine', but does have 'Cervix'. She also notes that both 'Uterus' and 'Uterus, NOS' are listed in the "primary_diagnosis_site" results. As she was initially looking for "uterine", Julia decides to expand her search a bit to account for variable naming schemes. So, she runs a fuzzy match filter on the "ResearchSubject.primary_diagnosis_site" for 'uter' as that should cover all variants:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")
['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']
Just to be sure, Julia also searches for any other instances of "cervix":
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="cerv")
['Cervix', 'Cervix uteri']
With all her likely terms found, Julia begins to create a search that will get data for all of her terms. She does this by writing a series of Q
statements that define what rows should be returned from each column. For the "treatment_anatomic_site", only one term is of interest, so she uses the =
operator to get only exact matches:
Tsite = Q('ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site = "Cervix"')
However, for "primary_diagnosis_site", Julia has several terms she wants to search with. Luckily, Q
also can run fuzzy searches. It can also search more than one term at a time, so Julia writes one big Q
statement to grab everything that is either 'uter' or 'cerv':
Dsite = Q('ResearchSubject.primary_diagnosis_site = "%uter%" OR ResearchSubject.primary_diagnosis_site = "%cerv%"')
Finally, Julia adds her two queries together into one large one:
ALLDATA = Tsite.OR(Dsite)
Now that Julia has a query, she can use it to look for data in any of the CDA endpoints. She starts by getting an overall summary of what data is available using count
:
ALLDATA.count.run()
Getting results from database
Total execution time: 3475 ms
specimen_count : 40766
treatment_count : 3045
diagnosis_count : 3683
researchsubject_count : 4867
subject_count : 3740
It seems there's a lot of data that might work for Julias study! Since she is interested in the beginings of cancer, she decides to start by looking at the researchsubject information, since that is where most of the diagnosis information is. She again gets a summary using count
:
ALLDATA.researchsubject.run()
Getting results from database
Total execution time: 3523 ms
QueryID: 065affcf-84fb-4cc4-9fe7-73535c7bce0a Offset: 0 Count: 100 Total Row Count: 4867 More pages: True
Browsing the primary_diagnosis_condition data, Julia notices that there are a large number of research subjects that are Adenomas and Adenocarcinomas. Since Julia wants to look for common phenotypes in early cancers, she decides it might be easier to exclude the endocrine related data, as they might have different mechanisms. So she adds a new filter to her query:
Noadeno = Q('ResearchSubject.primary_diagnosis_condition != "Adenomas and Adenocarcinomas"')
NoAdenoData = ALLDATA.AND(Noadeno)
NoAdenoData.researchsubject.count.run()
Getting results from database
Total execution time: 3415 ms
total : 3196
files : 297923
system | count |
---|---|
PDC | 104 |
GDC | 1918 |
IDC | 1174 |
primary_diagnosis_condition | count |
---|---|
Uterine Corpus Endometrial Carcinoma | 104 |
Cystic, Mucinous and Serous Neoplasms | 487 |
Squamous Cell Neoplasms | 609 |
Complex Mixed and Stromal Neoplasms | 320 |
None | 1175 |
Myomatous Neoplasms | 187 |
Not Reported | 12 |
Epithelial Neoplasms, NOS | 230 |
Complex Epithelial Neoplasms | 27 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Neoplasms, NOS | 12 |
Trophoblastic neoplasms | 13 |
Mesonephromas | 5 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 961 |
Corpus uteri | 373 |
Cervix uteri | 688 |
Uterus | 867 |
Cervix | 307 |
She then previews the actual metadata for researchsubject, subject, and file, to make sure that they have all the information she will need for her work:
NoAdenoData.researchsubject.run().to_dataframe()
Getting results from database
Total execution time: 3482 ms
id | identifier | member_of_research_project | primary_diagnosis_condition | primary_diagnosis_site | subject_id | |
---|---|---|---|---|---|---|
0 | 146bd9db-1645-4950-bd18-de30d0db2487 | [{'system': 'GDC', 'value': '146bd9db-1645-495... | CGCI-HTMCP-CC | Squamous Cell Neoplasms | Cervix uteri | HTMCP-03-06-02138 |
1 | 32e83039-7663-422b-a541-6d9149851560 | [{'system': 'GDC', 'value': '32e83039-7663-422... | GENIE-GRCC | Complex Mixed and Stromal Neoplasms | Uterus, NOS | GENIE-GRCC-4f168dad |
2 | 37063f74-ccc7-426e-ac1c-ad733f2f7e95 | [{'system': 'GDC', 'value': '37063f74-ccc7-426... | GENIE-UHN | Epithelial Neoplasms, NOS | Corpus uteri | GENIE-UHN-247706 |
3 | 3878f58e-76ba-4480-a784-88505bd464d0 | [{'system': 'GDC', 'value': '3878f58e-76ba-448... | TCGA-UCEC | Cystic, Mucinous and Serous Neoplasms | Corpus uteri | TCGA-FI-A2EX |
4 | 3df6abe2-2123-4bfa-a4e4-88df5f940c04 | [{'system': 'GDC', 'value': '3df6abe2-2123-4bf... | TCGA-CESC | Squamous Cell Neoplasms | Cervix uteri | TCGA-JX-A3PZ |
... | ... | ... | ... | ... | ... | ... |
95 | fa219ae6-def1-4200-972a-3fd17d688d34 | [{'system': 'GDC', 'value': 'fa219ae6-def1-420... | FM-AD | Squamous Cell Neoplasms | Cervix uteri | AD7747 |
96 | fb6f2e38-9281-4085-923c-ef99955fd5ea | [{'system': 'GDC', 'value': 'fb6f2e38-9281-408... | CGCI-HTMCP-CC | Squamous Cell Neoplasms | Cervix uteri | HTMCP-03-06-02062 |
97 | 13d72130-604c-4d79-95cc-53c2e25d91b0 | [{'system': 'GDC', 'value': '13d72130-604c-4d7... | TCGA-CESC | Squamous Cell Neoplasms | Cervix uteri | TCGA-ZJ-AAX4 |
98 | 15d1d0ad-4196-49d1-8eb3-38c75b7db58c | [{'system': 'GDC', 'value': '15d1d0ad-4196-49d... | GENIE-MSK | Myomatous Neoplasms | Uterus, NOS | GENIE-MSK-P-0005582 |
99 | 1d6f367d-a00d-4bd0-9a8b-0a25e37fc1cd | [{'system': 'GDC', 'value': '1d6f367d-a00d-4bd... | GENIE-DFCI | Cystic, Mucinous and Serous Neoplasms | Uterus, NOS | GENIE-DFCI-001660 |
100 rows × 6 columns
ResearchSubject Field Definitions
A research subject is the entity of interest in a research study, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subject’s privacy. An individual who participates in 3 studies will have 3 researchsubject IDs- id: The unique identifier for this researchsubject
- identifier: An embedded array of information that includes the originating data center and the ID the researchsubject had there
- member_of_research_project: The name of the study/project that the subject particpated in
- primary_diagnosis_condition: The cancer, disease or other condition under study
- primary_diagnosis_site: The primary_disease_site that qualifies the researchsubject for the research_project
- subject_id: An identifier for the subject. Can be joined to the `id` field from subject results
NoAdenoData.subject.run().to_dataframe()
Getting results from database
Total execution time: 3460 ms
id | identifier | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | age_at_death | cause_of_death | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AD2728 | [{'system': 'GDC', 'value': 'AD2728'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
1 | C3N-01876 | [{'system': 'IDC', 'value': 'C3N-01876'}] | Homo sapiens | None | None | None | NaN | [cptac_ucec] | None | NaN | None |
2 | GENIE-DFCI-007276 | [{'system': 'GDC', 'value': 'GENIE-DFCI-007276'}] | Homo sapiens | female | white | not hispanic or latino | -18627.0 | [GENIE-DFCI] | Not Reported | NaN | None |
3 | GENIE-DFCI-009140 | [{'system': 'GDC', 'value': 'GENIE-DFCI-009140'}] | Homo sapiens | female | white | not hispanic or latino | -24837.0 | [GENIE-DFCI] | Not Reported | NaN | None |
4 | GENIE-DFCI-009144 | [{'system': 'GDC', 'value': 'GENIE-DFCI-009144'}] | Homo sapiens | female | white | not hispanic or latino | -19723.0 | [GENIE-DFCI] | Not Reported | NaN | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | AD14317 | [{'system': 'GDC', 'value': 'AD14317'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
96 | AD3008 | [{'system': 'GDC', 'value': 'AD3008'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
97 | AD6414 | [{'system': 'GDC', 'value': 'AD6414'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
98 | AD7975 | [{'system': 'GDC', 'value': 'AD7975'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
99 | C3L-00157 | [{'system': 'GDC', 'value': 'C3L-00157'}, {'sy... | Homo sapiens | female | white | hispanic or latino | -22118.0 | [CPTAC3-Discovery, CPTAC-3, cptac_ucec] | Dead | 1396.0 | Cancer Related |
100 rows × 11 columns
Subject Field Definitions
A subject is a specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets- id: The unique identifier for this subject
- identifier: An embedded array of information that includes the originating data center and the ID the subject had there
- species: The species of the subject
- sex: A reference to the biological sex of the donor organism.
- race: The race of the subject
- ethnicity: The ethnicity of the subject
- days_to_birth: Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days
- subject_associated_project: An embedded array of the names of projects (studies) the subject was part of
- vital_status: Whether the subject is alive
- age_at_death: The number of days after first enrollment that the subject died
- cause_of_death: The cause of death, if known
NoAdenoData.file.run().to_dataframe()
Getting results from database
Total execution time: 3742 ms
id | identifier | label | data_category | data_type | file_format | associated_project | drs_uri | byte_size | checksum | data_modality | imaging_modality | dbgap_accession_number | researchsubject_specimen_id | researchsubject_id | subject_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | d3151fb9-9dd5-470e-b181-4d920f686068 | [{'system': 'GDC', 'value': 'd3151fb9-9dd5-470... | TCGA-B5-A11E-01A-21-A163-20_RPPA_data.tsv | Proteome Profiling | Protein Expression Quantification | TSV | TCGA-UCEC | drs://dg.4DFC:d3151fb9-9dd5-470e-b181-4d920f68... | 22341 | f44fc349969dda464ddf37f5e1f149f1 | Genomic | None | None | TCGA-B5-A11E | ||
1 | 2200d48f-d10d-4e0c-aff6-a71958fc2b1b | [{'system': 'GDC', 'value': '2200d48f-d10d-4e0... | TCGA-A5-A0G9-01A-21-A162-20_RPPA_data.tsv | Proteome Profiling | Protein Expression Quantification | TSV | TCGA-UCEC | drs://dg.4DFC:2200d48f-d10d-4e0c-aff6-a71958fc... | 24285 | 8edb8c63f398d0d6dab0655d62b1cd93 | Genomic | None | None | TCGA-A5-A0G9 | ||
2 | e6ee1e9e-9c28-4db8-9f7f-3916f5351717 | [{'system': 'GDC', 'value': 'e6ee1e9e-9c28-4db... | TCGA-N7-A4Y5-01A-21-A41P-20_RPPA_data.tsv | Proteome Profiling | Protein Expression Quantification | TSV | TCGA-UCS | drs://dg.4DFC:e6ee1e9e-9c28-4db8-9f7f-3916f535... | 22026 | 73159e8898216b617ac3e135af51d87e | Genomic | None | None | TCGA-N7-A4Y5 | ||
3 | 81674772-fd6d-48b6-93b1-fa585d1ed568 | [{'system': 'GDC', 'value': '81674772-fd6d-48b... | 49b02eb4-8e31-42cd-a3e7-065611836434.wgs.BRASS... | Somatic Structural Variation | Structural Rearrangement | BEDPE | CPTAC-3 | drs://dg.4DFC:81674772-fd6d-48b6-93b1-fa585d1e... | 9977 | 64560c17caa67fa25411218ef57101a6 | Genomic | None | phs001287 | C3L-01307 | ||
4 | c3392a1e-1241-4068-9bca-31fd836148de | [{'system': 'GDC', 'value': 'c3392a1e-1241-406... | TCGA-BG-A0MA-01A-21-A18Q-20_RPPA_data.tsv | Proteome Profiling | Protein Expression Quantification | TSV | TCGA-UCEC | drs://dg.4DFC:c3392a1e-1241-4068-9bca-31fd8361... | 22324 | 3da3113805454ac4fca6482fbaf4b4b1 | Genomic | None | None | TCGA-BG-A0MA | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | b42cdaba-46c8-4a02-b7cf-86ceb5d1f712 | [{'system': 'GDC', 'value': 'b42cdaba-46c8-4a0... | TCGA-HG-A2PA-01A-21-A40H-20_RPPA_data.tsv | Proteome Profiling | Protein Expression Quantification | TSV | TCGA-CESC | drs://dg.4DFC:b42cdaba-46c8-4a02-b7cf-86ceb5d1... | 22070 | 398e6cca19ff30d932a1a78669254710 | Genomic | None | None | TCGA-HG-A2PA | ||
96 | 1731c20f-1f10-4f80-8793-99593f81515f | [{'system': 'GDC', 'value': '1731c20f-1f10-4f8... | TCGA-B5-A11P-01B-21-A18Q-20_RPPA_data.tsv | Proteome Profiling | Protein Expression Quantification | TSV | TCGA-UCEC | drs://dg.4DFC:1731c20f-1f10-4f80-8793-99593f81... | 22344 | 8bcbded9fbd5a48a58f77ae1e3ea829f | Genomic | None | None | TCGA-B5-A11P | ||
97 | 36f66d66-f71f-49be-9e51-ac640b826d3f | [{'system': 'GDC', 'value': '36f66d66-f71f-49b... | TCGA-EY-A1GH-01A-21-A18Q-20_RPPA_data.tsv | Proteome Profiling | Protein Expression Quantification | TSV | TCGA-UCEC | drs://dg.4DFC:36f66d66-f71f-49be-9e51-ac640b82... | 22338 | 52a07dbf0ebfcafeb41958d4a1e2b489 | Genomic | None | None | TCGA-EY-A1GH | ||
98 | 59a4c826-87d1-43ab-9b2e-3c6088275fd7 | [{'system': 'GDC', 'value': '59a4c826-87d1-43a... | de3cbd77-822b-4c86-80e8-9be54ca8b324.wgs.BRASS... | Somatic Structural Variation | Structural Rearrangement | BEDPE | CGCI-HTMCP-CC | drs://dg.4DFC:59a4c826-87d1-43ab-9b2e-3c608827... | 121947 | c9bcda0d917caf81773efd8e2f827ebb | Genomic | None | phs000528 | HTMCP-03-06-02040 | ||
99 | a7c7ba3e-7d9d-4735-afff-22442a0e9a84 | [{'system': 'GDC', 'value': 'a7c7ba3e-7d9d-473... | 44c6a116-146f-43fd-aa1e-7e8b1636a722.wgs.BRASS... | Somatic Structural Variation | Structural Rearrangement | VCF | CGCI-HTMCP-CC | drs://dg.4DFC:a7c7ba3e-7d9d-4735-afff-22442a0e... | 76341 | 5e513d0fae6e32b5425c647c9d8d3ba3 | Genomic | None | phs000528 | HTMCP-03-06-02144 |
100 rows × 16 columns
File Field Definitions
A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.- id: The unique identifier for this file
- identifier: An embedded array of information that includes the originating data center and the ID the file had there
- label: The full name of the file
- data_catagory: A desecription of the kind of general kind data the file holds
- data_type: A more specific descripton of the data type
- file_format: String to identify the full file extension including compression extensions
- associated_project: The name the data center uses for the study this file was generated for
- drs_uri: A unique identifier that can be used to retreive this specific file from a server
- byte_size: Size of the file in bytes
- checksum: The md5 value for the file
- data_modality: Describes the biological nature of the information gathered as the result of an activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging"
- imaging_modality: For files with the `data_modality` of "Imaging", a descriptor for the image type
- dbgap_accession_number: The project id number for this data on dbGaP
Finally, Julia wants to save these results to use for the future. Since the preview dataframes only show the first 100 results of each search, she uses the paginator
function to get all the data from the subject and researchsubject endpoints into their own dataframes:
researchsubs = NoAdenoData.researchsubject.run()
rsdf = pd.DataFrame()
for i in researchsubs.paginator(to_df=True):
rsdf = pd.concat([rsdf, i])
Getting results from database
Total execution time: 3318 ms
subs = NoAdenoData.subject.run()
subsdf = pd.DataFrame()
for i in subs.paginator(to_df=True):
subsdf = pd.concat([subsdf, i])
Getting results from database
Total execution time: 3283 ms
rsdf # view the researchsubject dataframe
id | identifier | member_of_research_project | primary_diagnosis_condition | primary_diagnosis_site | subject_id | |
---|---|---|---|---|---|---|
0 | 146bd9db-1645-4950-bd18-de30d0db2487 | [{'system': 'GDC', 'value': '146bd9db-1645-495... | CGCI-HTMCP-CC | Squamous Cell Neoplasms | Cervix uteri | HTMCP-03-06-02138 |
1 | 32e83039-7663-422b-a541-6d9149851560 | [{'system': 'GDC', 'value': '32e83039-7663-422... | GENIE-GRCC | Complex Mixed and Stromal Neoplasms | Uterus, NOS | GENIE-GRCC-4f168dad |
2 | 37063f74-ccc7-426e-ac1c-ad733f2f7e95 | [{'system': 'GDC', 'value': '37063f74-ccc7-426... | GENIE-UHN | Epithelial Neoplasms, NOS | Corpus uteri | GENIE-UHN-247706 |
3 | 3878f58e-76ba-4480-a784-88505bd464d0 | [{'system': 'GDC', 'value': '3878f58e-76ba-448... | TCGA-UCEC | Cystic, Mucinous and Serous Neoplasms | Corpus uteri | TCGA-FI-A2EX |
4 | 3df6abe2-2123-4bfa-a4e4-88df5f940c04 | [{'system': 'GDC', 'value': '3df6abe2-2123-4bf... | TCGA-CESC | Squamous Cell Neoplasms | Cervix uteri | TCGA-JX-A3PZ |
... | ... | ... | ... | ... | ... | ... |
91 | TCGA-N9-A4Q7__tcga_ucs | [{'system': 'IDC', 'value': 'TCGA-N9-A4Q7'}] | tcga_ucs | None | Uterus | TCGA-N9-A4Q7 |
92 | TCGA-QS-A744__tcga_ucec | [{'system': 'IDC', 'value': 'TCGA-QS-A744'}] | tcga_ucec | None | Uterus | TCGA-QS-A744 |
93 | c64d5576-df00-4772-a3d1-1f8863000750 | [{'system': 'GDC', 'value': 'c64d5576-df00-477... | CGCI-HTMCP-CC | Squamous Cell Neoplasms | Cervix uteri | HTMCP-03-06-02099 |
94 | cc500ada-7440-412f-b54c-4966c8098dcb | [{'system': 'GDC', 'value': 'cc500ada-7440-412... | GENIE-DFCI | Cystic, Mucinous and Serous Neoplasms | Uterus, NOS | GENIE-DFCI-000331 |
95 | d7a75bf5-5189-4978-99d9-fcef91c9fbd2 | [{'system': 'GDC', 'value': 'd7a75bf5-5189-497... | TCGA-CESC | Squamous Cell Neoplasms | Cervix uteri | TCGA-EK-A2R7 |
3196 rows × 6 columns
subsdf # view the subject dataframe
id | identifier | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | age_at_death | cause_of_death | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AD2728 | [{'system': 'GDC', 'value': 'AD2728'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
1 | C3N-01876 | [{'system': 'IDC', 'value': 'C3N-01876'}] | Homo sapiens | None | None | None | NaN | [cptac_ucec] | None | NaN | None |
2 | GENIE-DFCI-007276 | [{'system': 'GDC', 'value': 'GENIE-DFCI-007276'}] | Homo sapiens | female | white | not hispanic or latino | -18627.0 | [GENIE-DFCI] | Not Reported | NaN | None |
3 | GENIE-DFCI-009140 | [{'system': 'GDC', 'value': 'GENIE-DFCI-009140'}] | Homo sapiens | female | white | not hispanic or latino | -24837.0 | [GENIE-DFCI] | Not Reported | NaN | None |
4 | GENIE-DFCI-009144 | [{'system': 'GDC', 'value': 'GENIE-DFCI-009144'}] | Homo sapiens | female | white | not hispanic or latino | -19723.0 | [GENIE-DFCI] | Not Reported | NaN | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3 | TCGA-EY-A72D | [{'system': 'GDC', 'value': 'TCGA-EY-A72D'}, {... | Homo sapiens | female | black or african american | not hispanic or latino | -31818.0 | [TCGA-UCEC, tcga_ucec] | Alive | NaN | None |
4 | TCGA-IE-A4EH | [{'system': 'GDC', 'value': 'TCGA-IE-A4EH'}, {... | Homo sapiens | female | white | not hispanic or latino | -12871.0 | [tcga_sarc, TCGA-SARC] | Alive | NaN | None |
5 | TCGA-IS-A3KA | [{'system': 'GDC', 'value': 'TCGA-IS-A3KA'}, {... | Homo sapiens | female | white | not hispanic or latino | -26775.0 | [tcga_sarc, TCGA-SARC] | Dead | 413.0 | None |
6 | TCGA-NA-A4QY | [{'system': 'GDC', 'value': 'TCGA-NA-A4QY'}, {... | Homo sapiens | female | white | not hispanic or latino | -22756.0 | [tcga_ucs, TCGA-UCS] | Dead | 114.0 | None |
7 | TCGA-VS-A9V3 | [{'system': 'GDC', 'value': 'TCGA-VS-A9V3'}, {... | Homo sapiens | female | white | not reported | -22990.0 | [TCGA-CESC, tcga_cesc] | Alive | NaN | None |
2608 rows × 11 columns
Then Julia uses the id
fields in each result to join them together into one big dataset:
allmetadata = rsdf.set_index("subject_id").join(subsdf.set_index("id"), lsuffix='resub', rsuffix="subject")
allmetadata
id | identifierresub | member_of_research_project | primary_diagnosis_condition | primary_diagnosis_site | identifiersubject | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | age_at_death | cause_of_death | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AD100 | 0f08e2e9-9983-4204-972f-a630b7ab2c25 | [{'system': 'GDC', 'value': '0f08e2e9-9983-420... | FM-AD | Squamous Cell Neoplasms | Cervix uteri | [{'system': 'GDC', 'value': 'AD100'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
AD1026 | 6d9d6cb9-652f-4749-b4c5-aa9e6b80de69 | [{'system': 'GDC', 'value': '6d9d6cb9-652f-474... | FM-AD | Complex Mixed and Stromal Neoplasms | Uterus, NOS | [{'system': 'GDC', 'value': 'AD1026'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
AD10328 | 514fc104-1ee5-4701-8f45-9a011143f1e2 | [{'system': 'GDC', 'value': '514fc104-1ee5-470... | FM-AD | Squamous Cell Neoplasms | Cervix uteri | [{'system': 'GDC', 'value': 'AD10328'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
AD10460 | 8c36611d-be2f-432a-afde-e684ab4333ea | [{'system': 'GDC', 'value': '8c36611d-be2f-432... | FM-AD | Cystic, Mucinous and Serous Neoplasms | Uterus, NOS | [{'system': 'GDC', 'value': 'AD10460'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
AD10485 | 0ad0fdda-dd96-48df-8edd-e5e471e9f680 | [{'system': 'GDC', 'value': '0ad0fdda-dd96-48d... | FM-AD | Cystic, Mucinous and Serous Neoplasms | Uterus, NOS | [{'system': 'GDC', 'value': 'AD10485'}] | Homo sapiens | female | not reported | not reported | NaN | [FM-AD] | Not Reported | NaN | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
TCGA-ZJ-AB0H | TCGA-ZJ-AB0H__tcga_cesc | [{'system': 'IDC', 'value': 'TCGA-ZJ-AB0H'}] | tcga_cesc | None | Cervix | [{'system': 'GDC', 'value': 'TCGA-ZJ-AB0H'}, {... | Homo sapiens | female | not reported | not reported | -17869.0 | [TCGA-CESC, tcga_cesc] | Alive | NaN | None |
TCGA-ZJ-AB0I | a4f13656-a941-498a-9ac9-f020ed559b35 | [{'system': 'GDC', 'value': 'a4f13656-a941-498... | TCGA-CESC | Squamous Cell Neoplasms | Cervix uteri | [{'system': 'GDC', 'value': 'TCGA-ZJ-AB0I'}, {... | Homo sapiens | female | white | not hispanic or latino | -9486.0 | [TCGA-CESC, tcga_cesc] | Alive | NaN | None |
TCGA-ZJ-AB0I | TCGA-ZJ-AB0I__tcga_cesc | [{'system': 'IDC', 'value': 'TCGA-ZJ-AB0I'}] | tcga_cesc | None | Cervix | [{'system': 'GDC', 'value': 'TCGA-ZJ-AB0I'}, {... | Homo sapiens | female | white | not hispanic or latino | -9486.0 | [TCGA-CESC, tcga_cesc] | Alive | NaN | None |
TCGA-ZX-AA5X | 4756acc0-4e96-44d4-b359-04d64dc7eb84 | [{'system': 'GDC', 'value': '4756acc0-4e96-44d... | TCGA-CESC | Squamous Cell Neoplasms | Cervix uteri | [{'system': 'GDC', 'value': 'TCGA-ZX-AA5X'}, {... | Homo sapiens | female | white | not hispanic or latino | -23440.0 | [TCGA-CESC, tcga_cesc] | Alive | NaN | None |
TCGA-ZX-AA5X | TCGA-ZX-AA5X__tcga_cesc | [{'system': 'IDC', 'value': 'TCGA-ZX-AA5X'}] | tcga_cesc | None | Cervix | [{'system': 'GDC', 'value': 'TCGA-ZX-AA5X'}, {... | Homo sapiens | female | white | not hispanic or latino | -23440.0 | [TCGA-CESC, tcga_cesc] | Alive | NaN | None |
3196 rows × 15 columns
And saves it out to a csv so she can browse it with Excel:
allmetadata.to_csv("allmetadata.csv")
Julia knows from her subject count summary that there are 33480 files associated with her subjects, which is likely far more than she needs. To help her decide what files she wants, Julia uses endpoint chaining to get summary information about the files that are assigned to researchsubjects for her search criteria
NoAdenoData.researchsubject.file.count.run()
Getting results from database
Total execution time: 3399 ms
total : 3196
files : 297923
system | count |
---|---|
PDC | 104 |
GDC | 1918 |
IDC | 1174 |
primary_diagnosis_condition | count |
---|---|
Uterine Corpus Endometrial Carcinoma | 104 |
Cystic, Mucinous and Serous Neoplasms | 487 |
Squamous Cell Neoplasms | 609 |
Complex Mixed and Stromal Neoplasms | 320 |
None | 1175 |
Myomatous Neoplasms | 187 |
Not Reported | 12 |
Epithelial Neoplasms, NOS | 230 |
Complex Epithelial Neoplasms | 27 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Neoplasms, NOS | 12 |
Trophoblastic neoplasms | 13 |
Mesonephromas | 5 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 961 |
Corpus uteri | 373 |
Cervix uteri | 688 |
Uterus | 867 |
Cervix | 307 |