Summarize Search Results¶
Before we do any work, we need to import several functions from cdapython:
Q
andquery
which power the searchcolumns
which lets us view entity field namesunique_terms
which lets view entity field contents
We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")
2022.6.22
CDA data comes from three sources:
- The Proteomic Data Commons (PDC)
- The Genomic Data Commons (GDC)
- The Imaging Data Commons (IDC)
The CDA makes this data searchable in four endpoints:
subject
: Specific, unique, individualsresearch_subject
: Study-individual aggregate entities. Asubject
who was part of three studies will appear as threeresearchsubjects
specimen
: Samples taken from individualfile
: Data aboutsubject
,researchsubject
,specimen
, and their associated information
If you are looking to build a cohort of distinct individuals who meet some criteria, search by subject
. If you want to build a cohort, but are particularly interested in studies rather than the participates per se, search by researchsubject
. If you are looking for biosamples that can be ordered or a specfic format of information (for e.g. histological slides) start with specimen
. If you are primarily looking for files you can reuse for your own analysis, start with file
.
In CDA search, these concepts can also be strung together, so you can look specifically for specimen file
, or researchsubject specimen
. In all cases, any search can use any metadata field, the only difference between search types is what type of data you return by default.
Getting simple summary data¶
Let's try a broad search of the CDA to see what information exists about cancers that were first diagnosed in the brain. To run this simple search, we would first construct a query in Q
and save it to a variable myquery
. This is the same query we ran in the Basic Search notebook:
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')
Where did those terms come from?
If you aren't sure how we knew what terms to put in our search, please refer back to the What search terms are available? notebook.Overall summary¶
You can get a quick summary of how many unique specimens, treatments, diagnoses, researchsubjects and subjects meet your search criteria by chaining a count
command into the basic run
call.
myquery.count.run()
Getting results from database
Total execution time: 3330 ms
specimen_count : 39150
treatment_count : 2379
diagnosis_count : 1751
researchsubject_count : 2923
subject_count : 2314
These numbers are how many total rows of data will come back when querying the various endpoints.
subject summary¶
We can also add count
to the other run calls we did in the Basic Search notebook to get more detailed summaries:
subjectresults = myquery.subject.count.run()
Getting results from database
Total execution time: 3295 ms
Since we save the output as a variable, we need to look at the variable to see the results:
subjectresults
total : 2314
files : 4081065
system | count |
---|---|
IDC | 1167 |
PDC | 309 |
GDC | 1449 |
sex | count |
---|---|
None | 683 |
male | 979 |
female | 649 |
not reported | 3 |
race | count |
---|---|
None | 683 |
white | 1308 |
not reported | 135 |
asian | 33 |
black or african american | 96 |
Unknown | 20 |
other | 9 |
not allowed to collect | 25 |
american indian or alaska native | 4 |
native hawaiian or other pacific islander | 1 |
ethnicity | count |
---|---|
None | 683 |
not hispanic or latino | 1282 |
not reported | 219 |
Unknown | 21 |
hispanic or latino | 84 |
not allowed to collect | 25 |
cause_of_death | count |
---|---|
None | 2028 |
Not Reported | 200 |
Cancer Related | 63 |
Infection | 3 |
Not Cancer Related | 9 |
Surgical Complications | 2 |
Unknown | 9 |
By default, the results are displayed as a table for easy previewing of the data. Since we queried the subject
endpoint, our default results tell us subject
level information, that is, information about unique individuals: their sex, race, age, species, etc. Using counts gives us back a nice pivot table type summary of the countable fields for Subjects. Note that above the table it also tells you the total subject count, as well as how many files are associated with those subjects.
This gives you a quick way to assess whether the full search results will have the data fields you require. But if you want to get the underlying data for your own downstream applications, you can also get the raw numbers by calling the zeroth value of the variable:
subjectresults[0]
{'total': 2314, 'files': 4081065, 'system': [{'system': 'IDC', 'count': 1167}, {'system': 'PDC', 'count': 309}, {'system': 'GDC', 'count': 1449}], 'sex': [{'sex': 'null', 'count': 683}, {'sex': 'male', 'count': 979}, {'sex': 'female', 'count': 649}, {'sex': 'not reported', 'count': 3}], 'race': [{'race': 'null', 'count': 683}, {'race': 'white', 'count': 1308}, {'race': 'not reported', 'count': 135}, {'race': 'asian', 'count': 33}, {'race': 'black or african american', 'count': 96}, {'race': 'Unknown', 'count': 20}, {'race': 'other', 'count': 9}, {'race': 'not allowed to collect', 'count': 25}, {'race': 'american indian or alaska native', 'count': 4}, {'race': 'native hawaiian or other pacific islander', 'count': 1}], 'ethnicity': [{'ethnicity': 'null', 'count': 683}, {'ethnicity': 'not hispanic or latino', 'count': 1282}, {'ethnicity': 'not reported', 'count': 219}, {'ethnicity': 'Unknown', 'count': 21}, {'ethnicity': 'hispanic or latino', 'count': 84}, {'ethnicity': 'not allowed to collect', 'count': 25}], 'cause_of_death': [{'cause_of_death': 'null', 'count': 2028}, {'cause_of_death': 'Not Reported', 'count': 200}, {'cause_of_death': 'Cancer Related', 'count': 63}, {'cause_of_death': 'Infection', 'count': 3}, {'cause_of_death': 'Not Cancer Related', 'count': 9}, {'cause_of_death': 'Surgical Complications', 'count': 2}, {'cause_of_death': 'Unknown', 'count': 9}]}
Subject Field Definitions
A subject is a specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets'total' | id | The unique identifier for this subject |
Data Center (System) Counted | identifier | An embedded array of information that includes the originating data center and the ID the subject had there |
Counted | species | The species of the subject |
Counted | sex | The sex of the subject |
Counted | race | The race of the subject |
Counted | ethnicity | The ethnicity of the subject |
Not Counted | days_to_birth | Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days |
Not Counted | subject_associated_project | An embedded array of the names of projects (studies) the subject was part of |
Not Counted | vital_status | Whether the subject is alive |
Not Counted | age_at_death | The number of days after first enrollment that the subject died |
Counted | cause_of_death | The cause of death, if known |
researchsubject¶
If we're interested in what researchsubjects meet our critera, we can also run our query against the researchsubject endpoint. Lets run it without saving to a variable this time to make it a bit quicker:
myquery.researchsubject.count.run()
Getting results from database
Total execution time: 3204 ms
total : 2923
files : 4081045
system | count |
---|---|
GDC | 1449 |
PDC | 309 |
IDC | 1165 |
primary_diagnosis_condition | count |
---|---|
Gliomas | 1244 |
Glioblastoma | 100 |
Germ Cell Neoplasms | 104 |
None | 1165 |
Pediatric/AYA Brain Tumors | 199 |
Other | 10 |
Neoplasms, NOS | 63 |
Not Reported | 11 |
Malignant Lymphomas, NOS or Diffuse | 14 |
Not Applicable | 9 |
Mature B-Cell Lymphomas | 2 |
Neuroepitheliomatous Neoplasms | 2 |
primary_diagnosis_site | count |
---|---|
Brain | 2923 |
ResearchSubject Field Definitions
A research subject is the entity of interest in a research study, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subject’s privacy. An individual who participates in 3 studies will have 3 researchsubject IDs'total' | id | The unique identifier for this researchsubject |
Data Center (System) Counted | identifier | An embedded array of information that includes the originating data center and the ID the researchsubject had there |
Not Counted | member_of_research_project | The name of the study/project that the subject particpated in |
Counted | primary_diagnosis_condition | The cancer, disease or other condition under study |
Counted | primary_diagnosis_site | The primary_disease_site that qualifies the researchsubject for the research_project |
Not Counted | subject_id | An identifier for the subject |
diagnosis¶
The diagnosis endpoint is an extension of the researchsubject endpoint, and returns information about researchsubjects that have a diagnosis that meets our search criteria. :
myquery.diagnosis.count.run()
Getting results from database
Total execution time: 3273 ms
total : 1751
system | count |
---|---|
GDC | 1422 |
PDC | 329 |
primary_diagnosis | count |
---|---|
Glioblastoma | 821 |
Mixed glioma | 131 |
Ganglioglioma, NOS | 18 |
Mixed germ cell tumor | 79 |
Neoplasm, malignant | 50 |
Oligodendroglioma, NOS | 112 |
Astrocytoma, NOS | 64 |
Glioma, NOS | 93 |
Oligodendroglioma, anaplastic | 78 |
Astrocytoma, anaplastic | 130 |
Glioma, malignant | 26 |
Medulloblastoma, NOS | 22 |
Ependymoma, NOS | 32 |
Atypical teratoid/rhabdoid tumor | 12 |
Teratoma, malignant, NOS | 2 |
Not Reported | 10 |
Teratoma, benign | 3 |
Craniopharyngioma | 16 |
Papillary glioneuronal tumor | 2 |
Malignant lymphoma, NOS | 14 |
Yolk sac tumor | 8 |
Embryonal carcinoma, NOS | 8 |
Neoplasm, uncertain whether benign or malignant | 13 |
Germinoma | 4 |
Malignant lymphoma, large B-cell, diffuse, NOS | 2 |
Gliosarcoma | 1 |
stage | count |
---|---|
None | 1422 |
Not Reported | 110 |
Unknown | 219 |
grade | count |
---|---|
not reported | 1116 |
G1 | 98 |
G2 | 52 |
Not Reported | 392 |
G4 | 36 |
None | 22 |
High Grade | 26 |
Low Grade | 9 |
Diagnosis Field Definitions
A diagnosis is a medical classification of a disease for a given research subject in a given study. A single research subject may have different diagnoses across different studiestotal' | id | The unique identifier for this diagnosis in this research subject |
Data Center (System) Counted | identifier | An embedded array of information that includes the originating data center and the ID the diagnosed researchsubject had there |
Counted | primary_diagnosis | The main medical diagnosis for this subject in this study |
Not Counted | age_at_diagnosis | The subjects age in days after birth on the day they were first diagnosed |
Not Counted | morphology | The International Classification of Diseases for Oncology diagnosic code for this diagnosis |
Counted | stage | A measure of disease spread. Different diseases may use different staging criteria |
Counted | grade | A measure of cell abnormality. Different diseases may use different grading criteria |
Not Counted | method_of_diagnosis | The test or system used for determining the diagnosis |
Not Counted | subject_id | An identifier for the subject. Can be joined to the `id` field from subject results |
Not Counted | researchsubject_id | An identifier for the subject. Can be joined to the `id` field from researchsubject results |
treatment¶
The treatment endpoint is an extension of diagnosis and returns information about treatments undertaken on research subjects that have a given diagnosis that meets our search criteria:
myquery.treatment.count.run()
Getting results from database
Total execution time: 3178 ms
total : 2379
system | count |
---|---|
GDC | 2379 |
treatment_type | count |
---|---|
Radiation Therapy, NOS | 1139 |
Targeted Molecular Therapy | 23 |
Pharmaceutical Therapy, NOS | 1117 |
Immunotherapy (Including Vaccines) | 23 |
Chemotherapy | 30 |
Radiation, Proton Beam | 1 |
Surgery | 23 |
None | 23 |
treatment_effect | count |
---|---|
None | 2379 |
Treatment Field Definitions
A treatment is a medical intervention for a diagnosed disease in a given subject in a given study. A single research subject may have multiple treatments for a single diagnosis, and/or different diagnoses, and different treatments, across different studiestotal' | id | The unique identifier for this treatment of this diagnosis in this research subject |
Data Center (System) Counted | identifier | An embedded array of information that includes the originating data center and the ID the treated researchsubject had there |
Counted | treatment_type | The medical intervention undertaken |
Not Counted | treatment_outcome | The result of the medical intervention |
Not Counted | days_to_treatment_start | |
Not Counted | days_to_treatment_end | |
Not Counted | therapeutic_agent | What treatment or drug was used for this researchsubject |
Not Counted | treatment_anatomic_site | The specific body location of the treatment |
Counted | treatment_effect | |
Not Counted | treatment_end_reason | |
Not Counted | number_of_cycles | |
Not Counted | subject_id | An identifier for the subject. Can be joined to the `id` field from subject results |
Not Counted | researchsubject_id | An identifier for the researchsubject. Can be joined to the `id` field from researchsubject results |
Not Counted | researchsubject_diagnosis_id | An identifier for the diagnosis. Can be joined to the `id` field from diagnosis results |
specimens¶
We can use this same query to see what specimens are available for brain tissue at the CDA:
myquery.specimen.count.run()
Getting results from database
Total execution time: 3234 ms
total : 39150
files : 50494
system | count |
---|---|
GDC | 38492 |
PDC | 658 |
primary_disease_type | count |
---|---|
Gliomas | 37549 |
Glioblastoma | 200 |
Other | 20 |
Pediatric/AYA Brain Tumors | 438 |
Mature B-Cell Lymphomas | 54 |
Germ Cell Neoplasms | 416 |
Neoplasms, NOS | 252 |
Not Reported | 121 |
Not Applicable | 36 |
Malignant Lymphomas, NOS or Diffuse | 56 |
Neuroepitheliomatous Neoplasms | 8 |
source_material_type | count |
---|---|
Primary Tumor | 27519 |
Solid Tissue Normal | 538 |
Blood Derived Normal | 10074 |
Recurrent Tumor | 513 |
Not Reported | 36 |
Next Generation Cancer Model | 169 |
Expanded Next Generation Cancer Model | 35 |
Metastatic | 252 |
Buccal Cell Normal | 14 |
specimen_type | count |
---|---|
portion | 5986 |
sample | 4085 |
analyte | 6659 |
aliquot | 18673 |
slide | 3747 |
Nearly 40,000 specimens with over 50,000 files meet our search criteria! We would typically expect this number to be much larger than our number of subjects or research_subjects. First because studies will often take more than one sample per subject, and second because any given specimen might be aliquoted out to be used in multiple tests.
Specimen Field Definitions
A specimen is a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject ID'total' | id | The unique identifier for this specimen |
Data Center (System) Counted | identifier | An embedded array of information that includes the originating data center and the ID the specimen had there |
Not Counted | associated_project | The name of the study/project that the subject particpated in |
Not Counted | age_at_collection | The subjects age at collection of the proximate specimen |
Counted | primary_disease_type | The disease that qualifies the researchsubject for the associated_project |
Not Counted | anatomical_site | The body part from which the proximate specimen was taken |
Counted | source_material_type | The general kind of material from which the specimen was derived, indicating the physical nature of the source materialf |
Counted | specimen_type | The high-level type of the specimen, based on its how it has been derived from the original extracted sample. One of: analyte, aliquot, portion, sample, or slide |
Not Counted | derived_from_specimen | For derived samples, the `id` for the original sample |
Not Counted | subject_id | An identifier for the subject. Can be joined to the `id` field from subject results |
Not Counted | research_subject_id | An identifier for the subject. Can be joined to the `id` field from researchsubject results |
file¶
The files endpoint returns all files that match our query:
myquery.file.count.run()
Getting results from database
Total execution time: 3212 ms
specimen_count : 39150
treatment_count : 2379
diagnosis_count : 1751
researchsubject_count : 2923
subject_count : 2314
There are a huge number of files (4081065) that match our search. Likely we would want to additionaly filter the results by file format or data type to get only files we can use. See all the ways you can filter and refine searches with more search terms in the Advanced search notebook.
Files from a single endpoint (endpoint chaining)¶
If you want all file formats and data types, but only from a specific endpoint, you can also filter the file results by chaining endpoints together. This will return all the files that match our search AND that are specifically from specimens:
myquery.specimen.file.count.run()
Getting results from database
Total execution time: 3162 ms
total : 39150
files : 50494
system | count |
---|---|
GDC | 38492 |
PDC | 658 |
primary_disease_type | count |
---|---|
Gliomas | 37549 |
Glioblastoma | 200 |
Other | 20 |
Pediatric/AYA Brain Tumors | 438 |
Mature B-Cell Lymphomas | 54 |
Germ Cell Neoplasms | 416 |
Neoplasms, NOS | 252 |
Not Reported | 121 |
Not Applicable | 36 |
Malignant Lymphomas, NOS or Diffuse | 56 |
Neuroepitheliomatous Neoplasms | 8 |
source_material_type | count |
---|---|
Primary Tumor | 27519 |
Solid Tissue Normal | 538 |
Blood Derived Normal | 10074 |
Recurrent Tumor | 513 |
Not Reported | 36 |
Next Generation Cancer Model | 169 |
Expanded Next Generation Cancer Model | 35 |
Metastatic | 252 |
Buccal Cell Normal | 14 |
specimen_type | count |
---|---|
portion | 5986 |
sample | 4085 |
analyte | 6659 |
aliquot | 18673 |
slide | 3747 |
Learn more about chaining endpoints in the Chaining endpoints notebook.
File Field Definitions
A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.`total` | id | The unique identifier for this file |
Data Center (System) Counted | identifier | An embedded array of information that includes the originating data center and the ID the file had there |
Not Counted | label | The full name of the file |
Counted | data_catagory | A desecription of the kind of general kind data the file holds |
Counted | data_type | A more specific descripton of the data type |
Counted | file_format | String to identify the full file extension including compression extensions |
Not Counted | associated_project | The name the data center uses for the study this file was generated for |
Not Counted | drs_uri | A unique identifier that can be used to retreive this specific file from a server |
Not Counted | byte_size | Size of the file in bytes |
Not Counted | checksum | The md5 value for the file |
Not Counted | data_modality | Describes the biological nature of the information gathered as the result of an activity |
Not Counted | imaging_modality | For files with the `data_modality` of "Imaging" |
Not Counted | dbgap_accession_number | The project id number for this data on dbGaP |