Summarize Search Results¶

Before we do any work, we need to import several functions from cdapython:

Q and query which power the search
columns which lets us view entity field names
unique_terms which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:

            
                Copied!
                
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")

2022.6.22

CDA data comes from three sources:

The CDA makes this data searchable in four endpoints:

subject: Specific, unique, individuals
research_subject: Study-individual aggregate entities. A subject who was part of three studies will appear as three researchsubjects
specimen: Samples taken from individual
file: Data about subject, researchsubject, specimen, and their associated information

If you are looking to build a cohort of distinct individuals who meet some criteria, search by subject. If you want to build a cohort, but are particularly interested in studies rather than the participates per se, search by researchsubject. If you are looking for biosamples that can be ordered or a specfic format of information (for e.g. histological slides) start with specimen. If you are primarily looking for files you can reuse for your own analysis, start with file.

In CDA search, these concepts can also be strung together, so you can look specifically for specimen file, or researchsubject specimen. In all cases, any search can use any metadata field, the only difference between search types is what type of data you return by default.

Getting simple summary data¶

Let's try a broad search of the CDA to see what information exists about cancers that were first diagnosed in the brain. To run this simple search, we would first construct a query in Q and save it to a variable myquery. This is the same query we ran in the Basic Search notebook:

In [2]:

            
                Copied!
                
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')

Where did those terms come from?

If you aren't sure how we knew what terms to put in our search, please refer back to the What search terms are available? notebook.

Overall summary¶

You can get a quick summary of how many unique specimens, treatments, diagnoses, researchsubjects and subjects meet your search criteria by chaining a count command into the basic run call.

In [3]:

            
                Copied!
                
myquery.count.run()
myquery.count.run()

Getting results from database

Total execution time: 3330 ms

specimen_count : 39150

treatment_count : 2379

diagnosis_count : 1751

researchsubject_count : 2923

subject_count : 2314

Out[3]:

These numbers are how many total rows of data will come back when querying the various endpoints.

subject summary¶

We can also add countto the other run calls we did in the Basic Search notebook to get more detailed summaries:

In [4]:

            
                Copied!
                
subjectresults = myquery.subject.count.run()
subjectresults = myquery.subject.count.run()

Getting results from database

Total execution time: 3295 ms

Since we save the output as a variable, we need to look at the variable to see the results:

In [5]:

            
                Copied!
                
subjectresults
subjectresults

    total : 2314

  files : 4081065

system	count
IDC	1167
PDC	309
GDC	1449

sex	count
None	683
male	979
female	649
not reported	3

race	count
None	683
white	1308
not reported	135
asian	33
black or african american	96
Unknown	20
other	9
not allowed to collect	25
american indian or alaska native	4
native hawaiian or other pacific islander	1

ethnicity	count
None	683
not hispanic or latino	1282
not reported	219
Unknown	21
hispanic or latino	84
not allowed to collect	25

cause_of_death	count
None	2028
Not Reported	200
Cancer Related	63
Infection	3
Not Cancer Related	9
Surgical Complications	2
Unknown	9

Out[5]:

By default, the results are displayed as a table for easy previewing of the data. Since we queried the subject endpoint, our default results tell us subject level information, that is, information about unique individuals: their sex, race, age, species, etc. Using counts gives us back a nice pivot table type summary of the countable fields for Subjects. Note that above the table it also tells you the total subject count, as well as how many files are associated with those subjects.

This gives you a quick way to assess whether the full search results will have the data fields you require. But if you want to get the underlying data for your own downstream applications, you can also get the raw numbers by calling the zeroth value of the variable:

In [6]:

            
                Copied!
                
subjectresults[0]
subjectresults[0]

Out[6]:

{'total': 2314,
 'files': 4081065,
 'system': [{'system': 'IDC', 'count': 1167},
  {'system': 'PDC', 'count': 309},
  {'system': 'GDC', 'count': 1449}],
 'sex': [{'sex': 'null', 'count': 683},
  {'sex': 'male', 'count': 979},
  {'sex': 'female', 'count': 649},
  {'sex': 'not reported', 'count': 3}],
 'race': [{'race': 'null', 'count': 683},
  {'race': 'white', 'count': 1308},
  {'race': 'not reported', 'count': 135},
  {'race': 'asian', 'count': 33},
  {'race': 'black or african american', 'count': 96},
  {'race': 'Unknown', 'count': 20},
  {'race': 'other', 'count': 9},
  {'race': 'not allowed to collect', 'count': 25},
  {'race': 'american indian or alaska native', 'count': 4},
  {'race': 'native hawaiian or other pacific islander', 'count': 1}],
 'ethnicity': [{'ethnicity': 'null', 'count': 683},
  {'ethnicity': 'not hispanic or latino', 'count': 1282},
  {'ethnicity': 'not reported', 'count': 219},
  {'ethnicity': 'Unknown', 'count': 21},
  {'ethnicity': 'hispanic or latino', 'count': 84},
  {'ethnicity': 'not allowed to collect', 'count': 25}],
 'cause_of_death': [{'cause_of_death': 'null', 'count': 2028},
  {'cause_of_death': 'Not Reported', 'count': 200},
  {'cause_of_death': 'Cancer Related', 'count': 63},
  {'cause_of_death': 'Infection', 'count': 3},
  {'cause_of_death': 'Not Cancer Related', 'count': 9},
  {'cause_of_death': 'Surgical Complications', 'count': 2},
  {'cause_of_death': 'Unknown', 'count': 9}]}

Subject Field Definitions

A subject is a specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets

'total'	id	The unique identifier for this subject
Data Center (System) Counted	identifier	An embedded array of information that includes the originating data center and the ID the subject had there
Counted	species	The species of the subject
Counted	sex	The sex of the subject
Counted	race	The race of the subject
Counted	ethnicity	The ethnicity of the subject
Not Counted	days_to_birth	Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days
Not Counted	subject_associated_project	An embedded array of the names of projects (studies) the subject was part of
Not Counted	vital_status	Whether the subject is alive
Not Counted	age_at_death	The number of days after first enrollment that the subject died
Counted	cause_of_death	The cause of death, if known

researchsubject¶

If we're interested in what researchsubjects meet our critera, we can also run our query against the researchsubject endpoint. Lets run it without saving to a variable this time to make it a bit quicker:

In [7]:

            
                Copied!
                
myquery.researchsubject.count.run()
myquery.researchsubject.count.run()

Getting results from database

Total execution time: 3204 ms

    total : 2923

  files : 4081045

system	count
GDC	1449
PDC	309
IDC	1165

primary_diagnosis_condition	count
Gliomas	1244
Glioblastoma	100
Germ Cell Neoplasms	104
None	1165
Pediatric/AYA Brain Tumors	199
Other	10
Neoplasms, NOS	63
Not Reported	11
Malignant Lymphomas, NOS or Diffuse	14
Not Applicable	9
Mature B-Cell Lymphomas	2
Neuroepitheliomatous Neoplasms	2

primary_diagnosis_site	count
Brain	2923

Out[7]:

ResearchSubject Field Definitions

A research subject is the entity of interest in a research study, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subject’s privacy. An individual who participates in 3 studies will have 3 researchsubject IDs

'total'	id	The unique identifier for this researchsubject
Data Center (System) Counted	identifier	An embedded array of information that includes the originating data center and the ID the researchsubject had there
Not Counted	member_of_research_project	The name of the study/project that the subject particpated in
Counted	primary_diagnosis_condition	The cancer, disease or other condition under study
Counted	primary_diagnosis_site	The primary_disease_site that qualifies the researchsubject for the research_project
Not Counted	subject_id	An identifier for the subject

diagnosis¶

The diagnosis endpoint is an extension of the researchsubject endpoint, and returns information about researchsubjects that have a diagnosis that meets our search criteria. :

In [8]:

            
                Copied!
                
myquery.diagnosis.count.run()
myquery.diagnosis.count.run()

Getting results from database

Total execution time: 3273 ms

    total : 1751

system	count
GDC	1422
PDC	329

primary_diagnosis	count
Glioblastoma	821
Mixed glioma	131
Ganglioglioma, NOS	18
Mixed germ cell tumor	79
Neoplasm, malignant	50
Oligodendroglioma, NOS	112
Astrocytoma, NOS	64
Glioma, NOS	93
Oligodendroglioma, anaplastic	78
Astrocytoma, anaplastic	130
Glioma, malignant	26
Medulloblastoma, NOS	22
Ependymoma, NOS	32
Atypical teratoid/rhabdoid tumor	12
Teratoma, malignant, NOS	2
Not Reported	10
Teratoma, benign	3
Craniopharyngioma	16
Papillary glioneuronal tumor	2
Malignant lymphoma, NOS	14
Yolk sac tumor	8
Embryonal carcinoma, NOS	8
Neoplasm, uncertain whether benign or malignant	13
Germinoma	4
Malignant lymphoma, large B-cell, diffuse, NOS	2
Gliosarcoma	1

stage	count
None	1422
Not Reported	110
Unknown	219

grade	count
not reported	1116
G1	98
G2	52
Not Reported	392
G4	36
None	22
High Grade	26
Low Grade	9

Out[8]:

Diagnosis Field Definitions

A diagnosis is a medical classification of a disease for a given research subject in a given study. A single research subject may have different diagnoses across different studies

total'	id	The unique identifier for this diagnosis in this research subject
Data Center (System) Counted	identifier	An embedded array of information that includes the originating data center and the ID the diagnosed researchsubject had there
Counted	primary_diagnosis	The main medical diagnosis for this subject in this study
Not Counted	age_at_diagnosis	The subjects age in days after birth on the day they were first diagnosed
Not Counted	morphology	The International Classification of Diseases for Oncology diagnosic code for this diagnosis
Counted	stage	A measure of disease spread. Different diseases may use different staging criteria
Counted	grade	A measure of cell abnormality. Different diseases may use different grading criteria
Not Counted	method_of_diagnosis	The test or system used for determining the diagnosis
Not Counted	subject_id	An identifier for the subject. Can be joined to the `id` field from subject results
Not Counted	researchsubject_id	An identifier for the subject. Can be joined to the `id` field from researchsubject results

treatment¶

The treatment endpoint is an extension of diagnosis and returns information about treatments undertaken on research subjects that have a given diagnosis that meets our search criteria:

In [9]:

            
                Copied!
                
myquery.treatment.count.run()
myquery.treatment.count.run()

Getting results from database

Total execution time: 3178 ms

    total : 2379

system	count
GDC	2379

treatment_type	count
Radiation Therapy, NOS	1139
Targeted Molecular Therapy	23
Pharmaceutical Therapy, NOS	1117
Immunotherapy (Including Vaccines)	23
Chemotherapy	30
Radiation, Proton Beam	1
Surgery	23
None	23

treatment_effect	count
None	2379

Out[9]:

Treatment Field Definitions

A treatment is a medical intervention for a diagnosed disease in a given subject in a given study. A single research subject may have multiple treatments for a single diagnosis, and/or different diagnoses, and different treatments, across different studies

total'	id	The unique identifier for this treatment of this diagnosis in this research subject
Data Center (System) Counted	identifier	An embedded array of information that includes the originating data center and the ID the treated researchsubject had there
Counted	treatment_type	The medical intervention undertaken
Not Counted	treatment_outcome	The result of the medical intervention
Not Counted	days_to_treatment_start
Not Counted	days_to_treatment_end
Not Counted	therapeutic_agent	What treatment or drug was used for this researchsubject
Not Counted	treatment_anatomic_site	The specific body location of the treatment
Counted	treatment_effect
Not Counted	treatment_end_reason
Not Counted	number_of_cycles
Not Counted	subject_id	An identifier for the subject. Can be joined to the `id` field from subject results
Not Counted	researchsubject_id	An identifier for the researchsubject. Can be joined to the `id` field from researchsubject results
Not Counted	researchsubject_diagnosis_id	An identifier for the diagnosis. Can be joined to the `id` field from diagnosis results

specimens¶

We can use this same query to see what specimens are available for brain tissue at the CDA:

In [10]:

            
                Copied!
                
myquery.specimen.count.run()
myquery.specimen.count.run()

Getting results from database

Total execution time: 3234 ms

   total : 39150

   files : 50494

system	count
GDC	38492
PDC	658

primary_disease_type	count
Gliomas	37549
Glioblastoma	200
Other	20
Pediatric/AYA Brain Tumors	438
Mature B-Cell Lymphomas	54
Germ Cell Neoplasms	416
Neoplasms, NOS	252
Not Reported	121
Not Applicable	36
Malignant Lymphomas, NOS or Diffuse	56
Neuroepitheliomatous Neoplasms	8

source_material_type	count
Primary Tumor	27519
Solid Tissue Normal	538
Blood Derived Normal	10074
Recurrent Tumor	513
Not Reported	36
Next Generation Cancer Model	169
Expanded Next Generation Cancer Model	35
Metastatic	252
Buccal Cell Normal	14

specimen_type	count
portion	5986
sample	4085
analyte	6659
aliquot	18673
slide	3747

Out[10]:

Nearly 40,000 specimens with over 50,000 files meet our search criteria! We would typically expect this number to be much larger than our number of subjects or research_subjects. First because studies will often take more than one sample per subject, and second because any given specimen might be aliquoted out to be used in multiple tests.

Specimen Field Definitions

A specimen is a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject ID

'total'	id	The unique identifier for this specimen
Data Center (System) Counted	identifier	An embedded array of information that includes the originating data center and the ID the specimen had there
Not Counted	associated_project	The name of the study/project that the subject particpated in
Not Counted	age_at_collection	The subjects age at collection of the proximate specimen
Counted	primary_disease_type	The disease that qualifies the researchsubject for the associated_project
Not Counted	anatomical_site	The body part from which the proximate specimen was taken
Counted	source_material_type	The general kind of material from which the specimen was derived, indicating the physical nature of the source materialf
Counted	specimen_type	The high-level type of the specimen, based on its how it has been derived from the original extracted sample. One of: analyte, aliquot, portion, sample, or slide
Not Counted	derived_from_specimen	For derived samples, the `id` for the original sample
Not Counted	subject_id	An identifier for the subject. Can be joined to the `id` field from subject results
Not Counted	research_subject_id	An identifier for the subject. Can be joined to the `id` field from researchsubject results

file¶

The files endpoint returns all files that match our query:

In [11]:

            
                Copied!
                
myquery.file.count.run()
myquery.file.count.run()

Getting results from database

Total execution time: 3212 ms

specimen_count : 39150

treatment_count : 2379

diagnosis_count : 1751

researchsubject_count : 2923

subject_count : 2314

Out[11]:

There are a huge number of files (4081065) that match our search. Likely we would want to additionaly filter the results by file format or data type to get only files we can use. See all the ways you can filter and refine searches with more search terms in the Advanced search notebook.

Files from a single endpoint (endpoint chaining)¶

If you want all file formats and data types, but only from a specific endpoint, you can also filter the file results by chaining endpoints together. This will return all the files that match our search AND that are specifically from specimens:

In [12]:

            
                Copied!
                
myquery.specimen.file.count.run()
myquery.specimen.file.count.run()

Getting results from database

Total execution time: 3162 ms

   total : 39150

   files : 50494

system	count
GDC	38492
PDC	658

primary_disease_type	count
Gliomas	37549
Glioblastoma	200
Other	20
Pediatric/AYA Brain Tumors	438
Mature B-Cell Lymphomas	54
Germ Cell Neoplasms	416
Neoplasms, NOS	252
Not Reported	121
Not Applicable	36
Malignant Lymphomas, NOS or Diffuse	56
Neuroepitheliomatous Neoplasms	8

source_material_type	count
Primary Tumor	27519
Solid Tissue Normal	538
Blood Derived Normal	10074
Recurrent Tumor	513
Not Reported	36
Next Generation Cancer Model	169
Expanded Next Generation Cancer Model	35
Metastatic	252
Buccal Cell Normal	14

specimen_type	count
portion	5986
sample	4085
analyte	6659
aliquot	18673
slide	3747

Out[12]:

Learn more about chaining endpoints in the Chaining endpoints notebook.

File Field Definitions

A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.

`total`	id	The unique identifier for this file
Data Center (System) Counted	identifier	An embedded array of information that includes the originating data center and the ID the file had there
Not Counted	label	The full name of the file
Counted	data_catagory	A desecription of the kind of general kind data the file holds
Counted	data_type	A more specific descripton of the data type
Counted	file_format	String to identify the full file extension including compression extensions
Not Counted	associated_project	The name the data center uses for the study this file was generated for
Not Counted	drs_uri	A unique identifier that can be used to retreive this specific file from a server
Not Counted	byte_size	Size of the file in bytes
Not Counted	checksum	The md5 value for the file
Not Counted	data_modality	Describes the biological nature of the information gathered as the result of an activity
Not Counted	imaging_modality	For files with the `data_modality` of "Imaging"
Not Counted	dbgap_accession_number	The project id number for this data on dbGaP

Last update: 2022-06-22