Basic (single term) Search¶

Before we do any work, we need to import several functions from cdapython:

Q and query which power the search
columns which lets us view entity field names
unique_terms which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:

            
                Copied!
                
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")

2022.6.22

CDA data comes from three sources:

The CDA makes this data searchable in four main endpoints:

subject: Specific, unique, individuals
researchsubject: Study-individual aggregate entities. A Subject who was part of three studies will appear as three ResearchSubjects
specimen: Samples taken from individual
file: Data about Subjects, ResearchSubjects, Specimens, and their associated information

and two endpoints that offer deeper information about data in the researchsubject endpoint:

diagnosis: Information about what medical diagnosis a researchsubject has
treatment: Information about what medical treatment(s) were performed for a given diagnosis

If you are looking to build a cohort of distinct individuals who meet some criteria, search by subject. If you want to build a cohort, but are particularly interested in studies rather than the participates per se, search by researchsubject. If you are looking for biosamples that can be ordered or a specfic format of information (for e.g. histological slides) start with specimen. If you are primarily looking for files you can reuse for your own analysis, start with file.

In CDA search, these concepts can also be chained together, so you can look specifically for specimen subjects, or researchsubject diagnoses. In the four 'main' tables, all of the rows will have one or more files associated with them that can be directly found by chaining, as in specimen files. Diagnosis and treatment do not have files directly associated with them and so can only be used to find files in conjunction with the other searches.

In all cases, any search can use any metadata field, the only difference between search types is what type of data you return by default.

Basic search with endpoints¶

Let's try a broad search of the CDA to see what information exists about cancers that were first diagnosed in the brain. To run this simple search, we would first construct a query in Q and save it to a variable myquery:

In [2]:

            
                Copied!
                
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')

Where did those terms come from?

If you aren't sure how we knew what terms to put in our search, please refer back to the What search terms are available? notebook.

subject¶

Now we can use that query to search any of information types. Let's start by looking at what Subjects meet our criteria. To do that, we will send our query to the subject endpoint, then ask for it to run:

In [3]:

            
                Copied!
                
subjectresults = myquery.subject.run()
subjectresults = myquery.subject.run()

Getting results from database

Total execution time: 3513 ms

We saved the output in a variable subjectresults, so we don't get much visible output. To see what our results are, we need to look into the variable. The simplest way is to call subjectresults directly:

In [4]:

            
                Copied!
                
subjectresults
subjectresults

Out[4]:

            QueryID: f7c3b547-599f-4c91-b5a8-fb7eec603683
            
            Offset: 0
            Count: 100
            Total Row Count: 2314
            More pages: True

This output tells us our QueryID, which we don't really need, but the computer does to track our questions. Then it tells us four parameters that describe our results:

Offset: This is how many rows of information we've told the query to skip in the data, here we didn't tell it to skip anything, so the offset is zero
Count: This is how many rows the current page of our results table has. To keep searches fast, we default to pages with 100 rows.
Total Row Count: This is how many rows are in the full results table
More pages: This is always a True or False. False means that our current page has all the available results. True means that we will see only the first 100 results in this table, and will need to page through for more.

Now that we've seen the metadata about our results, let's look at the actual table. The easiest way to do this is by using the python function .to_dataframe() on our subjectresults variable:

In [5]:

            
                Copied!
                
subjectresults.to_dataframe()
subjectresults.to_dataframe()

Out[5]:

	id	identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	age_at_death	cause_of_death
0	900-00-5299	[{'system': 'IDC', 'value': '900-00-5299'}]	Homo sapiens	None	None	None	NaN	[rembrandt]	None	NaN	None
1	900-00-5308	[{'system': 'IDC', 'value': '900-00-5308'}]	Homo sapiens	None	None	None	NaN	[rembrandt]	None	NaN	None
2	ACRIN-DSC-MR-Brain-027	[{'system': 'IDC', 'value': 'ACRIN-DSC-MR-Brai...	Homo sapiens	None	None	None	NaN	[acrin_dsc_mr_brain]	None	NaN	None
3	ACRIN-DSC-MR-Brain-051	[{'system': 'IDC', 'value': 'ACRIN-DSC-MR-Brai...	Homo sapiens	None	None	None	NaN	[acrin_dsc_mr_brain]	None	NaN	None
4	ACRIN-DSC-MR-Brain-123	[{'system': 'IDC', 'value': 'ACRIN-DSC-MR-Brai...	Homo sapiens	None	None	None	NaN	[acrin_dsc_mr_brain]	None	NaN	None
...	...	...	...	...	...	...	...	...	...	...	...
95	TCGA-76-4934	[{'system': 'GDC', 'value': 'TCGA-76-4934'}, {...	Homo sapiens	female	white	not reported	-24319.0	[tcga_gbm, TCGA-GBM]	Dead	77.0	None
96	TCGA-CS-6667	[{'system': 'GDC', 'value': 'TCGA-CS-6667'}, {...	Homo sapiens	female	white	not hispanic or latino	-14375.0	[TCGA-LGG, tcga_lgg]	Alive	NaN	None
97	TCGA-DU-A5TW	[{'system': 'GDC', 'value': 'TCGA-DU-A5TW'}, {...	Homo sapiens	female	black or african american	not hispanic or latino	-12107.0	[TCGA-LGG, tcga_lgg]	Alive	NaN	None
98	TCGA-HT-7603	[{'system': 'GDC', 'value': 'TCGA-HT-7603'}, {...	Homo sapiens	male	white	not hispanic or latino	-10883.0	[TCGA-LGG, tcga_lgg]	Alive	NaN	None
99	TCGA-P5-A736	[{'system': 'GDC', 'value': 'TCGA-P5-A736'}]	Homo sapiens	female	white	not hispanic or latino	-16300.0	[TCGA-LGG]	Alive	NaN	None

100 rows × 11 columns

By default to_dataframe() shows us the first and last five rows for the first page of our results, so we can easily preview our data.

Since we queried the Subject endpoint, our default results tell us Subject level information, that is, information about unique individuals: their sex, race, age, species, etc. The id column tells us the unique identifier for each individual. The identifier column has nested information about what study or studies a Subject participated in, and will list all of their researchsubject identifiers.

The to_dataframe() function converts the results to a pandas dataframe. So if we save the dataframe to a variable, we can use any pandas functions to explore it. For example, lets see whether any of the Subjects in our first 100 results are black or african american. First we'll save the results to a dataframe, then subset that dataframe to only show rows where the word "black" appears in the "race" column. "NAs" which are shown as "None" in these tables, so for our filter to work, we'll need to specifically tell it to ignore NAs. We're also telling it we want the word "black" regardless of capitalization with case=False:

In [6]:

            
                Copied!
                
subjectdata = subjectresults.to_dataframe()
subjectdata[subjectdata['race'].str.contains("black", case=False, na=False)]
subjectdata = subjectresults.to_dataframe()
subjectdata[subjectdata['race'].str.contains("black", case=False, na=False)]

Out[6]:

	id	identifier	species	sex	race	ethnicity	days_to_birth	subject_associated_project	vital_status	age_at_death	cause_of_death
8	C374289	[{'system': 'PDC', 'value': 'C374289'}]	Homo sapiens	male	black or african american	not hispanic or latino	NaN	[Proteogenomic Analysis of Pediatric Brain Can...	Alive	NaN	Not Reported
41	GENIE-MSK-P-0016639	[{'system': 'GDC', 'value': 'GENIE-MSK-P-00166...	Homo sapiens	female	black or african american	not hispanic or latino	-22645.0	[GENIE-MSK]	Not Reported	NaN	None
47	TCGA-02-0330	[{'system': 'GDC', 'value': 'TCGA-02-0330'}]	Homo sapiens	female	black or african american	not hispanic or latino	-18654.0	[TCGA-GBM]	Dead	484.0	None
52	TCGA-76-6656	[{'system': 'GDC', 'value': 'TCGA-76-6656'}, {...	Homo sapiens	male	black or african american	not hispanic or latino	-24265.0	[tcga_gbm, TCGA-GBM]	Dead	147.0	None
97	TCGA-DU-A5TW	[{'system': 'GDC', 'value': 'TCGA-DU-A5TW'}, {...	Homo sapiens	female	black or african american	not hispanic or latino	-12107.0	[TCGA-LGG, tcga_lgg]	Alive	NaN	None

There are three subjects in our first hundred results that meet the criteria. If we just want to be sure that the data contains some value, this might be good enough. But often we want to search the entire set of results and not just the first page.

We'll cover how to work with large results dataframes in the Pagination notebook. Or, learn how to get summary information from search results in the Data Summaries notebook.

Subject Field Definitions

A subject is a specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets

id: The unique identifier for this subject
identifier: An embedded array of information that includes the originating data center and the ID the subject had there
species: The species of the subject
sex: A reference to the biological sex of the donor organism.
race: The race of the subject
ethnicity: The ethnicity of the subject
days_to_birth: Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days
subject_associated_project: An embedded array of the names of projects (studies) the subject was part of
vital_status: Whether the subject is alive
age_at_death: The number of days after first enrollment that the subject died
cause_of_death: The cause of death, if known

researchsubject¶

If we're interested in what researchsubjects meet our critera, we can also run our query against the researchsubject endpoint:

In [7]:

            
                Copied!
                
researchsubjectresults = myquery.researchsubject.run()
researchsubjectresults
researchsubjectresults = myquery.researchsubject.run()
researchsubjectresults

Getting results from database

Total execution time: 3495 ms

Out[7]:

            QueryID: b305ba49-2c35-49c5-9313-156500e3e01e
            
            Offset: 0
            Count: 100
            Total Row Count: 2923
            More pages: True

Now we see that our 2314 subjects have 2923 researchsubjects between them, that means that some, but not all, of our subjects were participants in more than one study. Let's peek at the data:

In [8]:

            
                Copied!
                
researchsubjectresults.to_dataframe()
researchsubjectresults.to_dataframe()

Out[8]:

	id	identifier	member_of_research_project	primary_diagnosis_condition	primary_diagnosis_site	subject_id
0	0073a136-d5f4-4fd6-88f9-711768f2abc6	[{'system': 'GDC', 'value': '0073a136-d5f4-4fd...	TCGA-LGG	Gliomas	Brain	TCGA-VM-A8CF
1	0611f5bc-89b5-44bd-b301-751faaadb561	[{'system': 'GDC', 'value': '0611f5bc-89b5-44b...	TCGA-LGG	Gliomas	Brain	TCGA-P5-A5F1
2	142a1357-f1e8-40b9-82cf-f29577058598	[{'system': 'GDC', 'value': '142a1357-f1e8-40b...	CPTAC-3	Gliomas	Brain	C3N-01852
3	22e0c3ea-9f6d-4d73-9282-17ee4553f436	[{'system': 'GDC', 'value': '22e0c3ea-9f6d-4d7...	TCGA-GBM	Gliomas	Brain	TCGA-32-4213
4	2533b299-2b18-4b92-907e-ff39a6427298	[{'system': 'GDC', 'value': '2533b299-2b18-4b9...	GENIE-MSK	Germ Cell Neoplasms	Brain	GENIE-MSK-P-0013028
...	...	...	...	...	...	...
95	eae44c1c-1628-4b58-8b90-d3372e3577d5	[{'system': 'GDC', 'value': 'eae44c1c-1628-4b5...	TCGA-GBM	Gliomas	Brain	TCGA-12-0821
96	0987f48e-9d58-47b5-a1a6-de704caf4ed5	[{'system': 'GDC', 'value': '0987f48e-9d58-47b...	TCGA-GBM	Gliomas	Brain	TCGA-16-1062
97	0f9f8f46-5e6c-4bae-938d-218c192b199b	[{'system': 'GDC', 'value': '0f9f8f46-5e6c-4ba...	CPTAC-3	Gliomas	Brain	C3L-01157
98	104c3d6f-2139-11ea-aee1-0e1aae319e49	[{'system': 'PDC', 'value': '104c3d6f-2139-11e...	CPTAC3-Discovery	Glioblastoma	Brain	C3L-03727
99	104c73ca-2139-11ea-aee1-0e1aae319e49	[{'system': 'PDC', 'value': '104c73ca-2139-11e...	CPTAC3-Discovery	Glioblastoma	Brain	C3N-02770

100 rows × 6 columns

Each row from the researchsubject endpoint results tells us about a subject in a given study. Using this endpoint we can find out information like what studies fit our search criteria, and also get data that we can filter to have only subjects from multiple studies, or only subjects from single studies.

Any given subject will have one row per study they participated in. The subject_id in the last column of this view is the same as the id in the first column of the Subjects endpoint results. You can use this to combine information across endpoints, which is covered in the Merging Results notebook.

ResearchSubject Field Definitions

A research subject is the entity of interest in a research study, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subject’s privacy. An individual who participates in 3 studies will have 3 researchsubject IDs

id: The unique identifier for this researchsubject
identifier: An embedded array of information that includes the originating data center and the ID the researchsubject had there
member_of_research_project: The name of the study/project that the subject particpated in
primary_diagnosis_condition: The cancer, disease or other condition under study
primary_diagnosis_site: The primary_disease_site that qualifies the researchsubject for the research_project
subject_id: An identifier for the subject. Can be joined to the `id` field from subject results

diagnosis¶

The diagnosis endpoint is an extension of the researchsubject endpoint, and returns information about researchsubjects that have a diagnosis that meets our search criteria:

In [9]:

            
                Copied!
                
diagnosisresults = myquery.diagnosis.run()
diagnosisresults.to_dataframe()
diagnosisresults = myquery.diagnosis.run()
diagnosisresults.to_dataframe()

Getting results from database

Total execution time: 3577 ms

Out[9]:

	id	identifier	primary_diagnosis	age_at_diagnosis	morphology	stage	grade	method_of_diagnosis	subject_id	researchsubject_id
0	082d0ca2-8a8b-4a76-8231-50cb41cf201c	[{'system': 'GDC', 'value': '082d0ca2-8a8b-4a7...	Neoplasm, malignant	NaN	8000/3	None	Not Reported	None	GENIE-DFCI-022969	bf5bc810-6cb6-4996-b9b9-222802a208c8
1	21382b4d-7002-5c10-bda6-0a0baec492a2	[{'system': 'GDC', 'value': '21382b4d-7002-5c1...	Astrocytoma, anaplastic	13621.0	9401/3	None	not reported	None	TCGA-FG-8185	fb3884df-3680-4c5b-8092-981007aeba03
2	262a76c3-4ca9-5741-a683-d32c7e6ea241	[{'system': 'GDC', 'value': '262a76c3-4ca9-574...	Astrocytoma, anaplastic	15230.0	9401/3	None	not reported	None	TCGA-E1-5302	837b51d0-e6b7-431b-a5f3-7615bdf12b67
3	3d2bdce9-2848-11ec-b712-0a4e2186f121	[{'system': 'PDC', 'value': '3d2bdce9-2848-11e...	Glioblastoma	18509.0	None	Not Reported	Not Reported	None	C3L-03392	104c3685-2139-11ea-aee1-0e1aae319e49
4	75862d6b-ed13-4f01-a662-149cc27a6fe3	[{'system': 'GDC', 'value': '75862d6b-ed13-4f0...	Mixed germ cell tumor	NaN	9085/3	None	Not Reported	None	GENIE-MDA-6021	ecbf4826-01ca-4308-86a7-1ac0ae3e3dc7
...	...	...	...	...	...	...	...	...	...	...
95	d092ab4f-ff5e-11e9-9a07-0a80fada099c	[{'system': 'PDC', 'value': 'd092ab4f-ff5e-11e...	Glioma, NOS	4932.0	None	Unknown	G1	None	C829512	d08df067-ff5e-11e9-9a07-0a80fada099c
96	fa6c58ff-97a7-4c41-98a5-e0bf9a641515	[{'system': 'GDC', 'value': 'fa6c58ff-97a7-4c4...	Malignant lymphoma, NOS	NaN	9590/3	None	Not Reported	None	GENIE-MSK-P-0005258	e37c75ef-8e2f-446a-be95-24a10d05a17b
97	ff61fa80-46fa-5f55-875b-74cd8e6629fa	[{'system': 'GDC', 'value': 'ff61fa80-46fa-5f5...	Oligodendroglioma, anaplastic	23482.0	9451/3	None	not reported	None	TCGA-P5-A72Z	141f0546-f6f2-408f-ac86-07ca4aadf3d0
98	04f5ef09-6809-4b8d-9c9a-161c86826e4f	[{'system': 'GDC', 'value': '04f5ef09-6809-4b8...	Neoplasm, malignant	NaN	8000/3	None	Not Reported	None	GENIE-DFCI-008524	9838df48-80d1-459c-8bd3-27b89bbc07a2
99	1448f993-0561-4c69-9af6-7ee845c92ae0	[{'system': 'GDC', 'value': '1448f993-0561-4c6...	Glioblastoma	16133.0	9440/3	None	Not Reported	None	C3L-03390	6b0f4d36-78bb-4afc-a440-8c193bd4e8ce

100 rows × 10 columns

Diagnosis Field Definitions

A diagnosis is a medical classification of a disease for a given research subject in a given study. A single research subject may have different diagnoses across different studies

id: The unique identifier for this diagnosis in this research subject
identifier: An embedded array of information that includes the originating data center and the ID the diagnosed researchsubject had there
primary_diagnosis: The main medical diagnosis for this subject in this study
age_at_diagnosis: The subjects age in days after birth on the day they were first diagnosed
morphology: The International Classification of Diseases for Oncology diagnosic code for this diagnosis
stage: A measure of disease spread. Different diseases may use different staging criteria, please refer to the originating data source to see what staging system is reported
grade: A measure of cell abnormality. Different diseases may use different grading criteria, please refer to the originating data source to see what grading system is reported
method_of_diagnosis: The test or system used for determining the diagnosis
subject_id: An identifier for the subject. Can be joined to the `id` field from subject results
researchsubject_id: An identifier for the subject. Can be joined to the `id` field from researchsubject results

treatment¶

The treatment endpoint is an extension of diagnosis and returns information about treatments undertaken on research subjects that have a given diagnosis that meets our search criteria:

In [10]:

            
                Copied!
                
treatmentresults = myquery.treatment.run()
treatmentresults.to_dataframe()
treatmentresults = myquery.treatment.run()
treatmentresults.to_dataframe()

Getting results from database

Total execution time: 3583 ms

Out[10]:

	id	identifier	treatment_type	treatment_outcome	days_to_treatment_start	days_to_treatment_end	therapeutic_agent	treatment_anatomic_site	treatment_effect	treatment_end_reason	number_of_cycles	subject_id	researchsubject_id	researchsubject_diagnosis_id
0	0aa4ed01-6ae2-5dc5-8317-9cc4b8d46d19	[{'system': 'GDC', 'value': '0aa4ed01-6ae2-5dc...	Pharmaceutical Therapy, NOS	None	None	NaN	None	None	None	None	None	TCGA-06-0939	cb9f842c-9bf5-48c3-8d3e-b344b1b6c190	2fc83e04-0ec5-5c2c-8656-3b634ee3f302
1	0ca4b739-8381-5120-9c1d-6b39d4d9dea4	[{'system': 'GDC', 'value': '0ca4b739-8381-512...	Pharmaceutical Therapy, NOS	None	None	NaN	None	None	None	None	None	TCGA-06-0157	ea3b5da2-6a12-400c-bf0f-e442f5ec1132	d36f5d8e-1050-5880-8cf9-4060aa0e0622
2	0e3b660f-80de-5cc6-af5b-e74365f52a5b	[{'system': 'GDC', 'value': '0e3b660f-80de-5cc...	Radiation Therapy, NOS	None	None	NaN	None	None	None	None	None	TCGA-06-0126	d88bbd87-e876-4a44-96f5-c28ceac661b8	d1ba0665-895d-5cdf-a8e7-be112e4b6fc1
3	1081cec8-03e9-5bca-ad9f-5a83c47ff3be	[{'system': 'GDC', 'value': '1081cec8-03e9-5bc...	Pharmaceutical Therapy, NOS	None	None	NaN	None	None	None	None	None	TCGA-28-1760	b7b86c1f-9688-4129-891c-843e3a37b3e5	aba32eca-f512-5a56-8f44-bdecb2972452
4	243582a6-1f09-5fb9-9ce6-da5c73226bd7	[{'system': 'GDC', 'value': '243582a6-1f09-5fb...	Radiation Therapy, NOS	None	None	NaN	None	None	None	None	None	TCGA-14-0787	a2338b30-f511-4163-af3b-1e4a40ff00e1	e84ef8da-1b77-5281-9b85-a2c5be51435c
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	0f3cf9c4-a794-5f65-9916-c94263d0ab21	[{'system': 'GDC', 'value': '0f3cf9c4-a794-5f6...	Pharmaceutical Therapy, NOS	None	None	NaN	None	None	None	None	None	TCGA-DH-A66D	f005fa99-7e4e-4308-b11c-83b25683b1fd	c5f71302-7864-5f96-87ed-6af8141d217e
96	281a005b-933b-54e3-b928-5bf69fe96e61	[{'system': 'GDC', 'value': '281a005b-933b-54e...	Pharmaceutical Therapy, NOS	None	None	NaN	None	None	None	None	None	TCGA-DU-7018	f18dab99-26b0-4727-89a9-7f16bd382356	2f2c48c0-f497-51fb-b8eb-e1b4f3c86f4d
97	2f6fac8c-16ad-44ea-9537-bbe0d47f49a0	[{'system': 'GDC', 'value': '2f6fac8c-16ad-44e...	Surgery	None	None	NaN	None	None	None	None	None	HCM-BROD-0002-C71	3092d72b-75b1-4ae2-ac38-d4c1cd377e4c	f68b1f17-da54-4595-b017-37fd6d6d1e3e
98	5e600cdb-18da-59f6-8822-dcbb646a0b21	[{'system': 'GDC', 'value': '5e600cdb-18da-59f...	Radiation Therapy, NOS	None	None	NaN	None	None	None	None	None	TCGA-08-0349	0645038d-8abe-4a79-8695-bf8824d33f67	2c1f394d-d827-5811-be98-2914d2db081e
99	81841517-a3e6-5de5-8e47-01c637f9e836	[{'system': 'GDC', 'value': '81841517-a3e6-5de...	Pharmaceutical Therapy, NOS	None	None	NaN	None	None	None	None	None	TCGA-DB-A64V	176d89d7-f8f1-4a72-b45a-31cbe1632a30	c6700507-19d0-552b-b87c-6dc3370021c8

100 rows × 14 columns

Treatment Field Definitions

A treatment is a medical intervention for a diagnosed disease in a given subject in a given study. A single research subject may have multiple treatments for a single diagnosis, and/or different diagnoses, and different treatments, across different studies

id: The unique identifier for this treatment of this diagnosis in this research subject
identifier: An embedded array of information that includes the originating data center and the ID the treated researchsubject had there
treatment_type: The medical intervention undertaken
treatment_outcome: The result of the medical intervention
days_to_treatment_start:
days_to_treatment_end:
therapeutic_agent: What treatment or drug was used for this researchsubject
treatment_anatomic_site: The specific body location of the treatment
treatment_effect:
treatment_end_reason:
number_of_cycles:
subject_id: An identifier for the subject. Can be joined to the `id` field from subject results
researchsubject_id: An identifier for the researchsubject. Can be joined to the `id` field from researchsubject results
researchsubject_diagnosis_id: An identifier for the diagnosis. Can be joined to the `id` field from diagnosis results

specimens¶

We can use this same query to see what specimens are available for brain tissue at the CDA:

In [11]:

            
                Copied!
                
specimenresults =  myquery.specimen.run()
print(specimenresults)
specimenresults =  myquery.specimen.run()
print(specimenresults)

Getting results from database

Total execution time: 3827 ms

            QueryID: edfd86f3-a40b-48ab-94a1-5e29d46d5438
            
            Offset: 0
            Count: 100
            Total Row Count: 39150
            More pages: True

Nearly 40,000 specimens meet our search criteria! We would typically expect this number to be much larger than our number of subjects or researchsubjects. First because studies will often take more than one sample per subject, and second because any given specimen might be aliquoted out to be used in multiple tests. Since we didn't specify any further filters, our results will return all of these as seperate speciments. Let's look at a few:

In [12]:

            
                Copied!
                
specimenresults.to_dataframe()
specimenresults.to_dataframe()

Out[12]:

	id	identifier	associated_project	age_at_collection	primary_disease_type	anatomical_site	source_material_type	specimen_type	derived_from_specimen	subject_id	researchsubject_id
0	02397b40-fda8-4fa1-bdbc-89a06d96365c	[{'system': 'GDC', 'value': '02397b40-fda8-4fa...	TCGA-LGG	-9867	Gliomas	None	Primary Tumor	sample	initial specimen	TCGA-E1-A7YY	722c172f-46e6-47a1-82e5-3207278df89b
1	02edf275-6a20-44a4-a3b4-982627a77ba5	[{'system': 'GDC', 'value': '02edf275-6a20-44a...	TCGA-GBM	-24844	Gliomas	None	Primary Tumor	sample	initial specimen	TCGA-06-0152	f5bc5d97-e054-4e53-992d-71b896bd97d5
2	033c9e3a-1498-4e9d-bbb1-2ab70851c8b8	[{'system': 'GDC', 'value': '033c9e3a-1498-4e9...	TCGA-LGG	-15431	Gliomas	None	Blood Derived Normal	aliquot	0d507dbd-c668-48f0-bcb6-22c62404c5eb	TCGA-FG-8186	dcd45077-f068-490b-bdcc-4d4a62285116
3	034e2508-0b9c-4b56-93d5-49f7b1f35c59	[{'system': 'GDC', 'value': '034e2508-0b9c-4b5...	TCGA-GBM	-13489	Gliomas	None	Primary Tumor	analyte	c6f62fdd-6647-492b-bf2c-f74f155b383a	TCGA-26-1438	17dffffc-65d6-4209-9075-18a441001f0f
4	03da9810-f78d-4eec-bd34-20fc79a57fd6	[{'system': 'GDC', 'value': '03da9810-f78d-4ee...	TCGA-GBM	-27172	Gliomas	None	Blood Derived Normal	aliquot	17cb34a4-a136-4d0c-af75-acf776f32859	TCGA-14-3476	872511fb-98cd-4fba-82f6-4c3689c75ae2
...	...	...	...	...	...	...	...	...	...	...	...
95	3c0e211c-c9ea-4fcc-a2a0-0403478b9fee	[{'system': 'GDC', 'value': '3c0e211c-c9ea-4fc...	TCGA-GBM	-25409	Gliomas	None	Primary Tumor	aliquot	8827f48c-2066-436d-90a4-9bf23009f83d	TCGA-08-0346	5536201c-e739-46e6-8200-0bdcc28ae9ef
96	3c10c410-5bb9-4e08-8687-7c0278d337d3	[{'system': 'GDC', 'value': '3c10c410-5bb9-4e0...	TCGA-LGG	-13371	Gliomas	None	Primary Tumor	analyte	5a178f84-37c6-419a-b9f3-9e71550a3f8e	TCGA-DB-5275	bbfb5399-8d43-4b75-bf90-23ec142697d7
97	3c38eeab-66c7-5dcf-927b-867098c4f797	[{'system': 'GDC', 'value': '3c38eeab-66c7-5dc...	GENIE-MSK	-11688	Germ Cell Neoplasms	None	Primary Tumor	portion	8d5f6667-06db-47d0-8c70-76516435da12	GENIE-MSK-P-0005099	926870ab-c74a-4500-8d1e-104f956895a0
98	3d733bea-6537-4df5-9537-5fc989f00ba1	[{'system': 'GDC', 'value': '3d733bea-6537-4df...	TCGA-GBM	-20538	Gliomas	None	Primary Tumor	slide	fcb101a6-510f-4097-a83a-334a26e01aa4	TCGA-06-1801	bdc75722-1076-49f3-8dc7-f2b91e5a15eb
99	3e10d52d-92df-4eae-ad22-a9c08b72e16b	[{'system': 'GDC', 'value': '3e10d52d-92df-4ea...	TCGA-GBM	-20974	Gliomas	None	Primary Tumor	aliquot	e82ddb98-1066-42ec-9faa-78853bbb44c9	TCGA-08-0246	66d2e309-eaa1-4225-a34d-4565b4ef8019

100 rows × 11 columns

Specimen Field Definitions

Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.

id: The unique identifier for this specimen
identifier: An embedded array of information that includes the originating data center and the ID the specimen had there
associated_project: The name of the study/project that the subject particpated in
age_at_collection: The subjects age in days (counting backwards to birth) on the day of the collection of the proximate specimen
primary_disease_type: The disease that qualifies the researchsubject for the associated_project
anatomical_site: The body part from which the proximate specimen was taken
source_material_type: The general kind of material from which the specimen was derived, indicating the physical nature of the source material
specimen_type: The high-level type of the specimen, based on its how it has been derived from the original extracted sample. One of: analyte, aliquot, portion, sample, or slide
derived_from_specimen: For derived samples, the `id` for the original sample
subject_id: An identifier for the subject. Can be joined to the `id` field from subject results
researchsubject_id: An identifier for the subject. Can be joined to the `id` field from researchsubject results

file¶

The file endpoint returns information about files that meet our search criteria, regardless of whether they are attached to subjects, research-subjects or specimens:

In [13]:

            
                Copied!
                
myquery.file.run()
myquery.file.run()

Getting results from database

Total execution time: 3702 ms

Out[13]:

            QueryID: ce7f7dff-d650-40d1-aca2-554ececfd80d
            
            Offset: 0
            Count: 100
            Total Row Count: 4530800
            More pages: True

In [14]:

            
                Copied!
                
fileresults = myquery.file.run()
fileresults.to_dataframe()
fileresults = myquery.file.run()
fileresults.to_dataframe()

Getting results from database

Total execution time: 3564 ms

Out[14]:

	id	identifier	label	data_category	data_type	file_format	associated_project	drs_uri	byte_size	checksum	data_modality	imaging_modality	dbgap_accession_number	researchsubject_specimen_id	researchsubject_id	subject_id
0	0bff62f0-aadc-4797-a9b1-0c18709211f3	[{'system': 'IDC', 'value': '0bff62f0-aadc-479...	idc/0bff62f0-aadc-4797-a9b1-0c18709211f3.dcm	Imaging	None	DICOM	tcga_gbm	drs://dg.4DFC:0bff62f0-aadc-4797-a9b1-0c187092...	NaN	None	Imaging	MR	None		TCGA-06-0179__tcga_gbm	TCGA-06-0179
1	10f1e60f-3bb1-455e-b686-b4f457b872c7	[{'system': 'IDC', 'value': '10f1e60f-3bb1-455...	idc/10f1e60f-3bb1-455e-b686-b4f457b872c7.dcm	Imaging	None	DICOM	tcga_gbm	drs://dg.4DFC:10f1e60f-3bb1-455e-b686-b4f457b8...	NaN	None	Imaging	MR	None		TCGA-06-0119__tcga_gbm	TCGA-06-0119
2	15f2fbfb-c92c-44e5-ad2c-823d9db916a0	[{'system': 'IDC', 'value': '15f2fbfb-c92c-44e...	idc/15f2fbfb-c92c-44e5-ad2c-823d9db916a0.dcm	Imaging	None	DICOM	acrin_dsc_mr_brain	drs://dg.4DFC:15f2fbfb-c92c-44e5-ad2c-823d9db9...	NaN	None	Imaging	MR	None		ACRIN-DSC-MR-Brain-121__acrin_dsc_mr_brain	ACRIN-DSC-MR-Brain-121
3	17b44b7b-6275-4b43-8e21-ad430959d98b	[{'system': 'IDC', 'value': '17b44b7b-6275-4b4...	idc/17b44b7b-6275-4b43-8e21-ad430959d98b.dcm	Imaging	None	DICOM	acrin_dsc_mr_brain	drs://dg.4DFC:17b44b7b-6275-4b43-8e21-ad430959...	NaN	None	Imaging	MR	None		ACRIN-DSC-MR-Brain-118__acrin_dsc_mr_brain	ACRIN-DSC-MR-Brain-118
4	1afe1496-0cd6-463f-9bee-5c4facee2b41	[{'system': 'IDC', 'value': '1afe1496-0cd6-463...	idc/1afe1496-0cd6-463f-9bee-5c4facee2b41.dcm	Imaging	None	DICOM	tcga_lgg	drs://dg.4DFC:1afe1496-0cd6-463f-9bee-5c4facee...	NaN	None	Imaging	MR	None		TCGA-DU-7018__tcga_lgg	TCGA-DU-7018
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	0c452594-6952-40cb-8c55-f4fcc30bdc50	[{'system': 'IDC', 'value': '0c452594-6952-40c...	idc/0c452594-6952-40cb-8c55-f4fcc30bdc50.dcm	Imaging	None	DICOM	qin_gbm_treatment_response	drs://dg.4DFC:0c452594-6952-40cb-8c55-f4fcc30b...	NaN	None	Imaging	MR	None		QIN-GBM-TR-55__qin_gbm_treatment_response	QIN-GBM-TR-55
96	0e8b0619-090d-4ae5-bfb4-830c5f792909	[{'system': 'IDC', 'value': '0e8b0619-090d-4ae...	idc/0e8b0619-090d-4ae5-bfb4-830c5f792909.dcm	Imaging	None	DICOM	acrin_dsc_mr_brain	drs://dg.4DFC:0e8b0619-090d-4ae5-bfb4-830c5f79...	NaN	None	Imaging	MR	None		ACRIN-DSC-MR-Brain-049__acrin_dsc_mr_brain	ACRIN-DSC-MR-Brain-049
97	1b0cc01f-c97e-4bca-a1d2-027367555b37	[{'system': 'IDC', 'value': '1b0cc01f-c97e-4bc...	idc/1b0cc01f-c97e-4bca-a1d2-027367555b37.dcm	Imaging	None	DICOM	lgg_1p19qdeletion	drs://dg.4DFC:1b0cc01f-c97e-4bca-a1d2-02736755...	NaN	None	Imaging	MR	None		LGG-566__lgg_1p19qdeletion	LGG-566
98	1d6c6213-1a3d-4c10-8c14-61f1a39754b9	[{'system': 'IDC', 'value': '1d6c6213-1a3d-4c1...	idc/1d6c6213-1a3d-4c10-8c14-61f1a39754b9.dcm	Imaging	None	DICOM	tcga_gbm	drs://dg.4DFC:1d6c6213-1a3d-4c10-8c14-61f1a397...	NaN	None	Imaging	MR	None		TCGA-06-0185__tcga_gbm	TCGA-06-0185
99	20fc1e43-aa9c-47cb-8800-72fceccf6097	[{'system': 'IDC', 'value': '20fc1e43-aa9c-47c...	idc/20fc1e43-aa9c-47cb-8800-72fceccf6097.dcm	Imaging	None	DICOM	acrin_dsc_mr_brain	drs://dg.4DFC:20fc1e43-aa9c-47cb-8800-72fceccf...	NaN	None	Imaging	MR	None		ACRIN-DSC-MR-Brain-013__acrin_dsc_mr_brain	ACRIN-DSC-MR-Brain-013

100 rows × 16 columns

As you might expect, searching file gives us a huge number of results. This is great if you are surveying what kind of data is available, but is less useful for getting a coherent cohort.

A better way to get files for a specific cohort is to chain your queries together, which we cover in the next tutorial Chaining Queries: Combine information from multiple endpoints, and build And/Or/Like and other advanced query strings.

Another useful way to look at high level information is to use our counts feature which returns summary information rather than the full search results. Check out the Data Summaries tutorial to try it.

File Field Definitions

A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.

id: The unique identifier for this file
identifier: An embedded array of information that includes the originating data center and the ID the file had there
label: The full name of the file
data_catagory: A desecription of the kind of general kind data the file holds
data_type: A more specific descripton of the data type
file_format: String to identify the full file extension including compression extensions
associated_project: The name the data center uses for the study this file was generated for
drs_uri: A unique identifier that can be used to retreive this specific file from a server
byte_size: Size of the file in bytes
checksum: The md5 value for the file
data_modality: Describes the biological nature of the information gathered as the result of an activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging"
imaging_modality: For files with the `data_modality` of "Imaging", a descriptor for the image type
dbgap_accession_number: The project id number for this data on dbGaP

Last update: 2022-06-15