# Basic (single term) Search

Before we do any work, we need to import several functions from cdapython:
- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")

2022.6.9


CDA data comes from three sources:
- The [Proteomic Data Commons](https://proteomic.datacommons.cancer.gov/pdc/) (PDC)
- The [Genomic Data Commons](https://gdc.cancer.gov/) (GDC)
- The [Imaging Data Commons](https://datacommons.cancer.gov/repository/imaging-data-commons) (IDC)

The CDA makes this data searchable in four main endpoints:

- `subject`: Specific, unique, individuals
- `researchsubject`: Study-individual aggregate entities. A Subject who was part of three studies will appear as three ResearchSubjects
- `specimen`: Samples taken from individual
- `file`: Data about Subjects, ResearchSubjects, Specimens, and their associated information

and two endpoints that offer deeper information about data in the researchsubject endpoint:

- `diagnosis`: Information about what medical diagnosis a researchsubject has
- `treatment`: Information about what medical treatment(s) were performed for a given diagnosis

If you are looking to build a cohort of distinct individuals who meet some criteria, search by `subject`. If you want to build a cohort, but are particularly interested in studies rather than the participates per se, search by `researchsubject`. If you are looking for biosamples that can be ordered or a specfic format of information (for e.g. histological slides) start with `specimen`. If you are primarily looking for files you can reuse for your own analysis, start with `file`.

In CDA search, these concepts can also be chained together, so you can look specifically for specimen subjects, or researchsubject diagnoses. In the four 'main' tables, all of the rows will have one or more files associated with them that can be directly found by chaining, as in specimen files. Diagnosis and treatment do not have files directly associated with them and so can only be used to find files in conjunction with the other searches.

In all cases, any search can use any metadata field, the only difference between search types is what type of data you return by default. 



## Basic search with endpoints

Let's try a broad search of the CDA to see what information exists about cancers that were first diagnosed in the brain. To run this simple search, we would first construct a query in `Q` and save it to a variable `myquery`:

In [2]:
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')


<div style="background-color:#6ce6b9;color:black;padding:20px;">
<h3>Where did those terms come from?</h3>
    
If you aren't sure how we knew what terms to put in our search, please refer back to the <a href="../SearchTerms">What search terms are available?</a> notebook. 
</div>

### subject
Now we can use that query to search any of information types. Let's start by looking at what Subjects meet our criteria. To do that, we will send our query to the subject endpoint, then ask for it to run:

In [3]:
subjectresults = myquery.subject.run()

Total execution time: 3458 ms


We saved the output in a variable `subjectresults`, so we don't get much visible output. To see what our results are, we need to look into the variable. The simplest way is to call `subjectresults` directly:

In [4]:
subjectresults


            QueryID: de8d59ac-6c64-4319-beef-25da7a9ca64a
            
            Offset: 0
            Count: 100
            Total Row Count: 2314
            More pages: True
            

This output tells us our QueryID, which we don't really need, but the computer does to track our questions. Then it tells us four parameters that describe our results:

---

- **Offset:** This is how many rows of information we've told the query to skip in the data, here we didn't tell it to skip anything, so the offset is zero
- **Count:** This is how many rows the current page of our results table has. To keep searches fast, we default to pages with 100 rows.
- **Total Row Count:** This is how many rows are in the full results table
- **More pages:** This is always a True or False. False means that our current page has all the available results. True means that we will see only the first 100 results in this table, and will need to page through for more.

---
    
Now that we've seen the metadata about our results, let's look at the actual table. The easiest way to do this is by using the python function `.to_dataframe()` on our `subjectresults` variable:

In [5]:
subjectresults.to_dataframe()

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death
0,ACRIN-FMISO-Brain-046,"[{'system': 'IDC', 'value': 'ACRIN-FMISO-Brain...",Homo sapiens,,,,,[acrin_fmiso_brain],,,
1,C136161,"[{'system': 'PDC', 'value': 'C136161'}]",Homo sapiens,male,white,not hispanic or latino,,[Proteogenomic Analysis of Pediatric Brain Can...,Alive,,Not Reported
2,LGG-338,"[{'system': 'IDC', 'value': 'LGG-338'}]",Homo sapiens,,,,,[lgg_1p19qdeletion],,,
3,LGG-563,"[{'system': 'IDC', 'value': 'LGG-563'}]",Homo sapiens,,,,,[lgg_1p19qdeletion],,,
4,RIDER Neuro MRI-3183286461,"[{'system': 'IDC', 'value': 'RIDER Neuro MRI-3...",Homo sapiens,,,,,[rider_neuro_mri],,,
...,...,...,...,...,...,...,...,...,...,...,...
95,TCGA-DB-A4XG,"[{'system': 'GDC', 'value': 'TCGA-DB-A4XG'}]",Homo sapiens,male,white,not hispanic or latino,-12550.0,[TCGA-LGG],Alive,,
96,TCGA-HT-7688,"[{'system': 'GDC', 'value': 'TCGA-HT-7688'}, {...",Homo sapiens,male,white,not hispanic or latino,-21844.0,"[TCGA-LGG, tcga_lgg]",Alive,,
97,TCGA-IK-7675,"[{'system': 'GDC', 'value': 'TCGA-IK-7675'}]",Homo sapiens,male,white,hispanic or latino,-15900.0,[TCGA-LGG],Dead,578.0,
98,TCGA-P5-A5EV,"[{'system': 'GDC', 'value': 'TCGA-P5-A5EV'}]",Homo sapiens,male,white,not hispanic or latino,-14310.0,[TCGA-LGG],Alive,,


By default `to_dataframe()` shows us the first and last five rows for the first page of our results, so we can easily preview our data.

Since we queried the Subject endpoint, our default results tell us Subject level information, that is, information about unique individuals: their sex, race, age, species, etc. The `id` column tells us the unique identifier for each individual. The identifier column has nested information about what study or studies a Subject participated in, and will list all of their researchsubject identifiers. 

The `to_dataframe()` function converts the results to a pandas dataframe. So if we save the dataframe to a variable, we can use any pandas functions to explore it. For example, lets see whether any of the Subjects in our first 100 results are black or african american. First we'll save the results to a dataframe, then subset that dataframe to only show rows where the word "black" appears in the "race" column. "NAs" which are shown as "None" in these tables, so for our filter to work, we'll need to specifically tell it to ignore NAs. We're also telling it we want the word "black" regardless of capitalization with `case=False`:


In [6]:
subjectdata = subjectresults.to_dataframe()
subjectdata[subjectdata['race'].str.contains("black", case=False, na=False)]

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death
59,TCGA-TQ-A7RO,"[{'system': 'GDC', 'value': 'TCGA-TQ-A7RO'}]",Homo sapiens,male,black or african american,hispanic or latino,-10890.0,[TCGA-LGG],Alive,,
74,TCGA-06-0394,"[{'system': 'GDC', 'value': 'TCGA-06-0394'}]",Homo sapiens,male,black or african american,not hispanic or latino,-18913.0,[TCGA-GBM],Dead,329.0,
83,GENIE-JHU-03295,"[{'system': 'GDC', 'value': 'GENIE-JHU-03295'}]",Homo sapiens,female,black or african american,Unknown,,[GENIE-JHU],Not Reported,,


There are three subjects in our first hundred results that meet the criteria. If we just want to be sure that the data contains some value, this might be good enough. But often we want to search the entire set of results and not just the first page. 

We'll cover how to work with large results dataframes in the [Pagination](../Pagination.ipynb) notebook. Or, learn how to get summary information from search results in the [Data Summaries](../DataSummaries.ipynb) notebook.


---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>Subject Field Definitions</h3>

<i>A subject is a specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets</i>

    
<ul>
  <li><b>id:</b> The unique identifier for this subject</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the subject had there</li>
  <li><b>species:</b> The species of the subject</li>
  <li><b>sex:</b> A reference to the biological sex of the donor organism. </li>
  <li><b>race:</b> The race of the subject</li>
  <li><b>ethnicity:</b> The ethnicity of the subject</li>
  <li><b>days_to_birth:</b> Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days</li>
  <li><b>subject_associated_project:</b> An embedded array of the names of projects (studies) the subject was part of</li>
  <li><b>vital_status:</b> Whether the subject is alive</li>
  <li><b>age_at_death:</b> The number of days after first enrollment that the subject died</li>
  <li><b>cause_of_death:</b> The cause of death, if known</li>
</ul>  

</div>
    
---

### researchsubject

If we're interested in what researchsubjects meet our critera, we can also run our query against the researchsubject endpoint:

In [7]:
researchsubjectresults = myquery.researchsubject.run()
researchsubjectresults

Total execution time: 3973 ms



            QueryID: 96575399-d9ef-46d0-a36b-b8d34f0064ca
            
            Offset: 0
            Count: 100
            Total Row Count: 2923
            More pages: True
            

Now we see that our 2314 subjects have 2923 researchsubjects between them, that means that some, but not all, of our subjects were participants in more than one study. Let's peek at the data:

In [8]:
researchsubjectresults.to_dataframe()

Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,0a85c7b4-f07d-4727-b9d2-7b14c52edabb,"[{'system': 'GDC', 'value': '0a85c7b4-f07d-472...",TCGA-GBM,Gliomas,Brain,TCGA-12-0703
1,104c852f-2139-11ea-aee1-0e1aae319e49,"[{'system': 'PDC', 'value': '104c852f-2139-11e...",CPTAC3-Discovery,Glioblastoma,Brain,C3N-03473
2,19dd2c8f-05ca-44e8-8c19-bac0327f2ea9,"[{'system': 'GDC', 'value': '19dd2c8f-05ca-44e...",TCGA-GBM,Gliomas,Brain,TCGA-76-6661
3,374117f3-a351-43e8-9848-a1d724c71a46,"[{'system': 'GDC', 'value': '374117f3-a351-43e...",GENIE-DFCI,"Neoplasms, NOS",Brain,GENIE-DFCI-089524
4,3f70c3e3-0131-466f-92aa-0a63ab3d4258,"[{'system': 'GDC', 'value': '3f70c3e3-0131-466...",TCGA-LGG,Gliomas,Brain,TCGA-CS-6188
...,...,...,...,...,...,...
95,0389b35b-651b-4776-b12a-d315a100f47c,"[{'system': 'GDC', 'value': '0389b35b-651b-477...",TCGA-GBM,Gliomas,Brain,TCGA-12-0619
96,104c0d21-2139-11ea-aee1-0e1aae319e49,"[{'system': 'PDC', 'value': '104c0d21-2139-11e...",CPTAC3-Discovery,Glioblastoma,Brain,C3L-01043
97,13d12179-3182-4f41-85a2-90fd50e51480,"[{'system': 'GDC', 'value': '13d12179-3182-4f4...",TCGA-GBM,Gliomas,Brain,TCGA-06-1802
98,1b6c184a-5868-4a51-8a82-aa16a7e65126,"[{'system': 'GDC', 'value': '1b6c184a-5868-4a5...",TCGA-GBM,Gliomas,Brain,TCGA-06-0881


Each row from the researchsubject endpoint results tells us about a subject in a given study. Using this endpoint we can find out information like what studies fit our search criteria, and also get data that we can filter to have only subjects from multiple studies, or only subjects from single studies.

Any given subject will have one row per study they participated in. The subject_id in the last column of this view is the same as the `id` in the first column of the Subjects endpoint results. You can use this to combine information across endpoints, which is covered in the [Merging Results](../MergingResults.ipynb) notebook.


---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>ResearchSubject Field Definitions</h3>

<i>A research subject is the entity of interest in a research study, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subject’s privacy. An individual who participates in 3 studies will have 3 researchsubject IDs</i>
    
<ul>
  <li><b>id:</b> The unique identifier for this researchsubject</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the researchsubject had there</li>
  <li><b>member_of_research_project:</b> The name of the study/project that the subject particpated in</li>
  <li><b>primary_diagnosis_condition:</b> The cancer, disease or other condition under study</li>
  <li><b>primary_diagnosis_site:</b> The primary_disease_site that qualifies the researchsubject for the research_project</li>
  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>
</ul>  

</div>
    
---

### diagnosis

The diagnosis endpoint is an extension of the researchsubject endpoint, and returns information about researchsubjects that have a diagnosis that meets our search criteria:

In [9]:
diagnosisresults = myquery.diagnosis.run()
diagnosisresults.to_dataframe()

Total execution time: 3609 ms


Unnamed: 0,id,identifier,primary_diagnosis,age_at_diagnosis,morphology,stage,grade,method_of_diagnosis,subject_id,researchsubject_id
0,0cd95232-62ce-4cda-8bc1-57874f84088a,"[{'system': 'GDC', 'value': '0cd95232-62ce-4cd...",Glioblastoma,27855.0,9440/3,,Not Reported,,C3L-02955,42c00ea2-17e5-4f68-af9f-8c1f1bebed25
1,1476b363-2767-4847-a571-650c9c32993b,"[{'system': 'GDC', 'value': '1476b363-2767-484...",Glioblastoma,18595.0,9440/3,,Not Reported,,C3L-01046,5838a973-c162-47cd-a93d-e28fd92f1ca5
2,2513548c-257f-593f-9c72-41ccb2f81bf3,"[{'system': 'GDC', 'value': '2513548c-257f-593...","Astrocytoma, anaplastic",12836.0,9401/3,,not reported,,TCGA-DU-6392,fcd9e1c4-bddb-4856-844c-03df48fba499
3,37de5395-e8a5-5684-9c73-59b3637c77dc,"[{'system': 'GDC', 'value': '37de5395-e8a5-568...",Glioblastoma,19460.0,9440/3,,not reported,,TCGA-12-0778,db5ea0e6-4cac-4f18-8643-32afb1e0287d
4,3f6fa8c7-2848-11ec-b712-0a4e2186f121,"[{'system': 'PDC', 'value': '3f6fa8c7-2848-11e...",Glioblastoma,12827.0,,Not Reported,Not Reported,,C3N-03184,104c81a1-2139-11ea-aee1-0e1aae319e49
...,...,...,...,...,...,...,...,...,...,...
95,3b2b5be2-e3b6-5f26-8d3a-0cdccd5c578a,"[{'system': 'GDC', 'value': '3b2b5be2-e3b6-5f2...","Astrocytoma, anaplastic",13453.0,9401/3,,not reported,,TCGA-VV-A86M,284680b2-f961-402d-8740-e5f9e4fb4a98
96,3d780e07-2848-11ec-b712-0a4e2186f121,"[{'system': 'PDC', 'value': '3d780e07-2848-11e...",Glioblastoma,26097.0,,Not Reported,Not Reported,,C3L-03727,104c3d6f-2139-11ea-aee1-0e1aae319e49
97,468daf73-2692-5ea9-b4f0-3975186eac9b,"[{'system': 'GDC', 'value': '468daf73-2692-5ea...",Glioblastoma,28011.0,9440/3,,not reported,,TCGA-76-4925,872abc8a-6c1f-4114-b993-7d0327fb38bd
98,578d1331-1b12-59fa-a584-5796c6ffbe83,"[{'system': 'GDC', 'value': '578d1331-1b12-59f...","Oligodendroglioma, anaplastic",13424.0,9451/3,,not reported,,TCGA-TM-A84S,cd50aea8-a5db-48d3-ad28-cbe9f9e75ae3



---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>Diagnosis Field Definitions</h3>

<i>A diagnosis is a medical classification of a disease for a given research subject in a given study. A single research subject may have different diagnoses across different studies</i>

    
<ul>
  <li><b>id:</b> The unique identifier for this diagnosis in this research subject</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the diagnosed researchsubject had there</li>
  <li><b>primary_diagnosis:</b> The main medical diagnosis for this subject in this study</li>
  <li><b>age_at_diagnosis:</b> The subjects age in days after birth on the day they were first diagnosed</li>
  <li><b>morphology:</b> The <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a> diagnosic code for this diagnosis</li>
  <li><b>stage:</b> A measure of disease spread. Different diseases may use different staging criteria, please refer to the originating data source to see what staging system is reported</li>
  <li><b>grade:</b> A measure of cell abnormality. Different diseases may use different grading criteria, please refer to the originating data source to see what grading system is reported</li>
  <li><b>method_of_diagnosis:</b> The test or system used for determining the diagnosis</li>
  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>
  <li><b>researchsubject_id:</b> An identifier for the subject. Can be joined to the `id` field from researchsubject results</li>
</ul>  

</div>
    
---




### treatment

The treatment endpoint is an extension of diagnosis and returns information about treatments undertaken on research subjects that have a given diagnosis that meets our search criteria:

In [10]:
treatmentresults = myquery.treatment.run()
treatmentresults.to_dataframe()

Total execution time: 3428 ms


Unnamed: 0,id,identifier,treatment_type,treatment_outcome,days_to_treatment_start,days_to_treatment_end,therapeutic_agent,treatment_anatomic_site,treatment_effect,treatment_end_reason,number_of_cycles,subject_id,researchsubject_id,researchsubject_diagnosis_id
0,0d167298-0b5b-4039-be6b-12b1af520a76,"[{'system': 'GDC', 'value': '0d167298-0b5b-403...",Immunotherapy (Including Vaccines),,,,,,,,,HCM-BROD-0416-C71,aa17554b-2291-4947-9858-eadb82704f9b,de976004-e763-4759-9a7a-5b24a7f4aafe
1,0d22a5ff-7d93-5b79-9002-7c689a4a4719,"[{'system': 'GDC', 'value': '0d22a5ff-7d93-5b7...","Pharmaceutical Therapy, NOS",,,,,,,,,TCGA-02-0033,fc5ad666-d67a-4a5c-8e4e-1c8d099e9f85,f7895b28-7936-53a6-bbd8-98dff83bbf38
2,0edd1ce2-54b8-5709-a357-0b513ce4f573,"[{'system': 'GDC', 'value': '0edd1ce2-54b8-570...","Pharmaceutical Therapy, NOS",,,,,,,,,TCGA-06-2569,620282f9-f932-4335-9c7d-ece53dcaf7a1,fc7c38d3-8066-52c0-8a64-76ba9cbbb4ef
3,1855993a-951d-43f1-a794-ed89db21debd,"[{'system': 'GDC', 'value': '1855993a-951d-43f...",Surgery,,,,,,,,,HCM-BROD-0415-C71,a42da11c-f1c5-4641-98f4-535c675d43d5,20ebb213-ae8d-4b68-ac38-9bf33c452af9
4,1994c6f9-9fef-509f-97bf-d85dca25cf4c,"[{'system': 'GDC', 'value': '1994c6f9-9fef-509...","Pharmaceutical Therapy, NOS",,,,,,,,,TCGA-12-0615,141a1ef1-9be6-46d1-b445-305b222727d2,78cacf8c-d55b-585d-b592-d28b64223411
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,747e2ded-43d1-5c5c-9a5e-0c5838ca3deb,"[{'system': 'GDC', 'value': '747e2ded-43d1-5c5...","Radiation Therapy, NOS",,,,,,,,,TCGA-12-3651,d540d6d2-266c-48f9-8e73-304389b2060b,2071ad07-eb5e-5df4-8bdb-adc6c0234037
96,7706b706-5403-545e-ae89-ba512ab018d2,"[{'system': 'GDC', 'value': '7706b706-5403-545...","Pharmaceutical Therapy, NOS",,,,,,,,,TCGA-02-0440,502f47db-7d3d-4081-ab15-1c38e7e78652,213f94e3-8baa-518a-8f24-8e8a60010e7d
97,785d64a0-f656-5a14-b1c5-d60de1803dca,"[{'system': 'GDC', 'value': '785d64a0-f656-5a1...","Radiation Therapy, NOS",,,,,,,,,TCGA-19-1790,aeeaf19a-3b0d-4c79-a029-44642581f4d8,f807242b-2b87-51d1-b00a-f0e760e9e0ce
98,78ffae9e-8ebd-568d-bcab-22e5a7f89f3d,"[{'system': 'GDC', 'value': '78ffae9e-8ebd-568...","Pharmaceutical Therapy, NOS",,,,,,,,,TCGA-08-0245,e6dee2d7-ca05-44e2-bf25-1068a416bd14,d8fec5f1-bcf9-5a9f-92c0-761c42dd3cf1



---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>Treatment Field Definitions</h3>

<i><i>A treatment is a medical intervention for a diagnosed disease in a given subject in a given study. A single research subject may have multiple treatments for a single diagnosis, and/or different diagnoses, and different treatments, across different studies</i></i>

    
<ul>
  <li><b>id:</b> The unique identifier for this treatment of this diagnosis in this research subject</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the treated researchsubject had there</li>
  <li><b>treatment_type:</b> The medical intervention undertaken</li>
  <li><b>treatment_outcome:</b> The result of the medical intervention</li>
  <li><b>days_to_treatment_start:</b> </li>
  <li><b>days_to_treatment_end:</b> </li>
  <li><b>therapeutic_agent:</b> What treatment or drug was used for this researchsubject</li>
  <li><b>treatment_anatomic_site:</b> The specific body location of the treatment</li>
  <li><b>treatment_effect:</b> </li>
  <li><b>treatment_end_reason:</b> </li>
  <li><b>number_of_cycles:</b> </li>
  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>
  <li><b>researchsubject_id:</b> An identifier for the researchsubject. Can be joined to the `id` field from researchsubject results</li>
  <li><b>researchsubject_diagnosis_id:</b> An identifier for the diagnosis. Can be joined to the `id` field from diagnosis results</li>
</ul>  

</div>
    
---




### specimens

We can use this same query to see what specimens are available for brain tissue at the CDA:

In [11]:
specimenresults =  myquery.specimen.run()
print(specimenresults)

Total execution time: 3594 ms

            QueryID: 5c30d8e8-e5d2-49f0-bbf2-78c47f5fe27f
            
            Offset: 0
            Count: 100
            Total Row Count: 39150
            More pages: True
            


Nearly 40,000 specimens meet our search criteria! We would typically expect this number to be much larger than our number of subjects or researchsubjects. First because studies will often take more than one sample per subject, and second because any given specimen might be aliquoted out to be used in multiple tests. Since we didn't specify any further filters, our results will return all of these as seperate speciments. Let's look at a few:

In [12]:
specimenresults.to_dataframe()

Unnamed: 0,id,identifier,associated_project,age_at_collection,primary_disease_type,anatomical_site,source_material_type,specimen_type,derived_from_specimen,subject_id,researchsubject_id
0,007d8606-60c0-4f0c-ab7c-50cdef117b32,"[{'system': 'GDC', 'value': '007d8606-60c0-4f0...",TCGA-LGG,-10027,Gliomas,,Primary Tumor,aliquot,facc315d-fd9c-4d2c-9b15-578b91e1aa91,TCGA-P5-A5ET,d87742fd-ff94-4ef0-bbf4-5dd5a185a0e7
1,0291f03c-1348-40e1-b4f0-1d9c9683de41,"[{'system': 'GDC', 'value': '0291f03c-1348-40e...",CPTAC-3,-12449,Gliomas,,Primary Tumor,sample,initial specimen,C3L-02984,5718a50d-332c-4882-8736-2ca8989946dd
2,02db7613-55ba-4672-bd13-5fbddb7ad174,"[{'system': 'GDC', 'value': '02db7613-55ba-467...",TCGA-LGG,-14679,Gliomas,,Primary Tumor,aliquot,8ff9fc97-1b01-46d1-b525-a801d80c2ab0,TCGA-S9-A6TW,c45e02e5-7e3c-4f36-8b8d-f54617e8a436
3,031d6f5e-e91e-4abe-8b45-b749e044a6f4,"[{'system': 'GDC', 'value': '031d6f5e-e91e-4ab...",TCGA-GBM,-25684,Gliomas,,Blood Derived Normal,aliquot,d797ba11-67c5-47c6-ae91-e5a9343fda95,TCGA-14-1825,e0e1b5b3-6e3c-4b79-aa2a-32b320c3e45a
4,036f3828-857e-52e8-9023-0f21d8668f7d,"[{'system': 'GDC', 'value': '036f3828-857e-52e...",TCGA-GBM,-12777,Gliomas,,Primary Tumor,portion,445fab23-bde2-4468-bb81-d46ef0d1dd00,TCGA-06-0176,7391f99d-9528-46a1-9799-c5f0a7bfc63e
...,...,...,...,...,...,...,...,...,...,...,...
95,3d272667-6a91-4fae-b3c1-41bc8dc2ddf7,"[{'system': 'GDC', 'value': '3d272667-6a91-4fa...",TCGA-GBM,-23317,Gliomas,,Primary Tumor,analyte,af1a25b6-3766-4d1d-af94-a5a02fbb4024,TCGA-06-0184,1f48f010-98fe-4b5a-b96a-14fb25eff23f
96,3e2dba25-2fc2-4f3b-90f0-663e1576a6bf,"[{'system': 'GDC', 'value': '3e2dba25-2fc2-4f3...",TCGA-LGG,-14659,Gliomas,,Primary Tumor,aliquot,c92c7a46-e026-48a7-9057-e5853b41d22a,TCGA-HT-7620,0d61bdbd-24b1-4885-a099-9e42ca7eedcd
97,3eff6639-06e7-404b-9214-a376a1a61b63,"[{'system': 'GDC', 'value': '3eff6639-06e7-404...",TCGA-LGG,-9322,Gliomas,,Primary Tumor,analyte,abdbe23f-34b2-4e72-ad1c-26f39f1a132d,TCGA-DU-7011,cb598780-9e42-4167-b487-eec90ad4f36f
98,3f9d2c59-09f2-4904-8351-58e5b6ba1ce6,"[{'system': 'GDC', 'value': '3f9d2c59-09f2-490...",TCGA-GBM,-15950,Gliomas,,Primary Tumor,portion,46c31ffb-160d-4705-a7ca-6b3393054d8c,TCGA-26-1442,c29d73c0-c885-4105-bf74-38e9178e71c9



---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>Specimen Field Definitions</h3>

<i>Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.</i>

    
<ul>
  <li><b>id:</b> The unique identifier for this specimen</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the specimen had there</li>
  <li><b>associated_project:</b> The name of the study/project that the subject particpated in</li>
  <li><b>age_at_collection:</b> The subjects age in days (counting backwards to birth) on the day of the collection of the proximate specimen</li>
  <li><b>primary_disease_type:</b> The disease that qualifies the researchsubject for the associated_project</li>
  <li><b>anatomical_site:</b> The body part from which the proximate specimen was taken</li>
  <li><b>source_material_type:</b> The general kind of material from which the specimen was derived, indicating the physical nature of the source material</li>
  <li><b>specimen_type:</b> The high-level type of the specimen, based on its how it has been derived from the original extracted sample. One of: analyte, aliquot, portion, sample, or slide</li>
  <li><b>derived_from_specimen:</b> For derived samples, the `id` for the original sample</li>
  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>
  <li><b>researchsubject_id:</b> An identifier for the subject. Can be joined to the `id` field from researchsubject results</li>
</ul>  

</div>
    
---


### file

The file endpoint returns information about files that meet our search criteria, regardless of whether they are attached to subjects, research-subjects or specimens: 

In [13]:
myquery.file.run()

Total execution time: 3625 ms



            QueryID: 01b1daed-96cd-43ca-b95f-2863337c041d
            
            Offset: 0
            Count: 100
            Total Row Count: 4530800
            More pages: True
            

In [14]:
fileresults = myquery.file.run()
fileresults.to_dataframe()

Total execution time: 3668 ms


Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,researchsubject_specimen_id,researchsubject_id,subject_id
0,b0af44ea-827b-429a-abe4-9be51efdcb01,"[{'system': 'IDC', 'value': 'b0af44ea-827b-429...",idc/b0af44ea-827b-429a-abe4-9be51efdcb01.dcm,Imaging,,DICOM,tcga_gbm,drs://dg.4DFC:b0af44ea-827b-429a-abe4-9be51efd...,,,Imaging,MR,,,TCGA-06-1801__tcga_gbm,TCGA-06-1801
1,b1bdea96-a6b9-426d-9f08-62851c688fb0,"[{'system': 'IDC', 'value': 'b1bdea96-a6b9-426...",idc/b1bdea96-a6b9-426d-9f08-62851c688fb0.dcm,Imaging,,DICOM,acrin_dsc_mr_brain,drs://dg.4DFC:b1bdea96-a6b9-426d-9f08-62851c68...,,,Imaging,MR,,,ACRIN-DSC-MR-Brain-006__acrin_dsc_mr_brain,ACRIN-DSC-MR-Brain-006
2,b42a8b89-73e2-4f06-91c2-e372f43d4bca,"[{'system': 'IDC', 'value': 'b42a8b89-73e2-4f0...",idc/b42a8b89-73e2-4f06-91c2-e372f43d4bca.dcm,Imaging,,DICOM,ivygap,drs://dg.4DFC:b42a8b89-73e2-4f06-91c2-e372f43d...,,,Imaging,MR,,,W22__ivygap,W22
3,b80074a4-3682-450c-93f5-e92c8d5f72c5,"[{'system': 'IDC', 'value': 'b80074a4-3682-450...",idc/b80074a4-3682-450c-93f5-e92c8d5f72c5.dcm,Imaging,,DICOM,acrin_dsc_mr_brain,drs://dg.4DFC:b80074a4-3682-450c-93f5-e92c8d5f...,,,Imaging,MR,,,ACRIN-DSC-MR-Brain-008__acrin_dsc_mr_brain,ACRIN-DSC-MR-Brain-008
4,b952da7f-9c9b-4157-a21c-60a96973dee6,"[{'system': 'IDC', 'value': 'b952da7f-9c9b-415...",idc/b952da7f-9c9b-4157-a21c-60a96973dee6.dcm,Imaging,,DICOM,acrin_dsc_mr_brain,drs://dg.4DFC:b952da7f-9c9b-4157-a21c-60a96973...,,,Imaging,MR,,,ACRIN-DSC-MR-Brain-114__acrin_dsc_mr_brain,ACRIN-DSC-MR-Brain-114
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1b10dc16-42f6-4269-b2a2-1fdd8e0bc3f4,"[{'system': 'IDC', 'value': '1b10dc16-42f6-426...",idc/1b10dc16-42f6-4269-b2a2-1fdd8e0bc3f4.dcm,Imaging,,DICOM,acrin_fmiso_brain,drs://dg.4DFC:1b10dc16-42f6-4269-b2a2-1fdd8e0b...,,,Imaging,MR,,,ACRIN-FMISO-Brain-025__acrin_fmiso_brain,ACRIN-FMISO-Brain-025
96,20af19dd-b9e7-4771-9e02-0fdfd53e4dd4,"[{'system': 'IDC', 'value': '20af19dd-b9e7-477...",idc/20af19dd-b9e7-4771-9e02-0fdfd53e4dd4.dcm,Imaging,,DICOM,acrin_dsc_mr_brain,drs://dg.4DFC:20af19dd-b9e7-4771-9e02-0fdfd53e...,,,Imaging,MR,,,ACRIN-DSC-MR-Brain-055__acrin_dsc_mr_brain,ACRIN-DSC-MR-Brain-055
97,25125b60-8acb-449a-b7f4-2c8f8bbbe535,"[{'system': 'IDC', 'value': '25125b60-8acb-449...",idc/25125b60-8acb-449a-b7f4-2c8f8bbbe535.dcm,Imaging,,DICOM,ivygap,drs://dg.4DFC:25125b60-8acb-449a-b7f4-2c8f8bbb...,,,Imaging,MR,,,W35__ivygap,W35
98,25a95305-fd9f-4a6a-a9dd-44ae978fc803,"[{'system': 'IDC', 'value': '25a95305-fd9f-4a6...",idc/25a95305-fd9f-4a6a-a9dd-44ae978fc803.dcm,Imaging,,DICOM,rembrandt,drs://dg.4DFC:25a95305-fd9f-4a6a-a9dd-44ae978f...,,,Imaging,MR,,,HF1113__rembrandt,HF1113


As you might expect, searching file gives us a huge number of results. This is great if you are surveying what kind of data is available, but is less useful for getting a coherent cohort. 

A better way to get files for a specific cohort is to chain your queries together, which we cover in the next tutorial [Chaining Queries](../ChainingQueries): Combine information from multiple endpoints, and build And/Or/Like and other advanced query strings.

Another useful way to look at high level information is to use our counts feature which returns summary information rather than the full search results. Check out the [Data Summaries tutorial](../DataSummaries) to try it.



---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>File Field Definitions</h3>

<i>A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.</i>

    
<ul>
  <li><b>id:</b> The unique identifier for this file</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the file had there</li>
  <li><b>label:</b> The full name of the file</li>
  <li><b>data_catagory:</b> A desecription of the kind of general kind data the file holds</li>
  <li><b>data_type:</b> A more specific descripton of the data type</li>
  <li><b>file_format:</b> String to identify the full file extension including compression extensions</li>
  <li><b>associated_project:</b> The name the data center uses for the study this file was generated for</li>
  <li><b>drs_uri:</b> A unique identifier that can be used to retreive this specific file from a server</li>
  <li><b>byte_size:</b> Size of the file in bytes</li>
  <li><b>checksum:</b> The md5 value for the file</li>
  <li><b>data_modality:</b> Describes the biological nature of the information gathered as the result of an activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging"</li>
  <li><b>imaging_modality:</b> For files with the `data_modality` of "Imaging", a descriptor for the image type</li>
  <li><b>dbgap_accession_number:</b> The project id number for this data on dbGaP</li>
</ul>  

</div>
    
---
