# Summarize Search Results

Before we do any work, we need to import several functions from cdapython:
- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")

2022.6.21


CDA data comes from three sources:
- The [Proteomic Data Commons](https://proteomic.datacommons.cancer.gov/pdc/) (PDC)
- The [Genomic Data Commons](https://gdc.cancer.gov/) (GDC)
- The [Imaging Data Commons](https://datacommons.cancer.gov/repository/imaging-data-commons) (IDC)

The CDA makes this data searchable in four endpoints:

- `subject`: Specific, unique, individuals
- `research_subject`: Study-individual aggregate entities. A `subject` who was part of three studies will appear as three `researchsubjects`
- `specimen`: Samples taken from individual
- `file`: Data about `subject`, `researchsubject`, `specimen`, and their associated information


If you are looking to build a cohort of distinct individuals who meet some criteria, search by `subject`. If you want to build a cohort, but are particularly interested in studies rather than the participates per se, search by `researchsubject`. If you are looking for biosamples that can be ordered or a specfic format of information (for e.g. histological slides) start with `specimen`. If you are primarily looking for files you can reuse for your own analysis, start with `file`.

In CDA search, these concepts can also be strung together, so you can look specifically for `specimen file`, or `researchsubject specimen`. In all cases, any search can use any metadata field, the only difference between search types is what type of data you return by default.


## Getting simple summary data

Let's try a broad search of the CDA to see what information exists about cancers that were first diagnosed in the brain. To run this simple search, we would first construct a query in `Q` and save it to a variable `myquery`. This is the same query we ran in the [Basic Search](../BasicSearch.ipynb) notebook:

In [2]:
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')


<div style="background-color:#6ce6b9;color:black;padding:20px;">
<h3>Where did those terms come from?</h3>
    
If you aren't sure how we knew what terms to put in our search, please refer back to the <a href="../SearchTerms">What search terms are available?</a> notebook. 
</div>

### Overall summary

You can get a quick summary of how many unique specimens, treatments, diagnoses, researchsubjects and subjects meet your search criteria by chaining a `count` command into the basic `run` call. 

In [3]:
myquery.count.run()

Total execution time: 3470 ms



            QueryID: 61d0445f-313a-41a0-8bc8-5d8fc1329ace
            
            Offset: 0
            Count: 1
            Total Row Count: 1
            More pages: False
            

These numbers are how many total rows of data will come back when querying the various endpoints.



### subject summary

We can also add `count`to the other run calls we did in the [Basic Search](../BasicSearch.ipynb) notebook to get more detailed summaries:

In [4]:
subjectresults = myquery.subject.count.run()

Total execution time: 3413 ms


Since we save the output as a variable, we need to look at the variable to see the results:

In [5]:
subjectresults

system,count
IDC,1167
PDC,309
GDC,1449

sex,count
,683
male,979
female,649
not reported,3

race,count
,683
white,1308
not reported,135
asian,33
black or african american,96
Unknown,20
other,9
not allowed to collect,25
american indian or alaska native,4
native hawaiian or other pacific islander,1

ethnicity,count
,683
not hispanic or latino,1282
not reported,219
Unknown,21
hispanic or latino,84
not allowed to collect,25

cause_of_death,count
,2028
Not Reported,200
Cancer Related,63
Infection,3
Not Cancer Related,9
Surgical Complications,2
Unknown,9




By default, the results are displayed as a table for easy previewing of the data. Since we queried the `subject` endpoint, our default results tell us `subject` level information, that is, information about unique individuals: their sex, race, age, species, etc. Using counts gives us back a nice pivot table type summary of the countable fields for Subjects. Note that above the table it also tells you the total subject count, as well as how many files are associated with those subjects.

This gives you a quick way to assess whether the full search results will have the data fields you require. But if you want to get the underlying data for your own downstream applications, you can also get the raw numbers by calling the zeroth value of the variable:

In [6]:
subjectresults[0]

{'total': 2314,
 'files': 4081065,
 'system': [{'system': 'IDC', 'count': 1167},
  {'system': 'PDC', 'count': 309},
  {'system': 'GDC', 'count': 1449}],
 'sex': [{'sex': 'null', 'count': 683},
  {'sex': 'male', 'count': 979},
  {'sex': 'female', 'count': 649},
  {'sex': 'not reported', 'count': 3}],
 'race': [{'race': 'null', 'count': 683},
  {'race': 'white', 'count': 1308},
  {'race': 'not reported', 'count': 135},
  {'race': 'asian', 'count': 33},
  {'race': 'black or african american', 'count': 96},
  {'race': 'Unknown', 'count': 20},
  {'race': 'other', 'count': 9},
  {'race': 'not allowed to collect', 'count': 25},
  {'race': 'american indian or alaska native', 'count': 4},
  {'race': 'native hawaiian or other pacific islander', 'count': 1}],
 'ethnicity': [{'ethnicity': 'null', 'count': 683},
  {'ethnicity': 'not hispanic or latino', 'count': 1282},
  {'ethnicity': 'not reported', 'count': 219},
  {'ethnicity': 'Unknown', 'count': 21},
  {'ethnicity': 'hispanic or latino', 'coun


---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>Subject Field Definitions</h3>

<i>A subject is a specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets</i>
    
    
    
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;margin:0px auto;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<tbody>
  <tr>
    <td class="tg-7zrl">'total'</td>
    <td class="tg-0lax"> id</td>
    <td class="tg-0lax"> The unique identifier for this subject</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Data Center (System) Counted</td>
    <td class="tg-7zrl">identifier</td>
    <td class="tg-0lax"> An embedded array of information that includes the originating data center and the ID the subject had there</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">species</td>
    <td class="tg-0lax"> The species of the subject</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">sex</td>
    <td class="tg-0lax"> The sex of the subject </td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">race</td>
    <td class="tg-0lax"> The race of the subject</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">ethnicity</td>
    <td class="tg-0lax"> The ethnicity of the subject</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">days_to_birth</td>
    <td class="tg-0lax"> Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">subject_associated_project</td>
    <td class="tg-0lax"> An embedded array of the names of projects (studies) the subject was part of</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">vital_status</td>
    <td class="tg-0lax"> Whether the subject is alive</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">age_at_death</td>
    <td class="tg-0lax"> The number of days after first enrollment that the subject died</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-0lax"> cause_of_death</td>
    <td class="tg-0lax"> The cause of death, if known</td>
  </tr>
</tbody>
</table>

</div>
    
---

### researchsubject

If we're interested in what researchsubjects meet our critera, we can also run our query against the researchsubject endpoint. Lets run it without saving to a variable this time to make it a bit quicker:

In [7]:
myquery.researchsubject.count.run()

Total execution time: 3448 ms


system,count
GDC,1449
PDC,309
IDC,1165

primary_diagnosis_condition,count
Gliomas,1244
Glioblastoma,100
Germ Cell Neoplasms,104
,1165
Pediatric/AYA Brain Tumors,199
Other,10
"Neoplasms, NOS",63
Not Reported,11
"Malignant Lymphomas, NOS or Diffuse",14
Not Applicable,9

primary_diagnosis_site,count
Brain,2923







---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>ResearchSubject Field Definitions</h3>

<i>A research subject is the entity of interest in a research study, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjectâ€™s privacy. An individual who participates in 3 studies will have 3 researchsubject IDs</i>
    
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;margin:0px auto;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<tbody>
  <tr>
    <td class="tg-7zrl">'total'</td>
    <td class="tg-7zrl">id</td>
    <td class="tg-0lax"> The unique identifier for this researchsubject</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Data Center (System) Counted</td>
    <td class="tg-7zrl">identifier</td>
    <td class="tg-0lax"> An embedded array of information that includes the originating data center and the ID the researchsubject had there</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">member_of_research_project</td>
    <td class="tg-0lax"> The name of the study/project that the subject particpated in</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">primary_diagnosis_condition</td>
    <td class="tg-0lax"> The cancer, disease or other condition under study</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">primary_diagnosis_site</td>
    <td class="tg-0lax"> The primary_disease_site that qualifies the researchsubject for the research_project</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">subject_id</td>
    <td class="tg-0lax"> An identifier for the subject</td>
  </tr>
</tbody>
</table>
</div>
    
---

### diagnosis

The diagnosis endpoint is an extension of the researchsubject endpoint, and returns information about researchsubjects that have a diagnosis that meets our search criteria. :

In [8]:
myquery.diagnosis.count.run()

Total execution time: 3593 ms


system,count
GDC,1422
PDC,329

primary_diagnosis,count
Glioblastoma,821
Mixed glioma,131
"Ganglioglioma, NOS",18
Mixed germ cell tumor,79
"Neoplasm, malignant",50
"Oligodendroglioma, NOS",112
"Astrocytoma, NOS",64
"Glioma, NOS",93
"Oligodendroglioma, anaplastic",78
"Astrocytoma, anaplastic",130

stage,count
,1422
Not Reported,110
Unknown,219

grade,count
not reported,1116
G1,98
G2,52
Not Reported,392
G4,36
,22
High Grade,26
Low Grade,9




---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>Diagnosis Field Definitions</h3>

<i>A diagnosis is a medical classification of a disease for a given research subject in a given study. A single research subject may have different diagnoses across different studies</i>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;margin:0px auto;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<tbody>
  <tr>
    <td class="tg-za14">total'</td>
    <td class="tg-za14">id</td>
    <td class="tg-0pky"> The unique identifier for this diagnosis in this research subject</td>
  </tr>
  <tr>
    <td class="tg-za14">Data Center (System) Counted</td>
    <td class="tg-za14">identifier</td>
    <td class="tg-0pky"> An embedded array of information that includes the originating data center and the ID the diagnosed researchsubject had there</td>
  </tr>
  <tr>
    <td class="tg-za14">Counted</td>
    <td class="tg-za14">primary_diagnosis</td>
    <td class="tg-0pky"> The main medical diagnosis for this subject in this study</td>
  </tr>
  <tr>
    <td class="tg-za14">Not Counted</td>
    <td class="tg-za14">age_at_diagnosis</td>
    <td class="tg-0pky"> The subjects age in days after birth on the day they were first diagnosed</td>
  </tr>
  <tr>
    <td class="tg-za14">Not Counted</td>
    <td class="tg-za14">morphology</td>
    <td class="tg-0pky"> The <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a> diagnosic code for this diagnosis</td>
  </tr>
  <tr>
    <td class="tg-za14">Counted</td>
    <td class="tg-za14">stage</td>
    <td class="tg-0pky"> A measure of disease spread. Different diseases may use different staging criteria</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">grade</td>
    <td class="tg-0lax"> A measure of cell abnormality. Different diseases may use different grading criteria</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">method_of_diagnosis</td>
    <td class="tg-0lax"> The test or system used for determining the diagnosis</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">subject_id</td>
    <td class="tg-0lax"> An identifier for the subject. Can be joined to the `id` field from subject results</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">researchsubject_id</td>
    <td class="tg-0lax"> An identifier for the subject. Can be joined to the `id` field from researchsubject results</td>
  </tr>
</tbody>
</table>


</div>
    
---


### treatment

The treatment endpoint is an extension of diagnosis and returns information about treatments undertaken on research subjects that have a given diagnosis that meets our search criteria:

In [9]:
myquery.treatment.count.run()

Total execution time: 3475 ms


system,count
GDC,2379

treatment_type,count
"Radiation Therapy, NOS",1139
Targeted Molecular Therapy,23
"Pharmaceutical Therapy, NOS",1117
Immunotherapy (Including Vaccines),23
Chemotherapy,30
"Radiation, Proton Beam",1
Surgery,23
,23

treatment_effect,count
,2379





---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>Treatment Field Definitions</h3>

<i><i>A treatment is a medical intervention for a diagnosed disease in a given subject in a given study. A single research subject may have multiple treatments for a single diagnosis, and/or different diagnoses, and different treatments, across different studies</i></i>
    
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;margin:0px auto;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<tbody>
  <tr>
    <td class="tg-za14">total'</td>
    <td class="tg-za14">id</td>
    <td class="tg-0pky"> The&nbsp;&nbsp;&nbsp;unique identifier for this treatment of this diagnosis in this research&nbsp;&nbsp;&nbsp;subject</td>
  </tr>
  <tr>
    <td class="tg-za14">Data Center (System) Counted</td>
    <td class="tg-za14">identifier</td>
    <td class="tg-0pky"> An embedded array of information that includes the originating data center and the ID the treated&nbsp;&nbsp;&nbsp;researchsubject had there</td>
  </tr>
  <tr>
    <td class="tg-za14">Counted</td>
    <td class="tg-za14">treatment_type</td>
    <td class="tg-0pky"> The medical intervention undertaken</td>
  </tr>
  <tr>
    <td class="tg-za14">Not Counted</td>
    <td class="tg-za14">treatment_outcome</td>
    <td class="tg-0pky"> The result of the medical intervention</td>
  </tr>
  <tr>
    <td class="tg-za14">Not Counted</td>
    <td class="tg-za14">days_to_treatment_start</td>
    <td class="tg-0pky"> </td>
  </tr>
  <tr>
    <td class="tg-za14">Not Counted</td>
    <td class="tg-za14">days_to_treatment_end</td>
    <td class="tg-0pky"> </td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">therapeutic_agent</td>
    <td class="tg-0lax"> What treatment or drug was used for this researchsubject</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">treatment_anatomic_site</td>
    <td class="tg-0lax"> The specific body location of the treatment</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">treatment_effect</td>
    <td class="tg-0lax"> </td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">treatment_end_reason</td>
    <td class="tg-0lax"> </td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">number_of_cycles</td>
    <td class="tg-0lax"> </td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">subject_id</td>
    <td class="tg-0lax"> An identifier for the subject. Can be joined to the `id` field from subject results</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">researchsubject_id</td>
    <td class="tg-0lax"> An identifier for the researchsubject. Can be joined to the `id` field from researchsubject results</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">researchsubject_diagnosis_id</td>
    <td class="tg-0lax"> An identifier for the diagnosis. Can be joined to the `id` field from diagnosis results</td>
  </tr>
</tbody>
</table>
 

</div>
    
---




### specimens

We can use this same query to see what specimens are available for brain tissue at the CDA:

In [10]:
myquery.specimen.count.run()

Total execution time: 3544 ms


system,count
GDC,38492
PDC,658

primary_disease_type,count
Gliomas,37549
Glioblastoma,200
Other,20
Pediatric/AYA Brain Tumors,438
Mature B-Cell Lymphomas,54
Germ Cell Neoplasms,416
"Neoplasms, NOS",252
Not Reported,121
Not Applicable,36
"Malignant Lymphomas, NOS or Diffuse",56

source_material_type,count
Primary Tumor,27519
Solid Tissue Normal,538
Blood Derived Normal,10074
Recurrent Tumor,513
Not Reported,36
Next Generation Cancer Model,169
Expanded Next Generation Cancer Model,35
Metastatic,252
Buccal Cell Normal,14

specimen_type,count
portion,5986
sample,4085
analyte,6659
aliquot,18673
slide,3747




Nearly 40,000 specimens with over 50,000 files meet our search criteria! We would typically expect this number to be much larger than our number of subjects or research_subjects. First because studies will often take more than one sample per subject, and second because any given specimen might be aliquoted out to be used in multiple tests. 

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>Specimen Field Definitions</h3>

<i>A specimen is a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject ID</i>
    
    
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;margin:0px auto;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
    
<table class="tg">
<tbody>
  <tr>
    <td class="tg-za14">'total'</td>
    <td class="tg-za14">id</td>
    <td class="tg-0pky"> The unique identifier for this specimen</td>
  </tr>
  <tr>
    <td class="tg-za14">Data Center (System) Counted</td>
    <td class="tg-za14">identifier</td>
    <td class="tg-0pky"> An embedded array of information that includes the originating data center and the ID the specimen had there</td>
  </tr>
  <tr>
    <td class="tg-za14">Not Counted</td>
    <td class="tg-za14">associated_project</td>
    <td class="tg-0pky"> The name of the study/project that the subject particpated in</td>
  </tr>
  <tr>
    <td class="tg-za14">Not Counted</td>
    <td class="tg-za14">age_at_collection</td>
    <td class="tg-0pky"> The subjects age at collection of the proximate specimen</td>
  </tr>
  <tr>
    <td class="tg-za14">Counted</td>
    <td class="tg-za14">primary_disease_type</td>
    <td class="tg-0pky"> The disease that qualifies the researchsubject for the associated_project</td>
  </tr>
  <tr>
    <td class="tg-za14">Not Counted</td>
    <td class="tg-za14">anatomical_site</td>
    <td class="tg-0pky"> The body part from which the proximate specimen was taken</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">source_material_type</td>
    <td class="tg-0lax"> The general kind of material from which the specimen was derived, indicating the physical nature of the source materialf</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">specimen_type</td>
    <td class="tg-0lax"> The high-level type of the specimen, based on its how it has been derived from the original extracted sample. One of: analyte, aliquot, portion, sample, or slide</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">derived_from_specimen</td>
    <td class="tg-0lax"> For derived samples, the `id` for the original sample</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">subject_id</td>
    <td class="tg-0lax"> An identifier for the subject. Can be joined to the `id` field from subject results</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">research_subject_id</td>
    <td class="tg-0lax"> An identifier for the subject. Can be joined to the `id` field from researchsubject results</td>
  </tr>
</tbody>
</table>
</div>

### file

The files endpoint returns all files that match our query:

In [13]:
myquery.file.count.run()

Total execution time: 3525 ms


system,count
IDC,4027448
PDC,3048
GDC,50569

data_category,count
Imaging,4027448
DNA Methylation,3339
Copy Number Variation,6909
Simple Nucleotide Variation,20481
Biospecimen,5575
Raw Mass Spectra,762
Sequencing Reads,5862
Processed Mass Spectra,762
Peptide Spectral Matches,1524
Proteome Profiling,679

file_format,count
DICOM,4027448
mzIdentML,762
BCR XML,2274
MAF,8375
vendor-specific,762
TXT,8848
BAM,5862
TSV,4319
VCF,12255
BEDPE,1878

data_type,count
,4027448
Annotated Somatic Mutation,11808
Methylation Beta Value,1113
Biospecimen Supplement,1946
Aligned Reads,5862
Gene Expression Quantification,900
Slide Image,3629
Masked Copy Number Segment,2185
Allele-specific Copy Number Segment,1071
Open Standard,1524




There are a huge number of files (4081065) that match our search. Likely we would want to additionaly filter the results by file format or data type to get only files we can use. See all the ways you can filter and refine searches  with more search terms in the [Advanced search]("../AdvancedSearch-Operators") notebook.

## Files from a single endpoint (endpoint chaining)

If you want all file formats and data types, but only from a specific endpoint, you can also filter the file results by chaining endpoints together. This will return all the files that match our search AND that are specifically from specimens:

In [12]:
myquery.specimen.file.count.run()

Total execution time: 3534 ms


system,count
PDC,3048
GDC,47446

data_category,count
Sequencing Reads,5862
Structural Variation,3144
Processed Mass Spectra,762
Transcriptome Profiling,3104
Peptide Spectral Matches,1524
DNA Methylation,3339
Biospecimen,3629
Simple Nucleotide Variation,20481
Raw Mass Spectra,762
Copy Number Variation,6909

file_format,count
MAF,8375
VCF,12255
BAM,5862
TXT,8848
vendor-specific,762
TSV,4319
mzIdentML,762
BEDPE,1878
IDAT,2226
SVS,3629

data_type,count
Masked Intensities,2226
Masked Copy Number Segment,2185
Splice Junction Quantification,864
Copy Number Segment,2334
Aggregated Somatic Mutation,1144
Slide Image,3629
Protein Expression Quantification,679
Text,762
Aligned Reads,5862
Raw Simple Somatic Mutation,6202




Learn more about chaining endpoints in the [Chaining endpoints]("../AdvancedSearch-Chaining") notebook.

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>File Field Definitions</h3>

<i>A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.</i>

    
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
</style>
<table class="tg">
<tbody>
  <tr>
    <td class="tg-7zrl">`total`</td>
    <td class="tg-7zrl">id</td>
    <td class="tg-7zrl">The unique identifier for this file</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Data Center (System) Counted</td>
    <td class="tg-7zrl">identifier</td>
    <td class="tg-7zrl">An embedded array of information that includes the originating data&nbsp;&nbsp;&nbsp;center and the ID the file had there</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">label</td>
    <td class="tg-7zrl">The full name of the file</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">data_catagory</td>
    <td class="tg-7zrl">A desecription of the kind of general kind data the file holds</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">data_type</td>
    <td class="tg-7zrl">A more specific descripton of the data type</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">file_format</td>
    <td class="tg-7zrl">String to identify the full file extension including compression&nbsp;&nbsp;&nbsp;extensions</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">associated_project</td>
    <td class="tg-7zrl">The name the data center uses for the study this file was generated for</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">drs_uri</td>
    <td class="tg-7zrl">A unique identifier that can be used to retreive this specific file from&nbsp;&nbsp;&nbsp;a server</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">byte_size</td>
    <td class="tg-7zrl">Size of the file in bytes</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">checksum</td>
    <td class="tg-7zrl">The md5 value for the file</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">data_modality</td>
    <td class="tg-7zrl">Describes the biological nature of the information gathered as the result&nbsp;&nbsp;&nbsp;of an activity</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">imaging_modality</td>
    <td class="tg-7zrl">For files with the `data_modality` of "Imaging"</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">dbgap_accession_number</td>
    <td class="tg-7zrl">The project id number for this data on dbGaP</td>
  </tr>
</tbody>
</table>

</div>