# Available Search Terms
---

Before we do any work, we need to import several functions from cdapython:
- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")

2022.6.9




CDA data comes from three sources:
- The [Proteomic Data Commons](https://proteomic.datacommons.cancer.gov/pdc/) (PDC)
- The [Genomic Data Commons](https://gdc.cancer.gov/) (GDC)
- The [Imaging Data Commons](https://datacommons.cancer.gov/repository/imaging-data-commons) (IDC)

The CDA makes this data searchable in four main endpoints:

- `subject`: A specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets
- `researchsubject`: a person/plant/animal/microbe within a given study. An individual who participates in 3 studies will have 3 researchsubject IDs
- `specimen`: a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject ID
- `file`: A unit of data about subjects, researchsubjects, specimens, or their associated information

and two endpoints that offer deeper information about data in the researchsubject endpoint:

- `diagnosis`: Information about what medical diagnosis a researchsubject has
- `treatment`: Information about what medical treatment(s) were performed for a given diagnosis

You search any metadata field from any endpoint, the only difference between search types is what type of data you return by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.

Accordingly, to see what search fields are available, we use the command `columns`:

In [2]:
columns()


            QueryID: 5a3c3935-4a7e-42d0-a7c6-f3300b1ee149
            
            Offset: 0
            Count: 74
            Total Row Count: 74
            More pages: False
            

This output tells us that there are 74 searchable fields, but it doesn't output them directly. Running CDA commands like this first gives you an overall summary of the data you're going to get, and so is nice for doing a gut check. However, if we want to see the data on our screen we can have `columns()` print out it's contents to a list instead:

In [3]:
columns().to_list()

['File.id',
 'File.identifier',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri',
 'File.byte_size',
 'File.checksum',
 'File.data_modality',
 'File.imaging_modality',
 'File.dbgap_accession_number',
 'id',
 'identifier',
 'identifier.system',
 'identifier.value',
 'species',
 'sex',
 'race',
 'ethnicity',
 'days_to_birth',
 'subject_associated_project',
 'vital_status',
 'age_at_death',
 'cause_of_death',
 'ResearchSubject',
 'ResearchSubject.id',
 'ResearchSubject.identifier',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.member_of_research_project',
 'ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnos

By default, `columns()` returns the first 100 items. If that is too many, you can limit your search to only a specified number: 

In [4]:
columns(limit=10).to_list()

['File.id',
 'File.identifier',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri']

Or you can filter the list for terms that match your interests:

In [5]:
columns().to_list(filters="diagnosis")

['ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosi

<div style="background-color:#ed6161;color:#f5f5f5;padding:20px;">
<strong>Check your search criteria!</strong>
While available search fields may look like ones you've seen in PDC, GDC or IDC, that does not mean they will contain exactly the same information; several are renamed or restructured in the CDA model. The field name mappings are described in <a href="../Documentation/Schema.md">CDA Schema Field Mapping.</a>
</div>


We can directly get information about what data populates any of these fields using the `unique_terms()` function. Like `columns`, `unique_terms` defaults to giving us an overview of the results, and we view them the same way:

In [6]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()

['Abdomen',
 'Abdomen, Mediastinum',
 'Adrenal Glands',
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bile Duct',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung',
 'Cervix',
 'Cervix uteri',
 'Chest',
 'Chest-Abdomen-Pelvis, Leg, TSpine',
 'Colon',
 'Connective, subcutaneous and other soft tissues',
 'Corpus uteri',
 'Ear',
 'Esophagus',
 'Extremities',
 'Eye and adnexa',
 'Floor of mouth',
 'Gallbladder',
 'Gum',
 'Head',
 'Head and Neck',
 'Head-Neck',
 'Heart, mediastinum, and pleura',
 'Hematopoietic and reticuloendothelial systems',
 'Hypopharynx',
 'Intraocular',
 'Kidney',
 'Larynx',
 'Lip',
 'Liver',
 'Liver and intrahepatic bile ducts',
 'Lung',
 'Lung Phantom',
 'Lymph nodes',
 'Marrow, Blood',
 'Meninges',
 'Mesothelium',
 'Nasal cavity and middle ear',
 'Nasopharynx',
 'Not Reported',
 'Oropharynx',
 'Other and ill-defined digest

We can use the same trick here to search for only diagnosis sites that we're interested in:

In [7]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="lung")


['Bronchus and lung', 'Lung', 'Lung Phantom']

We can use this same logic to look for partial matches. For instance, if I'm not sure whether the data I'm interested in would be labeled as "uterine" or "uterus" I might search for just "uter"

In [8]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

Success! Not only are there multiple ways that "Uterus" is specified in the CDA data, I now also know that there are also data for specific uterine tissues. 

---

<div style="background-color:#ed6161;color:#f5f5f5;padding:20px;">
<strong>Check your search terms!</strong>
If you run into unexpected results when running a search, be sure that you're searching all the terms you want. CDA data is not yet harmonized across centers, so there are many cases where a single term search will not return all the information you need, however the CDA provides tools that make it easy to search all forms of a term to enable cross dataset search.
</div>

---


Explore the available terms by changing which table, how many results, and which unique terms you request. Once you have found terms you're interested in, head to [Basic Search](../BasicSearch) to build simple queries.