{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "614a478f",
   "metadata": {},
   "source": [
    "# Build a Cohort"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8004e87",
   "metadata": {},
   "source": [
    "**Example use case:** \n",
    "\n",
    "<img src=\"./images/julia.png\" alt=\"alt_text\" align=\"left\"\n",
    "\twidth=\"150\" height=\"150\" />\n",
    "Julia is an oncologist that specializes in female reproductive health. As part of her research, she is interested in  using existing data on uterine cancers. If possible, she would like to see multiple datatypes (gross imaging, genomic data, proteomic data, histology) that come from the same patient, so she can look for shared phenotypes to test for their potential as early diagnostics. Julia heard that the Cancer Data Aggregator has made it easy to search across multiple datasets created by NCI, and so has decided to start her search there.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0cadd7d",
   "metadata": {},
   "source": [
    "Before Julia does any work, she needs to import several functions from cdapython:\n",
    "\n",
    "- `Q` and `query` which power the search\n",
    "- `columns` which lets us view entity field names\n",
    "- `unique_terms` which lets view entity field contents\n",
    "\n",
    "She also asks cdapython to report it's version so she can be sure she's using the one she means to."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a5265d4d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022.6.21\n"
     ]
    }
   ],
   "source": [
    "from cdapython import Q, columns, unique_terms, query\n",
    "import cdapython\n",
    "import pandas as pd \n",
    "print(cdapython.__version__)\n",
    "Q.set_host_url(\"http://35.192.60.10:8080/\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75eef23e",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#c1f5ed;color:black;padding:20px;\">\n",
    "    \n",
    "CDA data comes from three sources:\n",
    "<ul>\n",
    "<li><b>The <a href=\"https://proteomic.datacommons.cancer.gov/pdc/\"> Proteomic Data Commons</a> (PDC)</b></li>\n",
    "<li><b>The <a href=\"https://gdc.cancer.gov/\">Genomic Data Commons</a> (GDC)</b></li>\n",
    "<li><b>The <a href=\"https://datacommons.cancer.gov/repository/imaging-data-commons\">Imaging Data Commons</a> (IDC)</b></li>\n",
    "</ul> \n",
    "    \n",
    "The CDA makes this data searchable in four main endpoints:\n",
    "\n",
    "<ul>\n",
    "<li><b>subject:</b> A specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets</li>\n",
    "<li><b>researchsubject:</b> a person/plant/animal/microbe within a given study. An individual who participates in 3 studies will have 3 researchsubject IDs</li>\n",
    "<li><b>specimen:</b> a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject ID</li>\n",
    "<li><b>file:</b> A unit of data about subjects, researchsubjects, specimens, or their associated information</li>\n",
    "</ul>\n",
    "and two endpoints that offer deeper information about data in the researchsubject endpoint:\n",
    "<ul>\n",
    "<li><b>diagnosis:</b> Information about what medical diagnosis a researchsubject has</li>\n",
    "<li><b>treatment:</b> Information about what medical treatment(s) were performed for a given diagnosis</li>\n",
    "</ul>\n",
    "Any metadata field can be searched from any endpoint, the only difference between search types is what type of data is returned by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "391bc9a7",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "   \n",
    "   Accordingly, to see what search fields are available, Julia starts by using the command `columns`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ef0dd8e5",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['File.id',\n",
       " 'File.identifier.system',\n",
       " 'File.identifier.value',\n",
       " 'File.label',\n",
       " 'File.data_category',\n",
       " 'File.data_type',\n",
       " 'File.file_format',\n",
       " 'File.associated_project',\n",
       " 'File.drs_uri',\n",
       " 'File.byte_size',\n",
       " 'File.checksum',\n",
       " 'File.data_modality',\n",
       " 'File.imaging_modality',\n",
       " 'File.dbgap_accession_number',\n",
       " 'id',\n",
       " 'identifier.system',\n",
       " 'identifier.value',\n",
       " 'species',\n",
       " 'sex',\n",
       " 'race',\n",
       " 'ethnicity',\n",
       " 'days_to_birth',\n",
       " 'subject_associated_project',\n",
       " 'vital_status',\n",
       " 'age_at_death',\n",
       " 'cause_of_death',\n",
       " 'ResearchSubject.id',\n",
       " 'ResearchSubject.identifier.system',\n",
       " 'ResearchSubject.identifier.value',\n",
       " 'ResearchSubject.member_of_research_project',\n",
       " 'ResearchSubject.primary_diagnosis_condition',\n",
       " 'ResearchSubject.primary_diagnosis_site',\n",
       " 'ResearchSubject.Diagnosis.id',\n",
       " 'ResearchSubject.Diagnosis.identifier.system',\n",
       " 'ResearchSubject.Diagnosis.identifier.value',\n",
       " 'ResearchSubject.Diagnosis.primary_diagnosis',\n",
       " 'ResearchSubject.Diagnosis.age_at_diagnosis',\n",
       " 'ResearchSubject.Diagnosis.morphology',\n",
       " 'ResearchSubject.Diagnosis.stage',\n",
       " 'ResearchSubject.Diagnosis.grade',\n",
       " 'ResearchSubject.Diagnosis.method_of_diagnosis',\n",
       " 'ResearchSubject.Diagnosis.Treatment.id',\n",
       " 'ResearchSubject.Diagnosis.Treatment.identifier.system',\n",
       " 'ResearchSubject.Diagnosis.Treatment.identifier.value',\n",
       " 'ResearchSubject.Diagnosis.Treatment.treatment_type',\n",
       " 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',\n",
       " 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',\n",
       " 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',\n",
       " 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',\n",
       " 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',\n",
       " 'ResearchSubject.Diagnosis.Treatment.treatment_effect',\n",
       " 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason',\n",
       " 'ResearchSubject.Diagnosis.Treatment.number_of_cycles',\n",
       " 'ResearchSubject.Specimen.id',\n",
       " 'ResearchSubject.Specimen.identifier.system',\n",
       " 'ResearchSubject.Specimen.identifier.value',\n",
       " 'ResearchSubject.Specimen.associated_project',\n",
       " 'ResearchSubject.Specimen.age_at_collection',\n",
       " 'ResearchSubject.Specimen.primary_disease_type',\n",
       " 'ResearchSubject.Specimen.anatomical_site',\n",
       " 'ResearchSubject.Specimen.source_material_type',\n",
       " 'ResearchSubject.Specimen.specimen_type',\n",
       " 'ResearchSubject.Specimen.derived_from_specimen',\n",
       " 'ResearchSubject.Specimen.derived_from_subject']"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "columns().to_list()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd05eba2",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "   \n",
    "There are a lot of columns in the CDA data, but Julia is most interested in diagnosis data, so she filters the list to only those:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "536970c4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ResearchSubject.primary_diagnosis_condition',\n",
       " 'ResearchSubject.primary_diagnosis_site',\n",
       " 'ResearchSubject.Diagnosis.id',\n",
       " 'ResearchSubject.Diagnosis.identifier.system',\n",
       " 'ResearchSubject.Diagnosis.identifier.value',\n",
       " 'ResearchSubject.Diagnosis.primary_diagnosis',\n",
       " 'ResearchSubject.Diagnosis.age_at_diagnosis',\n",
       " 'ResearchSubject.Diagnosis.morphology',\n",
       " 'ResearchSubject.Diagnosis.stage',\n",
       " 'ResearchSubject.Diagnosis.grade',\n",
       " 'ResearchSubject.Diagnosis.method_of_diagnosis',\n",
       " 'ResearchSubject.Diagnosis.Treatment.id',\n",
       " 'ResearchSubject.Diagnosis.Treatment.identifier.system',\n",
       " 'ResearchSubject.Diagnosis.Treatment.identifier.value',\n",
       " 'ResearchSubject.Diagnosis.Treatment.treatment_type',\n",
       " 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',\n",
       " 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',\n",
       " 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',\n",
       " 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',\n",
       " 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',\n",
       " 'ResearchSubject.Diagnosis.Treatment.treatment_effect',\n",
       " 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason',\n",
       " 'ResearchSubject.Diagnosis.Treatment.number_of_cycles']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "columns().to_list(filters=\"diagnosis\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a63b4cf0",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#c1f5ed;color:black;padding:20px;\">\n",
    "\n",
    "To search the CDA, a user also needs to know what search terms are available. Each column will contain a huge amount of data, so retreiving all of the rows would be overwhelming. Instead, the CDA has a `unique_terms()` function that will return all of the unique values that populate the requested column. Like `columns`, `unique_terms` defaults to giving us an overview of the results, and can be filtered.\n",
    "    \n",
    "</div>\n",
    "\n",
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "Since Julia is interested specificially in uterine cancers, she uses the `unique_terms` function to see what data is available for 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site' and 'ResearchSubject.primary_diagnosis_site' to see if 'uterine' appears:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4527dde5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Brain',\n",
       " 'Cervix',\n",
       " 'Head - Face Or Neck, Nos',\n",
       " 'Lymph Node(s) Paraaortic',\n",
       " 'Other',\n",
       " 'Pelvis',\n",
       " 'Spine',\n",
       " 'Unknown']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique_terms(\"ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site\").to_list()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "740e5955",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Abdomen',\n",
       " 'Abdomen, Mediastinum',\n",
       " 'Adrenal Glands',\n",
       " 'Adrenal gland',\n",
       " 'Anus and anal canal',\n",
       " 'Base of tongue',\n",
       " 'Bile Duct',\n",
       " 'Bladder',\n",
       " 'Bones, joints and articular cartilage of limbs',\n",
       " 'Bones, joints and articular cartilage of other and unspecified sites',\n",
       " 'Brain',\n",
       " 'Breast',\n",
       " 'Bronchus and lung',\n",
       " 'Cervix',\n",
       " 'Cervix uteri',\n",
       " 'Chest',\n",
       " 'Chest-Abdomen-Pelvis, Leg, TSpine',\n",
       " 'Colon',\n",
       " 'Connective, subcutaneous and other soft tissues',\n",
       " 'Corpus uteri',\n",
       " 'Ear',\n",
       " 'Esophagus',\n",
       " 'Extremities',\n",
       " 'Eye and adnexa',\n",
       " 'Floor of mouth',\n",
       " 'Gallbladder',\n",
       " 'Gum',\n",
       " 'Head',\n",
       " 'Head and Neck',\n",
       " 'Head-Neck',\n",
       " 'Heart, mediastinum, and pleura',\n",
       " 'Hematopoietic and reticuloendothelial systems',\n",
       " 'Hypopharynx',\n",
       " 'Intraocular',\n",
       " 'Kidney',\n",
       " 'Larynx',\n",
       " 'Lip',\n",
       " 'Liver',\n",
       " 'Liver and intrahepatic bile ducts',\n",
       " 'Lung',\n",
       " 'Lung Phantom',\n",
       " 'Lymph nodes',\n",
       " 'Marrow, Blood',\n",
       " 'Meninges',\n",
       " 'Mesothelium',\n",
       " 'Nasal cavity and middle ear',\n",
       " 'Nasopharynx',\n",
       " 'Not Reported',\n",
       " 'Oropharynx',\n",
       " 'Other and ill-defined digestive organs',\n",
       " 'Other and ill-defined sites',\n",
       " 'Other and ill-defined sites in lip, oral cavity and pharynx',\n",
       " 'Other and ill-defined sites within respiratory system and intrathoracic organs',\n",
       " 'Other and unspecified female genital organs',\n",
       " 'Other and unspecified major salivary glands',\n",
       " 'Other and unspecified male genital organs',\n",
       " 'Other and unspecified parts of biliary tract',\n",
       " 'Other and unspecified parts of mouth',\n",
       " 'Other and unspecified parts of tongue',\n",
       " 'Other and unspecified urinary organs',\n",
       " 'Other endocrine glands and related structures',\n",
       " 'Ovary',\n",
       " 'Palate',\n",
       " 'Pancreas',\n",
       " 'Pancreas ',\n",
       " 'Pelvis, Prostate, Anus',\n",
       " 'Penis',\n",
       " 'Peripheral nerves and autonomic nervous system',\n",
       " 'Phantom',\n",
       " 'Prostate',\n",
       " 'Prostate gland',\n",
       " 'Rectosigmoid junction',\n",
       " 'Rectum',\n",
       " 'Renal pelvis',\n",
       " 'Retroperitoneum and peritoneum',\n",
       " 'Skin',\n",
       " 'Small intestine',\n",
       " 'Spinal cord, cranial nerves, and other parts of central nervous system',\n",
       " 'Stomach',\n",
       " 'Testicles',\n",
       " 'Testis',\n",
       " 'Thymus',\n",
       " 'Thyroid',\n",
       " 'Thyroid gland',\n",
       " 'Tonsil',\n",
       " 'Trachea',\n",
       " 'Unknown',\n",
       " 'Ureter',\n",
       " 'Uterus',\n",
       " 'Uterus, NOS',\n",
       " 'Vagina',\n",
       " 'Various',\n",
       " 'Various (11 locations)',\n",
       " 'Vulva']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique_terms(\"ResearchSubject.primary_diagnosis_site\").to_list()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b005036b",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#c1f5ed;color:black;padding:20px;\">\n",
    "    \n",
    "CDA makes multiple datasets searchable from a common interface, but does not harmonize the data. This means that researchers should review all the terms in a column, and not just choose the first one that fits, as there may be other similar terms available as well.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73e6b8dc",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "Julia sees that \"treatment_anatomic_site\" does not have 'Uterine', but does have 'Cervix'. She also notes that both 'Uterus' and 'Uterus, NOS' are listed in the \"primary_diagnosis_site\" results. As she was initially looking for \"uterine\", Julia decides to expand her search a bit to account for variable naming schemes. So, she runs a fuzzy match filter on the \"ResearchSubject.primary_diagnosis_site\" for 'uter' as that should cover all variants:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "31064125",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique_terms(\"ResearchSubject.primary_diagnosis_site\").to_list(filters=\"uter\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9311a49e",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "Just to be sure, Julia also searches for any other instances of \"cervix\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2038a8cf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Cervix', 'Cervix uteri']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique_terms(\"ResearchSubject.primary_diagnosis_site\").to_list(filters=\"cerv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29c4de58",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "With all her likely terms found, Julia begins to create a search that will get data for all of her terms. She does this by writing a series of `Q` statements that define what rows should be returned from each column. For the \"treatment_anatomic_site\", only one term is of interest, so she uses the `=` operator to get only exact matches:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "951fcc8f",
   "metadata": {},
   "outputs": [],
   "source": [
    "Tsite = Q('ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site = \"Cervix\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12cb5f72",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "However, for \"primary_diagnosis_site\", Julia has several terms she wants to search with. Luckily, `Q` also can run fuzzy searches. It can also search more than one term at a time, so Julia writes one big `Q` statement to grab everything that is either 'uter' or 'cerv':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "36cfd8a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "Dsite = Q('ResearchSubject.primary_diagnosis_site = \"%uter%\" OR ResearchSubject.primary_diagnosis_site = \"%cerv%\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "349af6f2",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "Finally, Julia adds her two queries together into one large one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9f5f9e4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "ALLDATA = Tsite.OR(Dsite)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1f5cb55",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "Now that Julia has a query, she can use it to look for data in any of the CDA endpoints. She starts by getting an overall summary of what data is available using `count`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "355b1706",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Getting results from database\n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Getting results from database\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total execution time: 3346 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "\n",
       "            QueryID: cd87701d-7844-4410-a04c-3a363eab6ae5\n",
       "            \n",
       "            Offset: 0\n",
       "            Count: 1\n",
       "            Total Row Count: 1\n",
       "            More pages: False\n",
       "            "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ALLDATA.count.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7ce25fc",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "It seems there's a lot of data that might work for Julias study! Since she is interested in the beginings of cancer, she decides to start by looking at  the researchsubject information, since that is where most of the diagnosis information is. She again gets a summary using `count`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "55b0cdeb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Getting results from database\n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Getting results from database\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total execution time: 3611 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "\n",
       "            QueryID: 7db53011-5f7a-4dad-80a5-5cc2cb332e69\n",
       "            \n",
       "            Offset: 0\n",
       "            Count: 100\n",
       "            Total Row Count: 4867\n",
       "            More pages: True\n",
       "            "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ALLDATA.researchsubject.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86a323e2",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "Browsing the primary_diagnosis_condition data, Julia notices that there are a large number of research subjects that are Adenomas and Adenocarcinomas. Since Julia wants to look for common phenotypes in early cancers, she decides it might be easier to exclude the endocrine related data, as they might have different mechanisms. So she adds a new filter to her query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "0d526198",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Getting results from database\n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Getting results from database\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total execution time: 3449 ms\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">    total : <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3196</span>    \n",
       "</pre>\n"
      ],
      "text/plain": [
       "    total : \u001b[1;36m3196\u001b[0m    \n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">   files : <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">297923</span>   \n",
       "</pre>\n"
      ],
      "text/plain": [
       "   files : \u001b[1;36m297923\u001b[0m   \n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_82040_ th {\n",
       "  background-color: #000066;\n",
       "  color: white;\n",
       "  text-align: left;\n",
       "}\n",
       "#T_82040_ td {\n",
       "  text-align: left;\n",
       "  border-bottom: 1px solid black;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_82040_\" style='display:inline'>\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"col_heading level0 col0\" >system</th>\n",
       "      <th class=\"col_heading level0 col1\" >count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td id=\"T_82040_row0_col0\" class=\"data row0 col0\" >PDC</td>\n",
       "      <td id=\"T_82040_row0_col1\" class=\"data row0 col1\" >104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_82040_row1_col0\" class=\"data row1 col0\" >GDC</td>\n",
       "      <td id=\"T_82040_row1_col1\" class=\"data row1 col1\" >1918</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_82040_row2_col0\" class=\"data row2 col0\" >IDC</td>\n",
       "      <td id=\"T_82040_row2_col1\" class=\"data row2 col1\" >1174</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<style type=\"text/css\">\n",
       "#T_e22b2_ th {\n",
       "  background-color: #000066;\n",
       "  color: white;\n",
       "  text-align: left;\n",
       "}\n",
       "#T_e22b2_ td {\n",
       "  text-align: left;\n",
       "  border-bottom: 1px solid black;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_e22b2_\" style='display:inline'>\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"col_heading level0 col0\" >primary_diagnosis_condition</th>\n",
       "      <th class=\"col_heading level0 col1\" >count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row0_col0\" class=\"data row0 col0\" >Uterine Corpus Endometrial Carcinoma</td>\n",
       "      <td id=\"T_e22b2_row0_col1\" class=\"data row0 col1\" >104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row1_col0\" class=\"data row1 col0\" >Cystic, Mucinous and Serous Neoplasms</td>\n",
       "      <td id=\"T_e22b2_row1_col1\" class=\"data row1 col1\" >487</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row2_col0\" class=\"data row2 col0\" >Squamous Cell Neoplasms</td>\n",
       "      <td id=\"T_e22b2_row2_col1\" class=\"data row2 col1\" >609</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row3_col0\" class=\"data row3 col0\" >Complex Mixed and Stromal Neoplasms</td>\n",
       "      <td id=\"T_e22b2_row3_col1\" class=\"data row3 col1\" >320</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row4_col0\" class=\"data row4 col0\" >None</td>\n",
       "      <td id=\"T_e22b2_row4_col1\" class=\"data row4 col1\" >1175</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row5_col0\" class=\"data row5 col0\" >Myomatous Neoplasms</td>\n",
       "      <td id=\"T_e22b2_row5_col1\" class=\"data row5 col1\" >187</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row6_col0\" class=\"data row6 col0\" >Not Reported</td>\n",
       "      <td id=\"T_e22b2_row6_col1\" class=\"data row6 col1\" >12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row7_col0\" class=\"data row7 col0\" >Epithelial Neoplasms, NOS</td>\n",
       "      <td id=\"T_e22b2_row7_col1\" class=\"data row7 col1\" >230</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row8_col0\" class=\"data row8 col0\" >Complex Epithelial Neoplasms</td>\n",
       "      <td id=\"T_e22b2_row8_col1\" class=\"data row8 col1\" >27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row9_col0\" class=\"data row9 col0\" >Soft Tissue Tumors and Sarcomas, NOS</td>\n",
       "      <td id=\"T_e22b2_row9_col1\" class=\"data row9 col1\" >14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row10_col0\" class=\"data row10 col0\" >Neoplasms, NOS</td>\n",
       "      <td id=\"T_e22b2_row10_col1\" class=\"data row10 col1\" >12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row11_col0\" class=\"data row11 col0\" >Trophoblastic neoplasms</td>\n",
       "      <td id=\"T_e22b2_row11_col1\" class=\"data row11 col1\" >13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row12_col0\" class=\"data row12 col0\" >Mesonephromas</td>\n",
       "      <td id=\"T_e22b2_row12_col1\" class=\"data row12 col1\" >5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_e22b2_row13_col0\" class=\"data row13 col0\" >Neuroepitheliomatous Neoplasms</td>\n",
       "      <td id=\"T_e22b2_row13_col1\" class=\"data row13 col1\" >1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<style type=\"text/css\">\n",
       "#T_726c5_ th {\n",
       "  background-color: #000066;\n",
       "  color: white;\n",
       "  text-align: left;\n",
       "}\n",
       "#T_726c5_ td {\n",
       "  text-align: left;\n",
       "  border-bottom: 1px solid black;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_726c5_\" style='display:inline'>\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"col_heading level0 col0\" >primary_diagnosis_site</th>\n",
       "      <th class=\"col_heading level0 col1\" >count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td id=\"T_726c5_row0_col0\" class=\"data row0 col0\" >Uterus, NOS</td>\n",
       "      <td id=\"T_726c5_row0_col1\" class=\"data row0 col1\" >961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_726c5_row1_col0\" class=\"data row1 col0\" >Corpus uteri</td>\n",
       "      <td id=\"T_726c5_row1_col1\" class=\"data row1 col1\" >373</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_726c5_row2_col0\" class=\"data row2 col0\" >Cervix uteri</td>\n",
       "      <td id=\"T_726c5_row2_col1\" class=\"data row2 col1\" >688</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_726c5_row3_col0\" class=\"data row3 col0\" >Uterus</td>\n",
       "      <td id=\"T_726c5_row3_col1\" class=\"data row3 col1\" >867</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_726c5_row4_col0\" class=\"data row4 col0\" >Cervix</td>\n",
       "      <td id=\"T_726c5_row4_col1\" class=\"data row4 col1\" >307</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Noadeno = Q('ResearchSubject.primary_diagnosis_condition != \"Adenomas and Adenocarcinomas\"')\n",
    "\n",
    "NoAdenoData = ALLDATA.AND(Noadeno)\n",
    "\n",
    "NoAdenoData.researchsubject.count.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40a0191d",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "She then previews the actual metadata for researchsubject, subject, and file, to make sure that they have all the information she will need for her work:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "d186b837",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Getting results from database\n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Getting results from database\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total execution time: 3379 ms\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>identifier</th>\n",
       "      <th>member_of_research_project</th>\n",
       "      <th>primary_diagnosis_condition</th>\n",
       "      <th>primary_diagnosis_site</th>\n",
       "      <th>subject_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>146bd9db-1645-4950-bd18-de30d0db2487</td>\n",
       "      <td>[{'system': 'GDC', 'value': '146bd9db-1645-495...</td>\n",
       "      <td>CGCI-HTMCP-CC</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>HTMCP-03-06-02138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>32e83039-7663-422b-a541-6d9149851560</td>\n",
       "      <td>[{'system': 'GDC', 'value': '32e83039-7663-422...</td>\n",
       "      <td>GENIE-GRCC</td>\n",
       "      <td>Complex Mixed and Stromal Neoplasms</td>\n",
       "      <td>Uterus, NOS</td>\n",
       "      <td>GENIE-GRCC-4f168dad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>37063f74-ccc7-426e-ac1c-ad733f2f7e95</td>\n",
       "      <td>[{'system': 'GDC', 'value': '37063f74-ccc7-426...</td>\n",
       "      <td>GENIE-UHN</td>\n",
       "      <td>Epithelial Neoplasms, NOS</td>\n",
       "      <td>Corpus uteri</td>\n",
       "      <td>GENIE-UHN-247706</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3878f58e-76ba-4480-a784-88505bd464d0</td>\n",
       "      <td>[{'system': 'GDC', 'value': '3878f58e-76ba-448...</td>\n",
       "      <td>TCGA-UCEC</td>\n",
       "      <td>Cystic, Mucinous and Serous Neoplasms</td>\n",
       "      <td>Corpus uteri</td>\n",
       "      <td>TCGA-FI-A2EX</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3df6abe2-2123-4bfa-a4e4-88df5f940c04</td>\n",
       "      <td>[{'system': 'GDC', 'value': '3df6abe2-2123-4bf...</td>\n",
       "      <td>TCGA-CESC</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>TCGA-JX-A3PZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>fa219ae6-def1-4200-972a-3fd17d688d34</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'fa219ae6-def1-420...</td>\n",
       "      <td>FM-AD</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>AD7747</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>fb6f2e38-9281-4085-923c-ef99955fd5ea</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'fb6f2e38-9281-408...</td>\n",
       "      <td>CGCI-HTMCP-CC</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>HTMCP-03-06-02062</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>13d72130-604c-4d79-95cc-53c2e25d91b0</td>\n",
       "      <td>[{'system': 'GDC', 'value': '13d72130-604c-4d7...</td>\n",
       "      <td>TCGA-CESC</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>TCGA-ZJ-AAX4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>15d1d0ad-4196-49d1-8eb3-38c75b7db58c</td>\n",
       "      <td>[{'system': 'GDC', 'value': '15d1d0ad-4196-49d...</td>\n",
       "      <td>GENIE-MSK</td>\n",
       "      <td>Myomatous Neoplasms</td>\n",
       "      <td>Uterus, NOS</td>\n",
       "      <td>GENIE-MSK-P-0005582</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>1d6f367d-a00d-4bd0-9a8b-0a25e37fc1cd</td>\n",
       "      <td>[{'system': 'GDC', 'value': '1d6f367d-a00d-4bd...</td>\n",
       "      <td>GENIE-DFCI</td>\n",
       "      <td>Cystic, Mucinous and Serous Neoplasms</td>\n",
       "      <td>Uterus, NOS</td>\n",
       "      <td>GENIE-DFCI-001660</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      id  \\\n",
       "0   146bd9db-1645-4950-bd18-de30d0db2487   \n",
       "1   32e83039-7663-422b-a541-6d9149851560   \n",
       "2   37063f74-ccc7-426e-ac1c-ad733f2f7e95   \n",
       "3   3878f58e-76ba-4480-a784-88505bd464d0   \n",
       "4   3df6abe2-2123-4bfa-a4e4-88df5f940c04   \n",
       "..                                   ...   \n",
       "95  fa219ae6-def1-4200-972a-3fd17d688d34   \n",
       "96  fb6f2e38-9281-4085-923c-ef99955fd5ea   \n",
       "97  13d72130-604c-4d79-95cc-53c2e25d91b0   \n",
       "98  15d1d0ad-4196-49d1-8eb3-38c75b7db58c   \n",
       "99  1d6f367d-a00d-4bd0-9a8b-0a25e37fc1cd   \n",
       "\n",
       "                                           identifier  \\\n",
       "0   [{'system': 'GDC', 'value': '146bd9db-1645-495...   \n",
       "1   [{'system': 'GDC', 'value': '32e83039-7663-422...   \n",
       "2   [{'system': 'GDC', 'value': '37063f74-ccc7-426...   \n",
       "3   [{'system': 'GDC', 'value': '3878f58e-76ba-448...   \n",
       "4   [{'system': 'GDC', 'value': '3df6abe2-2123-4bf...   \n",
       "..                                                ...   \n",
       "95  [{'system': 'GDC', 'value': 'fa219ae6-def1-420...   \n",
       "96  [{'system': 'GDC', 'value': 'fb6f2e38-9281-408...   \n",
       "97  [{'system': 'GDC', 'value': '13d72130-604c-4d7...   \n",
       "98  [{'system': 'GDC', 'value': '15d1d0ad-4196-49d...   \n",
       "99  [{'system': 'GDC', 'value': '1d6f367d-a00d-4bd...   \n",
       "\n",
       "   member_of_research_project            primary_diagnosis_condition  \\\n",
       "0               CGCI-HTMCP-CC                Squamous Cell Neoplasms   \n",
       "1                  GENIE-GRCC    Complex Mixed and Stromal Neoplasms   \n",
       "2                   GENIE-UHN              Epithelial Neoplasms, NOS   \n",
       "3                   TCGA-UCEC  Cystic, Mucinous and Serous Neoplasms   \n",
       "4                   TCGA-CESC                Squamous Cell Neoplasms   \n",
       "..                        ...                                    ...   \n",
       "95                      FM-AD                Squamous Cell Neoplasms   \n",
       "96              CGCI-HTMCP-CC                Squamous Cell Neoplasms   \n",
       "97                  TCGA-CESC                Squamous Cell Neoplasms   \n",
       "98                  GENIE-MSK                    Myomatous Neoplasms   \n",
       "99                 GENIE-DFCI  Cystic, Mucinous and Serous Neoplasms   \n",
       "\n",
       "   primary_diagnosis_site           subject_id  \n",
       "0            Cervix uteri    HTMCP-03-06-02138  \n",
       "1             Uterus, NOS  GENIE-GRCC-4f168dad  \n",
       "2            Corpus uteri     GENIE-UHN-247706  \n",
       "3            Corpus uteri         TCGA-FI-A2EX  \n",
       "4            Cervix uteri         TCGA-JX-A3PZ  \n",
       "..                    ...                  ...  \n",
       "95           Cervix uteri               AD7747  \n",
       "96           Cervix uteri    HTMCP-03-06-02062  \n",
       "97           Cervix uteri         TCGA-ZJ-AAX4  \n",
       "98            Uterus, NOS  GENIE-MSK-P-0005582  \n",
       "99            Uterus, NOS    GENIE-DFCI-001660  \n",
       "\n",
       "[100 rows x 6 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "NoAdenoData.researchsubject.run().to_dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "086697b3",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "<div style=\"background-color:#c1f5ed;color:black;padding:20px;\">\n",
    "\n",
    "<h3>ResearchSubject Field Definitions</h3>\n",
    "\n",
    "<i>A research subject is the entity of interest in a research study, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subject’s privacy. An individual who participates in 3 studies will have 3 researchsubject IDs</i>\n",
    "    \n",
    "<ul>\n",
    "  <li><b>id:</b> The unique identifier for this researchsubject</li>\n",
    "  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the researchsubject had there</li>\n",
    "  <li><b>member_of_research_project:</b> The name of the study/project that the subject particpated in</li>\n",
    "  <li><b>primary_diagnosis_condition:</b> The cancer, disease or other condition under study</li>\n",
    "  <li><b>primary_diagnosis_site:</b> The primary_disease_site that qualifies the researchsubject for the research_project</li>\n",
    "  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>\n",
    "</ul>  \n",
    "\n",
    "</div>\n",
    "    \n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8d0f5e2f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Getting results from database\n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Getting results from database\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total execution time: 3404 ms\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>identifier</th>\n",
       "      <th>species</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>ethnicity</th>\n",
       "      <th>days_to_birth</th>\n",
       "      <th>subject_associated_project</th>\n",
       "      <th>vital_status</th>\n",
       "      <th>age_at_death</th>\n",
       "      <th>cause_of_death</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AD2728</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD2728'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>C3N-01876</td>\n",
       "      <td>[{'system': 'IDC', 'value': 'C3N-01876'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[cptac_ucec]</td>\n",
       "      <td>None</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GENIE-DFCI-007276</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'GENIE-DFCI-007276'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-18627.0</td>\n",
       "      <td>[GENIE-DFCI]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>GENIE-DFCI-009140</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'GENIE-DFCI-009140'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-24837.0</td>\n",
       "      <td>[GENIE-DFCI]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>GENIE-DFCI-009144</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'GENIE-DFCI-009144'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-19723.0</td>\n",
       "      <td>[GENIE-DFCI]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>AD14317</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD14317'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>AD3008</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD3008'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>AD6414</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD6414'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>AD7975</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD7975'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>C3L-00157</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'C3L-00157'}, {'sy...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>hispanic or latino</td>\n",
       "      <td>-22118.0</td>\n",
       "      <td>[CPTAC3-Discovery, CPTAC-3, cptac_ucec]</td>\n",
       "      <td>Dead</td>\n",
       "      <td>1396.0</td>\n",
       "      <td>Cancer Related</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 11 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                   id                                         identifier  \\\n",
       "0              AD2728             [{'system': 'GDC', 'value': 'AD2728'}]   \n",
       "1           C3N-01876          [{'system': 'IDC', 'value': 'C3N-01876'}]   \n",
       "2   GENIE-DFCI-007276  [{'system': 'GDC', 'value': 'GENIE-DFCI-007276'}]   \n",
       "3   GENIE-DFCI-009140  [{'system': 'GDC', 'value': 'GENIE-DFCI-009140'}]   \n",
       "4   GENIE-DFCI-009144  [{'system': 'GDC', 'value': 'GENIE-DFCI-009144'}]   \n",
       "..                ...                                                ...   \n",
       "95            AD14317            [{'system': 'GDC', 'value': 'AD14317'}]   \n",
       "96             AD3008             [{'system': 'GDC', 'value': 'AD3008'}]   \n",
       "97             AD6414             [{'system': 'GDC', 'value': 'AD6414'}]   \n",
       "98             AD7975             [{'system': 'GDC', 'value': 'AD7975'}]   \n",
       "99          C3L-00157  [{'system': 'GDC', 'value': 'C3L-00157'}, {'sy...   \n",
       "\n",
       "         species     sex          race               ethnicity  days_to_birth  \\\n",
       "0   Homo sapiens  female  not reported            not reported            NaN   \n",
       "1   Homo sapiens    None          None                    None            NaN   \n",
       "2   Homo sapiens  female         white  not hispanic or latino       -18627.0   \n",
       "3   Homo sapiens  female         white  not hispanic or latino       -24837.0   \n",
       "4   Homo sapiens  female         white  not hispanic or latino       -19723.0   \n",
       "..           ...     ...           ...                     ...            ...   \n",
       "95  Homo sapiens  female  not reported            not reported            NaN   \n",
       "96  Homo sapiens  female  not reported            not reported            NaN   \n",
       "97  Homo sapiens  female  not reported            not reported            NaN   \n",
       "98  Homo sapiens  female  not reported            not reported            NaN   \n",
       "99  Homo sapiens  female         white      hispanic or latino       -22118.0   \n",
       "\n",
       "                 subject_associated_project  vital_status  age_at_death  \\\n",
       "0                                   [FM-AD]  Not Reported           NaN   \n",
       "1                              [cptac_ucec]          None           NaN   \n",
       "2                              [GENIE-DFCI]  Not Reported           NaN   \n",
       "3                              [GENIE-DFCI]  Not Reported           NaN   \n",
       "4                              [GENIE-DFCI]  Not Reported           NaN   \n",
       "..                                      ...           ...           ...   \n",
       "95                                  [FM-AD]  Not Reported           NaN   \n",
       "96                                  [FM-AD]  Not Reported           NaN   \n",
       "97                                  [FM-AD]  Not Reported           NaN   \n",
       "98                                  [FM-AD]  Not Reported           NaN   \n",
       "99  [CPTAC3-Discovery, CPTAC-3, cptac_ucec]          Dead        1396.0   \n",
       "\n",
       "    cause_of_death  \n",
       "0             None  \n",
       "1             None  \n",
       "2             None  \n",
       "3             None  \n",
       "4             None  \n",
       "..             ...  \n",
       "95            None  \n",
       "96            None  \n",
       "97            None  \n",
       "98            None  \n",
       "99  Cancer Related  \n",
       "\n",
       "[100 rows x 11 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "NoAdenoData.subject.run().to_dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dec76132",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "<div style=\"background-color:#c1f5ed;color:black;padding:20px;\">\n",
    "\n",
    "<h3>Subject Field Definitions</h3>\n",
    "\n",
    "<i>A subject is a specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets</i>\n",
    "\n",
    "    \n",
    "<ul>\n",
    "  <li><b>id:</b> The unique identifier for this subject</li>\n",
    "  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the subject had there</li>\n",
    "  <li><b>species:</b> The species of the subject</li>\n",
    "  <li><b>sex:</b> A reference to the biological sex of the donor organism. </li>\n",
    "  <li><b>race:</b> The race of the subject</li>\n",
    "  <li><b>ethnicity:</b> The ethnicity of the subject</li>\n",
    "  <li><b>days_to_birth:</b> Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days</li>\n",
    "  <li><b>subject_associated_project:</b> An embedded array of the names of projects (studies) the subject was part of</li>\n",
    "  <li><b>vital_status:</b> Whether the subject is alive</li>\n",
    "  <li><b>age_at_death:</b> The number of days after first enrollment that the subject died</li>\n",
    "  <li><b>cause_of_death:</b> The cause of death, if known</li>\n",
    "</ul>  \n",
    "\n",
    "</div>\n",
    "    \n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "04e04136",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Getting results from database\n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Getting results from database\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total execution time: 3746 ms\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>identifier</th>\n",
       "      <th>label</th>\n",
       "      <th>data_category</th>\n",
       "      <th>data_type</th>\n",
       "      <th>file_format</th>\n",
       "      <th>associated_project</th>\n",
       "      <th>drs_uri</th>\n",
       "      <th>byte_size</th>\n",
       "      <th>checksum</th>\n",
       "      <th>data_modality</th>\n",
       "      <th>imaging_modality</th>\n",
       "      <th>dbgap_accession_number</th>\n",
       "      <th>researchsubject_specimen_id</th>\n",
       "      <th>researchsubject_id</th>\n",
       "      <th>subject_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>d3151fb9-9dd5-470e-b181-4d920f686068</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'd3151fb9-9dd5-470...</td>\n",
       "      <td>TCGA-B5-A11E-01A-21-A163-20_RPPA_data.tsv</td>\n",
       "      <td>Proteome Profiling</td>\n",
       "      <td>Protein Expression Quantification</td>\n",
       "      <td>TSV</td>\n",
       "      <td>TCGA-UCEC</td>\n",
       "      <td>drs://dg.4DFC:d3151fb9-9dd5-470e-b181-4d920f68...</td>\n",
       "      <td>22341</td>\n",
       "      <td>f44fc349969dda464ddf37f5e1f149f1</td>\n",
       "      <td>Genomic</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>TCGA-B5-A11E</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2200d48f-d10d-4e0c-aff6-a71958fc2b1b</td>\n",
       "      <td>[{'system': 'GDC', 'value': '2200d48f-d10d-4e0...</td>\n",
       "      <td>TCGA-A5-A0G9-01A-21-A162-20_RPPA_data.tsv</td>\n",
       "      <td>Proteome Profiling</td>\n",
       "      <td>Protein Expression Quantification</td>\n",
       "      <td>TSV</td>\n",
       "      <td>TCGA-UCEC</td>\n",
       "      <td>drs://dg.4DFC:2200d48f-d10d-4e0c-aff6-a71958fc...</td>\n",
       "      <td>24285</td>\n",
       "      <td>8edb8c63f398d0d6dab0655d62b1cd93</td>\n",
       "      <td>Genomic</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>TCGA-A5-A0G9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>e6ee1e9e-9c28-4db8-9f7f-3916f5351717</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'e6ee1e9e-9c28-4db...</td>\n",
       "      <td>TCGA-N7-A4Y5-01A-21-A41P-20_RPPA_data.tsv</td>\n",
       "      <td>Proteome Profiling</td>\n",
       "      <td>Protein Expression Quantification</td>\n",
       "      <td>TSV</td>\n",
       "      <td>TCGA-UCS</td>\n",
       "      <td>drs://dg.4DFC:e6ee1e9e-9c28-4db8-9f7f-3916f535...</td>\n",
       "      <td>22026</td>\n",
       "      <td>73159e8898216b617ac3e135af51d87e</td>\n",
       "      <td>Genomic</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>TCGA-N7-A4Y5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>81674772-fd6d-48b6-93b1-fa585d1ed568</td>\n",
       "      <td>[{'system': 'GDC', 'value': '81674772-fd6d-48b...</td>\n",
       "      <td>49b02eb4-8e31-42cd-a3e7-065611836434.wgs.BRASS...</td>\n",
       "      <td>Somatic Structural Variation</td>\n",
       "      <td>Structural Rearrangement</td>\n",
       "      <td>BEDPE</td>\n",
       "      <td>CPTAC-3</td>\n",
       "      <td>drs://dg.4DFC:81674772-fd6d-48b6-93b1-fa585d1e...</td>\n",
       "      <td>9977</td>\n",
       "      <td>64560c17caa67fa25411218ef57101a6</td>\n",
       "      <td>Genomic</td>\n",
       "      <td>None</td>\n",
       "      <td>phs001287</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>C3L-01307</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>c3392a1e-1241-4068-9bca-31fd836148de</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'c3392a1e-1241-406...</td>\n",
       "      <td>TCGA-BG-A0MA-01A-21-A18Q-20_RPPA_data.tsv</td>\n",
       "      <td>Proteome Profiling</td>\n",
       "      <td>Protein Expression Quantification</td>\n",
       "      <td>TSV</td>\n",
       "      <td>TCGA-UCEC</td>\n",
       "      <td>drs://dg.4DFC:c3392a1e-1241-4068-9bca-31fd8361...</td>\n",
       "      <td>22324</td>\n",
       "      <td>3da3113805454ac4fca6482fbaf4b4b1</td>\n",
       "      <td>Genomic</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>TCGA-BG-A0MA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>b42cdaba-46c8-4a02-b7cf-86ceb5d1f712</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'b42cdaba-46c8-4a0...</td>\n",
       "      <td>TCGA-HG-A2PA-01A-21-A40H-20_RPPA_data.tsv</td>\n",
       "      <td>Proteome Profiling</td>\n",
       "      <td>Protein Expression Quantification</td>\n",
       "      <td>TSV</td>\n",
       "      <td>TCGA-CESC</td>\n",
       "      <td>drs://dg.4DFC:b42cdaba-46c8-4a02-b7cf-86ceb5d1...</td>\n",
       "      <td>22070</td>\n",
       "      <td>398e6cca19ff30d932a1a78669254710</td>\n",
       "      <td>Genomic</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>TCGA-HG-A2PA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>1731c20f-1f10-4f80-8793-99593f81515f</td>\n",
       "      <td>[{'system': 'GDC', 'value': '1731c20f-1f10-4f8...</td>\n",
       "      <td>TCGA-B5-A11P-01B-21-A18Q-20_RPPA_data.tsv</td>\n",
       "      <td>Proteome Profiling</td>\n",
       "      <td>Protein Expression Quantification</td>\n",
       "      <td>TSV</td>\n",
       "      <td>TCGA-UCEC</td>\n",
       "      <td>drs://dg.4DFC:1731c20f-1f10-4f80-8793-99593f81...</td>\n",
       "      <td>22344</td>\n",
       "      <td>8bcbded9fbd5a48a58f77ae1e3ea829f</td>\n",
       "      <td>Genomic</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>TCGA-B5-A11P</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>36f66d66-f71f-49be-9e51-ac640b826d3f</td>\n",
       "      <td>[{'system': 'GDC', 'value': '36f66d66-f71f-49b...</td>\n",
       "      <td>TCGA-EY-A1GH-01A-21-A18Q-20_RPPA_data.tsv</td>\n",
       "      <td>Proteome Profiling</td>\n",
       "      <td>Protein Expression Quantification</td>\n",
       "      <td>TSV</td>\n",
       "      <td>TCGA-UCEC</td>\n",
       "      <td>drs://dg.4DFC:36f66d66-f71f-49be-9e51-ac640b82...</td>\n",
       "      <td>22338</td>\n",
       "      <td>52a07dbf0ebfcafeb41958d4a1e2b489</td>\n",
       "      <td>Genomic</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>TCGA-EY-A1GH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>59a4c826-87d1-43ab-9b2e-3c6088275fd7</td>\n",
       "      <td>[{'system': 'GDC', 'value': '59a4c826-87d1-43a...</td>\n",
       "      <td>de3cbd77-822b-4c86-80e8-9be54ca8b324.wgs.BRASS...</td>\n",
       "      <td>Somatic Structural Variation</td>\n",
       "      <td>Structural Rearrangement</td>\n",
       "      <td>BEDPE</td>\n",
       "      <td>CGCI-HTMCP-CC</td>\n",
       "      <td>drs://dg.4DFC:59a4c826-87d1-43ab-9b2e-3c608827...</td>\n",
       "      <td>121947</td>\n",
       "      <td>c9bcda0d917caf81773efd8e2f827ebb</td>\n",
       "      <td>Genomic</td>\n",
       "      <td>None</td>\n",
       "      <td>phs000528</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>HTMCP-03-06-02040</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>a7c7ba3e-7d9d-4735-afff-22442a0e9a84</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'a7c7ba3e-7d9d-473...</td>\n",
       "      <td>44c6a116-146f-43fd-aa1e-7e8b1636a722.wgs.BRASS...</td>\n",
       "      <td>Somatic Structural Variation</td>\n",
       "      <td>Structural Rearrangement</td>\n",
       "      <td>VCF</td>\n",
       "      <td>CGCI-HTMCP-CC</td>\n",
       "      <td>drs://dg.4DFC:a7c7ba3e-7d9d-4735-afff-22442a0e...</td>\n",
       "      <td>76341</td>\n",
       "      <td>5e513d0fae6e32b5425c647c9d8d3ba3</td>\n",
       "      <td>Genomic</td>\n",
       "      <td>None</td>\n",
       "      <td>phs000528</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>HTMCP-03-06-02144</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 16 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      id  \\\n",
       "0   d3151fb9-9dd5-470e-b181-4d920f686068   \n",
       "1   2200d48f-d10d-4e0c-aff6-a71958fc2b1b   \n",
       "2   e6ee1e9e-9c28-4db8-9f7f-3916f5351717   \n",
       "3   81674772-fd6d-48b6-93b1-fa585d1ed568   \n",
       "4   c3392a1e-1241-4068-9bca-31fd836148de   \n",
       "..                                   ...   \n",
       "95  b42cdaba-46c8-4a02-b7cf-86ceb5d1f712   \n",
       "96  1731c20f-1f10-4f80-8793-99593f81515f   \n",
       "97  36f66d66-f71f-49be-9e51-ac640b826d3f   \n",
       "98  59a4c826-87d1-43ab-9b2e-3c6088275fd7   \n",
       "99  a7c7ba3e-7d9d-4735-afff-22442a0e9a84   \n",
       "\n",
       "                                           identifier  \\\n",
       "0   [{'system': 'GDC', 'value': 'd3151fb9-9dd5-470...   \n",
       "1   [{'system': 'GDC', 'value': '2200d48f-d10d-4e0...   \n",
       "2   [{'system': 'GDC', 'value': 'e6ee1e9e-9c28-4db...   \n",
       "3   [{'system': 'GDC', 'value': '81674772-fd6d-48b...   \n",
       "4   [{'system': 'GDC', 'value': 'c3392a1e-1241-406...   \n",
       "..                                                ...   \n",
       "95  [{'system': 'GDC', 'value': 'b42cdaba-46c8-4a0...   \n",
       "96  [{'system': 'GDC', 'value': '1731c20f-1f10-4f8...   \n",
       "97  [{'system': 'GDC', 'value': '36f66d66-f71f-49b...   \n",
       "98  [{'system': 'GDC', 'value': '59a4c826-87d1-43a...   \n",
       "99  [{'system': 'GDC', 'value': 'a7c7ba3e-7d9d-473...   \n",
       "\n",
       "                                                label  \\\n",
       "0           TCGA-B5-A11E-01A-21-A163-20_RPPA_data.tsv   \n",
       "1           TCGA-A5-A0G9-01A-21-A162-20_RPPA_data.tsv   \n",
       "2           TCGA-N7-A4Y5-01A-21-A41P-20_RPPA_data.tsv   \n",
       "3   49b02eb4-8e31-42cd-a3e7-065611836434.wgs.BRASS...   \n",
       "4           TCGA-BG-A0MA-01A-21-A18Q-20_RPPA_data.tsv   \n",
       "..                                                ...   \n",
       "95          TCGA-HG-A2PA-01A-21-A40H-20_RPPA_data.tsv   \n",
       "96          TCGA-B5-A11P-01B-21-A18Q-20_RPPA_data.tsv   \n",
       "97          TCGA-EY-A1GH-01A-21-A18Q-20_RPPA_data.tsv   \n",
       "98  de3cbd77-822b-4c86-80e8-9be54ca8b324.wgs.BRASS...   \n",
       "99  44c6a116-146f-43fd-aa1e-7e8b1636a722.wgs.BRASS...   \n",
       "\n",
       "                   data_category                          data_type  \\\n",
       "0             Proteome Profiling  Protein Expression Quantification   \n",
       "1             Proteome Profiling  Protein Expression Quantification   \n",
       "2             Proteome Profiling  Protein Expression Quantification   \n",
       "3   Somatic Structural Variation           Structural Rearrangement   \n",
       "4             Proteome Profiling  Protein Expression Quantification   \n",
       "..                           ...                                ...   \n",
       "95            Proteome Profiling  Protein Expression Quantification   \n",
       "96            Proteome Profiling  Protein Expression Quantification   \n",
       "97            Proteome Profiling  Protein Expression Quantification   \n",
       "98  Somatic Structural Variation           Structural Rearrangement   \n",
       "99  Somatic Structural Variation           Structural Rearrangement   \n",
       "\n",
       "   file_format associated_project  \\\n",
       "0          TSV          TCGA-UCEC   \n",
       "1          TSV          TCGA-UCEC   \n",
       "2          TSV           TCGA-UCS   \n",
       "3        BEDPE            CPTAC-3   \n",
       "4          TSV          TCGA-UCEC   \n",
       "..         ...                ...   \n",
       "95         TSV          TCGA-CESC   \n",
       "96         TSV          TCGA-UCEC   \n",
       "97         TSV          TCGA-UCEC   \n",
       "98       BEDPE      CGCI-HTMCP-CC   \n",
       "99         VCF      CGCI-HTMCP-CC   \n",
       "\n",
       "                                              drs_uri  byte_size  \\\n",
       "0   drs://dg.4DFC:d3151fb9-9dd5-470e-b181-4d920f68...      22341   \n",
       "1   drs://dg.4DFC:2200d48f-d10d-4e0c-aff6-a71958fc...      24285   \n",
       "2   drs://dg.4DFC:e6ee1e9e-9c28-4db8-9f7f-3916f535...      22026   \n",
       "3   drs://dg.4DFC:81674772-fd6d-48b6-93b1-fa585d1e...       9977   \n",
       "4   drs://dg.4DFC:c3392a1e-1241-4068-9bca-31fd8361...      22324   \n",
       "..                                                ...        ...   \n",
       "95  drs://dg.4DFC:b42cdaba-46c8-4a02-b7cf-86ceb5d1...      22070   \n",
       "96  drs://dg.4DFC:1731c20f-1f10-4f80-8793-99593f81...      22344   \n",
       "97  drs://dg.4DFC:36f66d66-f71f-49be-9e51-ac640b82...      22338   \n",
       "98  drs://dg.4DFC:59a4c826-87d1-43ab-9b2e-3c608827...     121947   \n",
       "99  drs://dg.4DFC:a7c7ba3e-7d9d-4735-afff-22442a0e...      76341   \n",
       "\n",
       "                            checksum data_modality imaging_modality  \\\n",
       "0   f44fc349969dda464ddf37f5e1f149f1       Genomic             None   \n",
       "1   8edb8c63f398d0d6dab0655d62b1cd93       Genomic             None   \n",
       "2   73159e8898216b617ac3e135af51d87e       Genomic             None   \n",
       "3   64560c17caa67fa25411218ef57101a6       Genomic             None   \n",
       "4   3da3113805454ac4fca6482fbaf4b4b1       Genomic             None   \n",
       "..                               ...           ...              ...   \n",
       "95  398e6cca19ff30d932a1a78669254710       Genomic             None   \n",
       "96  8bcbded9fbd5a48a58f77ae1e3ea829f       Genomic             None   \n",
       "97  52a07dbf0ebfcafeb41958d4a1e2b489       Genomic             None   \n",
       "98  c9bcda0d917caf81773efd8e2f827ebb       Genomic             None   \n",
       "99  5e513d0fae6e32b5425c647c9d8d3ba3       Genomic             None   \n",
       "\n",
       "   dbgap_accession_number researchsubject_specimen_id researchsubject_id  \\\n",
       "0                    None                                                  \n",
       "1                    None                                                  \n",
       "2                    None                                                  \n",
       "3               phs001287                                                  \n",
       "4                    None                                                  \n",
       "..                    ...                         ...                ...   \n",
       "95                   None                                                  \n",
       "96                   None                                                  \n",
       "97                   None                                                  \n",
       "98              phs000528                                                  \n",
       "99              phs000528                                                  \n",
       "\n",
       "           subject_id  \n",
       "0        TCGA-B5-A11E  \n",
       "1        TCGA-A5-A0G9  \n",
       "2        TCGA-N7-A4Y5  \n",
       "3           C3L-01307  \n",
       "4        TCGA-BG-A0MA  \n",
       "..                ...  \n",
       "95       TCGA-HG-A2PA  \n",
       "96       TCGA-B5-A11P  \n",
       "97       TCGA-EY-A1GH  \n",
       "98  HTMCP-03-06-02040  \n",
       "99  HTMCP-03-06-02144  \n",
       "\n",
       "[100 rows x 16 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "NoAdenoData.file.run().to_dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cf9f2d3",
   "metadata": {},
   "source": [
    "\n",
    "---\n",
    "\n",
    "<div style=\"background-color:#c1f5ed;color:black;padding:20px;\">\n",
    "\n",
    "<h3>File Field Definitions</h3>\n",
    "\n",
    "<i>A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.</i>\n",
    "\n",
    "    \n",
    "<ul>\n",
    "  <li><b>id:</b> The unique identifier for this file</li>\n",
    "  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the file had there</li>\n",
    "  <li><b>label:</b> The full name of the file</li>\n",
    "  <li><b>data_catagory:</b> A desecription of the kind of general kind data the file holds</li>\n",
    "  <li><b>data_type:</b> A more specific descripton of the data type</li>\n",
    "  <li><b>file_format:</b> String to identify the full file extension including compression extensions</li>\n",
    "  <li><b>associated_project:</b> The name the data center uses for the study this file was generated for</li>\n",
    "  <li><b>drs_uri:</b> A unique identifier that can be used to retreive this specific file from a server</li>\n",
    "  <li><b>byte_size:</b> Size of the file in bytes</li>\n",
    "  <li><b>checksum:</b> The md5 value for the file</li>\n",
    "  <li><b>data_modality:</b> Describes the biological nature of the information gathered as the result of an activity, independent of the technology or methods used to produce the information. Always one of \"Genomic\", \"Proteomic\", or \"Imaging\"</li>\n",
    "  <li><b>imaging_modality:</b> For files with the `data_modality` of \"Imaging\", a descriptor for the image type</li>\n",
    "  <li><b>dbgap_accession_number:</b> The project id number for this data on dbGaP</li>\n",
    "</ul>  \n",
    "\n",
    "</div>\n",
    "    \n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba6aadbe",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "Finally, Julia wants to save these results to use for the future. Since the preview dataframes only show the first 100 results of each search, she uses the `paginator` function to get all the data from the subject and researchsubject endpoints into their own dataframes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "c2cec2bc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Getting results from database\n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Getting results from database\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total execution time: 3286 ms\n"
     ]
    }
   ],
   "source": [
    "researchsubs = NoAdenoData.researchsubject.run()\n",
    "rsdf = pd.DataFrame()\n",
    "for i in researchsubs.paginator(to_df=True):\n",
    "    rsdf = pd.concat([rsdf, i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "a1258057",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Getting results from database\n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Getting results from database\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total execution time: 3374 ms\n"
     ]
    }
   ],
   "source": [
    "subs = NoAdenoData.subject.run()\n",
    "subsdf = pd.DataFrame()\n",
    "for i in subs.paginator(to_df=True):\n",
    "    subsdf = pd.concat([subsdf, i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "04cd73df",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>identifier</th>\n",
       "      <th>member_of_research_project</th>\n",
       "      <th>primary_diagnosis_condition</th>\n",
       "      <th>primary_diagnosis_site</th>\n",
       "      <th>subject_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>146bd9db-1645-4950-bd18-de30d0db2487</td>\n",
       "      <td>[{'system': 'GDC', 'value': '146bd9db-1645-495...</td>\n",
       "      <td>CGCI-HTMCP-CC</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>HTMCP-03-06-02138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>32e83039-7663-422b-a541-6d9149851560</td>\n",
       "      <td>[{'system': 'GDC', 'value': '32e83039-7663-422...</td>\n",
       "      <td>GENIE-GRCC</td>\n",
       "      <td>Complex Mixed and Stromal Neoplasms</td>\n",
       "      <td>Uterus, NOS</td>\n",
       "      <td>GENIE-GRCC-4f168dad</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>37063f74-ccc7-426e-ac1c-ad733f2f7e95</td>\n",
       "      <td>[{'system': 'GDC', 'value': '37063f74-ccc7-426...</td>\n",
       "      <td>GENIE-UHN</td>\n",
       "      <td>Epithelial Neoplasms, NOS</td>\n",
       "      <td>Corpus uteri</td>\n",
       "      <td>GENIE-UHN-247706</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3878f58e-76ba-4480-a784-88505bd464d0</td>\n",
       "      <td>[{'system': 'GDC', 'value': '3878f58e-76ba-448...</td>\n",
       "      <td>TCGA-UCEC</td>\n",
       "      <td>Cystic, Mucinous and Serous Neoplasms</td>\n",
       "      <td>Corpus uteri</td>\n",
       "      <td>TCGA-FI-A2EX</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3df6abe2-2123-4bfa-a4e4-88df5f940c04</td>\n",
       "      <td>[{'system': 'GDC', 'value': '3df6abe2-2123-4bf...</td>\n",
       "      <td>TCGA-CESC</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>TCGA-JX-A3PZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91</th>\n",
       "      <td>TCGA-N9-A4Q7__tcga_ucs</td>\n",
       "      <td>[{'system': 'IDC', 'value': 'TCGA-N9-A4Q7'}]</td>\n",
       "      <td>tcga_ucs</td>\n",
       "      <td>None</td>\n",
       "      <td>Uterus</td>\n",
       "      <td>TCGA-N9-A4Q7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92</th>\n",
       "      <td>TCGA-QS-A744__tcga_ucec</td>\n",
       "      <td>[{'system': 'IDC', 'value': 'TCGA-QS-A744'}]</td>\n",
       "      <td>tcga_ucec</td>\n",
       "      <td>None</td>\n",
       "      <td>Uterus</td>\n",
       "      <td>TCGA-QS-A744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93</th>\n",
       "      <td>c64d5576-df00-4772-a3d1-1f8863000750</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'c64d5576-df00-477...</td>\n",
       "      <td>CGCI-HTMCP-CC</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>HTMCP-03-06-02099</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>94</th>\n",
       "      <td>cc500ada-7440-412f-b54c-4966c8098dcb</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'cc500ada-7440-412...</td>\n",
       "      <td>GENIE-DFCI</td>\n",
       "      <td>Cystic, Mucinous and Serous Neoplasms</td>\n",
       "      <td>Uterus, NOS</td>\n",
       "      <td>GENIE-DFCI-000331</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>d7a75bf5-5189-4978-99d9-fcef91c9fbd2</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'd7a75bf5-5189-497...</td>\n",
       "      <td>TCGA-CESC</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>TCGA-EK-A2R7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3196 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      id  \\\n",
       "0   146bd9db-1645-4950-bd18-de30d0db2487   \n",
       "1   32e83039-7663-422b-a541-6d9149851560   \n",
       "2   37063f74-ccc7-426e-ac1c-ad733f2f7e95   \n",
       "3   3878f58e-76ba-4480-a784-88505bd464d0   \n",
       "4   3df6abe2-2123-4bfa-a4e4-88df5f940c04   \n",
       "..                                   ...   \n",
       "91                TCGA-N9-A4Q7__tcga_ucs   \n",
       "92               TCGA-QS-A744__tcga_ucec   \n",
       "93  c64d5576-df00-4772-a3d1-1f8863000750   \n",
       "94  cc500ada-7440-412f-b54c-4966c8098dcb   \n",
       "95  d7a75bf5-5189-4978-99d9-fcef91c9fbd2   \n",
       "\n",
       "                                           identifier  \\\n",
       "0   [{'system': 'GDC', 'value': '146bd9db-1645-495...   \n",
       "1   [{'system': 'GDC', 'value': '32e83039-7663-422...   \n",
       "2   [{'system': 'GDC', 'value': '37063f74-ccc7-426...   \n",
       "3   [{'system': 'GDC', 'value': '3878f58e-76ba-448...   \n",
       "4   [{'system': 'GDC', 'value': '3df6abe2-2123-4bf...   \n",
       "..                                                ...   \n",
       "91       [{'system': 'IDC', 'value': 'TCGA-N9-A4Q7'}]   \n",
       "92       [{'system': 'IDC', 'value': 'TCGA-QS-A744'}]   \n",
       "93  [{'system': 'GDC', 'value': 'c64d5576-df00-477...   \n",
       "94  [{'system': 'GDC', 'value': 'cc500ada-7440-412...   \n",
       "95  [{'system': 'GDC', 'value': 'd7a75bf5-5189-497...   \n",
       "\n",
       "   member_of_research_project            primary_diagnosis_condition  \\\n",
       "0               CGCI-HTMCP-CC                Squamous Cell Neoplasms   \n",
       "1                  GENIE-GRCC    Complex Mixed and Stromal Neoplasms   \n",
       "2                   GENIE-UHN              Epithelial Neoplasms, NOS   \n",
       "3                   TCGA-UCEC  Cystic, Mucinous and Serous Neoplasms   \n",
       "4                   TCGA-CESC                Squamous Cell Neoplasms   \n",
       "..                        ...                                    ...   \n",
       "91                   tcga_ucs                                   None   \n",
       "92                  tcga_ucec                                   None   \n",
       "93              CGCI-HTMCP-CC                Squamous Cell Neoplasms   \n",
       "94                 GENIE-DFCI  Cystic, Mucinous and Serous Neoplasms   \n",
       "95                  TCGA-CESC                Squamous Cell Neoplasms   \n",
       "\n",
       "   primary_diagnosis_site           subject_id  \n",
       "0            Cervix uteri    HTMCP-03-06-02138  \n",
       "1             Uterus, NOS  GENIE-GRCC-4f168dad  \n",
       "2            Corpus uteri     GENIE-UHN-247706  \n",
       "3            Corpus uteri         TCGA-FI-A2EX  \n",
       "4            Cervix uteri         TCGA-JX-A3PZ  \n",
       "..                    ...                  ...  \n",
       "91                 Uterus         TCGA-N9-A4Q7  \n",
       "92                 Uterus         TCGA-QS-A744  \n",
       "93           Cervix uteri    HTMCP-03-06-02099  \n",
       "94            Uterus, NOS    GENIE-DFCI-000331  \n",
       "95           Cervix uteri         TCGA-EK-A2R7  \n",
       "\n",
       "[3196 rows x 6 columns]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rsdf # view the researchsubject dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "92a6f811",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>identifier</th>\n",
       "      <th>species</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>ethnicity</th>\n",
       "      <th>days_to_birth</th>\n",
       "      <th>subject_associated_project</th>\n",
       "      <th>vital_status</th>\n",
       "      <th>age_at_death</th>\n",
       "      <th>cause_of_death</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AD2728</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD2728'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>C3N-01876</td>\n",
       "      <td>[{'system': 'IDC', 'value': 'C3N-01876'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[cptac_ucec]</td>\n",
       "      <td>None</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GENIE-DFCI-007276</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'GENIE-DFCI-007276'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-18627.0</td>\n",
       "      <td>[GENIE-DFCI]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>GENIE-DFCI-009140</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'GENIE-DFCI-009140'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-24837.0</td>\n",
       "      <td>[GENIE-DFCI]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>GENIE-DFCI-009144</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'GENIE-DFCI-009144'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-19723.0</td>\n",
       "      <td>[GENIE-DFCI]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>TCGA-EY-A72D</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'TCGA-EY-A72D'}, {...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>black or african american</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-31818.0</td>\n",
       "      <td>[TCGA-UCEC, tcga_ucec]</td>\n",
       "      <td>Alive</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>TCGA-IE-A4EH</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'TCGA-IE-A4EH'}, {...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-12871.0</td>\n",
       "      <td>[tcga_sarc, TCGA-SARC]</td>\n",
       "      <td>Alive</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>TCGA-IS-A3KA</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'TCGA-IS-A3KA'}, {...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-26775.0</td>\n",
       "      <td>[tcga_sarc, TCGA-SARC]</td>\n",
       "      <td>Dead</td>\n",
       "      <td>413.0</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>TCGA-NA-A4QY</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'TCGA-NA-A4QY'}, {...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-22756.0</td>\n",
       "      <td>[tcga_ucs, TCGA-UCS]</td>\n",
       "      <td>Dead</td>\n",
       "      <td>114.0</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>TCGA-VS-A9V3</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'TCGA-VS-A9V3'}, {...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not reported</td>\n",
       "      <td>-22990.0</td>\n",
       "      <td>[TCGA-CESC, tcga_cesc]</td>\n",
       "      <td>Alive</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2608 rows × 11 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                   id                                         identifier  \\\n",
       "0              AD2728             [{'system': 'GDC', 'value': 'AD2728'}]   \n",
       "1           C3N-01876          [{'system': 'IDC', 'value': 'C3N-01876'}]   \n",
       "2   GENIE-DFCI-007276  [{'system': 'GDC', 'value': 'GENIE-DFCI-007276'}]   \n",
       "3   GENIE-DFCI-009140  [{'system': 'GDC', 'value': 'GENIE-DFCI-009140'}]   \n",
       "4   GENIE-DFCI-009144  [{'system': 'GDC', 'value': 'GENIE-DFCI-009144'}]   \n",
       "..                ...                                                ...   \n",
       "3        TCGA-EY-A72D  [{'system': 'GDC', 'value': 'TCGA-EY-A72D'}, {...   \n",
       "4        TCGA-IE-A4EH  [{'system': 'GDC', 'value': 'TCGA-IE-A4EH'}, {...   \n",
       "5        TCGA-IS-A3KA  [{'system': 'GDC', 'value': 'TCGA-IS-A3KA'}, {...   \n",
       "6        TCGA-NA-A4QY  [{'system': 'GDC', 'value': 'TCGA-NA-A4QY'}, {...   \n",
       "7        TCGA-VS-A9V3  [{'system': 'GDC', 'value': 'TCGA-VS-A9V3'}, {...   \n",
       "\n",
       "         species     sex                       race               ethnicity  \\\n",
       "0   Homo sapiens  female               not reported            not reported   \n",
       "1   Homo sapiens    None                       None                    None   \n",
       "2   Homo sapiens  female                      white  not hispanic or latino   \n",
       "3   Homo sapiens  female                      white  not hispanic or latino   \n",
       "4   Homo sapiens  female                      white  not hispanic or latino   \n",
       "..           ...     ...                        ...                     ...   \n",
       "3   Homo sapiens  female  black or african american  not hispanic or latino   \n",
       "4   Homo sapiens  female                      white  not hispanic or latino   \n",
       "5   Homo sapiens  female                      white  not hispanic or latino   \n",
       "6   Homo sapiens  female                      white  not hispanic or latino   \n",
       "7   Homo sapiens  female                      white            not reported   \n",
       "\n",
       "    days_to_birth subject_associated_project  vital_status  age_at_death  \\\n",
       "0             NaN                    [FM-AD]  Not Reported           NaN   \n",
       "1             NaN               [cptac_ucec]          None           NaN   \n",
       "2        -18627.0               [GENIE-DFCI]  Not Reported           NaN   \n",
       "3        -24837.0               [GENIE-DFCI]  Not Reported           NaN   \n",
       "4        -19723.0               [GENIE-DFCI]  Not Reported           NaN   \n",
       "..            ...                        ...           ...           ...   \n",
       "3        -31818.0     [TCGA-UCEC, tcga_ucec]         Alive           NaN   \n",
       "4        -12871.0     [tcga_sarc, TCGA-SARC]         Alive           NaN   \n",
       "5        -26775.0     [tcga_sarc, TCGA-SARC]          Dead         413.0   \n",
       "6        -22756.0       [tcga_ucs, TCGA-UCS]          Dead         114.0   \n",
       "7        -22990.0     [TCGA-CESC, tcga_cesc]         Alive           NaN   \n",
       "\n",
       "   cause_of_death  \n",
       "0            None  \n",
       "1            None  \n",
       "2            None  \n",
       "3            None  \n",
       "4            None  \n",
       "..            ...  \n",
       "3            None  \n",
       "4            None  \n",
       "5            None  \n",
       "6            None  \n",
       "7            None  \n",
       "\n",
       "[2608 rows x 11 columns]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "subsdf # view the subject dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75bcbe86",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "Then Julia uses the `id` fields in each result to join them together into one big dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "9b7a3383",
   "metadata": {},
   "outputs": [],
   "source": [
    "allmetadata = rsdf.set_index(\"subject_id\").join(subsdf.set_index(\"id\"), lsuffix='resub', rsuffix=\"subject\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a01f8c5c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>identifierresub</th>\n",
       "      <th>member_of_research_project</th>\n",
       "      <th>primary_diagnosis_condition</th>\n",
       "      <th>primary_diagnosis_site</th>\n",
       "      <th>identifiersubject</th>\n",
       "      <th>species</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>ethnicity</th>\n",
       "      <th>days_to_birth</th>\n",
       "      <th>subject_associated_project</th>\n",
       "      <th>vital_status</th>\n",
       "      <th>age_at_death</th>\n",
       "      <th>cause_of_death</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AD100</th>\n",
       "      <td>0f08e2e9-9983-4204-972f-a630b7ab2c25</td>\n",
       "      <td>[{'system': 'GDC', 'value': '0f08e2e9-9983-420...</td>\n",
       "      <td>FM-AD</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD100'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AD1026</th>\n",
       "      <td>6d9d6cb9-652f-4749-b4c5-aa9e6b80de69</td>\n",
       "      <td>[{'system': 'GDC', 'value': '6d9d6cb9-652f-474...</td>\n",
       "      <td>FM-AD</td>\n",
       "      <td>Complex Mixed and Stromal Neoplasms</td>\n",
       "      <td>Uterus, NOS</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD1026'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AD10328</th>\n",
       "      <td>514fc104-1ee5-4701-8f45-9a011143f1e2</td>\n",
       "      <td>[{'system': 'GDC', 'value': '514fc104-1ee5-470...</td>\n",
       "      <td>FM-AD</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD10328'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AD10460</th>\n",
       "      <td>8c36611d-be2f-432a-afde-e684ab4333ea</td>\n",
       "      <td>[{'system': 'GDC', 'value': '8c36611d-be2f-432...</td>\n",
       "      <td>FM-AD</td>\n",
       "      <td>Cystic, Mucinous and Serous Neoplasms</td>\n",
       "      <td>Uterus, NOS</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD10460'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AD10485</th>\n",
       "      <td>0ad0fdda-dd96-48df-8edd-e5e471e9f680</td>\n",
       "      <td>[{'system': 'GDC', 'value': '0ad0fdda-dd96-48d...</td>\n",
       "      <td>FM-AD</td>\n",
       "      <td>Cystic, Mucinous and Serous Neoplasms</td>\n",
       "      <td>Uterus, NOS</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'AD10485'}]</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[FM-AD]</td>\n",
       "      <td>Not Reported</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-ZJ-AB0H</th>\n",
       "      <td>TCGA-ZJ-AB0H__tcga_cesc</td>\n",
       "      <td>[{'system': 'IDC', 'value': 'TCGA-ZJ-AB0H'}]</td>\n",
       "      <td>tcga_cesc</td>\n",
       "      <td>None</td>\n",
       "      <td>Cervix</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'TCGA-ZJ-AB0H'}, {...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>not reported</td>\n",
       "      <td>not reported</td>\n",
       "      <td>-17869.0</td>\n",
       "      <td>[TCGA-CESC, tcga_cesc]</td>\n",
       "      <td>Alive</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-ZJ-AB0I</th>\n",
       "      <td>a4f13656-a941-498a-9ac9-f020ed559b35</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'a4f13656-a941-498...</td>\n",
       "      <td>TCGA-CESC</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'TCGA-ZJ-AB0I'}, {...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-9486.0</td>\n",
       "      <td>[TCGA-CESC, tcga_cesc]</td>\n",
       "      <td>Alive</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-ZJ-AB0I</th>\n",
       "      <td>TCGA-ZJ-AB0I__tcga_cesc</td>\n",
       "      <td>[{'system': 'IDC', 'value': 'TCGA-ZJ-AB0I'}]</td>\n",
       "      <td>tcga_cesc</td>\n",
       "      <td>None</td>\n",
       "      <td>Cervix</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'TCGA-ZJ-AB0I'}, {...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-9486.0</td>\n",
       "      <td>[TCGA-CESC, tcga_cesc]</td>\n",
       "      <td>Alive</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-ZX-AA5X</th>\n",
       "      <td>4756acc0-4e96-44d4-b359-04d64dc7eb84</td>\n",
       "      <td>[{'system': 'GDC', 'value': '4756acc0-4e96-44d...</td>\n",
       "      <td>TCGA-CESC</td>\n",
       "      <td>Squamous Cell Neoplasms</td>\n",
       "      <td>Cervix uteri</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'TCGA-ZX-AA5X'}, {...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-23440.0</td>\n",
       "      <td>[TCGA-CESC, tcga_cesc]</td>\n",
       "      <td>Alive</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TCGA-ZX-AA5X</th>\n",
       "      <td>TCGA-ZX-AA5X__tcga_cesc</td>\n",
       "      <td>[{'system': 'IDC', 'value': 'TCGA-ZX-AA5X'}]</td>\n",
       "      <td>tcga_cesc</td>\n",
       "      <td>None</td>\n",
       "      <td>Cervix</td>\n",
       "      <td>[{'system': 'GDC', 'value': 'TCGA-ZX-AA5X'}, {...</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>female</td>\n",
       "      <td>white</td>\n",
       "      <td>not hispanic or latino</td>\n",
       "      <td>-23440.0</td>\n",
       "      <td>[TCGA-CESC, tcga_cesc]</td>\n",
       "      <td>Alive</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3196 rows × 15 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                id  \\\n",
       "AD100         0f08e2e9-9983-4204-972f-a630b7ab2c25   \n",
       "AD1026        6d9d6cb9-652f-4749-b4c5-aa9e6b80de69   \n",
       "AD10328       514fc104-1ee5-4701-8f45-9a011143f1e2   \n",
       "AD10460       8c36611d-be2f-432a-afde-e684ab4333ea   \n",
       "AD10485       0ad0fdda-dd96-48df-8edd-e5e471e9f680   \n",
       "...                                            ...   \n",
       "TCGA-ZJ-AB0H               TCGA-ZJ-AB0H__tcga_cesc   \n",
       "TCGA-ZJ-AB0I  a4f13656-a941-498a-9ac9-f020ed559b35   \n",
       "TCGA-ZJ-AB0I               TCGA-ZJ-AB0I__tcga_cesc   \n",
       "TCGA-ZX-AA5X  4756acc0-4e96-44d4-b359-04d64dc7eb84   \n",
       "TCGA-ZX-AA5X               TCGA-ZX-AA5X__tcga_cesc   \n",
       "\n",
       "                                                identifierresub  \\\n",
       "AD100         [{'system': 'GDC', 'value': '0f08e2e9-9983-420...   \n",
       "AD1026        [{'system': 'GDC', 'value': '6d9d6cb9-652f-474...   \n",
       "AD10328       [{'system': 'GDC', 'value': '514fc104-1ee5-470...   \n",
       "AD10460       [{'system': 'GDC', 'value': '8c36611d-be2f-432...   \n",
       "AD10485       [{'system': 'GDC', 'value': '0ad0fdda-dd96-48d...   \n",
       "...                                                         ...   \n",
       "TCGA-ZJ-AB0H       [{'system': 'IDC', 'value': 'TCGA-ZJ-AB0H'}]   \n",
       "TCGA-ZJ-AB0I  [{'system': 'GDC', 'value': 'a4f13656-a941-498...   \n",
       "TCGA-ZJ-AB0I       [{'system': 'IDC', 'value': 'TCGA-ZJ-AB0I'}]   \n",
       "TCGA-ZX-AA5X  [{'system': 'GDC', 'value': '4756acc0-4e96-44d...   \n",
       "TCGA-ZX-AA5X       [{'system': 'IDC', 'value': 'TCGA-ZX-AA5X'}]   \n",
       "\n",
       "             member_of_research_project  \\\n",
       "AD100                             FM-AD   \n",
       "AD1026                            FM-AD   \n",
       "AD10328                           FM-AD   \n",
       "AD10460                           FM-AD   \n",
       "AD10485                           FM-AD   \n",
       "...                                 ...   \n",
       "TCGA-ZJ-AB0H                  tcga_cesc   \n",
       "TCGA-ZJ-AB0I                  TCGA-CESC   \n",
       "TCGA-ZJ-AB0I                  tcga_cesc   \n",
       "TCGA-ZX-AA5X                  TCGA-CESC   \n",
       "TCGA-ZX-AA5X                  tcga_cesc   \n",
       "\n",
       "                        primary_diagnosis_condition primary_diagnosis_site  \\\n",
       "AD100                       Squamous Cell Neoplasms           Cervix uteri   \n",
       "AD1026          Complex Mixed and Stromal Neoplasms            Uterus, NOS   \n",
       "AD10328                     Squamous Cell Neoplasms           Cervix uteri   \n",
       "AD10460       Cystic, Mucinous and Serous Neoplasms            Uterus, NOS   \n",
       "AD10485       Cystic, Mucinous and Serous Neoplasms            Uterus, NOS   \n",
       "...                                             ...                    ...   \n",
       "TCGA-ZJ-AB0H                                   None                 Cervix   \n",
       "TCGA-ZJ-AB0I                Squamous Cell Neoplasms           Cervix uteri   \n",
       "TCGA-ZJ-AB0I                                   None                 Cervix   \n",
       "TCGA-ZX-AA5X                Squamous Cell Neoplasms           Cervix uteri   \n",
       "TCGA-ZX-AA5X                                   None                 Cervix   \n",
       "\n",
       "                                              identifiersubject       species  \\\n",
       "AD100                     [{'system': 'GDC', 'value': 'AD100'}]  Homo sapiens   \n",
       "AD1026                   [{'system': 'GDC', 'value': 'AD1026'}]  Homo sapiens   \n",
       "AD10328                 [{'system': 'GDC', 'value': 'AD10328'}]  Homo sapiens   \n",
       "AD10460                 [{'system': 'GDC', 'value': 'AD10460'}]  Homo sapiens   \n",
       "AD10485                 [{'system': 'GDC', 'value': 'AD10485'}]  Homo sapiens   \n",
       "...                                                         ...           ...   \n",
       "TCGA-ZJ-AB0H  [{'system': 'GDC', 'value': 'TCGA-ZJ-AB0H'}, {...  Homo sapiens   \n",
       "TCGA-ZJ-AB0I  [{'system': 'GDC', 'value': 'TCGA-ZJ-AB0I'}, {...  Homo sapiens   \n",
       "TCGA-ZJ-AB0I  [{'system': 'GDC', 'value': 'TCGA-ZJ-AB0I'}, {...  Homo sapiens   \n",
       "TCGA-ZX-AA5X  [{'system': 'GDC', 'value': 'TCGA-ZX-AA5X'}, {...  Homo sapiens   \n",
       "TCGA-ZX-AA5X  [{'system': 'GDC', 'value': 'TCGA-ZX-AA5X'}, {...  Homo sapiens   \n",
       "\n",
       "                 sex          race               ethnicity  days_to_birth  \\\n",
       "AD100         female  not reported            not reported            NaN   \n",
       "AD1026        female  not reported            not reported            NaN   \n",
       "AD10328       female  not reported            not reported            NaN   \n",
       "AD10460       female  not reported            not reported            NaN   \n",
       "AD10485       female  not reported            not reported            NaN   \n",
       "...              ...           ...                     ...            ...   \n",
       "TCGA-ZJ-AB0H  female  not reported            not reported       -17869.0   \n",
       "TCGA-ZJ-AB0I  female         white  not hispanic or latino        -9486.0   \n",
       "TCGA-ZJ-AB0I  female         white  not hispanic or latino        -9486.0   \n",
       "TCGA-ZX-AA5X  female         white  not hispanic or latino       -23440.0   \n",
       "TCGA-ZX-AA5X  female         white  not hispanic or latino       -23440.0   \n",
       "\n",
       "             subject_associated_project  vital_status  age_at_death  \\\n",
       "AD100                           [FM-AD]  Not Reported           NaN   \n",
       "AD1026                          [FM-AD]  Not Reported           NaN   \n",
       "AD10328                         [FM-AD]  Not Reported           NaN   \n",
       "AD10460                         [FM-AD]  Not Reported           NaN   \n",
       "AD10485                         [FM-AD]  Not Reported           NaN   \n",
       "...                                 ...           ...           ...   \n",
       "TCGA-ZJ-AB0H     [TCGA-CESC, tcga_cesc]         Alive           NaN   \n",
       "TCGA-ZJ-AB0I     [TCGA-CESC, tcga_cesc]         Alive           NaN   \n",
       "TCGA-ZJ-AB0I     [TCGA-CESC, tcga_cesc]         Alive           NaN   \n",
       "TCGA-ZX-AA5X     [TCGA-CESC, tcga_cesc]         Alive           NaN   \n",
       "TCGA-ZX-AA5X     [TCGA-CESC, tcga_cesc]         Alive           NaN   \n",
       "\n",
       "             cause_of_death  \n",
       "AD100                  None  \n",
       "AD1026                 None  \n",
       "AD10328                None  \n",
       "AD10460                None  \n",
       "AD10485                None  \n",
       "...                     ...  \n",
       "TCGA-ZJ-AB0H           None  \n",
       "TCGA-ZJ-AB0I           None  \n",
       "TCGA-ZJ-AB0I           None  \n",
       "TCGA-ZX-AA5X           None  \n",
       "TCGA-ZX-AA5X           None  \n",
       "\n",
       "[3196 rows x 15 columns]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "allmetadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "024da831",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "And saves it out to a csv so she can browse it with Excel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "b6628de4",
   "metadata": {},
   "outputs": [],
   "source": [
    "allmetadata.to_csv(\"allmetadata.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "246644d3",
   "metadata": {},
   "source": [
    "<img src=\"./images/julia.png\" align=\"left\"\n",
    "\twidth=\"50\" height=\"50\" />\n",
    "   \n",
    "Julia knows from her subject count summary that there are 33480 files associated with her subjects, which is likely far more than she needs. To help her decide what files she wants, Julia uses endpoint chaining to get summary information about the files that are assigned to researchsubjects for her search criteria\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "ae1ae079",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Getting results from database\n",
       "\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Getting results from database\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">\"timestamp\"</span>:<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1656003440040</span>,<span style=\"color: #008000; text-decoration-color: #008000\">\"status\"</span>:<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">404</span>,<span style=\"color: #008000; text-decoration-color: #008000\">\"error\"</span>:<span style=\"color: #008000; text-decoration-color: #008000\">\"Not </span>\n",
       "<span style=\"color: #008000; text-decoration-color: #008000\">Found\"</span>,<span style=\"color: #008000; text-decoration-color: #008000\">\"path\"</span>:<span style=\"color: #008000; text-decoration-color: #008000\">\"//api/v1/researchsubjects/files/counts/all_Subjects_v3_0_w_RS\"</span><span style=\"font-weight: bold\">}</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[1m{\u001b[0m\u001b[32m\"timestamp\"\u001b[0m:\u001b[1;36m1656003440040\u001b[0m,\u001b[32m\"status\"\u001b[0m:\u001b[1;36m404\u001b[0m,\u001b[32m\"error\"\u001b[0m:\u001b[32m\"Not \u001b[0m\n",
       "\u001b[32mFound\"\u001b[0m,\u001b[32m\"path\"\u001b[0m:\u001b[32m\"//api/v1/researchsubjects/files/counts/all_Subjects_v3_0_w_RS\"\u001b[0m\u001b[1m}\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total execution time: 57 ms\n"
     ]
    }
   ],
   "source": [
    "NoAdenoData.researchsubject.file.count.run()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}