An Overview of CDISC Genomic Standards

The terms pharmacogenomics and pharmacogenetics refer to the study of how a person’s specific genetic sequences influence how they respond to medications. Pharmacogenomics aims to develop reasonable methods for improving therapies in relation to the patient’s genotype in order to achieve maximum effectiveness with the fewest side effects.

Pharmacogenomics enables us to identify inter-individual differences in genetic elements that influence patients’ responses to disease and treatment. Pharmacogenomics has been understood since the early 1960s. PGx is now widely used in clinical trials and drug development. PGx studies are now frequently used in clinical trials for patient stratification, improving medication safety, and optimizing doses, among other stages of drug development and labeling. Pharmacogenomics benefits include reduced toxicity and side effects, as well as faster attainment of the optimal therapeutic dose.

The FDA formally permitted Pharmacogenomics Data Submissions in 2005, and CDISC introduced the clinical data standard in May 2015. (SDTMIG-PGx). This paved the way for SDTM data on biospecimens and genetics to be implemented, such as specimen collection and management, quality data, genetic mutation, genotyping, gene expression, cytogenetics, viral genetics, and proteomics. The CDISC team has now terminated the SDTMIG-PGx v1.0, as much of the SDTMIG-PGx content can be fit into SDTMIG v3.4, the provisional PF domain has been discontinued and replaced by the GF domain, and the PG, PB, and SB domains are deprecated.

Why is it so important?

Pharmacogenomics and pharmacogenetics are both critical to the success of drug development and clinical trials. The study of pharmacogenomics assists researchers in predicting the efficacy and toxicity of a new medicine during clinical development.

Pharmacogenetic testing can be used in clinical investigations to stratify patients depending on their genotype, which correlates to their metabolizing capacity. This reduces the occurrence of serious adverse medication responses and contributes to the success of clinical studies.

Pharmacogenomics information is in two categories: information about biospecimens and information about genetic observations. 

BE – Biospecimen Events 

As mentioned above, BE domain records the data about the specimen. BE is defined to capture information regarding actions taken that affect the status of the specimen. The domain records the actions that were taken, such as transportation, freezing, thawing, aliquoting, etc., when they occurred, and which party became accountable for the specimen. The BE dataset is structured as one record per biospecimen event per specimen obtained per person and STUDYID, USUBJID, BEREFID, BETERM, and BESTDTC are key variables.  

Mapping of BE dataset:

The BE domain contains information on the specimens gathered for pharmacogenomics evaluations. Consider a dataset that captures the activities that will have an influence on blood samples. A specimen is collected, flash frozen, thawed, and shipped to a lab in another location. In the lab, specimen characteristics like volume and weight is collected. Some tests are very sensitive to processes such as flash freezing or time spent in transit. Therefore, it is important to record when the processes were started and completed. 

Table 2 shows how the information from sample collection to transportation is mapped to BE domain. The variable BETERM stores the activities conducted with the biospecimen sample and might contain values such as “collection,” “freezing,” “transporting,” and so on. BEDECOD is the corresponding dictionary-derived term.

Row 1: The blood sample was collected.

Row 2-4: The sample was flash frozen to avoid DNA degradation then stored at -82˚C before shipping to lab.

Row 5: Sample was thawed for extracting DNA samples.

BECAT describes the various categories via which the biospecimen goes, if there is a further category, BESCAT will record it. BEPARTY is the party who is responsible for the biospecimen as a result of the activity in the linked BETERM variable. The party might be a human (for e.g., a subject), an organization (for e.g., a sponsor), or a place that serves as a proxy for an individual or organization (e.g., site). The date variable BEDTC is accountable for the date of specimen collection, whereas the start date variable BESTDTC and the end date variable BEENDTC hold the start and end date/times for the event given in BETERM.

Table 2: Example of BE domain

BS – Biospecimen Findings

BS is a findings class domain which collects the characteristics of the biospecimen and extracted samples such as specimen volume and quality of the sample. The BS data is structured as one record per biospecimen finding per specimen collected per subject and STUDYID, USUBJID, BSREFID, BSTESTCD, and BSDTC are the key variables. 

Mapping of BS dataset:

As we know, the BS domain is used to describe the properties of the biospecimen. Consider the above example of the blood sample. Here we also collect certain measures like volume and weight of the biospecimen.  Table 3 shows the same information mapped in the BS domain. 

Row 1: The blood sample is collected and the volume measurement.

Row 2: The sample was flash frozen to avoid DNA degradation then stored at -82˚C before shipping to lab.

Row 3: Specifies the material used for flash freeze.

Row 4: Specifies the mass of the biospecimen.

Row 5: The concentration of the DNA extracted from blood sample

BSTEST and BSTESTCD indicate the characteristics of the biospecimen sample. BSCAT is used to define the categories of the values given in BSTESTCD, if there is a further category, BESCAT will record it. BSORRES stores the measurement or finding the result as it was originally received or collected. The value of BSORRES in a standard format or standard units is placed in BSSTRESC; if results are numeric, they should also be stored in a numeric format in BSSTRESN. BSORRESU and BSSTRESU are the unit variables in the original and standard format. The variable BSSPEC stores the type of specimen collected. The variable BSDTC indicates the date BSTEST is done in IS08601 format.

Table 3: Example of BS domain.

Genomics Findings (GF)

The findings domain GF contains data regarding the structure, function, evolution, mapping, and editing of subject and non-host organism genomic material of interest. The expected structure for the GF domain is one record per finding per observation per biospecimen per subject. GF has 57 variables: 11 identifiers, 1 topic, 35 qualifiers, and 10 timing.

The Genomics Findings domain ameliorated the variables with overlapping concepts and unclear definitions to get simplified. New concepts and newly established SDTM variables have been added, and outdated concepts have been removed. GF domain is an apt way to manage genetic testing and assessment results. Nevertheless, a basic understanding of genetics is required to map the data into the correct variables. Since the GF domain is a new domain, it is difficult to understand and perform with limited directions.

Renaming PF to Genomics Findings (GF) 

The Genomics Findings (GF) domain replaces the PF domain from SDTMIG v3.3 for Pharmacogenomics and Pharmacogenetics to handle modern clinical research genomic data based on previous standard improvements.

This domain gives the structure, function, evolution, mapping, and editing of subject and non-host organism genomic material of interest; i.e., genetic variation, transcription, and summary measures derived from these assessments. Such genomic findings help us to conclude/anticipate the amino acids or proteins formed. But the direct assessment of the proteins is out of scope for this domain.  

For non-host organisms including bacteria, viruses, and parasites, genetic findings from assessments of non-host organisms in subject samples are in scope for GF. However, identifying a viable, non-host organism or infectious agent (Microbiology Specimen (MB) domain) and determining the resistance and susceptibility of a non-host organism to a drug (Microbiology Susceptibility (MS) domain) are out of scope for this domain.  

Mapping of GF variables and GF Model Dataset

Consider an example in which we have findings from an assessment of a known single nucleotide variant (rs699947) in a protein coding gene VEGFA (Vascular Endothelial Growth Factor A) using wet laboratory methodology polymerase chain reaction. Here, DNA has been extracted.

Findings from this assessment show the genotypes from DNA extracted from the blood of 3 individuals, each with a different genotype at the genetic locus of interest (VEGFA gene). 

Row 1: Shows a subject genotype which is homozygous (A/A) for the variant nucleotide in the reference sequence. 

Row 2: Shows a subject genotype which is heterozygous (A/C) for the nucleotide in the reference sequence. 

Row 3: Shows a subject genotype which is homozygous (C/C) for the nucleotide in the reference sequence.

VEGF is a potent stimulator for angiogenesis. rs699947 is a Single Nucleotide Polymorphism within close proximity of VEGFA (Vascular endothelial growth factor A). The A allele of the SNP rs699947 causes increased risk of thyroid cancer development and regional lymph node metastasis in men. The Genomics Findings domain helps to foresee the chance of developing a disease in an individual and each individual may respond differently to the treatment based on their genomic variations.

  1. A unique identifier for the assayed genetic specimen can be mapped to GFREFID.
  2. GFTEST/CD, GFTSTDTL: The variables GFTEST/CD and GFTSTDTL are closely coupled. While GFTEST/CD discusses the name of the measurement, GFTSTDTL provides specific value/information associated with value in GFTEST/CD. A GFTEST/CD can have more than one associated GFTSTDTL values and a single GFTSTDTL value may be mapped to one or more GFTEST/CD value. Both GFTEST/CD and GFTSTDTL are needed to explicitly describe the findings in GF domain. 
  3. Findings or results as collected can be mapped to GFORRES and the reference value for the result or finding as originally received or collected are mapped to GFORREF. The values from GFORRES in a standard format or in standard units (in character format) are mapped to GFSTRESC and the values from GFORREF in a standard format or in standard units (in character format) are mapped to GFSTREFC.
  4. Inheritability is mapped to GFINHERT.
  5. An identifier for the genome reference used to generate the reported result can be mapped to GFGENREF.
  6. The designation (name or number) of the chromosome or contig on which the variant or other feature appears can be mapped to GFCHROM.
  7. The gene names can be mapped to GFSYM and a description of the type of genomic entity that is represented by GFSYM are mapped to GFSYMTYP. The gene names populated in the GFSYM have to be obtained from genomic symbol list maintained in the HUGO Gene Nomenclature Committee (HGNC) database ( GFTESTCD and GFTEST are not populating gene name values.
  8. The genetic location within a sequence for the observed value in GFORRES are mapped to GFGENLOC.

Table 4: Example of GF domain

RELSPEC – Related Specimens

RELSPEC domain is used to record specimen relationships and hierarchies, not for maintaining any relationships with datasets or domains like RELREC. It consists of 6 variables, STUDYID, USUBJID, REFID (The unique reference Identifier for specimens of each subject), SPEC (the type of specimen), PARENT (records the parent specimen identifier for tracing purposes), LEVEL (Identifies the generation number of the sample where the collected sample is considered the first generation) There are three CDISC controlled terminology codelists that may be applicable to SPEC: SPEC (C77529), SPECTYPE(C78734), and GENSMP (C111114).

Consider a situation where blood sample is collected and three levels of DNA is extracted. The same is shown in Figure 1. Table 5 shows how the information can be mapped to RELSPEC domain.

Figure 1. Specimen Relationship

Table 5. Example of RELSPEC Domain


BE (Biospecimen Events), BS (Biospecimen Findings), GF (Genomic Findings) and RELSPEC domains are quite new to us. These new domains support sorting and publishing the data collected in a better manner. Getting good exposure to the implementation of these domains will help us to get familiarized and to have a better understanding of the same.


CDISC. 2015. Study Data Tabulation Model Implementation Guide: Pharmacogenomics/Genetics. Available at:

Role of pharmacogenomics in drug discovery and development

Available at Role of pharmacogenomics in drug discovery and development – PMC (

Implementing the U.S. FDA guidance on pharmacogenomic data submissions

Available at Implementing the U.S. FDA guidance on pharmacogenomic data submissions – Goodsaid – 2007 – Environmental and Molecular Mutagenesis – Wiley Online Library

Introduction to the SDTM Genomics Findings (GF) Domain (

Time to Get in the Genomics Findings (GF) Domain 

Vascular endothelial growth factor gene polymorphisms in thyroid cancer [PMID 17951537]

Linghui Zhang. 2017. Implementation of STDM Pharmacogenomics/Genetics Domains on Genetic Variation Data. Merck & Co.

Leave a Reply

Your email address will not be published. Required fields are marked *