CDISC Programming: Importance of Custom Checks

Author: Genpro Statistical Programming Team

SDTM-domain structures and relationships are similar across studies under a therapeutic area which leads to code standardization and reusability. Interim data transfers also come with changes in data leading to rerun of existing programs with minor updates. The possibility of errors in such scenarios are large with truncation in data, new data issues being unidentified, attribute changes, etc.

Not all data issues are identified at the initial stage of Source data validation, but they tend to surface during the development of CDISC datasets. In addition, certain data issues identified at the initial run might not always be necessarily resolved in the next data transfer. This is where custom errors/warnings play a significant role. Similarly, in the case of Statistical programming, a standard code might be replicated for various TFLs with changes only to the parameters considered. In such cases, specific custom checks based on parameters also come into importance.

Quality and Time are the two major factors that define efficient clinical research. Development of high-quality Study Data Tabulation Model (SDTM)/ Analysis Data Model (ADaM)/ Tables, Listings, and Figures (TLF) datasets can be time-consuming while implementing the QC checks at different phases of the development. Industries are focusing on the automation process, and the importance of custom checks in such cases are even greater.

Quality is obstinate in programming, and it is very disappointing that issues are not identified at an earlier stage even though people have followed standard programming practices. Experience is, of course, a key factor that will help predict the possibility of error. But the programming team does not always consist of experienced members.


‘User-defined errors/warnings’ used by programmers for various programming purposes helps to ensure reliability if used in a standardized manner. The question is how these checks can help us to ensure the quality of programming. There are common (standard) and study-based checks that the programmer can carry out in their respective programs some of which specific to SDTM may include ensuring that all subjects presented in the study are presented in SDTM.DM, identifying study-based approaches in the program, ensuring that all column values exceeding 200 are mapped to SUPP.

SCENARIO 1: SUBJECTS IN ANY RAW DATASET MUST BE PRESENT WITHIN SDTM.DM Programmers need to merge SDTM.DM dataset for the development of all other SDTM dataset and there are chances that subjects in external data are not in DM: data lb; merge raw_lb(in=b); by subjid; if b; if b and not a then putlog “ERROR: Subject ” subjid “is not in SDTM.DM”; run;   Log: If a subject is not presented in SDTM.DM: SDTM dataset   SCENARIO 2: TO ENSURE THAT THE RESULT WILL BE POPULATED FOR THE TEST WHEN THE PERFORMANCE STATUS IS ‘YES’

In certain scenarios, it is possible that the data from EDC can be collected in such a manner that there can be 2 datasets for a single object, one containing the performance status of a particular examination and the second dataset containing the respective results of the tests. Programmers can use the above approach to ensure that all the subject with performance status “Yes” has a corresponding value in the external data.

data oe; set oe; if cmiss(oeorres) and cmiss(oestat) then putlog “ERROR: Result is not populated for ” subject ” “; run; SDTM Datasets   SCENARIO 3: ENSURING THE AUTHENTICITY OF THE DERIVED BASELINE FLAG

Derivation of the baseline flag depends on how it’s described in the protocol. Usually, this might be the last non-missing measurement prior to the first administration of a study drug. So, programmers could use the DATA or PROC SQL step to identify this record and merge it back with the original record. It is necessary for the programmer to ensure that there are no multiple baseline flags for a test within a subject populated. The code provided below checks the same. For this, first, we need to derive the count of the baseline based on the dependent variables:

proc sql; create table basechk as select distinct usubjid,lbcat,lbscat,lbtestcd,lborres,lbdtc,count(lbblfl) as cnt from lb5 where ~missing (lbblfl) group by usubjid,lbcat,lbscat,lbtestcd; quit; data null; set basechk; if cnt gt 1 then put “ERROR: Multiple baseline populated for Subject=” usubjid “and TEST=” lbtestcd; run;   PROC SQL   SCENARIO 4: ADDRESSING TEMPORARY APPROACHES USED WITHIN PROGRAMS PRIOR TO DB LOCK.

There may be situations where the programmer may be stuck or unable to move to the next step due to data issues or missing data sets. In such scenarios, the team can either stop processing or move forward with a temporary approach. Consider a case where multiple records are populated under a visit for an LBTEST with a different sponsor-defined identifier and the team needs confirmation that the record should be used for analysis. The programmer can select the maximum/minimum, first/last, etc. values for each category as an analysis record, and the temporary approach was to consider the last value for each visit:

data lb2; set sdtm_lb end=ls; by usubjid visit lbcat lbscat lbtest; if last.lbtest; if ls then put “WARNING: Temporary – Duplicates records not considered for analysis”; run; If the team has a standard log check macro that also works with hint words, then the check can also be provided as put “TEMPORARY -xxx”.   SCENARIO 5: CHECK THAT ALL DATA VALUES ARE CONSIDERED IN THE PROGRAMMING FORMATS.

Programmers use user-defined formats for the generation of SDTM or ADAM that helps in the reusability of programs for example – in the case of an extension study. Yet there may be a risk that there might be new values added as part of the formats which can be missed out during updates. Suppose that the value “E5” available within Study 2 is not included in the formats because no such value has been recorded or listed in Study1. In such cases where the chances of missing these records are high, the below approach can be taken.

*Formats used for the example program; proc format; value $efrmt “E1″=”Example1” “E2″=”Example2” “E3″=”Example3” “E4″=”Example4” other=”XXX”; run; data e2; set e1; evar1=put(evar,efrmt.); if evar1=”XXX” then put “ERROR: Format EFRMT need to be updated for the value-” evar; run;   SDTM user-defined formats   SCENARIO 6: ENSURE THAT NO TRUNCATION OCCURS WHILE CONCATENATION

Concatenating two or more variables is frequent in programming and programmers are exploring various ways to get outputs. It is the duty of the programmer to ensure that no truncation has occurred in programming. Without any custom check, a programmer can ensure the quality with the use of the cat function. The code given below is the two separate concatenation methods, but only the code using catx shows an alert when truncation occurs:

data trail1; length conc $10; set (obs=2); conc=strip(model)||”-“||strip(type); run; data trail2; length conc $10; set (obs=2); conc=catx(“-“,of model,type); run;   CONCATENATION   SCENARIO 7: TO ENSURE THAT ALL THE VARIABLES RESULTING FROM PROC TRANSPOSE ARE CONSIDERED IN FURTHER PHASES.

Concatenating variables resulting from the process of transposing are also common in programming and are most frequently used in the generation of patient narratives and listings. If there are unique Identifier Parameters (IDs) for transposing, there would be no problem. If there is no Identifier, the resulting variables will be in the form col1, col2, etc. and, in these situations, the programmer must typically manually define the last variable and construct the program to use up to that variable. So, here’s a way to define the last variable and construct a program based on it.

The example given below is to illustrate the specific laboratory test presented for the subject under the lab category ‘Chemistry’ in the column called LBTESTS. The laboratory measurements are separated by a comma: *Identifying the distinct lab test populated for each subject under the category Chemistry; proc sql; create table lab as select distinct usubjid,lbcat,lbtest from where lbcat=”Chemistry”; quit; proc transpose data=lab out=lab_t; var lbtest; by usubjid lbcat; quit; *Identifying the maximum number of columns resulting from the transpose; proc sql noprint; select strip(put(max(testcnt),best.)) into: maxcln from (select distinct usubjid,count(lbtest) as testcnt from lab group by usubjid,lbcat); quit; %put Column=&maxcln.; data list1; length lbtests $400; *Length used for avoiding truncation; set lab_t; *Combining the resulted values from proc transpose; lbtests=catx(“,”, of col1-col&maxcln.); run;    PROC TRANSPOSE  PROC TRANSPOSE   SCENARIO 8: TO ENSURE TOPIC VARIABLE IN SDTM IS ALWAYS POPULATED.

“A missing value in Required variable” is a severe violation of SDTM/SEND compliance which may result in failing automated processes like uploading study data into a data warehouse or running standardized analysis tools. Hence it is mandatory that the required Variable is non-missing.

The topic variable needs to be populated for all cases in an SDTM domain. The code given below is ensuring topic variable in the SDTM is always populated: data lb; set lb; if cmiss(lbtestcd) then putlog “ERROR: Topic variable is missing for ” subject; run;   TOPIC VARIABLE IN SDTM   SCENARIO 9: SDTM FULL DATE/PARTIAL DATE ARE HANDLED PROPERLY.

An SDTM DTC variable may include data that is represented in ISO 8601 format as a complete date/time, a partial date/time, or an incomplete date/time. It would always be good to check if a partial date/full date is handled properly as specified in SDTM.IG. The custom check below will help to identify whether the full date/partial date is correctly mapped to an extent (assuming ‘-‘ is not used for missing values).

data chk. date=”2018-08-20″; output; date=”2018-08″; output; date=”2018-8″; output; run; data chk1; set chk; if length(date) not in(4,7,10) then put ” ERROR: Date is not meaningful”; run;   SDTM FULL DATE/PARTIAL DATE   SCENARIO 10: UNITS ARE POPULATED WITHOUT RESULTS OR RESULTS ARE POPULATED WITHOUT UNITS.

For findings domains like LB and VS, some tests may have resulted in different units (example: Body temperature can be collected both in degree Celsius and Fahrenheit). In such situations, it’s a tedious process to identify the unit of a particular result and hence, usually, the raw data will contain a variable having the unit corresponding to each result.

At times, situations arise where a unit is populated when the result is missing, or the result is populated while the raw data variable for the unit has no value. Such cases of raw data issues can be identified using a custom check and thus report the data issues to the reviewer easily.

data vs; set vs2; if cmiss(vsorresu, vsorres)=1 then put “Result populated without unit/Unit populated without result”; run;   Result populated without unit   Note: The user can also add a custom check to ensure that there were no multiple standard units presented for a test under a particular category.   CONCLUSION

Not all data issues are identified at the initial stage of source data validation, but they tend to surface during the development of CDISC datasets. In addition, certain data issues identified at the initial run might not always be necessarily resolved in the next data transfer. This is where custom errors/warnings play a significant role. Similarly, in the case of statistical programming, a standard code might be replicated for various TFLs with changes only to the parameters considered. In such cases, specific custom checks based on parameters also come into importance. It is highly recommended that these kinds of custom checks be carried out in automation programs because, if quality fails, everything fails.

Leave a Reply

Your email address will not be published. Required fields are marked *