Simplify your SAS ‘Search’ or ‘Replace’ engine using Regular Expression

 

 

 

 

Author: Mr. Pranav Kurode – Clinical SAS Programmer

 

Ever worked with NONUNIFORMSTRING?

 

Finding, extracting or replacing text from or within non-standard strings like above are usually difficult. Perl regular expression is advanced technique to solve this issue.

 

Regular expression is a sequence of characters that is used to define a search pattern. Many of the text processing tasks in SAS can be performed using Perl Regular Expression. These tasks can be performed using traditional character functions, but Perl Regular Expression can provide simple solutions to much-complicated text manipulation tasks.

 

When performing a match, SAS will search in source string with the help of substring provided. For example prxmatch(‘/bike/’,’I have 1 bike’). In this case “bike” is substring that is searched in source string “I have 1 bike”. Perl regular expressions are composed of characters and special characters that are called metacharacters. Metacharacter are used to perform forcing the match to begin in a particular location and matching a particular set of characters. Some Metacharacter are covered below

 

 

Functions :

 

PRXPARSE

It is used to define a Perl regular expression to be used later by the other Perl regular expression functions. Each time you compile a regular expression, SAS assigns sequential numbers to the resulting expression. This number is needed to perform searches by the other PRX functions such as PRXMATCH, PRXCHANGE

 

Syntax : prxparse(Perl-Regular-Expression)

Perl-Regular-Expression : String placed in quotation marks

 

PRXMATCH

It is used to locate the position in a string, where a regular expression match is found. This function returns the first position in a string expression of the pattern described by the regular expression. If this pattern is not found, the function returns a zero.

Note: In some case “m” operator is used in prxmatch it is default operator. So “m/…/” is similar as “/…/”.

 

Syntax : prxmatch(Pattern-ID | Perl-Regular-Expression, String)

Pattern-ID : Value returned from prxparse

String : A character variable or string in quotation marks

 

 

PRXCHANGE

It is used to substitute one string for another. One advantage of using PRXCHANGE over TRANWRD is that you can search for strings using wild cards. Wildcards is covered in metacharacter below in section.

Note that you need to use the “s” operator in the regular expression to specify the search and replacement expression

 

Syntax : prxchange(Pattern-ID | Perl-Regular-Expression, Times, old-string)

Times : Times is the number of times to search for and replace a string. 1 means replace 1 time. -1 indicates to replace until the end of string is reached.

Old-string : Is the string that you want to replace.

 

 

Metacharacter used are:

Basic Syntax
CharacterDescription
/…/Starting and ending of Regex delimiters
()Grouping
|Alternation

 

Example 1:

Suppose the data contains value “rat” , “cat” , “bat”. We are interested in “rat” and “cat”.

In this case “at” is same in all values only difference is “r” or “c” or “b”

 

 

Program 1:

data program1;

set test1;

/*In prxparse we have pattern mentioned above*/

parse = prxparse(‘/(r|c)at/’);

/*prxmatch will search for string using substring*/

/*Substring is same variable we passed in prxparse*/

match = prxmatch(parse,a);

/*<——-OR———–>*/

/*this prxmatch in match1 variable is same as above instead

direct pattern is passed without using prxparse variable*/

match1 = prxmatch(‘/(r|c)at/’,a);

run;

Output 1:

 

 

 

Character Class
CharacterDescription
[…]Matches the character in the bracket
[^..]Matches the character in not the bracket
[a-z]Matches character ranging from a to z

 

 

Example 2:

We will consider same Example 1 data

 

Program 2:

data program2;

set test1;

match2 = prxmatch(‘/[rc]at/’,a); /*Either r or c*/

/*<——-OR———–>*/

match3 = prxmatch(‘/[^b]at/’,a); /*Not in b*/

run;

 

Output :

 

 

Position Matching
CharacterDescription
^Match beginning of the line
$Match end of the line

 

 

Example 3:

Consider same Example 1 data

 

Program 3:

data program3;

set test1;

match4 = prxmatch(‘/^[rc]/’,a); /*Starting with r or c*/

/*<——-OR———–>*/

match5 = prxmatch(‘/^[^b]/’,a); /*Not starting with b*/

run;

 

Output 3:

 

 

Wildcards Class
CharacterDescription
.Matches any character
\dmatches a digit character [0-9]
\Dmatches everything except a digit character
\wmatches a word character or alpha numeric character including underscore [a-zA-Z0-9_]
\Wmatches a non-word or non-alphanumeric character excluding underscore
\tmatches tab character
\smatches a blank “space”
\Smatches everything except blank “space”

 

 

Example 4 :

Program 4:

data program4;

/*Match digit in this case 1 from abc123*/

/*Output will display position in the string*/

num1 = prxmatch(‘m/\d/’,”abc123″);

/*Match character in this case a from abc123*/

char2 = prxmatch(‘m/\D/’,”abc123″);

/*Match the charachter a from abc123*/

numchar1 = prxmatch(‘m/\w/’,”abc123″);

/*Match the digit 1 from 123abc*/

numchar2 = prxmatch(‘m/\w/’,”123abc”);

/*Matches “*” from abc*123 */

nonumchar = prxmatch(‘m/\W/’,”abc*123″);

/*Matches a blank ” ” from abc123*/

blank = prxmatch(‘m/\s/’,”abc 123″);

/*Matches “*” from abc*123 */

noblank = prxmatch(‘m/\S/’,”abc*123″);

run;

 

Output 4:

 

 

 

Repetition Factor(match as many times as possible)
CharacterDescription
*Matches 0 or more times
+Matches 1 or more times
?Matches 0 or 1 time
{n}Matches exactly n times
{n,}Matches at least n times
{n,m}Matches minimum n times but not more than m times

 

 

Example 5:

Please note: If special symbols are present in data. It is better to use ‘\’ as an escape character before special symbol. E.g: Consider your data consist ‘*’. It is better to use ‘\*’.

 

Program 5:

data program5;

/*matches character 1 or more time and replace*/

/*In this case ab is character which is 2 times(>=1)*/

one_mor = prxchange(‘s/\w+/*/’,-1,”ab%”);

/*both will get replaced by “*”*/                                /*In this case first digit is checked zero or more time        then character is checked one or more time */

/*this pattern is replaced by “*”*/

zer_mor = prxchange(‘s/\d*\w+/*/’,-1,”ab%”) ;

/*Match 2 character*/

/*”i” operator indicates “case  insensitivity” so ab is equal       Ab aB*/

match2c = prxchange(‘s/\w{2}/1/i’,-1,”Ab”);

/*Match min 1 character max 2 character*/

/*$1 represented first ()*/

/*Similarly $2 represents second ()*/

match12c = prxchange(‘s/\w{1,2}(\d)/$1/i’,-1,”Ab1″);     match1c = prxchange(‘s/\w{1,}/1/i’,-1,”Ab”);

run;

 

 

Output 5:

 

 

Code snippet:

 

Example 1:

In the following example if there are 5 values in “trt” variable TRT1, TRT2, TRT3, PROD1, PROD3 and we are interested in extracting TRT1, TRT2 and PROD1

 

Program :

 

data b;

set a;

if prxmatch(‘m/[12]/’,trt) >=1;

run;

 

Output :

 

Example 2:

If dataset contains 1000 values. In this example we will consider unique pattern. Values consist of domain name and number “AE 10 DM 12”, “CM 11, DS 20” “Adverse Event 17” “MH 17,20” and the value should represent one domain with following number. In case of “Adverse Event 17” the value should display “AE 17” in case “MH 17,20” the value should display “MH 17” “MH 20”

Data :

 

 

Program :

data test;

set domain;

if prxmatch(‘/\w+\s\d+(,)?\s\w+\s\d+/’,domN) >= 1 then do;

var1=prxchange(‘s/(\w+\s\d+)(,)?\s\w+\s\d+/$1/’,-1,domN);       var2 = prxchange(‘s/\w+\s\d+(,)?\s(\w+\s\d+)/$2/’,-1,domN);   end;

if prxmatch(‘/\w{3,}\s\w{3,}\s\d+/’,domN) >= 1 then

var1 = prxchange(‘s/(\w)\w{2,}\s(\w)\w{2,}\s(\d+)/$1$2

$3/’,-1,domN);

if prxmatch(‘/\w+\s\d+,\d+/’,domN) >=1 then do;

var1 = prxchange(‘s/(\w+)\s(\d+),\d+/$1 $2/’,-1,domN);

var2 = prxchange(‘s/(\w+)\s\d+,(\d+)/$1 $2/’,-1,domN);       end;

run;

 

Output :

 

Wish to know more? Always feel free to write to us at info@genproindia.com.

You may also like:
SAS INDEX – MAKE SUBSETTING QUICK

  Author: Dinesh Motkar – Clinical SAS Programmer at Genpro   As part of SAS programming, we often come across situations where we need to remove unwanted data or to locate specific rows from data. Performing this processing using where clause or statement along with...

Read More
Introduction to Linear Mixed Model

Introduction to Linear Mixed Model     Author: Anoop Jose – Clinical SAS Programmer at Genpro Research   In clinical trials, usually, we take multiple measurements from a subject at different time points. In the case of repeated measures or longitudinal data, multiple observations are...

Read More
Clinical Data and Wearable Device : Future of Data Capturing

Clinical Data and Wearable Device : Future of Data Capturing       Author: Mr. Vinu C Raju – Clinical Statistical Programmer at Genpro   Have you ever wondered how your social networking app is suggesting a friend request for a person you met yesterday...

Read More

close