Reflections of a Data Scientist: SQL Advanced Joins (SAS)

This article is a short tutorial pertaining to the joining of data sets within SAS. Though SAS code does contain its own capabilities for data merging, I personally feel that data merging is far cleaner and efficient, from a coding perspective, when performed within SQL, or within SAS through the utilization of PROC SQL.

I am assuming that you are familiar with how SQL joins are utilized. However, if you are not, here is a quick summary of the most useful types of joins and how they operate:

The Left Join - All data from the left data set is joined with the data from the right data set. The matches are made based on primary unique identifiers from each table. All data that does not match from the right table is discarded. However, all data from the left data set remains regardless.

The Right Join - All data from the right data set is joined with the data from the left data set. The matches are made based on primary unique identifiers from each table. All data that does not match from the left set is discarded. However, all data from the right data set remains regardless.

The Inner Join - All data from the right data set is joined with data from the left data set The matches are made based on primary unique identifiers from each table. All data that does not match from either set is discarded.

The Full Outer Join - All data from the right data set is joined with data from the left data set. The matches are made based on primary unique identifiers from each table. All data that does not match from either set is kept regardless.

The easiest way to join two tables is to utilize a modified version of the code below:

proc sql ;

create table <tablename> as /* Name of data set that will be created */

select a.<varnames or *>, b.<varnames or *> /* Name of the variables that will be included within the new set */

from <TABLEA> as a left join /* type of join */ <TABLEB> as b /* The names of the original two sets that will be joined */

on a.<VARA> = b.<VARB> /* The primary keys from each data set from which matches will be assessed */

;

quit;

I would recommend enabling the OPTION NOTHREADS; option that was discussed in a previous article. This option prevents SAS from potentially corrupting the original data sets while merging them to create a new set.

As an example, let us say that we wanted to left join SETA with SETB based on matching id variables.

proc sql ;

create table SETC as /* Name of data set that will be created */

select a.*, b.* /* Name of the variables that will be included within the new set */

from SETA as a left join /* All data that does not match SETA from SET B will be dropped from the set */ SETB as b /* The names of the original two sets that will be joined */

on a.ID1 = b.ID2 /* In this particular case, ID1 is primary key utilized from SETA, and ID2 is the primary key utilized from SETB */

;

quit;

For the code to run correctly, the primary keys which are being compared must be of the same data type. Also, be sure that the data variables which will be consolidated through the join do not have the same variable name.

Reflections of a Data Scientist

Monday, June 5, 2017

SQL Advanced Joins (SAS)

No comments:

Post a Comment