This material was prepared by Richard Charnigo on 01 September 2025.

1. SQL stands for "structured query language".  One can use SQL to manipulate and 
extract relevant portions of data sets.  Further, the results of such manipulations
and extractions can themselves be saved as new data sets.

2. I can tell you first-hand that a lot of current non-academic job advertisements 
related to statistics and data science express interest in the applicants having 
skills in SQL.  Python is also much in demand.

3. If you are familiar with SAS, then you will notice some similarities between SQL 
syntax and SAS syntax.  Some things that people use SQL for, I have in my academic 
career used SAS for.

4. From the following website, I downloaded (at no cost) SQL Server 2022 Express 
onto my Hewlett Packard PC running Windows.

https://www.microsoft.com/en-us/sql-server/sql-server-downloads

5. From the following website, I downloaded (at no cost) SQL Server Management Studio.

https://learn.microsoft.com/en-us/ssms/install/install

6. The following website provides a lot of information about SQL.  The examples that I
shall give below are not from that website but rely on my having learned from that 
website.  The website also includes a "Playground" section, where someone can practice 
writing SQL queries without having downloaded the Microsoft products in points 4 and 5 
above.

https://www.sqltutorial.org/

7. Another website I consulted is the following.

https://www.sqlservertutorial.net/

8. When I run SQL Server Management Studio, I am shown a pop-up that asks me to 
connect to the server.  In my case, the server refers to a location on the hard drive 
of my computer.  I did need to check the box "Trust server certificate" to proceed.

9. In the Object Explorer, I right clicked on Databases and selected New Database.  I 
then needed to assign a name, such as Example.

10. Again in the Object Explorer, I expanded Databases and right clicked on Example.  
I chose Tasks and then Import Flat File.  I identified the location of a text file on  
my computer called SampleDataforSQL1.  If you are reading this, presumably you also 
have access to that text file and can mimic the steps described herein.  This is a 
pretend data set.  So, there are no worries here about IRB approval or HIPAA, which 
can be considerations for real-world data sets involving human subjects.  The pretend 
data set contains 20 records for patients visiting a family medicine clinic.  There 
are also only a handful of variables, whereas real-world data sets sometimes contain 
dozens of variables (or more).  You can proceed through the pages in the Import Flat 
File pop-up, with the one extra step of checking the box to allow nulls (i.e., 
missing values) under Modify Columns.

11. Repeat point 10 for the second text file called SampleDataforSQL2.  This is 
another pretend data set containing 11 sets of "lab results" for patients visiting a 
family medicine clinic.  The idea of having a second data set is to help me 
illustrate how you can use SQL to combine related data sets, especially when not all 
of the people in one data set appear in the other data set.

12. Left click New Query.  A window will open, and you can paste in the two lines 
below.  After you left click Execute, you will see that the entire first data set has
displayed. 

select *
from SampleDataforSQL1;

13. What if you want only the first  5  records to display ?

select top 5 *
from SampleDataforSQL1;

14. What if you want the results sorted by age, from oldest to youngest ?

select *
from SampleDataforSQL1
order by age desc;

15. What if you want to limit the above to those aged 65 or greater ?

select *
from SampleDataforSQL1
where age > 64
order by age desc;

16. What if you want to limit the above to those aged 65 or greater with
private insurance ?  

select *
from SampleDataforSQL1
where age > 64 AND insurance = 'private'
order by age desc;

17. Hmm.  Could there be a "mistake" here ?  Wouldn't someone aged 88 more 
typically have Medicare ?  With real-world data sets, sometimes we discover
mistakes because of egregiously outlying values.  Other times we discover
mistakes based on implausible combinations of variable values.  Sometimes
we are not sure whether suspicious observations are in fact mistakes, and
it may help (when possible) to seek the opinion of a subject matter expert
or a person involved in the original collection of data.  In this instance, 
let us suppose that the person aged 88 really had public insurance.  We can 
implement a correction as follows.  (The first two lines below make the
correction; the remaining four lines verify that the correction has been
effective.)

update SampleDataforSQL1
set insurance = 'public' where id = 19;

select *
from SampleDataforSQL1
where age > 64 AND insurance = 'private'
order by age desc;

18. What if you don't want all variables to display but only a few of them ?

select id, sbp, dbp, age
from SampleDataforSQL1;

19. What if you want to derive a new variable based on existing variables ?  
In this case, I am requesting that a weighted average of systolic blood 
pressure and diastolic blood pressure be computed and stored as "map" (mean 
arterial pressure).

select id, sbp, dbp, age, sbp/3+2*dbp/3 AS map   
from SampleDataforSQL1;

20. Now let's create a new data set that has variables from both of the 
original data sets.  More specifically, for each patient, we would like to 
extract both systolic blood pressure (first data set) and total cholesterol
(second data set).  Note that there are 10 people in the first data set who
do NOT appear in the second data set; these will show NULL (missing) values 
for total cholesterol in the new data set.  There is also one person in the 
second data set who does NOT appear in the first data set; this person will
show a NULL (missing) value for systolic blood pressure in the new data set.
Also, note that patients are being matched on the id variable, which appears
in both data sets.

select sbp, totalc
from SampleDataforSQL1 full outer join SampleDataforSQL2 
on SampleDataforSQL1.id = SampleDataforSQL2.id;

21. An issue with the previous result is that the new data set does not show 
the id variable.  Because the id variable appears in both data sets, we 
rename it as ID1 if from the first data set and as ID2 if from the second 
data set.

select SampleDataforSQL1.id as ID1, SampleDataforSQL2.id as ID2, sbp, totalc
from SampleDataforSQL1 full outer join SampleDataforSQL2 
on SampleDataforSQL1.id = SampleDataforSQL2.id;

22. The preceding result is still not quite satisfactory, both because of the 
duplicate id variables and because neither ID1 nor ID2 is defined for 
everyone.  The following slightly tricky syntax yields a single id variable 
("ID3").  The  coalesce(SampleDataforSQL1.id,0)  replaces missing id values 
in the first data set by  0's.  The coalesce(SampleDataforSQL2.id,0)  
replaces missing id values in the second data set by  0's.  So, ID3 is made 
to inherit the id value from the second data set when it is missing from the 
first data set and vice versa.

select sbp, totalc,
case when coalesce(SampleDataforSQL1.id,0)=0 
THEN coalesce(SampleDataforSQL2.id,0)
ELSE coalesce(SampleDataforSQL1.id,0) END ID3
from SampleDataforSQL1 full outer join SampleDataforSQL2 
on SampleDataforSQL1.id = SampleDataforSQL2.id;

23. What if we want to, for example, count the number of patients seen by each 
doctor in the first data set ?

select doctor, count(id) as NumberSeen
from SampleDataforSQL1
group by doctor;

24. What if we want to, for example, record the average systolic blood pressure 
in the first data set, both overall and stratified based on doctor and 
insurance type ?  The  coalesce(doctor,'All')  and  coalesce(insurance,'All')  
create labels of 'All' when results are reported in strata based on only one 
variable instead of both.

select coalesce(doctor,'All') AS doctor, 
coalesce(insurance,'All') as insurance, 
avg(sbp) as AvgSBP
from SampleDataforSQL1
group by cube(doctor, insurance);

25. Suppose that we want to save the results from the previous query into a 
text file, so that we have it as a permanently available data set.  Go to Tools, 
Options, and Query Results.  Expand Query Results, and then expand  SQL Server.  
For Results to Grid, check the box to include column headers.  This box was 
unchecked by default, and checking this box will cause variable names to be 
included in the text file.

26. To create a text file with query results, right click where the results are 
displayed in the SQL Server Management Studio.  Choose Save Results As, select 
a directory, and then type in a file name.  Also, set the file type to be text.  
(You could instead, if you wanted, create a CSV file, which would by default 
open in Microsoft Excel.)  The present step gives you a new permanently 
available data set, like the data sets read in during steps 10 and 11, except 
that this new data set has been derived from what you did in SQL.

27. What I have carried out here does not show all capabilities of SQL but 
gives you some ideas as to what can be done.  The resource in point 6 above is 
also very helpful.