This material was prepared by Richard Charnigo on 01 September 2025. 1. SQL stands for "structured query language". One can use SQL to manipulate and extract relevant portions of data sets. Further, the results of such manipulations and extractions can themselves be saved as new data sets. 2. I can tell you first-hand that a lot of current non-academic job advertisements related to statistics and data science express interest in the applicants having skills in SQL. Python is also much in demand. 3. If you are familiar with SAS, then you will notice some similarities between SQL syntax and SAS syntax. Some things that people use SQL for, I have in my academic career used SAS for. 4. From the following website, I downloaded (at no cost) SQL Server 2022 Express onto my Hewlett Packard PC running Windows. https://www.microsoft.com/en-us/sql-server/sql-server-downloads 5. From the following website, I downloaded (at no cost) SQL Server Management Studio. https://learn.microsoft.com/en-us/ssms/install/install 6. The following website provides a lot of information about SQL. The examples that I shall give below are not from that website but rely on my having learned from that website. The website also includes a "Playground" section, where someone can practice writing SQL queries without having downloaded the Microsoft products in points 4 and 5 above. https://www.sqltutorial.org/ 7. Another website I consulted is the following. https://www.sqlservertutorial.net/ 8. When I run SQL Server Management Studio, I am shown a pop-up that asks me to connect to the server. In my case, the server refers to a location on the hard drive of my computer. I did need to check the box "Trust server certificate" to proceed. 9. In the Object Explorer, I right clicked on Databases and selected New Database. I then needed to assign a name, such as Example. 10. Again in the Object Explorer, I expanded Databases and right clicked on Example. I chose Tasks and then Import Flat File. I identified the location of a text file on my computer called SampleDataforSQL1. If you are reading this, presumably you also have access to that text file and can mimic the steps described herein. This is a pretend data set. So, there are no worries here about IRB approval or HIPAA, which can be considerations for real-world data sets involving human subjects. The pretend data set contains 20 records for patients visiting a family medicine clinic. There are also only a handful of variables, whereas real-world data sets sometimes contain dozens of variables (or more). You can proceed through the pages in the Import Flat File pop-up, with the one extra step of checking the box to allow nulls (i.e., missing values) under Modify Columns. 11. Repeat point 10 for the second text file called SampleDataforSQL2. This is another pretend data set containing 11 sets of "lab results" for patients visiting a family medicine clinic. The idea of having a second data set is to help me illustrate how you can use SQL to combine related data sets, especially when not all of the people in one data set appear in the other data set. 12. Left click New Query. A window will open, and you can paste in the two lines below. After you left click Execute, you will see that the entire first data set has displayed. select * from SampleDataforSQL1; 13. What if you want only the first 5 records to display ? select top 5 * from SampleDataforSQL1; 14. What if you want the results sorted by age, from oldest to youngest ? select * from SampleDataforSQL1 order by age desc; 15. What if you want to limit the above to those aged 65 or greater ? select * from SampleDataforSQL1 where age > 64 order by age desc; 16. What if you want to limit the above to those aged 65 or greater with private insurance ? select * from SampleDataforSQL1 where age > 64 AND insurance = 'private' order by age desc; 17. Hmm. Could there be a "mistake" here ? Wouldn't someone aged 88 more typically have Medicare ? With real-world data sets, sometimes we discover mistakes because of egregiously outlying values. Other times we discover mistakes based on implausible combinations of variable values. Sometimes we are not sure whether suspicious observations are in fact mistakes, and it may help (when possible) to seek the opinion of a subject matter expert or a person involved in the original collection of data. In this instance, let us suppose that the person aged 88 really had public insurance. We can implement a correction as follows. (The first two lines below make the correction; the remaining four lines verify that the correction has been effective.) update SampleDataforSQL1 set insurance = 'public' where id = 19; select * from SampleDataforSQL1 where age > 64 AND insurance = 'private' order by age desc; 18. What if you don't want all variables to display but only a few of them ? select id, sbp, dbp, age from SampleDataforSQL1; 19. What if you want to derive a new variable based on existing variables ? In this case, I am requesting that a weighted average of systolic blood pressure and diastolic blood pressure be computed and stored as "map" (mean arterial pressure). select id, sbp, dbp, age, sbp/3+2*dbp/3 AS map from SampleDataforSQL1; 20. Now let's create a new data set that has variables from both of the original data sets. More specifically, for each patient, we would like to extract both systolic blood pressure (first data set) and total cholesterol (second data set). Note that there are 10 people in the first data set who do NOT appear in the second data set; these will show NULL (missing) values for total cholesterol in the new data set. There is also one person in the second data set who does NOT appear in the first data set; this person will show a NULL (missing) value for systolic blood pressure in the new data set. Also, note that patients are being matched on the id variable, which appears in both data sets. select sbp, totalc from SampleDataforSQL1 full outer join SampleDataforSQL2 on SampleDataforSQL1.id = SampleDataforSQL2.id; 21. An issue with the previous result is that the new data set does not show the id variable. Because the id variable appears in both data sets, we rename it as ID1 if from the first data set and as ID2 if from the second data set. select SampleDataforSQL1.id as ID1, SampleDataforSQL2.id as ID2, sbp, totalc from SampleDataforSQL1 full outer join SampleDataforSQL2 on SampleDataforSQL1.id = SampleDataforSQL2.id; 22. The preceding result is still not quite satisfactory, both because of the duplicate id variables and because neither ID1 nor ID2 is defined for everyone. The following slightly tricky syntax yields a single id variable ("ID3"). The coalesce(SampleDataforSQL1.id,0) replaces missing id values in the first data set by 0's. The coalesce(SampleDataforSQL2.id,0) replaces missing id values in the second data set by 0's. So, ID3 is made to inherit the id value from the second data set when it is missing from the first data set and vice versa. select sbp, totalc, case when coalesce(SampleDataforSQL1.id,0)=0 THEN coalesce(SampleDataforSQL2.id,0) ELSE coalesce(SampleDataforSQL1.id,0) END ID3 from SampleDataforSQL1 full outer join SampleDataforSQL2 on SampleDataforSQL1.id = SampleDataforSQL2.id; 23. What if we want to, for example, count the number of patients seen by each doctor in the first data set ? select doctor, count(id) as NumberSeen from SampleDataforSQL1 group by doctor; 24. What if we want to, for example, record the average systolic blood pressure in the first data set, both overall and stratified based on doctor and insurance type ? The coalesce(doctor,'All') and coalesce(insurance,'All') create labels of 'All' when results are reported in strata based on only one variable instead of both. select coalesce(doctor,'All') AS doctor, coalesce(insurance,'All') as insurance, avg(sbp) as AvgSBP from SampleDataforSQL1 group by cube(doctor, insurance); 25. Suppose that we want to save the results from the previous query into a text file, so that we have it as a permanently available data set. Go to Tools, Options, and Query Results. Expand Query Results, and then expand SQL Server. For Results to Grid, check the box to include column headers. This box was unchecked by default, and checking this box will cause variable names to be included in the text file. 26. To create a text file with query results, right click where the results are displayed in the SQL Server Management Studio. Choose Save Results As, select a directory, and then type in a file name. Also, set the file type to be text. (You could instead, if you wanted, create a CSV file, which would by default open in Microsoft Excel.) The present step gives you a new permanently available data set, like the data sets read in during steps 10 and 11, except that this new data set has been derived from what you did in SQL. 27. What I have carried out here does not show all capabilities of SQL but gives you some ideas as to what can be done. The resource in point 6 above is also very helpful.