SQL Simplified: Optimization Strategies & Enhancing Query Performance with examples.

Mohamed Hendawy
9 min readJan 6, 2024

--

SQL, or Structured Query Language, is the cornerstone of modern database management and manipulation. For software engineers, mastering SQL is not just about writing functional queries; it’s about crafting them for optimal performance and efficiency. This comprehensive guide explores SQL optimization strategies, each designed to enhance the performance of your database interactions.

1. Selective Data Retrieval

Poorly Written:

SELECT * FROM Employees;

Optimized:

SELECT EmployeeID, FirstName, LastName FROM Employees;

Rationale: Reducing data transfer by selecting only necessary columns enhances query performance, particularly for tables with numerous columns.

2. Smart Joins

Poorly Written:

SELECT * 
FROM Orders
INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID
WHERE Customers.Country = 'USA';

Optimized:

SELECT Orders.OrderID, Orders.OrderDate, Customers.CustomerName 
FROM Orders
INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID
WHERE Customers.Country = 'USA';

Rationale: Limiting the data fetched in JOIN operations to only relevant columns minimizes memory usage and speeds up data processing.

3. Replacing Subqueries

Poorly Written:

SELECT EmployeeID, 
(SELECT COUNT(*) FROM Orders WHERE Orders.EmployeeID = Employees.EmployeeID) AS NumberOfOrders
FROM Employees;

Optimized:

SELECT Employees.EmployeeID, COUNT(Orders.OrderID) AS NumberOfOrders
FROM Employees
LEFT JOIN Orders ON Employees.EmployeeID = Orders.EmployeeID
GROUP BY Employees.EmployeeID;

Rationale: A LEFT JOIN with aggregation is more efficient than a correlated subquery for large data sets.

4. Utilizing Indexes Effectively

Poorly Written:

SELECT * FROM Orders WHERE YEAR(OrderDate) = 2021;

Optimized:

SELECT * FROM Orders WHERE OrderDate BETWEEN '2021-01-01' AND '2021-12-31';

Rationale: The poorly written query uses the YEAR() function on the OrderDate column. This approach has a significant drawback: it prevents the database from effectively using any indexes on the OrderDate column. When a function is applied to a column in a query's condition, the database typically cannot use an index on that column, leading to a full table scan. A full table scan means the database must examine every row in the Orders table to determine if the YEAR(OrderDate) matches 2021. This process can be highly inefficient, especially for large tables with many rows, as it increases the query execution time.

The optimized query, on the other hand, directly compares the OrderDate with a specific date range. This method allows the database to utilize an index on the OrderDate column, if one exists. When an index is used, the database can quickly locate the rows where OrderDate falls within the specified range, without scanning the entire table. This results in a much faster query, particularly for large datasets, as it reduces the number of rows the database needs to check to fulfill the query.

5. Window Functions for Advanced Analysis

Poorly Written:

SELECT e1.EmployeeID, e1.Name, 
(SELECT COUNT(*) FROM Employees e2 WHERE e1.DepartmentID = e2.DepartmentID) AS DeptEmployeeCount
FROM Employees e1;

Optimized:

SELECT EmployeeID, Name, 
COUNT(*) OVER (PARTITION BY DepartmentID) AS DeptEmployeeCount
FROM Employees;

Rationale: The poorly written query uses a correlated subquery for each row in the outer query, resulting in repeated executions that can drastically reduce performance, especially in large tables. The optimized query, using a window function, efficiently computes the count of employees in each department in a single pass through the data, leveraging the power of window functions to perform complex calculations across sets of rows that are related to the current row.

6. Effective Pagination in Large Datasets

Poorly Written:

SELECT * FROM Logs WHERE EventType = 'ERROR';

Optimized:

SELECT * FROM Logs WHERE EventType = 'ERROR' LIMIT 100 OFFSET 0;

Rationale: In the poorly written query, fetching all records with ‘ERROR’ events can be highly inefficient for a large table, leading to a heavy load on the database and a long wait time for the query to complete. The optimized query uses pagination (LIMIT and OFFSET), which is essential when dealing with large datasets. It restricts the number of rows returned, making the query execution faster and more manageable.

7. Understanding the Use of DISTINCT

Poorly Written:

SELECT DISTINCT EmployeeID, DepartmentID FROM Employees WHERE DepartmentID > 10;

Optimized:

SELECT EmployeeID, DepartmentID FROM Employees WHERE DepartmentID > 10;

Rationale: The original query uses DISTINCT, which can be computationally expensive as it requires the database to sort and remove duplicate rows. If the combination of EmployeeID and DepartmentID is already unique due to the data model or business logic, then using DISTINCT is unnecessary. The optimized query removes DISTINCT, potentially improving performance by reducing sorting and processing overhead.

8. Efficient Date Range Queries

Poorly Written:

SELECT * FROM Sales WHERE MONTH(SaleDate) = 6 AND YEAR(SaleDate) = 2021;

Optimized:

SELECT * FROM Sales WHERE SaleDate BETWEEN '2021-06-01' AND '2021-06-30';

Rationale: In the poorly written query, the functions MONTH() and YEAR() prevent the database from using indexes on the SaleDate column, leading to inefficient full table scans. The optimized query eliminates these functions and instead uses a direct range comparison, which is far more efficient and likely to utilize an index if one exists on SaleDate.

9. Avoiding Redundant Grouping

Poorly Written:

SELECT DepartmentID, EmployeeID, COUNT(*) FROM Tasks GROUP BY DepartmentID, EmployeeID;

Optimized:

SELECT DepartmentID, COUNT(*) FROM Tasks GROUP BY DepartmentID;

Rationale: The original query includes an unnecessary grouping by EmployeeID, which leads to a finer partition of data than required, potentially producing a larger result set with repeated department data. The optimized query focuses on the actual requirement - counting tasks per department. This change reduces the computational load by aggregating data only at the department level.

10. Existence Checks Using EXISTS

Poorly Written:

SELECT * FROM Employees WHERE EmployeeID IN (SELECT EmployeeID FROM Salaries WHERE Amount > 50000);
  • This query selects all columns from the Employees table where the EmployeeID is in the list of EmployeeIDs from the Salaries table with an Amount greater than 50,000.
  • The subquery in the IN clause produces a list of all EmployeeIDs that meet the salary condition, potentially including duplicates if an employee has multiple salary records over 50,000.
  • The main query then matches its EmployeeID against this list. If an EmployeeID appears multiple times in the subquery's result, it does not affect the main query's output, as the main query still fetches unique employee records.

Optimized:

SELECT * FROM Employees e WHERE EXISTS (SELECT 1 FROM Salaries s WHERE s.EmployeeID = e.EmployeeID AND s.Amount > 50000);
  • This query selects all columns from the Employees table where there exists at least one corresponding record in the Salaries table with an Amount greater than 50,000 for that employee.
  • The EXISTS clause checks for the existence of at least one matching record. As soon as it finds a match, it stops checking further. It does not produce a list of IDs like the IN subquery.
  • The result of this query is generally the same as the IN subquery, as it's checking for the existence of the same condition (an employee having a salary record over 50,000). However, it's often more efficient, especially if there are multiple salary entries per employee, as it stops searching after finding the first match.

Rationale: The poorly written query uses IN with a subquery, which can be inefficient as it might generate a large intermediate result set. The optimized query with EXISTS is typically faster, as the database stops checking for matches in the subquery as soon as the first match is found, rather than building a complete list of matches first.

11. Strategic Use of OR in JOIN Conditions

Poorly Written:

SELECT * FROM Orders 
LEFT JOIN Customers ON Orders.CustomerID = Customers.CustomerID OR Orders.CustomerID IS NULL;

Optimized:

SELECT * FROM Orders 
LEFT JOIN Customers ON Orders.CustomerID = Customers.CustomerID
WHERE Orders.CustomerID IS NULL OR Customers.CustomerID IS NOT NULL;

Rationale: The poorly written query’s OR condition in the JOIN can lead to a complex and less efficient execution plan. By restructuring the query and moving part of the condition to the WHERE clause, the database can more effectively utilize indexes and optimize the join operation.

12. Partial Indexes for Query Efficiency

Poorly Written:

SELECT * FROM Messages WHERE IsRead = false AND UserID = 1;

Optimized:

-- Assuming a partial index exists on IsRead where IsRead = false
SELECT * FROM Messages WHERE IsRead = false AND UserID = 1;

Rationale: The poorly written query may not efficiently use indexes if the IsRead column has low cardinality (i.e., not many unique values). The optimized query, assuming a partial index exists, leverages the index specifically created for IsRead = false conditions, resulting in a faster and more efficient lookup.

13. Leveraging Lateral Joins

Poorly Written:

SELECT e.EmployeeID, e.Name, 
(SELECT COUNT(*) FROM Sales s WHERE s.EmployeeID = e.EmployeeID) AS SalesCount
FROM Employees e;

Optimized:

SELECT e.EmployeeID, e.Name, s.SalesCount
FROM Employees e
LEFT JOIN LATERAL (SELECT COUNT(*) AS SalesCount FROM Sales WHERE EmployeeID = e.EmployeeID) s ON true;

Rationale: The poorly written query uses a correlated subquery for each employee, which can be inefficient. The optimized version with a LATERAL join allows for a more efficient execution, as the correlated subquery is evaluated only once and can better utilize indexes.

14. Optimizing Aggregate Functions

Poorly Written:

SELECT MAX(Salary) FROM Employees WHERE DepartmentID = 3;

Optimized:

-- Assuming an index on DepartmentID including Salary
SELECT MAX(Salary) FROM Employees WHERE DepartmentID = 3;

Rationale: If an index includes both DepartmentID and Salary, the database can quickly locate and compute the maximum salary for department 3, significantly speeding up the query.

15. Using Subqueries Instead of JOINs

Poorly Written:

SELECT * FROM Employees WHERE ManagerID IN (SELECT EmployeeID FROM Employees WHERE Title = 'Manager');

Optimized:

SELECT e.* FROM Employees e
JOIN Employees m ON e.ManagerID = m.EmployeeID
WHERE m.Title = 'Manager';

Rationale: A JOIN can be more efficient than a subquery, particularly for large datasets, as it can better utilize indexes and reduce computational overhead.

16. Misusing OR in Queries

Poorly Written:

SELECT * FROM Orders WHERE CustomerID = 5 OR TotalAmount > 1000;
  • This query selects all orders where either the CustomerID is 5 or the TotalAmount is greater than 1000.
  • If an order satisfies both conditions (e.g., it’s from customer 5 and the amount is over 1000), it will appear once in the result set.

Optimized:

SELECT * FROM Orders WHERE CustomerID = 5
UNION
SELECT * FROM Orders WHERE TotalAmount > 1000;
  • UNION in SQL combines the results of two or more SELECT statements. However, it eliminates duplicate rows from the result set.
  • This query first selects orders where CustomerID is 5 and then selects orders where TotalAmount is greater than 1000. If any orders meet both criteria, they will still appear only once in the final result set due to UNION's duplicate removal.
  • Therefore, the end result of this query should be the same as the original query with the OR condition.

Rationale: Rewriting the query using UNION allows each part of the query to be optimized independently, potentially using different indexes for each condition.

It’s worth noting that while the results are the same, the performance may differ. Using UNION can sometimes offer performance benefits, especially when each SELECT statement can be optimized separately (for example, if different indexes can be used for each condition). However, this can vary based on the database schema, the size and distribution of data, and the database management system's optimization capabilities.

17. Reducing Query Complexity

Poorly Written:

SELECT e.Name, (SELECT AVG(Salary) FROM Employees WHERE DepartmentID = e.DepartmentID) AS AvgDeptSalary
FROM Employees e;

Optimized:

SELECT e.Name, d.AvgDeptSalary
FROM Employees e
JOIN (SELECT DepartmentID, AVG(Salary) AS AvgDeptSalary FROM Employees GROUP BY DepartmentID) d ON e.DepartmentID = d.DepartmentID;

Rationale: The poorly written query uses a correlated subquery that recalculates the average salary for each employee’s department, which is highly inefficient. The optimized version calculates the average salary per department in a subquery and then joins it with the main Employees table. This approach calculates each department's average salary once, significantly reducing the computational load.

18. Handling NULL Values Efficiently

Poorly Written:

SELECT * FROM Orders WHERE CustomerID IS NULL OR CustomerID = 5;

Optimized:

SELECT * FROM Orders WHERE COALESCE(CustomerID, 5) = 5;

Rationale: In the original query, the database needs to check two conditions for each row, which might be less efficient. The COALESCE function in the optimized query simplifies the logic by substituting NULL values with 5, allowing for a single condition check. This can make the query more efficient and concise.

19. Avoiding Full Table Scans with Proper Indexing

Poorly Written:

SELECT * FROM Products WHERE ProductName LIKE '%Widget%';

Optimized:

-- Assuming a full-text index on ProductName
SELECT * FROM Products WHERE MATCH(ProductName) AGAINST('Widget' IN NATURAL LANGUAGE MODE);

Full-Text Search: If your database supports it, consider implementing a full-text search index. This can be more efficient for searching within strings.

20. Maximizing Performance with Materialized Views

Poorly Written:

SELECT * 
FROM (SELECT ProductID, SUM(Sales) AS TotalSales FROM Orders GROUP BY ProductID) AS SalesSummary
WHERE TotalSales > 1000;

Optimized:

-- Assuming a materialized view 'SalesSummary' exists
SELECT * FROM SalesSummary WHERE TotalSales > 1000;

Rationale: The poorly written query calculates the total sales per product each time it runs, which can be inefficient for large datasets. The optimized query uses a materialized view, where the summary data is precomputed and stored, leading to faster retrieval for repeated queries.

21. Using Temporary Tables for Complex Queries

Poorly Written:

SELECT e.EmployeeID, e.Name, COUNT(t.TaskID) AS TaskCount
FROM Employees e
JOIN Tasks t ON e.EmployeeID = t.AssignedTo
GROUP BY e.EmployeeID, e.Name;

Optimized:

CREATE TEMPORARY TABLE EmployeeTasks AS
SELECT EmployeeID, COUNT(TaskID) AS TaskCount
FROM Tasks
GROUP BY EmployeeID;

SELECT e.EmployeeID, e.Name, et.TaskCount
FROM Employees e
JOIN EmployeeTasks et ON e.EmployeeID = et.EmployeeID;

Rationale: The original query performs a join and then an aggregation, which can be heavy on resources. By using a temporary table, the task counts are precomputed, making the final join more efficient.

--

--

No responses yet