Fundamentals of SQL Query Optimization
SQL query optimization is critical for advanced users who work with large, complex databases. Even small improvements in query performance can translate to significant gains in productivity and cost savings.
In some cases, poorly optimized queries can even bring a system to its knees.
At its core, query optimization involves finding the most efficient way to retrieve data from a database. This often requires a deep understanding of the database structure, as well as the underlying hardware and software environment. Optimizing queries involves a variety of techniques.In this post, we will explore some of these advanced techniques in greater detail, providing examples and best practices for optimizing SQL queries.
Below you can find some ways to significantly improve the performance and readability of your SQL Queries.
Use lowercase code
The truth is that for many years, I used to code in uppercase SQL. This finally changed, when I had to adapt myself in a professional setting that codes lowercase. I got used to it immediately and I never looked back.
Despite the fact that uppercase is more commonly used for SQL keywords, as it can make the code easier to read and differentiate between keywords and identifiers, unfortunately, it is not ergonomic. Lowercase style is less “noisy” and more natural. You do not have to focus on pressing CAPS LOCK but in the SQL side of things. Finally, readability is getting the same when you are getting used to lowercase.
Instead of this
SELECT
department
, COUNT(*) AS total_employees
FROM Employees
WHERE department IN ('Sales', 'Marketing')
GROUP BY department
HAVING COUNT(*) > 10
ORDER BY total_employees DESC
Do this
select
department
, count(*) as total_employees
from employees
where department in ('Sales', 'Marketing')
group by department
having count(*) > 10
order by total_employees desc
Consistency is key when it comes to choosing between uppercase and lowercase keywords. If you’re working on a team or with a codebase that uses lowercase keywords, it’s important to maintain that style for the sake of consistency and clarity.
Ultimately, the choice between uppercase and lowercase SQL keywords comes down to personal preference and the style guidelines of your team or organization.
Use leading commas
In SQL, leading commas
and trailing commas
refer to the placement of commas in a list of items within a SQL statement. Let’s break down these terms:
- Leading commas refer to commas that appear at the beginning of each element in a list
- Trailing commas refer to commas that appear at the end of each element in a list
Of course, there is no single standard for how to write SQL code. Some people use uppercase keywords, while some lowercase. Some use trailing commas and others leading commas.
Trailing commas style is easier to read for most people. One advantage of using leading commas is that the last item doesn’t differ from the others, but on the other hand the first item differs.
To my experience, leading commas style is easier for debugging as for errors that will occur due to missing commas, it will be faster to check the lines from start instead of their end and fix the appropriate line.
The most crucial thing here is to adapt ourselves in the organisation that we belong and stick to the team’s coding style. It’s more useful to have pieces of code with a consistent look, rather applying your personal preferences.
Instead of this
select
order_id,
order_date,
user_id,
customer_name,
line_item_id
from line_items
Do this
select
order_id
, order_date
, user_id
, customer_name
, line_item_id
from line_items
aha moment: Imagine that a comma is missing. Which style suits you bette for debugging?
Refer table names in columns
In a database with multiple tables, it is possible that two or more tables have columns with the same name. If we don’t specify the table name when selecting columns, the database may not know which column we are referring to, leading to errors or incorrect results.
Instead of this
select
order_number
, order_date
, first_name
, last_name
, product_name
, price
from orders o
inner join customers c on o.customer_id = c.id
inner join products p on o.product_id = p.id
where c.last_name = 'Smith' and p.price > 100
order by order_date desc;
Do this
select o.order_number
, o.order_date
, c.first_name
, c.last_name
, p.product_name
, p.price
from orders o
inner join customers c on o.customer_id = c.id
inner join products p on o.product_id = p.id
where c.last_name = 'Smith' and p.price > 100
order by o.order_date desc;
It makes the query more readable and easier to understand. When we refer to columns without specifying the table name, it can be difficult to determine which table the column comes from, particularly if the query involves multiple tables.
by including the table name in the select statement we avoid conflicts when joining tables. It is possible for two or more tables to have columns with the same name.
Use Common Table Expressions (CTEs) and/or temp tables
CTEs and temp tables can make the query more readable and easier to understand. Nested queries can become very complex and difficult to read, especially if there are several levels of nesting. It can also improve query performance. When we use a nested query, the database management system has to execute the inner query before it can execute the outer query. This can result in slower query performance, especially if the inner query involves a large amount of data or a complex calculation.
They can be reused in multiple queries. When we use a nested query, we can only use the result set of the inner query once. If we need to use the same result set in multiple queries, we have to execute the inner query multiple times, which can result in slower query performance.
-
Breaking down complex queries into smaller, results in more manageable and readable parts. This can greatly improve the clarity and maintainability of SQL code.
-
Code Reusability Results can be referenced multiple times within a single query.
-
Easier Debugging and Testing It becomes simpler to isolate and debug specific parts of a query.
-
Code Documentation Properly named CTEs and temp tables can serve as self-documenting code, providing descriptive names for temporary result sets and making the intention of the query clearer.
The choice between CTEs and temporary tables depends on the specific requirements of your query and the nature of the data manipulation you’re performing. In many cases, CTEs are favored for their simplicity and scope, while temporary tables are chosen for scenarios requiring persistence and reuse of intermediate results.
Instead of this
select
c.customer_name
, qr.quarter
, coalesce(qr.quarter_revenue, 0) as quarter_revenue
, coalesce(tr.total_revenue, 0) as total_revenue
from
customers c
left join (
select
customer_id
, date_trunc('quarter', order_date) as quarter
, sum(price * quantity) as quarter_revenue
from
order_items oi
join orders o on oi.order_id = o.order_id
group by
customer_id,
quarter
) qr on c.customer_id = qr.customer_id
left join (
select
customer_id
, sum(price * quantity) as total_revenue
from
order_items oi
join orders o on oi.order_id = o.order_id
group by
customer_id
) tr on c.customer_id = tr.customer_id;
Do this
-- Create a temp table with all orders and their dates
create temporary table temp_orders as
select
order_id
, customer_id
, order_date
, sum(price * quantity) as revenue
from order_items
group by order_id;
-- Create a temp table with the revenue for each quarter of the year
create temporary table temp_quarter_revenue as
select
customer_id
, date_trunc('quarter', order_date) as quarter
, sum(revenue) as quarter_revenue
from temp_orders
group by customer_id, quarter;
-- Create a temp table with the total revenue for each customer
create temporary table temp_total_revenue as
select
customer_id
, sum(revenue) as total_revenue
from temp_orders
group by customer_id;
-- Combine the temp tables to get the revenue for each quarter and customer
select
c.customer_name
, qr.quarter
, coalesce(qr.quarter_revenue, 0) as quarter_revenue
, coalesce(tr.total_revenue, 0) as total_revenue
from customers c
left join temp_quarter_revenue qr on c.customer_id = qr.customer_id
left join temp_total_revenue tr on c.customer_id = tr.customer_id;
Instead of this
select *
from
(
select
aggregated_data.customer_city
, aggregated_data.seller_city
, avg_price_total
, dense_rank() over (partition by customer_city order by sales desc) as rank_of_seller
from
(
select
customers.customer_city
, sellers.seller_city
, avg(price) as avg_price_total
, count(*) as sales
from order_items
left join (select seller_id, seller_city from sellers) sellers
on order_items.seller_id = sellers.seller_id
left join (select order_id, customer_id from orders where order_status = 'delivered') orders
on order_items.order_id = orders.order_id
left join (select customer_id, customer_city from customers) customers
on orders.customer_id = customers.customer_id
where price > 30
group by1,2
) aggregated_data
) rank_data
cross join (select avg(price) as avg_price_total from order_items where price > 30) avg_price_for_order_items
where customer_city = 'cabo frio' and rank_of_seller = 3
"""
Do this
with sellers_cte as (
select
seller_id
, seller_city
from sellers
)
, order_items_cte as (
select
order_id
, order_item_id
, seller_id
, price
from order_items
where price > 30
)
, avg_price_for_order_items_cte as (
select
avg(price) as avg_price_total
from order_items_cte
)
, orders_cte as (
select
order_id
, customer_id
from orders
where order_status = 'delivered'
)
, customers_cte as (
select
customer_id
, customer_city
from customers
)
, aggregated_data as (
select
customers_cte.customer_city
, sellers_cte.seller_city
, count(*) as sales
, avg(order_items_cte.price) as avg_price_per_city_combo
from order_items_cte
left join sellers_cte
on order_items_cte.seller_id = sellers_cte.seller_id
left join orders_cte
on order_items_cte.order_id = orders_cte.order_id
left join customers_cte
on orders_cte.customer_id = customers_cte.customer_id
group by 1,2
)
, rank_cte as (
select
aggregated_data.customer_city
, aggregated_data.seller_city
, avg_price_per_city_combo
, dense_rank() over (partition by customer_city order by sales desc) as rank_of_seller
from aggregated_data
)
select *
from rank_cte
cross join avg_price_for_order_items_cte
where customer_city = 'cabo frio' and rank_of_seller = 3
Note: These queries achieve the same result but they are more clear and easier to read and understand. They are also faster, especially if there are many orders and/or customers, because it is not executing subqueries for each row in the main query. It breaks the query down into smaller, more manageable parts, and simplifies the logic, making it easier to read and optimize.
Use proper naming convention
Overall, it is important to use a consistent naming convention and to choose names that are descriptive, clear, and easy to understand. This makes it easier for users to work with the database and ensures that queries and joins are accurate and efficient.
Instead of this
create table emp (
emp_id int primary key
, fn varchar(50)
, ln varchar(50)
, date_birth date
, hd date
, dpt_id int
);
Do this
create table employees (
employee_id int primary key
, first_name varchar(50)
, last_name varchar(50)
, date_of_birth date
, hire_date date
, department_id int
);
Notice how the table and column names are all abbreviated and use a non-standard naming convention. The table name is in singular form, which could be confusing if there are multiple entities in the table.
The column names use abbreviations, which may not be immediately clear to users who are not familiar with the system. This can make it difficult to understand the relationships between tables and can lead to confusion and errors in querying the database.
Avoid select *
Please stop doing that. Reduce the number of columns returned in the select
statement to only those that are necessary for the query.
This is a good practice for several reasons, especially in a columnar database.
In a columnar database, each column is stored separately from the other columns, which means that retrieving data from a large number of columns can be slower than retrieving data from a smaller number of columns.
by selecting *
we increase the amount of data that needs to be transferred across the network and processed by the application. This can result in higher memory usage and longer processing times, which can negatively impact query performance. Furthermore, we reduce the clutter in the select statement and make it easier to see what the query is doing.
Instead of this
select *
from orders
Do this
select
order_id
, user_id
, order_date
from orders
Use proper indentation and white spaces
Using proper indentation and spacing can help to make SQL code more readable and easier to maintain. It’s important to follow best practices and to keep the code well-organized to prevent errors and improve efficiency.
Instead of this
select c.customer_name,o.order_id,p.product_name,od.quantity
from customers c
join orders o on c.customer_id = o.customer_id
join order_details od on o.order_id = od.order_id
join products p on od.product_id = p.product_id
where c.city = 'New York' and o.order_date >= '2022-01-01' and p.category_id = 2
order by c.customer_name asC,o.order_id desc;
Do this
select
c.customer_name
, o.order_id
, p.product_name
, od.quantity
from customers c
join orders o on c.customer_id = o.customer_id
join order_details od on o.order_id = od.order_id
join products p on od.product_id = p.product_id
where
c.city = 'New York'
and o.order_date >= '2022-01-01'
and p.category_id = 2
ORDER by c.customer_name asc, o.order_id desc;
It’s a bit faster to read and understand it, isn’t it?
Summary
SQL optimization can have significant benefits for both performance and team collaboration.
In terms of performance, optimizing SQL queries can significantly reduce the time and resources required to retrieve data from the database.
In terms of team collaboration, SQL optimization can help ensure that the codebase is maintainable and scalable. Optimized SQL queries are generally more readable and easier to understand, which can make it easier for team members to collaborate and work together on projects. Additionally, SQL optimization can help prevent performance issues that may arise due to poorly written queries, which can help reduce the need for constant code maintenance and troubleshooting.
Overall, SQL optimization is crucial for both performance and team collaboration. by optimizing SQL queries, teams can improve the performance of their applications, make their codebase more maintainable, and enhance collaboration and teamwork.
Did you like the post? Empower me to dedicate more time and resources to curate, create, and share content that educates and inspires.