๐ Turbocharging Database Queries: How We Went From 1219s to ๐๏ธ 1s and Reduced CPU Usage from ๐ฅ 30% to โ๏ธ 1%
Introduction
Welcome to our exciting journey of optimizing Django database queries! We started with a ๐ slow function that took a staggering 1219 seconds to analyze data and consumed 30% of our CPU. After venturing through the realm of Django optimizations, we conquered our challenges and achieved a โก lightning-fast 1-second execution time and reduced our CPU usage to a cool 1%.
Letโs dive into the adventure and learn from our experiences! ๐
Context ๐ฏ
At our ๐ฒfintech company๐ฒ, we often face the challenge of analyzing vast datasets spanning thousands of rows across multiple tables. In one particular case, we needed to generate insightful statistics on the number of active accounts and accounts with positive balances for each account category.
In our context, accounts function much like bank accounts, with which users are able to send or receive money through transactions. An account is considered active if it has more than two transactions in a month, either incoming or outgoing.
Similar to banks ๐ฆ , which offer various types of accounts such as savings, fixed deposits, and more, we also categorize each account into multiple categories. Our goal was to count all the accounts belonging to a specific category, identify the active accounts within that category, and determine the accounts with positive balances.
๐ The Initial Query
Our first and naive approach to solving this problem was straightforward. We fetched all the records and looped through each one using a simple for loop. Although this method was easy to implement, the performance was abysmal as it had to traverse through thousands of records taking O(n) time complexity. It took a mind-boggling 1219 seconds to analyze the data, and the CPU usage soared to 30%.
Clearly, we needed a better solution.
def get_active_accounts(account_queryset, transaction_queryset):
# ... (code to count accounts by type) ...
result = ACTIVE_ACCOUNTS_CATEGORIES
for account in account_queryset:
# ... (code to check for active accounts and positive balance) ...
return result
๐ Exploring Django Annotations
Determined to optimize our query, we turned to Django annotations. Annotations allow us to perform complex aggregations on our querysets without the need for Python loops. We started by annotating our queryset with the total number of accounts available and accounts having a positive balance for each category. This provided us with a more efficient way to calculate the positive balance and the total count of accounts.
account_data = account_data.values("account_type__category").annotate(
count=Count("id"),
positive_balance=Count(
Case(
When(
Q(balance__gt=0) | Q(commission_balance__gt=0),
then=F("id")
)
)
)
)
However, when we tried to calculate the active accounts using annotations, we hit a roadblock. The query we needed for active accounts was more complex than what Django annotations could handle. We encountered errors, and our efforts to use annotations for active account calculations were in vain.
# Annotate the queryset with the number of transactions for each account
account_data = account_queryset.annotate(
from_transaction_count=Count("transaction_account_from", distinct=True),
to_transaction_count=Count("transaction_account_to", distinct=True),
)
# Calculate the active accounts
account_data = account_data.annotate(
active=Case(
When(
Q(from_transaction_count__gte=2) | Q(to_transaction_count__gte=2),
then=Value(True)
),
default=Value(False),
output_field=BooleanField()
)
)
# Group the queryset by account_type__category and aggregate the counts
account_data = account_data.values("account_type__category").annotate(
count=Count("id"),
positive_balance=Count(
Case(
When(
Q(balance__gt=0) | Q(commission_balance__gt=0),
then=F("id")
)
)
),
active_count=Count(
Case(
When(
active=True,
then=F("id")
),
output_field=IntegerField()
)
)
)
The Turning Point โ Raw SQL to the Rescue ๐งช
Realizing that we had reached the limits of Django annotations, we decided to take a different approach: using raw SQL queries. By constructing a raw SQL query, we could harness the full power of our database engine and avoid the harshness of Django annotations.
SELECT at.type_id, COUNT(*) as total_count,
SUM(CASE WHEN bal + comm_bal > 0 THEN 1 ELSE 0 END) as pos_bal_count,
SUM(CASE WHEN (SELECT COUNT(*) FROM tx t
WHERE (t.trans_from_id = acc.id
OR t.trans_to_id = acc.id)
AND EXTRACT(MONTH FROM t.created_dt) = %s
AND EXTRACT(YEAR FROM t.created_dt) = %s) >= 2
THEN 1 ELSE 0 END) as active_acc_count
FROM accounts acc
JOIN account_types at ON acc.account_type_id = at.type_id
GROUP BY at.type_id;
And it worked! Our raw SQL query delivered the results we needed with incredible speed. It took a mere 1 second to analyze the data, and the CPU usage plummeted to below 1%. We had achieved our goal of dramatically improving the performance of our query.
Lessons Learned ๐
Throughout this journey, we gained valuable insights into optimizing Django database queries. We learned that:
- Django annotations are powerful but have their limitations. Theyโre excellent for simple aggregations and calculations, but when dealing with more complex scenarios, raw SQL queries may be the better option.
- Raw SQL queries unlock the full potential of the database engine, enabling more complex and efficient calculations.
- Always consider the trade-offs between code readability and performance. While Django annotations offer a higher level of abstraction and readability, raw SQL queries can provide better performance in certain cases.
๐ Conclusion
Our quest for optimizing Django database queries was a thrilling adventure full of twists, turns, and valuable lessons. We started with a sluggish function that took an agonizing 1219 seconds to execute and managed to optimize it down to a blazing-fast 1-second execution time, reducing CPU usage from 30% to a mere 1%.
While we achieved dramatic improvements in our query performance, there are always opportunities to push the boundaries further. In the spirit of continuous learning and improvement, here are a few suggestions on how we could refine our analysis and provide an even more efficient solution:
- Database indexing: Ensure that the database is optimized with proper indexing on columns that are frequently used in the filter and join operations. This can significantly speed up query execution times.
- Caching: Implement caching strategies for frequently accessed data or aggregated results. This can help reduce the load on the database and provide faster response times for repetitive calculations.
- Batch processing: If the dataset is continually growing, consider breaking the analysis into smaller batches, processing them independently, and then aggregating the results. This can help distribute the workload and improve overall performance.
- Materialized views: Depending on your database engine, you might be able to leverage materialized views to store pre-computed results of complex queries. This can dramatically speed up query execution times for frequently accessed aggregated data.
- Asynchronous processing: For non-time-critical analysis, consider offloading the processing to an asynchronous task queue, such as Celery. This can help distribute the workload across multiple workers and improve the overall responsiveness of your application.
- Parallel processing โ๏ธ: If your database supports parallel query execution, consider using parallel processing to further speed up the analysis. This can help distribute the workload across multiple cores or processors and improve query performance.
- ๐Analyzing on the go and storing the results: Instead of processing the data periodically or on-demand, you can incrementally update the statistics as new transactions occur. For instance, when a transaction happens, you can increment the active accounts counter by one if the account meets the criteria. This approach allows you to maintain up-to-date statistics with minimal processing overhead.
- Leveraging specialized tools for large-scale analytics: As our dataset grows, it may become increasingly challenging to manage and analyze the data using traditional relational databases. In such cases, you can consider using specialized tools like Google BigQuery or Apache Spark, which are designed to handle large-scale analytics efficiently. These tools can help you perform complex analysis with greater speed and scalability.
If youโre facing similar challenges in your projects, we hope our journey can inspire you to explore various optimization techniques and find the perfect balance between code readability and performance. Remember, thereโs always room for improvement, and with the right approach, you can achieve amazing results! ๐