Python performance trap

November 4, 2025

You and Python, we've been together for so long. We love its development speed and rich ecosystem. But as our systems scale, we hit a performance wall. That's obvious, but the most frustrating part is the journey of trying to fix it. Let's discuss a few scenarios you may have already encountered, from the "obvious" fixes to the rabbit holes that only make things worse.

The Sidecar "Solution" and its Latency Trap

Imagine you have a small service within a microservice architecture, processing non stop requests from a message queue. Somehow, it fails to handle the throughput from the queue, and now you must find a way to optimize it. Looking into it, there are some factors you and I want to take into account. Obviously, Python's blocking I/O is the main issue here. If you look at how a Pub/Sub library handles pulling messages from the queue, it often fetches the message broker API at intervals. For each message, it spawns a thread for individual processing. Remember, async support for Python is not a new thing, but somehow async adoption for message broker pulling is still not very good. And they still rely on brute forcing the machine's raw power. In this case, you have the infamous issue that all message pipelines have: the backpressure issue. Your consumer's capacity does not match the throughput, and the waiting messages in the queue keep growing.

So how about thinking of a hybrid programming model, where your Python handles the business part, and a native message consumer handles the heavy lifting? Sounds like a dream. In that case, you might introduce a sidecar container. This lets you consume messages from the message broker, and on the Python side of the application, you write your message processing logic with FastAPI. What could be bad about that? Awesome async support, a fast and native message consumer. Until you find out the latency has become terrible and your original issue is still there. The round trip is now broken into multiple pieces, adding a new step from the consumer to your FastAPI. Even if it's local or via HTTP, this is still slow. Most pulling message broker libraries have some sort of mechanism for batching messages before sending, and your FastAPI, waiting for new messages, puts some of its workers to sleep. And when you do it this way, the native consumer's capacity is still much greater than the FastAPI servers, and the backpressure issue is still there. You've overcomplicated your workload without clearly solving the issue.

The Cython/Rust Rabbit Hole

How about keeping everything in Python, but optimizing your Python code? You find out I/O issues can be handled and optimized with async programming and some caching, but for heavy processing, Python fails. People know this, and many efforts have been made to work around this. You find there's a thing called Cython, or the underlying technology, C extensions. This lets you write your CPU heavy code in a native language and lets your Python code call it.

Hey, this sounds good, but let's talk about Cython first. Now you have to learn a mix of Python and C++, along with some weird and odd quirks, and experiment just to actually speed up your codebase. Not everything can be Cythonized, and it takes a lot of practice just to optimize and reduce the memory footprint, and learn techniques to release the GIL and free your C extension code from the Python 'hell' interpreter. Now you realize your codebase is a mix of things your team can't understand, making it hard to maintain, with multiple technologies in the same codebase complicating the architecture. It's the same with PyO3, a Rust library that lets you write your own CPU heavy code in Rust, compile it, and use it with Python. Now you've mixed many technologies together: the Rust and Python systems, the Python GIL, the Rust borrow checker, the async pool, and the Tokio async pool. You've overcomplicated the design, only to fail at scaling up. This only works when you need to speed up just a part of your code. For example, your Python worker fetches data from the network, processes it by combining multiple documents, and sends it out. In this case, you can try to replace the pure Python implementation with a Rust based one. But that's as much as you can do.

Asking the Right Questions

Then you might think, how can we solve the original issue?

Let's be clear, mixing multiple languages together and handling their complexity is never a workable solution. Feeling the pain from Python alone is enough. You don't want to take on more and feel the worst of the worst. Sometimes you just want to sit back and think about the good old days. Why did you choose Python? What factors helped you decide Python was a good option at that time? And right now, with this issue, has any new factor appeared that breaks your deal?

Once you answer those questions, there's one more: is it a promising idea to refactor your entire application from Python to a different native language that offers much better performance? Is your team ready to work on Rust or Kotlin, something they have never worked with and which requires them to master those things within three months?

And to be honest, I can't answer those broadly either. Put those questions in the context of your business, your future scaling plans, and the budget for the work. And remember, maybe what you are trying to solve is something others have already found an answer for. I wish you the best of luck!

python

Dark Blue Pattern

Python performance trap

The Sidecar "Solution" and its Latency Trap

The Cython/Rust Rabbit Hole

Asking the Right Questions

More posts

Next generation of *.quanghuy.dev #2 - Building PayloadCMS

Next generation of *.quanghuy.dev #1 - PayloadCMS migration