On a highly transactional Azure PostgreSQL Database customer might face below error in case we overlooked certain important parameters. Here is an example:
Error: Message: I/O error on GET request for “http://10.xxx.xxx.xxx:31653/getSearchxxx”: Connection timed out (Connection timed out); nested exception is java.net.ConnectException: Connection timed out (Connection timed out).

Customer Environment
- Application hosted in containers using Azure Kubernetes
- Total Pods: 200 (Number of connections to DB varies depending on POD functionality)
- Database SKU: Azure Database for PostgreSQL (Memory Optimized 32 Cores)
- Max User Connections Limit: 1982
- Connection Pooling – HikariCP (JDBC)
First have a look at key metrics on App and network side to rule out any issue.
- Check for packet any drops.
- Perform TCPdump and Nping analysis.
- Check for any handshake issue\Packet Drops
- Check for Latency here in this scenario, Latency was observed.
- Check the response time from App to DB

Let’s have a look at some database key metrics.



If we look closely here, we can see that there is an abrupt increase in ‘short-lived’ connections. Here we can see that this is happening for a shorter span of time. This is typically the case with highly transactional applications, where the client opens a connection, executes a simple query (read/write), and then closes the connection. If your application is designed around ‘short-lived’ connections and you expect many queries from different client sessions\(AKS)Pods you might end up in a situation of connection timeout. The maximum number of connections short or long-lived is decided by the max_connections server parameter which further depends on the SKU for example Memory Optimized with 32 Cores support 1982 concurrent connections, now when this limit is reached the server will not accept any new connections until one of the existing clients disconnect which case outage in the application. This will also lead to high CPU, connection latency and this time spent on processing connections and disconnections can affect the overall database performance.
How to find short lived connection via KQL
Check if Azure PostgreSQL is sending diagnostic information to Log Analytics if not configure it to send to Log analytics.
Logs – Azure Database for PostgreSQL – Single Server | Microsoft Learn
AzureDiagnostics
| where ResourceProvider =="MICROSOFT.DBFORPOSTGRESQL"
| where Category == "PostgreSQLLogs"
| where TimeGenerated >= ago(2d)
| where Message contains "disconnection: session time"
| extend pgmessage = tostring(split(Message, "disconnection: session time: ")[-1])
| extend myuser = tostring(split(tostring(split(pgmessage, " database=")[-2]), " user=")[-1])
|extend hours = todecimal(substring(pgmessage, 0, 1))
| extend minutes = todecimal(substring(pgmessage, 2, 2))
| extend seconds = todecimal(substring(pgmessage, 5, 2))
| extend milliseconds = todecimal(substring(pgmessage, 7, 4))
| extend connection_life_seconds = hours*60*60+minutes*60+seconds+milliseconds
| where myuser != 'azure_superuser'
| extend connection_type = case(connection_life_seconds < 60 , strcat("Short Live Connection"), connection_life_seconds between (60 .. 1200) , strcat("Normal Live Connection"),connection_life_seconds >1200, strcat("Long Live Connections"), "")
| summarize max(connection_life_seconds) by TimeGenerated,connection_type,myuser
| render timechart
What to do when we identified short lived connection?
Check if the application is using any kind of connection pooling.
Connection Pooling: Connection pooling is the process of having a pool of active connections on the backend servers. These can be used any time a user sends a request. Instead of opening, maintaining, and closing a connection when a user sends a request, the server will assign an active connection to the user. This smaller number of database connections helps reduce the context switching and resource contention, and therefore improves the overall performance. This allows you to scale your database connections, while keeping the connection latency low
What connections pooling software\tools are available?
What could possibly be wrong with the connection pooling software?
There could be several configuration issues, during our our analysis on the HikariCP configuration the idle timeout has been set in seconds\1 min which is very low and defeats the purpose of connection pooling. Here in this case this idle timeout will be applicable as the minimum idle is defined as less than the max pool size. Hence, we will not be able to use the connection pooling effectively and thus application must spawn new connection object every time we make a request from the AKS PODS
This incorrect configuration of idle timeout is the actual root
For Example
50 Pods making 10 requests\seconds and in the absence of no connection to reuse in the pool new connection objects will be created which will lead to a queue at the gateway resulting into connection timeout.



Try changing the idle timeout value to 5 Min in Hikari Configuration to ensure connections can be reused.
Workaround Results: After configuring the connection pooling optimally ensuring the connections remains there in the system for some time and can be reused and here you can see that blue line of short-lived connection almost disappear.

Learnings
- In case of large\highly transactional applications make sure that we use PgBouncer (recommended)\Hikari for connection pooling.
- In case any Open source like HikariCP check the configuration properly
- Plan to migrate to Flexible Server because that provides built-in connection pooling using PgBouncer
- Check if the application is creating too many short-lived connections.