Performance Issues with Python, Postgresql and psycopg2

In a previous post, I wrote about some of the things I found annoying with the standard python DB-API and wrote a wrapper class to assist and provide a general abstraction layer which could be used against any database.

So far I have used the insertBatch method with a target of sqlite, DB2 and Sybase and have never had any issues with the performance. Recently I was testing the code against Postgresql with psycopg2 as the driver and the insert performance was abysmal. Reading through the documents of psycopg2 the authors admit that the executemany method in their driver does not perform well. Infact the following is statement in the the online documents

Warning In its current implementation this method is not faster than executing execute() in a loop

Oh dear this isn’t good news. What can be done to keep the API appearing the same but take care of the poor performance with psychopg2 ? Well the document mentions a different method called execute_values() lets make a few changes to the code to help.

Firstly in the previous version we were just using sqlite and left different imports / databases as an exercise to the reader, suggesting that an Enum or similar could be passed in and the class could then import the necessary modules.

Again to keep things simple and to avoid the helper class from having to know too much implementation detail or rely on trying to import modules which may not be installed we will use a generic approach. Firstly we will change the constructor. In the previous version it looked like this the below


import sqlite3

class DBConn(object):

  def__init__(self, connectionFunc):
    self.conn = sqlite3.connect(':memory:')

Lets alter this to accept a function passed in by users of our helper class.

class DBConn(object):
  def __init__(self, connectionFunc):
    self.conn = connectionFunc()

Firstly note we have removed the line “import sqlite”. We have changed the __init__ method to accept a second parameter which will be a function and we assign the return value of this to our internal connection object.

A sample usage of our helper class with sqlite would be.

def mySqlLiteConnection():
  import sqlite3
  return sqlite3.connect(‘:memory:’)

with DBConn(mySqlLiteConnection) as db:
  #Do something with the connection

In the case above we are making the user responsible for any imports. The function they pass should return the connection object. A sample with postgresql would be


def postgres():
  import psycopg2
  d = {
    "host" : "MacMini.local" ,
    "database" : "testdb" ,
    "user" : "hugo" ,
    "password" : "pass"
  }
  return psycopg2.connect(**d)

with DBConn(postgres) as db:
  #Do something with the connection

Now we have seen a way of delegating the responsibility of importing drivers to the user we have a generic class which will take case of resource management (e.g. closing cursors and connections), will enable user to access result set field by name and perform inserts by passing in a generator, regardless of the underlying differences in drivers or implementation details.

Lets make a small tweak to our class to check if the driver is psycopg2. In this case we are not responsible for the drivers, the user could be using anything. To make this work we check the sys.modules dictionary.

import sys
from itertools import islice

class DBConn(object):
  def __init__(self, connectionFunc):
    self.conn = connectionFunc()
    #check to see if driver is psycopg2
    if 'psycopg2' in sys.modules:
      import psycopg2.extras as psycopg2Extras
      self.sycopg2Extras = psycopg2Extras
      ext = sys.modules['psycopg2.extensions']
      self.isPsycopg2 = isinstance(self.conn, ext.connection)

The constructor now has some special checks for pyscopg2. Lets amend the insertBatch method to address the performance issue


def insertBatch(self, sql, it, size=5000):
  for batch in DBConn.chunk(it, size):
    with self.ClosingCursor() as c:
      if self.isPsycopg2:
        self.sycopg2Extras.execute_values(c, sql, batch)
      else:
        c.executemany(sql, batch)

The insert match will now use execute_values if the driver is psycopg2 otherwise it will use the standard executemany method.
The only other thing to note is that when using execute_values the insert statement will look slightly different.
When using executemany a sql statement might look like
“insert into users (name, phonenumber) values (%s,%s)”

With execute_values this should be

insert into users (name, phonenumber) values %s

We can abstract that away later so the user doesn’t know about such things.

Hugos mostly inane waffle about tech and coding