Tuesday, August 16, 2011

SQL query performance using a select on a large record set with Hibernate

This title is partly a question because I am unsure that my solution is the best. I have come to the following solution based on trial and error, and settling on what seemed to have worked.

The Database

The database is MS SQL 2008 Enterprise, we are using the correct JDBC driver, and are using Hibernate 3.4.0.GA. The database is also located on the same physical machine as the application server.

The Table

There is only a single table that is the target of the problem, which has the following structure:

Foo

  • ID (UUID Primary Key)
  • A (int, indexed)
  • B (datetime, indexed)
  • C (varchar, indexed)
  • D (int, not indexed)

In this table there are slightly over 2.5 million records, and I am trying to select less than 10 results.

The Query

This is the query that was slow, taking over 5 minutes to complete (in the SQL Variant). Take into account though that the different columns were used under different circumstances, making the query dynamic. The more columns used, the slower the query. In the case of using every column it looked like this:

SQL:

SELECT ID, A, B, C, D

FROM FOO

WHERE A = 1 AND B >= ‘2011-08-12’ AND C = ‘BAR’ AND D = 3

Hibernate:

Query query = em.createQuery("from Foo where a=:a and b >= :b and c = :c and d = :d”);

query.setParameter(“a”, 1);

query.setParameter(“b”, date);

query.setParameter(“c”, “BAR”);

query.setParameter(“d”, 3);

List<Foo> list = query.getResultList();

The Obvious

Column D isn’t indexed. Column D isn’t also used all of the time, and removing it from the query only cuts the 5 minute time by 30 seconds. There is also the preference that I do not change the existing database unless absolutely the only option.

There are also probably better ways to deal with the datetime column type, for example see http://diegworld.blogspot.com/2010/09/sql-server-performance-querying-by-date.html. However, if I remove this column from the query it doesn’t change the amount of time it takes.

The SQL Solution

Eventually I just started playing around with the SQL variant of the query to see if I could improve the performance. I started with running select statements with each individual column and found that they all worked in a second, for example:

SELECT ID, A, B, C, D, FROM FOO WHERE A = 1

SELECT ID, A, B, C, D, FROM FOO WHERE B >= ‘2011-08-12’

SELECT ID, A, B, C, D, FROM FOO WHERE C = ‘BAR’

SELECT ID, A, B, C, D, FROM FOO WHERE D = 3

At this point I realized that the problem has to do with all of the restrictions I added in the where clause, but changing the order of the “ands” didn’t improve performance either. With this information I then tried building the query so that it would incorporate sub queries. The idea being that if I have select with A I have set A, in which I can select with B from that and have Set AB, select C from that and have Set ABC, and select D from that and have Set ABCD. The resulting Set of ABCD would contain only records which had the A, B, C, and D criteria.

The resulting SQL that I came up with was the following:

SELECT ID, A, B, C, D FROM FOO fooA WHERE fooA.ID IN (

SELECT ID FROM FOO fooB WHERE fooB.ID IN (

    SELECT ID FROM FOO fooC WHERE fooC.ID IN (

          SELECT ID FROM FOO fooD WHERE D = 3

    ) AND fooC.C = ‘BAR’

) AND fooB.B >= ‘2011-08-12’

) AND fooA.A = 1

This query took less than a second to run, as opposed to the previous variant that was taking more than 5 minutes.

My question is though, is this the best thing to do from both a SQL and MSSQL perspective?

The Hibernate Implementation

Hibernate has a lot of flexibility in terms of how you can do sub queries, but according to my research on the internet the preferred means for doing this is to use Criteria and DetachedCriteria. Here are some places which I used to come to this conclusion:

The way this works in Hibernate is as follows:

Session session = (Session) getEntityManager().getDelegate();

Criteria crit = session.createCriteria(Foo.class);

// First subquery to get A

DetachedCriteria query1 = DetachedCriteria.forClass(Foo.class);

query1.add(Restrictions.eq("a", 1));

query1.setProjection(Projections.property("id"));

crit.add(Subqueries.propertyIn("id", query1));

// Second subquery to get B

DetachedCriteria query2 = DetachedCriteria.forClass(Foo.class);

query2.add(Restrictions.ge("b", date));

query2.setProjection(Projections.property("id"));

crit.add(Subqueries.propertyIn("id", query2));

// Third subquery to get C

DetachedCriteria query3 = DetachedCriteria.forClass(Foo.class);

query3.add(Restrictions.eq("c", “BAR”));

query3.setProjection(Projections.property("id"));

crit.add(Subqueries.propertyIn("id", query3));

// Fourth subquery to get D

DetachedCriteria query4 = DetachedCriteria.forClass(Foo.class);

query4.add(Restrictions.eq("d", 3));

query4.setProjection(Projections.property("id"));

crit.add(Subqueries.propertyIn("id", query4));

List<Foo> results = crit.list();

Each subquery reads as “get the id’s from Foo where the restriction holds true”, and the criteria represents the combined result of all of the added subqueries.

This isn’t the exact equivalent of the previously specified SQL though. When looking at the HQL debug output it ends up being the following:

SELECT ID, A, B, C, D

FROM Foo

WHERE ID in (SELECT ID FROM Foo WHERE A = 1)

AND ID in (SELECT ID FROM Foo WHERE B >= ‘2011-08-12’)

AND ID in (SELECT ID FROM Foo WHERE C = ‘BAR’)

AND ID in (SELECT ID FROM Foo WHERE D = 3)

This is better than what was there, but it still isn’t as efficient as it could be. In this code we are selecting 4 different times across the entire Foo record set. In the SQL I am trying to write each time we select it is from the previous subset. For example we are looking for condition B inside the A result set,which is smaller than all of the Foo records.

As it turns out, it isn’t that hard. You just add your subqueries to other subqueries like this:

Session session = (Session) getEntityManager().getDelegate();

Criteria crit = session.createCriteria(Foo.class);

 

// First subquery to get A

DetachedCriteria query1 = DetachedCriteria.forClass(Foo.class);

query1.add(Restrictions.eq("a", 1));

query1.setProjection(Projections.property("id"));

crit.add(Subqueries.propertyIn("id", query1));

 

// Second subquery to get B

DetachedCriteria query2 = DetachedCriteria.forClass(Foo.class);

query2.add(Restrictions.ge("b", date));

query2.setProjection(Projections.property("id"));

query1.add(Subqueries.propertyIn("id", query2));

 

// Third subquery to get C

DetachedCriteria query3 = DetachedCriteria.forClass(Foo.class);

query3.add(Restrictions.eq("c", “BAR”));

query3.setProjection(Projections.property("id"));

query2.add(Subqueries.propertyIn("id", query3));

 

// Fourth subquery to get D

DetachedCriteria query4 = DetachedCriteria.forClass(Foo.class);

query4.add(Restrictions.eq("d", 3));

query4.setProjection(Projections.property("id"));

query3.add(Subqueries.propertyIn("id", query4));

 

List<Foo> results = crit.list();

This hibernate code results in the desired SQL:

SELECT ID, A, B, C, D FROM FOO fooA WHERE fooA.ID IN (

SELECT ID FROM FOO fooB WHERE fooB.ID IN (

    SELECT ID FROM FOO fooC WHERE fooC.ID IN (

          SELECT ID FROM FOO fooD WHERE D = 3

    ) AND fooC.C = ‘BAR’

) AND fooB.B >= ‘2011-08-12’

) AND fooA.A = 1

2 comments:

Kris Hofmans said...

I just heard some wise words from a great dba regarding this article:
1) gather statistics
2) if 1) fails, switch database

Unknown said...

Table----->
Employee(emp_id,emp_name,address,dept_id,salary)
Department(dept_id,name)

select emp_id,emp-name
from Employee
where emp_id in(select max(salary),emp_id from Employee group by emp_id)


please give me Criteria query corresponding to this, using detached criteria query for subquery.
Thanks in advance

Contributors