Dani's IT Blog

…moving from Oracle to Software Development


leave a comment »

I just learned about InfoQ. I especially enjoy the great presentations, which you can watch in the Educational Content section.

I stumbled upon the page, when I was looking for Design Patterns in Functional Programming. Aino Vonge Corry held a interesting and funny speach, Functional Design Patterns, at QCon.

As a Vim fan, I enjoyed Bill Odom’s
Vim – From Essentials to Mastery.


Written by danirey

April 27, 2012 at 02:45

Oracle Exadata performance revealed – SmartScan – Part V

with 2 comments

As I said in my first post, Oracle can optimize the queries through row and column filtering. Until now, I only showed how row filtering works.

Column filtering is an interesting option, to reduce the amount of data, which has to be transfered. In most queries we only need a subset of the columns of a table. If we need only two out of twenty columns of a table and the storage sends all columns to the database, then roughly 80% of the data received is immediately trashed. In the request sent from the database to the storage cells, the needed rows and columns are specified. Based on this information, the Exadata Cells only send data for the needed columns back to the database. As we already know, Oracle uses the result set as container for the data. Therefore sending only specific columns is fairly easy. Enough the theory, let’s have a look into this.

Until now I always used SELECT *, to assure that column projection does not kick in. The execution plan does not contain any information, whether column filtering, is or could be used. Basically it can always be used, if SmartScan is enabled and not all columns are requested by the database. If we want to know, if column filtering was used, we have to have a look at the diagnostic events. We only have to consider the value of the event ‘cell physical IO interconnect bytes’

All queries use the table opitz.huge_customers.

SQL> desc opitz.huge_customers
           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      ID                                       NUMBER
    2      FIRST_NAME                               VARCHAR2(50 CHAR)
    3      LAST_NAME                                VARCHAR2(50 CHAR)
    4      COUNTRY                                  VARCHAR2(2 CHAR)
    5      IS_A_CLASS                               VARCHAR2(1)

First I issue a SELECT *, to know how much data is transferred, when all columns are transferred.

SELECT * FROM opitz.huge_customers WHERE is_a_class = 'Y';

Cell physical IO interconnect bytes shows that 340 MB were transferred over the interconnect.

NAME                                        VALUE     WAIT_CLASS
------------------------------------------ ---------- -----------
cell physical IO interconnect bytes        357090752  SQL

What if we only select one column.

SELECT last_name FROM opitz.huge_customers WHERE is_a_class = 'Y';

Of course less data is transferred.

NAME                                        VALUE     WAIT_CLASS
------------------------------------------ ---------- -----------
cell physical IO interconnect bytes        185055792  SQL

When we select all columns, 340MB have to be sent over the interconnect, whereas selecting only the column last_name, reduces the amount of data which has to be sent, to 180MB. Therefore only half the amount of data needs to be transferred. Can Exadata further reduce the data, if a column with a smaller data type is chosen?

SELECT is_a_class FROM opitz.huge_customers WHERE is_a_class = 'Y'; 

Yes, as we can see even less data was transferred.

NAME                                        VALUE     WAIT_CLASS
------------------------------------------ ---------- -----------
cell physical IO interconnect bytes         53693936  SQL

This time a bit more than 50MB where transferred. We can clearly see, that it matters which columns are transferred. We know that the result set will cause some overhead. Let’s calculate it, to get an idea how efficient it is.

The is_a_class column takes 1 Byte to be stored. We selected 4259648 rows. Therefore 4259648 Bytes of data have to be transferred. Of course some additional data has to be added, for that the database knows how to interpret the data. 4259648 Bytes is 4MB and 51MB where transferred. Therefore the overhead per row was 51MB – 4 MB / 4259648 rows = 11 Bytes/row.

Does the overhead change when selecting another column? For the last_name column, we transferred 185055792 Bytes for the same amount of rows. I know, that the column last_name only contains values of 32 Bytes (I filled the column with randomly generate 32 Byte long strings). Therefore 4259648 * 32 Bytes = 136308736 Bytes or 129MB have to be transferred. 176MB minus 129MB is again 47MB for 4259648 rows and therefore again a overhead of 11 Bytes per row.

Although this overhead calculation might seem accurate, it isn’t. To get a better understanding, more tests would be needed, like changing the number of columns selected and changing the number rows selected. But I’m quite sure that a good part of this overhead is per row. I don’t think an overall result set overhead of more than a few kilobytes is realistic. But I can imagine, that some additional data is needed to separate a VARCHAR2 data field from it’s successor. The main message, of this overhead calculation is, an overhead exists and the optimizer might take this fact into account when it has to decide whether to offload or not.

Written by danirey

March 31, 2011 at 06:07

Oracle Exadata performance revealed – SmartScan – Part IV

leave a comment »

In my last post I was writing about predicate offloading. I showed that it is possible, to offload a filter such as WHERE columnX equals Y. But how useful is this in real live? Most queries are more complex than this. Many use functions, such as upper, length, substr, mod, ceil, abs, ….

So what happens, if a function is part of a predicate? Does the Exadata Cell know about all of them?

No. But it knows about a lot of them. You can use every Oracle 11g database, to find out which one can be handled by the Exadata Cells and therefore don’t prevent offload processing. This information is part of the data dictionary.

SELECT * FROM v$sqlfn_metadata WHERE offloadable = 'YES';

In addition all those functions can be combined with the following operators.


To get an idea of how well it works, I made a simple test using the MOD function. The MOD function calculates [first parameter] modulo [second parameter]. As the ID column is unique, the following query will return every 117th row.

SELECT * FROM opitz.huge_customers WHERE mod(id,117) = 0;

What did the execution plan look like?

| Id  | Operation                 | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT          |                |   639K|    46M|   196K  (1)| 00:39:17 |
|*  1 |  TABLE ACCESS STORAGE FULL| HUGE_CUSTOMERS |   639K|    46M|   196K  (1)| 00:39:17 |

Predicate Information (identified by operation id):

   1 - storage(MOD("ID",117)=0)

The line
in the Predicate Information, indicates that offload processing could be used for this query. As we already know, we have to have a look at the events, to know whether offload processing was used or not. As you can see, from the first part of the tracefile, offload processing was used.

PARSING IN CURSOR #2 len=86 dep=0 uid=0 oct=3 lid=0 tim=1289240657920538 hv=3576726616 ad='22674a1e0' sqlid='f4d2btvam112s'
SELECT /*+ GATHER_PLAN_STATISTICS */ * FROM opitz.huge_customers WHERE mod(id,117) = 0
PARSE #2:c=2000,e=2852,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=4233741383,tim=1289240657920537
EXEC #2:c=0,e=23,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4233741383,tim=1289240657920643
WAIT #2: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=1289240657920667
WAIT #2: nam='ges message buffer allocation' ela= 2 pool=0 request=1 allocated=0 obj#=-1 tim=1289240657921078
WAIT #2: nam='ges message buffer allocation' ela= 1 pool=0 request=1 allocated=0 obj#=-1 tim=1289240657921136
WAIT #2: nam='enq: KO - fast object checkpoint' ela= 179 name|mode=1263468550 2=131153 0=1 obj#=-1 tim=1289240657921343
WAIT #2: nam='reliable message' ela= 951 channel context=10055297152 channel handle=10033718256 broadcast message=10096394424 obj#=-1 tim=1289240657922493
WAIT #2: nam='ges message buffer allocation' ela= 1 pool=0 request=1 allocated=0 obj#=-1 tim=1289240657922535
WAIT #2: nam='enq: KO - fast object checkpoint' ela= 88 name|mode=1263468550 2=131153 0=1 obj#=-1 tim=1289240657922650
WAIT #2: nam='ges message buffer allocation' ela= 1 pool=0 request=1 allocated=0 obj#=-1 tim=1289240657922694
WAIT #2: nam='enq: KO - fast object checkpoint' ela= 114 name|mode=1263468545 2=131153 0=2 obj#=-1 tim=1289240657922835
WAIT #2: nam='cell smart table scan' ela= 136 cellhash#=2520626383 p2=0 p3=0 obj#=15043 tim=1289240657931912
WAIT #2: nam='cell smart table scan' ela= 120 cellhash#=88802347 p2=0 p3=0 obj#=15043 tim=1289240657934294
WAIT #2: nam='cell smart table scan' ela= 246 cellhash#=398250101 p2=0 p3=0 obj#=15043 tim=1289240657937276
WAIT #2: nam='cell smart table scan' ela= 1126 cellhash#=2520626383 p2=0 p3=0 obj#=15043 tim=1289240657955228
WAIT #2: nam='cell smart table scan' ela= 1086 cellhash#=88802347 p2=0 p3=0 obj#=15043 tim=1289240657956385
WAIT #2: nam='cell smart table scan' ela= 1085 cellhash#=398250101 p2=0 p3=0 obj#=15043 tim=1289240657957539
WAIT #2: nam='cell smart table scan' ela= 654 cellhash#=2520626383 p2=0 p3=0 obj#=15043 tim=1289240657958306
WAIT #2: nam='cell smart table scan' ela= 205 cellhash#=88802347 p2=0 p3=0 obj#=15043 tim=1289240657958540
WAIT #2: nam='cell smart table scan' ela= 16 cellhash#=398250101 p2=0 p3=0 obj#=15043 tim=1289240657958584
WAIT #2: nam='cell smart table scan' ela= 14378 cellhash#=398250101 p2=0 p3=0 obj#=15043 tim=1289240657974236
WAIT #2: nam='cell smart table scan' ela= 2920 cellhash#=2520626383 p2=0 p3=0 obj#=15043 tim=1289240657977313
WAIT #2: nam='cell smart table scan' ela= 1 cellhash#=88802347 p2=0 p3=0 obj#=15043 tim=1289240657977362
WAIT #2: nam='cell smart table scan' ela= 1 cellhash#=398250101 p2=0 p3=0 obj#=15043 tim=1289240657977379
FETCH #2:c=23996,e=56800,p=94,cr=103,cu=0,mis=0,r=1,dep=0,og=1,plh=4233741383,tim=1289240657977490

But is there also a way to see, how much it helped. We already know that the query returns only every 117th row, therefore only a 117th of the table needs to be returned by the Exadata Cells. Can we verify and confirm this?

Yes, there is such information in the v$sesstat View. Remember that those numbers are cumulative and you have to subtract the values of those stats before running the query, from the values after running the query. The following where my results.

physical read bytes = 5891588096 = 5600 MB
cell physical IO interconnect bytes = 47898864 = 45MB

And 5600MB / 117 > 45MB

So reality is even better than the theory. Therefore the overhead of the data blocks divided by 117 was bigger then the overhead of the result set ­čśÇ

In my next post, I will write about column projection.

Written by danirey

March 9, 2011 at 07:06

Oracle Exadata performance revealed – SmartScan – Part III

with one comment

In the first two posts I described what Exadata is and discussed the new concepts introduced with Exadata.

In this post I will show, how the predicate offloading works and how to analyze whether the database used SmartScan (offload processing) or not.

Let us start with a very simple query, which selects all A Class customers from the customers table. There is no Index on the is_a_class column and the table contains approximately 500 million rows.
SELECT * FROM opitz.customers WHERE is_a_class = 'Y';
and have a look at the execution plan

| Id  | Operation                 | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT          |           |   500K|    36M|  3100   (1)| 00:00:38 |
|*  1 |  TABLE ACCESS STORAGE FULL| CUSTOMERS |   500K|    36M|  3100   (1)| 00:00:38 |
Query Block Name / Object Alias (identified by operation id):
   1 - SEL$1 / CUSTOMERS@SEL$1
Predicate Information (identified by operation id):
   1 - filter("IS_A_CLASS"='Y')
Column Projection Information (identified by operation id):

This execution plan is without the usage of SmartScan, but the data resists in an Exadata Cell. Therefore we can see TABLE ACCESS STORAGE FULL instead of the well known TABLE ACCESS FULL. In this case the word STORAGE indicates that the operation was executed on an Exadata. I have set cell_offload_processing = false to force this execution plan. But what would the execution plan be, with the parameter cell_offload_processing set to the default value, which is true.

| Id  | Operation                 | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT          |           |   500K|    36M|  3100   (1)| 00:00:38 |
|*  1 |  TABLE ACCESS STORAGE FULL| CUSTOMERS |   500K|    36M|  3100   (1)| 00:00:38 |
Query Block Name / Object Alias (identified by operation id):
   1 - SEL$1 / CUSTOMERS@SEL$1
Predicate Information (identified by operation id):
   1 - storage("IS_A_CLASS"='Y')
Column Projection Information (identified by operation id):
26 rows selected.

In the Predicate Information the line storage(“IS_A_CLASS”=’Y’) was added. This line indicates that offload processing could be used for this query. It is very important to know, that the decision whether SmartScan should be used or not, is a runtime decision! When the optimizer calculates the best execution plan, it only decides whether SmartScan could be used, not whether it is used! This is very specific to Exadata and often misunderstood.

But how can we know, whether offload processing was used or not?

Oracle added some new wait events, which contain this information. A normal multiblock read from Exadata would look as follows
WAIT #33: nam='cell multiblock physical read' ela= 7552 cellhash#=398250101 diskhash#=786031317 bytes=40960 obj#=16639 tim=1286372833389835

A neat detail. We can not only see what was executed, but also on which disk of what cell, how many bytes where read and which object was retrieved.

When a SmartScan is done, the ‘cell smart table scan’ event will occur.
WAIT #2: nam='cell smart table scan' ela= 154 cellhash#=2520626383 p2=0 p3=0 obj#=15043 tim=1289240139259306

As we have seen the execution plan itself is not enough to know whether a SmartScan happened, only the according Wait events can tell the truth. In the next post I will write about predicate offloading when functions are used to evaluate the data.

Written by danirey

March 7, 2011 at 18:01

Oracle Exadata performance revealed – SmartScan – Part II

with one comment

In the last post I gave a brief introduction to Oracle Exadata and a first hint on how the massive performance improvements are achieved.

Let us think through what Oracle had to do, to be able to filter data on the storage level.

Normally communication between Oracle and the storage server is on block level. Which means Oracle tells the storage system which blocks it needs and the storage sends them to the database server. With such a communication the storage can not know, how data could be pre-filtered. Therefore Oracle had to find a way to send additional information to the storage server. Which led to a new protocol called iDB. It is used for the communication between the database server and the storage cells. The added data contains the following and more information:

  • db_unique_name
  • instance name
  • consumer group
  • read mirror copy yes/no
  • sqlid
  • SQL statement or relevant parts of it

But the new protocol itself is not sufficient. We now have the SQL statement, or at least parts of it on the storage cell (I will try to investigate this in more detail in the future). But what next?
Of course the cell software will have to separate needed and unneeded data. To do so, it has to be able to read Oracle blocks. Therefore the cell software contains parts of the Oracle Database kernel.

To transfer the filtered data back to the database, the iDB protocol is used. But what now? How can we put this information into the buffer cache?

Let us have a look at the structure of an Oracle block. The one shown below, contains a part of a heap table.

Oracle block containing parts of a heap table

As we can see, the data area is only a subpart of the block. By ripping of a single row, data can be reduced even further. But how can this information be stored in the buffer cache? Remember, the buffer cache only contains complete oracle blocks. That’s also why the database has to have multiple buffer caches, if it uses more than one block size.

Sending the complete block, is not what we want to do. What’s the alternative?
What about: We send only the needed rows to the database and it fills the gaps, to create a complete block, with NULL?
This would unnecessarily pollute the buffer cache with NULLs and the database would have to take special care for never writing this block back to the storage.

Is there a better solution?
Oracle could create an artificial block, containing only the needed rows gathered from multiple blocks. But what can we gain by doing so?
Although the buffer cache would not get polluted with NULLs anymore, the database would still have to take special care for never writing this block to the storage.

Ok, lets take it one step further. Of what use would this artificial blocks in the buffer cache be?
Only the original query, or maybe some queries selecting subsets of the original one, could ever use this blocks again. Because the data is specially filtered for the query on row and column level, chances that the data can be reused are very low. Therefore it does not make sense, to have such blocks in the buffer cache anyway.

So what’s the solution then?
Once again, Oracle had the chance to reuse some “old” technology (please consider this a laud for the good decisions the Oracle developers made in the past). The direct read, usually used for parallel reads, puts a result set into the PGA instead of putting the read blocks into the buffer cache. Because only the result set is saved, no unnecessary data pollutes the memory. By using the PGA instead of the SGA, special memory structures such as a result set are possible and the memory can be freed as soon as the requester finished reading it. To write the data into the PGA the Cells use RDMA, remote direct memory access. Without RDMA additional memory copies would be needed and a processor in the database server would have to take care of this. Thanks to RDMA, the processors of the database server should not be involved at all.

Now you should have some knowledge about the underlying architecture of the SmartScan. In the next post I will show, how the predicate offloading works.

Written by danirey

March 2, 2011 at 17:56

Oracle Exadata performance revealed – SmartScan – Part I

with 2 comments

This is the start of yet another series of blog posts. This time it will be about Oracle Exadata and its SmartScan.

I just arrived back home from an Exadata Database Machine Administration Workshop with Joel Goodman as trainer. He did a great job. Have a look at his blog on oratrain.wordpress.com. Fortunately after this training I can still stick by what I wrote in a paper about SmartScan for the SOUG (swiss oracle user group) in November last year. Therefore I will put parts of this paper, which are originally written in German, in English on my blog. As it will not be a simple translation of the original text, it might also be interesting to people who read the original article.

First lets start with a very brief introduction to Oracle Exadata – readers already knowing Exadata can skip this section. Oracle Exadata Database Machine is a database appliance. It contains some servers to run an Oracle Database, usually in a Real Application Cluster and some others used as Storage Servers. The storage server software is developed by Oracle and optimized for the Oracle Database. All systems are very powerful and connected to each other over InfiniBand. The current minimum configuration is a cabinet containing two database servers with 2x 6-core Intel Xeon and 96GB each. Three storage servers, usually called Cells in the context of Exadata, containing 2x 6-core Intel Xeon and 24GB of RAM each. The storage server contain 12 hard disk drives and 384GB Sun Flash Accelerator storage each. Everything is plumbed together with 40Gbit/s InfiniBand. The smallest Database Machine configuration provides a raw read performance from disk of 5.6 GByte/s and 16 GByte/s from the flash storage. For detailed hardware specifications have a look at http://www.oracle.com/us/products/database/exadata/index.html.

Those are quite impressive numbers, but how much does this improve the speed of my application? Oracle claims a performance gain of 10 to 30 times faster. The biggest improvement achieved by a customer was 72 times faster. But how is this possible? Such massive improvements can not be achieved only by putting some very fast hardware together, otherwise customers would have already done this. But how does Oracle achieve its performance gains?

To be faster there are only two things you can do. Either work faster or do less work. We have already seen what Oracle did to work faster, but how can they reduce work and achieve the same result nevertheless?

In todays three tier architectures it is standard, to push a WHERE clause from the application layer into the database layer. Therefore only a subset of the data has to be sent back to the application layer. But what if the storage would already do this filtering? This would drastically reduce the payload on the database server and the network between the storage and database server. Therefore Oracle wrote an intelligent storage server, which is able to apply such filters before sending the data to the database.

Oracle calls this storage level optimizations SmartScan. With SmartScan Oracle can filter the following in the storage layer:

  • an equals predicate filter in the WHERE clause
  • a WHERE clause which filters using a function
  • only returning the needed subset of columns
  • pre-joining tables on storage

In subsequent posts I will show how this optimizations work. In the next post I will show what Oracle had to implement or reuse to make such optimizations possible.

Written by danirey

February 28, 2011 at 17:27

Artikel ├╝ber Exadata Performance im SOUG Newsletter

leave a comment »

Mit dem neusten SOUG Newsletter 1/2011 wurde mein “Oracle Exadata Storage Server Performance erkl├Ąrt” Artikel ver├Âffentlicht. Darin versuche ich aufzuzeigen, wie der Exadata Storage Server seine Performance erreicht. Das Hauptaugenmerk des Artikels liegt dabei aber nicht auf der Hardware, sondern auf der Intelligenz des Systems.

Der Oracle Exadata Storage Server beinhaltet Logik zum Filtern der Daten, bevor Sie vom Storage Server zum Datenbank Server transferiert werden. Diesen Mechanismus wird von Oracle SmartScan genannt und auf ihm liegt auch der Fokus meines Artikels.

Da er als Beispielartikel, f├╝r den Inhalt des SOUG Newsletter, zus├Ątzlich Online publiziert wurde, steht er Interessierten auch unter http://www.soug.ch/fileadmin/user_upload/Newsletter/NL_public/Rey_ExadataPerformance.pdf zur Verf├╝gung.

Written by danirey

December 23, 2010 at 06:22