Recently I was hit with the question: (paraphrasing)
I just received the error “ORA-01578: ORACLE data block corrupted (file # 11, block # 410112)”. Is there another scenario you can receive this error, without it meaning physical corruption?
If there is, I’ve never run into it. In my experience, this has always meant physical block corruption within the data file. The simplest option to correct this is to recover the data from a backup, and roll through the archive/redo logs until the data file is healthy again.
Unfortunately, it’s not always that easy. There are different backup strategies that can be used in Oracle and which one you use depends on what your needs are. Briefly, I’ll explain the three most common along with some of their pro’s and con’s:
– This method of backup physically dumps a schema in a portable (albeit proprietary) fashion. This is desirable as you can perform this backup while the database is running, and protects you from incidental accidents. (Dropping a table for instance) It’s very easy to recover portions of a schema using this method. However, the data in the export is only as current as the time the export was taken. If you only run an export once a day, that’s nearly 24 hours of transactions that may be lost. (Unless also augmented with other techniques to get point in time recovery)
– This is a block level backup of the physical data files while the database is running. The requirement for this is that the database must also be in archive log mode. This is a state in which the online redo logs are dumped (archived) to disk periodically, and preserved. (redo logs are basically logs of every transaction) This gives you the most protection (when done right) as you should never lose any committed transaction. You can do point in time recovery and the database is available during the backup. This takes some strategy, so it’s also the most complex. In addition, partial recoveries are complicated, usually involving recovery to a new instance and transferring data.
– This is also a block level backup, much like the hot. You are not restricted to enabling archive log mode, but the database must be down in order to perform the backup. (The database can also be in archive log mode, but if you’re in this mode anyway, might as well just go for the hot backup.. not to say that you have to. There’s plenty of reason to do cold backups even in archive log mode) The pros and cons are similar to hot, except of course, the database must be down and if you’re not in archive log mode, you can’t do point in time recovery. (unless also augmented with other techniques)
Most companies will use a combination of techniques, and most will prefer hot backups if they employ DBA’s. There are a lot of companies that don’t have full time DBA’s and choose the logical, because it’s easy. Also, some software vendors dictate the strategy. To be honest, I’ve never tried to do point in time recovery with logical and colds (w/o also being in archive log mode), so I don’t honestly know what those would look like, but I assume it’s at least somewhat possible using flashback, log mining, etc. Anyway, I’m getting a bit off-topic, let’s bring it back.. 😀
In this instance, logical backups were chosen, and as far as I know, there was nothing to augment them for point in time recovery. One problem with this choice is that they are susceptible to block corruption too. It’s a row level backup, so if it reaches a corrupt row (block), the backup type will fail. In this case, the backups were in this failing state, so we weren’t sure how much use they were going to be.
All is not lost though!!
There is a chance that this corrupt block could belong to an index which is easily rebuilt w/o any loss. Or, it’s possible that this block is actually empty. I know what you’re thinking, if it’s empty, then why is oracle complaining when querying the table? Because in a full table scan, oracle reads every block until it reaches the high water mark (HWM). Under normal operation, Oracle does not move that HWM back. What does this mean?
Well, if you create a table, load it with 1 million rows, the HWM is positioned at the end of those records. If we then remove those 1 million rows, there will be no data in the table, but the HWM will remain at the previous position. With a full table scan, all the blocks from the beginning of the table, up to the high water mark will be read. In a backup situation, you’d want oracle to read every row, so it will default to a full table scan. (Plus, It takes advantage of multi-block reads for speed) Viola, oracle reads a corrupted block, albeit empty.
You can use this query to determine the segment and segment type that the corrupted block belongs to:
select owner||'.'||segment_name segment, segment_type from dba_extents where file_id=11 and 410112 between block_id and block_id+blocks;
The file_id (11) and block number (410112) are taken from the error message above. If segment type returns ‘INDEX’, you can easily rebuild the index with something like: (Substitute [segment] for the segment returned in the above query)
alter index [segment] rebuild;
If this segment is a table, I used a few tests to try and determine if the block was empty. As I explained above, if you select the data using a full table scan, you’ll hit the corrupted block. Sooo, that should mean if you access the table with a supporting index, it should avoid corruption in an empty block. Now, this is only going to work if you have an index that fully encompasses the rows, as would be the case with a primary key.
Luckily, we have one of those. Unluckily, that query also failed with the same corrupted block issue. To test this, we used the following SQL, which includes a hint to the optimizer to access this table through an index:
select /*+ index(a) */ * from schema.table_name a;
Inversely, if you want to force a full table scan:
select /*+ full(a) */ * from schema.table_name a;
I use the term force loosely. Those hints (/*+ function() */) are just that: A hint to Oracle’s optimizer to do something differently than it might try otherwise. Oracle can choose to ignore hints, so we’re not actually forcing anything. 😀 I wasn’t actually on the system to make sure this actually worked, but since the two resulted in different answers, but both got the same error, I presume it actually did behave the way I intended.
So now what? Well, there is a good chance that there really are rows in the corrupted block. Now the question becomes, how do I recover as much of this data as possible, given the fact that I can’t seemingly touch the table without hitting the corruption?
If we could address every row individually, and directly, we could grab every row and ignore exceptions to skip those in the corrupted block. We already know we have a primary key index, we just need to get all of the primary keys out of it. So how do we query an index? The answer is, you can’t. There is no syntax for selecting data from an index directly. (Well, that I know of anyway) However, Oracle being the smart guys they are, they do have an optimization that basically does just that. Index fast full scan, FTW!
If you’re accessing an indexed column, and the index can cover the dataset completely, Oracle will sometimes use a fast full scan on the index and avoid the extra IO to the table. No point in reading in table blocks, if you don’t need additional data from it. To nudge Oracle to do this, there’s a hint for that: index_ffs(). Here’s how we got the number of records, utilizing this method:
select /*+ index_ffs(a) */ count(sp_id) from schema.table_name a;
Provided there was no funkiness with the constraints/indexes, this should be an accurate count of the number of rows that are (were) in the table. This returned 113524 for us. Once we recover all the rows that we can, the difference is obviously the number of rows that were lost. The next step is to recover the rows. Because this is getting a bit longer than I planned, let me just post the code and I’ll briefly cover it afterwards:
begin for row in ( select /*+ index_ffs(a) */ distinct sp_id from table_name ) loop declare v_errm varchar2(2000); begin insert into table_name_recovery select /*+ index(a) */ * from table_name a where a.sp_id = row.sp_id ; exception when others then v_errm := substr(sqlerrm,1,1000); insert into table_name_errors (sp_id, errm) values (row.sp_id, v_errm); end; end loop; commit; end;
Nothing fancy. All the magic is in the hints. Basically, we created an empty copy of the corrupted table (Here’s a trick: create table A as select * from B where 1=2), and another to throw exceptions and primary keys in. We then ran the above code.
This loops through all the primary keys using the index fast full scan trick. It then selects the row with that primary key from the corrupted table and stores it in our new copy. If this select throws any exceptions, including the corruption error, we record the exception and primary key in the 2nd table, and move onto the next row.
New Healthy Block
When it’s done, our new table should have all the rows we can recover from the database. (All but 29 rows in this instance) Where you go from here really depends on the severity and application. In our case, a few more rows were recovered from backups. Perhaps in your case, you can reconstruct them from email, reports, other databases, etc. Might need to get creative here! ;D