TryCatchFinally.net Some SQL, some .NET, and whatever else

23Nov/150

Viewing Supplemental Logging status in your Oracle database

Viewing the status of supplemental logging on all of our Oracle tables is done via the all_log_groups table, but it's not in an easy-to-use format and needs to be pivoted (and it only shows you which tables have logging of some kind, which makes it difficult to tell which tables don't have anything). I've done that below to provide an easier view of the status of supplemental logging on all tables, regardless of their current status:

  select t.owner, t.table_name,
         max(case when lg.log_group_type = 'USER LOG GROUP'      then lg.always end ) as "User Log Group",
         max(case when lg.log_group_type = 'ALL COLUMN LOGGING'  then lg.always end ) as "All Column",
         max(case when lg.log_group_type = 'PRIMARY KEY LOGGING' then lg.always end ) as "Primary Key",
         max(case when lg.log_group_type = 'UNIQUE KEY LOGGING'  then lg.always end ) as "Unique Key",
         max(case when lg.log_group_type = 'FOREIGN KEY LOGGING' then lg.always end ) as "Foreign Key"
    from dba_tables t
    left
    join all_log_groups lg
      on t.owner = lg.owner
     and t.table_name = lg.table_name
   where t.owner = 'Table Owner'
   group by t.owner, t.table_name
   order by t.table_name

In the recent spat of Goldengate troubleshooting, I've had to check these values a number of times and I hope this query is useful to others (since I couldn't find a good example online that made this information easy to view).

Please let me know below if you have any issues!

22May/150

Correcting OGG-01733 in Goldengate – Trail file header file size mismatch

Warning - Oracle/Goldengate support will probably get mad at you if you try this. It worked great for me, but they recommended we reload the data from scratch, so that's probably what they'd recommend for you too. Just know that this is 100% unofficial and unsupported 🙂

Quick summary:

We have Goldengate replication from Oracle 11.2 on Lunux to MSSQL 2012 on Windows, and we ran into an OGG-01733 error "Trail file header file size value {X} for trail file {Y} differs from actual size of the file ({Z})", which caused an ABEND where we were stuck. We opened a ticket with Oracle support and after a week with very little response, the concluded that I should just perform a new initial load on the destination - since the trail files had already been pumped to the destination server and removed from the extract server, they were unable to troubleshoot further.

It turns out the work-around was to open the trail file in a hex editor and manually update the trail file header to make it think it was supposed to be the size it actually was. After saving the file again and resuming replication, it continued on its merry way and applied the transactions without another complaint.

Steps to resolve this error message:

  1. Make a backup of your trail file - you know, since you're editing it and might want a second shot.
  2. Open the report file and make a note of the size the file is currently ("Z") and the size it's supposed to be ("X"). I'll refer to those as X and Z further down.
  3. Use a decimal-to-hex converter like this one to convert both of these values to their hex equivalent (now I'll call them "HX" and "HZ")
  4. Load up the trail file in your favorite hex editing tool - I like using Notepad++ in combination with the HEX-editor plug-in (once the file is loaded, select "HEX-Editor" from the plug-ins menu, and then select "View in Hex")
  5. Perform a search (if you're using Notepad++, ensure the data type is set to "Hexadecimal") for your "HX" value - the size the file thinks it should be. However, you need to search for an even number of digits - if your hex value is an odd number of digits, either drop the leftmost (largest) one or add a zero to the left (I dropped a digit):
  6. Goldengate - Hex Editor

    • Side note: You can see that my trail file size isn't too far into the file - under 300 bytes from the beginning. However, since it's stored in hex, it's not something that's easily viewable in the file (where you will see some file path and server version information if you look to the right where the ASCII is displayed. Also, in my image, the file size is preceded by a quite a few zeroes - my trail files are set to 100MB, but it appears Goldengate supports up to 4GB trail files using the 32 bytes in the header file. Back to fixing this...
  7. CAREFULLY edit the HX value you've found to be the new HZ value - the actual size of the file. In particular, don't move any of the bytes around or add/remove anything, just fix the values you need to change so that the file size is stored in the same location.
  8. Save the file and close it.
  9. Resume replication right where you left off (assuming you made a backup and the edited the original trail file) - it should check the new file size, see the transaction that was previously beyond the file size limit, and then apply it and move on!

Conclusion?

What causes this behavior? I can't find any clear documentation or explanation at all - when searching for this error, the only meaningful links I can find at all are either in an Oriental language and have basic details as well as a dire warning to call Oracle support immediately or a case where somebody receives it on an initial load and the forum's advice is "your table is too small to mess with this - just export it to CSV and reload it that way".

When we looked at the list of trail files, we noticed something particularly odd - the trail files near the offending file all had ascending "last modified" timestamps, as you'd expect, but this file was actually out of order:

05/01/2015  03:45 AM        99,999,462 SV002351
05/01/2015  04:38 AM        99,999,802 SV002352
05/01/2015  08:13 AM        99,999,367 SV002353
05/01/2015  10:09 AM        99,999,936 SV002354
05/01/2015  11:05 AM        99,999,630 SV002355
                                                 <-- File should be right here
05/01/2015  11:41 AM               891 SV002357
05/01/2015  11:47 AM        99,999,462 SV002358
05/01/2015  11:50 AM        99,999,280 SV002359
05/01/2015  11:58 AM        99,999,314 SV002360
05/01/2015  12:09 PM        99,999,910 SV002361
05/01/2015  12:40 PM        99,998,043 SV002362
05/01/2015  01:16 PM        99,999,754 SV002363
05/01/2015  01:34 PM        72,017,446 SV002356  <-- But it's down here
05/01/2015  02:05 PM        99,999,516 SV002364
05/01/2015  02:40 PM        99,999,966 SV002365

The file contained two additional transactions beyond the stated header size and the actual end of the file, and they were both time-stamped correctly to have been located in that file (they were both stamped 10:34AM, along with the transactions that were earlier in the file, and since the server is an hour off because of time zone, they were in the right file).

The fact that it's smaller than the others, and that it's followed by a file containing no transactions (just a header) led me to believe the file was cut short by a network interruption of some kind. We're using a local extract and a separate pump, as we're advised to do, but the connection still drops from time to time. In this case, I can only imagine it was in the middle of committing something, was interrupted, and then somehow these transactions were suspended for some reason and then added to the file later. I can't imagine why, but when they're added, the file header isn't updated.

Hopefully this explanation and work-around have helped somebody else - we pulled our hair out for a week going back and forth with Oracle support and scouring the internet (unsuccessfully) for any relevant information - in the end, going rogue and editing the file was the only way (short of a complete reload) to get things moving again!

3Apr/140

Oracle Goldengate REPLICAT frozen on “Starting”

We use Oracle Goldengate (expensive and probably overkill for Oracle->MSSQL, but good at what it does) to replicate data from an Oracle database into a SQL Server. However, I got an alert the other day that replication had stopped, and when I checked the status of replication, all the REPs we had set up were in status "Starting...", but none we actually doing anything.

Attempting to stop them got the following error:

GGSCI (GGSERVER) 68> stop rep MYREP
Sending STOP request to REPLICAT MYREP ...
ERROR: opening port for REPLICAT MYREP (TCP/IP error: Connection refused).

 

Stopping/Starting the manager service or rebooting the PC didn't help either - they still said "Starting" and were unresponsive. Even stranger, deleting and recreating the REP gave the same result - before I even attempted to start the REP for the first time, it said "Starting", and an attempt to start it gave me "Process is starting up - try again later".

The cause was the REP process status file, located in the DIRPCS folder under the Goldengate root - there should be a file for each REP that's currently running giving details about the status. When a REP stops, this file is deleted. Since all of the current REPs weren't doing anything (they were all sitting at the end of the previous trail file), they should have been stopped. I deleted the PCR files for the affected REP streams, and then manager reporting "STOPPED" - at that point, I was able to start up each REP without issue.

I'm not sure how they got that way, but once started again, they all worked without issue. I hope this saves you the troubleshooting time of hunting down these files!

27Sep/131

Unexpected results in an SSRS report when using Oracle OLEDB driver and OPENQUERY

Here's a pretty specific situation we ran across recently - a comment in our OPENQUERY text was causing the results of our query to be completely different than expected.

The T-SQL query we'd written in SSMS that fetched data from an Oracle linked server was running in about 30 seconds, but when we put the query directly into an SSRS report, it would never finish. We were stumped, since we were executing the exact same query in both SSMS and SSRS, but SSRS appeared to hang when executing it. We trimmed down our original query (a few CTEs containing OPENQUERY, joined to a handful of local tables) until we found the single CTE that was doing it. Executing that query in SSMS (or executing the OPENQUERY portion directly in PL/SQL Developer) returned results in about two seconds - however, putting that same portion on a report (even using the report wizard and with no special formatting) resulted in a report that never finished running, whether in the report designer or deployed to our SSRS site.

As it turns out, the SSRS engine (it occurred in both the BIDS/SSDT and once the report was deployed) was consolidating the OPENQUERY text to a single line when executing it, rather than respecting the line breaks in our query text. As a result, the "--" that preceded a comment mid-way through the query was actually commenting out the entire rest of the query, a la SQL injection style. This resulted in a completely ignored WHERE clause on the query, and different results - effectively, SSRS was taking this query:

SELECT *
  FROM OPENQUERY(OracleServer,
                 'SELECT *
                    FROM SomeSchema.SomeTable -- Note the comment here
                   WHERE ID = 50')

and executing it like this:

SELECT *
  FROM OPENQUERY(OracleServer,
                 'SELECT * FROM SomeSchema.SomeTable -- Note the comment here WHERE ID = 50')

Since our query wasn't returning results, but just hanging for 30 minutes before we gave up, we asked our Oracle admin to watch the execution plan after seeing this Stackoverflow question (and the comments on the accepted answer) that suggested that they were seeing SSRS generate a complete different execution plan for the query (by comparing the plan generated when it was executed from both SSMS and SSRS, I'd hoped the difference would be obvious - it was). This didn't make any sense at all, since the same query should be hitting SQL Server, and so exactly the same OPENQUERY pass-through query hitting Oracle, but the Oracle DBA confirmed that half of our Oracle query was commented out when it came from SSRS. This meant that our 1000 row expected resultset had become a 1.5 billion row dump of the entire table, which would explain why we were seeing the delay on our report!

This didn't happen in SSMS, and it didn't happen when previewing the dataset itself in SSRS, but only occurred when previewing the actual report (this is the only case where SSRS submits a modified version of the query to the SQL engine). Removing the comment from the query resolved the issue, as did switching the comment syntax to /* */ from double-dash.

This occurs both in SQL Server Data Tools (based on VS2010) and BIDS 2008 R2, which is what we had handy to test. Also, it appears to only apply to the Oracle OLEDB driver - creating an Oracle linked server via either ODBC, as well a SQL OLEDB linked server, didn't exhibit the same issue, and the final report results were filtered as expected.

Steps to reproduce the issue:

1. Create a linked server using the Oracle OLEDB driver

2. In SSMS, create a query that accesses the linked server using OPENQUERY with a simple WHERE clause inside

For example, we had something like the following:

SELECT *
  FROM OPENQUERY(OracleServer,
                 'SELECT *
                    FROM SomeSchema.SomeTable
                   WHERE ID = 50')

3. Execute your query and make note of the number of rows it returns

4. Modify your query by adding a short comment at the end of your FROM line - this should not impact the query results at all

SELECT *
  FROM OPENQUERY(OracleServer,
                 'SELECT *
                    FROM SomeSchema.SomeTable -- Note the comment here
                   WHERE ID = 50')

5. Execute your query again - the row count should be unchanged from step 3

6. In BIDS or SSDT, create a new report, either manually or using the wizard, providing your modified query from step 4 as the dataset

7. Once your report is created, view the dataset properties and preview the query results - note that they match the expected row count

To do this, right-click on your dataset in the "Report Data" panel, select "Dataset properties". When it opens to the Properties window, select "Query designer..." near the bottom, and then click the exclamation point in the toolbar to execute your query and preview the results. You'll see your filtered result set, as expected. When you're done, close this window and go back to your report.

8. Preview your report and notice that the WHERE clause is ignored - all rows from your table are displayed

Summary, and take-away

I'm only using the placement of the comment and ignoring the WHERE clause as an example - the comment could be anywhere in the query, and could even result in an query with invalid syntax that refuses to execute at all.

I hope this explanation helps you avoid the two days of troubleshooting we did to get to this point and find the cause! I'm unclear on why the driver behaves this way, or if it's an SSRS issue specifically (I suspect it is, since it doesn't occur in SSMS with the exact same query). If anybody can point me to an open Connect item, I'd be happy to vote for it, but until then, I'm making the effort to migrate to using /* */ comment syntax everywhere - not only is it more clear to readers and flexible for in-line comments, it doesn't break OPENQUERY functionality (and that's reason enough for me).