Switchover from dataguard primary to standby
Last week, we initiated a planned switchover of our primary and standby dataguard instances. 11.2.0.3, physical standby. Dead easy.
The database change went absolutely perfectly. No problems. Brilliant. A 5 step process and everything did as the Oracle docs said they should.
The problem... the problem was everything else.
Some background: The company has been using active data guard to populate a read-only standby for about 9 months, and since it was installed have never done a switchover (or a failover). Almost all client jobs are cron- & script-based (running on the DB host), or run manually.
A week before the switch, I'd gone through a list of apps with a senior developer and checked that they all referenced a DNS name rather than an IP address. That's a relatively short list of about 10 apps of various types and sizes.
Come the switch, most things went as smoothly as you'd hope! The vast majority of the Tomcat apps switched without a hitch. One app had parts of it which relied on another which we'd missed, but overall that wasn't a huge issue.
The real issues came when the batch runs started, and I realised just how different the primary and standby servers were.
Things like:
Whether this last one was a misunderstanding (e.g. that parameters don't get synced with Data Guard) or sheer oversight, I'm not sure.
Now we're at a point where we've switched over manually a couple of times (a hard drive fault on the new primary forced a switch) and it's causing less than 20m app downtime, which I'm quite pleased about for an unplanned switchover, and we could probably reduce this even further if required.
The database change went absolutely perfectly. No problems. Brilliant. A 5 step process and everything did as the Oracle docs said they should.
The problem... the problem was everything else.
Some background: The company has been using active data guard to populate a read-only standby for about 9 months, and since it was installed have never done a switchover (or a failover). Almost all client jobs are cron- & script-based (running on the DB host), or run manually.
A week before the switch, I'd gone through a list of apps with a senior developer and checked that they all referenced a DNS name rather than an IP address. That's a relatively short list of about 10 apps of various types and sizes.
Come the switch, most things went as smoothly as you'd hope! The vast majority of the Tomcat apps switched without a hitch. One app had parts of it which relied on another which we'd missed, but overall that wasn't a huge issue.
The real issues came when the batch runs started, and I realised just how different the primary and standby servers were.
Things like:
- No oratab on the standby
- No inventory on the standby
- File permissions on the oracle binaries hadn't been set using root.sh, so users other than oracle couldn't connect using SQL*Plus and SQL*Loader
- The parameter file on the standby was significantly different. e.g. increased SESSIONS and OPEN_CURSORS parameters hadn't been replicated to the primary
Whether this last one was a misunderstanding (e.g. that parameters don't get synced with Data Guard) or sheer oversight, I'm not sure.
Now we're at a point where we've switched over manually a couple of times (a hard drive fault on the new primary forced a switch) and it's causing less than 20m app downtime, which I'm quite pleased about for an unplanned switchover, and we could probably reduce this even further if required.
Comments
Post a Comment