Friday 31 October 2008

Why are error messages not unique

That's what I want to know.

Why is it that you get an error message in Backup Exec, and it spits out an error for you, and so you click on it, which takes you to a web page with a description of that error, right?

Wrong. With Symantec Backup Exec, it just takes you to a list of issues which may or may not be remotely close to what you have issues with, and rarely has any useful answer.

Here I am today again trying to find out why certain jobs keep failing without any sort of sane error reason. Another day fighting Backup Exec.

Wednesday 17 September 2008

Server Paused. In a not-actually-paused-at-all kinda way

For the past 2-3 weeks our CASO Backup Exec Server has been INSISTING that the "Job Status" for every single job is "Server Paused". While it isn't actually uncommon for that to be the case, and certainly I've seen a server have this status from time to time, it isn't the case for days and not on all of our media servers.

It seems the answer is simple (but annoying)... here's what a Symantec Forum post says:

----------------
I have had this problem many times. Like a plague. The fix most times is simple. Go to the devices tab. Highlght your media server. Right click and pause it. Then unpause it. This is caused by an interruption which corrupts a file. Backup Exec shows nothing paused. Just fixed this today after 10 days following a problem with a tape drive. Symantec support knows about it but hasn't published the fix.
----------------

Friday 12 September 2008

A few months later...

Backup Exec 10d continues to offer variable results. For straightforward jobs it tends to do a reasonable job, most of the time.

However, each time we add a set of Synthetic jobs to the mix, things start going horribly wrong, and the backup exec engine regularly dies.

While I currently refuse to pay any money for upgrades since they promised these features were part of 10d, and they simply don't work, I have had more luck with an eval of 12d. Of course until we've been running it as long as 10d and it has the same sort of load we won't know for sure, but I don't hold out much hope!

Tuesday 26 February 2008

Event numbers

So, now that we've managed *touch wood* to iron out the big issues that have been plaguing us for ages, perhaps it's time to look at some of the more annoying niggles we have to contend with.

I've always thought that many event numbers produced by most applications were simply too short and simple, after all, how can you hope to cover all the eventualities with only 4/5 characters. With the change in Backup Exec in recent versions to much longer event numbers you would think they could tie things down much more specifically, but apparently not.

Looking at an error we've been seeing recently, we're getting this when running a backup :

V-79-57344-33938 - An error occurred on a query to database .

So you might think with a possible 1,000,000,000,000 event numbers available that this error would be specific to the problem I'm seeing, and allow me to find more information directly relating to it... of course you'd be wrong.

Clicking on the link provided by the Job History shows me 9 different articles all apparently relating to this one error. 6 of them refer to restore jobs rather than backup jobs, 2 relate to SQL 2005 not 2000 as is the case in this instance, 2 relate to running SQL 7 alongside 2000 which we're not doing and of the 2 which do relate to backups in their titles, the contents either refer to SQL 7 or to you having an incomplete restore operation preventing the backup from running.

Now I realise that producing written content for up to a trillion error numbers would be a mammoth task, but personally if I'm searching for a specific event number and no content exists for it yet I'd prefer to know that, rather than trawl through mountains of knowledge base articles in the hope that one might actually be relevant to my situation! I can always then search using words from the error message to find articles that are similar, and try things from there!

Monday 14 January 2008

Reliability? Can we say we're onto Stage 2?

I'm rather scared, and pleased, to say that we appear to have genuinely made it work - in my last few posts I was still sceptical that Backup Exec was just being nice to us, but it does appear it is genuinely now acting like a competent backup product.

One server has now been up for 27 days, and is still running backups quite happily, with 2,000 odd jobs run in that time (66 a day or so), and the CASO server just getting on with it's jobs and delegation. No more tears.

Our CPS system is still working - that's the most flawless part of Backup Exec I've seen. It's absolutely awesome - it says continuous protection, and it offers just that. We installed it, got over one hurdle of making it listen on a specific IP, and that was that. Our main file servers have just-below-real-time backups 24/7.

Our next challenge (Stage 2, which took 18 months to get to!) is to build a comprehensive reporting and restore testing infrastructure around the software. We want to know everything possible about what it does, so we can provide internal quality reports, ensure we meet SLAs and finally, but not least, ensure customers can be assured of regular, reliable backups.

Restore testing is often overlooked, but certainly not here! It is an essential, and core part of our plan to operate regular (hopefully scheduled) restore tests so we can be sure those backups actually work - as someone else we know found out to great cost - backing up isn't enough!