Storage area network – Fiber Channel: Overlifting LTO tape when bus reset

We have a situation that I hope customers can understand better.

This is what happened:

>has The tape library of the LTO tape drive is connected to a Fibre Channel environment
>Archive software running on Windows Server 2008 is writing data to the tape
>At some point, the tape is restarted without the software being aware of it And writing erased the tape
>Detect the situation by comparing the expected position on the tape with the actual position

I don’t have detailed information about the equipment supplier.

It seems that a reset occurred on the tape drive, which caused the tape to rewind, but this situation was not reported as a driver and software error, so the software thought the writing was successful.

I am reading a lot of documents to understand The reason for this happening, but I cannot make any final conclusions to help the customer.

> Can FC HBA or switch resend SCSI writes when the bus is reset?

>Can such a thing be configured?

> Did the FC HBA or switch ignore the reported unit attention?
>Can the operating system drivers be blamed?
>Is this supplier specific?

If someone could give me some guidance, I would be very grateful.

This is a known issue with tape drives, and they are simply a way of rewinding the tape easily by looking at the device sideways (i.e. opening it in the wrong way-by rewinding the device-just for example checking the status).

< /p>

At least one major UNIX backup software is so worried that it just refuses to write to the tape a second time until the tape is ready to be deleted; this is from the amanda FAQ (specially mentioning the bus reset as Question area):

Why does Amanda not append to a tape?

One run of Amanda = one (set of) tapes. Amanda opens the tape device
once, writes all the images and filemarks, and closes the device once.
Using that sequence, there is no possibility that other programs
interrupt the sequence and rewind the tape, without Amanda noticing.

Doing “mt -f /dev/st0 status” could be enough, or even “amcheck
daily”. Also, an error like a scsi bus reset implies a rewind.

If Amanda would close and reopen the tape drive for each backup image,
there is a window of vulnerability that the tape gets rewound
accidentally, and the next image will overwrite all the good backups
on the tape. And you wouldn’t know unless you tried to restore from
the tape.

When appending to a tape, there is the possibility that, between the
time that Amanda positions to the last image (that already is not
really trivial!), and opening the device for writing, a tape rewind
happens, and in that case Amanda would happily erase ALL of the tape,
containing possibly many days worth of backup.

Bacula also solves this problem by never closing the tape device, so no one else can open it by mistake while loading the tape. But This does not solve the bus reset problem.

In essence, this is a problem, and it is difficult. I might argue that your backup hardware should be strong enough that these don’t happen often ; If FC seems to be particularly prone to these problems, you should now replace the SAS tape drive, or at least connect the tape device directly to the backup server in order to remove the fiber switch from the path, etc. Other than that, I can’t see how you Do more than you do, because you discovered the problem before the usual point of view, that is, “Our recovery does not work, we are messed up”.

We have a situation that I hope customers can understand better.

This is what happened:

>A tape library with an LTO tape drive is connected to a Fibre Channel environment
>Archive software running on Windows Server 2008 is writing data to tape
>At some point, the tape was restarted without the software being aware of this and writing erased the tape
> Detect the situation by comparing the expected position on the tape with the actual position

I don’t have detailed information about the device vendor.

It seems that a reset occurred on the tape drive, causing the tape Rewind, but this situation is not reported as a driver and software error, so the software thinks that the writing is successful.

I’m reading a lot of documents to understand why this happens, but I can’t make any final conclusions to help customers.

> Can the FC HBA or switch be reset when the bus is reset? Resend the SCSI write?

>Can such a thing be configured?

> Did the FC HBA or switch ignore the reported unit attention?
>Can the operating system drivers be blamed?
>Is this supplier specific?

If someone could give me some guidance, I would be very grateful.

This is a known issue with tape drives, and they Just by looking at the device sideways (ie, opening it in the wrong way-by rewinding the device-just for example to check the status) to easily rewind the tape.

There is at least one major UNIX The backup software is so worried that it just refuses to write to the tape a second time until the tape is ready to be deleted; this is from the amanda FAQ (specially mentioning bus reset as a problem area):

< /p>

Why does Amanda not append to a tape?

One run of Amanda = one (set of) tapes. Amanda opens the tape device< br> once, writes all the images and filemarks, and closes the device once.
Using that sequence, there is no possibility that other programs
interrupt the sequence and rewind the tape, without Amanda noticing.

Doing “mt -f /dev/st0 status” could be enough, or even “amcheck
daily”. Also, an error like a scsi bus reset implies a rewind. p>

If Amanda would close and reopen the tape drive for each backup image,
there is a window of vulnerability that the tape gets rewound
accidentally, and the next image will overwrite all the good backups
on the tape. And you wouldn’t know unless you tried to restore from
the tape.

When appending to a tape, there is the possibility that, between the
time that Amanda positions to the last image (that already is not
really trivial!), and opening the device for writing, a tape rewind
happens, and in that case Amanda would happily erase ALL of the tape,
containing possibly many days worth of backup.

< /blockquote>

Bacula also solves this problem by never closing the tape device, so no one else can open it by mistake when loading the tape. But this does not solve the bus reset problem.

Essentially, this is a problem, and it’s difficult. I might argue that your backup hardware should be strong enough that these don’t happen often; if FC seems to be particularly prone to these problems, then the SAS tape should be replaced now Or at least connect the tape device directly to the backup server in order to remove the fiber switch etc. from the path. Other than that, I can’t see how you can do more than you have, because you found out before the usual point of view The problem, that is, “Our recovery did not work, we were messed up”.

Leave a Comment

Your email address will not be published.