How to generate power consumption SSD metadata corruption? Can I minimize it?

Note: This is a follow-up question of Is there a way to protect SSD from corruption due to power loss?. I got good information there, but it basically focuses on three One aspect, “get a UPS”, “get a better drive”, or how to deal with Postgres reliability.

But what I really want to know is whether I can take any measures to protect the SSD Avoid metadata corruption, especially in old writes. Review this issue. This is the ext4 file system on Kingston’s consumer-grade solid state hard drives. With write caching enabled, we have seen the following types of problems:

< p>>Files with wrong permissions
>Files that have become directories (for example, toggle.wav is now the directory containing the files)
>Files that have become directories (not sure about the content…)
> Files with scrambled data

There are fewer problems with these things that happen to the data written when the drive fails or shortly before. This is a problem, but it is expected, I This problem can be solved in other ways.

The bigger surprise and problem is that metadata corruption has occurred in an area on the disk that has not been written recently (that is, a week or more ago).

I am trying to understand how this happens at the disk/controller level. What is going on? Does the SSD periodically “rebalance” and move blocks, even if I am writing elsewhere? Like this:

enter image description here

and then There will be a power outage when rewriting D. There may be some in the first block, and some in the second block. But I don’t know if it works. Maybe other things happen too..?

All in all – I want to understand how this happened and if I can take any steps to alleviate OS level issues.

Note: “Get a better SSD” or “Use UPS” is not a valid answer here-we are trying to move in this direction, but I have to accept the actual reality and find the best results we have now. Without these disks and solutions without UPS, then I guess this is the answer.

References:

Is post-sudden-power-loss filesystem corruption on an SSD drive’s ext3 partition “expected behavior”?
This is similar Yes, but it is not clear if he has encountered various problems we encountered.

Edit:
I have also been reading about ext4 issues that may have power failures. We have recorded , But I don’t know anything else.

Prevent data corruption on ext4/Linux drive on power loss

http://www.pointsoftware.ch/en/4-ext4- vs-ext3-filesystem-and-why-delayed-allocation-is-bad/

about For the way metadata is damaged after unexpected power failure, please see my other answer here.

Disabling cache can significantly reduce the possibility of data loss in flight; however, according to your SSD, static data There is still a risk of corruption. In addition, it will cause a huge performance loss (after disabling the private DRAM cache, I see that a 500 MB/s SSD can only be written at a speed of 5 MB/s).

If you can’t trust your SSD, the only “solution” (or rather, the workaround) is to use an end-to-end checksum file system as ZFS or BTRFS and RAID1/mirror setup: Way, any final single run check/scrub can recover device (meta) data damage from another mirror side.

Note: This is Is there a way to protect SSD from corruption due to po wer loss? Follow-up question. I got good information there, but it basically focused on three areas, “get a UPS”, “get a better drive”, or how to deal with Postgres reliability.

But what I really want to know is whether I can take any measures to protect the SSD from metadata corruption, especially in old writes. Recall this question. This is a Kingston consumer-grade solid state drive On the ext4 file system, write caching is enabled, we have seen the following types of problems:

>Files with wrong permissions
>Files that have become directories (for example, toggle.wav now contains File directory)
>has become a directory of files (not sure about the content..)
>files with scrambled data

For those written when the drive fails or shortly before These things happen to the data, the problem is even less. This is a problem, but it is expected, I can solve this problem in other ways.

The bigger surprise and problem is that the recent disk Metadata corruption occurred in an unwritten area (that is, a week or more ago).

I am trying to understand how this happens at the disk/controller level. What is going on? Does the SSD periodically “rebalance” and move blocks, even if I am writing elsewhere? Like this:

enter image description here

and then There will be a power outage when rewriting D. There may be some in the first block, and some in the second block. But I don’t know if it works. Maybe other things happen too..?

All in all – I want to understand how this happened and if I can take any steps to alleviate OS level issues.

Note: “Get a better SSD” or “Use UPS” is not a valid answer here-we are trying to move in this direction, but I have to accept the actual reality and find the best results we have now. Without these disks and solutions without UPS, then I guess this is the answer.

References:

Is post-sudden-power-loss filesystem corruption on an SSD drive’s ext3 partition “expected behavior”?
This is similar Yes, but it is not clear if he has encountered various problems we encountered.

Edit:
I have also been reading about ext4 issues that may have power failures. We have recorded , But I don’t know anything else.

Prevent data corruption on ext4/Linux drive on power loss

http://www.pointsoftware.ch/en/4-ext4- vs-ext3-filesystem-and-why-delayed-allocation-is-bad/

For the method of metadata corruption after unexpected power failure, please check my other answer here.

Disabling caching can significantly reduce the possibility of data loss in flight; however, depending on your SSD, there is still a risk of data corruption at rest. In addition, it will cause Huge performance loss (after disabling the private DRAM cache, I saw that a 500 MB/s SSD can only be written at a speed of 5 MB/s).

If you can’t trust your SSD, only The “solution” (or rather, the workaround) is to use an end-to-end checksum file system as a ZFS or BTRFS and RAID1/mirror setup: in this way, any final single pass runs check/scrub, The device (meta) data damage can be recovered from another mirror end.

Leave a Comment

Your email address will not be published.