extremesanity.com

Warning “Valid SP-Cache is for Storage Processor SP…” event code 0x7241

February 15th, 2017 extremesanity No comments

On a VNX 5300 array, after powering down, unseating, and re-seating a SP I encountered the event code 0x7241 with warning message “Valid SP-Cache is for Storage Processor SP…”.

Disabling write and read caches on each SP did not resolve. Restarting management services on the SPs did not resolve.

To resolve, I restarted one SP within the Unisphere GUI, then after it was back up, I restarted the other SP.

Categories: EMC VNX Tags:

Continuous offline archiving of EMC VNX array performance data

May 8th, 2014 extremesanity No comments

The EMC VNX arrays do not offer a good (inexpensive) way to archive performance data continuously to a management server for future retrieval. You CAN turn on performance data logging and have it periodically archive to the array itself, but I prefer not to have multiple GB of archived performance data on the same array I may be troubleshooting in the future, not to mention that is one more item to review on my maintenance checklist.

Turning on performance data logging

First thing, to enable performance monitoring and generate NAR files check (and uncheck) the following options in Unisphere > System > Monitoring & Alerts > Statistics > Performance Data Logging:

Next click start to start performance logging. Re-verify you have unchecked the “stop automatically after” option. The array will periodically archive performance data to .nar files on the array itself. In my environment the array archives to a nar file about once every 12 hours for each storage processor. You may force the array to archive to a .nar file by stopping then starting the data logging.

Note: In order to review NAR files after they are generated you must have the Unisphere Analyzer enabler installed on the array, otherwise you will have to engage EMC support to review the performance logs for you.

Retrieving performance logs from the array and archiving to a server

Install naviseccli on a server, then edit the below vbscript code, entering your own values for the IP addresses of the SPs, user, password, and file path. Create a scheduled task that executes cscript.exe against the vbscript code on the server on a daily basis. The script places a call to each SP, stores all NAR files on that SP to the directory of your choosing, then deletes all the NAR files from that SP.


'grab perf logs from array then delete logs off array

Set objShell = WScript.CreateObject("WScript.Shell")
Set objExecObject = objShell.Exec("cmd /c naviseccli -Address <ip of SP_A> -User <san username> -Password <san password> -Scope 0 analyzer -archive -all -o -path <folder path (ex: C:\EMC\data_archive)>
")
WScript.Sleep 60000
Set objExecObject = objShell.Exec("cmd /c naviseccli -Address <ip of SP_B> -User <san username> -Password <san password> -Scope 0 analyzer -archive -all -o -path <folder path (ex: C:\EMC\data_archive)>
WScript.Sleep 60000
Set objExecObject = objShell.Exec("cmd /c naviseccli -Address <ip of SP_A> -User <san username> -Password <san password> -Scope 0 analyzer -archive -delete -all -o")
WScript.Sleep 60000
Set objExecObject = objShell.Exec("cmd /c naviseccli -Address <ip of SP_B> -User <san username> -Password <san password> -Scope 0 analyzer -archive -delete -all -o")

Now you have continuously archived data from your array that you can now open in Unisphere Analyzer to review array performance.

Categories: EMC VNX Tags:

Proactive copy to hot spare on EMC VNX array

March 5th, 2014 extremesanity 1 comment

SAN errors, oh no!

The process started when the array emailed me a couple of soft media errors, so I glanced at the SPA and SPB event logs in Navisphere and saw this:

Notice that the majority of errors showed up as informational and not warning or critical, meaning the array will not indicate to anyone that this drive is about to fail, yeesh.

Note also that all the errors occur on the same disk, Bus 0 Enclosure 1 Disk 7, a NLSAS 2TB drive in my environment.

Event codes included 0x6a0, 0x820, and 0x801 with descriptions of disk soft media error, soft scsi bus error, and soft media error. My suggestion is to filter only on description, and do searches for “error” to find all the messages.

Reviewing the disks within the navisphere GUI showed that no disks were faulted, the dashboard showed no errors, and the hot spare was not in use, meaning the array did not believe the drive should be failed yet.

Time to be proactive…

I opened a case with EMC support, and sent them screenshots of the SP event logs, and SPCollects from both SPA and SPB, and noted that the error occurred on the same drive over 100 times in one day. The support representative immediately requested I do a proactive copy to the hot spare disk, and ordered a replacement disk sent to me.

A proactive copy is preferred because instead of requiring the array to rebuild the RAID array to a new disk (and endure the performance degradation inherent in this procedure), it copies the data from one disk to another, then tells the RAID array to use the hot spare disk, then disables the failing disk, skipping the rebuild process altogether and hence no RAID rebuild performance degradation.

I first tried to do the proactive copy from the navisphere gui without success, below.

Note the option to copy is greyed out. Apparently VNXs new mixed RAID storage pools prevents this option from being used, so I moved onto the CLI.

Passing the command
naviseccli -Address <SP IP address> -User <user> -Password <password> -Scope 0 copytohotspare 0_1_7 -initiate
(where 0_1_7 is the failing disk) worked correctly, starting the proactive copy from the failing disk to the hotspare of the same type.

Checking progress of the proactive copy

Now to check progress of the copy…

First I tried looking at the disks in the GUI

The disk state is listed as “Copying to Hot Spare(100%)”. Hmm, 100% doesn’t seem right, I just started this procedure. (Looking at the RAID LUN within the GUI showed the state as “transitioning” without any progress indicators)

Then I tried through the CLI

Well that also doesn’t look right, either. I continued on looking at SPCollect logs, SP event logs, and looking all over the Internet, including the EMC community forums, without finding any answers. (running getlun on the RAID LUN from the CLI didn’t show any progress indicators also)

Update: I figured out a way to gauge progress, though a bit crude. See the update at the bottom.

Eventually after 19 hours (NLSAS 2TB) the process completed, throwing event logs 0x6b0, 0x604, 0x67d, 0x6a8, 0x67c, 0x7a7, 0x6ab0 along with many others indicating it had marked the failing drive as failed as expected.

Complete list of event logs thrown as part of the proactive copy completion:
0x6b0, 0x712d4601, 0x906, 0x7a7, 0x608, 0x6a1, 0x602, 0x7a5, 0x6a8, 0x712789a0, 0x67b, 0x67c, 0x603, 0x602, 0x712d0508, 0x604, 0x712d0507, 0x2580, 0x906, 0x7a6, 0x799, 0x712d4602, 0x712d4601, 0x67d, 0x7400, 0x740a, 0x2580and probably some others I missed.

At this point I gave the EMC CE a call and scheduled replacement of the drive. Once the replacement drive is in place, the array should copy the information on the hot spare back to the replacement drive, then mark the hot spare drive as available again.

Update: It appears a progress indicator of sorts is included in the lustat command in the SPCollect logs. By running SPCollects over and over again I can gauge the progress of proactive copies and equalizations when the drive is replaced.

This information is contained within the SPCollect zip file, within the *_sus zip file, within the SPx_cfg_info.txt file. By looking at the EQZ percentage, I can gauge roughly when it will finish, and more importantly that the equalization or proactive copy is progressing.

Update: I came across a new EMC article today that may have a command that will work to gauge progress better. I have not attempted it myself yet.

From:
https://community.emc.com/docs/DOC-7962

Categories: EMC VNX Tags:

extremesanity.com

Archive

Warning “Valid SP-Cache is for Storage Processor SP…” event code 0x7241

Continuous offline archiving of EMC VNX array performance data