Blog Archives

Simplify Storage Management Today, Risk Free, and Free of Charge

Posted on by

While my peers have been blogging about the new CLARiiON and Celerra releases, both of which provide significant enhancements to the EMC CX4-based Unified platforms you already own, I thought I’d shift gears just a tad…

What if you are a Clariion CX/CX3 customer, or a CX4 customer who isn’t ready to upgrade to the newly released FLARE30 code, but want to simplify management of your storage environment, get better reporting, dashboards, wizards, etc.  Well, you are in luck.

Just as with previous versions of Navisphere and FLARE, EMC offers off-array versions of Clariion management agents, servers, and GUIs.  As of yesterday, that includes off-array versions of Unisphere.  If you are a current customer of Clariion, you can login to PowerLink and download the Unisphere off-array software and build a management station.  After installation, you can manage your existing Clariion CX/CX3/CX4 hardware without upgrading the FLARE code.  As you upgrade your CX4 systems to FLARE30, new features will be enabled in Unisphere, and as you upgrade your Celerra NS systems to DART6, they can be added to the Clariion management domain and managed from the very same Unisphere instance.  How’s that for easy and convenient?

But what do you get by using Unisphere to manage your non-FLARE30 systems?  Unfortunately, you won’t be able to take advantage of FASTCache, FAST, Compression, and other features that are only available in FLARE 30, but there are some advantages..

First and foremost, Unisphere completely dumps the Navisphere tree-based management view and replaces it with end-result based tasks.  So instead of creating several objects to provision raid groups and LUNs, then present to a host, you just run the “Allocate” wizard and select the array, disks/raid group/pool, LUN size, hosts, etc and commit.

Second, upon launching Unisphere and logging in, you are immediately presented with dashboard views showing the amount of used/available storage, and active alerts, all customizable, so you can see the state of your entire CLARiiON storage environment “at-a-glance”.

To install Unisphere today, login to Powerlink, browse to “Support > Software Downloads and Licensing > Downloads T-Z > Unisphere Server Software” and download “EMC Unisphere Server” and “EMC Unisphere Client”.  Install them both to your Windows system and fire it up.  If you have Navisphere off-array software already installed, Unisphere will upgrade the existing installation for you.  You will also want to download and install Unisphere Service Manager (USM), also from Powerlink at “Support > Software Downloads and Licensing > Downloads T-Z > Unisphere Service Manager (USM).”  USM will provide various support and service related tools including active technical advisories for your storage arrays.

Begin using Unisphere today and you get some immediate benefits, plus you will be ready to take advantage of new features enabled with FLARE30 (FAST, FASTCache, Compression, etc) as well as managing NAS across all of your Celerra systems once they are upgraded to DART 6.  As a bonus, you’ll have a chance to get familiar with Unisphere before a future FLARE upgrade or new EMC Unified purchase forces you to learn it.

And did I mention you don’t have to buy anything or introduce risk with a firmware upgrade?

Just because it’s production, doesn’t mean it’s not a test

Posted on by 0 comment

Back in July I wrote about the week long sailing trip that ended after 1 day with engine failure and dramatic action.  Since then our old sailboat has been stuck in Anacortes, WA while the local marine service company diagnosed and repaired the engine.  My wife also delivered our first child during that time so we were a little busy anyway.  They declared the engine good to go last week and I scheduled sea trials and pickup for Tuesday (8/24).  We packed a cooler full of food, some clothes, sleeping bags and drove up to Anacortes to meet the boat.  After a slightly expensive lunch at the marina restaurant, with masterful drinks poured by the same bartender we were served by the last time, we met the Travelift about to splash the boat.

The engine fired up just fine and sounds much better than it used to.  It runs and idles smoother, doesn’t smoke, runs cooler, etc.  So we headed out for sea trials in the bay and cruised around for about 45 minutes at different RPMs, heating and cooling the engine to stress it a little looking for any problems.  The boat runs great!  Under engine power we move about 1 knot faster than before too.  I think the engine has been running poorly for quite a while before it failed.  Anyway, satisfied that the boat engine was performing well, we headed back in to the dock.  I paid the bill, we loaded out provisions and headed out with just enough time to make it to Deception Pass for slack tide.

In the immortal words of Captain Ron — “Well, the best way to find out is to get her out on the ocean Kitty, if anything’s gonna happen, its gonna happen out there.”

20 minutes out, a new sound develops from the engine compartment.  It sounds like metal rattling–a very distinct, sharp sound.  Down in the engine compartment it’s a very loud sound, an exhaust leak from somewhere.  A couple phone calls and we turn back to Anacortes.  We clearly aren’t making it to Deception Pass tonight.  The engine is not quote right yet.  Mechanic shows up and determines that the head gasket is leaking, might have been a defective gasket.  But it’s solid copper and a new one is several days away.  Another mechanic joins us at 8:30am the next morning and finds out that the head bolts loosened during the sea trial and subsequent motoring.  He tightens then up and its running fine again.  So out for another trial, then back to cool the engine and check the bolts again–still good!

So we finally leave Puget Sound’s own Bermuda Triangle for home.  We pass through Deception right on time and continue south towards Coupeville on Whidbey Island.  In another moment of calamity on our eternal 3-hour tour, we are moving along at over 6 knots when the boat suddenly stops dead in the water and pitches forward.  Jason who was in the galley, flies forward into the head and falls down while dishes go flying.  A quick check of the depth sounder (showing 2.8ft) confirms my fears..  we hit a sand bar.  It turns out the navigator (me) was too preoccupied on his cell phone dealing with plans for the night and talking to the car dealer about the Mazda’s coolant leak, to notice that we were about 100 yards outside the marked channel.  Reversing the engine does nothing to help and the current is pushing us against the sand bar pretty hard.

If you read the previous post, you’ll remember that the dinghy saved the day when the engine failed..  Well, another notch on the dinghy’s stern is due after I threw it off the bow, mounted the Yamaha motor, and used it as a mini tugboat to spin the sailboat around into the current to push off the sand bar.  I’m contemplating renaming the sailboat and dinghy to “The Problem” and “The Solution” respectively.

We made it safely to Coupeville and had a wonderful afternoon and evening.  My wife and baby drove over to meet us for dinner and the next morning we shoved off early for Everett.  We got a little wet on this last run due to rain but made it home safe, locked the boat down, hopped in the car and went home.  It’s a series of mini-adventures I will never forget.

On the plus side, our little old sailboat is now better equipped, I have a new found respect for the dinghy, and I got to go boating once more before summer ends, even if it did cost us a lot more money than we had planned.

Lies, Damn Lies, and Marketing…

Posted on by

Yesterday, In his blog posted entitled “Myth Busting: Storage Guarantees“, Vaughn Stewart from NetApp blogged about the EMC 20% Guarantee and posted a chart of storage efficiency features from EMC and NetApp platforms to illustrate his point.  Chuck Hollis from EMC called it “chartsmithing” in comment but didn’t elaborate specifically on the charts deficiencies.  Well allow me to take that ball…

As presented, Vaughn’s chart (below) is technically factual (with one exception which I’ll note), but it plays on the human emotion of Good vs Bad (Green vs Red) by attempting to show more Red on EMC products than there should be.

The first and biggest problem is the chart compares EMC Symmetrix and EMC Clariion dedicated-block storage arrays with NetApp FAS, EMC Celerra, and NetApp vSeries which are all Unified storage systems or gateways.  Rather than put n/a or leave the field blank for NAS features on the block-only arrays, the chart shows a resounding and red NO, leading the reader to assume that the feature should be there but somehow EMC left it out.

As far as keeping things factual, some of the EMC and NetApp features in this chart are not necessarily shipping today (very soon though, and since it affects both vendors I’ll allow it here).  And I must make a correction with respect to EMC Symmetrix and Space Reclamation, which IS available on Symm today.

I’ve taken the liberty of massaging Vaughn’s chart to provide a more balanced view of the feature comparison.  I’ve also added EMC Celerra gateway on Symmetrix to the comparison as well as an additional data point which I felt was important to include.

I’ve included some footnotes in the chart to explain some of the results but I’ll explain a little here as well.

1.) I removed the block only EMC configuration devices because the NetApp devices in the comparison are Unified systems.

2.) I removed the SAN data row for Single Instance storage because Single Instance (identical file) data reduction technology is inherently NAS related.

3.) Zero Space Reclamation is a feature available in Symmetrix storage.  In Clariion, the Compression feature can provide a similar result since zero pages are compressible.

I left the 3 different data reduction techniques as individually listed even though the goal of all of them is to save disk space.  Depending on the data types, each method has strengths and weaknesses.

One question, if a bug in OnTap causes a vSeries to lose access to the disk on a Symmetrix during an online Enginuity upgrade, who do you call?  How would you know ahead of time if EMC hasn’t validated vSeries on Symmetrix like EMC does with many other operating systems/hosts/applications in eLab?

The goal if my post here really is to show how the same data can be presented in different ways to give readers a different impression.  I won’t get into too much as far as technical differences between the products, like how comparing FAS to Symmetrix is like comparing a box truck to a freight train, or how fronting an N+1 loosely coupled clustered, global cached, high-end storage array with a midrange dual-controller gateway for block data might not be in a customer’s best interest.

What do you think?

Comcast — Professional FUD Slinging…

Posted on by

When trying to sell something, one of the hard things that I’ve noticed sales people dealing with is the fine line between comparing your own product to your competitors, and slinging FUD (fear, uncertainty, and doubt) about your competitors.  The right way to sell is to show the benefit of your products to your customer and leave the competitor out of the conversation as much as possible.  There are those that have real trouble staying on the right side of the line… and then there are those that just make stuff up, which is worse.

My wife and I were watching TV Monday evening when a Comcast guy rang the doorbell.  We switched to Verizon FiOS a couple years ago and have had zero desire to switch back to Comcast.  The Internet is faster, the SD and HD picture quality is better, and the DVR menu/guides make Comcast’s software look decades old, which it is.

Matt, from Comcast, started by mentioning that they will have trucks in the neighborhood over the next couple weeks, framing the reason for his being there as a courtesy to us so we aren’t worried about all the trucks.  He then asked if we were Comcast customers and when he found out we were Verizon customers he quickly seg-wey’d into the true reason for his visit – FUD Slinging…

He informed us that now that Frontier has purchased FiOS from Verizon, Frontier does not want to the keep the lines maintained so they would be shutting down FiOS.  He stated that we have 3 months to get out of our FiOS contract without a penalty. When I asked him to clarify how Frontier is shutting down FiOS, Matt tells us they are shutting down Residential service, but keeping the Business customers.

Okay, pause right there, there are 3 things wrong with this –

  1. So Frontier spends $billions to buy the FiOS infrastructure from Verizon only to shut it down?  Their investor presentation indicates that ALL new build outs are Fiber-to-the-Home; it would seem a big waste to invest in FTTH but not provide FTTH Services.
  2. Even if Frontier were shutting down some FiOS infrastructure, why would “the largest pure rural telecommunications carrier in the United States” shut down residential services and keep business services?
  3. If Frontier was shutting down FiOS and I had to find an alternative service, why would I have to get out of my contract at all, it seems my contract would end on its own right?

Then as we chatted about why I switched to FiOS in the first place, and he tried REALLY hard to get me to set up an appointment for them to switch me back, I explained how the set top box software is light years ahead of Comcasts and I’d probably want Tivo if I switched back to Comcast.  He then to tell me that Comcast had acquired Tivo.  Seriously?  It’s so easy to check these things.

So I got his phone number and finally got him to leave, saying I’d figure out my schedule and call him in the next couple days.

A little research on the web sets the record straight.  The reality of the situation is that Frontier purchased the FiOS assets from Verizon in order to increase their broadband and TV footprint and will continue to provide the services because FiOS aligns perfectly with their core business.  There is no urgency to switch to another carrier at this time, despite what Comcast says.  Frontier even took the time to update the logo in our DVR software already.

All indications are that Tivo is still independent as well.

So this guy, Matt from Comcast, is hoping that the Frontier acquisition was public enough that we know something is changing, but haven’t looked into it enough yet, and that we’ll be freaked out by his indicating that we only have 3 months to do something.  I’m not even a Comcast customer and yet I feel compelled to write to them to complain.  Well now I’ve complained in public.

Every Cruise is a Shakedown Cruise (in IT terms, every Production environment is also a QA environment)

Posted on by

It’s the morning of day 2 on a 7 day sailing trip in the San Juan Islands of Puget Sound.  We are 43 nautical miles from our homeport, and I’m sitting at the table watching a diesel mechanic take apart the little engine on our boat.

Over the 4th of July weekend, we spent nearly 3 full days getting the boat ready for this trip.  Washed inside and out, installed new convenience items, changed the oil, checked the transmission fluid, batteries, electrical systems, etc.  We taken several short and long trips with our Cal 2-29 over the past 5 years and there hasn’t been a single trip over 24 hours that didn’t require a repair of some kind.  Once, the bilge pump sucked water INTO the boat and we had to re-plumb the bilge pump system with makeshift hoses available at the nearby port.  Another time, while docking in Friday Harbor, my wife leaned too hard on a stanchion, causing it to break off and sending her into the cold Puget Sound water.  Twice, an over-zealous helmsperson switched from reverse to forward gear while the engine was at speed and tore the flex coupling on the prop shaft in half.  Both times we were close to docking so we just drifted into port and made repairs.  After that we thought we had finally seen the last of the major issues for a while.

On Tuesday morning, we left too early to fuel up so I brought a 5-gallon can of Diesel on board.  35 nautical miles later that proved to be a good idea, when we almost ran out of fuel, while navigating the tight and dangerous Deception Pass.  We refueled without stopping using a makeshift funnel made out of a plastic water bottle.  Afterwards, the engine was clearly turning more than 2000 rpm based on sound and boat speed but the tachometer was showing 600-800 and bouncing wildly.  Something to look at later since the engine seemed okay.

An hour later, on the west side of Fidalgo Island, entering the Strait of Juan de Fuca, we were planning our final destination for the day when the engine began to lose power for an unknown reason. Finally, we saw what seemed to be unusual black smoke from the exhaust.  At that point we shut down the engine to check on things.  We were a few hundred yards from a rock wall, which was cause for some concern, but we had a little time to assess the situation.

At first glance, the alternator belt was very loose but it didn’t make sense because the bolt that allows for adjustment had clearly not moved.  It turned out that the bolt on the other end of the mounting arm, the one that secures the arm to the engine block, had sheared off and the arm was free of the engine.  Since the engine is an old diesel, which does not require any power or electronic systems to run, we decided we’d try and remove the belt and go without the alternator until we can repair it.  We also found a few random bolts and screws in the engine compartment.

While working to secure the belt out of the way with zip-ties we noticed the starter solenoid had pretty much fallen off of the starter, the spring was visible even.  The bolts had come loose and one was missing, plus reattaching would require a lot of work due to the location of the bolts.  Well, being a single cylinder small diesel, the Farymann A30M can be started with a hand crank when warm, so we secured the solenoid out of the way and figured we’d fire it up with the crank and get to a nearby marina.

Hand cranking failed to produce a running engine, and we really don’t know why, we may have needed the glow plug on which we forgot about until a long time after giving up.  It was looking like we were going to have to call Vessel Assist, when I remembered a story I heard about someone pushing their sailboat with their dinghy lashed to the side of the boat near the stern.  So we secured the dinghy, fired up the Yamaha 2.5hp motor, and amazingly we were moving along at 4knots just in time to move away from the rock wall that was now only about 100 yards away.  An hour later we dinghy-motored our little 35 year old Cal into Flounder Bay on the northwest corner of Fidalgo Island.  Some steaks, corn on the cob, and a healthy dose of Captain Morgan over the next few hours helped the mood and the day was done!

At this point we’ve found that not only was the alternator and starter solenoid loose from the engine, one of the two engine mounts was about 30 minutes of running from falling off also.  It’s likely the loose engine mount added vibration, which caused the other bolts to loosen, causing more bolts to fail completely–a multi-stage failure of sorts.  Today, our goal is to work with the marine service tech to get the engine put back together and tightened up, then see if the engine will run, and assess anything we find there.  At $92.50 per hour, this could be a costly day.

This experience, and the previous ones we’ve had as well, reminded me that you need to be prepared for anything, especially when your life depends on it.  When your customers (internal or external) depend on your IT systems, you should be prepared for anything to go wrong, and you might have to patch things together to get it going until you can fix it the right way.  And that’s okay.  Remember, duct tape and zip-ties can pretty much fix anything!  😉

And it’s only been 24 hours since the trip started.

Follow up here

Resiliency vs Redundancy: Using VPLEX for SQL HA

Posted on by

A little history on my philosophy around high-availability

Around the year 2000, when I was working in network operations for a large wireless telco, a very senior network architect explained to me the company’s philosophy on building high availability solutions into the network.  The phrase I remember from that conversation was “we don’t build redundant networks, we build resilient networks..” The difference is that while redundant networks failover to secondary paths to resume traffic, resilient networks don’t go down at all.  This concept has stuck with me ever since and I tend to tackle high-availability problems of all kinds with this idea in mind.  It’s frankly been very difficult to build solutions that are resilient across the entire stack, mostly because infrastructure technology hasn’t quite gotten there yet.

Things may have changed…

I recently had a meeting with a customer to discuss local high availability for SQL.  This customer has a very large multi-node clustered SQL environment (hundreds of TBs of data, hundreds of databases, hundreds of instances, many clusters, many nodes per cluster) and has been testing SQL database mirroring as an alternative to traditional Windows Failover Clustering.  The focus of the meeting wound up focused primarily on leveraging VPLEX as an alternative to SQL mirroring, and the reasons for that decision suddenly reminded me of the Resiliency vs Redundancy discussion I had years ago.  A VPLEX solution potentially solves the same problem as DB mirroring, does it with less complexity, and less risk.

VPLEX Local as a Resilient HA solution

One of the many features of VPLEX is it’s ability to mirror data across multiple storage arrays and present that mirror as a single LUN to the host.  For customers already running large multi-node MSCS clusters, the LUN appears just like any normal storage LUN and Windows/SQL treat the LUN normally.  There are several reasons VPLEX should be considered as an alternative to database mirroring. (much of this applies to Exchange CCR as well)

VPLEX hardware is inherently Resilient.  A VPLEX cluster is an N+1 cluster of loosely coupled nodes, cooperating with each other, but not depending on each other.  Hosts can access any of the hosted data, through any of the ports, on any of the cluster nodes.  If a node fails for any reason, the remaining nodes continue serving IO for any data.  Except for a dead path on the host side (managed by PowerPath or MPIO), there is no failover process, and no cache mirroring to worry about.  The potential performance impact of a failure is equal to 1, divided by the quantity of that component in the cluster. (128 x 8gbe ports across 8 director nodes for a large VPLEX Local cluster)

In addition, because VPLEX utilizes a write-through cache, there is never any dirty cache data (data in cache that has not been committed to disk) in a VPLEX system.  A power outage or VPLEX hardware failure does not put data at risk.

Other Advantages of using VPLEX over SQL Database Mirroring

Improved Performance:

  • Compared with SQL Database mirroring, VPLEX mirroring has significantly less impact on transaction performance for writes and can improve transaction performance in some cases due to the large read cache in the VPLEX directors. (Note: I am comparing to DB Mirroring in Full-Safety mode since the customer’s requirement was a zero-data-loss solution.)

Non-Disruptive Storage Failover:

  • In the event of a storage failure, SQL Mirroring must perform a cluster node failover which takes a few seconds at best, possibly disrupting applications.  VPLEX provides completely non-disruptive failover when a storage failure occurs.  (A server hardware failure still triggers a node failover as it would in any other failover clustering scenario.)

Less Management Overhead:

  • From a management perspective, using VPLEX instead of SQL Database mirroring gives the SQL DBAs fewer SQL instances and fewer moving parts to manage on a daily basis.  The storage team just presents a mirrored LUN from VPLEX to the cluster and it’s business as usual for the DBAs.
  • VPLEX also allows the storage team to non-disruptively migrate data between storage arrays behind VPLEX to balance load, perform hardware refreshes, resolve capacity problems.  VPLEX performs the migration at the direction of the storage admins.

Reduced Risk:

  • Reducing management complexity also reduces risk.  With a high number of database instances and db mirrors involved in a large environment like this one, the chance of one of those mirrors having a problem, or being configured incorrectly, is increased.  DBAs can rely on VPLEX mirroring all of the data, 24x7x365, even when host maintenance is being performed.

Reduced Cost:

  • When compared with the SQL Database Mirroring solution, the VPLEX solution reduced the number of physical servers needed in this environment, reducing cost enough to more than offset the cost of VPLEX itself.  Combined with reductions in soft costs, like reduced DBA management overhead, VPLEX will actually save them quite a bit of money, and increased uptime during storage refresh and maintenance will increase revenues in this case as well.

A Distributed Future:

  • Next year, when a second datacenter is online nearby, the first VPLEX Local cluster can be connected to another VPLEX cluster in the new datacenter.  Then the SQL cluster nodes and data can be distributed across both datacenters, providing protection from entire datacenter outages, or solving space constraints with no changes to the application or servers, and no downtime.

I wonder how many other customers would like to build more resilient infrastructures?

If you combine a VPLEX solution with a true cluster file system and an active-active database engine (ie: Oracle RAC), you can eliminate the disruption caused by server hardware failures.  It’s just a matter of time now until the entire stack can be designed for true resiliency with very little management overhead.  I can’t wait to see what happens.

The following EMC White Paper has a lot of good information about using VPLEX in this same context:

Workload Resiliency with EMC VPLEX

While EMC users benefit from Replication Manager, NetApp users NEED SnapManager

Posted on by

This is a follow up to my recent post NetApp and EMC: Replication Management Tools Comparison, in which I discussed the differences between EMC Replication Manager and NetApp SnapManager.

————

As a former customer of both NetApp and EMC, and now as an employee of EMC, I noticed a big difference between NetApp and EMC as far as marketing their replication management tools. As a customer, EMC talked about Replication Manager several times and we purchased it and deployed it. NetApp made SnapManager a very central part of their sales campaign, sometimes skipping any discussion of the underlying storage in favor of showing off SnapManager functionality. This is an extremely effective sales technique and NetApp sales teams are so good at this that many people don’t even realize that other vendors have similar, and in my opinion EMC has better, functionality.  One of the reasons for this difference in marketing strategy is that NetApp users NEED SnapManager, while EMC users do not always need Replication Manager.

The reason why is both simple and complex…

EMC storage arrays (Clariion, Symmetrix, RecoverPoint, Invista) all have one technology in common that NetApp Filers do not–Consistency Groups. A consistency group allows the storage system to take a snapshot of multiple LUNs simultaneously, so simultaneous in fact that all of the snapshots are at the exact same point in time down to the individual write. This means that, without taking any applications offline and without any orchestration software, EMC storage arrays can create crash-consistent copies of nearly any kind of data at any time.

The EMC Whitepaper “EMC CLARiiON Database Storage Solutions: Oracle 10g/11g with CLARiiON Storage Replication Consistency” downloadable from EMC’s website has the following explanation of consistency groups in general…

“…Consistent replication operates on multiple LUNs as a set such that if the replication action fails for one member in the set, replication for all other members of the set are canceled or stopped.  Thus the contents of all replicated LUNs in the set are guaranteed to be identical point-in-time replicas of their source and dependent-write consistency is maintained…”

“…With consistent replication, the database does not have to be shut down or put into “hot backup mode.”  Replicates created with SnapView or MV/S (or MV/A, Timefinder, SRDF, Recoverpoint, etc) consistency operations, without first quiescing or halting the application, are restartable point-in-time replicas of the production data and guaranteed to be dependent-write consistent.”

Consistency is important for any application that is writing to multiple LUNs at the same time such as SQL database and log volumes. SnapManager and Replication Manager actually prepare the application by quiescing the database during the snapshot creation process. This process creates “application-consistent” copies which are technically better for recovery compared with “storage-consistent” copies (also known as crash-consistent copies).

So, while I will acknowledge that quiescing the database during a snapshot/replication operation provides the best possible recovery image, that may not be realistic in some scenarios.  The first issue is that the actual operation of quiescing, snapping, checking the image, then pushing an update to a remote storage array takes some time.  Depending on the size of the dataset, this operation can take from several minutes to several hours to complete.  If you have a Recovery Point Objective (RPO) of 5 minutes or less, using either of these tools is pretty much a non-starter.

Another issue is one of application support.  EMC Replication Manager and NetApp SnapManager have very wide support for the most popular operating systems, filesystems, databases, and applications, they certainly don’t support every application.  A very simple example is a Novell Netware file server with a NSS pool/volume spanning multiple LUNs.  Neither NetApp nor EMC have support for Novell Netware in their replication management tools.  While you can certainly replicate all of the LUNs with NetApp SnapManager, SnapManager has no consistency technology built-in to keep the LUNs write-order consistent.  The secondary copy will appear completely corrupt to the Netware server if a recovery is attempted.  Through the use of consistency groups with MirrorView/Async, the replication of each LUN is tracked as a group and all of the LUNs are write-order consistent with each other, keeping the filesystem itself consistent.  You would need to have either array-level consistency technology, or support for Netware in the replication management tool in order to replication such a server..  Unfortunately, NetApp provides neither.

You may have complex applications that consist of Oracle and SQL databases, NTFS filesystems, and application servers running as VMs.  Using array-based consistency groups, you can replicate all of these components simultaneously and keep them all consistent with each other.  This way you won’t have transactions that normally affect two databases end up missing in one of the two after a recovery operation, even if those databases are different technologies (Oracle and MySQL, or PostgreSQL for example).

EMC Storage arrays provide consistency group technology for Snapshots and Replication in Clariion and Symmetrix storage arrays.  In fact, with Symmetrix, consistency groups can span multiple arrays without any host software.  By comparison, NetApp Filers do not have consistency group technology in the array.  Snapshots are taken (for local replicas and for SnapMirror) at the FlexVolume level.  Two FlexVolumes cannot be snapped consistently with each other without SnapManager.

There are a couple workarounds for NetApp users–you can snapshot an aggregate, but that is not recommended by NetApp for most customers, or you can put multiple LUNs in the same FlexVol, but that still limits you to 16TB of data including snapshot reserve space, and both options violate best practices for database designs of keeping data and logs in separate spindles for recovery.  Even with these workarounds, you cannot gain LUN consistency across the two controllers in an HA Filer pair, something the CLARiiON does natively, and can help for load balancing IO across the storage processors.

In general, I recommend that EMC customers use EMC Replication Manager and NetApp customers use SnapManager for the applications that are supported, and for most scenarios.  But when RPO’s are short, or the environment falls outside the support matrix for those tools, consistency groups become the best or only option.

Incidentally, with EMC RecoverPoint, you get the best of both worlds.  CDP or near-CDP replication of data using consistency groups for zero or near-zero RPOs plus application-consistent bookmarks made anytime the database is quiesced.  Recovery is done from the up-to-the-second version of the data, but if that data is not good for any reason, you can roll back to another point in time, including a point-in-time when the database was quiesced (a bookmark).

So, while EMC has, in Replication Manager, an equivalent offering to NetApp’s SnapManager, EMC customers are not required to use it, and in some cases they can achieve better results using array-based consistency technologies.

NetApp and EMC: Replication Management Tools Comparison

Posted on by

I started this post before I started working for EMC and got sidetracked with other topics.  Recent discussions I’ve had with people have got me thinking more about orchestration of data protection, replication, and disaster recovery, so it was time to finish this one up…

———————————–

Prior to me coming to work for EMC, I was working on a project to leverage NetApp and EMC storage simultaneously for redundancy.  I had a chance to put various tools from EMC and NetApp into production and have been able to make some observations with respect to some of the differences.  This is a follow up my previous NetApp and EMC posts…

NetApp and EMC: Real world comparisons
NetApp and EMC: Startup and First Impressions
NetApp and EMC: ESX and Exchange 2007 CCR
NetApp and EMC: Exchange 2007 Replication

Specifically this post is a comparison between NetApp SnapManager 5.x and EMC Replication Manager 5.x.  First, here’s a quick background on both tools based on my personal experience using them.

Description

EMC Replication Manager (RM) is a single application that runs on a dedicated “Replication Manager Server.”  RM agents are deployed to the hosts of applications that will be replicated.  RM supports local and remote replication features in EMC’s Clariion storage array, Celerra Unified NAS, Symmetrix DMX/V-Max, Invista, and RecoverPoint products.  With a single interface, Replication Manager lets you schedule, modify, and monitor snapshot, clone, and replication jobs for Exchange, SQL, Oracle, Sharepoint, VMWare, Hyper-V, etc.  RM supports Role-Based authentication so application owners can have access to jobs for their own applications for monitoring and managing replication.  RM can manage jobs across all of the supported applications, array types, and replication technologies simultaneously.  RM is licensed by storage array type and host count. No specific license is required to support the various applications.

NetApp SnapManager is actually a series of applications designed for each application that NetApp supports.  There are versions of SnapManager for Exchange, SQL, Sharepoint, SAP, Oracle, VMWare, and Hyper-V.  The SnapManager application is installed on each host of an application that will be replicated, and jobs are scheduled on each specific host using Windows Task Scheduler.  Each version of SnapManager is licensed by application and host count.  I believe you can also license SnapManager per-array instead of per-host which could make financial sense if you have lots of hosts.

Commonality

EMC Replication Manager and NetApp SnapManager products tackle the same customer problem–provide guaranteed recoverability of an application, in the primary or a secondary datacenter, using array-based replication technologies.  Both products leverage array-based snapshot and replication technology while layering application-consistency intelligence to perform their duties.  In general, they automate local and remote protection of data.  Both applications have extensive CLI support for those that want that.

Differences

  • Deployment
    • EMC RM – Replication Manager is a client-server application installed on a control server.  Agents are deployed to the protected servers.
    • NetApp SM – SnapManager is several applications that are installed directly on the servers that host applications being protected.
  • Job Management
    • EMC RM – All job creation, management, and monitoring is done from the central GUI. Replication Manager has a Java based GUI.
    • NetApp SM – Job creation and monitoring is done via the SnapManager GUI on the server being protected.  SnapManager utilizes an MMC based GUI.
  • Job Scheduling
    • EMC RM – Replication Manager has a central scheduler built-in to the product that runs on the RM Server.  Jobs are initiated and controlled by the RM Server, the agent on the protected server performs necessary tasks as required.
    • NetApp SM – SnapManager jobs are scheduled with Windows Task Scheduler after creation.  The SnapManager GUI creates the initial scheduled task when a job is created through the wizard.  Modifications are made by editing the scheduled task in Windows task scheduler.

So while the tools essentially perform the same function, you can see that there are clear architectural differences, and that’s where the rubber meets the road.  Being a centrally managed client-server application, EMC Replication Manager has advantages for many customers.

Simple Comparison Example: Exchange 2007 CCR cluster
(snapshot and replicate one of the two copies of Exchange data)

With NetApp SnapManager, the application is installed on both cluster nodes, then an administrator must log on to the console on the node that hosts the copy you want to replicate, and create two jobs which run on the same schedule.  Job A is configured to run when the node is the active node, Job B is configured to run when the node is passive.  Due to some of the differences in the settings, I was unable to configure a single job that ran successfully regardless of whether the node was active or passive.  If you want to modify the settings, you either have to edit the command line options in the Scheduled Task, or create a new job from scratch and delete the old one.

With EMC Replication Manager, you deploy the agent to both cluster nodes, then in the RM GUI, create a job against the cluster virtual name, not the individual node.  You define which server you want the job to run on in the cluster, and whether the job should run when the node is passive, active, or both.  All logs, monitoring, and scheduling is done in the same RM GUI, even if you have 50 Exchange clusters, or SQL and Oracle for that matter.  Modifying the job is done by right-clicking on the job and editing the properties.  Modifying the schedule is done in the same way.

So as the number of servers and clusters increases in your environment, having a central UI to manage and monitor all jobs across the enterprise really helps.  But here’s where having a centrally managed application really shines…

But what if it gets complicated?

Let’s say you have a multi-tier application like IBM FileNet, EMC Documentum, or OpenText and you need to replicate multiple servers, multiple databases, and multiple file systems that are all related to that single application.  Not only does EMC Replication Manager support SQL and Filesystems in the same GUI, you can tie the jobs together and make them dependent on each other for both failure reporting and scheduling.  For example, you can snapshot a database and a filesystem, then replicate both of them without worrying about how long the first job takes to complete.  Jobs can start other jobs on completely independent systems as necessary.

Without this job dependence functionality, you’d generally have to create scheduled tasks on each server and have dependent jobs start with a delay that is long enough to allow the first job to complete while as short as possible to prevent the two parts of the application from getting too far out of sync.  Some times the first job takes longer than usual causing subsequent jobs to complete incorrectly.  This is where Replication Manager shows it’s muscle with it’s ability to orchestrate complex data protection strategies, across the entire enterprise, with your choice of protection technologies (CDP, Snapshot, Clone, Bulk Copy, Async, Sync) from a single central user interface.

You don’t need a Backup solution!

Posted on by

Well, not exactly.  What you really need is a restore solution!

I was discussing this with a colleague recently as we compared difficulties multiple customers are having with backups in general.  My colleague was relating a discussion he had with his customer where he told them, “stop thinking about how to design a backup solution, and start thinking about how to design a restore solution!”

Most of our customers are in the same boat, they work really hard to make sure that their data is backed up within some window of time, and offsite as soon as possible in order to ensure protection in the event of a catastrophic failure.  What I’ve noticed in my previous positions in IT and more so now as a technical consultant with EMC is that (in my experience) most people don’t really think about how that data is going to get restored when it is needed.  There are a few reasons for this:

  • Backing up data is the prerequisite for a restore; IT professionals need to get backups done, regardless of whether they need to restore the data.  It’s difficult to plan for theoretical needs and restore is still viewed, incorrectly, as theoretical.
  • Backup throughput and duration is easily measured on a daily basis, restores occur much more rarely and are not normally reported on.
  • Traditional backup has been done largely the same way for a long time and most customers follow the same model of nightly backups (weekly full, daily incremental) to disk and/or tape, shipping tape offsite to Iron Mountain or similar.

I think storage vendors, EMC and NetApp particularly, are very good at pointing out the distinction between a backup solution and a restore solution, where backup vendors are not quite as good at this.  So what is the difference?

When designing a backup solution the following factors are commonly considered:

  • Size of Protected Data – How much data do I have to protect with backup (usually GB or TB)
  • Backup Window – How much time do I have each night to complete the backups (in hours)
  • Backup Throughput – How fast can I move the data from it’s normally location to the backup target
  • Applications – What special applications do I have to integrate with (Exchange, Oracle, VMWare)
  • Retention Policy – How long do I have to hang on to the backups for policy or legal purposes
  • Offsite storage – How do I get the data stored at some other location in case of fire or other disaster

If you look at it from a restore prospective, you might think about the following:

  • How long can I afford to be down after a failure?  Recovery Time Objective (RTO): This will determine the required restore speed.  If all backups are stored offsite, the time to recall a tape or copy data across the WAN affects this as well.
  • How much data can I afford to lose if I have to restore? Recovery Point Objective (RPO):  This will determine how often the backup must occur, and in many cases this is less than 24 hours.
  • Where do I need to restore the application? This will help in determining where to send the data offsite.

Answer these questions first and you may find that a traditional backup solution is not going to fulfill your requirements.  You may need to look at other technologies, like Snapshots, Clones, replication, CDP, etc.  If a backup takes 8 hours, the restore of that data will most likely take at least 8 hours, if not closer to 16 hours.  If you are talking about a highly transactional database, hosting customer facing web sites, and processing millions of dollars per hour, 8 hours of downtime for a restore is going to cost you tens or hundreds of millions of dollars in lost revenue.

Two of my customers have database instances hosted on EMC storage, for example, which are in the 20TB size range.  They’ve each architected a backup solution that can get that 20TB database backed up within their backup window.  The problem is, once that backup completes, they still have to offsite the backup, and replicate it to their DR site across a relatively small WAN link.  They both use compressed database dumps for backup because, from the DBA’s perspective, dumps are the easiest type of backup to restore from, and the compression helps get 20TB of data pushed across 1gbe Ethernet connections to the backup server.  One of the customers is actually backing up all of their data to DataDomain deduplication appliances already; the other is planning to deploy DataDomain.  The problem in both cases is that, if you pre-compress the backup data, you break deduplication, and you get no benefit from the DataDomain appliance vs. traditional disk.  Turning off compression in the dump can’t be done because the backup would take longer than the backup window allows.  The answer here is to step back, think about the problem you are trying to solve–restoring data as quickly as possible in the event of failure–and design for that problem.

How might these customers leverage what they already have, while designing a restore solution to meet their needs?

Since they are already using EMC storage, the first step would be to start taking snapshots and/or clones of the database.  These snapshots can be used for multiple purposes…

  • In the event of database corruption, or other host/filesystem/application level problem, the production volume can be reverted to a snapshot in a matter of minutes regardless of the size of the database (better RTO).  Snapshots can be taken many times a day to reduce the amount of data loss incurred in the event of a restore (better RPO).
  • A snapshot copy of the database can be mounted to a backup server directly and backed up directly to tape or backup disk.  This eliminates the requirement to perform database dumps at all as well as any network bottleneck between the database server and backup server.  Since there is no dump process, and no requirement to pre-compress the data, de-duplication (via DataDomain) can be employed most efficiently.  Using a small 10gbps private network between the backup media servers and DataDomain appliances, in conjunction with DD-BOOST, throughput can be 2.5X faster than with CIFS, NFS, or VTL to the same DataDomain appliance.  And with de-duplication being leveraged, retention can be very long since each day’s backup only adds a small amount of new data to the DataDomain.
  • Now that we’ve improved local restore RTO/RPO, eliminated the backup window entirely for the database server, and decreased the amount of disk required for backup retention, we can replicate the backup to another DataDomain appliance at the DR site.  Since we are taking full advantage of de-duplication now, the replication bandwidth required is greatly reduced and we can offsite the backup data in a much shorter period of time.
  • Next, we give the DBAs back the ability to restore databases easily, and at will, by leveraging EMC Replication Manager.  RM manages the snapshot schedules, mounting of snaps to the backup server, and initiation of backup jobs from the snapshot, all in a single GUI that storage admins and DBAs can access simultaneously.

So we leveraged the backup application they already own, the DataDomain appliances they already own, storage arrays they already own, built a small high-bandwidth backup network, and layered some additional functionality, to drastically improve their ability to restore critical data.  The very next time they have a data integrity problem that requires a restore, these customer’s will save literally millions of dollars due to their ability to restore in minutes vs. hours.

If RPO’s of a few hours are not acceptable, then a Continuous Data Protection (CDP) solution could be added to this environment.  EMC RecoverPoint CDP can journal all database activity to be used to restore to any point in time, bringing data loss (RPO) to zero or near-zero, something no amount of snapshots can provide, and keeping restore time (RTO) within minutes (like snapshots).  Further, the journaled copy of the database can be stored on a different storage array providing complete protection for the entire hardware/software stack.  RecoverPoint CDP can be combined with Continuous Remote Replication (CRR) to replicate the journaled data to the DR site and provide near-zero RPO and extremely low RTO in a DR/BC scenario.  Backups could be transitioned to the DR site leveraging the RecoverPoint CRR copies to reduce or eliminate the need to replicate backup data.  EMC Replication Manager manages RecoverPoint jobs in the same easy to use GUI as snapshot and clone jobs.

There are a whole host of options available from EMC (and other storage vendors) to protect AND restore data in ways that traditional backup applications cannot match.  This does not mean that backup software is not also needed, as it usually ends up being a combined solution.

The key to architecting a restore solution is to start thinking about what would happen if you had to restore data, how that impacts the business and the bottom line, and then architect a solution that addresses the business’ need to run uninterrupted, rather than a solution that is focused on getting backups done in some arbitrary daily/nightly window.

EMC Unified: The benefit of having options

Posted on by

I’ve been having some fun discussions with one of my customers recently about how to tackle various application problems within the storage environment and it got me thinking about the value of having “options”.  This customer has an EMC Celerra Unified Storage Array that has Fiber Channel, iSCSI, NFS, and CIFS protocols enabled.  This single storage system supports VMWare, SQL, Web, Business Intelligence, and many custom applications.

The discussion was specifically centered on ensuring adequate storage performance for several different applications, each with a different type of workload…

1.)  Web Servers – Primarily VMs with general-purpose IO loads and low write ratios.

2.)  SQL Servers – Physical and Virtual machines with 30-40% write ratios and low latency requirements.

3.)  Custom Application  – A custom application database with 100% random read profiles running across 50 servers.

The EMC Unified solution:

EMC Storage already sports virtual provisioning in order to provision LUNs from large pools of disk to improve overall performance and reduce complexity.  In addition, QoS features in the array can be used to provide guaranteed levels of performance for specific datasets by specifying minimum and maximum bandwidth, response time, and IO requirements on a per-LUN basis.  This can help alleviate disk contention when many LUNs share the same disks, as in a virtual pool.  Enterprise Flash Drives (EFD) are also available for EMC Storage arrays to provide extremely high performance to applications that require it and they can coexist with FC and SATA drives in the same array.  Read and write cache can also be tuned at an array and LUN level to help with specific workloads.  With the updates to the EMC Unified Platform that I discussed previously, Sub-LUN FAST (auto tiering), and FAST Cache (EFD used as array cache) will be available to existing customers after a simple, non-disruptive, microcode upgrade, providing two new ways to tackle these issues.

So which feature should my customer use to address their 3 different applications?

Sub-LUN FAST (Fully Automated Storage Tiering)

Put all of the data into large Virtual Provisioning pools on the array, add a few EFD (SSD) and SATA disks to the mix and enable FAST to automatically move the blocks to the appropriate tier of storage.  Over time the workload would even out across the various tiers and performance would increase for all of the workloads with much fewer drives, saving on power, floor space, cooling, and potentially disk cost depending on the configuration.  This happens non-disruptively in the background.  Seems like a no-brainer right?

For this customer, FAST helps the web server VMs and the general-purpose SQL databases where the workload is predominately read and much of the same data is being accessed repeatedly (high locality of reference).   As long as the blocks being accessed most often are generally the same, day-to-day, automated tiering (FAST) is a great solution.  But what if the workload is much more random?  FAST would want to push all of the data into EFD, which generally wouldn’t be possible due to capacity requirements.  Okay, so tiering won’t solve all of their problems.  What about FAST Cache?

FAST Cache

Exponentially increase the size of the storage array’s read AND write cache with EFD (SSD) disks.  This would improve performance across the entire array for all “cache friendly” applications.

For this customer, increasing the size of write cache definitely helps performance for SQL (50% increase in TPM, 50% better response time as an example) but what about their custom database that is 100% random read?  Increasing the size of read cache will help get more data into cache and reduce the need to go to disk for reads, but the more random the data, the less useful cache is.   Okay, so very large caches won’t solve all of their problems.   EFDs must be the answer right?

EFD Disks

Forget SATA and FC disks; just use EFD for everything and it will be super fast!!   EFD has extremely high random read/write performance, low latency at high loads, and very high bandwidth.  You will even save money on power and cooling.

The total amount of data this customer is dealing with in these three applications alone exceeds 20TB.  To store that much in EFD would be cost prohibitive to say the least.  So, while EFD can solve all of this customer’s technical problems, they couldn’t afford to acquire enough EFD for the capacity requirements.

But wait, it’s not OR, it’s AND

The beauty of the EMC Unified solution is that you can use all of these technologies, together, on the same array, simultaneously.

In this customer’s case, we put FC and SATA into a virtual pool with FAST enabled and provision the web and general-purpose SQL servers from it.  FAST will eventually migrate the least used blocks to SATA, freeing the FC disks for the more demanding blocks.

Next, we extend the array cache using a couple EFDs and FAST Cache to help with random read, sequential pre-fetching, and bursty writes across the whole array.

Finally, for the custom 100% random read database, we dedicate a few EFDs to just that application, snapshot the DB and present copies to each server.  We disable read and write cache for the EFD backed volumes which leaves more cache available to the rest of the applications on the array, further improving total system performance.

Now, if and when the customer starts to see disk contention in the virtual pool that might affect performance of the general-purpose SQL databases, QoS can be tuned to ensure low response times on just the SQL volumes ensuring consistent performance.  If the disks become saturated to the point where QoS cannot maintain the response time or the other LUNs are suffering from load generated by SQL, any of the volumes can be migrated (non-disruptively) to a different virtual pool in the array to reduce disk contention.

Options

If you look at offerings from the various storage vendors, many promote large virtual pools, some also promote large caches of some kind, others promote block level tiering, and a few promote EFD (aka SSDs) to solve performance problems.  But, when you are consolidating multiple workloads into a single platform, you will discover that there are weaknesses in every one of those features and you are going to wish you had the option to use most or all of those features together.

You have that option on EMC Unified.