Analyzing I/O performance in Linux

Mon­i­tor­ing and ana­lyz­ing per­for­mance is an impor­tant task for any sysad­min. Disk I/O bot­tle­necks can bring appli­ca­tions to a crawl. What are IOPS?  Should I use SATA, SAS, or FC? How many spin­dles do I need? What RAID level should I use? Is my sys­tem read or write heavy? These are com­mon ques­tions for any­one embark­ing on an disk I/O analy­sis quest. Oblig­a­tory dis­claimer: I do not con­sider myself an expert in stor­age or any­thing for that mater. This is just how I have done I/O analy­sis in the past. I wel­come addi­tions and cor­rec­tions. I believe it’s also impor­tant to note that this analy­sis is geared toward ran­dom oper­a­tions than sequen­tial read/write workloads.

Let’s start at the very begin­ning … a very good place to start. Hey it worked for Julie Andrews … So what are IOPS? They are input out­put (I/O) oper­a­tions mea­sured in sec­onds. It’s good to note that IOPS are also referred to as trans­fers per sec­ond (tps). IOPs are impor­tant for appli­ca­tions that require fre­quent access to disk. Data­bases, ver­sion con­trol sys­tems, and mail stores all come to mind.

Great so now that I know what IOPS are how do I cal­cu­late them? IOPS are a func­tion of rota­tional speed (aka spin­dle speed), latency and seek time. The equa­tion is pretty sim­ple, 1/(seek + latency) = IOPS. Scott Lowe has a good exam­ple on his techreplublic.com blog.

Sam­ple drive:

  • Cal­cu­lated IOPS for this disk: 1/(0.003 + 0.0045) = about 133 IOPS
  • It’s great to know how to cal­cu­late a disks IOPS but for the most part you can get by with com­monly accepted aver­ages. Of course sources vary but from what I have seen.

    Rota­tional Speed (rpm) IOPS
    5400 50–80
    7200 75–100
    10k 125–150
    15k 175–210

    Should I use SATA, SAS or FC? That’s a loaded ques­tion. As with most things the answer is “depends”. I don’t want to get into the SATA vs SAS debate you can do your own research and make your own deci­sions based on your needs, but I will point out a few things.

    These fac­tors are  key con­sid­er­a­tions when choos­ing what kind of dri­ves to use.

    What RAID level should I use? You know what IOPS are, how to cal­cu­late them and deter­mined what kind of dri­ves to use, the next log­i­cal ques­tion is com­monly RAID 5 vs RAID 10. There is dif­fer­ence in reli­a­bil­ity, espe­cially as the num­ber of dri­ves in your raid-set increases but that is out­side the scope of this post.

    Raid Level
    Write Oper­a­tions Read Oper­a­tions Notes
    0 1 1 Write/Read: high through­put, low CPU uti­liza­tion, no redundancy
    1 2 1 Write: only as fast as sin­gle driveRead: Two read schemes avail­able. Read data from both dri­ves, or data from the drive that returns it first. One is higher through­put the other is faster seek times.
    5 4 1 Write: Read-Modify-Write requires two reads and two writes per write request. Lower through­put higher CPU if the HBA doesn’t have a ded­i­cated IO processor.Read-Modify-Write requires two reads and two writes per write request. Lower through­put higher CPU if the HBA doesn’t have a ded­i­cated IO proces­sor.Read: High through­put low CPU uti­liza­tion nor­mally, in a failed state per­for­mance falls dra­mat­i­cally due to par­ity cal­cu­la­tion and any rebuild oper­a­tions that are going on.
    6 5 1 Write: Read-Modify-Write requires three reads and three writes per write request. Do not use a soft­ware imple­men­ta­tion if it is avail­ableRead: High through­put low CPU uti­liza­tion nor­mally, in a failed state per­for­mance falls dra­mat­i­cally due to par­ity cal­cu­la­tion and any rebuild oper­a­tions that are going on.

    As you can see in the table above, writes are where you take your per­for­mance hit. Now that the penalty or RAID fac­tor is known for dif­fer­ent raid lev­els we can get a good esti­mate of the the­o­ret­i­cal max­i­mum IOPS for a RAID set (exclud­ing caching of course). To do this you take the prod­uct of the num­ber of disks and IOPS per disk divided by the sum of the %read work­load and the prod­uct of the raid fac­tor (see write oper­a­tions col­umn) and %write workload.

    Here is the equation:

    d = num­ber of disks
    dIOPS = IOPS per disk
    %r = % of read work­load
    %w = % of write work­load
    F = raid fac­tor (write oper­a­tions column)

    Wait a sec­ond, where am I sup­posed to get %read and %write from?

    You need to exam­ine your work­load. I usu­ally turn to my favorite sta­tis­tics col­lec­tor, sys­stat.  sar –d –p will report activ­ity for each block device and pretty print the device name. I am assum­ing you already know what block device you are look­ing to ana­lyze but if your look­ing for the busiest device just look in the tps col­umn.  the rd_sec/s and wr_sec/s columns dis­play num­ber of sec­tors read/written from/to the device. To get the per­cent­age of read or writes divide rd_sec/s by the sum of rd_sec/s and wr_sec/s.

    The equa­tions:

    An exam­ple from my workstation:

    Aver­age for sdb rd_sec/s = 1150.80
    Aver­age for sdb wr_sec/s = 1166.53

    As you can see my work­sta­tion read/write work­load is pretty bal­anced at 49.6% read, and 50.3% write. Com­pare that to a cvs server (don’t get me started on how bad cvs is, its just some­thing I have to deal with).

    Aver­age for sdb rd_sec/s = 27.78k
    Aver­age for sdb wr_sec/s = 2.07k

    This server work­load is extremely high on reads. Ok time to ana­lyze the performance.

    In and of itself being a heavy read work­load is not a prob­lem. My prob­lem is user com­plaints of slow­ness. I note (again from sys­stat col­lected met­rics) that the tps or aver­age IOPS on this device is about 574. Again thats not an issue in and of itself, we need to know what we can expect from its sub­sys­tem. This device hap­pens to be SAN based stor­age. The raid set its on is com­prised of 4 10kRPM FC dri­ves in a raid 10. Remem­ber from the table above that IOPS for a 10kRPM drive are in the 125-150ish range. We need to cal­cu­late the expected IOPS from that raid set using the IOPS equa­tion above, our mea­sured work­loads for read/write, the num­ber of disks, and the raid level (10 and 1 are treated the same).

    Using the high end of the scale for 10kRPM IOPS per drive results in a max­i­mum the­o­ret­i­cal IOPS of 561.79, thats pretty close to what I am observ­ing (remem­ber cache is not taken into account). So based on these num­bers it looks like my stor­age sub­sys­tem is sat­u­rated. I guess I bet­ter add some spin­dles. Unfor­tu­nately there is no his­tor­i­cal data for this sys­tem so I have no way of know­ing how many tps I need to aim for.

    Don’t get stuck where I am and have to guess how many spin­dles need to be added to reduce the pain, start record­ing your trends now! Even bet­ter, once you start col­lect­ing your sta­tis­ti­cal infor­ma­tion go ahead and set an alert for 65% or 70% uti­liza­tion of the­o­ret­i­cal max IOPS for an extended period as well as increas­ingly both­er­some alerts going up from there. It’s never good to have to react to per­for­mance issues, always bet­ter to be proac­tive. There was absolutely noth­ing wrong with the siz­ing of this exam­ple raid set 2–4 years ago. Had it been under mon­i­tor­ing the entire time with proper thresh­olds set a proper plan could have been made, and spin­dles could have been added before caus­ing users any pain.

    If you want to use sys­stat like I did, you might find this Nagios plug-in that I wrote help­ful check_sar_perf. I use it with Zenoss, but it could be tied into any NMS that records the per­for­mance data from a Nagios plug-in.

    Go forth, col­lect, ana­lyze and plan so your users aren’t call­ing you with issues.

    19 Comments

    • Jean-Francois Mac OS X Google Chrome 5.0.342.9 wrote:

      The link for the check_sar_perf script seems com­pletely unrelated.

      Other than that, this is a very good article!

    • Matt Windows XP Opera 9.80 wrote:

      The link for check_sar_perf points to http://vmtoday.com/2010/04/storage-basics-part-vi-storage-workload-characterization/ , which I don’t think was your intention.

    • Thanks for the com­ment Jean-Francois, check_sar_perf is just an easy way to trans­port sar met­rics for stuff­ing into an NMS.
      The only thing that makes it related would be the out­put from check_sar_perf disk sda etc … stuffed into your cacti or zenoss so you could see a trend.

    • lol oops, fix­ing it now

    • Kees Linux Firefox 3.5.5 wrote:

      Does any­body know if inter­laced sec­tors are used on hard­disks? In the day of flop­pies this could increase the data through­put con­sid­er­ably by choos­ing the right interlace.

    • Tormak Ubuntu Firefox 3.0.19 wrote:

      When com­par­ing SATA and SAS it’s impor­tant to remem­ber that SATA is only 1/2 duplex. This is a good intro­duc­tion to the disks but I’d really like to see you fol­low it up (for the sake of all noobs) with a dis­cus­sion about band­width over the bus and maybe even con­trollers and their lim­i­ta­tions (includ­ing caching issues/options writeback/write through/cache mir­ror­ing, etc.).

    • Tormak Ubuntu Firefox 3.0.19 wrote:

      Addi­tion­ally, a dis­cus­sion wrt tun­ing the VFS for spe­cific work­load per­for­mance would dove­tail nicely. Maybe a future article?

    • Slappy Linux Firefox 3.6.3 wrote:

      Whoa, that is some high-level hard­ware ana­lyz­ing. I hope to get to that level one day.

    • twogunmickey Ubuntu Firefox 3.5.8 wrote:

      What about cashe size? You failed to men­tion it in the arti­cle. I fig­ure it doesn’t play that big of a dif­fer­ence. Espe­cially in long, con­stant trans­fers, but it has to help some? With new dri­vers com­ing with even larger cashes 64MBs!

      Another ques­tion I’ve always have had but have never heard any­one address. It seems to me that even though a higher RPM drive might have bet­ter per­for­mance, a lower RPM drive would have a longer life. Espe­cially if they were both mod­els from the same line from the same company.

    • Vonskippy Windows XP Firefox 3.6.3 wrote:

      In your first table, you have 5200 rpm dri­ves — it should be 5400 rpm.

    • It would be good to note” that IOPS are also known as TPS, if it were any­where close to *true*.

      Alas, it’s not. “Trans­ac­tions per sec­ond”, as it’s gen­er­ally used, refers to a very spe­cific DBMS bench­mark, pro­mul­gated by the Trans­ac­tion Pro­cess­ing Per­for­mance Coun­cil; the TPS-C rat­ing. (They have two oth­ers, but the –C is the one most com­monly quoted)

      Whether you’re that specfic or not, you’re almost cer­tainly still talk­ing about SQL trans­ac­tions, and each one of those is going to take a *lot* more than 1 IOP. Gen­er­ally by 2 to 3 orders of mag­ni­tude, but 4 isn’t uncom­mon, and 5 or 6 isn’t unreasonable.

      And if your IOPs are tak­ing sec­onds, my con­do­lences. :-)

    • @Vonskippy — Thanks I’ll fix it.

      @Baylink — Yes, I meant to say Trans­fers per sec­ond. I’ll cor­rect it. From man sar — A trans­fer is an I/O request to a phys­i­cal device. Mul­ti­ple log­i­cal requests can be com­bined into a sin­gle IO request to the device. A trans­fer is of inde­ter­mi­nate size.

    • @Tormak — Thanks for the com­ments! You are right about not­ing sata is only 1/2 duplex. I thought about men­tion­ing it but didn’t want to get deep into that dis­cus­sion. Since you bring it up, Ill toss it in.
      As for tun­ing file sys­tems, that does seem like it would be a good read. In fact I noticed my exam­ple file sys­tem could use some tun­ing. If I can get some good data from doing that I may write a file-system tun­ing follow-up.

    • My apolo­gies for the tone; yes­ter­day was kindof a cranky day.

      Nice bits in the rest of the piece.

      And I like your captcha.

    • @Baylink, hey no prob­lem :) every­one has cranky days. Thanks for the cor­rec­tion. As for the captcha, I agree. It was the least obnox­ious one I could find. Props to http://clickcha.com/ the author.

    • phil Linux Firefox 3.5.9 wrote:

      Nice write up. One point to make though, there really is no such thing as an IOP sin­gu­lar. IOPS means IO Oper­a­tions Per Sec­ond. The “OP” part is not short for OPer­a­tion. Leav­ing off the “S” is non­sen­si­cal. It’s like PPS in the net­work space (or hope­fully, KPPS). Sin­gu­lar is just IO (for either “I/O Oper­a­tion” or just I/O sans “/” for the lazy).

    • Roger Themocap Linux Firefox 3.5.9 wrote:

      The num­bers in the equa­tions for %r and %w are the same. The results are different.

    • @phil — Yeah, your write it can be mis­lead­ing i sup­pose. In my head I was just think­ing Input Out­put OPer­a­tion :) Kind of like say­ing your ATM PIN Num­ber (Per­sonal Iden­ti­fi­ca­tion Num­ber Num­ber). I should prob­a­bly change it up in the post. Thanks for the input.

    • @Roger The­mo­cap — Ah your right. I screwed up when putting my equa­tions into http://www.codecogs.com/components/equationeditor/equationeditor.php to gen­er­ate the images. Ill get that fixed. FYI its the one for %w thats wrong, should be .503 = 1166.53/(1150.08+1166.53). Thanks for the cor­rec­tion :)

    Leave a Reply

    Your email is never shared.Required fields are marked *

    To submit your comment, click the image below where it asks you to...
    Clickcha - The One-Click Captcha