Quantcast
Channel: Mike Lagase
Viewing all 60 articles
Browse latest View live

Perfwiz for Exchange 2010

$
0
0

NOTE: This version of Perfwiz has been replaced by a newly written script that is talked about in http://blogs.technet.com/b/mikelag/archive/2010/07/09/exchange-2007-2010-performance-data-collection-script.aspx

  1. Download the appropriate version of Perfwiz for your server

    How to download
    To download these XML files to your computer, right click the file of your choice, select Save Target As... , and then save it to a directory location of your choice on your Exchange Server

    Role Based
    Use these as a high level look in to how the server is performing and if you need to branch out with more counters, use the Full Counter/Instance set below.

    Exchange_2010_Perfwiz-MBX.xml
    Other roles coming soon....

    All Counters/All Instances
    Use this counter set at your own discretion as this could potentially cause performance degradation on your server trying to log this amount of counters.

    Exchange_2010_Perfwiz-Full.xml
  2. Open Performance Monitor
  3. Expand Reliability and Performance and then expand Data Collector Sets
  4. Right click User Defined,Select New, and then Data Collector Set
  5. Enter a unique name for this Data Collector set (ie. ExPerfwiz), select Create from a template (Recommended) and then click Next
  6. Select the Browse button, navigate to the XML file that was saved in Step 1, select Open
  7. Select Next on the next screen
  8. Enter in a root Directory of where you would like to store the performance log files. Click Next
  9. If you need to run this performance log under different credentials, enter it on this page. Click Finish

The effects of paged pool token memory usage and Outlook 2007’s shared folders feature

$
0
0

In Exchange 2003, we are bound by 32-bit architecture memory limitations and paged pool memory just so happens to be one of them. This is a very common issue that we in Exchange support see is paged pool memory depletion related to excessive Token usage.

For the most part, paged pool memory troubleshooting for Exchange 2003 can be found in http://support.microsoft.com/kb/912376. This article goes over what access tokens are, what symptoms are seen when paged pool memory usage is low, how token sizes are generally calculated, and a script that will allow you to export group memberships for users that are mail enabled. It does not go over some of the other reasons why paged pool memory may increase on any given Exchange 2003 server.

There is also a great document “Addressing Problems Due to Access Token Limitations” at http://download.microsoft.com/download/8/f/3/8f36dfe4-47d0-4775-ad5a-5614384921aa/AccessTokenLimitation.doc that goes over Access Tokens in much greater detail.

If you had used the scripts in 912376 to export the users group membership, you might end up with data in Excel similar to the screenshot listed below.

image

Notice that we have a slew of users that have really high group membership. This can and will affect overall Token sizes when connecting to an Exchange server.

If you ever wanted to calculate the approximate size of a users access token, the formula as listed in 912376 is the following:

[12 x number of user rights] + [token overhead] + [44 x number of group memberships] = token size in bytes

Trying to find the maximum amount of security groups a user can belong to before we double the size, we would come up with the following (4096-500 / 44 = 81.70). This means that a user needs to be a part of a maximum of approximately 80 security groups before the token size doubles to 8192. This a rough guesstimate, but this should be enough to estimate overall token sizes for users in general terms if you wanted to. Microsoft does not recommend more than 100MB of paged pool memory for token allocations as it may impact Exchange and other applications installed on the server.

One of the reasons that I believe that there are many security groups that are mail enabled in any organization is because this is a default behavior of creating groups in Active Directory Users & Computers (ADUC). When you create a group in ADUC, the default type is a security group, not distribution, so over time, you can end up with users that are a part of many groups that are of type security which is a bad thing. If you let this go long enough, the users access token can bloat to high values causing paged pool memory issues caused by excessive Toke usage.

One of the first things you need to do is ensure that the hotfix in http://support.microsoft.com/kb/912480 is applied on the Exchange server. This hotfix decreases the amount of connections that Outlook users make to an Exchange server.

Each connection that a client makes to a server will increase overall token usage for that user due to that fact that we need to expand all group memberships to ensure that you have permissions to the requested resource and you don’t have any specific deny in any of the many groups they are a part of. If you establish a lot of TCP connections to a single process such as the store.exe process, you could exhaust paged pool memory resources on the server very easily every time a user needs to verify that they have permissions to access resources in users mailboxes. The more connections you have open, the more token usage that will be consumed.

So now I would like to bring to light a new feature in Outlook 2007 to help cache shared folders such as Calendars, Contacts, Tasks, and Notes. This specific feature is Download shared folders (excludes mail folders) which is turned on by default in any new Outlook 2007 profiles. This allows easier and faster access to other users mailbox folders since the data will be cached locally in their Outlook profile. Here is what the option looks like in a users outlook profile.

image

For each shared folder that is cached in a user profile, an additional connection will be opened to the Exchange server to sync that data to the users OST file. If a user is opening a lot of other folders, then you will see many connections to the Exchange server. To make it even better, http://support.microsoft.com/kb/955572 introduced a new feature where you can cache other users mail folders such as the Inbox by adding a value to the registry on each client computer called CacheOthersMail. Once this is implemented, any user can now have the ability to cache any other users mailbox folders should they have access to them.

As you can see, this can significantly increase the amount of connections to any given Exchange server, thus increasing overall token usage. If you have users that have high group memberships, then it may be advantageous to disable these features. This can be accomplished organization wide via Group Policy by disabling the “Synchronizing data in shared folders” option listed in http://technet.microsoft.com/en-us/library/cc179175(office.12).aspx. Disabling this feature will now cause the Outlook clients to make direct RPC connections to the Exchange server since that folder data is no longer cached locally, so it may take just a bit longer to pull up the data they are looking for.

If you are migrating your users from Outlook 2003 to Outlook 2007 and are noticing increased page pool Toke memory usage on an Exchange server, then this may be one of the reasons for it.

One way to determine if the outlook client has many connections open to the Exchange server is to run a netstat –ao on the Exchange server and pipe it to a text file. If you view the list of connections from single client IP addresses to the Store process, which in this case was running under port 1674, you may see many established connections as shown below.

TCP    EXSERVER:1674       CLIENTCOMPUTER:3196     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3202     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3205     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3210     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3219     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3234     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3235     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3238     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3239     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3241     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3246     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3252     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3253     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3257     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3258     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3260     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3261     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3263     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3264     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3277     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3278     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3280     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3282     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3283     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3284     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3285     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3286     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3293     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3301     ESTABLISHED     4944

If this type pattern is seen for various clients, then this feature may be affecting your Exchange servers paged pool memory usage which could potentially lead to unresponsive servers.

In one of the cases that I had worked, disabling this feature via group policy dropped the runtime paged pool memory usage 40-50MB. This is a huge memory savings that may keep your server afloat until you migrate to Exchange 2007 or Exchange 2010. Of course, those versions of Exchange are not susceptible to these type nonpaged or paged pool memory issues since we have moved away from the 32-bit platform and will only support the 64-bit versions of Exchange in production.

Once you have migrated these users to later versions of Exchange, you can then remove the group policy to enable this feature again.

Of course there are many other reasons for excessive Toke usage on an Exchange server, but this is just one of those new features that may bite you in the end.

Until next time…

Exchange Server Profile Analyzer updated!!

$
0
0

For any of you folks that might have created a hair ball on the floor by pulling your hair out trying to obtain data from the Exchange Profile Analyzer (EPA) tool, there is good news that a new version has just been released and should help resolve the majority of the issues that you might have run in to previously. One of the updates that was a major blocker is if you had changed your default recipient policy to include variables such as %s.%g@domain.com, then this update will resolve this problem for you amongst a slew of other enhancements.

Please note that this is still the DAV version of the tool and will not run against Exchange 2010 servers, but currently this tool should only be run against Exchange 2003 or Exchange 2007 installations in the hopes that you can obtain the user profiles for your users to allow proper planning/sizing of your Exchange 2010 servers.

You can download the latest binaries from the following locations.

32-bit - http://www.microsoft.com/downloads/details.aspx?familyid=8F575F60-BD80-44AA-858B-A1F721108FAD&displaylang=en
64-bit - http://www.microsoft.com/downloads/details.aspx?familyid=C009C049-9F4C-4519-A389-69C281B2ABDA&displaylang=en

Enjoy!!

Perfmon data viewing nuance…

$
0
0

Recently, I have been running in to a perfmon nuance that I would like to call out. On any given performance troubleshooting day which is almost every day, I open up Perfmon, select the BLG file of choice and then add the appropriate counters that I would like to view.

In this instance, I was presented with the following view.

image

The odd thing is that all instances are linear in fashion for both increasing or decreasing counter values. Something surely is not right. As you can see, this is not helpful at all.

Since I know that this perfmon has data in it, I needed to find a way to view this information as it was crucial we found root cause of the issue with the data provided.

One of the reasons that this could happen if there are long gaps of uncollected data in the perfmon file as data it being appended to the existing BLG file. So if I started a perfmon up for 1 second and then immediately stopped it, I would collect one sampling of data. If my perfmon data collection is setup to append data to the existing blg file, then the next time the perfmon collected is started it will append the data to the existing BLG file. Since we average out the data over the last 100 samples, you can clearly see that the averages here will be way off and will show data similar to the above screen capture.

So how do I overcome this? Well, first, you need to find out where the data gaps are in the perfmon file. To do this, you can use the relog utility that comes with Windows to dump out the blg file to csv format. To make this easier, I use the \Processor(_Total)\% Processor Time counter to export this data on as that counter is normally collected in every perfmon file. The syntax of the command is below along with the associated result information.

image

In the above case, we can see that the blg contained information for 14 hours, but only 29 samples were taken. Based on that alone, the numbers do not add up.

If you open the log.csv file in notepad, you should get an output similar to the following. As highlighted below, you will see the actual gap in data collection which was throwing off the view in perfmon. This output will also give us the time interval that this perfmon was taken which was every 30 seconds. This time interval is extremely important to me when viewing perfmon data for any Exchange Performance related issues.

image

The first data sampling is what we don’t need here, so we need to change our Time Range in Perfmon by moving the slider bar over to the right approximately one hour before the data capture

image

Once you have the slider where you want it, right-click the slider bar and then select Zoom to as shown below as shown in the following pictorial.

image

What you should now see is some resemblance of performance data.

image

What you now want to do to remove the whitespace surrounding the capture is to left-click and drag to the right the time slider until you have highlighted the data you want to look at. Once selected, right-click the window and select Zoom To to zoom in to that time frame.

image

What you should now see if the actual data you want to look at.

image

Another way of being able to view this data is to use the relog utility to export out specific timeframes in the log. Since we know the start and end times of the data we want to view, we can simply run the following command to export data between specific times to an export.csv file. Once you pull up the data in perfmon for this export csv file, you will have the same view as the one listed above.

image

I hope this helps you decipher whether or not a perfmon file has been corrupted or the data just does not look correct for whatever reason.

Happy performance viewing!!

Audit Exchange 2007 SP2 Auditing

$
0
0

There have been a few cases that have been coming through the support channels stating that auditing is not working for whatever reason. After reviewing the cases, we have found that this is due to users or groups in the Configuration Partition of Active Directory that have been granted Full Access to the containers in the tree or having the All Extended Rights permission. Having these permissions will essentially bypass any auditing events from occurring.

Let’s take a step back for a moment. When applying Exchange 2007 SP2 in an organization, the schema is extended with the right ms-Exch-Store-Bypass-Access-Auditing. If a user previously has been granted Full Control within the Configuration tree, that user or group will then take on an allow right for this Bypass auditing right, thus being exempt from being audited. This is not good for compliance reasons and the end results will not contain audit entries for all users that can/will be accessing mailboxes.

The other problem is that there is currently no way to lock down the ability for any administrator to add a specific Allow on an object in Active Directory for this bypass right, thus excluding them from being audited.

Listed below are the *default* groups that have the Bypass Auditing right due to various permission settings:

  • Domain Admins
  • Enterprise Admins
  • Exchange Organization Administrators

Once the Schema has been extended, there are 5 places to add auditing bypass entries in the configuration container in Active Directory as shown below.

  • Exchange Organization Container
  • Exchange Administrative Group Container
  • Exchange Servers Container
  • Exchange Server object
  • Exchange Database object

When auditing is not working as expected, it would be a rather tedious process to check permissions throughout the configuration tree for these objects where the bypass extended right has been set.  I have created a powershell script (AuditBypassPerm.ps1) that should help export permissions for each of these objects to make your job finding permissions problems that much easier.

Before I go over the script, I want to describe some of the terms that you will need to know when looking through the output of this script. Objects in Active Directory can be assigned using Access Rights using System.DirectoryServices.ActiveDirectoryRights[]. These Access rights control what type of permission a user or group has on a particular object. A listing of all the available Access rights can be found at http://msdn.microsoft.com/en-us/library/system.directoryservices.activedirectoryrights.aspx.

The three main rights that we are concerned with relationship to this auditing are the following:

  • ms-Exch-Store-Bypass-Access-Auditing = Bypass Exchange Access Auditing in the Information Store
  • GenericAll = The right to create or delete children, delete a subtree, read and write properties, examine children and the object itself, add and remove the object from the directory, and read or write with an extended right.
  • ExtendedRight = A customized control access right. This can be used to specifically apply a particular right such as ms-Exch-Store-Bypass-Access-Auditing or this could mean that you may have an allow right for All Extended Rights as shown below. All Extended Rights mean just that, All rights including the “Bypass Exchange Access Auditing in the Information Store” right.

     image

So with that said, these are the three main rights that we need to concentrate on when we are trying to find a needle in the haystack. The next piece that we also need to be cognizant about is whether there is a specific deny or one of the 3 rights is being inherited from some other object in the Configuration tree.

In my example, I used an account called AuditTest1 to show how one would troubleshoot something like this. I granted a deny to the bypass right at the organization level so that this users account would be audited, but then at the Database object level, I granted the All Extended Rights right for this account. What this essentially did is to bypass auditing for this user at the database level, thus logging no events for any mailbox access for mailboxes on that database.

The syntax for the script is as follows:

.\AuditBypassPerm.ps1 <MBXServername>

After running the script, you will get output similar to the following

image

Once the script has completed, notepad will open up a text file which will provide information similar to the below pictorial. This allows you to visually see at each of the 5 objects where we can set the bypass permissions what rights were assigned to this user.

image

The key takeaway here is to view what groups/accounts have the ms-Exch-Store-Bypass-Access-Auditing,GenericAll , or ExtendedRight right set on them and if it is set, to determine at what level in the Configuration Tree a potential override has been set that would prevent auditing specific accounts from being audited.

If there is an Inherited permission that you cannot view what rights has been set at each of the 5 object levels, then you will need to open up ADSIEdit.msc and then walk the tree up from that object until you find the object in which permissions have been changed.

So that sounds great, but what happens when you have a user that is not listed in the tree, but is still not being audited? The main reason for this is that this user is a part of a group that could have permission in to the tree to have one of these 3 rights.

Since Powershell V1 does not have a direct way to view a users group membership, I created another really small script to list out the groups that these users are a part of. This is not the same format that is listed above, but will help provide a general understanding of what groups the user is a part of that you can then compare to the output listed above for your troubleshooting efforts.

The syntax for the group membership script is as follows:

.\GetUserGroups.ps1 <Username>

Note: This can be in simple format or domain\username format as shown below.

image

These set of scripts can be downloaded here

I hope this helps untangle the inability to get auditing working for some users/groups as reviewing permissions is sometimes a very tedious task.

Another question that comes up is how do I map the Message ID that is listed in the event when accessing the message to an actual message in a users mailbox? A sample event is listed below with the relevant parts highlighted.

image

So from that, we can see that we have a message ID of <6B83547937704D4EB0EFA4327EF0DEC82D8F92EC36@MIKELAGDC.mikelaglab.com> and this message was opened in the folder /Calendar.

For every message on an Exchange 2007 server, we generate a unique Message ID that is stamped on the PR_INTERNET_MESSAGE_ID MAPI property of each message. This MAPI property is also equivalent to 0X1035000E.

With MFCMAPI, you can find this message rather easily by creating a content restriction in the mailbox. To do this, you would need to create a MAPI profile for that user specifically on an administrative workstation or use a MAPI profile that has full access to all mailboxes.

IMPORTANT: While going in with an administrative account performing these operations, if the message is touched in the mailbox while using MFCMAPI, an auditing event will be logged to the Exchange Auditing log. If you don’t want to log any events while doing your investigation, it may be best to logon with an approved account that has the Bypass Auditing right so that whatever actions you are taking inside a users mailbox is not audited or use an account that is specific to finding what object was audited.

Once you open the mailbox in MFCMAPI, you would then need to navigate to and open the folder that was listed in the auditing event as shown above. In this case, it was the Calendar folder. After the folder is opened, click on Table on the Menu and then select Create an Apply Content Restriction.

image

Next, we need to specify what Property Tag we are looking for which in this example is 0X1035000E. Once this property tag number is entered, you will see that the Property Name field will show what we want to filter on. Click OK when you are finished.

image

In the Property Value field, enter <6B83547937704D4EB0EFA4327EF0DEC82D8F92EC36@MIKELAGDC.mikelaglab.com> including the < > characters as shown below. This is very important that the < > characters are entered, otherwise, the restriction will not return the message. Click OK when you are done.

image

The result should be the message that you are looking for.

image

That is all for now….

Happy Auditing!!

Exchange 2007/2010 Performance Data Collection Script

$
0
0

In efforts to help streamline performance data collection on Exchange 2007/Exchange 2010 servers, I have created a powershell script that will automate the data collection for you. One of the nice features of the script is that you do not have to tell the script what roles are installed as it will automatically detect what is installed locally on the server, thus adding the appropriate counters for you. Previously, you had to manually select an XML file from here for Exchange 2007 servers and here for Exchange 2010 servers and then import it in to the performance console.

I’ve seen a lot of cases that use the previous Perfwiz utility, but unfortunately, this was originally designed to collect data for Exchange 2003 servers and was never updated to support the later versions of Exchange. This older version of Perfwiz should never be used to troubleshoot performance issues for versions later than Exchange 2003 as the pertinent counters are not being collected to accurately troubleshoot a performance issue.

During the development phase of this script, it was found that starting with Windows 2003 x64 that the log roll mechanism no longer worked properly and stopped once the maximum log file size was hit. Even though this worked previously in on Windows 2003 x86 versions, something changed on the 64-bit platform which prevented this from working. This problem is also inherent in the Windows 2008 operating system, but eventually was resolved in Windows 2008 R2. The script works around all of these issues to help you collect the right data at the right time by doing the following:

  • If Windows 2003 x64 and –circular switch not specified, then roll log to next log file once maxsize is reached or duration time is hit, whichever one is first.
  • If Windows 2008 RTM/SP1/SP2 and –circular switch not specified, then roll log every 4 hours. If Interval is set to less than 30 seconds, then roll log every hour.

IMPORTANT: To help save on the disk space consumed to write these log files out, the *default duration* is set to 8 hours. This time duration should be enough to capture most performance cases during the day, but if longer durations are needed, then refer to the switches listed in the table below to help set the desired configuration for your needs.

Listed below are the switches that can be used with this script at the time of this posting. New switches will be added as time goes on. These switches should help allow you to collect the right data at the right time and also allows the flexibility to set the appropriate settings.

Parameter

Description

-help or -?

Provides help regarding the overall usage of the script

-circular

Turns on circular logging to save on disk space. Negates default duration of 8 hours

-delete

Deletes the currently running Perfwiz data collection

-duration

Specifies the overall duration of the data collection. If omitted, the default value is (08:00:00) or 8 hours

-EseExtendedOn

Enables Extended ESE performance counters

-EseExtendedOff

Disables Extended ESE performance counters

-filepath

Sets the directory location of where the blg file will be stored

-full

Defines a counter set that includes all Counters/instances

-interval

Specifies the interval time between data samples. If omitted, the default value is (00:00:30) or 30 seconds

-maxsize

Specifies the maximum size of blg file in MB. If omitted, the default value is 512

-query

Queries configuration information of previously created Exchange_Perfwiz Data Collector

-start

Starts Exchange_Perfwiz data collection

-stop

Stops the currently running Perfwiz data collection

-StoreExtendedOn

Enables Extended Store performance counters

-StoreExtendedOff

Disables Extended Store performance counters

-threads

Specifies whether threads will be added to the data collection. If omitted, threads counters will not be added to the collection

-webhelp

Launches web help for script

For additional information, you can check out the website that includes the latest 1.3 version download at http://code.msdn.microsoft.com/ExPerfwiz.

If you have an issue with this script or have a feature suggestion, use the Discussions/Issue Tracker tabs on the Experfwiz page listed above. There are also additional examples of how to run the script with additional switches on the site.

Enjoy!!

Mike

Detecting Low Virtual Memory Conditions in Windows 2008 and R2

$
0
0

On Windows 2008 servers with Exchange 2007/2010 installed, there are times when you may run out of virtual memory for various reasons. One could be a memory leak in some application or simply not configuring the paging file correctly.

Once you run out of virtual memory on any given server, various applications may start failing/crashing on the server due to the inability to obtain memory to complete a specific function that is being called. In some cases, this could lead to a possible blue screen of death (BSOD).

For server based systems, the new Reliability Infrastructure helps automatically diagnose various operating system components. Of that infrastructure, Resource Exhaustion Detection and Resolution (RADAR) helps notify you when you are resources are reaching critical levels. RADAR is part of the Diagnostic Policy service that is installed on each server.

When RADAR detects that memory has reached a critical state, a 2004 event will be logged to the system log. An example of one of these events is shown below. As you can see, it has various information that provides overall memory consumption for various system resources, the top processes for memory consumption, file version information and paged/nonpaged pool memory that includes the top tags that could attribute to the memory problem. The bolded parts are the area of interest.

Log Name:      System
Source:        Microsoft-Windows-Resource-Exhaustion-Detector
Event ID:      2004
Task Category: Resource Exhaustion Diagnosis Events
Level:         Warning
Keywords:      Events related to exhaustion of system commit limit (virtual memory).
User:          SYSTEM
Description:
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: store.exe (7580) consumed 11282399232 bytes, MSExchangeMailboxAssistants.exe (21200) consumed 590950400 bytes, and w3wp.exe (21092) consumed 562757632 bytes.
Event Xml:
<Event xmlns="
http://schemas.microsoft.com/win/2004/08/events/event">
<System>
    <Provider Name="Microsoft-Windows-Resource-Exhaustion-Detector" Guid="{9988748e-c2e8-4054-85f6-0c3e1cad2470}" />
    <EventID>2004</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>3</Task>
    <Opcode>33</Opcode>
    <Keywords>0x8000000020000000</Keywords>
    <TimeCreated SystemTime="2010-09-03T10:47:01.431311400Z" />
    <EventRecordID>169289</EventRecordID>
    <Correlation ActivityID="{AC93AF3C-02AE-433D-8C22-FA32493FAD8C}" />
    <Execution ProcessID="1160" ThreadID="8312" />
    <Channel>System</Channel>
    <Computer>Exserver01.domain.com</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <UserData>
    <MemoryExhaustionInfo xmlns:auto-ns2="
http://schemas.microsoft.com/win/2004/08/events" xmlns="http://www.microsoft.com/Windows/Resource/Exhaustion/Detector/Events">
      <SystemInfo>
        <SystemCommitLimit>21261021184</SystemCommitLimit>
        <SystemCommitCharge>20993597440</SystemCommitCharge>
        <ProcessCommitCharge>19448094720</ProcessCommitCharge>
        <PagedPoolUsage>453672960</PagedPoolUsage>
        <PhysicalMemorySize>17176764416</PhysicalMemorySize>
        <PhysicalMemoryUsage>17025470464</PhysicalMemoryUsage>
        <NonPagedPoolUsage>422363136</NonPagedPoolUsage>
        <Processes>133</Processes>
      </SystemInfo>
    
<ProcessInfo>
        <Process_1>
          <Name>store.exe</Name>
          <ID>7580</ID>
          <CreationTime>2010-09-02T11:21:32.755807700Z</CreationTime>
          <CommitCharge>11282399232</CommitCharge>
          <HandleCount>5619</HandleCount>
          <Version>14.1.218.10</Version>
          <TypeInfo>1089</TypeInfo>
        </Process_1>
        <Process_2>
          <Name>MSExchangeMailboxAssistants.exe</Name>
          <ID>21200</ID>
          <CreationTime>2010-08-28T06:50:53.878440200Z</CreationTime>
          <CommitCharge>590950400</CommitCharge>
          <HandleCount>2664</HandleCount>
          <Version>14.1.218.10</Version>
          <TypeInfo>1090</TypeInfo>
        </Process_2>
        <Process_3>
          <Name>w3wp.exe</Name>
          <ID>21092</ID>
          <CreationTime>2010-08-31T08:25:12.245594900Z</CreationTime>
          <CommitCharge>562757632</CommitCharge>
          <HandleCount>2817</HandleCount>
          <Version>7.0.6002.18005</Version>
          <TypeInfo>67</TypeInfo>
        </Process_3>
        <Process_4>
          <Name>powershell.exe</Name>
          <ID>19692</ID>
          <CreationTime>2010-09-03T09:12:48.188589800Z</CreationTime>
          <CommitCharge>152682496</CommitCharge>
          <HandleCount>629</HandleCount>
          <Version>6.0.6002.18111</Version>
          <TypeInfo>136</TypeInfo>
        </Process_4>
        <Process_5>
          <Name>mmc.exe</Name>
          <ID>18768</ID>
          <CreationTime>2010-09-03T09:12:42.167067000Z</CreationTime>
          <CommitCharge>107646976</CommitCharge>
          <HandleCount>464</HandleCount>
          <Version>6.0.6002.18005</Version>
          <TypeInfo>144</TypeInfo>
        </Process_5>
        <Process_6>
          <Name>explorer.exe</Name>
          <ID>13396</ID>
          <CreationTime>2010-09-03T09:12:24.929288000Z</CreationTime>
          <CommitCharge>22032384</CommitCharge>
          <HandleCount>451</HandleCount>
          <Version>6.0.6002.18005</Version>
          <TypeInfo>152</TypeInfo>
        </Process_6>
      </ProcessInfo>
      <PagedPoolInfo>
        <Tag_1>
          <Name>MmSt</Name>
          <PoolUsed>216638928</PoolUsed>
        </Tag_1>
        <Tag_2>
          <Name>CM31</Name>
          <PoolUsed>103596032</PoolUsed>
        </Tag_2>
        <Tag_3>
          <Name>MmRe</Name>
          <PoolUsed>15907504</PoolUsed>
        </Tag_3>
      </PagedPoolInfo>
      <NonPagedPoolInfo>
        <Tag_1>
          <Name>SmMs</Name>
          <PoolUsed>161243168</PoolUsed>
        </Tag_1>
        <Tag_2>
          <Name>BCM0</Name>
          <PoolUsed>40694064</PoolUsed>
        </Tag_2>
        <Tag_3>
          <Name>Cont</Name>
          <PoolUsed>35498720</PoolUsed>
        </Tag_3>
      </NonPagedPoolInfo>
      <ExhaustionEventInfo>
        <Time>2010-09-03T10:47:18.540433800Z</Time>
      </ExhaustionEventInfo>
    </MemoryExhaustionInfo>
  </UserData>
</Event>

This helps you determine what resource was the possible offender without having to install any additional tools on the server to troubleshoot this. The best part is that you don’t have to wait for an additional event to occur as the information has already been collected and logged.

There is another place where events are logged which is under the Windows Resource Exhaustion Detector (Resource-Exhaustion-Detector) under Applications and Services Logs in the Event Viewer as shown below.

image

These events show much less information than the system event, but do show your commit limits and charges to the system too. Sample below.

Log Name:      Microsoft-Windows-Resource-Exhaustion-Detector/Operational
Source:        Microsoft-Windows-Resource-Exhaustion-Detector
Event ID:      1003
Task Category: Resource Exhaustion Detection Events
Level:         Warning
Keywords:      Events related to exhaustion of system commit limit (virtual memory).
User:          SYSTEM
Computer:      ExServer01.Domain.Com
Description:
The Windows Resource Exhaustion Detector received a notification that the computer is low on virtual memory.
Event Xml:
<Event xmlns="
http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Resource-Exhaustion-Detector" Guid="{9988748e-c2e8-4054-85f6-0c3e1cad2470}" />
    <EventID>1003</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>2</Task>
    <Opcode>22</Opcode>
    <Keywords>0x4000000020000000</Keywords>
    <TimeCreated SystemTime="2010-09-03T10:52:01.431065200Z" />
    <EventRecordID>180</EventRecordID>
    <Correlation ActivityID="{0B95CAB5-E004-4C92-BF5D-3BFA39FDF7EE}" />
    <Execution ProcessID="1160" ThreadID="8312" />
    <Channel>Microsoft-Windows-Resource-Exhaustion-Detector/Operational</Channel>
    <Computer>ExServer01.domain.com</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <UserData>
    <CommitLimitExhaustion xmlns:auto-ns2="
http://schemas.microsoft.com/win/2004/08/events"xmlns="http://www.microsoft.com/Windows/Resource/Exhaustion/Detector/Events">
  
   <SystemCommitLimit>21261021184</SystemCommitLimit>
      <SystemCommitCharge>21258543104</SystemCommitCharge>

    </CommitLimitExhaustion>
  </UserData>
</Event>

A couple of potential events that can be seen when memory resources are low are shown below.

  • MSExchangeRepl Service failing to read a log file for database copy due to an out of memory error condition.

    Log Name:      Application
    Source:        MSExchangeRepl
    Event ID:      2168
    Task Category: Service
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      ExServer01.Domain.Com
    Description:
    Log file F:\Exchsrvr\DB\DB0001\LOG001\E00000A7A46.log' for database copy EXServer MBX Store 001\ExServer01' couldn't be read. Error: Out of Memory (-1011)
    Event Xml:
    <Event xmlns="
    http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="MSExchangeRepl" />
        <EventID Qualifiers="49156">2168</EventID>
        <Level>2</Level>
        <Task>1</Task>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime="2010-09-02T11:21:13.000000000Z" />
        <EventRecordID>3190563</EventRecordID>
        <Channel>Application</Channel>
        <Computer>Exserver01.domain.com</Computer>
        <Security />
      </System>
      <EventData>
        <Data>F:\Exchsrvr\DB\DB0001\LOG001\E00000A7A46.log</Data>
        <Data>EXServer MBX Store 001\ExServer01' </Data>
        <Data>Out of Memory (-1011)</Data>
      </EventData>
    </Event>

  • A Registry flush operation failing to write the SOFTWARE hive to disk

    Log Name:      System
    Source:        Microsoft-Windows-Kernel-General
    Event ID:      6
    Task Category: None
    Level:         Error
    Keywords:     
    User:          SYSTEM
    Computer:      ExServer01.domain.com
    Description:
    An I/O operation initiated by the Registry failed unrecoverably.The Registry could not flush hive (file): '\SystemRoot\System32\Config\SOFTWARE'.
    Event Xml:
    <Event xmlns="
    http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-Windows-Kernel-General" Guid="{a68ca8b7-004f-d7b6-a698-07e2de0f1f5d}" />
        <EventID>6</EventID>
        <Version>0</Version>
        <Level>2</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8000000000000000</Keywords>
        <TimeCreated SystemTime="2010-09-03T10:48:17.714333400Z" />
        <EventRecordID>169290</EventRecordID>
        <Correlation />
        <Execution ProcessID="4" ThreadID="92" />
        <Channel>System</Channel>
        <Computer>ExServer01.domain.com</Computer>
        <Security UserID="S-1-5-18" />
      </System>
      <EventData>
        <Data Name="FinalStatus">0xc000014d</Data>
        <Data Name="ExtraStringLength">36</Data>
        <Data Name="ExtraString">\SystemRoot\System32\Config\SOFTWARE</Data>
      </EventData>
    </Event>

Depending on the component used to instantiate a specific function will determine what component will log the event in the system log. Finding root cause for memory issues has become significantly easier with this new Reliability Infrastructure and I hope this blog helps show you some of the methods for troubleshooting these type issues.

Until next time!!!

How to unload/reload performance counters on Exchange 2010

$
0
0

 

Previously I had wrote a set of steps for fixing/repairing broken Exchange 2007 counters at http://blogs.technet.com/b/mikelag/archive/2009/10/21/how-to-fix-repair-broken-exchange-2007-counters.aspx. These set of steps have been replaced with an easier way of loading the counters using Powershell for Exchange 2010. This new method applies mainly for managed code services in Exchange 2010 as some of the previous lodctr/unlodctr commands still work for services such as MSExchangeIS and ESE.

In this example, I am going to use the same ESE missing counters as an example.

The important values that we are interested are the following. These values must exist prior to loading the ESE performance counters.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ESE\Performance]
"Close"="ClosePerformanceData"
"Collect"="CollectPerformanceData"
"Library"="C:\\Program Files\\Microsoft\\Exchange Server\\V14\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"
"Open"="OpenPerformanceData"
"PerfIniFile"="eseperf.ini"

Here is a screenshot of what it will look like:

clip_image002

Again, verify if the name Disable Performance Counters is set in the registry and if so, is the value set to a value other than 0? If this is set to a value of 4, those counters had a problem at one point and were disabled. If this is set to 4, then change it to 0 and then reopen Performance Monitor to see if the counters can be seen now. If that works, life is good and you can either stop reading here or continue on to reloading the counters if they are still not available.

If the values First Counter, First Help, Last Counter and Last Help are listed, then it is recommended to unload the performance counters first before attempting to reload them again.

To unload performance counters for ESE, perform the following steps:

  • Close down any running versions of Performance Monitor and stop any other monitoring services that might be attempting to use those counters
  • Open the Exchange Management Shell
  • Type add-pssnapin Microsoft.Exchange.Management.PowerShell.Setup and hit enter
  • Type remove-perfcounters –definitionfilename “C:\Program Files\Microsoft\Exchange Server\V14\Bin\perf\AMD64\eseperf.xml” and hit enter
    Note: Ensure that the path is changed to the correct location of eseperf.xml on your server

To reload performance counters for ESE, perform the following steps

  • Close down any running versions of Performance Monitor and stop any other monitoring services that might be attempting to use those counters
  • Open the Exchange Management Shell
  • Type add-pssnapin Microsoft.Exchange.Management.PowerShell.Setup and hit enter
  • Type new-perfcounters –definitionfilename “C:\Program Files\Microsoft\Exchange Server\V14\Bin\perf\AMD64\eseperf.xml” and hit enter
    Note: Ensure that the path is changed to the correct location of eseperf.xml on your server

This will reload the appropriate performance counters for ESE. Once that is done, check the application event log to ensure that they were loaded OK and no Perflib errors exist. Open Performance monitor again to see if the counters are available.

If you are looking to reload any of the Exchange counters, you can substitute the appropriate XML in the command with any of the available XMLs from the \Program Files\Microsoft\Exchange Server\V14\Setup\Perf directory.

Here is a screenshot of some of the available performance counters

image

I hope this helps in your performance counter monitoring in Exchange 2010.


Processor 0 increased CPU utilization

$
0
0

While looking on a Exchange 2010 server recently in task manager to review the amount of CPU utilization, I noticed that Processor 0 was at 100% CPU while all of the other CPUs were relatively lower compared to this processor.This type of behavior is caused by the Receive Side Scaling (RSS)feature not being enabled on the server. RSS is a feature that was first implemented back in Windows 2003 with the Scalable Networking Pack which allows you to span network traffic across multiple CPU cores. If RSS is not enabled, only *one* CPU will be used to process incoming network traffic which could cause a networking bottleneck on the server.Additional information on RSS can be found here.

Here is what it looks like in Task Manager on the Performance tab.

clip_image002

As you can see, the first processor is pegged at 100% CPU which is indicative of RSS not being enabled. Generally on new installations of Windows 2008 or greater, this feature is enabled by default, but in this case, it was disabled.

Prior to enabling RSS on any given machine, there are a few dependencies that are necessary for RSS to work properly and are listed below.

  • Install the latest network card driver and associated Network Configuration Utility. The network card driver update is very important as older versions had known bugs that would cause RSS to fail.
  • Offloading features of the network card must be enabled (ie.IPv4 Checksum offload,  TCP/UDP Checksum Offload for IPv4/IPv6)
  • Receive Side Scaling must be enabled on the network card properties
  • Receive Side Scaling Queues and Max number of RSS Processors must be set to the maximum value listed in the network card properties. This is typically the amount of CPU cores that are installed on the server. Hyperthreading does not count towards the max amount of CPU cores that can be leveraged here. The use of hyperthreading is generally not recommended on Exchange servers anyway and is referenced here

    Note: If Receive Side Scaling Queues and Max number of RSS Processors are not changed to a value above 1, then enabling RSS does not provide any benefits since you will only be using a single core to process incoming network traffic.
  • RSS must be enabled at the OS layer by running  netsh int tcp set global rss=enabled . Use netsh int tcp show global to confirm that the setting was enabled properly.

After enabling RSS, you can clearly see below the difference in processor utilization on the server as the CPU utilization for Processor 0 now fairly close to the other processors right around 3:00AM.


image

Many people have disabled the Scalable Networking Pack features across the board due to the various issues that were caused by the TCP Chimney feature back in Windows 2003. All of those problems have now been fixed in the latest patches and latest network card drivers, so enabling this feature will help increase networking throughput almost two fold. The more features that you offload to the network card, the less CPU you will use overall. This allows for greater scalability of your servers.

You will also want to monitor the amount of deferred procedure calls (DPC) that are created since there is additional overhead for distributing this load amongst multiple processors. With the latest hardware and drivers available, this overhead should be negligible.

In Windows 2008 R2 versions of the operating system, there are new performance counters to help track RSS/Offloading/DPC/NDIS traffic to different processors as shown below.

ObjectPerformance Counter
Per Processor Network Activity Cycles(*)

Stack Send Complete Cycles/sec
Miniport RSS Indirection Table Change Cycles
Build Scatter Gather Cycles/sec
NDIS Send Complete Cycles/sec
Miniport Send Cycles/sec
NDIS Send Cycles/sec
Miniport Return Packet Cycles/sec
NDIS Return Packet Cycles/sec
Stack Receive Indication Cycles/sec
NDIS Receive Indication Cycles/sec
Interrupt Cycles/sec
Interrupt DPC Cycles/sec

Per Processor Network Interface Card Activity(*)

Tcp Offload Send bytes/sec
Tcp Offload Receive bytes/sec
Tcp Offload Send Request Calls/sec
Tcp Offload Receive Indications/sec
Low Resource Received Packets/sec
Low Resource Receive Indications/sec
RSS Indirection Table Change Calls/sec
Build Scatter Gather List Calls/sec
Sent Complete Packets/sec
Sent Packets/sec
Send Complete Calls/sec
Send Request Calls/sec
Returned Packets/sec
Received Packets/sec
Return Packet Calls/sec
Receive Indications/sec
Interrupts/sec
DPCs Queued/sec

I hope this helps you understand why you might be seeing this type of CPU usage behavior.

Until next time!!

Mike

How fragmentation on incorrectly formatted NTFS volumes affects Exchange

$
0
0


Recently we have been seeing some gnarly performance issues in Exchange 2007 along with an added splash of database operation failures. This doesn’t sound enticing at all, but this blog post is going to discuss what these issues are and how to resolve them. This post is targeted mainly for Exchange 2007, but you can also apply the same methodology  to Exchange 2010 as this is where the original problem was seen.

Before going in to this, here is a highlight of some of the issues that you may see:

  • Databases failing with an Out of Memory condition
  • Extremely slow log replay times on CCR/SCR replica copies (High replay queue lengths)
  • High amount of split I/O’s occurring on any given LUN/Volume.
  • Slowly rising RPC requests until the Information Store service goes unresponsive

Examples

Here are some examples of the out of memory condition that would be written to the application log on the affected Exchange server.

Event Type:         : Error
Event Source:      : MSExchangeIS
Event Category:     : None
Event ID     : 1160
Description:
Database resource failure error Out of memory occurred in function JTAB_BASE::EcUpdate while accessing the database "CCRName\SGName".

Windows 2003 based error
Event Type:        Error
Event Source:    ESE
Event Category:                General
Event ID:              482
Description:
MSExchangeIS (9228) DBName: An attempt to write to the file "F:\Data\DBName.edb" at offset 530157682688 (0x0000007b6fdc4000) for 8192 (0x00002000) bytes failed after 0 seconds with system error 1450 (0x000005aa): "Insufficient system resources exist to complete the requested service. ".  The write operation will fail with error -1011 (0xfffffc0d).  If this error persists then the file may be damaged and may need to be restored from a previous backup.

Windows 2008 based error
Log Name:      Application
Source:        ESE
Event ID:      482
Task Category: General
Level:         Error
Description:
Information Store (8580) DBNAme: An attempt to write to the file "F:\Data\DBName.EDB" at offset 315530739712 (0x0000004977190000) for 32768 (0x00008000) bytes failed after 0 seconds with system error 665 (0x00000299): "The requested operation could not be completed due to a file system limitation ".  The write operation will fail with error -1022 (0xfffffc02).  If this error persists then the file may be damaged and may need to be restored from a previous backup.

So just what is this Insufficient system resources exist to complete the requested service error? The explanation will come later….

Here is an example of very high Split I/O operations (purple line) leading up to high RPC requests (green Line) until the server went unresponsive. In the below case, we were trying to extend the size of the database and couldn’t because of the underlying cause which I will explain shortly.

image

Another clear sign that you might be running in to this problem is when all I/O requests for that particular database instance goes to zero while RPC requests continue to climb and Version Buckets plateaus

image

This particular problem is not an obvious one and requires a few levels of explanation what is going on and a little bit of terminology to get you going. At the lowest layer, an exchange database resides on an NTFS partition which is setup when the server is first configured. This initial setup has some specific guidelines around how to properly partition and format the volumes and is referenced in http://technet.microsoft.com/en-us/library/bb738145(EXCHG.80).aspx for Exchange 2007 and http://technet.microsoft.com/en-us/library/ee832792.aspx for Exchange 2010. The two most important factors are proper partition alignment and NTFS Allocation unit size.

Below is a table of recommendations for use with Exchange.

Description

Recommended Value

Storage Track Boundary

64K or greater. (1MB recommended)

NTFS allocation unit/cluster size

64KB (DB and Log Drives)

RAID Stripe size

256KB or greater. Check with your storage vendor for best practices

NTFS allocation unit size

Before we go in to discussing this area, we need to take a step back and take a look at how NTFS operates. This is where you need to do a little homework by reading the following 2 references:

Now that we went over what the basic concept of what a File Attribute List (ATTRIBUTE_LIST) is and how files are actually stored on disk, we can continue on with why this is so important here. Let’s say that we have a disk that is formatted with a file allocation unit size of 4K or 4096 which is the default in Windows 2003 for any partition that is greater than 2GB in size. With Exchange 2007’s ESE page size of 8k, we will need to make two writes for a single page. These writes may or may not be contiguous in nature and could be spreading data across various sections of the disk and this is where fragmentation can begin for larger files on disk. As the File Attribute List (FAL) size grows outside of the MFT along with the database file sizes, the size of the FAL will continually grow to accommodate the fragmentation and the overall increase in database file sizes.

NTFS does have it’s limitations with the overall size of this attribute list per file and can have roughly around 1.5 million fragments. This is not an absolute maximum, but is around the area when problems can occur. The FAL size will never shrink and will continually keep growing over time. The maximum supported size of the ATTRIBUTE_LIST is 256K or 262144. If you were to reach this upper limit, you could no longer expand the size of your database and we would be doing a lot more smaller I/O operations and a lot more seeking around the drive to find the data we are looking for. This is where the “out of memory” error comes from along with the “Insufficient system resources exist to complete the requested service” error. File management APIs will start failing with ERROR_FILE_SYSTEM_LIMITATION in Windows 2008 or later and ERROR_INSUFFICIENT_RESOURCES for windows versions earlier than that when the absolute maximum has been reached. The out of memory error is a much higher level error that was bubbled up caused by NTFS not being able to increase the size of the FAL anymore. This is why it is not an obvious error and was ultimately found by Eric Norberg troubleshooting over many tirelessly nights and through long debugging sessions by EE extraordinaire Dave Goldman. Smile

This fragmentation issue is actually referenced in the following article:

A heavily fragmented file in an NTFS volume may not grow beyond a certain size
http://support.microsoft.com/kb/967351

This scenario is seen more on servers with smaller NTFS cluster sizes such as 4k, large databases that are 2 times the recommended 200GB maximum and low available disk space. The combination of those 3 variables can get you in to a very bad situation.

NTFS cluster sizes can be obtained by running the fsutil command as shown below for any given partition:

image

In Exchange 2007, you can check if you are running in to this issue by downloading and running Contig.exe from Sysinternals at http://technet.microsoft.com/en-us/sysinternals/bb897428.aspx

C:\>Contig.exe -a f:\data\DBName.edb

Contig v1.55 - Makes files contiguous
Copyright (C) 1998-2007 Mark Russinovich
Sysinternals - www.sysinternals.com

f:\data\DBName.edb is in 1.46698e+006 fragments

Summary:
     Number of files processed   : 1
     Average fragmentation       : 1.46698e+006 frags/file

In the above example, we are extremely close to the 1.5 million approximate maximum amount of fragments that you can have for any given file. This particular database will eventually be problematic and is a ticking time bomb waiting to happen.

For Exchange 2010 SP1, you can dump the same type information similar to contig.exe using eseutil.exe as shown below.

C:\>eseutil /ms f:\data\DBName.edb

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.01
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode...
Error: Access to source database 'f:\data\DBName.edb' failed with Jet error -1032.

File Information:
  File Name: f:\data\DBName.edb
  Volume Name: Drive2
  File System: NTFS
  Cluster Size: 4096 bytes
  Attribute List Size: 180 KB
  Extents Enumerated: 1157172

Operation terminated with error -1032 (JET_errFileAccessDenied, Cannot access file, the file is locked or in use) after 0.78 seconds.

Even though the command errors out due to the database being online, we are still able to obtain similar data. Eseutil allows you to look in to the actual FAL size, NTFS cluster size and how many extents have been created for that file due to excessive fragmentation if run locally on the server. With that, we can deduce that the NTFS cluster size is 4KB, the FAL size is 180KB and the Extents Enumerated is over 1.1 million fragments. A general rule of thumb is to not have a FAL size greater than 150KB in size and to have sufficient available disk space.

This fragmentation is also seen on CCR/Replica copies as the log files are shipped and then played in to the database. The end result is that log replay will slow to a crawl and you could have some very high replay queue lengths due to excessive Split I/Os occurring. Even with the fastest disks and improperly configured NTFS cluster sizes and disk alignments, you will still see this problem. You must fix the root of this problem to successfully resolve this issue.

So how do you mitigate this? Well, there are various ways to do this…

  1. If you determine that only a single database is affected by this issue, the quickest mitigation method to get you back in business is the following:
    1. Dismount the database
    2. Make a copy of the database to another drive with sufficient space. IMPORTANT: This cannot be on the same drive as we need to write this file out contiguously to another drive. This mere act of copying the file defrags the file for you.
    3. Delete the original copy of the database file
    4. Copy the database back to the original location
    5. Using this method does not resolve the issue long term if the NTFS cluster sizes are too small. It is only meant as a stop gap to buy you some time to resolve the issue long term.
  2. If on a CCR/SCR cluster, you have some options to fix this longer term.
    1. To resolve the NTFS cluster sizes on the non-active node or SCR target for any particular volume such as F:, use the following command to format the disk with a 64KB block size which is the recommended value for optimal performance.

      Format F: /q /y /fs:ntfs  /v:VolumeName /a:64K

      NOTE:
      This command wipes out any files that currently resides on the F: drive, so make sure that no other files or applications reside on this drive other than the database and log files. I would hope that you are dedicating these drives exclusively to Exchange and not sharing with any other applications. Exclusivity is what makes recovering from this much easier. 
    2. Verify that the disk was formatted properly by running the following command:

      image
    3. Once the disk has been reformatted, go ahead and reseed the databases that previously existed on the drive.

You may ask yourself, if the file is so fragmented, why can I not simply do an offline defrag of the file? The answer is that if you defrag the file itself, you have a high possibility of bloating the FAL size since we are causing the fragments to move around which causes the FAL size to grow. This is the primary reason why Exchange does not recommend running defrag on volumes which host database files. The only way to remove the attribute list for this file is to completely copy the file off to another drive, delete the original copy and then copy the copied file back to the original location. When this is done, the file is written to the disk contiguously leaving literally no fragments in the file. Life is good once again.

Once you have resolved these underlying issues, overall Exchange performance should be that much better and you can sleep better at night knowing you have increased throughput on your Exchange servers.

Note that it is still not recommended to run disk defragmentation software on Exchange server volumes, but there are times where file level fragmentation can cause significant performance problems on a server merely by the way data is being written to the disk. If optimal and/or recommended settings are not used when creating the volumes, this file fragmentation issue can occur much quicker. The majority of Exchange files are in use so running any regular disk defragmentation programs on the server will not help with this situation. If necessary, the only way to resolve this is to take all Exchange resources offline to ensure none of the files are in use and then defragment the disk to make the files contiguous on the disk once again.

In Exchange 2010 SP1 or later, logic was added to detect when the FAL would be exhausted (80% of max); and event accordingly.  There is no NTFS event for this behavior. The following event is an example that would be logged for a problematic database during online maintenance.

Log Name: Application
Source: ESE
Event ID: 739
Task Category: General
Level: Error
Description:
Information Store (5652) EXSERVER MBX Store 001: The NTFS file attributes size for database 'C:\DB\DB001\PRIV001.EDB' is 243136 bytes, which exceeds the threshold of 204800 bytes. The database file must be reseeded or restored from a copy or backup to prevent the database file from being unable to grow because of a file system limitation.

Update (3/8/2011): Exchange 2007 SP3 RU3 now has a fix that is referenced in http://support.microsoft.com/kb/2498066 that will increase the default extent size from 8MB to 64MB similar to that of Exchange 2010. Increasing the extent size helps reduce the amount of fragments that will be created for any given database. The 739 event has also been added so that monitoring software can alert on potential problems.

Reasonable Volume sizes and database sizes go a long way to protect yourself from fragmentation (the more competing files which extended/created on a volume, the greater the fragmentation of those files will be). 

Recommendations:

  • Keep your volume sizes at or below 2TB (why MBR partitions are recommended for E2K7). Exchange 2010 can have GPT volumes greater than 2TB, but the recommendation is to ensure that DB sizes are under 2TB in size.
  • Limit the number of databases hosted/volume.  10/volume is the absolute maximum we would recommend; where 5/volume is much better.
  • Do not place write intensive non-Exchange workloads on the same volume as an Exchange database.

I hope this sheds some light on why certain failures on Exchange servers could prevent you from doing various operations.

Thanks go to Matt Gossage, Tim McMichael, Bryan Matthew, Neal Christiansen and Luke Ibsen for reviewing this blog entry before posting

Mike

Perfwiz for Exchange 2010

$
0
0

NOTE: This version of Perfwiz has been replaced by a newly written script that is talked about in http://blogs.technet.com/b/mikelag/archive/2010/07/09/exchange-2007-2010-performance-data-collection-script.aspx

  1. Download the appropriate version of Perfwiz for your server

    How to download
    To download these XML files to your computer, right click the file of your choice, select Save Target As… , and then save it to a directory location of your choice on your Exchange Server

    Role Based
    Use these as a high level look in to how the server is performing and if you need to branch out with more counters, use the Full Counter/Instance set below.

    Exchange_2010_Perfwiz-MBX.xml
    Other roles coming soon….

    All Counters/All Instances
    Use this counter set at your own discretion as this could potentially cause performance degradation on your server trying to log this amount of counters.

    Exchange_2010_Perfwiz-Full.xml

  2. Open Performance Monitor
  3. Expand Reliability and Performance and then expand Data Collector Sets
  4. Right click User Defined, Select New, and then Data Collector Set
  5. Enter a unique name for this Data Collector set (ie. ExPerfwiz), select Create from a template (Recommended) and then click Next
  6. Select the Browse button, navigate to the XML file that was saved in Step 1, select Open
  7. Select Next on the next screen
  8. Enter in a root Directory of where you would like to store the performance log files. Click Next
  9. If you need to run this performance log under different credentials, enter it on this page. Click Finish

The effects of paged pool token memory usage and Outlook 2007’s shared folders feature

$
0
0

In Exchange 2003, we are bound by 32-bit architecture memory limitations and paged pool memory just so happens to be one of them. This is a very common issue that we in Exchange support see is paged pool memory depletion related to excessive Token usage.

For the most part, paged pool memory troubleshooting for Exchange 2003 can be found in http://support.microsoft.com/kb/912376. This article goes over what access tokens are, what symptoms are seen when paged pool memory usage is low, how token sizes are generally calculated, and a script that will allow you to export group memberships for users that are mail enabled. It does not go over some of the other reasons why paged pool memory may increase on any given Exchange 2003 server.

There is also a great document “Addressing Problems Due to Access Token Limitations” at http://download.microsoft.com/download/8/f/3/8f36dfe4-47d0-4775-ad5a-5614384921aa/AccessTokenLimitation.doc that goes over Access Tokens in much greater detail.

If you had used the scripts in 912376 to export the users group membership, you might end up with data in Excel similar to the screenshot listed below.

image

Notice that we have a slew of users that have really high group membership. This can and will affect overall Token sizes when connecting to an Exchange server.

If you ever wanted to calculate the approximate size of a users access token, the formula as listed in 912376 is the following:

[12 x number of user rights] + [token overhead] + [44 x number of group memberships] = token size in bytes

Trying to find the maximum amount of security groups a user can belong to before we double the size, we would come up with the following (4096-500 / 44 = 81.70). This means that a user needs to be a part of a maximum of approximately 80 security groups before the token size doubles to 8192. This a rough guesstimate, but this should be enough to estimate overall token sizes for users in general terms if you wanted to. Microsoft does not recommend more than 100MB of paged pool memory for token allocations as it may impact Exchange and other applications installed on the server.

One of the reasons that I believe that there are many security groups that are mail enabled in any organization is because this is a default behavior of creating groups in Active Directory Users & Computers (ADUC). When you create a group in ADUC, the default type is a security group, not distribution, so over time, you can end up with users that are a part of many groups that are of type security which is a bad thing. If you let this go long enough, the users access token can bloat to high values causing paged pool memory issues caused by excessive Toke usage.

One of the first things you need to do is ensure that the hotfix in http://support.microsoft.com/kb/912480 is applied on the Exchange server. This hotfix decreases the amount of connections that Outlook users make to an Exchange server.

Each connection that a client makes to a server will increase overall token usage for that user due to that fact that we need to expand all group memberships to ensure that you have permissions to the requested resource and you don’t have any specific deny in any of the many groups they are a part of. If you establish a lot of TCP connections to a single process such as the store.exe process, you could exhaust paged pool memory resources on the server very easily every time a user needs to verify that they have permissions to access resources in users mailboxes. The more connections you have open, the more token usage that will be consumed.

So now I would like to bring to light a new feature in Outlook 2007 to help cache shared folders such as Calendars, Contacts, Tasks, and Notes. This specific feature is Download shared folders (excludes mail folders) which is turned on by default in any new Outlook 2007 profiles. This allows easier and faster access to other users mailbox folders since the data will be cached locally in their Outlook profile. Here is what the option looks like in a users outlook profile.

image

For each shared folder that is cached in a user profile, an additional connection will be opened to the Exchange server to sync that data to the users OST file. If a user is opening a lot of other folders, then you will see many connections to the Exchange server. To make it even better, http://support.microsoft.com/kb/955572 introduced a new feature where you can cache other users mail folders such as the Inbox by adding a value to the registry on each client computer called CacheOthersMail. Once this is implemented, any user can now have the ability to cache any other users mailbox folders should they have access to them.

As you can see, this can significantly increase the amount of connections to any given Exchange server, thus increasing overall token usage. If you have users that have high group memberships, then it may be advantageous to disable these features. This can be accomplished organization wide via Group Policy by disabling the “Synchronizing data in shared folders” option listed in http://technet.microsoft.com/en-us/library/cc179175(office.12).aspx. Disabling this feature will now cause the Outlook clients to make direct RPC connections to the Exchange server since that folder data is no longer cached locally, so it may take just a bit longer to pull up the data they are looking for.

If you are migrating your users from Outlook 2003 to Outlook 2007 and are noticing increased page pool Toke memory usage on an Exchange server, then this may be one of the reasons for it.

One way to determine if the outlook client has many connections open to the Exchange server is to run a netstat –ao on the Exchange server and pipe it to a text file. If you view the list of connections from single client IP addresses to the Store process, which in this case was running under port 1674, you may see many established connections as shown below.

TCP    EXSERVER:1674       CLIENTCOMPUTER:3196     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3202     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3205     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3210     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3219     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3234     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3235     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3238     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3239     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3241     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3246     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3252     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3253     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3257     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3258     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3260     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3261     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3263     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3264     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3277     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3278     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3280     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3282     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3283     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3284     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3285     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3286     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3293     ESTABLISHED     4944
TCP    EXSERVER:1674       CLIENTCOMPUTER:3301     ESTABLISHED     4944

If this type pattern is seen for various clients, then this feature may be affecting your Exchange servers paged pool memory usage which could potentially lead to unresponsive servers.

In one of the cases that I had worked, disabling this feature via group policy dropped the runtime paged pool memory usage 40-50MB. This is a huge memory savings that may keep your server afloat until you migrate to Exchange 2007 or Exchange 2010. Of course, those versions of Exchange are not susceptible to these type nonpaged or paged pool memory issues since we have moved away from the 32-bit platform and will only support the 64-bit versions of Exchange in production.

Once you have migrated these users to later versions of Exchange, you can then remove the group policy to enable this feature again.

Of course there are many other reasons for excessive Toke usage on an Exchange server, but this is just one of those new features that may bite you in the end.

Until next time…

Exchange Server Profile Analyzer updated!!

$
0
0

For any of you folks that might have created a hair ball on the floor by pulling your hair out trying to obtain data from the Exchange Profile Analyzer (EPA) tool, there is good news that a new version has just been released and should help resolve the majority of the issues that you might have run in to previously. One of the updates that was a major blocker is if you had changed your default recipient policy to include variables such as %s.%g@domain.com, then this update will resolve this problem for you amongst a slew of other enhancements.

Please note that this is still the DAV version of the tool and will not run against Exchange 2010 servers, but currently this tool should only be run against Exchange 2003 or Exchange 2007 installations in the hopes that you can obtain the user profiles for your users to allow proper planning/sizing of your Exchange 2010 servers.

You can download the latest binaries from the following locations.

32-bit – http://www.microsoft.com/downloads/details.aspx?familyid=8F575F60-BD80-44AA-858B-A1F721108FAD&displaylang=en
64-bit – http://www.microsoft.com/downloads/details.aspx?familyid=C009C049-9F4C-4519-A389-69C281B2ABDA&displaylang=en

Enjoy!!

Perfmon data viewing nuance…

$
0
0

Recently, I have been running in to a perfmon nuance that I would like to call out. On any given performance troubleshooting day which is almost every day, I open up Perfmon, select the BLG file of choice and then add the appropriate counters that I would like to view.

In this instance, I was presented with the following view.

image

The odd thing is that all instances are linear in fashion for both increasing or decreasing counter values. Something surely is not right. As you can see, this is not helpful at all.

Since I know that this perfmon has data in it, I needed to find a way to view this information as it was crucial we found root cause of the issue with the data provided.

One of the reasons that this could happen if there are long gaps of uncollected data in the perfmon file as data it being appended to the existing BLG file. So if I started a perfmon up for 1 second and then immediately stopped it, I would collect one sampling of data. If my perfmon data collection is setup to append data to the existing blg file, then the next time the perfmon collected is started it will append the data to the existing BLG file. Since we average out the data over the last 100 samples, you can clearly see that the averages here will be way off and will show data similar to the above screen capture.

So how do I overcome this? Well, first, you need to find out where the data gaps are in the perfmon file. To do this, you can use the relog utility that comes with Windows to dump out the blg file to csv format. To make this easier, I use the \Processor(_Total)\% Processor Time counter to export this data on as that counter is normally collected in every perfmon file. The syntax of the command is below along with the associated result information.

image

In the above case, we can see that the blg contained information for 14 hours, but only 29 samples were taken. Based on that alone, the numbers do not add up.

If you open the log.csv file in notepad, you should get an output similar to the following. As highlighted below, you will see the actual gap in data collection which was throwing off the view in perfmon. This output will also give us the time interval that this perfmon was taken which was every 30 seconds. This time interval is extremely important to me when viewing perfmon data for any Exchange Performance related issues.

image

The first data sampling is what we don’t need here, so we need to change our Time Range in Perfmon by moving the slider bar over to the right approximately one hour before the data capture

image

Once you have the slider where you want it, right-click the slider bar and then select Zoom to as shown below as shown in the following pictorial.

image

What you should now see is some resemblance of performance data.

image

What you now want to do to remove the whitespace surrounding the capture is to left-click and drag to the right the time slider until you have highlighted the data you want to look at. Once selected, right-click the window and select Zoom To to zoom in to that time frame.

image

What you should now see if the actual data you want to look at.

image

Another way of being able to view this data is to use the relog utility to export out specific timeframes in the log. Since we know the start and end times of the data we want to view, we can simply run the following command to export data between specific times to an export.csv file. Once you pull up the data in perfmon for this export csv file, you will have the same view as the one listed above.

image

I hope this helps you decipher whether or not a perfmon file has been corrupted or the data just does not look correct for whatever reason.

Happy performance viewing!!

Audit Exchange 2007 SP2 Auditing

$
0
0

There have been a few cases that have been coming through the support channels stating that auditing is not working for whatever reason. After reviewing the cases, we have found that this is due to users or groups in the Configuration Partition of Active Directory that have been granted Full Access to the containers in the tree or having the All Extended Rights permission. Having these permissions will essentially bypass any auditing events from occurring.

Let’s take a step back for a moment. When applying Exchange 2007 SP2 in an organization, the schema is extended with the right ms-Exch-Store-Bypass-Access-Auditing. If a user previously has been granted Full Control within the Configuration tree, that user or group will then take on an allow right for this Bypass auditing right, thus being exempt from being audited. This is not good for compliance reasons and the end results will not contain audit entries for all users that can/will be accessing mailboxes.

The other problem is that there is currently no way to lock down the ability for any administrator to add a specific Allow on an object in Active Directory for this bypass right, thus excluding them from being audited.

Listed below are the *default* groups that have the Bypass Auditing right due to various permission settings:

  • Domain Admins
  • Enterprise Admins
  • Exchange Organization Administrators

Once the Schema has been extended, there are 5 places to add auditing bypass entries in the configuration container in Active Directory as shown below.

  • Exchange Organization Container
  • Exchange Administrative Group Container
  • Exchange Servers Container
  • Exchange Server object
  • Exchange Database object

When auditing is not working as expected, it would be a rather tedious process to check permissions throughout the configuration tree for these objects where the bypass extended right has been set.  I have created a powershell script (AuditBypassPerm.ps1) that should help export permissions for each of these objects to make your job finding permissions problems that much easier.

Before I go over the script, I want to describe some of the terms that you will need to know when looking through the output of this script. Objects in Active Directory can be assigned using Access Rights using System.DirectoryServices.ActiveDirectoryRights[]. These Access rights control what type of permission a user or group has on a particular object. A listing of all the available Access rights can be found at http://msdn.microsoft.com/en-us/library/system.directoryservices.activedirectoryrights.aspx.

The three main rights that we are concerned with relationship to this auditing are the following:

  • ms-Exch-Store-Bypass-Access-Auditing = Bypass Exchange Access Auditing in the Information Store
  • GenericAll = The right to create or delete children, delete a subtree, read and write properties, examine children and the object itself, add and remove the object from the directory, and read or write with an extended right.
  • ExtendedRight = A customized control access right. This can be used to specifically apply a particular right such as ms-Exch-Store-Bypass-Access-Auditing or this could mean that you may have an allow right for All Extended Rights as shown below. All Extended Rights mean just that, All rights including the “Bypass Exchange Access Auditing in the Information Store” right.

     image

So with that said, these are the three main rights that we need to concentrate on when we are trying to find a needle in the haystack. The next piece that we also need to be cognizant about is whether there is a specific deny or one of the 3 rights is being inherited from some other object in the Configuration tree.

In my example, I used an account called AuditTest1 to show how one would troubleshoot something like this. I granted a deny to the bypass right at the organization level so that this users account would be audited, but then at the Database object level, I granted the All Extended Rights right for this account. What this essentially did is to bypass auditing for this user at the database level, thus logging no events for any mailbox access for mailboxes on that database.

The syntax for the script is as follows:

.\AuditBypassPerm.ps1 <MBXServername>

After running the script, you will get output similar to the following

image

Once the script has completed, notepad will open up a text file which will provide information similar to the below pictorial. This allows you to visually see at each of the 5 objects where we can set the bypass permissions what rights were assigned to this user.

image

The key takeaway here is to view what groups/accounts have the ms-Exch-Store-Bypass-Access-Auditing, GenericAll , or ExtendedRight right set on them and if it is set, to determine at what level in the Configuration Tree a potential override has been set that would prevent auditing specific accounts from being audited.

If there is an Inherited permission that you cannot view what rights has been set at each of the 5 object levels, then you will need to open up ADSIEdit.msc and then walk the tree up from that object until you find the object in which permissions have been changed.

So that sounds great, but what happens when you have a user that is not listed in the tree, but is still not being audited? The main reason for this is that this user is a part of a group that could have permission in to the tree to have one of these 3 rights.

Since Powershell V1 does not have a direct way to view a users group membership, I created another really small script to list out the groups that these users are a part of. This is not the same format that is listed above, but will help provide a general understanding of what groups the user is a part of that you can then compare to the output listed above for your troubleshooting efforts.

The syntax for the group membership script is as follows:

.\GetUserGroups.ps1 <Username>

Note: This can be in simple format or domain\username format as shown below.

image

These set of scripts can be downloaded here

I hope this helps untangle the inability to get auditing working for some users/groups as reviewing permissions is sometimes a very tedious task.

Another question that comes up is how do I map the Message ID that is listed in the event when accessing the message to an actual message in a users mailbox? A sample event is listed below with the relevant parts highlighted.

image

So from that, we can see that we have a message ID of <6B83547937704D4EB0EFA4327EF0DEC82D8F92EC36@MIKELAGDC.mikelaglab.com> and this message was opened in the folder /Calendar.

For every message on an Exchange 2007 server, we generate a unique Message ID that is stamped on the PR_INTERNET_MESSAGE_ID MAPI property of each message. This MAPI property is also equivalent to 0X1035000E.

With MFCMAPI, you can find this message rather easily by creating a content restriction in the mailbox. To do this, you would need to create a MAPI profile for that user specifically on an administrative workstation or use a MAPI profile that has full access to all mailboxes.

IMPORTANT: While going in with an administrative account performing these operations, if the message is touched in the mailbox while using MFCMAPI, an auditing event will be logged to the Exchange Auditing log. If you don’t want to log any events while doing your investigation, it may be best to logon with an approved account that has the Bypass Auditing right so that whatever actions you are taking inside a users mailbox is not audited or use an account that is specific to finding what object was audited.

Once you open the mailbox in MFCMAPI, you would then need to navigate to and open the folder that was listed in the auditing event as shown above. In this case, it was the Calendar folder. After the folder is opened, click on Table on the Menu and then select Create an Apply Content Restriction.

image

Next, we need to specify what Property Tag we are looking for which in this example is 0X1035000E. Once this property tag number is entered, you will see that the Property Name field will show what we want to filter on. Click OK when you are finished.

image

In the Property Value field, enter <6B83547937704D4EB0EFA4327EF0DEC82D8F92EC36@MIKELAGDC.mikelaglab.com> including the < > characters as shown below. This is very important that the < > characters are entered, otherwise, the restriction will not return the message. Click OK when you are done.

image

The result should be the message that you are looking for.

image

That is all for now….

Happy Auditing!!


Exchange 2007/2010 Performance Data Collection Script

$
0
0

In efforts to help streamline performance data collection on Exchange 2007/Exchange 2010 servers, I have created a powershell script that will automate the data collection for you. One of the nice features of the script is that you do not have to tell the script what roles are installed as it will automatically detect what is installed locally on the server, thus adding the appropriate counters for you. Previously, you had to manually select an XML file from here for Exchange 2007 servers and here for Exchange 2010 servers and then import it in to the performance console.

I’ve seen a lot of cases that use the previous Perfwiz utility, but unfortunately, this was originally designed to collect data for Exchange 2003 servers and was never updated to support the later versions of Exchange. This older version of Perfwiz should never be used to troubleshoot performance issues for versions later than Exchange 2003 as the pertinent counters are not being collected to accurately troubleshoot a performance issue.

During the development phase of this script, it was found that starting with Windows 2003 x64 that the log roll mechanism no longer worked properly and stopped once the maximum log file size was hit. Even though this worked previously in on Windows 2003 x86 versions, something changed on the 64-bit platform which prevented this from working. This problem is also inherent in the Windows 2008 operating system, but eventually was resolved in Windows 2008 R2. The script works around all of these issues to help you collect the right data at the right time by doing the following:

  • If Windows 2003 x64 and –circular switch not specified, then roll log to next log file once maxsize is reached or duration time is hit, whichever one is first.
  • If Windows 2008 RTM/SP1/SP2 and –circular switch not specified, then roll log every 4 hours. If Interval is set to less than 30 seconds, then roll log every hour.

IMPORTANT: To help save on the disk space consumed to write these log files out, the *default duration* is set to 8 hours. This time duration should be enough to capture most performance cases during the day, but if longer durations are needed, then refer to the switches listed in the table below to help set the desired configuration for your needs.

Listed below are the switches that can be used with this script at the time of this posting. New switches will be added as time goes on. These switches should help allow you to collect the right data at the right time and also allows the flexibility to set the appropriate settings.

Parameter

Description

-help or -?

Provides help regarding the overall usage of the script

-circular

Turns on circular logging to save on disk space. Negates default duration of 8 hours

-delete

Deletes the currently running Perfwiz data collection

-duration

Specifies the overall duration of the data collection. If omitted, the default value is (08:00:00) or 8 hours

-EseExtendedOn

Enables Extended ESE performance counters

-EseExtendedOff

Disables Extended ESE performance counters

-filepath

Sets the directory location of where the blg file will be stored

-full

Defines a counter set that includes all Counters/instances

-interval

Specifies the interval time between data samples. If omitted, the default value is (00:00:30) or 30 seconds

-maxsize

Specifies the maximum size of blg file in MB. If omitted, the default value is 512

-query

Queries configuration information of previously created Exchange_Perfwiz Data Collector

-start

Starts Exchange_Perfwiz data collection

-stop

Stops the currently running Perfwiz data collection

-StoreExtendedOn

Enables Extended Store performance counters

-StoreExtendedOff

Disables Extended Store performance counters

-threads

Specifies whether threads will be added to the data collection. If omitted, threads counters will not be added to the collection

-webhelp

Launches web help for script

For additional information, you can check out the website that includes the latest 1.3 version download at http://code.msdn.microsoft.com/ExPerfwiz.

If you have an issue with this script or have a feature suggestion, use the Discussions/Issue Tracker tabs on the Experfwiz page listed above. There are also additional examples of how to run the script with additional switches on the site.

Enjoy!!

Mike

Detecting Low Virtual Memory Conditions in Windows 2008 and R2

$
0
0

On Windows 2008 servers with Exchange 2007/2010 installed, there are times when you may run out of virtual memory for various reasons. One could be a memory leak in some application or simply not configuring the paging file correctly.

Once you run out of virtual memory on any given server, various applications may start failing/crashing on the server due to the inability to obtain memory to complete a specific function that is being called. In some cases, this could lead to a possible blue screen of death (BSOD).

For server based systems, the new Reliability Infrastructure helps automatically diagnose various operating system components. Of that infrastructure, Resource Exhaustion Detection and Resolution (RADAR) helps notify you when you are resources are reaching critical levels. RADAR is part of the Diagnostic Policy service that is installed on each server.

When RADAR detects that memory has reached a critical state, a 2004 event will be logged to the system log. An example of one of these events is shown below. As you can see, it has various information that provides overall memory consumption for various system resources, the top processes for memory consumption, file version information and paged/nonpaged pool memory that includes the top tags that could attribute to the memory problem. The bolded parts are the area of interest.

Log Name:      System
Source:        Microsoft-Windows-Resource-Exhaustion-Detector
Event ID:      2004
Task Category: Resource Exhaustion Diagnosis Events
Level:         Warning
Keywords:      Events related to exhaustion of system commit limit (virtual memory).
User:          SYSTEM
Description:
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: store.exe (7580) consumed 11282399232 bytes, MSExchangeMailboxAssistants.exe (21200) consumed 590950400 bytes, and w3wp.exe (21092) consumed 562757632 bytes.
Event Xml:
<Event xmlns="
http://schemas.microsoft.com/win/2004/08/events/event">
<System>
    <Provider Name="Microsoft-Windows-Resource-Exhaustion-Detector" Guid="{9988748e-c2e8-4054-85f6-0c3e1cad2470}" />
    <EventID>2004</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>3</Task>
    <Opcode>33</Opcode>
    <Keywords>0x8000000020000000</Keywords>
    <TimeCreated SystemTime="2010-09-03T10:47:01.431311400Z" />
    <EventRecordID>169289</EventRecordID>
    <Correlation ActivityID="{AC93AF3C-02AE-433D-8C22-FA32493FAD8C}" />
    <Execution ProcessID="1160" ThreadID="8312" />
    <Channel>System</Channel>
    <Computer>Exserver01.domain.com</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <UserData>
    <MemoryExhaustionInfo xmlns:auto-ns2="
http://schemas.microsoft.com/win/2004/08/events" xmlns="http://www.microsoft.com/Windows/Resource/Exhaustion/Detector/Events">
      <SystemInfo>
        <SystemCommitLimit>21261021184</SystemCommitLimit>
        <SystemCommitCharge>20993597440</SystemCommitCharge>
        <ProcessCommitCharge>19448094720</ProcessCommitCharge>
        <PagedPoolUsage>453672960</PagedPoolUsage>
        <PhysicalMemorySize>17176764416</PhysicalMemorySize>
        <PhysicalMemoryUsage>17025470464</PhysicalMemoryUsage>
        <NonPagedPoolUsage>422363136</NonPagedPoolUsage>
        <Processes>133</Processes>
      </SystemInfo>
    
<ProcessInfo>
        <Process_1>
          <Name>store.exe</Name>
          <ID>7580</ID>
          <CreationTime>2010-09-02T11:21:32.755807700Z</CreationTime>
          <CommitCharge>11282399232</CommitCharge>
          <HandleCount>5619</HandleCount>
          <Version>14.1.218.10</Version>
          <TypeInfo>1089</TypeInfo>
        </Process_1>
        <Process_2>
          <Name>MSExchangeMailboxAssistants.exe</Name>
          <ID>21200</ID>
          <CreationTime>2010-08-28T06:50:53.878440200Z</CreationTime>
          <CommitCharge>590950400</CommitCharge>
          <HandleCount>2664</HandleCount>
          <Version>14.1.218.10</Version>
          <TypeInfo>1090</TypeInfo>
        </Process_2>
        <Process_3>
          <Name>w3wp.exe</Name>
          <ID>21092</ID>
          <CreationTime>2010-08-31T08:25:12.245594900Z</CreationTime>
          <CommitCharge>562757632</CommitCharge>
          <HandleCount>2817</HandleCount>
          <Version>7.0.6002.18005</Version>
          <TypeInfo>67</TypeInfo>
        </Process_3>
        <Process_4>
          <Name>powershell.exe</Name>
          <ID>19692</ID>
          <CreationTime>2010-09-03T09:12:48.188589800Z</CreationTime>
          <CommitCharge>152682496</CommitCharge>
          <HandleCount>629</HandleCount>
          <Version>6.0.6002.18111</Version>
          <TypeInfo>136</TypeInfo>
        </Process_4>
        <Process_5>
          <Name>mmc.exe</Name>
          <ID>18768</ID>
          <CreationTime>2010-09-03T09:12:42.167067000Z</CreationTime>
          <CommitCharge>107646976</CommitCharge>
          <HandleCount>464</HandleCount>
          <Version>6.0.6002.18005</Version>
          <TypeInfo>144</TypeInfo>
        </Process_5>
        <Process_6>
          <Name>explorer.exe</Name>
          <ID>13396</ID>
          <CreationTime>2010-09-03T09:12:24.929288000Z</CreationTime>
          <CommitCharge>22032384</CommitCharge>
          <HandleCount>451</HandleCount>
          <Version>6.0.6002.18005</Version>
          <TypeInfo>152</TypeInfo>
        </Process_6>
      </ProcessInfo>
      <PagedPoolInfo>
        <Tag_1>
          <Name>MmSt</Name>
          <PoolUsed>216638928</PoolUsed>
        </Tag_1>
        <Tag_2>
          <Name>CM31</Name>
          <PoolUsed>103596032</PoolUsed>
        </Tag_2>
        <Tag_3>
          <Name>MmRe</Name>
          <PoolUsed>15907504</PoolUsed>
        </Tag_3>
      </PagedPoolInfo>
      <NonPagedPoolInfo>
        <Tag_1>
          <Name>SmMs</Name>
          <PoolUsed>161243168</PoolUsed>
        </Tag_1>
        <Tag_2>
          <Name>BCM0</Name>
          <PoolUsed>40694064</PoolUsed>
        </Tag_2>
        <Tag_3>
          <Name>Cont</Name>
          <PoolUsed>35498720</PoolUsed>
        </Tag_3>
      </NonPagedPoolInfo>
      <ExhaustionEventInfo>
        <Time>2010-09-03T10:47:18.540433800Z</Time>
      </ExhaustionEventInfo>
    </MemoryExhaustionInfo>
  </UserData>
</Event>

This helps you determine what resource was the possible offender without having to install any additional tools on the server to troubleshoot this. The best part is that you don’t have to wait for an additional event to occur as the information has already been collected and logged.

There is another place where events are logged which is under the Windows Resource Exhaustion Detector (Resource-Exhaustion-Detector) under Applications and Services Logs in the Event Viewer as shown below.

image

These events show much less information than the system event, but do show your commit limits and charges to the system too. Sample below.

Log Name:      Microsoft-Windows-Resource-Exhaustion-Detector/Operational
Source:        Microsoft-Windows-Resource-Exhaustion-Detector
Event ID:      1003
Task Category: Resource Exhaustion Detection Events
Level:         Warning
Keywords:      Events related to exhaustion of system commit limit (virtual memory).
User:          SYSTEM
Computer:      ExServer01.Domain.Com
Description:
The Windows Resource Exhaustion Detector received a notification that the computer is low on virtual memory.
Event Xml:
<Event xmlns="
http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Resource-Exhaustion-Detector" Guid="{9988748e-c2e8-4054-85f6-0c3e1cad2470}" />
    <EventID>1003</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>2</Task>
    <Opcode>22</Opcode>
    <Keywords>0x4000000020000000</Keywords>
    <TimeCreated SystemTime="2010-09-03T10:52:01.431065200Z" />
    <EventRecordID>180</EventRecordID>
    <Correlation ActivityID="{0B95CAB5-E004-4C92-BF5D-3BFA39FDF7EE}" />
    <Execution ProcessID="1160" ThreadID="8312" />
    <Channel>Microsoft-Windows-Resource-Exhaustion-Detector/Operational</Channel>
    <Computer>ExServer01.domain.com</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <UserData>
    <CommitLimitExhaustion xmlns:auto-ns2="
http://schemas.microsoft.com/win/2004/08/events"xmlns="http://www.microsoft.com/Windows/Resource/Exhaustion/Detector/Events">
  
   <SystemCommitLimit>21261021184</SystemCommitLimit>
      <SystemCommitCharge>21258543104</SystemCommitCharge>

    </CommitLimitExhaustion>
  </UserData>
</Event>

A couple of potential events that can be seen when memory resources are low are shown below.

  • MSExchangeRepl Service failing to read a log file for database copy due to an out of memory error condition.

    Log Name:      Application
    Source:        MSExchangeRepl
    Event ID:      2168
    Task Category: Service
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      ExServer01.Domain.Com
    Description:
    Log file F:\Exchsrvr\DB\DB0001\LOG001\E00000A7A46.log’ for database copy EXServer MBX Store 001\ExServer01′ couldn’t be read. Error: Out of Memory (-1011)
    Event Xml:
    <Event xmlns="
    http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="MSExchangeRepl" />
        <EventID Qualifiers="49156">2168</EventID>
        <Level>2</Level>
        <Task>1</Task>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime="2010-09-02T11:21:13.000000000Z" />
        <EventRecordID>3190563</EventRecordID>
        <Channel>Application</Channel>
        <Computer>Exserver01.domain.com</Computer>
        <Security />
      </System>
      <EventData>
        <Data>F:\Exchsrvr\DB\DB0001\LOG001\E00000A7A46.log</Data>
        <Data>EXServer MBX Store 001\ExServer01′ </Data>
        <Data>Out of Memory (-1011)</Data>
      </EventData>
    </Event>

  • A Registry flush operation failing to write the SOFTWARE hive to disk

    Log Name:      System
    Source:        Microsoft-Windows-Kernel-General
    Event ID:      6
    Task Category: None
    Level:         Error
    Keywords:     
    User:          SYSTEM
    Computer:      ExServer01.domain.com
    Description:
    An I/O operation initiated by the Registry failed unrecoverably.The Registry could not flush hive (file): ‘\SystemRoot\System32\Config\SOFTWARE’.
    Event Xml:
    <Event xmlns="
    http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-Windows-Kernel-General" Guid="{a68ca8b7-004f-d7b6-a698-07e2de0f1f5d}" />
        <EventID>6</EventID>
        <Version>0</Version>
        <Level>2</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8000000000000000</Keywords>
        <TimeCreated SystemTime="2010-09-03T10:48:17.714333400Z" />
        <EventRecordID>169290</EventRecordID>
        <Correlation />
        <Execution ProcessID="4" ThreadID="92" />
        <Channel>System</Channel>
        <Computer>ExServer01.domain.com</Computer>
        <Security UserID="S-1-5-18" />
      </System>
      <EventData>
        <Data Name="FinalStatus">0xc000014d</Data>
        <Data Name="ExtraStringLength">36</Data>
        <Data Name="ExtraString">\SystemRoot\System32\Config\SOFTWARE</Data>
      </EventData>
    </Event>

Depending on the component used to instantiate a specific function will determine what component will log the event in the system log. Finding root cause for memory issues has become significantly easier with this new Reliability Infrastructure and I hope this blog helps show you some of the methods for troubleshooting these type issues.

Until next time!!!

How to unload/reload performance counters on Exchange 2010

$
0
0

 

Previously I had wrote a set of steps for fixing/repairing broken Exchange 2007 counters at http://blogs.technet.com/b/mikelag/archive/2009/10/21/how-to-fix-repair-broken-exchange-2007-counters.aspx. These set of steps have been replaced with an easier way of loading the counters using Powershell for Exchange 2010. This new method applies mainly for managed code services in Exchange 2010 as some of the previous lodctr/unlodctr commands still work for services such as MSExchangeIS and ESE.

In this example, I am going to use the same ESE missing counters as an example.

The important values that we are interested are the following. These values must exist prior to loading the ESE performance counters.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ESE\Performance]
"Close"="ClosePerformanceData"
"Collect"="CollectPerformanceData"
"Library"="C:\\Program Files\\Microsoft\\Exchange Server\\V14\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"
"Open"="OpenPerformanceData"
"PerfIniFile"="eseperf.ini"

Here is a screenshot of what it will look like:

clip_image002

Again, verify if the name Disable Performance Counters is set in the registry and if so, is the value set to a value other than 0? If this is set to a value of 4, those counters had a problem at one point and were disabled. If this is set to 4, then change it to 0 and then reopen Performance Monitor to see if the counters can be seen now. If that works, life is good and you can either stop reading here or continue on to reloading the counters if they are still not available.

If the values First Counter, First Help, Last Counter and Last Help are listed, then it is recommended to unload the performance counters first before attempting to reload them again.

To unload performance counters for ESE, perform the following steps:

  • Close down any running versions of Performance Monitor and stop any other monitoring services that might be attempting to use those counters
  • Open the Exchange Management Shell
  • Type add-pssnapin Microsoft.Exchange.Management.PowerShell.Setup and hit enter
  • Type remove-perfcounters –definitionfilename “C:\Program Files\Microsoft\Exchange Server\V14\Bin\perf\AMD64\eseperf.xml” and hit enter
    Note: Ensure that the path is changed to the correct location of eseperf.xml on your server

To reload performance counters for ESE, perform the following steps

  • Close down any running versions of Performance Monitor and stop any other monitoring services that might be attempting to use those counters
  • Open the Exchange Management Shell
  • Type add-pssnapin Microsoft.Exchange.Management.PowerShell.Setup and hit enter
  • Type new-perfcounters –definitionfilename “C:\Program Files\Microsoft\Exchange Server\V14\Bin\perf\AMD64\eseperf.xml” and hit enter
    Note: Ensure that the path is changed to the correct location of eseperf.xml on your server

This will reload the appropriate performance counters for ESE. Once that is done, check the application event log to ensure that they were loaded OK and no Perflib errors exist. Open Performance monitor again to see if the counters are available.

If you are looking to reload any of the Exchange counters, you can substitute the appropriate XML in the command with any of the available XMLs from the \Program Files\Microsoft\Exchange Server\V14\Setup\Perf directory.

Here is a screenshot of some of the available performance counters

image

I hope this helps in your performance counter monitoring in Exchange 2010.

Processor 0 increased CPU utilization

$
0
0

While looking on a Exchange 2010 server recently in task manager to review the amount of CPU utilization, I noticed that Processor 0 was at 100% CPU while all of the other CPUs were relatively lower compared to this processor.This type of behavior is caused by the Receive Side Scaling (RSS) feature not being enabled on the server. RSS is a feature that was first implemented back in Windows 2003 with the Scalable Networking Pack which allows you to span network traffic across multiple CPU cores. If RSS is not enabled, only *one* CPU will be used to process incoming network traffic which could cause a networking bottleneck on the server.Additional information on RSS can be found here.

Here is what it looks like in Task Manager on the Performance tab.

clip_image002

As you can see, the first processor is pegged at 100% CPU which is indicative of RSS not being enabled. Generally on new installations of Windows 2008 or greater, this feature is enabled by default, but in this case, it was disabled.

Prior to enabling RSS on any given machine, there are a few dependencies that are necessary for RSS to work properly and are listed below.

  • Install the latest network card driver and associated Network Configuration Utility. The network card driver update is very important as older versions had known bugs that would cause RSS to fail.
  • Offloading features of the network card must be enabled (ie.IPv4 Checksum offload,  TCP/UDP Checksum Offload for IPv4/IPv6)
  • Receive Side Scaling must be enabled on the network card properties
  • Receive Side Scaling Queues and Max number of RSS Processors must be set to the maximum value listed in the network card properties. This is typically the amount of CPU cores that are installed on the server. Hyperthreading does not count towards the max amount of CPU cores that can be leveraged here. The use of hyperthreading is generally not recommended on Exchange servers anyway and is referenced here

    Note: If Receive Side Scaling Queues and Max number of RSS Processors are not changed to a value above 1, then enabling RSS does not provide any benefits since you will only be using a single core to process incoming network traffic.

  • RSS must be enabled at the OS layer by running  netsh int tcp set global rss=enabled . Use netsh int tcp show global to confirm that the setting was enabled properly.

After enabling RSS, you can clearly see below the difference in processor utilization on the server as the CPU utilization for Processor 0 now fairly close to the other processors right around 3:00AM.

image

Many people have disabled the Scalable Networking Pack features across the board due to the various issues that were caused by the TCP Chimney feature back in Windows 2003. All of those problems have now been fixed in the latest patches and latest network card drivers, so enabling this feature will help increase networking throughput almost two fold. The more features that you offload to the network card, the less CPU you will use overall. This allows for greater scalability of your servers.

You will also want to monitor the amount of deferred procedure calls (DPC) that are created since there is additional overhead for distributing this load amongst multiple processors. With the latest hardware and drivers available, this overhead should be negligible.

In Windows 2008 R2 versions of the operating system, there are new performance counters to help track RSS/Offloading/DPC/NDIS traffic to different processors as shown below.

Object Performance Counter
Per Processor Network Activity Cycles(*)

Stack Send Complete Cycles/sec
Miniport RSS Indirection Table Change Cycles
Build Scatter Gather Cycles/sec
NDIS Send Complete Cycles/sec
Miniport Send Cycles/sec
NDIS Send Cycles/sec
Miniport Return Packet Cycles/sec
NDIS Return Packet Cycles/sec
Stack Receive Indication Cycles/sec
NDIS Receive Indication Cycles/sec
Interrupt Cycles/sec
Interrupt DPC Cycles/sec

Per Processor Network Interface Card Activity(*)

Tcp Offload Send bytes/sec
Tcp Offload Receive bytes/sec
Tcp Offload Send Request Calls/sec
Tcp Offload Receive Indications/sec
Low Resource Received Packets/sec
Low Resource Receive Indications/sec
RSS Indirection Table Change Calls/sec
Build Scatter Gather List Calls/sec
Sent Complete Packets/sec
Sent Packets/sec
Send Complete Calls/sec
Send Request Calls/sec
Returned Packets/sec
Received Packets/sec
Return Packet Calls/sec
Receive Indications/sec
Interrupts/sec
DPCs Queued/sec

I hope this helps you understand why you might be seeing this type of CPU usage behavior.

Until next time!!

Mike

How fragmentation on incorrectly formatted NTFS volumes affects Exchange

$
0
0

Recently we have been seeing some gnarly performance issues in Exchange 2007 along with an added splash of database operation failures. This doesn’t sound enticing at all, but this blog post is going to discuss what these issues are and how to resolve them. This post is targeted mainly for Exchange 2007, but you can also apply the same methodology  to Exchange 2010 as this is where the original problem was seen.

Before going in to this, here is a highlight of some of the issues that you may see:

  • Databases failing with an Out of Memory condition
  • Extremely slow log replay times on CCR/SCR replica copies (High replay queue lengths)
  • High amount of split I/O’s occurring on any given LUN/Volume.
  • Slowly rising RPC requests until the Information Store service goes unresponsive

Examples

Here are some examples of the out of memory condition that would be written to the application log on the affected Exchange server.

Event Type:         : Error
Event Source:      : MSExchangeIS
Event Category:     : None
Event ID     : 1160
Description:
Database resource failure error Out of memory occurred in function JTAB_BASE::EcUpdate while accessing the database "CCRName\SGName".

Windows 2003 based error
Event Type:        Error
Event Source:    ESE
Event Category:                General
Event ID:              482
Description:
MSExchangeIS (9228) DBName: An attempt to write to the file "F:\Data\DBName.edb" at offset 530157682688 (0x0000007b6fdc4000) for 8192 (0x00002000) bytes failed after 0 seconds with system error 1450 (0x000005aa): "Insufficient system resources exist to complete the requested service. ".  The write operation will fail with error -1011 (0xfffffc0d).  If this error persists then the file may be damaged and may need to be restored from a previous backup.

Windows 2008 based error
Log Name:      Application
Source:        ESE
Event ID:      482
Task Category: General
Level:         Error
Description:
Information Store (8580) DBNAme: An attempt to write to the file "F:\Data\DBName.EDB" at offset 315530739712 (0x0000004977190000) for 32768 (0x00008000) bytes failed after 0 seconds with system error 665 (0x00000299): "The requested operation could not be completed due to a file system limitation ".  The write operation will fail with error -1022 (0xfffffc02).  If this error persists then the file may be damaged and may need to be restored from a previous backup.

So just what is this Insufficient system resources exist to complete the requested service error? The explanation will come later….

Here is an example of very high Split I/O operations (purple line) leading up to high RPC requests (green Line) until the server went unresponsive. In the below case, we were trying to extend the size of the database and couldn’t because of the underlying cause which I will explain shortly.

image

Another clear sign that you might be running in to this problem is when all I/O requests for that particular database instance goes to zero while RPC requests continue to climb and Version Buckets plateaus

image

This particular problem is not an obvious one and requires a few levels of explanation what is going on and a little bit of terminology to get you going. At the lowest layer, an exchange database resides on an NTFS partition which is setup when the server is first configured. This initial setup has some specific guidelines around how to properly partition and format the volumes and is referenced in http://technet.microsoft.com/en-us/library/bb738145(EXCHG.80).aspx for Exchange 2007 and http://technet.microsoft.com/en-us/library/ee832792.aspx for Exchange 2010. The two most important factors are proper partition alignment and NTFS Allocation unit size.

Below is a table of recommendations for use with Exchange.

Description

Recommended Value

Storage Track Boundary

64K or greater. (1MB recommended)

NTFS allocation unit/cluster size

64KB (DB and Log Drives)

RAID Stripe size

256KB or greater. Check with your storage vendor for best practices

NTFS allocation unit size

Before we go in to discussing this area, we need to take a step back and take a look at how NTFS operates. This is where you need to do a little homework by reading the following 2 references:

Now that we went over what the basic concept of what a File Attribute List (ATTRIBUTE_LIST) is and how files are actually stored on disk, we can continue on with why this is so important here. Let’s say that we have a disk that is formatted with a file allocation unit size of 4K or 4096 which is the default in Windows 2003 for any partition that is greater than 2GB in size. With Exchange 2007’s ESE page size of 8k, we will need to make two writes for a single page. These writes may or may not be contiguous in nature and could be spreading data across various sections of the disk and this is where fragmentation can begin for larger files on disk. As the File Attribute List (FAL) size grows outside of the MFT along with the database file sizes, the size of the FAL will continually grow to accommodate the fragmentation and the overall increase in database file sizes.

NTFS does have it’s limitations with the overall size of this attribute list per file and can have roughly around 1.5 million fragments. This is not an absolute maximum, but is around the area when problems can occur. The FAL size will never shrink and will continually keep growing over time. The maximum supported size of the ATTRIBUTE_LIST is 256K or 262144. If you were to reach this upper limit, you could no longer expand the size of your database and we would be doing a lot more smaller I/O operations and a lot more seeking around the drive to find the data we are looking for. This is where the “out of memory” error comes from along with the “Insufficient system resources exist to complete the requested service” error. File management APIs will start failing with ERROR_FILE_SYSTEM_LIMITATION in Windows 2008 or later and ERROR_INSUFFICIENT_RESOURCES for windows versions earlier than that when the absolute maximum has been reached. The out of memory error is a much higher level error that was bubbled up caused by NTFS not being able to increase the size of the FAL anymore. This is why it is not an obvious error and was ultimately found by Eric Norberg troubleshooting over many tirelessly nights and through long debugging sessions by EE extraordinaire Dave Goldman. Smile

This fragmentation issue is actually referenced in the following article:

A heavily fragmented file in an NTFS volume may not grow beyond a certain size
http://support.microsoft.com/kb/967351

This scenario is seen more on servers with smaller NTFS cluster sizes such as 4k, large databases that are 2 times the recommended 200GB maximum and low available disk space. The combination of those 3 variables can get you in to a very bad situation.

NTFS cluster sizes can be obtained by running the fsutil command as shown below for any given partition:

image

In Exchange 2007, you can check if you are running in to this issue by downloading and running Contig.exe from Sysinternals at http://technet.microsoft.com/en-us/sysinternals/bb897428.aspx

C:\>Contig.exe -a f:\data\DBName.edb

Contig v1.55 – Makes files contiguous
Copyright (C) 1998-2007 Mark Russinovich
Sysinternals – www.sysinternals.com

f:\data\DBName.edb is in 1.46698e+006 fragments

Summary:
     Number of files processed   : 1
     Average fragmentation       : 1.46698e+006 frags/file

In the above example, we are extremely close to the 1.5 million approximate maximum amount of fragments that you can have for any given file. This particular database will eventually be problematic and is a ticking time bomb waiting to happen.

For Exchange 2010 SP1, you can dump the same type information similar to contig.exe using eseutil.exe as shown below.

C:\>eseutil /ms f:\data\DBName.edb

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.01
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…
Error: Access to source database ‘f:\data\DBName.edb’ failed with Jet error -1032.

File Information:
  File Name: f:\data\DBName.edb
  Volume Name: Drive2
  File System: NTFS
  Cluster Size: 4096 bytes
  Attribute List Size: 180 KB
  Extents Enumerated: 1157172

Operation terminated with error -1032 (JET_errFileAccessDenied, Cannot access file, the file is locked or in use) after 0.78 seconds.

Even though the command errors out due to the database being online, we are still able to obtain similar data. Eseutil allows you to look in to the actual FAL size, NTFS cluster size and how many extents have been created for that file due to excessive fragmentation if run locally on the server. With that, we can deduce that the NTFS cluster size is 4KB, the FAL size is 180KB and the Extents Enumerated is over 1.1 million fragments. A general rule of thumb is to not have a FAL size greater than 150KB in size and to have sufficient available disk space.

This fragmentation is also seen on CCR/Replica copies as the log files are shipped and then played in to the database. The end result is that log replay will slow to a crawl and you could have some very high replay queue lengths due to excessive Split I/Os occurring. Even with the fastest disks and improperly configured NTFS cluster sizes and disk alignments, you will still see this problem. You must fix the root of this problem to successfully resolve this issue.

So how do you mitigate this? Well, there are various ways to do this…

  1. If you determine that only a single database is affected by this issue, the quickest mitigation method to get you back in business is the following:
    1. Dismount the database
    2. Make a copy of the database to another drive with sufficient space. IMPORTANT: This cannot be on the same drive as we need to write this file out contiguously to another drive. This mere act of copying the file defrags the file for you.
    3. Delete the original copy of the database file
    4. Copy the database back to the original location
    5. Using this method does not resolve the issue long term if the NTFS cluster sizes are too small. It is only meant as a stop gap to buy you some time to resolve the issue long term.
  2. If on a CCR/SCR cluster, you have some options to fix this longer term.
    1. To resolve the NTFS cluster sizes on the non-active node or SCR target for any particular volume such as F:, use the following command to format the disk with a 64KB block size which is the recommended value for optimal performance.

      Format F: /q /y /fs:ntfs  /v:VolumeName /a:64K

      NOTE:
      This command wipes out any files that currently resides on the F: drive, so make sure that no other files or applications reside on this drive other than the database and log files. I would hope that you are dedicating these drives exclusively to Exchange and not sharing with any other applications. Exclusivity is what makes recovering from this much easier. 

    2. Verify that the disk was formatted properly by running the following command:

      image

    3. Once the disk has been reformatted, go ahead and reseed the databases that previously existed on the drive.

You may ask yourself, if the file is so fragmented, why can I not simply do an offline defrag of the file? The answer is that if you defrag the file itself, you have a high possibility of bloating the FAL size since we are causing the fragments to move around which causes the FAL size to grow. This is the primary reason why Exchange does not recommend running defrag on volumes which host database files. The only way to remove the attribute list for this file is to completely copy the file off to another drive, delete the original copy and then copy the copied file back to the original location. When this is done, the file is written to the disk contiguously leaving literally no fragments in the file. Life is good once again.

Once you have resolved these underlying issues, overall Exchange performance should be that much better and you can sleep better at night knowing you have increased throughput on your Exchange servers.

Note that it is still not recommended to run disk defragmentation software on Exchange server volumes, but there are times where file level fragmentation can cause significant performance problems on a server merely by the way data is being written to the disk. If optimal and/or recommended settings are not used when creating the volumes, this file fragmentation issue can occur much quicker. The majority of Exchange files are in use so running any regular disk defragmentation programs on the server will not help with this situation. If necessary, the only way to resolve this is to take all Exchange resources offline to ensure none of the files are in use and then defragment the disk to make the files contiguous on the disk once again.

In Exchange 2010 SP1 or later, logic was added to detect when the FAL would be exhausted (80% of max); and event accordingly.  There is no NTFS event for this behavior. The following event is an example that would be logged for a problematic database during online maintenance.

Log Name: Application
Source: ESE
Event ID: 739
Task Category: General
Level: Error
Description:
Information Store (5652) EXSERVER MBX Store 001: The NTFS file attributes size for database ‘C:\DB\DB001\PRIV001.EDB’ is 243136 bytes, which exceeds the threshold of 204800 bytes. The database file must be reseeded or restored from a copy or backup to prevent the database file from being unable to grow because of a file system limitation.

Update (3/8/2011): Exchange 2007 SP3 RU3 now has a fix that is referenced in http://support.microsoft.com/kb/2498066 that will increase the default extent size from 8MB to 64MB similar to that of Exchange 2010. Increasing the extent size helps reduce the amount of fragments that will be created for any given database. The 739 event has also been added so that monitoring software can alert on potential problems.

Reasonable Volume sizes and database sizes go a long way to protect yourself from fragmentation (the more competing files which extended/created on a volume, the greater the fragmentation of those files will be). 

Recommendations:

  • Keep your volume sizes at or below 2TB (why MBR partitions are recommended for E2K7). Exchange 2010 can have GPT volumes greater than 2TB, but the recommendation is to ensure that DB sizes are under 2TB in size.
  • Limit the number of databases hosted/volume.  10/volume is the absolute maximum we would recommend; where 5/volume is much better.
  • Do not place write intensive non-Exchange workloads on the same volume as an Exchange database.

I hope this sheds some light on why certain failures on Exchange servers could prevent you from doing various operations.

Thanks go to Matt Gossage, Tim McMichael, Bryan Matthew, Neal Christiansen and Luke Ibsen for reviewing this blog entry before posting

Mike

Viewing all 60 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>