Quantcast
Channel: Mike Lagase
Viewing all 60 articles
Browse latest View live

Exchange 2007/2010 Performance Data Collection Script

$
0
0

In efforts to help streamline performance data collection on Exchange 2007/Exchange 2010 servers, I have created a powershell script that will automate the data collection for you. One of the nice features of the script is that you do not have to tell the script what roles are installed as it will automatically detect what is installed locally on the server, thus adding the appropriate counters for you. Previously, you had to manually select an XML file from here for Exchange 2007 servers and here for Exchange 2010 servers and then import it in to the performance console.

I’ve seen a lot of cases that use the previous Perfwiz utility, but unfortunately, this was originally designed to collect data for Exchange 2003 servers and was never updated to support the later versions of Exchange. This older version of Perfwiz should never be used to troubleshoot performance issues for versions later than Exchange 2003 as the pertinent counters are not being collected to accurately troubleshoot a performance issue.

During the development phase of this script, it was found that starting with Windows 2003 x64 that the log roll mechanism no longer worked properly and stopped once the maximum log file size was hit. Even though this worked previously in on Windows 2003 x86 versions, something changed on the 64-bit platform which prevented this from working. This problem is also inherent in the Windows 2008 operating system, but eventually was resolved in Windows 2008 R2. The script works around all of these issues to help you collect the right data at the right time by doing the following:

  • If Windows 2003 x64 and –circular switch not specified, then roll log to next log file once maxsize is reached or duration time is hit, whichever one is first.
  • If Windows 2008 RTM/SP1/SP2 and –circular switch not specified, then roll log every 4 hours. If Interval is set to less than 30 seconds, then roll log every hour.

IMPORTANT: To help save on the disk space consumed to write these log files out, the *default duration* is set to 8 hours. This time duration should be enough to capture most performance cases during the day, but if longer durations are needed, then refer to the switches listed in the table below to help set the desired configuration for your needs.

Listed below are the switches that can be used with this script at the time of this posting. New switches will be added as time goes on. These switches should help allow you to collect the right data at the right time and also allows the flexibility to set the appropriate settings.

Parameter

Description

-help or -?

Provides help regarding the overall usage of the script

-circular

Turns on circular logging to save on disk space. Negates default duration of 8 hours

-delete

Deletes the currently running Perfwiz data collection

-duration

Specifies the overall duration of the data collection. If omitted, the default value is (08:00:00) or 8 hours

-EseExtendedOn

Enables Extended ESE performance counters

-EseExtendedOff

Disables Extended ESE performance counters

-filepath

Sets the directory location of where the blg file will be stored

-full

Defines a counter set that includes all Counters/instances

-interval

Specifies the interval time between data samples. If omitted, the default value is (00:00:30) or 30 seconds

-maxsize

Specifies the maximum size of blg file in MB. If omitted, the default value is 512

-query

Queries configuration information of previously created Exchange_Perfwiz Data Collector

-start

Starts Exchange_Perfwiz data collection

-stop

Stops the currently running Perfwiz data collection

-StoreExtendedOn

Enables Extended Store performance counters

-StoreExtendedOff

Disables Extended Store performance counters

-threads

Specifies whether threads will be added to the data collection. If omitted, threads counters will not be added to the collection

-webhelp

Launches web help for script

For additional information, you can check out the website that includes the latest 1.3 version download at http://code.msdn.microsoft.com/ExPerfwiz.

If you have an issue with this script or have a feature suggestion, use the Discussions/Issue Tracker tabs on the Experfwiz page listed above. There are also additional examples of how to run the script with additional switches on the site.

Enjoy!!

Mike


Detecting Low Virtual Memory Conditions in Windows 2008 and R2

$
0
0

On Windows 2008 servers with Exchange 2007/2010 installed, there are times when you may run out of virtual memory for various reasons. One could be a memory leak in some application or simply not configuring the paging file correctly.

Once you run out of virtual memory on any given server, various applications may start failing/crashing on the server due to the inability to obtain memory to complete a specific function that is being called. In some cases, this could lead to a possible blue screen of death (BSOD).

For server based systems, the new Reliability Infrastructure helps automatically diagnose various operating system components. Of that infrastructure, Resource Exhaustion Detection and Resolution (RADAR) helps notify you when you are resources are reaching critical levels. RADAR is part of the Diagnostic Policy service that is installed on each server.

When RADAR detects that memory has reached a critical state, a 2004 event will be logged to the system log. An example of one of these events is shown below. As you can see, it has various information that provides overall memory consumption for various system resources, the top processes for memory consumption, file version information and paged/nonpaged pool memory that includes the top tags that could attribute to the memory problem. The bolded parts are the area of interest.

Log Name:      System
Source:        Microsoft-Windows-Resource-Exhaustion-Detector
Event ID:      2004
Task Category: Resource Exhaustion Diagnosis Events
Level:         Warning
Keywords:      Events related to exhaustion of system commit limit (virtual memory).
User:          SYSTEM
Description:
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: store.exe (7580) consumed 11282399232 bytes, MSExchangeMailboxAssistants.exe (21200) consumed 590950400 bytes, and w3wp.exe (21092) consumed 562757632 bytes.
Event Xml:
<Event xmlns="
http://schemas.microsoft.com/win/2004/08/events/event">
<System>
    <Provider Name="Microsoft-Windows-Resource-Exhaustion-Detector" Guid="{9988748e-c2e8-4054-85f6-0c3e1cad2470}" />
    <EventID>2004</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>3</Task>
    <Opcode>33</Opcode>
    <Keywords>0x8000000020000000</Keywords>
    <TimeCreated SystemTime="2010-09-03T10:47:01.431311400Z" />
    <EventRecordID>169289</EventRecordID>
    <Correlation ActivityID="{AC93AF3C-02AE-433D-8C22-FA32493FAD8C}" />
    <Execution ProcessID="1160" ThreadID="8312" />
    <Channel>System</Channel>
    <Computer>Exserver01.domain.com</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <UserData>
    <MemoryExhaustionInfo xmlns:auto-ns2="
http://schemas.microsoft.com/win/2004/08/events" xmlns="http://www.microsoft.com/Windows/Resource/Exhaustion/Detector/Events">
      <SystemInfo>
        <SystemCommitLimit>21261021184</SystemCommitLimit>
        <SystemCommitCharge>20993597440</SystemCommitCharge>
        <ProcessCommitCharge>19448094720</ProcessCommitCharge>
        <PagedPoolUsage>453672960</PagedPoolUsage>
        <PhysicalMemorySize>17176764416</PhysicalMemorySize>
        <PhysicalMemoryUsage>17025470464</PhysicalMemoryUsage>
        <NonPagedPoolUsage>422363136</NonPagedPoolUsage>
        <Processes>133</Processes>
      </SystemInfo>
    
<ProcessInfo>
        <Process_1>
          <Name>store.exe</Name>
          <ID>7580</ID>
          <CreationTime>2010-09-02T11:21:32.755807700Z</CreationTime>
          <CommitCharge>11282399232</CommitCharge>
          <HandleCount>5619</HandleCount>
          <Version>14.1.218.10</Version>
          <TypeInfo>1089</TypeInfo>
        </Process_1>
        <Process_2>
          <Name>MSExchangeMailboxAssistants.exe</Name>
          <ID>21200</ID>
          <CreationTime>2010-08-28T06:50:53.878440200Z</CreationTime>
          <CommitCharge>590950400</CommitCharge>
          <HandleCount>2664</HandleCount>
          <Version>14.1.218.10</Version>
          <TypeInfo>1090</TypeInfo>
        </Process_2>
        <Process_3>
          <Name>w3wp.exe</Name>
          <ID>21092</ID>
          <CreationTime>2010-08-31T08:25:12.245594900Z</CreationTime>
          <CommitCharge>562757632</CommitCharge>
          <HandleCount>2817</HandleCount>
          <Version>7.0.6002.18005</Version>
          <TypeInfo>67</TypeInfo>
        </Process_3>
        <Process_4>
          <Name>powershell.exe</Name>
          <ID>19692</ID>
          <CreationTime>2010-09-03T09:12:48.188589800Z</CreationTime>
          <CommitCharge>152682496</CommitCharge>
          <HandleCount>629</HandleCount>
          <Version>6.0.6002.18111</Version>
          <TypeInfo>136</TypeInfo>
        </Process_4>
        <Process_5>
          <Name>mmc.exe</Name>
          <ID>18768</ID>
          <CreationTime>2010-09-03T09:12:42.167067000Z</CreationTime>
          <CommitCharge>107646976</CommitCharge>
          <HandleCount>464</HandleCount>
          <Version>6.0.6002.18005</Version>
          <TypeInfo>144</TypeInfo>
        </Process_5>
        <Process_6>
          <Name>explorer.exe</Name>
          <ID>13396</ID>
          <CreationTime>2010-09-03T09:12:24.929288000Z</CreationTime>
          <CommitCharge>22032384</CommitCharge>
          <HandleCount>451</HandleCount>
          <Version>6.0.6002.18005</Version>
          <TypeInfo>152</TypeInfo>
        </Process_6>
      </ProcessInfo>
      <PagedPoolInfo>
        <Tag_1>
          <Name>MmSt</Name>
          <PoolUsed>216638928</PoolUsed>
        </Tag_1>
        <Tag_2>
          <Name>CM31</Name>
          <PoolUsed>103596032</PoolUsed>
        </Tag_2>
        <Tag_3>
          <Name>MmRe</Name>
          <PoolUsed>15907504</PoolUsed>
        </Tag_3>
      </PagedPoolInfo>
      <NonPagedPoolInfo>
        <Tag_1>
          <Name>SmMs</Name>
          <PoolUsed>161243168</PoolUsed>
        </Tag_1>
        <Tag_2>
          <Name>BCM0</Name>
          <PoolUsed>40694064</PoolUsed>
        </Tag_2>
        <Tag_3>
          <Name>Cont</Name>
          <PoolUsed>35498720</PoolUsed>
        </Tag_3>
      </NonPagedPoolInfo>
      <ExhaustionEventInfo>
        <Time>2010-09-03T10:47:18.540433800Z</Time>
      </ExhaustionEventInfo>
    </MemoryExhaustionInfo>
  </UserData>
</Event>

This helps you determine what resource was the possible offender without having to install any additional tools on the server to troubleshoot this. The best part is that you don’t have to wait for an additional event to occur as the information has already been collected and logged.

There is another place where events are logged which is under the Windows Resource Exhaustion Detector (Resource-Exhaustion-Detector) under Applications and Services Logs in the Event Viewer as shown below.

image

These events show much less information than the system event, but do show your commit limits and charges to the system too. Sample below.

Log Name:      Microsoft-Windows-Resource-Exhaustion-Detector/Operational
Source:        Microsoft-Windows-Resource-Exhaustion-Detector
Event ID:      1003
Task Category: Resource Exhaustion Detection Events
Level:         Warning
Keywords:      Events related to exhaustion of system commit limit (virtual memory).
User:          SYSTEM
Computer:      ExServer01.Domain.Com
Description:
The Windows Resource Exhaustion Detector received a notification that the computer is low on virtual memory.
Event Xml:
<Event xmlns="
http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Resource-Exhaustion-Detector" Guid="{9988748e-c2e8-4054-85f6-0c3e1cad2470}" />
    <EventID>1003</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>2</Task>
    <Opcode>22</Opcode>
    <Keywords>0x4000000020000000</Keywords>
    <TimeCreated SystemTime="2010-09-03T10:52:01.431065200Z" />
    <EventRecordID>180</EventRecordID>
    <Correlation ActivityID="{0B95CAB5-E004-4C92-BF5D-3BFA39FDF7EE}" />
    <Execution ProcessID="1160" ThreadID="8312" />
    <Channel>Microsoft-Windows-Resource-Exhaustion-Detector/Operational</Channel>
    <Computer>ExServer01.domain.com</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <UserData>
    <CommitLimitExhaustion xmlns:auto-ns2="
http://schemas.microsoft.com/win/2004/08/events"xmlns="http://www.microsoft.com/Windows/Resource/Exhaustion/Detector/Events">
  
   <SystemCommitLimit>21261021184</SystemCommitLimit>
      <SystemCommitCharge>21258543104</SystemCommitCharge>

    </CommitLimitExhaustion>
  </UserData>
</Event>

A couple of potential events that can be seen when memory resources are low are shown below.

  • MSExchangeRepl Service failing to read a log file for database copy due to an out of memory error condition.

    Log Name:      Application
    Source:        MSExchangeRepl
    Event ID:      2168
    Task Category: Service
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      ExServer01.Domain.Com
    Description:
    Log file F:\Exchsrvr\DB\DB0001\LOG001\E00000A7A46.log' for database copy EXServer MBX Store 001\ExServer01' couldn't be read. Error: Out of Memory (-1011)
    Event Xml:
    <Event xmlns="
    http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="MSExchangeRepl" />
        <EventID Qualifiers="49156">2168</EventID>
        <Level>2</Level>
        <Task>1</Task>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime="2010-09-02T11:21:13.000000000Z" />
        <EventRecordID>3190563</EventRecordID>
        <Channel>Application</Channel>
        <Computer>Exserver01.domain.com</Computer>
        <Security />
      </System>
      <EventData>
        <Data>F:\Exchsrvr\DB\DB0001\LOG001\E00000A7A46.log</Data>
        <Data>EXServer MBX Store 001\ExServer01' </Data>
        <Data>Out of Memory (-1011)</Data>
      </EventData>
    </Event>

  • A Registry flush operation failing to write the SOFTWARE hive to disk

    Log Name:      System
    Source:        Microsoft-Windows-Kernel-General
    Event ID:      6
    Task Category: None
    Level:         Error
    Keywords:     
    User:          SYSTEM
    Computer:      ExServer01.domain.com
    Description:
    An I/O operation initiated by the Registry failed unrecoverably.The Registry could not flush hive (file): '\SystemRoot\System32\Config\SOFTWARE'.
    Event Xml:
    <Event xmlns="
    http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-Windows-Kernel-General" Guid="{a68ca8b7-004f-d7b6-a698-07e2de0f1f5d}" />
        <EventID>6</EventID>
        <Version>0</Version>
        <Level>2</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8000000000000000</Keywords>
        <TimeCreated SystemTime="2010-09-03T10:48:17.714333400Z" />
        <EventRecordID>169290</EventRecordID>
        <Correlation />
        <Execution ProcessID="4" ThreadID="92" />
        <Channel>System</Channel>
        <Computer>ExServer01.domain.com</Computer>
        <Security UserID="S-1-5-18" />
      </System>
      <EventData>
        <Data Name="FinalStatus">0xc000014d</Data>
        <Data Name="ExtraStringLength">36</Data>
        <Data Name="ExtraString">\SystemRoot\System32\Config\SOFTWARE</Data>
      </EventData>
    </Event>

Depending on the component used to instantiate a specific function will determine what component will log the event in the system log. Finding root cause for memory issues has become significantly easier with this new Reliability Infrastructure and I hope this blog helps show you some of the methods for troubleshooting these type issues.

Until next time!!!

How to unload/reload performance counters on Exchange 2010

$
0
0

 

Previously I had wrote a set of steps for fixing/repairing broken Exchange 2007 counters at http://blogs.technet.com/b/mikelag/archive/2009/10/21/how-to-fix-repair-broken-exchange-2007-counters.aspx. These set of steps have been replaced with an easier way of loading the counters using Powershell for Exchange 2010. This new method applies mainly for managed code services in Exchange 2010 as some of the previous lodctr/unlodctr commands still work for services such as MSExchangeIS and ESE.

In this example, I am going to use the same ESE missing counters as an example.

The important values that we are interested are the following. These values must exist prior to loading the ESE performance counters.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ESE\Performance]
"Close"="ClosePerformanceData"
"Collect"="CollectPerformanceData"
"Library"="C:\\Program Files\\Microsoft\\Exchange Server\\V14\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"
"Open"="OpenPerformanceData"
"PerfIniFile"="eseperf.ini"

Here is a screenshot of what it will look like:

clip_image002

Again, verify if the name Disable Performance Counters is set in the registry and if so, is the value set to a value other than 0? If this is set to a value of 4, those counters had a problem at one point and were disabled. If this is set to 4, then change it to 0 and then reopen Performance Monitor to see if the counters can be seen now. If that works, life is good and you can either stop reading here or continue on to reloading the counters if they are still not available.

If the values First Counter, First Help, Last Counter and Last Help are listed, then it is recommended to unload the performance counters first before attempting to reload them again.

To unload performance counters for ESE, perform the following steps:

  • Close down any running versions of Performance Monitor and stop any other monitoring services that might be attempting to use those counters
  • Open the Exchange Management Shell
  • Type add-pssnapin Microsoft.Exchange.Management.PowerShell.Setup and hit enter
  • Type remove-perfcounters –definitionfilename “C:\Program Files\Microsoft\Exchange Server\V14\Bin\perf\AMD64\eseperf.xml” and hit enter
    Note: Ensure that the path is changed to the correct location of eseperf.xml on your server

To reload performance counters for ESE, perform the following steps

  • Close down any running versions of Performance Monitor and stop any other monitoring services that might be attempting to use those counters
  • Open the Exchange Management Shell
  • Type add-pssnapin Microsoft.Exchange.Management.PowerShell.Setup and hit enter
  • Type new-perfcounters –definitionfilename “C:\Program Files\Microsoft\Exchange Server\V14\Bin\perf\AMD64\eseperf.xml” and hit enter
    Note: Ensure that the path is changed to the correct location of eseperf.xml on your server

This will reload the appropriate performance counters for ESE. Once that is done, check the application event log to ensure that they were loaded OK and no Perflib errors exist. Open Performance monitor again to see if the counters are available.

If you are looking to reload any of the Exchange counters, you can substitute the appropriate XML in the command with any of the available XMLs from the \Program Files\Microsoft\Exchange Server\V14\Setup\Perf directory.

Here is a screenshot of some of the available performance counters

image

I hope this helps in your performance counter monitoring in Exchange 2010.

Processor 0 increased CPU utilization

$
0
0

While looking on a Exchange 2010 server recently in task manager to review the amount of CPU utilization, I noticed that Processor 0 was at 100% CPU while all of the other CPUs were relatively lower compared to this processor.This type of behavior is caused by the Receive Side Scaling (RSS) feature not being enabled on the server. RSS is a feature that was first implemented back in Windows 2003 with the Scalable Networking Pack which allows you to span network traffic across multiple CPU cores. If RSS is not enabled, only *one* CPU will be used to process incoming network traffic which could cause a networking bottleneck on the server.Additional information on RSS can be found here.

Here is what it looks like in Task Manager on the Performance tab.

clip_image002

As you can see, the first processor is pegged at 100% CPU which is indicative of RSS not being enabled. Generally on new installations of Windows 2008 or greater, this feature is enabled by default, but in this case, it was disabled.

Prior to enabling RSS on any given machine, there are a few dependencies that are necessary for RSS to work properly and are listed below.

  • Install the latest network card driver and associated Network Configuration Utility. The network card driver update is very important as older versions had known bugs that would cause RSS to fail.
  • Offloading features of the network card must be enabled (ie.IPv4 Checksum offload,  TCP/UDP Checksum Offload for IPv4/IPv6)
  • Receive Side Scaling must be enabled on the network card properties
  • Receive Side Scaling Queues and Max number of RSS Processors must be set to the maximum value listed in the network card properties. This is typically the amount of CPU cores that are installed on the server. Hyperthreading does not count towards the max amount of CPU cores that can be leveraged here. The use of hyperthreading is generally not recommended on Exchange servers anyway and is referenced here

    Note: If Receive Side Scaling Queues and Max number of RSS Processors are not changed to a value above 1, then enabling RSS does not provide any benefits since you will only be using a single core to process incoming network traffic.
  • RSS must be enabled at the OS layer by running  netsh int tcp set global rss=enabled . Use netsh int tcp show global to confirm that the setting was enabled properly.

After enabling RSS, you can clearly see below the difference in processor utilization on the server as the CPU utilization for Processor 0 now fairly close to the other processors right around 3:00AM.


image

Many people have disabled the Scalable Networking Pack features across the board due to the various issues that were caused by the TCP Chimney feature back in Windows 2003. All of those problems have now been fixed in the latest patches and latest network card drivers, so enabling this feature will help increase networking throughput almost two fold. The more features that you offload to the network card, the less CPU you will use overall. This allows for greater scalability of your servers.

You will also want to monitor the amount of deferred procedure calls (DPC) that are created since there is additional overhead for distributing this load amongst multiple processors. With the latest hardware and drivers available, this overhead should be negligible.

In Windows 2008 R2 versions of the operating system, there are new performance counters to help track RSS/Offloading/DPC/NDIS traffic to different processors as shown below.

Object Performance Counter
Per Processor Network Activity Cycles(*)

Stack Send Complete Cycles/sec
Miniport RSS Indirection Table Change Cycles
Build Scatter Gather Cycles/sec
NDIS Send Complete Cycles/sec
Miniport Send Cycles/sec
NDIS Send Cycles/sec
Miniport Return Packet Cycles/sec
NDIS Return Packet Cycles/sec
Stack Receive Indication Cycles/sec
NDIS Receive Indication Cycles/sec
Interrupt Cycles/sec
Interrupt DPC Cycles/sec

Per Processor Network Interface Card Activity(*)

Tcp Offload Send bytes/sec
Tcp Offload Receive bytes/sec
Tcp Offload Send Request Calls/sec
Tcp Offload Receive Indications/sec
Low Resource Received Packets/sec
Low Resource Receive Indications/sec
RSS Indirection Table Change Calls/sec
Build Scatter Gather List Calls/sec
Sent Complete Packets/sec
Sent Packets/sec
Send Complete Calls/sec
Send Request Calls/sec
Returned Packets/sec
Received Packets/sec
Return Packet Calls/sec
Receive Indications/sec
Interrupts/sec
DPCs Queued/sec

I hope this helps you understand why you might be seeing this type of CPU usage behavior.

Until next time!!

Mike

How fragmentation on incorrectly formatted NTFS volumes affects Exchange

$
0
0


Recently we have been seeing some gnarly performance issues in Exchange 2007 along with an added splash of database operation failures. This doesn’t sound enticing at all, but this blog post is going to discuss what these issues are and how to resolve them. This post is targeted mainly for Exchange 2007, but you can also apply the same methodology  to Exchange 2010 as this is where the original problem was seen.

Before going in to this, here is a highlight of some of the issues that you may see:

  • Databases failing with an Out of Memory condition
  • Extremely slow log replay times on CCR/SCR replica copies (High replay queue lengths)
  • High amount of split I/O’s occurring on any given LUN/Volume.
  • Slowly rising RPC requests until the Information Store service goes unresponsive

Examples

Here are some examples of the out of memory condition that would be written to the application log on the affected Exchange server.

Event Type:         : Error
Event Source:      : MSExchangeIS
Event Category:     : None
Event ID     : 1160
Description:
Database resource failure error Out of memory occurred in function JTAB_BASE::EcUpdate while accessing the database "CCRName\SGName".

Windows 2003 based error
Event Type:        Error
Event Source:    ESE
Event Category:                General
Event ID:              482
Description:
MSExchangeIS (9228) DBName: An attempt to write to the file "F:\Data\DBName.edb" at offset 530157682688 (0x0000007b6fdc4000) for 8192 (0x00002000) bytes failed after 0 seconds with system error 1450 (0x000005aa): "Insufficient system resources exist to complete the requested service. ".  The write operation will fail with error -1011 (0xfffffc0d).  If this error persists then the file may be damaged and may need to be restored from a previous backup.

Windows 2008 based error
Log Name:      Application
Source:        ESE
Event ID:      482
Task Category: General
Level:         Error
Description:
Information Store (8580) DBNAme: An attempt to write to the file "F:\Data\DBName.EDB" at offset 315530739712 (0x0000004977190000) for 32768 (0x00008000) bytes failed after 0 seconds with system error 665 (0x00000299): "The requested operation could not be completed due to a file system limitation ".  The write operation will fail with error -1022 (0xfffffc02).  If this error persists then the file may be damaged and may need to be restored from a previous backup.

So just what is this Insufficient system resources exist to complete the requested service error? The explanation will come later….

Here is an example of very high Split I/O operations (purple line) leading up to high RPC requests (green Line) until the server went unresponsive. In the below case, we were trying to extend the size of the database and couldn’t because of the underlying cause which I will explain shortly.

image

Another clear sign that you might be running in to this problem is when all I/O requests for that particular database instance goes to zero while RPC requests continue to climb and Version Buckets plateaus

image

This particular problem is not an obvious one and requires a few levels of explanation what is going on and a little bit of terminology to get you going. At the lowest layer, an exchange database resides on an NTFS partition which is setup when the server is first configured. This initial setup has some specific guidelines around how to properly partition and format the volumes and is referenced in http://technet.microsoft.com/en-us/library/bb738145(EXCHG.80).aspx for Exchange 2007 and http://technet.microsoft.com/en-us/library/ee832792.aspx for Exchange 2010. The two most important factors are proper partition alignment and NTFS Allocation unit size.

Below is a table of recommendations for use with Exchange.

Description

Recommended Value

Storage Track Boundary

64K or greater. (1MB recommended)

NTFS allocation unit/cluster size

64KB (DB and Log Drives)

RAID Stripe size

256KB or greater. Check with your storage vendor for best practices

NTFS allocation unit size

Before we go in to discussing this area, we need to take a step back and take a look at how NTFS operates. This is where you need to do a little homework by reading the following 2 references:

Now that we went over what the basic concept of what a File Attribute List (ATTRIBUTE_LIST) is and how files are actually stored on disk, we can continue on with why this is so important here. Let’s say that we have a disk that is formatted with a file allocation unit size of 4K or 4096 which is the default in Windows 2003 for any partition that is greater than 2GB in size. With Exchange 2007’s ESE page size of 8k, we will need to make two writes for a single page. These writes may or may not be contiguous in nature and could be spreading data across various sections of the disk and this is where fragmentation can begin for larger files on disk. As the File Attribute List (FAL) size grows outside of the MFT along with the database file sizes, the size of the FAL will continually grow to accommodate the fragmentation and the overall increase in database file sizes.

NTFS does have it’s limitations with the overall size of this attribute list per file and can have roughly around 1.5 million fragments. This is not an absolute maximum, but is around the area when problems can occur. The FAL size will never shrink and will continually keep growing over time. The maximum supported size of the ATTRIBUTE_LIST is 256K or 262144. If you were to reach this upper limit, you could no longer expand the size of your database and we would be doing a lot more smaller I/O operations and a lot more seeking around the drive to find the data we are looking for. This is where the “out of memory” error comes from along with the “Insufficient system resources exist to complete the requested service” error. File management APIs will start failing with ERROR_FILE_SYSTEM_LIMITATION in Windows 2008 or later and ERROR_INSUFFICIENT_RESOURCES for windows versions earlier than that when the absolute maximum has been reached. The out of memory error is a much higher level error that was bubbled up caused by NTFS not being able to increase the size of the FAL anymore. This is why it is not an obvious error and was ultimately found by Eric Norberg troubleshooting over many tirelessly nights and through long debugging sessions by EE extraordinaire Dave Goldman. Smile

This fragmentation issue is actually referenced in the following article:

A heavily fragmented file in an NTFS volume may not grow beyond a certain size
http://support.microsoft.com/kb/967351

This scenario is seen more on servers with smaller NTFS cluster sizes such as 4k, large databases that are 2 times the recommended 200GB maximum and low available disk space. The combination of those 3 variables can get you in to a very bad situation.

NTFS cluster sizes can be obtained by running the fsutil command as shown below for any given partition:

image

In Exchange 2007, you can check if you are running in to this issue by downloading and running Contig.exe from Sysinternals at http://technet.microsoft.com/en-us/sysinternals/bb897428.aspx

C:\>Contig.exe -a f:\data\DBName.edb

Contig v1.55 - Makes files contiguous
Copyright (C) 1998-2007 Mark Russinovich
Sysinternals - www.sysinternals.com

f:\data\DBName.edb is in 1.46698e+006 fragments

Summary:
     Number of files processed   : 1
     Average fragmentation       : 1.46698e+006 frags/file

In the above example, we are extremely close to the 1.5 million approximate maximum amount of fragments that you can have for any given file. This particular database will eventually be problematic and is a ticking time bomb waiting to happen.

For Exchange 2010 SP1, you can dump the same type information similar to contig.exe using eseutil.exe as shown below.

C:\>eseutil /ms f:\data\DBName.edb

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.01
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode...
Error: Access to source database 'f:\data\DBName.edb' failed with Jet error -1032.

File Information:
  File Name: f:\data\DBName.edb
  Volume Name: Drive2
  File System: NTFS
  Cluster Size: 4096 bytes
  Attribute List Size: 180 KB
  Extents Enumerated: 1157172

Operation terminated with error -1032 (JET_errFileAccessDenied, Cannot access file, the file is locked or in use) after 0.78 seconds.

Even though the command errors out due to the database being online, we are still able to obtain similar data. Eseutil allows you to look in to the actual FAL size, NTFS cluster size and how many extents have been created for that file due to excessive fragmentation if run locally on the server. With that, we can deduce that the NTFS cluster size is 4KB, the FAL size is 180KB and the Extents Enumerated is over 1.1 million fragments. A general rule of thumb is to not have a FAL size greater than 150KB in size and to have sufficient available disk space.

This fragmentation is also seen on CCR/Replica copies as the log files are shipped and then played in to the database. The end result is that log replay will slow to a crawl and you could have some very high replay queue lengths due to excessive Split I/Os occurring. Even with the fastest disks and improperly configured NTFS cluster sizes and disk alignments, you will still see this problem. You must fix the root of this problem to successfully resolve this issue.

So how do you mitigate this? Well, there are various ways to do this…

  1. If you determine that only a single database is affected by this issue, the quickest mitigation method to get you back in business is the following:
    1. Dismount the database
    2. Make a copy of the database to another drive with sufficient space. IMPORTANT: This cannot be on the same drive as we need to write this file out contiguously to another drive. This mere act of copying the file defrags the file for you.
    3. Delete the original copy of the database file
    4. Copy the database back to the original location
    5. Using this method does not resolve the issue long term if the NTFS cluster sizes are too small. It is only meant as a stop gap to buy you some time to resolve the issue long term.
  2. If on a CCR/SCR cluster, you have some options to fix this longer term.
    1. To resolve the NTFS cluster sizes on the non-active node or SCR target for any particular volume such as F:, use the following command to format the disk with a 64KB block size which is the recommended value for optimal performance.

      Format F: /q /y /fs:ntfs  /v:VolumeName /a:64K

      NOTE:
      This command wipes out any files that currently resides on the F: drive, so make sure that no other files or applications reside on this drive other than the database and log files. I would hope that you are dedicating these drives exclusively to Exchange and not sharing with any other applications. Exclusivity is what makes recovering from this much easier. 
    2. Verify that the disk was formatted properly by running the following command:

      image
    3. Once the disk has been reformatted, go ahead and reseed the databases that previously existed on the drive.

You may ask yourself, if the file is so fragmented, why can I not simply do an offline defrag of the file? The answer is that if you defrag the file itself, you have a high possibility of bloating the FAL size since we are causing the fragments to move around which causes the FAL size to grow. This is the primary reason why Exchange does not recommend running defrag on volumes which host database files. The only way to remove the attribute list for this file is to completely copy the file off to another drive, delete the original copy and then copy the copied file back to the original location. When this is done, the file is written to the disk contiguously leaving literally no fragments in the file. Life is good once again.

Once you have resolved these underlying issues, overall Exchange performance should be that much better and you can sleep better at night knowing you have increased throughput on your Exchange servers.

Note that it is still not recommended to run disk defragmentation software on Exchange server volumes, but there are times where file level fragmentation can cause significant performance problems on a server merely by the way data is being written to the disk. If optimal and/or recommended settings are not used when creating the volumes, this file fragmentation issue can occur much quicker. The majority of Exchange files are in use so running any regular disk defragmentation programs on the server will not help with this situation. If necessary, the only way to resolve this is to take all Exchange resources offline to ensure none of the files are in use and then defragment the disk to make the files contiguous on the disk once again.

In Exchange 2010 SP1 or later, logic was added to detect when the FAL would be exhausted (80% of max); and event accordingly.  There is no NTFS event for this behavior. The following event is an example that would be logged for a problematic database during online maintenance.

Log Name: Application
Source: ESE
Event ID: 739
Task Category: General
Level: Error
Description:
Information Store (5652) EXSERVER MBX Store 001: The NTFS file attributes size for database 'C:\DB\DB001\PRIV001.EDB' is 243136 bytes, which exceeds the threshold of 204800 bytes. The database file must be reseeded or restored from a copy or backup to prevent the database file from being unable to grow because of a file system limitation.

Update (3/8/2011): Exchange 2007 SP3 RU3 now has a fix that is referenced in http://support.microsoft.com/kb/2498066 that will increase the default extent size from 8MB to 64MB similar to that of Exchange 2010. Increasing the extent size helps reduce the amount of fragments that will be created for any given database. The 739 event has also been added so that monitoring software can alert on potential problems.

Reasonable Volume sizes and database sizes go a long way to protect yourself from fragmentation (the more competing files which extended/created on a volume, the greater the fragmentation of those files will be). 

Recommendations:

  • Keep your volume sizes at or below 2TB (why MBR partitions are recommended for E2K7). Exchange 2010 can have GPT volumes greater than 2TB, but the recommendation is to ensure that DB sizes are under 2TB in size.
  • Limit the number of databases hosted/volume.  10/volume is the absolute maximum we would recommend; where 5/volume is much better.
  • Do not place write intensive non-Exchange workloads on the same volume as an Exchange database.

I hope this sheds some light on why certain failures on Exchange servers could prevent you from doing various operations.

Thanks go to Matt Gossage, Tim McMichael, Bryan Matthew, Neal Christiansen and Luke Ibsen for reviewing this blog entry before posting

Mike

Performance Counter Collection Tools for Exchange

New Windows Dynamic Cache Service for 64-bit Windows 2003 servers

$
0
0

If you've ever had an issue where low memory conditions were causing working set trimming issues due to excessive use of the System File Cache, then we have just released a new service that can be used to help alleviate this issue called Microsoft Windows Dynamic Cache Service.

More information regarding this new service can be found here and a direct link to download this new service can be found here

With Exchange 2007 servers also running in to these issues which I blogged about here, this service could potentially allow other 3rd party services to play nice with Exchange 2007 which may be consuming more than it's fair share of the System File Cache.

So if you find that Exchange performance is suffering because of some other service taking up overall memory in the System File Cache, then this service may be just for you.

Hope this helps with some of your performance related issues.

Mike

Client RPC Dialog box questionnaire for Administrators

$
0
0

There are times when you are troubleshooting an Exchange Server issue where it appears that the server is performing OK, but the users are still complaining of the dreaded RPC dialog box and hangs in their client. Most of the time an Exchange administrator or helpdesk personnel needs to speak directly with the end user to determine what actions they were taking at the time the RPC dialog box occurs. Since there are numerous ways which can promote this dialog box, an administrator needs to understand specific actions that users were taking at the time of the problem. A lot of the times, this has nothing to do with server side performance problems, but rather something that is installed on the client or something the user is doing.

I have created a simple document in which the users can answer to allow you to gain some insight in to a users actions and their habits that are aggravating this RPC dialog box.

The document is password protected so that the fields are checkable. The password currently is "Microsoft".

Please provide feedback regarding this document to help make this better.

Thanks,

Mike


Outlook 2007 Performance Improvements Hotfix

$
0
0

If you haven't heard already, we have released a Pre-SP2 hotfix that help improve Outlook performance and responsiveness in a big way. Here is an excerpt from the article.

Performance improvements

Performance and responsiveness are key concerns for all our customers. That is why we made the large performance tuning and optimization changes that are included in Office suite Service Pack 2 (SP2).

Outlook 2007 SP2 delivers performance improvements in four major areas:

  • General Responsiveness
    SP2 reduces I/O disk usage and UI response time.
  • Startup
    SP2 removes long operations from initial startup.
  • Shutdown
    SP2 makes Outlook exit predictably despite pending activities.
  • Folder/View Switch
    SP2 improves view rendering and folder switching.

Before you go out applying this on your machine, you need to be warned of the first startup experience as we rebuild the tables and indexes in your OST. If you have a large OST, this is going to take some time, but I can tell you that the wait is well worth it. It is actually an entirely new experience at least for me anyway's since I have a good deal of email item counts in my folders. Switching between folders with large item counts is no longer painful and this hotfix provides immediate viewing of these folders.

Grab the hotfix from the following article:

Description of the Outlook 2007 hotfix package (Outlook.msp): February 24, 2009
http://support.microsoft.com/kb/961752

Check out the plethora of improvements in this release in the following article as there are many.

Outlook 2007 improvements in the February 2009 cumulative update
http://support.microsoft.com/kb/968009

Hope this helps tame some of the larger mailboxes that you have.

Mike

Windows Desktop Search and the implications on WAN performance

$
0
0

Windows Desktop search (WDS) is a great tool to help you to search through the unwieldy plethora of documents or emails that you may have scattered across your desktop. With the addition of 3rd party IFilter add-ins, it makes it even easier to find what you are looking for.

As of version 3.01, Desktop search has disabled the indexing of online mailbox on a default installation due to performance implications on the Exchange server side. Companies sometimes have the need to still run Outlook in Online mode due to security requirements of not having local OST’s, or they need real-time email for business purposes. With some of those requirements, companies have the need to also have fast message/document retrieval which Windows Desktop Search can surely do without a problem.

WDS does have some group policy settings that will now allow online indexing of mailboxes and a listing of all the settings for WDS 4 is at http://technet.microsoft.com/en-us/library/cc732491.aspx. This setting that allows indexing of online mode Outlook profiles is "Enable Indexing uncached Exchange Folders". Once this is deployed via group policy, WDS will now start indexing online mode Outlook profiles. This of course could put a huge strain on the server as all of the users data is being indexed if deployed to a large user base. Recommended guidance states that you should deploy this policy to smaller subsets of users to prevent possible server performance problems. This is similar guidance to what Microsoft recommends for cached mode deployments.

With that said, I would like to now take us down a road where certain combinations of WDS policies can not only affect Exchange server side performance, but can also have serious implications on WAN performance. If you currently have a centralized Exchange deployment and users are accessing all of their email across WAN circuits, read on.

Let’s say you have an administrative assistant running in cached mode that needs to gain access to another users complete mailbox with a requirement that data in that mailbox is easily discoverable. This requirement can be easily met by using Windows Desktop Search and is very common in law firms. A default Outlook 2007 installation will have the "Download shared folders (exclude mail folders)" option selected for their user profile, so if this assistant had previously opened a another users non-email folder such as Contacts, Calendar and Tasks, WDS will index those items without any issue. This feature unfortunately does not meet the complete requirement as we need to index all items in the other users mailbox. After Full mailbox permissions is added for this assistant, they can now add this other users mailbox to their profile to view their data.  Once you do this, you will now see that WDS will still not find any email items unless you have selected the folder in the mailbox and then performing the search. Everything so far is the default behavior.

WDS has a feature which will allow you to index online delegate mailboxes and is deployed via the GPO setting "Enable Indexing of online delegate Mailboxes".  Once this setting is deployed, any user that has another users mail related folder in their profile will now get indexed. So that seems like a good thing, no? Well, we all know that indexing any type of mailbox in online mode will increase overall performance on the Exchange server and if users are doing this over a WAN, you will now see increased WAN utilization while WDS is indexing this data making direct RPC calls to the Exchange server. If this setting was deployed to a large user base while there are a number of profiles that have other mailboxes added to their profile, you could potentially saturate this network circuit. Your network administrator at this point would obviously not be too happy and your users would then start complaining that email access is really slow or Outlook may get disconnected due to this saturation. Our best practices dictate that this setting should be deployed to smaller user bases at a time to prevent increased client traffic to the Exchange Server.

Imagine deploying this policy to 1000 users all accessing Exchange across a WAN and all have an added mailbox to their profile. By default, WDS will only index 120 items per minute which should help keep the Indexing traffic under control. If all users workstations were indexing this amount of data at a time, we would be seeing about 120,000 items per minute of traffic.  Couple that with any attachments that WDS is configured to index for such as PDF or Word documents, and this will make for a very bad network day.

There are ways to change the amount of items that are indexed per minute by modifying the GPO setting "Enable Throttling Online Mailboxes". Setting this policy to a lower value will help reduce the amount of items that are indexed per minute per mailbox and should also help keep some of the network traffic down to a minimum. The caveat here is that it will take longer to index these mailboxes. Keep in mind that is still going to be direct RPC traffic to the Exchange server with minimal amount of throttling.

To help reduce some of this overhead, Outlook 2007 has a registry entry ( CacheOthersMail ) that will allow you to cache other users mail folders in an OST file. This was first introduced in KB955572 and requires that you disable the downloading of headers. This was then rolled up in to the Outlook 2007 post SP1 Sept. 24, 2008 hotfix package (957909) . If the indexing of delegate mailboxes policy has been deployed to these users and you add this Outlook registry key, you will now see a mixture of traffic being generated by WDS. One is direct RPC traffic to the Exchange server and the other is Outlook FxGetBuffer function calls or otherwise known as Outlook sync (ICS). The Outlook sync traffic will become more prevalent over time as the other users mailbox is cached locally in the OST file. FxGetBuffer calls are a lot less expensive than direct RPC calls to the Exchange server, so deploying the CacheOthersMail registry key may help with overall WAN utilization during initial indexing. You still need to plan on increased WAN traffic as synch traffic coming from many clients could also cause potential WAN degradation issues.

WDS Registry Reference

Registry data to index data in your mailbox if you have an Online mode profile
Key: HKLM\software\policies\Microsoft\windows\windows search
DWORD: PreventIndexingUncachedExchangeFolders
Value: 0

Registry data to index shared mailboxes:
Key: HKLM\software\policies\Microsoft\windows\windows search
DWORD: EnableIndexingDelegateMailboxes
Value: 1

Registry data to change the amount of mail items that are indexed per minute.
Key: HKLM\software\policies\Microsoft\windows\windows search
DWORD: EnableThrottlingOnlineMailboxes
Value: 120
Accepted Values (Default: 120, Min: 6, Max:600)

Outlook Registry Reference

Registry data to Cache others users mail data in an OST
One-off users
Key: HKCU\Software\Microsoft\Office\12.0\Outlook\Cached Mode
DWORD: CacheOthersMail
Value: 1

GPO deployed
Key: HKCU\Software\Policies\Microsoft\Office\12.0\Outlook\Cached Mode
DWORD: CacheOthersMail
Value: 1

One of the most taxing combinations of WDS settings with relationship to Exchange Server and WAN performance is deploying "Enable Indexing uncached Exchange Folders" and"Enable Indexing of online delegate Mailboxes" simultaneously. If you also index attachments which is the default behavior, this could put increased burden on network resources and could cause considerable downtime for your users. Deploying these settings needs to be carefully planned out especially in centralized Exchange installations to prevent the situations that I describe.

Troubleshooting Exchange 2007 Store Log/Database growth issues

$
0
0

One of the common issues we see in support is excessive Database and/or Transaction log growth problems. If you have ever run in to one of these issues, you will find that they are not always easy to troubleshoot as there are many tools that are needed to help understand where the problem might be coming from. Customers have asked why does the Server allow these type of operations to occur in the first place and why is the Exchange Server not resilient to this? That is not always an easy question to answer as there as so many variables as to why this may occur in the first place ranging from faulty Outlook Add-ins, Custom or 3rd party applications, corrupted rules, corrupted messages, online maintenance not running long enough to properly maintain your database, and the list goes on and on.

Once an Outlook client has created a profile to the Exchange server, they pretty much have full reign to do whatever actions they want within that MAPI profile. This of course, will be controlled mostly by your Organizations mailbox and message size limits and some of the Client throttling or backoff features that are new to Exchange 2007.

Since I have dealt with these type problems in great detail, I thought it would be helpful to share some troubleshooting steps with you that may help you collect, detect and mitigate these problems when and if you should see them.

General Troubleshooting

Exchange 2007 SP2 RU2 and Later

  • Exchange 2007 SP2 RU2 adds a new feature to help track these log growth issues much easier. All you have to do is to set some thresholds (warning/error) in the registry and then once the log growth problem starts occurring, you can simply view the application log for events where a user has crossed over the thresholds that you have set. Note: This is not set by default after installing SP2 RU2, so if you are in the middle of log growth issue, adding the appropriate registry keys to the server will help provide additional insight in to the problem. See http://support.microsoft.com/kb/972705 for more information on this new feature and how to determine what values to set these registry keys to.

Outlook 2007

  • A new Outlook 2007 fix has been created to allow any email with an attachment being sent via MAPISendMail to now honor message size limits. Below is a brief description of how MAPISendMail can affect log growth on an Exchange server. 
    • When using the Send To Mail Recipient facility in Windows using an online Outlook mode client to send a message/attachment over the max message size limit, Outlook will stream the data to the storeprior to performing any message size limit checking, thus creating log files for the amount of data that the attachment size is for. Once the outlook message comes up, the damage is already done on the Exchange server. If you add a recipient to the message and try to send the email, you will then receive an error "The messaging interface has returned an unknown error. If the problem persists, restart Outlook". If you then save the message in the mailbox it will be successful. If you then pull up the message and then send it, you will now get the error "The message being sent exceeds the message size established for this user".
    • Now if you attempt the same process using a cached mode client, Outlook will open a new message with the attachment without any limit checks. If you add a recipient and then send the message, it will sit in the users Outbox. Performing a send/receive on the client will now generate the error "Task 'Microsoft Exchange - Sending' reported error (0x80040610): 'The message being sent exceeds the message size established for this user.'" This is expected behavior. If the user goes in now and deletes the message, the message ends up in the users deleted items folder which is then synched to the server. Messages that are over the size limit that are either imported or saved in to a user’s mailbox does not honor overall message size limits during the sync process.
    • If you use the Send to Mail option in any Office program, you will receive the same results where if in online mode, we stream the data to the server prior to checking size limits. Cached mode reacts the same way as well.

      To resolve this issue for your Outlook 2007 users, install 978401 on every client machine.

Builds earlier than Exchange 2007 SP2 RU2

  1. Use Exchange User Monitor (Exmon) server side to determine if a specific user is causing the log growth problems.

    • Sort on CPU (%) and look at the top 5 users that are consuming the most amount of CPU inside the Store process. Check the Log Bytes column to verify for this log growth for a potential user.
    • If that does not show a possible user, sort on the Log Bytes column to look for any possible users that could be attributing to the log growth
    • If it appears that the user in Exmon is a ?, then this is representative of a HUB/Transport related problem generating the logs. Query the message tracking logs using the Message Tracking Log tool in the Exchange Management Consoles Toolbox to check for any large messages that might be running through the system. See step 5.9for a Powershell script to accomplish the same task.
  2. If suspected user is found via Exmon, then do one of the following:

    1. Disable MAPI access to the users mailbox using the following steps (Recommended):

      • Run Set-Casmailbox –Identity <Username> –MapiEnabled $False

      • Move the mailbox to another Mailbox Store. Note: This is necessary to disconnect the user from the store due to the Store Mailbox and DSAccess caches. Otherwise you could potentially be waiting for over 2 hours and 15 minutes for this setting to take effect. Moving the mailbox effectively kills the users MAPI session to the server and after the move, the users access to the store via a MAPI enabled client will be disabled.

    2. Disable the users AD account temporarily

    3. Kill their TCP connection with TCPView

    4. Call the client to have them close Outlook in the condition state for immediate relief.

  3. If closing the client down or killing their sessions seems to stop the log growth issue, then we need to do the following to see if this is OST or Outlook profile related:

    1. Have the user launch Outlook whileholding down the control key which will prompt if you would like to run Outlook in safe mode. If launching Outlook in safe mode resolves the log growth issue, then concentrate on what add-ins could be attributing to this problem.

    2. If you can gain access to the users machine, then do one of the following:

      1. Launch Outlook to confirm the log file growth issue on the server.

      2. If log growth is confirmed, do one of the following

        1. Check users Outbox for any messages.

          1. If user is running in Cached mode, set the Outlook client to Work Offline. Doing this will help stop the message being sent in the outbox and sometimes causes the message to NDR.

          2. If user is running in Online Mode, then try moving the message to another folder to prevent Outlook or the HUB server from processing the message.

          3. After each one of the steps above, check the Exchange server to see if log growth has ceased

        2. Call Microsoft Product Support to enable debug logging of the Outlook client to determine possible root cause.

      3. Follow the Running Process Explorer instructions in the below article to dump out dlls that are running within the Outlook Process. Name the file username.txt. This helps check for any 3rd party Outlook Add-ins that may be causing the excessive log growth.

        970920  Using Process Explorer to List dlls Running Under the Outlook.exe Process
        http://support.microsoft.com/kb/970920

      4. Check the Sync Issues folder for any errors that might be occurring

    3. Let’s attempt to narrow this down further to see if the problem is truly in the OST or something possibly Outlook Profile related:

      1. Run ScanPST against the users OST file to check for possible corruption.

      2. With the Outlook client shut down, rename the users OST file to something else and then launch Outlook to recreate a new OST file. If the problem does not occur, we know the problem is within the OST itself.

      3. If renaming the OST causes the problem to recur again, then recreate the users profile to see if this might be profile related.

  4. Ask Questions:

    1. Is the user using any type of mobile device?

    2. Question the end user if at all possible to understand what they might have been doing at the time the problem started occurring. It’s possible that a user imported a lot of data from a PST file which could cause log growth server side or there was some other erratic behavior that they were seeing based on a user action.

  5. If Exmon does not provide the data that is necessary to get root cause, then do the following:

    1. Check current queues against all HUB Transport Servers for stuck or queued messages

      get-exchangeserver | where {$_.IsHubTransportServer -eq "true"} | Get-Queue | where {$_.Deliverytype –eq “MapiDelivery”} | Select-Object Identity, NextHopDomain, Status, MessageCount | export-csv  HubQueues.csv

      Review queues for any that are in retry or have a lot of messages queued.

      Export out message sizes in MB in all Hub Transport queues to see if any large messages are being sent through the queues.

      get-exchangeserver | where {$_.ishubtransportserver -eq "true"} | get-message –resultsize unlimited | Select-Object Identity,Subject,status,LastError,RetryCount,queue,@{Name="Message Size MB";expression={$_.size.toMB()}} | sort-object -property size –descending | export-csv HubMessages.csv  

      Export out message sizes in Bytes in all Hub Transport queues.

      get-exchangeserver | where {$_.ishubtransportserver -eq "true"} | get-message –resultsize unlimited | Select-Object Identity,Subject,status,LastError,RetryCount,queue,size | sort-object -property size –descending | export-csv HubMessages.csv

    2. Check Users Outbox for any large, looping, or stranded messages that might be affecting overall Log Growth.

      get-mailbox -ResultSize Unlimited| Get-MailboxFolderStatistics -folderscope Outbox | Sort-Object Foldersize -Descending | select-object identity,name,foldertype,itemsinfolder,@{Name="FolderSize MB";expression={$_.folderSize.toMB()}} | export-csv OutboxItems.csv

      Note: This does not get information for users that are running in cached mode.

    3. Utilize the MSExchangeIS Client\Jet Log Record Bytes/sec and MSExchangeIS Client\RPC Operations/sec Perfmon counters to see if there is a particular client protocol that may be generating excessive logs. If a particular protocol mechanism if found to be higher than other protocols for a sustained period of time, then possibly shut down the service hosting the protocol. For example, if Exchange Outlook Web Access is the protocol generating potential log growth, then stopping the World Wide Web Service (W3SVC) to confirm that log growth stops. If log growth stops, then collecting IIS logs from the CAS/MBX Exchange servers involved will help provide insight in to what action the user was performing that was causing this occur.

    4. Run the following command from the Management shell to export out current user operation rates:

      To export to CSV File:

      get-logonstatistics |select-object username,Windows2000account,identity,messagingoperationcount,otheroperationcount,progressoperationcount,streamoperationcount,tableoperationcount,totaloperationcount | where {$_.totaloperationcount -gt 1000} | sort-object totaloperationcount -descending| export-csv LogonStats.csv

      To view realtime data:

      get-logonstatistics |select-object username,Windows2000account,identity,messagingoperationcount,otheroperationcount,progressoperationcount,streamoperationcount,tableoperationcount,totaloperationcount | where {$_.totaloperationcount -gt 1000} | sort-object totaloperationcount -descending| ft

      Key things to look for:
      In the below example, the Administrator account was storming the testuser account with email.
      You will notice that there are 2 users that are active here, one is the Administrator submitting all of the messages and then you will notice that the Windows2000Account references a HUB server referencing an Identity of testuser. The HUB server also has *no* UserName either, so that is a giveaway right there. This can give you a better understanding of what parties are involved in these high rates of operations

      UserName : Administrator
      Windows2000Account : DOMAIN\Administrator
      Identity : /o=First Organization/ou=First Administrative Group/cn=Recipients/cn=Administrator
      MessagingOperationCount : 1724
      OtherOperationCount : 384
      ProgressOperationCount : 0
      StreamOperationCount : 0
      TableOperationCount : 576
      TotalOperationCount : 2684

      UserName :
      Windows2000Account : DOMAIN\E12-HUB$
      Identity : /o= First Organization/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=testuser
      MessagingOperationCount : 630
      OtherOperationCount : 361
      ProgressOperationCount : 0
      StreamOperationCount : 0
      TableOperationCount : 0
      TotalOperationCount : 1091

    5. Enable Perfmon/Perfwiz logging on the server. Collect data through the problem times and then review for any irregular activities. You can grab some pre-canned Perfmon import files at http://blogs.technet.com/mikelag/archive/2008/05/02/perfwiz-replacement-for-exchange-2007.aspx to make collecting this data easier.

    6. Run ExTRA (Exchange Troubleshooting Assistant) via the Toolbox in the Exchange Management Console to look for any possible Functions (via FCL Logging) that may be consuming Excessive times within the store process. This needs to be launched during the problem period. http://blogs.technet.com/mikelag/archive/2008/08/21/using-extra-to-find-long-running-transactions-inside-store.aspx shows how to use FCL logging only, but it would be best to include Perfmon, Exmon, and FCL logging via this tool to capture the most amount of data.

    7. Dump the store process during the time of the log growth. (Use this as a last measure once all prior activities have been exhausted and prior to calling Microsoft for assistance. These issues are sometimes intermittent, and the quicker you can obtain any data from the server, the better as this will help provide Microsoft with information on what the underlying cause might be.)

      1. Download Procdump 3.0 or greater from http://technet.microsoft.com/en-us/sysinternals/dd996900.aspx and extract it to a directory on the Exchange server

      2. Open the command prompt and change in to the directory which procdump was extracted in step A.

      3. Type procdump -mp -s 120 -n 2 store.exe d:\DebugData. This will dump the data to D:\DebugData. Change this to whatever directory has enough space to dump the entire store.exe process twice. Check Task Manager for the store.exe process and how much memory it is currently consuming for a rough estimate of the amount of space that is needed to dump the entire store dump process.

        Important: If procdump is being run against a store that is on a clustered server, then you need to make sure that you set the Exchange Information Store resource to not affect the group. If the entire store dump cannot be written out in 300 seconds, the cluster service will kill the store service ruining any chances of collecting the appropriate data on the server.

      4. Open a case with Microsoft Product Support Services to get this data looked at.

    8. Collect a portion of Store transaction log files (100 would be good) during the problem period and parse them following the directions in http://blogs.msdn.com/scottos/archive/2007/11/07/remix-using-powershell-to-parse-ese-transaction-logs.aspx to look for possible patterns such as high pattern counts for IPM.Appointment. This will give you a high level overview if something is looping or a high rate of messages being sent. Note: This tool may or may not provide any benefit depending on the data that is stored in the log files, but sometimes will show data that is MIME encoded that will help with your investigation

    9. Export out Message tracking log data from affected MBX server

      Method 1
      Download the attached ExLogGrowthCollector.zip file to this post and extract to the MBX server that experienced the issue. Run ExLogGrowthCollector.ps1 from the Exchange Management Shell. Enter in the MBX server name that you would like to trace, the Start and End times and click on the Collect Logs button.

      image

      Note: What this script does is to export out all mail traffic to/from the specified mailbox server across all HUB servers between the times specified. This helps provide insight in to any large or looping messages that might have been sent that could have caused the log growth issue.

      Method 2
      Copy/Paste the following data in to notepad, save as msgtrackexport.ps1 and then run this on the affected Mailbox Server. Open in Excel for review. This is similar to the GUI version, but requires manual editing to get it to work.

      #Export Tracking Log data from affected server specifying Start/End Times

      Write-host "Script to export out Mailbox Tracking Log Information"
      Write-Host "#####################################################"
      Write-Host
      $server = Read-Host "Enter Mailbox server Name"
      $start = Read-host "Enter start date and time in the format of MM/DD/YYYY hh:mmAM"
      $end = Read-host "Enter send date and time in the format of MM/DD/YYYY hh:mmPM"
      $fqdn = $(get-exchangeserver $server).fqdn
      Write-Host "Writing data out to csv file..... "
      Get-ExchangeServer | where {$_.IsHubTransportServer -eq "True" -or $_.name -eq "$server"} | Get-MessageTrackingLog -ResultSize Unlimited -Start $start -End $end  | where {$_.ServerHostname -eq $server -or $_.clienthostname -eq $server -or $_.clienthostname -eq $fqdn} | sort-object totalbytes -Descending | export-csv MsgTrack.csv -NoType
      Write-Host "Completed!! You can now open the MsgTrack.csv file in Excel for review"


      Method 3
      You can also use the Process Tracking Log Tool at http://msexchangeteam.com/archive/2008/02/07/448082.aspx to provide some very useful reports.

    10. Save off a copy of the application/system logs from the affected server and review them for any events that could attribute to this problem

    11. Enable IIS extended logging for CAS and MB server roles to add the sc-bytes and cs-bytes fields to track large messages being sent via IIS protocols and to also track usage patterns.

Proactive monitoring and mitigation efforts

  1. In backup-less environments, if the “Do not permanently delete mailboxes and items until the store has been backed up” setting is checked on an Exchange 2003 database or the RetainDeletedItemsUntilBackup parameter is set to $true on an Exchange 2007 database , then this setting over time could lead to consistent steady store growth since all whitespace in the database is being consumed/reused. Even with online maintenance running on the server, these pages in the database are never reclaimed to free up any space on them due to this flag being set.
  2. Check whether online maintenance for the database in question has been running nightly in the application log.
  3. Check whether any move mailbox operations are occurring that might be moving users to this database exhibiting the log growth issue.
  4. Increase Diagnostics Logging for the following objects depending on what stores are being affected:

    • MSExchangeIS\Mailbox\Rules
    • MSExchangeIS\PublicFolders\Rules
  5. Enable Client Side monitoring per http://technet.microsoft.com/en-us/library/cc540465.aspx
  6. Create a monitoring plan using MOM/SCOM to alert when the amount of Log Bytes being written hit a specific threshold and then alert the messaging team for further action. There are thresholds that are a part of the Exchange 2007 Management Pack that could help alert to these type situations before the problem gets to a point of taking a database offline. Here are 2 examples of this.

    ESE Log Byte Write/sec MOM threshold
    Warning Event
    http://technet.microsoft.com/en-us/library/bb218522.aspx

    Error Event
    http://technet.microsoft.com/en-us/library/bb218733.aspx

    If an alert is raised, then perform an operation to start collecting data.
  7. Ensure http://support.microsoft.com/kb/958701 is installed at a minimum for each Outlook 2003 client to address known log/database growth issues for users streaming data to the information store that have exceeded message size limits. This fix also addresses a problem where clients could copy a message to their inbox from a PST that during the sync process could exceed mailbox limits, thus causing excessive log growth problems on the server.

    These hotfixes make use of the PR_PROHIBIT_SEND_QUOTA and PR_MAX_SUBMIT_MESSAGE_SIZE  which is referenced in http://support.microsoft.com/kb/894795

    Additional Outlook Log Growth fixes:
    http://support.microsoft.com/kb/957142
    http://support.microsoft.com/kb/936184

  8. Implement minimum Outlook Client versions that can connect to the Exchange server via the Disable MAPI clients registry key server side. See http://technet.microsoft.com/en-us/library/bb266970.aspx for more information.

    To disable clients less than Outlook 2003 SP2, use the following entries on an Exchange 2007 server
    "-5.9.9;7.0.0-11.6568.6567"

    Setting this to exclude Outlook client versions less than Outlook 2003 SP2 will help protect against stream issues to the store. Reason being is that Outlook 2003 SP2 and later understand the new quota properties that were introduced in to the store in http://support.microsoft.com/kb/894795. Older clients have no idea what these new properties are, so if a user sent a 600MB attachment on a message, it would stream the entire message to the store generating excessive log files and then get NDR’ed once the message size limits were checked. With SP2 installed, the Outlook client will first check to see if the attachment size is over the set quota for the organization and immediately stop the send with a warning message on the client and prevent the stream from being sent to the server.

    Allowing any clients older than SP2 to connect to the store is leaving the Exchange servers open for a growth issue.

  9. If Entourage clients are being utilized, then implement the MaxRequestEntityAllowed property in http://support.microsoft.com/kb/935848  to address a known issue where sending a message over the size limit could potentially create log growth for a database.
  10. Check to ensure File Level Antivirus exclusions are set correctly for both files and processes per http://technet.microsoft.com/en-us/library/bb332342.aspx
  11. Enable Content Conversion tracing on all HUB servers per http://technet.microsoft.com/en-us/library/bb397226.aspx . This will help log any failed conversion attempts that may be causing the log growth problem to occur.
  12. If POP3 or IMAP4 clients are connecting to specific servers, then implementing Protocol Logging for each on the servers that may be making use of these protocols will help log data to a log file where these protocols are causing excessive log growth spurts. See http://technet.microsoft.com/en-us/library/aa997690.aspx on how to enable this logging.
  13. Ensure Online maintenance is completing a pass for each database within the past week or two. Query Application event logs for the ESE events series 700 through 704 to clarify. If log growth issues occur during online maintenance periods, this could be normal as Exchange shuffles data around in the database. We just need to ensure that we keep this part in mind during these log growth problems.
  14. Check for any excessive ExCDO warning events related to appointments in the application log on the server. (Examples are 8230 or 8264 events). http://support.microsoft.com/kb/947014 is just one example of this issue. If recurrence meeting events are found, then try to regenerate calendar data server side via a process called POOF.  See http://blogs.msdn.com/stephen_griffin/archive/2007/02/21/poof-your-calender-really.aspx for more information on what this is.

    Event Type: Warning
    Event Source: EXCDO
    Event Category: General
    Event ID: 8230
    Description: An inconsistency was detected in username@domain.com: /Calendar/<calendar item> .EML. The calendar is being repaired. If other errors occur with this calendar, please view the calendar using Microsoft Outlook Web Access. If a problem persists, please recreate the calendar or the containing mailbox.

    Event Type: Warning
    Event ID : 8264
    Category : General
    Source : EXCDO
    Type : Warning
    Message : The recurring appointment expansion in mailbox <someone's address> has taken too long. The free/busy information for this calendar may be inaccurate. This may be the result of many very old recurring appointments. To correct this, please remove them or change their start date to a more recent date.

    Important: If 8230 events are consistently seen on an Exchange server, have the user delete/recreate that appointment to remove any corruption

  15. Add additional store logging per http://support.microsoft.com/kb/254606 to add more performance counter data to be collected with Perfmon. This will allow us to utilize counters such as ImportDeleteOpRate and SaveChangesMessageOpRates which allows us to see what these common log growth rates are. 
  16. Recommend forcing end dates on recurring meetings.  This can be done through the usage of the registry key DisableRecurNoEnd (DWORD).

    For Outlook 2003:
    http://support.microsoft.com/kb/952144
    HKEY_CURRENT_USER\Software\Microsoft\Office\11.0\Outlook\Preferences

    For Outlook 2007:
    http://support.microsoft.com/kb/955449
    HKEY_CURRENT_USER\Software\Microsoft\Office\12.0\Outlook\Preferences
    Value: 1 to Enable, 0 to Disable
  17. Implement LimitEmbeddingDepth on the Exchange servers as outlined in KB 833607 to prevent log growth due to recursion looping. Note: This article states this if for Exchange 2000-2003, but the key is also still valid in Exchange 2007 per source code

Known Issues

Exchange Server

SP1 Release Update 9 fixes

  • 959559 - Transaction log files grow unexpectedly in an Exchange Server 2007 Service Pack 1 mailbox server on a computer that is running Windows Server 2008
  • 925252 - The Store.exe process uses almost 100 percent of CPU resources, and the size of the public folder store increases quickly in Exchange Server 2007
  • 961124 - Some messages are stuck in the Outbox folder or the Drafts folder on a computer that is running Exchange Server 2007 Service Pack 1
    970725 - Public folder replication messages stay in the local delivery queue and cause an Exchange Server 2007 Service Pack 1 database to grow quickly

SP1 Release Update 8 fixes

  • 960775 - You receive a "Message too large for this recipient" NDR that has the original message attached after you restrict the Maximum Message Send Size value in Exchange Server 2007

SP1 Release Update 7 fixes

  • 957124 - You do not receive an NDR message even though your meeting request cannot be sent successfully to a recipient
  • 960775 - You receive a "Message too large for this recipient" NDR that has the original message attached after you restrict the Maximum Message Send Size value in Exchange Server 2007

SP1 Release Update 1 fixes

  • 947014 - An Exchange Server 2007 mailbox server randomly generates many transaction logs in an Exchange Server 2007 Service Pack 1 environment
  • 943371 - Event IDs 8206, 8213, and 8199 are logged in an Exchange Server 2007 environment

Outlook 2007

  • 970944– Installing this hotfix package addresses and issue where log files are generated unexpectedly when a user is running Outlook 2007 in the cached Exchange mode and sends an e-mail message to the recipients who have a corrupted e-mail address and/or e-mail address
  • 970777 - Additional log files are generated on the Exchange server unexpectedly when you send an e-mail message to recipients who have a corrupted e-mail address or a corrupted e-mail address type by using Cached Exchange mode in Outlook 2007 
  • 978401 - Description of the Office Outlook 2007 hotfix package (Outlook-x-none.msp): February 23, 2010 (Includes a MAPISendMAIL fix)

Outlook 2003

  • 958701 - Description of the Outlook 2003 Post-Service Pack 3 hotfix package (Engmui.msp, Olkintl.msp, Outlook.msp): October 28, 2008
  • 936184 - Description of the Outlook 2003 post-Service Pack 3 hotfix package: December 14, 2007
  • 897247 - Description of the Microsoft Office Outlook 2003 post-Service Pack 1 hotfix package: May 2, 2005

Entourage

  • 935848 - Various performance issues occur when you use Entourage for Mac to send large e-mail messages to an Exchange 2007 server

Windows 2008

  • 955612 - The "LCMapString" function may return incorrect mapping results for some languages in Windows Server 2008 and in Windows Vista

How to collect per request Performance Stats for IIS on Exchange 2007

$
0
0

Ever had a time where you were trying to troubleshoot an IIS Performance related issue on Exchange 2007 and the built-in performance counters were not giving you the data that you needed to gain insight in to the problem? I know I have run in to these before and they are not always the easiest to track as we cannot see latencies at a per request level easily.

As part of the default installation of Exchange 2007, you may have also seen IIS log entries similar to the following, but didn’t know what the appended IIS data meant.

/owa/ev.owa oeh=1&ns=DatePicker&ev=GetFreeBusy&m=2009-04-01T00%3a00%3a00&fId=LgAAAADBC0ggZ4mHTKllH8Mc0937AQBmBiNCEaM7R53LcWBj0I1aAAAAAACrAAAC&prfltncy=98&prfrpccnt=6&prfrpcltncy=78&prfldpcnt=0&prfldpltncy=0&prfavlcnt=0&prfavlltncy=0

The information I am calling out in this IIS Log request is prfltncy, prfrpccnt, prfrpcltncy, prfldpcnt, prfldpltncy, prfavlltncy. These entries are specific to latency entries at the end of each call that is being made. There may only be a handful of these throughout the logs by default.

Luckily, there is a way to enable additional per request user tracing in to the IIS logs to help you with troubleshooting these performance type problems. This tracing will allow you to see per request latencies for OWA, RPC and Availability requests.

To enable this additional logging, you would do the following:

Go to "Program Files\Microsoft\Exchange Server\ClientAccess\OWA". Edit web.config in Notepad. Add the following line of text under appSettings:
<add key="CollectPerRequestPerformanceStats" value="true"/>

After saving the web.config file, you should start seeing entries in the IIS logs similar to the above, but here is another log example:

/owa/default.aspx modurl=7&prfltncy=84212&prfrpccnt=37&prfrpcltncy=84011&prfldpcnt=9&prfldpltncy=30&prfavlcnt=0&prfavlltncy=0

In the above request, we can see that the RPC latencies are high (prfltncy=84212&prfrpccnt=37&prfrpcltncy=84011) , so this was most likely a bottleneck between the CAS and the backend Mailbox server. Now wasn’t that easy to determine where the potential bottleneck might lie?

Per Request Tracing Legend
Prfltncy - Overall Performance Latencies for this request
Prfrpccnt - RPC request count
Prfrpcltncy - RPC Latencies
Prfldpcnt - LDAP request count
Prfldpltncy – LDAP Latencies
Prfavlltncy - Availability Latencies

If you break one of these log requests down, here is the way you would look at this based on the first request example above. (Note: This was a call to get Free/Busy Data for a specific time period)

  • prfltncy=98 - Overall Performance Latency for the request
  • prfrpccnt=6&prfrpcltncy=78 - 6 RPC requests with a latency of 78ms
  • prfldpcnt=0&prfldpltncy=0 - 0 LDAP requests with a latency of 0ms
  • prfavlcnt=0&prfavlltncy=0 - 0 Availability requests with a latency of 0ms

You can use any log parser (ie.logparser.exe) of your choice to get further information, but this should help you understand some of the latencies down to a per request level.

I hope this helps in your performance troubleshooting…..

The Case of the Mysterious Exchange Server Hang

$
0
0

Recently we had a case in which an Exchange 2003 server would hang and no longer accept any new RPC connections to the Information Store. The rest of the server seemed to be operating just fine, but it was the Store that was ultimately having the problems.

I took a look at the perfmon data that was provided and didn’t see anything out of the ordinary except for a small amount of RPC Operations taking place on the server. The server did look like it was processing data though, so this was quite intriguing to me now. I did notice that one DC had a number outstanding LDAP requests for an extended period of time as shown below.

image

We ended up taking some dumps of the Store, IIS processes, and LSASS to see what might be going on. The store and IIS dumps were not that interesting. Looking at the LSASS dumps was an eye opener. We saw that over 150 threads were hung up calling (SECUR32!CALLSPM) in to the Security Provider Manager (SPM). The beginning of the stacks were showing secur32!LsaAcceptSecurityContext calls which were mostly client authentication calls to the server. More info on the AcceptSecurityContext calls can be found here

There was almost 200 other threads that were calling netlogon!NlpUserValidateHigher which essentially means that we are trying to send a user validation request to a higher authority for authentication requests over the secure channel. Once we accept this validation request, we then attempt to connect to the DC over RPC to handle the request. Debug analysis can be found on Dave Goldman’s blog here.

By default, Netlogon only allows 2 concurrent API calls for these authentication requests which is controlled by a semaphore. If the 2 semaphore objects are tied up waiting for a response from the DC, all other requests will start queuing, thus having this mysterious hang affect on the Exchange Server. This was our problem since the debug analysis showed that we hit our maximum of 2 concurrent requests most likely to an overloaded DC, leaving a backlog of requests for authentication traffic. This request queue is controlled by the MaxConcurrentApi setting. Each request has a default timeout of 45 seconds, so if there were requests that were timing out, this is surely going to cause some delays for other users. On healthy servers with good network connectivity, these authentication requests are extremely fast.

At this point, we knew that we were tied up in authentication calls to DC’s, but we couldn’t find out what users were trying to logon which were taking the most amount of time. The debug information only shows a point in time. It is possible that a user could be trying to authenticate to a down-level domain across a slow WAN link, not sure at this time.

To move forward, we enabled Netlogon Debug logging per http://support.microsoft.com/kb/109626 and let the problem occur again.

We opened the netlogon.log file and started reviewing the information. Prior to the problem we can see that responses are returning in a timely manner. Notice the time intervals happen within the same second

07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Returns 0x0

07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0

07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0

As traffic increases, the response times are starting to get a little slower

07/31 11:53:56 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:54:14 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0

07/31 11:53:57 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:54:17 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0

Now we see a response time right at 45 second timeout below

07/31 11:57:02 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:57:47 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0

Here is where our first netlogon timeout hit

07/31 11:57:03 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:57:48 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 11:57:48[CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.
07/31 11:57:48 [SESSION] I_NetLogonGetAuthData called: (null) EXDOMAIN(Flags 0x1) 
07/31 11:57:48 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0xC000005E

0xC000005E = STATUS_NO_LOGON_SERVERS

Now we are seeing that we cannot allocate a Client API slow because the max request queue is busy servicing other requests

07/31 11:58:55 [CRITICAL]EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 11:58:55 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.

07/31 12:38:08 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 12:38:08 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.

Now we get to an actual DC timeout error as shown below.

08/01 17:21:24 [CRITICAL] NlPrintRpcDebug: Couldn't get EEInfo for I_NetLogonSamLogonEx: 1761 (may be legitimate for 0xc0000064)
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlFinishApiClientSession: timeout call to \\DC1.domain.com.  Count: 2
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlFinishApiClientSession: dropping the session to \\DC1.domain.com
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlSetStatusClientSession: Set connection status to c000005e

We can see clearly now that DC1 was having problems servicing authentication requests to this Exchange server. This does not always mean that the DC is overloaded, it could be a down level network trust that is really slow that is causing this problem, so additional investigation needs to be performed at this point. We just know that Exchange is the victim and the problem is elsewhere now.

Troubleshooting methodologies

So what can we do at this point?

We can test secure channels for different domains to see which domains might be failing. First we will need to obtain the DC in which the secure channel is currently formed on the Exchange server by running nlttest /sc_query:<domain> replacing domain with the domain name where the Exchange Server resides in.

Once that DC is found, you will then run a command similar to the following for each of the domains:

nltest /server: DC1   /sc_query:ASIA
nltest /server: DC1   /sc_query:EUROPE
nltest /server: DC1   /sc_query:US

This will help fish out any down level domains that could be causing authentication delays.

You can also enable netlogon debug logging on the DC’s to help understand the traffic patterns there.

Installing the Server Performance Advisor on the Windows 2003 DC’s or using the Active Directory Diagnostics Data Collector in the Windows 2008 Reliability and Performance monitor will help fish out any potential bottlenecks.

Take netmon captures and search for NetrLogonSamLogonEx entries for Netlogon requests

For Windows 2003 servers, you can install the following hotfix to help track these type issues faster. This hotfix adds new performance counters to help track access to these semaphores better. Windows 2008 servers already have this built in to the OS

New performance counters for Windows Server 2003 let you monitor the performance of Netlogon authentication
http://support.microsoft.com/default.aspx/kb/928576

The main ones you want to look at at the following

Semaphore Holders: How many threads on average are holding the client semaphore

This is the number of threads trying to get a netlogon session to a DC that are Blocked. Blocked could be locked open by a process, network down, etc. when semaphore waiters is non 0, some local process is waiting on lsass for a response and the lsass thread is blocked. This correlates to the MaxConcurrentApi setting

By default this value should be less than 2 at any given time. If values about 2 are sustained, then either the Exchange server or DCs are overloaded.

Average Semaphore Hold Time: The average wait time for a thread to acquire the semaphore

These values should normally be very quick. Longer hold times mean that a potential bottleneck is occurring.

Semaphore Waiters: The average number of waiters waiting on the semaphore.

This value should remain at 0 at all times. Short bursty spikes are OK to see as that simply means that we had a large amount of requests which were handled in a short period of time.

In some instances on heavily loaded servers, you may want to adjust MaxConcurrentApi to a value of 5 on both the Exchange Servers and DC’s to help widen the pipe or increase the amount of auth requests that can occur at any given time. Bumping this setting up may help alleviate this problem altogether, but could also prolong the issue due to some other underlying issue that has now been masked. It’s always best to understand where the problem is coming from before making any major changes such as this which may increase overall processor utilization on the Exchange server and your domain controllers.

In this instance, we set MaxConcurrentApi to 5 on the DC’s and Exchange Servers and this appears to have reduced the amount of occurrences of this problem.

This particular problem not only affects Exchange servers, but also affects other applications such as ISA server. More information on this can be found here

I hope this provides some insight in to some of the underlying dependency problems that you may seen in Exchange.

That is all for now.

Exchange 2007 Memory usage and helpful troubleshooting tips

$
0
0

In support, we get a lot of statements stating that Exchange is using all or most of the memory on a given server. Some may say that is a bad thing, but for Exchange 2007 that is actually a good thing. In this blog, I would like to explain some of the misconceptions of Exchange’s memory usage with relationship to overall/allocated memory and the Paging file and it’s usage. I previously blogged about Exchange 2007 memory usage at Understanding Exchange 2007 Memory Usage and its use of the Paging File, but it appears that more clarification is needed in this area. I am also going to show some real world screenshots of customers actual perfmon log files that show good and bad behavior and how this might help you in troubleshooting what type of a memory issue you might have, if any.

So let’s start with the paging file and it’s usage as that appears to be a common question that comes up all of the time. Some of the questions stem around PF Usage in Task Manager as shown below on a Windows 2003 server and server monitoring software reporting this as a problem. PF Usage in Task Manager is the total number committed pages in the system. This is not how much is currently being used, it is merely the amount of page file space that has been allocated should the OS need to page out currently committed bytes

image

In Windows 2008, Task Manager now shows different terminology as PF Usage has been removed and has been replaced with just the word Memory

image

There are other counters that show PF usage as well, one is Paging File\% Usage which shows overall usage and Process\Page File Bytes which shows per process Page file allocation. The % Usage counter is about the same as what Task Manager PF Usage shows. It is just the amount of space that has been allocated should committed bytes need to get paged out and doesn’t indicate if the PF is currently being utilized. Paging File\% Usage is a counter that monitoring software commonly shows that could be a potential problem, but in all reality it might not be. Other factors needs to be looked at other than the amount of page file usage to get a clear indication if there is truly a problem or not.

Generally, Page file usage should remain under 80% at all times, but there are times when the OS needs to make use of the paging file and one example is a working set trim operation. The following picture shows an example of this working set trim operation for store.exe where the Memory\% PF Usage shows that we increase the PF usage at the same time the working sets are getting trimmed to satisfy some other application or driver request for allocating a contiguous memory block. You will also notice that PF usage never really bounces back either after something like this happens and remains around a sustained average for the remainder of the server being online or until the server is rebooted. Unless you are getting close to the max Memory\% Committed Bytes In Use, we shouldn’t be too concerned with the PF Usage unless we are seeing some high paging activity going on.

image

With that said, you would not use PF usage in Task Manager or Paging File\% Usage to determine if the paging file is currently being used. What you would use to monitor this is the amount of Memory\Pages/Sec that are occurring. This counter is a combination of both Memory\Pages Input/sec and Memory\Pages Output/Sec counters which also includes access to the system cache for file based operations to resolve hard page faults. Hard page faults occur when a page from a process is requested but does not exist in memory. This then means that we have to pull that data directly from the paging or backing file. If the page is elsewhere in memory, then this is called a soft fault. These two counters will help you understand if you are writing data (Pages Output) to the Paging file or you are reading data (Pages Input) from the paging file which might be affecting overall Exchange Server performance. Hard Page Faults can result is significant delays in processing data on the server.

Counter Definitions
Memory\Pages/Sec
  - Pages/sec is the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays.  It is the sum of Memory\Pages Input/sec and Memory\Pages Output/sec.  It is counted in numbers of pages, so it can be compared to other counts of pages, such as Memory\Page Faults/sec, without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files.

Memory\Pages Input/sec - Pages Input/sec is the rate at which pages are read from disk to resolve hard page faults. Hard page faults occur when a process refers to a page in virtual memory that is not in its working set or elsewhere in physical memory, and must be retrieved from disk. When a page is faulted, the system tries to read multiple contiguous pages into memory to maximize the benefit of the read operation. Compare the value of Memory\\Pages Input/sec to the value of  Memory\\Page Reads/sec to determine the average number of pages read into memory during each read operation.

Memory\Pages Output/Sec - Pages Output/sec is the rate at which pages are written to disk to free up space in physical memory. Pages are written back to disk only if they are changed in physical memory, so they are likely to hold data, not code. A high rate of pages output might indicate a memory shortage. Windows writes more pages back to disk to free up space when physical memory is in short supply.  This counter shows the number of pages, and can be compared to other counts of pages, without conversion.

Recommended guidance states that the size of the paging file should be RAM+10MB for optimal performance and should be of static size and not system managed. Having a paging file set to system managed could cause page file fragmentation which could affect performance in memory pressure conditions, but Exchange generally should not be making use of the paging file for normal operations. If virtual memory is shown to be problematic or high due to other applications on the servers requiring it, you can increase the size of the paging file to RAM * 1.5 to help alleviate some of this memory pressure on the server to help back all of the committed pages in memory. If you are still having problems at this point, check for potential memory leaks within the processes on the server.

High paging in excess of 10,000/sec or more could indicate severe memory pressure or a working set trimming problem that I talked about previously in http://blogs.technet.com/mikelag/archive/2007/12/19/working-set-trimming.aspx.

The amount of available memory is another question that comes up regularly. The main performance counter to monitor for available memory is Memory\Available MBytes. This is the amount of physical memory that is available for process or system use. It is the sum of Free, Zero, and Standby (cached) page lists. If you are on a Windows 2008 server and run Process Explorer viewing System Information, you will see these page lists referenced. Available RAM on any given Exchange 2007 server should not go below 100MB. After crossing the 100MB threshold, you are putting your server in a state vulnerable for working set trims when the Virtual Memory manager needs to process a memory allocation request in which sufficient RAM is not available to service that request. Another counter to check to cross correlate why available memory is low is Memory\System Cache Resident Bytes. Memory\System Cache Resident Bytes is part of the overall System cache which is viewable via the Memory\Cache Bytes counter.

image

The above picture is a depiction of how System cache can affect available memory leading up to a working set trim. Notice in yellow that the Store cache remains consistent prior to the trim, so we know that Exchange did not cause this, but rather some other application. This could be some application making use of the file cache causing this increase. A simple file copy operation of a very large file from this server to another server will cause this problem. You can tame this system cache problem by using the Windows Dynamic Cache service shown at http://blogs.msdn.com/ntdebugging/archive/2009/02/06/microsoft-windows-dynamic-cache-service.aspx. In the above case, it was Antivirus software making use of memory mapped files.

Note: If available RAM is about 100MB, please do not RDP in to the server and fire up the EMC for administration purposes. This will exhaust all RAM on the server and cause working set trim issues. Got to love that one, eh?

Next, I would like to talk about Committed Memory. There are two main counters that I look at when troubleshooting memory related issues to determine if we are truly running out of memory on a server. These counters are Memory\Committed Bytes and Memory\Commit Limit.

Memory\Committed Bytes is the amount of committed virtual memory, in bytes. Committed memory is the physical memory which has space reserved on the disk paging file(s). This counter displays the last collected value and is not an average.

Memory\Commit Limit is the amount of virtual memory that can be committed without having to extend the paging file(s).  It is measured in bytes. Committed memory is the physical memory which has space reserved on the disk paging files. There can be one paging file on each logical drive). If the paging file(s) are be expanded, this limit increases accordingly.  This counter displays the last collected value and is not an average. The Commit Limit is calculated by taking the amount of total RAM and adding that to the Paging File sizes. This sum will give you your overall Commit Limit on any given server.

There are a few ways to view the current values of the Commit Limit and Committed Bytes. In Task Manager, you could view the Commit Charge (K) area as shown in the above screenshot. You can view these counters in Perfmon, and of course using Process Explorer shown below.

 image

 

Real World Scenarios

Now that we have all of this knowledge, let’s take a look at some real world examples here.

The below picture shows a normal working server where the Store working set remains consistent throughout the lifetime of the store process due to the cache being warmed up or fully populated. This is where you get maximum performance from your Exchange server since you are caching all of the data in memory instead of having to rely on paging information to/from the disks. You will also notice that available memory is just under 2GB. The amount of committed bytes is also no where close to the Commit limit on the server.

image

The following example shows that our Committed Bytes is just about equal to the overall Commit Limit on the server. Any new memory allocations could be failing causing instability of your Exchange server. This problem was attributed to an incorrectly configured paging file on the server.

image

The next example shows an actual Store memory leak. As you can see, the Committed Bytes (blue), Private Bytes (pink) and Virtual Bytes (yellow) for Store is also increasing upward to the overall Commit Limit (green). This occurred due to a recursive operation within the store process exhausting all of the memory on the server. A recursive operation can be thought of a process that is being performed where one of the operations is to repeat the process. This is similar to a loop with no ending or a way to break out of the loop.

image

I hope this clears up some of the misconceptions behind  the command phrase “Exchange is using all the memory”.

Exchange 2007 SP2 Auditing Whitepaper

$
0
0

Exchange 2007 SP2 has introduced some new Mailbox Access Auditing features to help log events when users access folders and messages either in their own mailbox or another users mailbox. I wrote a whitepaper on these new features at http://technet.microsoft.com/en-us/library/ee331009.aspx. This new access auditing will log accesses to messages and folders which some customers have been wanting for a long time. So if you attempt to access another users folder and open or read a message, Exchange will now log events in the new Exchange auditing log on the server. This only shows you the path of access to message and folders, but does not specifically log deletions of messages in users folders.

The whitepaper also discusses how you can setup auditing to track configuration changes to Exchange related objects in Active Directory, so that if an administrator made a change to an Exchange configuration object that caused an outage, these events will now be logged on the domain controllers security event log. If your DC’s are Windows 2008, you can see what the previous values were and what the newly changed value is, so if you need to change it back to the way it was before the outage, you have a rolling log of all of these changes.

If you have some time and wanted to read more about it, see the above link for more details. This took a lot of time and effort on my part to pull this together and test most of the configuration auditing pieces to ensure that we were logging the correct data. Hope you enjoy it.


New ADAccess Performance counters included with Exchange 2007 SP2

$
0
0

Exchange 2007 SP2 has a new set of ADAccess Performance counters that only shows performance data from domain controllers in the same site as the Exchange Server. This new object is MSExchange ADAccess Local Site Domain Controllers. Previously, you had to use MSExchange ADAccess Domain Controllers(*)\Local site flag to detect if the server was local via Performance monitor.

Here is a listing of the new counters. They are very similar to the MSExchange ADAccess Domain Controllers counters, but only for local DCs.

\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Read calls/Sec
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Search calls/Sec
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Searches timed out per minute
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Fatal errors per minute
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Disconnects per minute
\MSExchange ADAccess Local Site Domain Controllers(*)\User searches failed per minute
\MSExchange ADAccess Local Site Domain Controllers(*)\Bind failures per minute
\MSExchange ADAccess Local Site Domain Controllers(*)\Long running LDAP operations/Min
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Pages/Sec
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP VLV Requests/Sec
\MSExchange ADAccess Local Site Domain Controllers(*)\Number of outstanding requests
\MSExchange ADAccess Local Site Domain Controllers(*)\DsGetDcName elapsed time
\MSExchange ADAccess Local Site Domain Controllers(*)\gethostbyname elapsed time
\MSExchange ADAccess Local Site Domain Controllers(*)\Kerberos ticket lifetime
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP connection lifetime
\MSExchange ADAccess Local Site Domain Controllers(*)\Reachability bitmask
\MSExchange ADAccess Local Site Domain Controllers(*)\IsSynchronized flag
\MSExchange ADAccess Local Site Domain Controllers(*)\GC capable flag
\MSExchange ADAccess Local Site Domain Controllers(*)\PDC flag
\MSExchange ADAccess Local Site Domain Controllers(*)\SACL right flag
\MSExchange ADAccess Local Site Domain Controllers(*)\Critical Data flag
\MSExchange ADAccess Local Site Domain Controllers(*)\Netlogon flag
\MSExchange ADAccess Local Site Domain Controllers(*)\OS Version flag
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Read Time
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Search Time

Unfortunately. upgrading to SP2 from an earlier service pack does not reload the MSExchange ADAccess counters, so you will have to do this manually. If you are installing Exchange using the SP2 binaries, you will have these new counters by default. To reload the MSExchange ADAccess counters, do the following:

  • Ensure that no other monitoring software is currently collecting performance counter data
  • Open a command prompt and change directory to the \Program Files\Microsoft\Exchange Server\Bin\perf\AMD64 directory
  • To unload the performance counters, type the following:
    unlodctr “MSExchange ADAccess”
  • To Reload the counters, type the following:
    loadcounter dscperf.ini
  • Restart the Exchange Services to successfully reload the counters. Note: This step is very important as Exchange opens file handles to the original counters that can only be reloaded with the restart of the Exchange Services.

For all of you that are collecting performance counters via WMI, you may notice that these new counters will not appear to be loaded. You can verify this by running perfmon/wmi to see if they are there. If they are not, you can transfer the PDH settings over to WMI by running wmiadap /f.

Enjoy!!

How to monitor and troubleshoot the use of Nonpaged pool memory in Exchange Server 2003 or in Exchange 2000 Server

$
0
0

This article is a high level overview on how to troubleshoot current Nonpaged pool memory usage on an Exchange server.  It explains what could be done to help mitigate some of the underlying problems that may be consuming Nonpaged pool memory and demonstrates tools that can be used to help track down processes or drivers consuming the most amount of memory.

Nonpaged pool memory is a limited resource on 32-bit architecture systems.  It is dependent on how the server is setup to manage memory and is calculated at system startup. The amount of nonpaged pool allocated on a given server is a combination of overall memory, running/loaded drivers and if the /3GB switch has been added to the boot.ini file.

Nonpaged pool memory is used for objects that cannot be paged out to disk and have to remain in memory as long as they are allocated. Examples of such objects may be network card drivers, video drivers and Antivirus filter level drivers.  By default, without the /3GB switch, the OS will allocate 256MB of RAM on a server for a Nonpaged pool. When the /3GB switch is added and the server is rebooted, this essentially halves the amount of Nonpaged pool memory on a given server to 128MB of RAM. The Windows Performance team has a table listed in http://blogs.technet.com/askperf/archive/2007/03/07/memory-management-understanding-pool-resources.aspx that discusses what the max pool memory resources can be on any given server. This link also disusses how to view the maximum amount of pool memory on any given server using Process Explorer. For Exchange servers, it is recommended to add the /3GB switch to the boot.ini file with the exception of pure HUB or Front End (FE) servers to allocate more memory to the user processes. As you can see, this limits how much you can load within that memory space. If this memory has been exhausted, the server will start becoming unstable and may become inaccessible. Unfortunately, since this memory cannot be paged in and out, you cannot resolve this problem without rebooting the server.

On Microsoft Windows 2003 64-bit operating systems, the Kernel Nonpaged pool memory can use as much as 128GB depending on configuration and RAM. This essentially overcomes this limitation. See 294418 for a list of differences in memory architectures between 32-bit and 64-bit versions of windows. Currently, the only version of Exchange that is supported on a 64-bit operating system is Exchange 2007, so when working with previous versions of Exchange we may still run into this Nonpaged pool limitation.

Symptoms

When Nonpaged pool memory has been depleted or is nearing the maximum on an Exchange Server, the following functionality may be affected because these features require access to HTTP/HTTPS to function:

  1. Users connecting via Outlook Web Access may experience “Page cannot be displayed” errors.

    The issue occurs when nonpaged pool memory is no longer sufficient on the server to process new requests.  More information on troubleshooting this issue is available in the following KB article:
    Error message when you try to view a Web page that is hosted on IIS 6.0: "Page cannot be displayed"
    http://support.microsoft.com/?id=933844

    Note: If this resolves your OWA issue, it is recommended to determine what is consuming nonpaged pool memory on the server. See the Troubleshooting section of this document for help in determining what is consuming this memory.
  2. RPC over HTTP connections are slow or unavailable.

    If you experience difficulties when you use an Outlook client computer to connect to a front-end server that is running Exchange Server 2003 it can indicate a depletion of Nonpaged pool memory.  HTTP.sys stops accepting new connections when the available nonpaged pool memory drops under 20MB.  More information on troubleshooting this issue is available in the following KB article:

    You experience difficulties when you use an Outlook client computer to connect to a front-end server that is running Exchange Server 2003
    http://support.microsoft.com/?id=924047
  3. The IsAlive check fails on Cluster

    The cluster IsAlive checks for the Exchange HTTP resource on a cluster server may fail causing service outages or failovers. This is the most common scenario that we see for Exchange 2003 clusters. When there is less than 20MB of nonpaged pool memory, http.sys will start rejecting connections affecting the IsAlive check.

    When nonpaged pool is becoming exhausted, the IsAlive check fails causing the resource to fail. Depending on your recovery settings for the HTTP resource in Cluster Administrator, we will try to either restart the resource or fail over the group. By default, we will try restarting the resource 3 times before affecting the group. If this threshold is hit, the entire group will fail over to another cluster node.
    To verify if nonpaged pool has been depleted, you can look in 2 possible locations. One is the cluster.log file and the other is the httperr.log

    Cluster.log
    For the cluster.log file, you may see an entry similar to the following:

    00000f48.00000654::2007/05/16-17:16:52.435 ERR Microsoft Exchange DAV Server Instance <Exchange HTTP Virtual Server Instance 101 (EXVSNAME)>: [EXRES] DwCheckProtocolBanner: failed in receive. Error 10054.

    Error 10054 is equivalent to WSAECONNRESET which is http.sys rejecting the connection.

    Httperr.log
    In the httperr.log that is located in the %windir%\system32\logfiles\httperr directory on the Exchange Server, you may see entries similar to the following.

    2007-05-16 16:44:56 - - - - - - - - - 1_Connections_Refused -
    2007-05-16 16:50:42 - - - - - - - - - 3_Connections_Refused -
    2007-05-16 16:50:47 - - - - - - - - - 2_Connections_Refused -
    2007-05-16 17:16:35 - - - - - - - - - 5_Connections_Refused –

    This confirms that http.sys is rejecting the connection to the server. Additional information regarding this logging can be found in the following article:

    Error logging in HTTP API
    http://support.microsoft.com/?id=820729

    Additional information for this issue is available in the following KB:

    Users receive a "The page cannot be displayed" error message, and "Connections_refused" entries are logged in the Httperr.log file on a server that is running Windows Server 2003, Exchange 2003, and IIS 6.0
    http://support.microsoft.com/?id=934878
  4. Random Server Lockups or Hangs
  5. Certain operations failing because of the lack of memory to support new operations.
    Check the Application and System logs where common operations might be failing.
Potential Workaround to provide immediate/temporary relief

If immediate relief is needed for all these scenarios to prevent these rejections from occurring on a cluster server, then you can add the EnableAggressiveMemoryUsage registry key on the server for temporary relief. When this is added, http.sys will then start rejecting connections when there is less than 8MB of Nonpaged pool memory available, overriding the 20MB default value. See 934878 for more information on setting this key. Note:  Please use this as a temporary method to get the Exchange cluster resources back online and investigate the underlying cause of who is taking up the most amount of Nonpaged pool memory on the server. An ideal situation would be having 100MB or less of overall Nonpaged pool memory consumed on any given server.

Nonpaged Pool Memory Depletion events

When pool memory has been depleted, you may start receiving the following error in the System Event log stating that a specific pool memory has been depleted.

Event ID 2019
Event Type: Error
Event Source: Srv
Event Category: None
Event ID: 2019
Description:
The server was unable to allocate from the system NonPaged pool because the pool was empty.

If you are getting these events, then the server is most likely very unstable currently or will be very soon. Immediate action is required to bring the server back online to a fully functional state such as moving the cluster resources to another node or rebooting the server that has this problem.

Troubleshooting

There are a couple of different ways to view the amount of real-time pool memory usage that is currently being consumed and the easiest one is Task Manager. Once you pull up Task Manager, you will need to click the Performance tab and in the lower right hand corner, you will see the amount of pool memory usage that is highlighted. If nonpaged pool is 106MB or more, then there is a possibility that the cluster IsAlive checks for the HTTP resource are failing or close to failing.

image

You can also view Nonpaged and Paged Pool usage per process on the Processes tab in Task Manager. I’ve added the Paged Pool column since the same basic rules applies there too. To do this, select the Processes tab, select View on the menu, and then Select Columns. Add Non-paged Pool, Paged Pool, and the Handles columns as shown below.

image

Once this column is added, you can now view pool usage per process which may help you track down what process is consuming the most amount of memory. You can sort each column to look for the highest consumer. The handle column is added to help determine if there is any process that may have a large amount of handles consuming a larger amount of nonpaged pool memory. (Note: A high handle count may affect either paged or nonpaged pool memory, so keep this in mind when analyzing data.) 

image

Another way of looking at handles for any given process is to use Process Explorer available here.  To add the handle count column, you would select View on the menu, then “Select Columns”, click the Process Performance tab, and then put a check box next to “Handle Count”. Click OK.

image

If you can’t determine from there what is consuming the memory, this may be a kernel related problem and not application specific. This will require some additional tools to determine what could be affecting the nonpaged pool memory.

One of the first things to look for is drivers that are more than 2 years old that may have had some issues in the past, but have been resolved with later driver releases. Running the Exchange Best Practices analyzer tool (ExBPA) located here can help report any drivers that may be outdated or have been known to have issues previously. If ExBPA did not report any problems with the configuration of the server or any driver related problems, further troubleshooting is necessary.

If the Windows Support tools are installed, you can use a tool called Poolmon to allow you to view what specific tags are consuming memory. More information regarding Poolmon can be found in the Windows Support Tools documentation here.  To run Poolmon, simply open up a command prompt and type “Poolmon” and then hit the “b” key to sort on the overall byte usage (Bytes) with the highest being at the top. Anything you see that is highlighted means that there was a change in memory for that specific tag.

In this view, you want to look at the top five consumers of memory which should be listed at the top. For the most part, you will be looking at the first two columns named Tag& Type.  The Tag is specific to a particular driver and the Type column indicates what type of memory is being used, nonpaged pool (Nonp) or paged pool (Paged) memory.  You will also be looking at the Bytes (shown in yellow) column. This column shows the bytes in use for the particular process Tag.

clip_image005

The Allocs and Frees columns can be used to determine if a tag is leaking memory. If there is a large difference between these two columns for a particular tag, then there may be a leak in that particular tag and should be investigated.

The file Pooltag.txt lists the pool tags used for pool allocations by kernel-mode components and drivers supplied with Windows, the associated file or component (if known), and the name of the component.

Where to get Pooltag.txt?

After install the debugging tools for windows located here, pooltag.txt can be found in the C:\Program Files\Debugging Tools for Windows\triage directory and normally has the most recent list of pooltags.

Pooltag.txt can also be obtained from the Windows Resource Kit:

http://www.microsoft.com/downloads/details.aspx?FamilyID=9D467A69-57FF-4AE7-96EE-B18C4790CFFD&displaylang=en

If the specific tag in question is not listed in pooltag.txt and is leaking memory, you can search for pool tags used by third-party drivers using the steps in the following article:

How to find pool tags that are used by third-party drivers
http://support.microsoft.com/default.aspx?scid=kb;EN-US;298102

Once you find what tag pertains to a specific driver, you would contact the vendor of that driver to see if they should have an updated version that may help alleviate this memory leak issue.

Recommended remediation

  1. Install the recommended hotfixes for Windows 2003 server based clusters from 895092
  2. Run the Exchange Best Practices Analyzer (ExBPA) tool to ensure that the exchange server is configured optimally. (ie: SystemPages registry setting, any outdated network card drivers, video drivers or storage drivers (storport.sys or SAN drivers), Mount point drivers (mountmgr.sys), boot.ini settings, etc.)
  3. Ensure that Windows 2003 SP2 is installed. If SP2 is not installed, at a minimum, you need to apply the hotfix in 918976
  4. Ensure that the Scalable Networking Pack features have been disabled. See http://msexchangeteam.com/archive/2007/07/18/446400.aspx for more information on how this can affect Exchange Servers
  5. Upgrade ExIFS.sys to the version listed in 946799
  6. If using MPIO, ensure 923801 at a minimum is installed. 935561 is recommended. Also see 961640 for another known memory leak issue
  7. If Emulex drivers are installed, be sure to upgrade to the version listed here to help with nonpaged pool memory consumption.
  8. Disable any unused NICs to lower overall NPP memory consumption
  9. Update network card drivers to the latest version.
      • If Jumbo Frames are being used, be sure to set this back to the default setting or lower the overall frame size to help reduce NPP memory usage.
      • If Broadcom Drivers are being utilized and are using the Virtual Bus Device (VBD) drivers, be sure to update the drivers to a driver version later than 4.x. Check your OEM manufacturers website for updated versions or go to the Broadcom download page here to check on their latest driver versions.
      • Any changes to the Network Card receive buffers or Receive Descriptors from the default could increase overall NPP memory. Set them back to the default settings if at all possible. This can be seen in poolmon with an increase in MmCm pool allocations.
  10. Update video card drivers to the latest version. If any accelerated graphics drivers are enabled, go ahead and uninstall these drivers and switch the display driver to Standard VGA. Add the /basevideo switch to the boot.ini file and reboot the server.
  11. Check to see if the EnableDynamicBacklog setting is being used on the server which can consume additional nonpaged pool memory. See 951162.
  12. If you are still having problems with NonPaged pool memory at this point, then I would recommend calling Microsoft Customer Support for further assistance with this problem.

    Additional Reading

    Nonpaged Pool is over the warning threshold (ExBPA Rule)
    http://technet.microsoft.com/en-us/library/aa996269(EXCHG.80).aspx

    Understanding Pool Consumption and Event ID: 2020 or 2019
    http://blogs.msdn.com/ntdebugging/archive/2006/12/18/Understanding-Pool-Consumption-and-Event-ID_3A00_--2020-or-2019.aspx

    3GB switch
    http://blogs.technet.com/askperf/archive/2007/03/23/memory-management-demystifying-3gb.aspx

     

    How to fix/repair broken Exchange 2007 counters

    $
    0
    0

    I commonly get calls on the inability to see performance counters in Performance Monitor (perfmon) and the inability to query them through WMI. I thought I would take some time to write about how to look for any problems with Exchange Performance Counters and then provide some high level insight on how to possibly fix them. Most of this information applies to Windows 2003 servers.

    If the counters are not being shown at all, the first place to check is the registry to see if the counters are not disabled. Here is a snippet of what one of the registry keys would look like

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ESE\Performance]
    "Close"="ClosePerformanceData"
    "Collect"="CollectPerformanceData"
    "Library"="C:\\Program Files\\Microsoft\\Exchange Server\\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"
    "Open"="OpenPerformanceData"
    "PerfIniFile"="eseperf.ini"

    If you also see a value of Disable Performance Counters in addition to the above default entries and is set to a nonzero value, the counter at one point had a problem loading and the Operating System disabled them for whatever reason. Set the value to 0 and then close and open Perfmon again to see if you can see the counters again. More information on this Disable Performance Counters setting can be found here . If this works for you, then whew, that was an easy one….

    If the Performance key is missing for a particular service, then we have bigger problems. I am not sure what causes this key to get removed, but if the key is not there, Perfmon or WMI does not know how to load the counters. There are a couple of key required parts that you need to understand before we can load any Performance counter, not just Exchange. The key pieces that are needed to reload any Performance counter is the following:

    • Performance key must be created under the specified service
    • Library path must be specified to the appropriate DLL for the service
    • A PerfIniFile must be specified which is the name of the ini file that will reload a specific services performance counters
    • Lastly, we need to have the Close, Collect, and Open values which specify what method is used to retrieve the Performance Counter Data. Note: This is unique to each service, so they will not always have the same information

    If we have these key pieces of information in the registry, we have the ability to reload said services performance counters. If we take the example above for ESE, if we opened a command prompt and navigated to the C:\Program Files\Microsoft\Exchange Server\bin\perf\AMD64 directory and then typed lodctreseperf.ini, this will reload the counters for ESE. If the counters were loaded successfully, we should now see that we have also added the First Counter, First Help, Last Counter, Last Help values as shown below.. These values correspond to specific data that was loaded in to the Perflib library.

    image

    If everything went well and you reopen Perfmon, You should now hopefully see the counters loaded. If they have not loaded, refresh the registry to see if the Disable Performance Counters key shows back up, If not, check the application log for Perflib errors which should provide additional information regarding why these counters did not load successfully.

    If you don’t know already, on Windows 2003 servers, you can actually pull up performance counters using the command Perfmon /WMI. If you do not see the newly added counters, then they have not been synchronized with the WMI repository yet. To help force this along, you could run wmiadap /f to force the reload of all counters in to the WMI repository.

    If this was successful, you will now see some additional Wbem entries as shown in the below pictorial.

    image

    Pulling up Perfmon /WMI again should hopefully show the counters that you are looking for. In some cases, monitoring software can still not pick up the newly added counters until the WMI service (Windows Management Instrumentation) has been restarted.

    If you ever wanted to unload Performance counters, one might think that you could simply unload the counters by running unlodctr eseperf.ini. Unfortunately, this will not work because the unlodctr utility requires that a service name be passed in instead of the ini file. To find the actual name of the service, you could simply open up eseperf.ini and at the top of the file, you should notice an entry similar to the following

    [info]
    drivername=ESE

    Ahh, there is the service name. Now if I run unlodctr ESE, this will now be successful. Doing this will remove the First Counter, First Help, Last Counter, Last Help values from the registry.

    Hopefully you are still with me at this time. Now what happens if the performance registry keys for all of your services went missing, now what do you do? Reinstall, flatten the box and reinstall to get them back? Well, unfortunately, there is not a direct way of recreating these registry keys as they are created during the installation of Exchange.

    The majority of the folks just export the data from another server, clean out any of the data that references performance counter data from the old server and then import them on the affected server. This does in fact work and is what I am going to talk about next on how to recover from a complete Performance key meltdown.

    Attached to this post is a zip file that contains all of the Performance keys across various different role combinations such as MBX, CAS, HUB, HUB/CAS, HUB/CAS/MBX. I’ve done all of the dirty work for you, so all you have to do is to perform some very simple modification steps to the files and then you are in business.

    CAUTION!!!: DO NOT IMPORT these registry keys if the Performance registry keys already exist as it will overwrite the data that currently exists in the registry and could potentially break your Performance counters that are currently working. If you only need to reload the Performance key for a single service, then pull out the data for that specific service, save it to a reg file and then import only that data. Basically use it as a reference point to help get you back running again.

    If you feel the need to use these reg import files due to all of the performance keys missing for all services, simply open the file that pertains to the role that you have installed and verify that the paths are correct to the correct library files. By Default, we install Exchange in to the to c:\program files\microsoft\Exchange Server directory, so if Exchange was installed outside of the default directory, you will need to update the file manually. Let’s take the ESE performance key below:

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ESE\Performance]
    "Close"="ClosePerformanceData"
    "Collect"="CollectPerformanceData"
    "Library"="C:\\Program Files\\Microsoft\\Exchange Server\\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"
    "Open"="OpenPerformanceData"
    "PerfIniFile"="eseperf.ini"

    Here you will see that library has the following value:

    "Library"="C:\\Program Files\\Microsoft\\Exchange Server\\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"

    What you will need to do is to replace the path with the correct path in which you have installed Exchange. If you installed Exchange on D: in the following directory (D:\\Program Files\\Microsoft\\Exchange Server\\bin), you would simply need to modify the first part of the path to show D:\\ instead of C:\\. A quick find and replace should work to hit all Performance keys. If you have installed it in to another Directory outside of the default paths, then you have a little more work to do to replace the path information. Just remember that for each backslash (\), you have to include double-backslashes (\\) to allow for proper importing of the reg files.

    There are only a handful of entries you have to manually modify, so this really shouldn’t take too long. Once you have the paths changed, save the appropriate file as a .reg file and import it by double-clicking on the file. Verify the Performance reg keys are good and valid by opening the Registry Editor to verify.

    Once the keys have been verified in the registry and look good, you can then run the powershell script to reload all of the Exchange performance counters. Simply copy the ReinstallAllPerCounters.pst.txt file to the Exchange server and then remove the .txt extension on the file. Open the Exchange Management Shell and then run the script. The screenshot below shows each ini file attempting to be loaded. Of course, on my server, I already had all of the performance keys, so we simply reported that the counters were already installed.

    clip_image002[6]

    Note: If you would like to transfer this data to WMI, simply type Y when asked.

    Once this has completed, be sure to check the application event log for details on any counters that failed to load. If everything went well, voila, you should have most if not all of your Exchange Performance Counters back once again.

    If the counters are still not showing up for whatever reason in WMI, you can run the following 2 commands to clear the WMI Adap cache and then re-sync the counters again to hopefully kick start things once again.

    See http://msdn.microsoft.com/en-us/library/aa394525(VS.85).aspx for more information on some of the additional commands included with the winmgmt command.

    Hopefully this will help you out trying to get your Exchange performance counters going once again.

    The case of the slow Exchange 2003 Server – Lessons learned

    $
    0
    0

    Recently we received a case in support with an Exchange 2003 server where message delivery was slow and the Local Delivery queue was getting backed up. The Local Delivery queue was actually reaching in to the two thousand range and would fluctuate around that number for extended periods of time.

    So we collected some performance data and all RPC latencies, disk latencies, CPU utilization and many of the other counters that we looked at did not show any signs of any problems. <Scratching Head>

    This is actually a common problem that I have seen where the server is responding OK to clients and everything else appears to be operating normally except for the local delivery queue that continually rises. Even disabling any Anti-virus software on the server including any VSAPI versions does not resolve the problem. So we essentially have a case of a slow Exchange server with no signs of performance degradation using any normal troubleshooting methods.

    The reason may not seem apparently obvious, but let me show you what this common problem is that I have seen in these situations. This not only applies to Exchange 2003, but it also applies to later versions of Exchange.

    In some companies, they need to be able to journal messages to holding mailboxes either on the same server or a different server to maintain a copy of all messages that are sent in the organization for compliance purposes. These journaling mailboxes can get quite large and requires a special level of attention to ensure that the mailbox sizes and item counts for those mailboxes are maintained within reasonable levels. They kind of defy what our normal recommendations/guidance states because item counts in these folders can surely reach tens of thousands of items rather quickly and depends on the amount of mail that is sent within your organization.

    Generally, the special level of attention needed that I mentioned earlier for journaling mailboxes are often overlooked. For each journaling mailbox, you need to have a process that will not only back up the items in these folders, but you need to also have some process that goes in and purges the data out of the mailbox once the backup has been taken. This purging process is necessary to maintain acceptable performance levels on an Exchange server. If these mailboxes are on their own server, user mailboxes are not normally affected. If these mailboxes are on the same server as user mailboxes, then this is where you might run in to some problems.

    In this case that we received, we had found a journaling mailbox that had almost 1.5 million items in the mailbox that was 109GB in size as shown in the below screenshot. Wow!! That is a lot of items in this one mailbox.

    huge journal mailbox-fixed

    If you tried to logon to this mailbox using Outlook, the client would most likely hang for 5-10 minutes trying to query the amount of rows in the message table to generate the view that Outlook is trying to open. Once this view is created, you should now be able to view the items and then get back control of the Outlook client. You may think that you could simply go in and start removing/deleting items from this mailbox to start lowering the overall size of the mailbox. Try as you must, but you will most likely end up trying to do this for days since the performance impact of this amount of items in the mailbox will make this a very painful process. Making any modifications to the messages in these folders will cause the message tables to be updated which for this amount of items is simply going to take an exorbitant amount of time.

    Our standard recommendation for Exchange mailboxes on Exchange 2003 servers is to have item counts under 5,000 items per folder. This guidance can be found in the Understanding the Performance Impact of High Item Counts and Restricted Views whitepaper here.

    A simple troubleshooting step would be to dismount the mailbox store that this mailbox resides in to see if the message delivery queues go down. If all of the queues flush for all other mailbox stores, you have now found your problem.

    If you absolutely need to get in to the mailbox to view some of the data, an Outlook client may not be the way to go to do some housecleaning. An alternative would be to use the MFCMAPI tool to view the contents of the mailbox. MFCMAPI will allow you to configure the tool to only allow a certain number of items to be returned at any given time. If you pull up MFCMAPI’s options screen, you can change the throttling section to limit the amount of rows that are displayed. If you were to put 4800 items in the highlighted section below, you would essentially limit the amount of rows or messages that are queried when the folder is opened to the number that you have entered. This will make viewing some of information a little bit easier, but still would be very cumbersome.

    clip_image002

    There are a couple of workarounds that you can do to clean this mailbox out.

    • If the data in the mailbox is already backed up, you could disable mail for that mailbox, run the cleanup agent and then create a new mailbox for the user. Note: the size of the database will still be huge and will increase backup and restore times even if you should recreate the mailbox. If you are finding that the backup times are taking a long time, you may want to think about using the dial tone database in the next suggestion or possibly moving the mailboxes on this store to a new database AFTER you have cleaned out the problem mailbox and then retiring the old database.
    • If the Mailbox Database houses only this one mailbox, you could simply dial tone that database starting with a fresh database. Instructions on how to do this can be found here
    • Purging the data out the mailbox using Mailbox Manager or some 3rd party tool may work, but keep in mind that you will most likely experience a performance problem on the server while the information is cleaned out of the mailbox and could take possibly hours to run

    Long live that 109GB/1.5million item mailbox!!! :)

    Another way to possibly find the high item count user is to use the PFDavAdmin tool to export items counts in users mailboxes. Steps on how to do this can be found here.

    These cases are sometimes very tough to troubleshoot as any performance tool that you might try to use to determine where the problem might lie would not showing anything at the surface. Since the Exchange server is still responding to RPC calls in a timely fashion, any expensive calls running such as a query rows operation will surely slow things down. If you see that things are slow on your Exchange 2003 server and perfmon does not show anything glaring, one of the first things that I check is item counts in users mailboxes looking for some top high item count offenders. Exchange 2007 can have other reasons for this slowness, but that would be another blog post in and of itself.

    So the moral of the story here is that should you have large mailboxes in your organization that are used as a journaling mailbox, a resource mailbox, or some type of automatic email processing application that might make use of Inbox rules to manipulate data in the mailbox, then you need to be absolutely sure that if the mailboxes are backed up or not, that the item counts in the folders of these mailboxes need to be kept to a reasonable count size or they will bring an Exchange server to crawling mode in trying to process email.

    Just trying to pass on some of this not so obvious information…….

    Improved PAL analysis for Exchange 2007

    $
    0
    0

    I wanted to write a post regarding a lot of work that has gone into updating the Exchange Server 2007 PAL xml threshold files in order to make them more relevant and to more accurately report on Exchange Performance problems. This update couldn’t have been done without the help from Chris Antonakis who was one of the major contributors to all of these updates. Kudos to Chris on this one.

    There are some major updates that you need to be aware of when running the PAL tool to analyze Exchange performance problems and the Mailbox Role was the biggest change on how to look at things.

    Shown below is the selection for the Mailbox Role threshold file which includes a few new questions. These questions will help break down performance problems specific to database drives, log file drives and pagefile drives in the resultant report. Previously, this was an all encompassing generic analysis which didn’t really give you the full picture of actual bottlenecks as there are latency differences between the database and log file drives.

    image

    Adding Database Drive letters is quite easy, and gathering the data for this input can be collected from various areas such as ExBPA and in the BLG file itself. These drive letters could also include Volume Mount Points.

    If you know the drive letters already, then that is great. Let’s say your database drives were on Drive E:, Drive F:, and Drive G:, you would need to enter them separated by a semicolon such as E:;F:;G: as shown in the screenshot above. You would also need to do this for the Log File Drives and the Page File Drives for a more accurate analysiss

    Using an ExBPA report of the server and the Tree Report view would be the best way to get the drive letter and volume mount point information, but sometimes a BLG file may provide enough information regarding volume mount points based on the naming convention that was used (keep in mind though that although the volume mount point is named “<Drive Letter:>\Logs” it may actually contain database files or no files at all). A screenshot below shows the Logical Disk counter that shows the volume mount point names. Unfortunately we don’t have a scripted way to pull the data out of the blg file at this time, so this is a manual process.

    image

    For the above information, assuming all the _DATA volume mount points contained Exchange databases, you would start entering data in the question as the following:

    S:\SG01_SG04_DATA;S:\SG05_SG08_DATA;S:\SG09_SG12_DATA

    You get the idea… Just remember that all drives and mount points need to be separated by a semicolon and you should be good.

    Now it’s important to note that we have included a catch all Generic Disk analysis for incase any of the drive questions were not answered. So, if you ran a report and forgot to enter any drive information in, you will get an output similar to the following in the Table of Contents. This may lean you towards an actual disk related problem due to the amount of times an analysis crossed over a threshold. You will see that there were 527 disk samples taken in this perfmon and all Database, Log and Page file drives have the same alert count in them. It is actually normal that this is happening because we will now log a tripped threshold for each drive type specific analysis and have fallen through to the Generic Disk analysis. If you see this, then go directly to the Generic analysis to look at your disk analysis.

    image

    For each one of the thresholds that tripped in which drive letters were not entered, you will see an entry in the table similar to the following stating that no data was entered in the questions. You can either ignore this and view the Generic Disk analysis or re-run the analysis with the questions correctly answered, providing a more accurate analysis.

    image

    The same holds true for the Hub Transport and Client Access server disk analysis.

    Another question that was added to the Mailbox server role analysis was ClientMajority which specifies if the majority of the clients are running in cached mode or not. This setting directly affects the analysis of the MSExchange Database(Information Store)\Database Cache % Hit counter.

    image

    Database Cache % Hit is the percentage of database file page requests that were fulfilled by the database cache without causing a file operation, i.e. not having to read the Exchange database to retrieve the page.  If this percentage is too low, the database cache size may be too small.

    Here are the thresholds that were added for this particular analysis.

    • WARNING - Checks to see if majority of the clients are in Online Mode and if the Database Cache Hit % is less than 90%
    • ERROR - Checks to see if majority of the clients are in Online Mode and if the Database Cache Hit % is less than 75%
    • WARNING - Checks to see if majority of the clients are in Cached Mode and if the Database Cache Hit % is less than 99%
    • ERROR - Checks to see if majority of the clients are in Cached Mode and if the Database Cache Hit % is less than 85%

    The last question that was added was CCRInUse. This question helps differentiate analysis for CopyQueueLength and ReplayQueueLength between CCR and LCR replication since we have different recommended values for each configuration.

    image

    There was also an update for the HUB and HUB/CAS role threshold files where you can now specify drive information for both the Exchange Transport queue file drives and the Page File Drives.

    image

    Additionally the 64bit question was removed from all the Exchange Server 2007 PAL threshold files, since Exchange 2007 is only supported in production on a 64bit Windows operating system.

    It’s probably also important to point out that we’ve managed to get all of the thresholds corrected and updated and a number of new analysis rules added however we haven’t necessarily managed to update or include all of the associated rule and troubleshooting text that goes with each analysis rule. As we get some more time these will be updated, for now it will be more important to migrate all the PAL 1.0 Exchange content to the new PAL 2.0 that will be available sometime in the near future.

    To download the latest XML files, go the XML update page here or direct download here

    If you are interested in the other changes that were made to the 3 threshold files here they are below:

    MBX:

    • Change RPC slow packets (>2s) more than 0 to only trigger on average value as per online documentation.
    • Updated RPC Average Latency to warn on 25ms average (as per online guidance), warn on 50ms max and critical on 70ms max or average.
    • Added MSExchangeIS\RPC Client Backoff/sec to warn on greater than 5.
    • Modified MSExchangeIS Client: RPCs Failed: Server Too Busy to only create a warning for greater than 50 and removed the error alert for greater than 10 seeing as this counter is mostly useful to know if Server Too Busy RPC errors have ever occurred (since it is calculated since store startup)
    • Modified MSExchangeIS\RPC Requests to warn on 50 instead of 70 as higher than 50 is already too high and to then error on 70.
    • Removed the MSExchangeWS\Request/Sec counter from Web Related as MSExchangeWS does not exist on a MBX server.
    • Added _Total to instance exclusions for disk analysis.
    • Added _Total to instance exclusions for MSExchange Search Indices counters.
    • Added _Total to instance exclusions for various other counters.
    • Created a generic disk analysis for when either the log drives, database drives or pagefile drives is unknown.
    • Added in a warning alert threshold for Calendar Attendant Requests Failed when it is greater than 0.
    • Removed the System process exclusion for Process(*)\% Processor Utilization analysis as we do want to know if this is using excessive amounts of CPU as it can indicate a hardware issue
    • Configured the Privileged Mode CPU Analysis to work on _Total instead of individual processors.
    • Updated the Privileged Mode CPU Analysis to not continue if the Total Processor Time is not greater than 0, previously it did not continue if the Privileged Mode Time was not greater than 0. This meant we could get a divide by 0.
    • Updated the Privileged Mode CPU Analysis to warn on greater than 50% of total CPU and Total CPU is between 20 and 50
    • Added a warning alert for Processor\% User Time to fire if % User Time is greater than 75% as per online guidance.
    • Corrected Memory\Pages/Sec text of "Spike in pages/sec - greater than 1000" to read "greater than 5000"
    • Added IPv4\Datagrams/sec and IPv6\Datagrams/sec
    • Added TCPv4\Connection Failures and TCPv6\Connection Failures
    • Added TCPv4\Connections Established and TCPv6\Connections Established
    • Added TCPv4\Connections Reset and TCPv6\Connections Reset and set a threshold for both to warn on an increasing trend of 30
    • Added TCPv4\Segments Received/sec and TCPv6\Segments Received/sec
    • Updated MSExchange Database(Information Store)\Version buckets allocated to alert on greater than 11468 instead of 12000 i.e. 70% of 16384.
    • Collapsed all MSExchange ADAccess counters under MSExchange ADAccess category
    • Added _Global_ as an exclusion to .Net Related\Memory Leak Detection in .Net
    • Added _Global_ as an exclusion to .Net Related\.NET CLR Exceptions / Second
    • Updated .Net Related\.NET CLR Exceptions / Second to warn on greater than 100 exceptions per second.
    • Moved Process(*)\IO Data Operations/sec into an IO Operations category as it is not just disk related
    • Moved Process(*)\IO Other Operations/sec into an IO Operations category as it is not just disk related
    • Updated Network Packets Outbound Errors to alert on greater than 0 instead of 1
    • Updated Network Utilization Analysis to error on greater than 70%
    • Updated Memory\Page Reads/Sec to only warn on 100 average instead of 100 max, other thresholds of 1000 and 10000 still remain the same
    • Updated Memory\Pages Input/Sec's warning to read "More than 1000 pages read per second on average"
    • Updated Memory\Pages Input/Sec to not warn on max of 1000 (it is too low to warn on 1000 max)
    • Updated Memory\Pages Output/Sec's warning to read "More than 1000 pages written per second on average"
    • Updated Memory\Pages Output/Sec to not warn on max of 1000 (it is too low to warn on 1000 max)
    • Added a content indexing section for the Exchange 2007 indexing counters
    • Added analysis for ExSearch processor usage to warn on more than 1% and error on more than 5%
    • Added analysis for MSFTEFD* processor usage to warn on using more than 10% of the Store.exe processor usage
    • Updated .Net CLR Memory\% Time in GC to include * for process and exclude _Global. Removed 5% threshold and made 10 and 20% threshold warning and error conditions respectively.
    • Updated MSExchange Replication\ReplayQueueLength and CopyQueueLength Counters to exclude _Total
    • Modified MSExchange ADAccess Processes(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max
    • Added threshold alerts for MSExchange ADAccess Processes(*)\LDAP Read Time
    • Added threshold alerts for MSExchange ADAccess Domain Controllers(*)\LDAP Read Time
    • Modified MSExchange ADAccess Domain Controllers(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max and only if number of Search Calls/Sec is greater than 1
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds

    HUB:

    • Removed the System process exclusion for Process(*)\% Processor Utilization analysis as we do want to know if this is using excessive amounts of CPU as it can indicate a hardware issue
    • Configured the Privileged Mode CPU Analysis to work on _Total instead of individual processors.
    • Updated the Privileged Mode CPU Analysis to not continue if the Total Processor Time is not greater than 0, previously it did not continue if the Privileged Mode Time was not greater than 0. This meant we could get a divide by 0.
    • Updated the Privileged Mode CPU Analysis to warn on greater than 50% of total CPU and Total CPU is between 20 and 50
    • Added a warning alert for Processor\% User Time to fire if % User Time is greater than 75% as per online guidance.
    • Removed Process\%Processor Time from the Process category as it is already included as part of Processor\Excessive Processor Use By Process
    • Modified Memory\Available MBytes to warn on less than 100MB and critical on less than 50MB
    • Added threshold alerts for Memory\% Committed Bytes in Use to warn on greater than 85% and critical on more than 90%
    • Added Memory\Committed Bytes
    • Corrected Memory\Pages Input/Sec to warn on greater than 1000 as it was set to warn on greater than 10
    • Added threshold alert for Memory\Pages Output/Sec to warn on greater than 1000
    • Corrected Memory\Pages/Sec text of "Spike in pages/sec - greater than 1000" to read "greater than 5000"
    • Modified Memory\Transition Pages Repurposed/Sec to warn on spikes greater than 1000 instead of 100
    • Modified Memory\Transition Pages Repurposed/Sec to critical on averages greater than 500 instead of 1000
    • Modified Memory\Transition Pages Repurposed/Sec to critical on spikes greater than 3000 instead of 1000
    • Added IPv4\Datagrams/sec and IPv6\Datagrams/sec
    • Added TCPv4\Connection Failures and TCPv6\Connection Failures
    • Added TCPv4\Connections Established and TCPv6\Connections Established
    • Added TCPv4\Connections Reset and TCPv6\Connections Reset and set a threshold for both to warn on an increasing trend of 30
    • Added TCPv4\Segments Received/sec and TCPv6\Segments Received/sec
    • Modified MSExchange ADAccess Processes(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max
    • Added threshold alerts for MSExchange ADAccess Processes(*)\LDAP Read Time
    • Added threshold alerts for MSExchange ADAccess Domain Controllers(*)\LDAP Read Time
    • Modified MSExchange ADAccess Domain Controllers(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max and only if number of Search Calls/Sec is greater than 1
    • Added MSExchangeTransport Queues(_total)\Messages Queued for Delivery Per Second
    • Removed all MSExchangeMailSubmission Counters as they are only on MBX
    • Removed MSExchange Database ==> Instances Log Generation Checkpoint Depth - MBX as this was for MBX role
    • Modified MSExchange Database ==> Instances(edgetransport/Transport Mail Database)\Log Threads Waiting to warn on greater than 10 and error on 50
    • Added an error alert for MSExchange Extensibility Agents(*)\Average Agent Processing Time (sec) to error on greater than 60 average
    • Collapsed all Database counters under MSExchange Database category
    • Collapsed all MSExchange ADAccess counters under MSExchange ADAccess category
    • Moved Process(EdgeTransport)\IO* counters into EdgeTransport IO Activity category
    • Updated MSExchange Database(*)\Database Page Fault Stalls/sec to MSExchange Database(edgetransport)\Database Page Fault Stalls/sec
    • Updated MSExchange Database ==> Instances(*)\I/O Database Reads Average Latency to MSExchange Database ==> Instances(edgetransport/Transport Mail Database)\I/O Database Reads Average Latency
    • Updated MSExchange Database ==> Instances(*)\I/O Database Writes Average Latency to MSExchange Database ==> Instances(edgetransport/Transport Mail Database)\I/O Database Writes Average Latency
    • Added _Total exclusions where necessary
    • Removed 64bit question
    • Added a question for pagefile drive
    • Added edgetransport as an exclusion to Memory\Memory Leak Detection
    • Added _Global_ as an exclusion to .Net Related\Memory Leak Detection in .Net
    • Added _Global_ as an exclusion to .Net Related\.NET CLR Exceptions / Second
    • Updated .Net Related\.NET CLR Exceptions / Second to warn on greater than 100 exceptions per second.
    • Moved Process(*)\IO Data Operations/sec into an IO Operations category as it is not just disk related
    • Moved Process(*)\IO Other Operations/sec into an IO Operations category as it is not just disk related
    • Updated Network Packets Outbound Errors to alert on greater than 0 instead of 1
    • Updated Network Utilization Analysis to error on greater than 70%
    • Updated Memory\Page Reads/Sec to only warn on 100 average instead of 100 max, other thresholds of 1000 and 10000 still remain the same
    • Updated Memory\Pages Input/Sec's warning to read "More than 1000 pages read per second on average"
    • Updated Memory\Pages Input/Sec to not warn on max of 1000 (this is too low to warn on 1000 max)
    • Updated Memory\Pages Output/Sec's warning to read "More than 1000 pages written per second on average"
    • Updated Memory\Pages Output/Sec to not warn on max of 1000 (this is too low to warn on 1000 max)
    • Updated .Net\CLR Memory\%Time in GC to include * for process and exclude _Global. Removed 5% threshold and made and 20% threshold warning and error conditions respectively.
    • Added all Store Interface counters.
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds 

      CAS:

    • Created a new CAS file based off of the common updates in the new MBX xml
    • Updated ASP.NET\Request Wait Time to warn on greater than 1000 max and error on 5000 max
    • Updated ASP.NET Applications(__Total__)\Requests In Application Queue to error on 3000 rather than 2500
    • Updated MSExchange Availability Service\Average Time to Process a Free Busy Request to warn on 5 avg or max and error on 25 avg or max
    • Updated MSExchange Availability Service\Average Time to Process a Cross-Site Free Busy Request to warn on 5 avg or max and error on 25 avg or max
    • Updated MSExchange OWA\Average Response Time to warn on max greater than 100 and more than 2 OWA requests per second on average
    • Updated MSExchange OWA\Average Search Time to warn on max greater than 31000
    • Updated MSExchangeFDS:OAB(*)\Download Task Queued to warn on avg greater than 0
    • Moved Process(*)\IO Data Operations/sec into an IO Operations category as it is not just disk related
    • Moved Process(*)\IO Other Operations/sec into an IO Operations category as it is not just disk related
    • Updated ASP.Net Requests Current to warn on greater than 1000 and error on greater than 5000 (max size it can get to is 5000 before requests are rejected)
    • Added all Store Interface counters.
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds

    HUB/CAS:

    • Combined both HUB and CAS XMLs for analysis of combined roles.
    • Added all Store Interface counters.
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds
    Viewing all 60 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>