Out Of Memory Problems -

I was inspired to write this article by some recent problems with programs failing because they ran out of memory. Understanding the problems proved a little tricky and the 'standard' tools used were slightly misleading.

The Problem

Memory is one of the main resources needed by a computer program, but unfortunately it can be more difficult to deal with than some other resource types. There are three main reasons for this:

Firstly, there are typically lots of requests for memory. Modern programming styles allocate large numbers of dynamic objects from the heap and it is common to count the number of memory allocations in thousands, if not millions. Contrast this with file handles, for example, where a typical program might only open a handful of files. Techniques that involve manually tracking each individual resource may be viable for some types but are rarely effective for memory use.

Secondly, all memory in use looks much the same; if you examine program memory it is normally hard to identify what a particular memory location is being used for and where a particular byte was allocated. Indeed, just finding the boundaries between memory allocated in separate allocation requests is often a hard problem. Again, most resource types do not have this problem; the identifier and contents of the resource usually help with both getting the use of the resource and finding its original allocation, and the boundary of the resource (file handle, TCP/IP connection, etc) is clear.

Finally, behind the scenes, memory is complicated. We're used to dealing with this complication with other resource types and expect to deal with different failure modes. For example, failing to connect to a URL in Java will raise an IOException but the actual runtime type of the exception could be a MalformedURLException, a ProtocolException, a SocketException, an UnknownHostException, etc. This range of exception types helps to differentiate between the various underlying causes for the failure and hence resolve the cause of the error. Not so with memory (at least in the environments I know well) - memory allocation failures all look pretty much the same.

For example, the man page for malloc on many systems says: "If there is an error, returns a NULL pointer and set errno to ENOMEM." Some systems add EAGAIN as an additional value, but this focuses on whether it's worth retrying, not on why the error happened.

There are a number of tools available on most development systems that provide automated memory tracking. This typically adds some runtime overhead and/or uses additional memory when it is being used, but does provide a great deal of help with the first two points (tracking the large number of individual memory allocations). In my experience there's less help with understanding the complexity of memory allocation in cases where this is contributing to the memory issue.

What might out of memory mean?

But surely, you might think, "out of memory" only has one cause - there's no memory left? Alas it isn't that simple, but to understand why not we have to expand a little on memory management.

Application

In most high level languages the application programmers allocate the bulk of their memory implicitly. For example when adding two strings together the runtime library takes care of allocating the memory for the new concatenated string; the runtime library is also responsible for deallocating the memory when the string is no longer required, perhaps using deterministic deallocation when the object goes out of scope or with the help of a garbage collector.

However, either explicitly or behind the scenes, the address of the allocated memory is needed. In most desktop architectures the address is a simple number; in some architectures a more complicated structure, perhaps involving a segment and an offset, may be used. Whatever the actual scheme though, these addresses have a finite range of values.

This provides the first, and most fundamental, limit to the amount of memory your process can allocate. The allocation may fail simply because there are no valid addresses left unused in the address space of the process.

So, for a typical 32bit process on Windows, valid addresses range from 0x00010000 to 0x7FFEFFFF [1]. If a process is using that entire range then it won't be able to allocate any more memory and even purchasing another gigabyte of RAM won't help.

Additionally, for common operating systems (where code and data share the same address space) the range of available addresses is further restricted by any program code that is loaded into memory. Many environments also support code modules (DLLs or sharable libraries) which are often loaded at a variety of addresses in memory and further fragment the range of available addresses.

Other memory allocations may also come out of the same address range. For example, stack space for the main process and any additional threads, the address range for memory mapped files or shared memory regions, and in some operating systems areas of the address space are reserved for interprocess communication or access to hardware components.

Runtime library

Within the runtime library the memory allocator will request memory from the operating system to fulfil the allocation requests made to it. Since operating system requests are usually quite expensive the runtime is often implemented by allocating big segments from the operating system and sub-allocating this memory to satisfy application requests.

There can be three main problems with such allocators.

Firstly, the allocator itself usually returns memory aligned to a suitable address for all operations - perhaps aligned to an 8 or 16 byte boundary - and also needs memory internally to manage the allocated memory. This can produce a large total overhead if your application makes many small requests. For example, consider the following simple C++ program:


#include 

int main()
{
   int count = 0;
   while ( new(std::nothrow) char )
      ++count;
   return count;
}
When I ran this program on my desktop machine it returned the value 89,134,808. But I've got much more memory than that on my machine! The problem is with the overhead of the memory allocator and my example program is a 'worst case' where the overhead dwarfs the size of each allocation. Application programmers who allocate many small objects typically find alternative strategies to reduce this overhead; these may include using a pooled allocator or dealing with arrays of objects rather than individual ones.

In the second case memory can get fragmented - where the allocation and deallocation pattern of the application leads to small gaps between the memory cells remaining in use. Again, depending on memory usage, these small gaps can add up to a sizable amount of wasted memory. This problem is fairly well explored and most memory allocation implementations adopt a variety of schemes to try and reduce the amount of fragmentation. Languages that use garbage collection have an additional trick in that the collector may choose to reduce fragmentation by compacting memory that remains allocated during a garbage collection.

Finally you may be using multiple memory allocators (for example, a mixed-language program or one using modules each with their own allocator). Each allocator is holding some memory internally that is not in use but has not been returned to the operating system and a request for memory using one allocator may fail because the other allocators have all the available memory in use. This is a difficult problem to solve and causes some headaches for browser writers (among others) trying to support Java virtual machines, Flash plugins, scripting engines and 'normal' HTML rendering all running inside the same process.

Operating system

Looking at the requests made to the operating system these may fail for multiple reasons. Firstly there may be limits set for each process, including how much memory allocation it is allowed, and so the request may be denied when such a limit is reached. Then the operating system's own memory manager may have restrictions on the addresses returned, possibly based on the natural page size of the hardware architecture.

On 32-bit windows, for example, the VirtualAllocate call returns memory addresses aligned to a 64K boundary. This provides yet another restriction on the number of OS allocations that can be successful.

Consider the following simple C program for Windows:

#include 
#include 

int main()
{
   int count = 0;
   while ( VirtualAlloc( 0, 1, MEM_COMMIT, PAGE_READWRITE ) )
   {
      ++count;
   }
   printf( "Allocated %i bytes.\nLast error code %li\n",
      count, GetLastError() );
}
When I execute this on my desktop machine I get this result:
C:>VirtualAllocateOneByte
Allocated 32664 bytes.
Last error code 8
Error code 8 is ERROR_NOT_ENOUGH_MEMORY - so I've run out of memory after allocating less than 32Kb!! What I've actually run out of is addresses that are on a 64K boundary.

Finally, we may actually be out of physical memory. Every byte of physical memory might be in use by some process, device drive or the OS itself. However most operating systems support "memory overcommit" where more memory can be allocated than the machine physically supports. The 'extra' memory is provided by swapping memory out to disk and transparently re-mapping the physical memory addresses freed up to other addresses in the same, or another, application.

So even if we are out of physical memory we might still get something back from an allocation request to the operating system. The OS may swap some existing data out to disk and give us the memory so freed up. Or the OS might reserve some space in the swap file and return us an address range that will only be swapped in when it is accessed. Different operating systems make different decisions on how liberal they are at this point. Each operating system seems to have its own preferred approach and many provide configurable choices!

Windows takes a conservative approach and only returns memory if it either really has the memory or it has successfully reserved space in the swap file; when you hit the swap space limits the allocation fails. This is a conservative policy and means the swap file expands to be bigger than may be strictly necessary but on the positive side you can't get a process failure when you later try to write to the allocated memory. Solaris also uses a conservative ('eager') approach by default but this can be configured (see the 'vm_swap_eager' attribute). Linux takes a more liberal approach (The operating system that likes to say 'Yes') and by default returns allocated memory with only a quick check for 'obvious' overcommitment. This has the advantage that the swap space usage is less but the disadvantage that a subsequent attempt to actually use the memory may fail because there is no room to expand the swap space. Later versions of Linux provide some kernel configuration settings (see vm.overcommit_memory and vm.overcommit_ratio) that allow you to configure the behaviour. You may (or may not) find this extract from the relevant man pages reassuring: "Depending on the percentage you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate."

Lastly, many operating systems allow the user to reserve spaces in the address range and commit and free individual pages in the range as required. Many operating systems also use this technique to manage the stack - the stack is created in an address range of the maximum stack size but marked as 'reserved' and actual pages of memory are only committed to the stack as it grows. This 'lazy allocation' avoids up-front use of megabytes of memory for simple programs that have a shallow stack.

Putting it all together

For a memory allocation to fail any one of a number things may fail. First off we may not have an available address range in our address space for the allocated memory - both size and alignment may be important here. We may have inadequate resources for the runtime allocator and any housekeeping it needs. We may have exceeded some allocation quota, or lack a suitable target for an allocation from the operating system or finally we may have run out of free physical memory or disk space.

In the problem cases that inspired the article the limiting factor was the address space, but for slightly different reasons in each case.

In one case the application used a number of vectors internally and, after some analysis, we discovered that the out of memory problem occurred when a large vector had an additional item added. This extra item hit the current capacity of the vector and so required a re-allocation of the underlying memory buffer. The buffer was 100 Mb in size and the vector class used had a simple doubling algorithm when resizing which meant a request was made for 200 Mb of contiguous memory for the new buffer. This request failed because there was no single free section in the address range that was this large. The application failed with an out of memory error although it had only allocated 800 Mb or so of memory and was running on a machine with 4 Gb of physical memory. This was on Windows and I failed (then) to find a standard tool that helped analyse this problem. The debugger Windbg from Microsoft Debugging Tools for Windows came close with its "!vadump" command but this simply prints the details of each memory region allocation by the operating system to the process and I wanted to find the largest contiguous free region. Eventually I wrote a quick tool that scanned the address range and reported the size of the largest free memory region; implementation of a derivative of this tool is described below.

One solution to this problem is to replace the vector (that requires a single memory range) with a container that allocates in chunks, such as the deque. Another solution is to preset the capacity of the vector to the final allocation size, but this is hard to do in general! In our case we changed the resize algorithm for large vectors to avoid the doubling and hence reduce the need for such large allocations.

In a different case we had a message switching application written in Java running on a 32bit JVM on Solaris. The application would generate out of memory errors even though tools such as 'top' showed our application was using less than 2 Gb of memory (out of a possible limit over well over 3Gb) and we had many gigabytes of free physical memory on the machine.

The fundamental problem here also turned out to be running out of address space - our application was creating a pair of threads for each connection it was managing and each thread was reserving a stack of 1 MB. The threads all had a shallow call depth, so in fact only the first 8 Kb of each stack was actually being committed, the rest of each stack was just left as reserved address space. By the time you have a thousand threads in the program that means nearly 1 Gb of address space reserved and hence not available for other allocations.

There were three possible solutions we explored here. The first solution was to reduce the number of threads, but this proved hard in our case because the threads were being created by a third party library that we were not able to change. We explored reducing the size of the thread stack, but failed to find a mechanism that allowed us to do so (in some cases operating system APIs provide the feature, but higher level abstractions may not expose it to application code). The eventual solution we went for in this case was to run in a 64 bit virtual machine. This slightly increased the actual amount of allocated memory used but enormously increased the available address range. The 1 Gb of reserved space drops from one quarter of the whole range in a 32bit program to a tiny fraction of the available 64 bit address range.

However, we found in this case too a lack of standard tools that provided diagnostics for the 'out of addressability' problem. In this case we eventually looked at the data from the /proc pseudo-filesystem and used some scripting to identify the reserved memory ranges.

A Useful (Windows) Tool

Earlier this year SysInternals [2] released vmmap, which is a Windows tool that provides a nice graphical view of the address space for a process and also provides the summary (totals and largest address ranges in various categories). I suspect this tool was written because running out of addressability is becoming more of an issue for other programmers too.

Screen Shot

Under the covers

I thought is would be instructive to provide source code for a program which prints the address ranges for a Windows process in a similar fashion to vmmap.

The program expects a list of process IDs on the command line and simply prints the virtual address map, and a summary line, for each one.

The algorithm obtains the user-mode address range from GetSystemInfo and iterates through the range using VirtualQueryEx() to get the memory information for each address block. Once the complete range has been read the details for each block and the summary is produced (Note that the MEMORY_BASIC_INFORMATION structure holds additional information that this simple program does not display).

Here is some sample output from this program in action:
C:>virtmap 1812
Processing: 1812
       8K at 00010000 committed
      56K at 00012000 free
       4K at 00020000 committed
      60K at 00021000 free
     240K at 00030000 reserved
...
       4K at 7FFDF000 committed
       4K at 7FFE0000 committed
      60K at 7FFE1000 reserved
Committed: 10Mb, reserved: 6Mb, free: 2031Mb (biggest: 1607Mb)
Here is the sample code itself (virtMap.cpp):

#include <windows.h>
#include <iostream>
#include <iomanip>
#include <iterator>
#include <string>
#include <vector>

// This class holds the virtual memory map for a process
class VirtMap
{
public:
  VirtMap();
  int VirtMap::process( HANDLE hProcess );
  void printOn( std::ostream& os ) const;

private:
  void add( MEMORY_BASIC_INFORMATION const & mb );

  std::vector<MEMORY_BASIC_INFORMATION> m_info;
  ULONG m_commit;
  ULONG m_reserve;
  ULONG m_free;
  ULONG m_biggest;
};

// Stream a memory map
std::ostream & operator<<( std::ostream & os, VirtMap const & virtMap )
{
  virtMap.printOn( os );
  return os;
}

// Stream a memory block
std::ostream & operator<<( std::ostream &os, MEMORY_BASIC_INFORMATION const & mb)
{
  os << std::setw(8) << mb.RegionSize / 1024 << "K at " << mb.BaseAddress;
  switch( mb.State )
  {
    case MEM_COMMIT: os << " committed"; break;
    case MEM_FREE: os << " free"; break;
    case MEM_RESERVE: os << " reserved"; break;
  }
  return os;
}

// Class VirtMap implementation
VirtMap::VirtMap()
: m_commit( 0 )
, m_reserve( 0 )
, m_free( 0 )
, m_biggest( 0 )
{
}

// Handle a single process
int VirtMap::process( HANDLE hProcess )
{
  int ret(0);
  SYSTEM_INFO SystemInfo;
  GetSystemInfo( &SystemInfo );

  PVOID baseAddress( SystemInfo.lpMinimumApplicationAddress );

  while ( baseAddress < SystemInfo.lpMaximumApplicationAddress )
  {
    MEMORY_BASIC_INFORMATION mb = { 0 };
    if ( ! VirtualQueryEx( hProcess, baseAddress, &mb, sizeof( mb ) ) )
    {
      int ret = ::GetLastError();
      std::cerr << "Cannot query memory at: " << baseAddress << ": " << ret << std::endl;
      break;
    }
    add( mb );

    baseAddress = (PCHAR)mb.BaseAddress + mb.RegionSize;
  }
  return ret;
}

// Add an item to the memory map
void VirtMap::add( MEMORY_BASIC_INFORMATION const & mb )
{
  m_info.push_back( mb );

  if ( mb.State & MEM_FREE )
  {
    m_free += mb.RegionSize;
    if ( m_biggest < mb.RegionSize )
      m_biggest = mb.RegionSize;
  }
  else if ( mb.State & MEM_RESERVE )
  {
    m_reserve += mb.RegionSize;
  }
  else if ( mb.State & MEM_COMMIT )
  {
    m_commit += mb.RegionSize;
  }
}

// Print the memory map
void VirtMap::printOn( std::ostream& os ) const
{
static const ULONG megabyte = 1024L * 1024L;

  std::copy( m_info.begin(), m_info.end(), 
    std::ostream_iterator<MEMORY_BASIC_INFORMATION>(os, "\n" ) );

  os << "Committed: " << m_commit / megabyte << "Mb"
    ", reserved: " << m_reserve / megabyte << "Mb"
    ", free: " << m_free / megabyte << "Mb"
    " (biggest: " << m_biggest / megabyte << "Mb)";
}

int main( int argc, char **argv )
{
  int ret = 0;

  for ( int idx = 1; idx != argc; ++idx )
  {
    int const pid = atoi( argv[idx] );
    HANDLE const hProcess = OpenProcess( PROCESS_QUERY_INFORMATION, FALSE, pid );

    if ( hProcess == 0 )
    {
      ret = ::GetLastError();
      std::cerr << "Unable to open process " << pid << ", error: " << ret << std::endl;
      break;
    }
    else
    {
      std::cout << "Processing: " << pid << std::endl;
      VirtMap virtMap;
      ret = virtMap.process( hProcess );
      CloseHandle( hProcess );
      if ( ret != 0 )
        break;
      std::cout << virtMap << std::endl;
    }
  }

  return ret;
}

Conclusion

As we reach the end of the road for 32 bit operating systems memory once again has become a scarce resource -- but this time round it is usually not physical memory that we're short of but managing to get enough address space in our processes to access the memory we need.
I hope this overview of some of the complexity of memory management will give people some help with identifying the cause of 'out of memory' problems.

Notes and Resources

  1. Newer Windows versions support the /3GB switch which potentially adds an extra gigabyte of addressability to a process
    See http://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx for more details.
  2. SysInternals is at http://technet.microsoft.com/en-us/sysinternals/.

Copyright (c) Roger Orr - rogero@howzatt.demon.co.uk
Published in CVu 21.3 (July 2009).
$Id: OutOfMemory.html,v 1.1 2009/07/15 19:00:33 Roger Exp $