Advanced programming tips, tricks and hacks for Mac development in C/Objective-C and Cocoa.

StreamToMe 2.0 is out (and Cocoa With Love is 2 years old).

Two different things turning two different kinds of "2". Snap.

Cocoa With Love is two years old today

I've been writing a feature article every week for two years now — haven't missed a single week yet (hope that doesn't jinx me).

To commemorate, I've put the complete list of downloadable code samples from my blog on the Projects With Love: Open Source Projects page.

StreamToMe 2.0 is out!

For those who don't know, StreamToMe is an iPhone application for streaming video and music from your Mac to your iPhone/iPod Touch. Plug your phone into your TV and you can use it to stream your Mac's music and video files to your TV. No prior conversion necessary — a huge range of codecs and formats (including AVI, WMV, MOV, MP4, FLV and more) will just play.

StreamToMe 2.0 adds:

  • Thumbnail previews for video files
  • Album artwork for music files (including while playing)
  • Artist and title metadata display
  • Subtitle rendering (including SRT and SSA files, MKV embedded SSA tracks and MOV text tracks)
  • Multiple audio tracks
  • Continuous and random play modes
  • Sort options including folder flattening (so you can view the full hierarchy of your iTunes collection as a single flat collection)
  • Preliminary support for EyeTV streams
  • Persistent playback — if you quit or are interrupted by a phone call during playback, you can continue from where you were.

And more!

StreamToMe is available to purchase from the iTunes App Store:

Download StreamToMe 2.0 from the AppStore now

Free upgrade for existing customers. Lucky them! Don't forget to update ServeToMe to enable some of the newer features.

Windows XP support for ServeToMe is coming soon (I'm hoping by the end of March).

Here's a few screenshots of the new version of StreamToMe in action:


Resolving a path containing a mixture of aliases and symlinks

Resolving symlinks in a path is very easy in Cocoa (it can be done in a single statement) but aliases require more work. Additionally the commands for resolving symlinks and aliases are incompatible with each other — meaning that you can resolve a path containing symlinks or aliases but not a mixture of the two. In this post, I present a category on NSString that will allow you to resolve a path containing any combination of symlinks or aliases as simply as resolving symlinks alone.

Introduction

When Apple introduced alias files in System 7, they were the only way of referencing another file in the filesystem. Sure, other operating systems already had symlinks but there was little likelihood of the two needing to interact since System 7 had almost no interoperability with Unix filesystems (despite the existence of A/UX).

Alias files are a great way of referencing other files because they can continue to track their target if it moves. Aliases store both the path of their target and the inode plus volume identifier of their target. This means that they won't break if the file at their target path moves, unlike a symlink which will break if the target moves (since a symlink is just a path). In fact, symlinks will often break if the symlink file itself moves since symlinks are normally relative paths.

Aliases do have some annoying traits. In their current incarnation, they insist on storing the thumbnail and metadata of data of their target, resulting in a file which may be hundreds of kilobytes (or more) of entirely redundant data — especially annoying if you have thousands of aliases. Despite this, they are normally a good, general-purpose way of linking files in the filesystem.

As a "user-exposed" feature (unlike symlinks which are largely hidden from novice users) they are also common. It's unfortunate then that they're so annoying to deal with in Cocoa.

Resolving aliases

Resolving a symlink is easy:

NSString *resolvedPath = [path stringByResolvingSymlinksInPath];

Theoretically, you only need one command to resolve an alias:

OSErr err = FSResolveAliasFile(
    &fsRef, resolveAliasChains, &targetIsFolder, &wasAliased);

Unfortunately, the FSRef here is a campy 1990's throwback (the 90's are now two decades ago) and in Cocoa programming, you rarely have or want an FSRef for a file. This means that you need to convert your NSString to an FSRef and then convert the result back to an NSString when you're done.

Apple's Low Level File Management Topics include an approach for resolving an alias from an NSString path which demonstrates this procedure. While I use an implementation derived from this in my solution, Apple's original implementation suffers from the following problems:

  • The CFURLGetFSRef function used to convert a CFURL into an FSRef will fail if the path contains an alias at anywhere other than the last path component.
  • While CFURLGetFSRef will follow symlinks in the URL to create the FSRef, no part of this code will actually return a resolved symlink, so that part will require a separate step.
  • The function FSResolveAliasFile will present a user dialog if the alias points to a volume which is not mounted. While potentially desirable in a user application, this is undesirable in all other cases.

This final point is not too difficult — we'll replace FSResolveAliasFile with FSResolveAliasFileWithMountFlags which allows us to disable the user dialog using the flags. But the remaining two points will require a little more work to address.

As a further comment about usage of FSResolveAliasFileWithMountFlags: aliases that point to other aliases are exceedingly rare (if you try to create an alias to an alias in the Finder, the Finder will make the second alias point directly to the target) so I pass false for resolveAliasChains to optimize for the unchained case and handle the unusual case of chained aliases at a different level in the code.

Breaking it down into solvable components

We can resolve paths that contain any number of symlinks and we can resolve a path that contains an alias but we can't do both at once.

The solution is therefore straightforward:

  1. break the path down into components
  2. build the components together, iteratively resolving aliases or symlinks at each level
  3. implement code for resolving the symlink or alias as efficiently as possible for the bottom level of this scenario

Each of these points will then be a different tier in a three level solution.

My solution will contain two requirements:

  • The initial path must be resolvable to an absolute path using -[NSString stringByStandardizingPath]
  • The path contained within a symlink file will not include any aliases or symlinks except as the final string component (no recursive parsing)

This second point is mostly a theoretical limitation since it is nearly impossible to generate a symlink with an alias or other symlink as a non-final path component (you'd have to create the symlink file manually).

Top level

The top level of the solution is simply an iteration over path components which then invokes the iterative link resolution.

// Break into components.
NSArray *pathComponents = [path pathComponents];

// First component ("/") needs no resolution; we only need to handle subsequent
// components.
NSString *resolvedPath = [pathComponents objectAtIndex:0];
pathComponents =
    [pathComponents subarrayWithRange:NSMakeRange(1, [pathComponents count] - 1)];

// Process all remaining components.
for (NSString *component in pathComponents)
{
    resolvedPath = [resolvedPath stringByAppendingPathComponent:component];
    resolvedPath = [resolvedPath stringByIterativelyResolvingSymlinkOrAlias];
    if (!resolvedPath)
    {
        return nil;
    }
}

I haven't shown the code which resolves the path to an absolute path or fails but the assumption that it begins with a "/" is valid.

Middle level

The middle level of the solution iterates over a path where only the final component could be a symlink or alias and resolves it until the result is neither an alias nor symlink.

For efficiency, this does two things in an unusual way:

  • I use lstat instead of -[NSFileManager attributesOfItemAtPath:error] since I only need the st_mode field, and NSFileManager invokes lstat internally anyway.
  • I use my own -[NSString stringByConditionallyResolvingSymlink] method instead of - [NSString stringByResolvingSymlinksInPath] since I know that only the final component requires resolution (I've already done the work for earlier components).
- (NSString *)stringByIterativelyResolvingSymlinkOrAlias
{
    NSString *path = self;
    NSString *aliasTarget = nil;
    struct stat fileInfo;
    
    // Use lstat to determine if the file is a directory or symlink
    if (lstat([[NSFileManager defaultManager]
        fileSystemRepresentationWithPath:path], &fileInfo) < 0)
    {
        return nil;
    }
    
    // While the file is a symlink or resolves as an alias, keep iterating.
    while (S_ISLNK(fileInfo.st_mode) ||
        (!S_ISDIR(fileInfo.st_mode) &&
            (aliasTarget = [path stringByConditionallyResolvingAlias]) != nil))
    {
        if (S_ISLNK(fileInfo.st_mode))
        {
            // Resolve the symlink component in the path
            NSString *symlinkPath = [path stringByConditionallyResolvingSymlink];
            if (!symlinkPath)
            {
                return nil;
            }
            path = symlinkPath;
        }
        else
        {
            // Or use the resolved alias result
            path = aliasTarget;
        }

        // Use lstat again to prepare for the next iteration
        if (lstat([[NSFileManager defaultManager]
            fileSystemRepresentationWithPath:path], &fileInfo) < 0)
        {
            path = nil;
            continue;
        }
    }
    
    return path;
}

The stringByConditionallyResolvingAlias method returns nil if the path exists but isn't an alias, allowing this function to double as both a test for whether the path is an alias as well as the resolution of that alias. I could use a similar approach to test and resolve symlinks (since I have also implemented a stringByConditionallyResolvingSymlink method) but I don't do this for aforementioned efficiency reasons: it would cause an extra fetch of the filesystem metadata which is the main bottleneck of the whole procedure.

Bottom level

The bottom level is then just the implementation of the stringByConditionallyResolvingAlias and stringByCondictionallyResolvingSymlink. The first is just a modification of Apple's code to address the issues I've already discussed — you can see the final product by downloading the code. The second method looks like this:

- (NSString *)stringByConditionallyResolvingSymlink
{
    // Get the path that the symlink points to
    NSString *symlinkPath =
        [[NSFileManager defaultManager]
            destinationOfSymbolicLinkAtPath:self
            error:NULL];
    if (!symlinkPath)
    {
        return nil;
    }
    if (![symlinkPath hasPrefix:@"/"])
    {
        // For relative path symlinks (common case), resolve the relative
        // components
        symlinkPath =
            [[self stringByDeletingLastPathComponent]
                stringByAppendingPathComponent:symlinkPath];
        symlinkPath = [symlinkPath stringByStandardizingPath];
    }
    return symlinkPath;
}

Hooray, I finally used NSFileManager in a post about files! Yes, once again it's probably just a wrapper around the C function readlink that I could invoke for myself but the fact that the NSFileManager method handles the nasty business of buffer allocation, sizing and string conversion is more than enough reason to forego the lower level function.

Conclusion

You can download NSString+SymLinksAndAliases.zip (3kb) which contains all the code discussed in this post (plus a few other related methods) as a category on NSString.

Usage of the category is as simple as importing the header and writing:

NSString *fullyResolvedPath = [somePath stringByResolvingSymlinksAndAliases];

The fullyResolvedPath will either contain the destination (as an absolute and fully resolved path) or it will be nil (if the path can't be fully resolved because it doesn't exist or can't be read for some reason).

I've tried to keep this code efficient by keeping the number of filesystem calls low. The code will certainly handle at least a few thousand alias resolutions per second on my computer but I haven't pushed it much harder than that.

Of course, if you're an iPhone programmer, all of this is a waste of time since the iPhone doesn't publicly expose FSRef (although CFURLGetFSRef exists to generate a pointer which is totally unusable). In fact, I'm not sure aliases are possible on the iPhone anyway.

The differences between Core Data and a Database

The Core Data Programming Guide clearly states that Core Data is not a database but since both Core Data and a database are ways of providing searchable, persistent storage, exactly how and why they are different may not be clear. In this post, I'll look at the way Core Data works and the reasons why its features and capabilities are different to those of common SQL databases — even though an SQL database may be used as the backing store.

Introduction

Both Core Data and an SQL database provide a means of persistently storing structured data in a searchable store.

Since programmers are generally familiar with databases and since Core Data is actually backed by an SQLite database, it is understandable that Core Data is often treated and used as though it were a wrapper around SQLite.

It is important to realize that although you can use Core Data in this way (in fact, it works very well like this), that Core Data actually operates over a different domain to SQLite — meaning that it provides lots of services that SQLite doesn't but also that Core Data can't provide some of the services that SQLite can. Even for services that both technologies provide, there are different performance considerations.

Primary function of a database

The somewhat narrow description of database that I will use is: persistent and searchable storage for data in table, row, column format where the primary goal is to keep the data up-to-date on disk at all times, with a secondary goal of powerful, focussed, narrow fetching and updating capabilities.

There are lots of database implementations and many provide features far beyond this description but I'm really looking at the key components of a straightforward SQLite-style database implementation with which many programmers are familiar.

Despite many databases being called "relational" (which implies that they have a degree of support for object connectivity), SQLite and many other relational databases don't handle the mechanics of connecting objects; maintaining state (like a an object relation) between columns, rows or tables is left to the user of the database. In this sense, a database is "dumb" storage — rows have few behaviors beyond "read" and "write" and extending or customizing their behavior would involve extending the database system itself. Even when triggers are available, their programmatic capabilities are limited.

Primary function of Core Data

At its heart, Core Data is an object graph manager with lifecycle, searching and persistence features. In the case of Core Data, object graph management includes:

  • You can connect object A to object B and the connection at both the A and B ends is kept perpetually in sync.
  • If you change the connection at the A end, the B end will be updated and all changes trigger notifications (to which you can attach arbitrary code)
  • Deletion of objects at one end of a connection can trigger cascading deletion or nullify responses.
  • The other end of a connection can exist out of memory (faulted) unless the connection is actually followed at which time the connected object is loaded.

Unlike a typical database, Core Data can be used totally in-memory. While Core Data is often called a "object persistence" framework (since it is expected that you will use its persistence features), it is actually possible to use Core Data entirely in-memory without any form of persistence. Of course, once you have data, it is reasonable to want to keep that data, so persistence is treated as a major feature of the framework but it is important to know that persistence is not mandatory.

Also unlike a database, it is possible to use Core Data without any form of searching. If you allocate and connect all your objects, all you need to do is hold onto one of them and you can walk through everything without needing a fetch request. This search-less behavior is not because Core Data transparently performs a search: once data is loaded in memory, following a connection does not involve a search.

All Core Data objects are fully instantiated Objective-C objects that manage most of their own attributes, relationships and lifecycle. This also means that their properties and behaviors are implemented by methods, making them both observable and overrideable.

Databases and object graph management are not inherently exclusive

While not available by default in SQLite, the "foreign key" of MySQL, and other SQL:2003 compliant databases, can handle keeping identifiers in different tables in sync and even cascade deletion when requested. The programmatic customization abilities of overrideable objects is not available but the basics of graph management are there.

There are other object relational frameworks that work closer to Core Data's model but which try to behave as an atomic, transactional database. To update the object graph, these frameworks must:

  • load appropriate rows from a database
  • instantiate objects from these rows
  • make changes to the graph objects that are now in memory
  • commit the changes back to the database

To be properly atomic, these steps must all be performed as a single transaction (with no other reads or write to the affected rows during the transaction). While some systems might require this, it is far too slow for a general object graph system.

Core Data does not follow this model as Core Data aims to be a more general object management system — and that means that it needs far better performance and flexibility than this model would allow.

Operating in-memory versus an on-disk database

Without access to the source code, it's not entirely clear. We can only assume that the NSManagedObjectContext tracks instantiated objects in a heap or structured container of some form so that it can find them again.

This tracking structure may behave a bit like an in-memory database but it would only be for tracking the existence of instantiated objects — it is important to note that it would not itself store any data. Also note that the centralized NSManagedObjectContext and any structures it might maintain are not how you interact with instantiated NSManagedObjects — you interact with an NSManagedObject by sending messages to NSManagedObject pointers.

The reason that Core Data focuses on an in-memory representation is speed. For object graph changes that affect multiple objects, it is much faster if they are all in memory already, rather than needing to search for them again in the database.

For temporary objects (data that doesn't have to be saved to disk) Core Data can create, change and manipulate objects much faster than SQLite can since SQLite has to update indexes and update nodes in the B-tree, as well as simply allocating space and setting values. Core Data can allocate millions of objects in a few seconds, where SQLite might take a few minutes for the same number of allocations.

The tradeoff with an in-memory approach is that SQLite is still used as the backing store. Reading from disk and saving to disk involves all of SQLite's overheads plus the overhead of the Core Data to SQLite conversion process — so is invariably slower than SQLite alone.

Common database tasks that Core Data doesn't do

I've spoken about features that Core Data has that databases do not. It is important to consider some of the features that Core Data's approach lacks with respect to a database.

Core Data cannot operate on data without loading the data into memory

In SQL you can simply "DROP tableName" to delete whole tables or update every column of a table with commands like "UPDATE tableName SET key1 = value WHERE key2 = otherValue". These commands can efficiently update vast amounts of data because they only need to load small amounts of data into RAM at any given time.

Core Data doesn't work in this perpetually on-disk manner — it only works on objects in memory. Even if you only want to delete an object, it must be loaded and instantiated in RAM.

Of course, this is necessary because the object, and its potentially overridden behaviors must be loaded and invoked. There are also connections to be kept up to date with other objects.

However this constraint has implications: if you're trying to change huge numbers of objects (tens of thousands or more) you will need to consider keeping your memory footprint down. This can be done by periodically refaulting unchanged values (refreshObject:mergeChanges:) or avoiding the fetch of an object's data (setIncludesPropertyValues:NO on NSFetchRequest) or even saving the whole context and releasing all the objects you're holding.

Core Data does not handle data logic

There are a few data-related features that SQL contains, a good example being "unique" keys, that Core Data does not include.

There are a couple of technical reasons why this might be the case. Subclasses can override the getter and setter for an attribute to the point where it is unclear whether it is or is not unique. In fact, transient Core Data attributes need not even support isEqual:.

However, I suspect the distinction is actually that Core Data offers no real support for attribute behaviors at all. Core Data manages the "graph" (connections) but the data attributes are all the responsibility of the business logic in the rest of the program; convenient as it may be, it falls outside Core Data's conceptual domain.

Multi-threaded, multi-user scenarios

Core Data does not offer any amount of threading support. To be fair, SQLite is single threaded too but many other databases are multi-threaded and multi-user.

Core Data has been designed for single-user environments (running inside desktop and iPhone apps). Getting rid of threading and locking makes the framework much faster and simpler to work with in its standard usage scenarios.

However, there are still situations where you will want multiple threads reading your data. NSManagedObjects and their NSManagedObjectContext should be accessed from a single thread only. If you need another thread working on the same data, you need to save the file and reopen using a different NSManagedObjectContext in the other thread.

Summary

DatabaseCore Data
Primary function is storing and fetching dataPrimary function is graph management (although reading and writing to disk is an important supporting feature)
Operates on data stored on disk (or minimally and incrementally loaded)Operates on objects stored in memory (although they can be lazily loaded from disk)
Stores "dumb" dataWorks with fully-fledged objects that self-manage a lot of their behavior and can be subclassed and customized for further behaviors
Can be transactional, thread-safe, multi-userNon-transactional, single threaded, single user (unless you create an entire abstraction around Core Data which provides these things)
Can drop tables and edit data without loading into memoryOnly operates in memory
Perpetually saved to disk (and often crash resilient)Requires a save process
Can be slow to create millions of new rowsCan create millions of new objects in-memory very quickly (although saving these objects will be slow)
Offers data constraints like "unique" keysLeaves data constraints to the business logic side of the program

 

Custom build rules, generated tables and faster floating point

As fast as computers are, heavy use of floating point functions can still slow them down. One way around this, is to use a lookup table instead of calculating floating point values at runtime. But keeping a generated table up-to-date is annoying work. In this post, I'll show you how to create a lookup table automatically using a custom build rule, making an OpenGL animation 5 times faster in the process.

Introduction

In this post, I'm going to work with a modified version of Apple's default OpenGL ES 1.0 program. In this program, a colored square will spin in the center of the window:

spinner.png
Download the complete project for this post: Spinner.zip (25kb).

To artificially create a situation where this simple program is bound by CPU performance, I will calculate the location of the square thousands of times for every frame.

Initial code and performance

The initial code for updating the model matrix before drawing is:

static float transX = 3.14159265f * 0.5;
static float transY = 0.0f;
for (int i = 0; i < 2e5; i++)
{
    translateX = sinf(transX)/2.0f;
    translateY = sinf(transY)/2.0f;
    doNothingff(&translateX, &transX);
    doNothingff(&translateY, &transY);
}
transX += 0.075f;
transY += 0.075f;

glTranslatef((GLfloat)(translateX), (GLfloat)(translateY), 0.0f);

This is a very simple calculation: it translates out to the edge of the unit circle and progresses around the circle by 0.075 radians per frame.

The code loops 2e5 (i.e. 20,000) times and passes its variables by reference into a dummy function (to prevent the compiler optimizing the loop away).

On my iPhone 3G, this runs between 5 and 6 frames per second.

Faster with tables

The reason this code is slow, is the sinf() function — the iPhone simply isn't meant for heavy floating point arithmetic.

However, this code only uses 84 steps to get around the circle. We can precalculate all of those points, store them in a table and use the table instead.

static int tableIndexX = -1;
static int tableIndexY = 0;
if (tableIndexX == -1)
    tableIndexX = SinTableSize / 4;

for (int i = 0; i < 2e5; i++)
{
    translateX = SinTable[tableIndexX];
    translateY = SinTable[tableIndexY];
    doNothingif(&tableIndexX, &translateX);
    doNothingif(&tableIndexY, &translateY);
}
tableIndexX = (tableIndexX + 1) % SinTableSize;
tableIndexY = (tableIndexY + 1) % SinTableSize;

glTranslatef((GLfloat)(translateX), (GLfloat)(translateY), 0.0f);

This code is the same as the previous code except instead of calculating each point, it simply steps through an array of values.

The SinTable just looks like this:

const float SinTable[84] = 
{
    0.000000f,
    0.037365f,
    0.074521f,
    0.111260f,
    //... 80 more entries ...
};

With the change to a lookup table, the code now runs at 26-28 frames per second. A 5 times speed increase.

Using code to generate code with a build rule

If you've ever used lookup tables, you'll know that they can be annoying to maintain. If you want to change them, you need to regenerate them and then reintegrate them into your project.

With a custom build rule, the regeneration and reintegration can be done automatically.

The code that generates the SinTable looks like this:

enum
{
    ProcessName = 0,
    OutputFile,
    TableSteps,
    ArgumentCount
};

int main (int argc, const char * argv[])
{
    if (argc != ArgumentCount)
        return 1;
    
    FILE *outputFile = fopen(argv[OutputFile], "w");
    long tableSteps = strtol(argv[TableSteps], NULL, 10);
    
    fprintf(outputFile, "const float SinTable[%ld] = \n", tableSteps);
    fprintf(outputFile, "{\n");
    for (long i = 0; i < tableSteps; i++)
    {
        fprintf(outputFile, "    %ff,\n", 0.5f * sinf(i * 2.0f * 3.14159265f / tableSteps));
    }
    fprintf(outputFile, "\n};\n");
    fprintf(outputFile, "const long SinTableSize = %ld;", tableSteps);
    
    return 0;
}

I put it in a file named SinTable.gen.c then I added a custom build rule to the target (Project→Edit Active Target→Rules).

The custom build rule looks like this:

buildrule.png

What this does is:

  1. Takes any file that ends in .gen.c
  2. Compiles it to a .gen file in the Derived files directory
  3. Executes the .gen program to generate a gen.result.c file in the same directory
  4. Tells Xcode that it produces the gen.result.c file and Xcode will then build it according to Xcode's standard C compilation rules, integrating it with the rest of the program.

Custom Build Rules versus Custom Build Phases

Many people are familiar with Custom Build Phases for running scripts during their builds. It is important to understand the differences between the two:

Custom Build RuleRun Script Custom Build Phase
Is an arbitrary /bin/sh scriptIs an arbitrary /bin/sh script
Runs once for input of a given typeRuns exactly once per build
Only runs when an input of the relevant type changesRuns on either every build (no inputs/outputs specified) or when specifically nominated "input" files are newer than nominated "output" files.
Outputs can be automatically picked up by Xcode for further processing by other build rulesXcode doesn't touch outputs — you must handle all processing yourself
Only visible in the Target's Settings windowVisible in the Group Tree by expanding the Target

Custom Build Rules are ideal for generating files that will be picked up by another build rule (for example, generating source code or .o files). Custom Build Phases are a better option for applying build numbers or handling the final built product.

Conclusion

Download the complete project for this post: Spinner.zip (25kb).

5 times faster, easy to implement and maintain.

Of course, lookup tables have a major limitation: they use memory. This table was only 84 entries long, so there's no real problem but if you need thousands of entries, you'll need to test to make sure you're actually gaining performance. Also, if the table is never in cache (because you're using huge amounts of memory between lookups), you're unlikely to save much time.

The point of this post though was to demonstrate build rules. This build rule keeps the lookup table up-to-date on every build and won't run unless the .gen.c file changes, so it won't recompile unnecessarily. The key strength of the build rule is that it will automatically run on all ".gen.c" files we happen to add to this target and the outputs are automatically picked up by Xcode and compiled and linked with the rest of the target.

These two points are the key advantages of Build Rules in Xcode: running on every file of a given type and integration into the build pipeline. If you only need your script to run once per build (instead of once per file) and you don't need Xcode to further handle the outputs, you can use Run Script Custom Build Phases instead.

Finding the cause of performance issues in your programs

This post is a response to reader questions asking how, in my previous post on Replacing Core Data Key Paths, I knew that the biggest performance problem in my initial approach was the incremental reallocation of NSMutableSet. In this post, I'll look at basic time and memory profiling on the Mac and talk about the most common types of scenario to look for when you're trying to solve performance problems.

Introduction

In my previous post, Performance Tests: Replacing Core Data Key Paths, I presented a simple category to improve Core Data key path fetching by 25%-35% (actually the improvement was much greater at small set sizes but they rarely matter). The improvement worked by using property accessors to access the Core Data values instead of looking up the properties by string name.

However, the original "naïve" approach that I used to do this actually took more than twice as long as the original Core Data string key paths.

This week, I'll look at how I analyzed this code and how I worked out where the performance bottleneck was.

The code to analyze

The purpose of the code was to replace this:

NSSet *projectNames = [company valueForKeyPath:@"projects.name"];

with this:

NSSet *projectNames = [company.projects slowObjectValuesForProperty:@selector(name)];

But, as the "slow" in the method name reveals. The initial approach was slower — by approximately a factor of 2.

The implementation of the slowObjectValuesForProperty: method is really simple. Let's look at it:

- (NSSet *)slowObjectValuesForProperty:(SEL)propertySelector
{
    NSMutableSet *result = [NSMutableSet set];
    for (id object in self)
    {
        id value = [object performSelector:propertySelector];
        if (value)
        {
            [result addObject:value];
        }
    }
    return result;
}

The test harness that I'll be using will be the PropertyAccessors project from the Performance Tests: Replacing Core Data Key Paths post, with all tests except the "Key Path Accessor" and "slowObjectValuesForProperty: accessor" ifdef'd out.

Time profiler

The first approach you should use when your code isn't running as fast as desired is a basic time profile.

As preparation for this, you should make sure that the test will run for long enough to give useful test results. Time profiling works by suspending the program at regular intervals and reading the program's current position. This amounts to a random sampling over time. But these random samples won't be helpful unless there are enough in your time critical code. Increase your data size or put the code you're looking at in a loop and try to make it run for at least 2 seconds.

For this test, I set the NumCompanies and NumProjectsPerCompany to 100 (giving a total data size of 10,000 project names) and further set the NumTestIterations to 1000 (so this test will fetch 10 million names in total).

This gives the following test durations:

Key Path set iteration test took                4.13078 seconds.
Slow objectValuesForProperty: test took         8.83262 seconds.

By default, the Time Profiler will sample once every millisecond. If your tests cover a large amount of code, you'll need more samples — so either make your tests run longer or increase the sample frequency (Time Profiler can go as fast as 20 μs).

Build the project (but don't run it directly from Xcode). Then select the "Run→Run with Performance Tool→Time Profiler" menu. You can also use "Shark" for this but they both work in a similar way and the Time Profiler runs in Instruments which has a nicer, newer interface (although Shark can track a few lower level metrics that Instruments still lacks).

iPhone note: The Time Profiler is not available to iPhone projects. For iPhone projects, select the "CPU Sampler" in the "Run with Performance Tool" menu. The "CPU Sampler" is higher overhead and cannot sample as often or as rigorously but will normally provide similar information.

The PropertyAccessors test will self-run, so you can just watch it go. For other projects where some interactivity may be required, you'll need to interact with the program until it runs the code you want to examine. If your test has a short duration or a large amount of code, you may need to click the "i" next to the "Time Profiler" Instruments icon at the top to change the sample rate to get good coverage for your program (you'll need to rerun the test if you make this change).

instruments1.png

Your window should look like this. If it doesn't, make sure the "Detail" view is visible (you won't need the "Extended Detail" view). Also make sure the "Call Tree" view is selected in the "Detail" view (it's the icon at the bottom of the window that looks like three horizontal lines stacked slightly askew).

Once the program finishes, uncheck the "Invert call tree" checkbox in the left column then expand the "Main Thread → start → main" elements and we'll see the most expensive steps in the main() function (which is where all the tests are run in this project).

instruments2.png

The top item in the main() function is the fetchObjectSetForRequest: method but we're not looking at that here (it's just Core Data loading and prefaulting the entire database).

Instead, we want to know why the second item, slowObjectValuesForProperty:, takes 2.23 times longer than the third item, valueForKeyPath:.

Expanding the tree from the slowObjectValuesForProperty: method twice, we see CFBasicHashAddValue() occupies almost all of this method's time.

If we expand the valueForKeyPath: call tree 7 times, we can see the same CFBasicHashAddValue() is used to create the set here but for some reason, this function only takes 1402 milliseconds here, compared to 5996 milliseconds for the slowObjectValuesForProperty: method.

The same function, acting on the same data. But one takes 4 times longer than the other. What is the explanation?

There are two possible answers: either the slow case is suffering from poor memory caching performance, or it is acting repeatedly and is slow due to repetition.

Object allocations

Shark can measure event counts (like cache misses and memory bandwidth) if you actually think that's the cause of a problem but it's unlikely to be the problem here. In most situations where the Time Profile isn't clear cut, the best thing to try here is a object allocation analysis.

As with the Time Profile, we run the Object Allocations performance tool from "Run→Run with Performance Tool→Object Allocations" menu in Xcode.

instruments3.png

Again, switch to the "Call Tree" view and make sure "Invert Call Tree" and "Separate by Category" are off.

Additionally, right-click (or control click) in the table header of the "Detail" view and make certain that "Count" and "Bytes" columns are both enabled.

If you navigate into "Main Thread → start → mainslowObjectValuesForProperty:", then follow the biggest allocation 4 times and you'll see that __CFBasicHashRehash is responsible for allocating almost all of this method's memory.

Similarly navigating into the valueForKeyPath: of the other test, reveals the same method 7 levels in allocates a majority of this method's memory.

However, there are three big differences between the memory performances of the __CFBasicHashRehash in each case:

  1. The slowObjectValuesForProperty: version performs 800,000 allocations whereas the valueForKeyPath: version performs exactly 100,000 (equal to the number of tests).
  2. The slowObjectValuesForProperty: version allocates 210.57MB whereas the valueForKeyPath: version allocates just 97.66MB.
  3. The slowObjectValuesForProperty: version is found inside CFSetAddValue where the valueForKeyPath: version is found inside CFSetCreate.

From the first point, it would appear that the presumption that the slow version is slow because it unnecessarily repeats itself is looking accurate — it is repeatedly reallocating.

The second point also suggests that the slow method is wasting extra memory (which is probably causing mild performance penalities). On low memory systems, this would be even worse.

The third point suggests why the first two might be the case: the slow method needs to reallocate as it adds more data.

Understand what the code you're invoking does

All this helps to identify where your program is spending its time. From here, you could probably work out that allocating the NSMutableSet once, instead of repeatedly as you go, is a good idea.

Of course, you need to confirm this by applying your changes and testing. Changing this code to allocate once instead of repeatedly as it goes is easy — it takes a matter of seconds. You do need to be wary though about spending too long optimizing code on a whim. The more involved the code change, the more confident you should be that you're actually fixing a problem and not inspecting irrelevant code or making unhelpful changes.

This is where it helps to really understand the code you're playing with.

For example: the reason why moving to a single allocation with the NSMutableSet above is helpful, is not actually directly because of the allocation itself — since the biggest block of time inside slowObjectValuesForProperty: that we're trying to optimize is spent in __CFStringEqual and memcmp when you drill all the way down; it is not spent in malloc itself.

Instead, I know that whenever NSMutableSet resizes, it needs to rehash every single object back into the newly resized hash table. It is this rehashing that is hammering the __CFStringEqual and memcmp functions (since they are the functions used to hash and detect collisions in the hash table). Yes, reducing reallocations makes it faster but the biggest reason for this improvement is the nature of hashed storage: because reducing reallocations reduces the need to rehash.

As I reported in the original post, fixing this allocation so that it happens just once will speed this code up by 2.6 times but it is important to understand that the need to rehash is why this change helped — other reallocation situations may not gain as much as we have here.

Finding the best candidates for speed improvements

Generally though, allocations and reallocations are always a prime place to look. Large allocations are expensive to move around in RAM, millions of small, irregular allocations can fragment memory and every allocation normally needs to be initialized, incurring some kind of expense. Even with NSArray, which doesn't need to rehash when it reallocates, there is still a copy of the elements from the old array to the new one. The performance gain will not be as great but it is still a place to squeeze extra performance.

When optimizing, the first things to look for are:

  • Memory allocations. They're easy to find and easy to tweak. They don't always give the best performance improvements but they're a good first point to examine.
  • Iteration over large arrays to find elements. If you need to search large arrays often, you should be storing in a dictionary or other constant time access structure. Depending on the size of the array, this can be a near 100% speed improvement. i.e. never, ever search anything bigger than a trivial array.
  • Nested loops over medium to large data sets. Easy to find. Eliminate the nesting if you can. Although they can be hard to eliminate since you need to rethink how you access your data but you can often do something to reduce their impact. If you have 1 loop inside another, always put the smallest loop (in terms of number of elements) inside the bigger one if you can, since memory scales better to doing small packets of work huge numbers of times.
  • Anything non-polynomial on more than trivial data sets. Non-polynomial actions are the slowest, worst things you can do. They are occasionally required but if at all possible, think up another design. What's non-polynomial? Anything where the number of packets of work involved in processing a collection grows greater than polynomially with respect to the size of the collection (i.e. if the packets of work are "y" and the number of objects in the collection is "x", then non-polynomial means the number of packets of work exceed y=x^a for large values of x, where a is any constant). Exponential growth (i.e. y=a^x) or factorial (i.e. y=x!) are the most common kinds of non-polynomial growth. If all this is confusing, then at least know the common case: trying to find an ordering or arrangement for a set of objects by exhaustively testing every combination is non-polynomial.

You may be tempted to think that pervasive multi-threading, OpenCL, SSE vectorization or assembly optimizations are the best way to solve performance issues, since they are all "high performance" technologies and the fastest programs all use them in some combination. However, these technologies are much harder to implement than simple design improvements so they should always be something you consider once you're sure that the design can't be further improved.

Conclusion

The first rule with optimization is that you should always optimize based on evidence.

Don't begin to optimize if you can't find a bottleneck or create one with data. Optimizing without actual evidence of a bottleneck (just suspicion that there might be one in future) is called "Premature Optimization" and is normally considered a waste of time.

However, even once you know where the bottleneck is, you need to know what is causing it. If the code is small, you can just play with it to see what happens but generally you need to inspect closely. It is surprisingly easy to make the wrong guesses about what is causing performance issues and waste time changing code for no gain.

I hope I've shown you how to gather the information you need to understand simple performance problems and the types of clues you'll need to narrow down the locations in your code that can be improved.