Deep Learning Performance Notes

I started experimenting with Deep Learning and immediately encountered learning issues. I am using license plate recognition code as an example. It is taking ~100K iterations to converge. For production systems,  the required number of iterations are going to be in 10s-100s of millions so performance matters a lot. Few notes:

1. MacBook Air

Each iteration took 6 seconds on my 2015 notebook. This would take 7 days (6*100K/86400) to complete the training. Not good.

2. Ubuntu Linux Server

Performance is much better but not good enough. Each iteration took 3 seconds so training time is still days.

3. Ubuntu Linux Server + GTX 1060 GPU

Each iteration took 0.3 seconds only that means I can experiment every few hours while learning DL.

However, a word of caution: I was hit with exploding/vanishing gradient problem. This is very clear that Tensorflow/GPU combo handles the floating point calculation differently than the CPU only system (I did not encounter this issue on the CPU). A way to solve exploding/vanishing gradient is by tweaking learning rate or trying with different initialization parameters. Reducing learning rate worked for me.

I also looked at running training on GPU instances in the cloud however, the cost seemed to be very high for now: ~$100-$200 per month for partial usage. I was able to upgrade existing computer for $250 to gain better performance.

For DL to become ubiquitous, independent developers need access to more affordable computing resources. For now, it appears that a personal computer with consumer grade GPU is the way to go for independent developers like me till cloud becomes cheap again.

Software performance considerations when using cache

Software performance considerations when using cache by Steve Daily, Intel Corp.  

This paper has discussed some basic concepts of cache operation and organization and presented some software examples to demonstrate how an awareness of the operation of the cache on your system can help improve software performance.

The use of cache in computer systems can dramatically improve system performance.  Increasing cache size is a cost-effective way to improve performance on microprocessors as the transistor count on the chip increases. While the operation of cache is generally transparent to the programmer, some issues can arise that influence a program’s performance.

This paper will look at some of the issues that affect single-processor system performance arising from the tendency of programs to refer to data and instructions in locations of memory close to previously accessed locations. This tendency is called Locality of Reference and is the basic property that allows cache to improve processor performance

Recently I got chance to implement a dynamic memory allocation based ANSI C program, and since I am performance freak, I end up writing same program in three styles with performance range of 1X-5X. Currently, I am working on an article to show various coding styles (Simplified ) for dynamic memory allocation and performance evaluation of each!! Keep tuned. And Ideas are always welcome!!

Guidelines for writing efficient C/C++ code

Simple source code changes can often result in substantial performance enhancements using modern optimizing compilers on high-end embedded processors. But, why is performance necessary? After all, the capabilities of modern microprocessors dwarf the capabilities of 1980-era supercomputers. First, the average case response time of a real-time system is irrelevant. It is the worst-case response time that guards against a dropped call, a misprint, or other error conditions in a product. In other words, performance is necessary to minimize the response time of one’s system, thereby achieving product reliability.

Second, increased performance allows for the implementation of more features without compromising the integrity of the system. Conversely, performance might be used to select a cheaper microprocessor or one that consumes less power.

Original Article can be read at Guidelines for writing efficient C/C++ code Greg Davis , Green Hills Software, Inc.   (04/05/2006 9:00 AM EDT) 

Continue reading “Guidelines for writing efficient C/C++ code”

Linked List Implementation in ASIC – Hardware


In Linked List Implementation in ASIC, overview of dynamic memory allocation was presented. Next Article,  Linked List Implementation in ASIC – ANSI C provided further explaination for the need for Linked List and foundation for software implementations of Linked List. 

This article provides foundation for ASIC linked list design architecture considerations. First, an ANSI-C model is presented to explain and mimic ASIC Linked List Implementation. After this, basic linked list operations are identified and performance analysis is performed. At the same time, various tradeoffs are  discussed and few unique methods are presented to increase Link List performance and throughput.

Continue reading “Linked List Implementation in ASIC – Hardware”