How to use

In General ...

A warning upfront: This is experimental work. NEVER use it in a production environment.

In general, it's easy. You should
  1. Extract prof.jar from the distribution, e.g. to a directory /home/user/ppProf
  2. Start your application with 
  3. Add  -agent:/home/myself/ppProf/prof.jar to your java command line, as in
    java -javaagent:/home/user/ppProf/prof.jar -classpath cp org.junit.textui.TestRunner my.test.Class
  4. Wait for your program to finish
  5. Look for a file ppp_Result.txt in the working directory. It contains the number of calls and time spent in each method, in milliseconds. In addition, ppp_Result_summary.txt lists the packages containing the most frequently called methods and most time-consuming methods.
That's all...

... with the illusion of nanosecond precision

If you want the illusion of nanosecond precision, add the nano option to the agent parameter, as in  -javaagent:/home/user/ppProf/prof.jar=nano. ppProf will then use System.nanoTime() instead of System.currentTimeMillis() to measure time.

I  write illusion because in my experience, it just slows down the measurement without any real benefit in precision. On my Linux box with a vanilla Sun JDK, tracing indicates that the Unix function gettimeofday (2) is called in both cases.
If you call a method sufficiently often, or if it runs long enough, the difference between millis and nanos will probably not matter. If you call a method only a few times and it runs fast, it's irrelevant to profiling.

... on Linux SMP on PC Hardware

There's always a catch. In this case, calling System.currentTimeMillis() or System.nanoTime() is outrageously expensive. Calling gettimeofday from C is about 16 x slower with an SMP kernel, on my test hardware. I'm not kidding. This is really really really expensive, esp. if compared e.g. to a simple getter method. It makes profiling really really really slow.
 
To fix this, there's the linuxSMPhack option. It uses a different operating system function to measure time. Unfortunately, that is not standard Java any longer, and requires a JNI implementation.  But as an added bonus, the function used claims to measure CPU time (rather than elapsed time) per thread.

To make this work, you need to do three things:
  1. Extract bin/libclockGettime.so from the distribution, and place it say in /home/user/ppProf as well
  2. Use the linuxSMPhack option to ppProf, as in -javaagent:/home./user/ppProf/prof.jar=linuxSMPhack
  3. specify to java where to find the library, using the -Djava.library.path=/home/user/ppProf option.

... to just count

If timing methods is too slow, you can use count option to ppProf. In that case, only the calls per method are counted. The overhead for this is usually minimal, and 70% in my worst-case example. You can determine the most frequently called methods this way, exclude their packages or classes from profiling, and turn timing on again.

What gets profiles, and how to change that?

By default, ppProf profiles everything except classes from
What does that mean? Say you have a method foo(), and it calls StringTokenizer.hasMoreElements(). Since StringTokenizer is from java.util, the time spent in hasMoreElements() will be attributed to foo().

Why this particular list? Well, it's a reasonable choice for my standard example.  Future versions will put this into a config file.
 The above list is equivalent to specifying scope=-java-sun-org/apache/xerces-org/jdom as an option to ppProf, as in -javaagent:/home./user/ppProf/prof.jar=scope=-java-sun-org/apache/xerces-org/jdom.

The rules for the scope option are as follows:
Here are some examples:
Final remark: ppProf currently only profiles classes loaded by ClassLoader.getSystemClassLoader().

Single vs. Multi-threaded code

Profiling  multi-threaded needs overhead which is not needed for single-threaded code. Using the single option to ppProf, you can avoid that overhead. If you use it for multi-threaded code, the results may be strange.
The default is the multi option, although it is still experimental.

For single-threaded code, you get a single result file. For multi-threaded code, you get that (with aggregated values for all threads) and a per-thread file as well.

Warning: The way multi-threaded profiling is performed assumes that a modes number of "long-running" threads is used. If you create and delete threads at high frequency, expect slow execution and OutOfMemory errors sooner or later.