Category Archives: Scala

A Monte Carlo Simulation in Scala

On LinkedIn a fellow named Francois Dewaste posts links to some Scala-related articles he runs across, and I often check them out. Recently he shared an article about a Monte Carlo simulation in Scala, written by Alvin Alexander. I found the problem interesting, so I want to share my own Scala solution, which is written using a more functional style, and also a mathematical interpretation of the results.

First I’ll repeat the basic problem:

If two people who are each right on a test 80% of the time (but whose answers are uncorrelated) have the same answer for a particular question, what is the probability that their answer is right?

You’ll get much more out of this post if you spend a minute thinking about the problem. How would you model it mathematically? How would you explore the problem using software?

Apparently several statisticians he asked said that the answer was 80%. Presumably their thinking was along the lines that since their answers are uncorrelated, the fact that two (or three or …) agree means nothing. But that’s wrong.

If you’re like me, your immediate answer will be “there isn’t enough information.” Clearly you would have to know how many possible answers there are, because that will affect the probability that they have the same wrong answer. If there are only two possible answers, then if they are wrong they agree 100% of the time; if there are an infinity of possible answers, then if they are wrong they agree 0% of the time. Looking at his code, though, it was clear that he meant for there to only be two possible answers to each question.

OK — ready for the answers?

First the Simulation

I know most folks are here to look at the Scala code, so here it is. To make it a bit easier to follow, bold font indicates where the parameters of the simulation are used.


object MonteCarlo extends App {

  val Array( numQuestions, numChoices, numPlayers, pctCorrect, numMustAgree ) =
    args map (_.toInt)

  val correctAnswer = numChoices - 1

  // For each question, generate player answers and look for consensus.
  // Where consensus is achieved, yield whether or not it is correct.
  val correctnessOfConsensus =
    for { i <- 1 to numQuestions
          (answer,instances) <- Array.fill(numPlayers)(
            if ( util.Random.nextInt(100) < pctCorrect )
              correctAnswer
            else // pick a wrong answer from 0 to correctAnswer-1
              util.Random.nextInt(correctAnswer)
          ) groupBy identity
          if instances.size >= numMustAgree
    } yield answer == correctAnswer

  val timesAgreed = correctnessOfConsensus.size
  val timesRight  = correctnessOfConsensus count identity
  println( s"Consensus meant correctness $timesRight out of $timesAgreed times." )
  if ( timesAgreed > 0 )
    println( f"That's ${ 100.0 * timesRight / timesAgreed }%.1f%%." )

}

It’s interesting to note that because of Scala’s type inference there isn’t a single type specification in the entire program, in spite of the fact that there are eleven bound variables. The call to toInt() is a conversion, not a type specification, and it would be needed in Python or Ruby as well.

The 9-line for-comprehension is the heart of the program: it repeatedly (once for each question) generates the players’ answers with the specified probability of being correct, figures out how many times each answer was used, picks out the cases where the required number of people agreed, and yields a Boolean indicating whether or not the agreed-upon value was correct.  Note that there can be more than one consensus for a given question; for example, if you use ten players, five of which must agree, then you can achieve consensus twice for a single question. The for-comprehension (with its underlying flatmap) makes that trivial.

Now let’s run it for the parameters of the original problem, but with a million questions so we get a more accurate probability estimate:

  $ scala MonteCarlo 1000000 2 2 80 2
  Consensus meant correctness 640039 out of 679984 times.
  That's 94.1%.

And they were also interested in the probability of the answer being correct if 3 out of 3 had it:

  $ scala MonteCarlo 1000000 2 3 80 3
  Consensus meant correctness 512213 out of 520220 times.
  That's 98.5%.

Those are essentially the same answers Alvin got.

Now the Math

So what’s behind these numbers? In the above runs (where everyone must agree), there are three possibilities for each question:

  • A. They all agree and they are correct.
  • B. They all agree and they are incorrect.
  • C. They don’t all agree.

We don’t care about C; what we are interested in is

         p(A)
     -------------
      p(A) + p(B)

Since their answers are uncorrelated, the probability that two players will both be right is the product of the probabilities that each is right. Since each is right 80% of the time, p(A) = square(0.8) = 0.64, and p(B) = square(0.2) = 0.04. So the expression above is 0.64 / ( 0.64 + 0.04 ) = 0.941 (roughly).

With three people, p(A) = cube(0.8) = 0.512 and p(B) = cube(0.2) = 0.008, so the probability that the consensus is correct is 0.512 / ( 0.512 + 0.008 ) = 0.9846 (roughly).

Oh, and remember the point about the number of possible answers mattering? Let’s say that instead of 2 possibilities there are a million:

  $ scala MonteCarlo 1000000 1000000 2 80 2
  Consensus meant correctness 640167 out of 640167 times.
  That's 100.0%.

With so many possible answers, the players almost never agree if they are incorrect. So, if they do agree, it’s because they are right.

There you have it! Interesting problem, Alvin — thanks!

[other Scala posts]

Dynamic Resource Types for a No-Fuss API

“this was one of those rare occasions where an idea turns out to be a lot better than intended”

In late 2009 I left a dying Sun Microsystems (right before the Oracle acquisition) to help my good friend Alok with his tiny un-funded startup called Yunteq. He was understandably having trouble finding engineers willing to work for a stake in the company rather than a salary, and since he was trying to make it as an entrepreneur and was in desperate need of help in a critical portion of the system, I went to his aid. The unsalaried period was only going to last three months — I could handle that — and he thought the company would probably be acquired in five, by a suitor they had been in talks with for a while. They were almost to the goal line, but they had a big problem; if I stepped in as a third developer to get them through it, I could own a significant share of a company about to be acquired. He had assembled a good team, and we would make decisions together, as co-founders.

It sure sounded good. In reality it was an extremely bumpy ride.  :^)

Anyway, in early 2010 I started on the task of designing a public REST API for the product. It had to present the functionality of the server — which was all about managing and organizing roomfuls of servers, and the virtual machines running on them — to the outside world. We started with a typical design in which the URL broke down the universe into a hierarchy of resources like this one:

/cust/CustomerId/app/ApplicationId/vm/VmId/disk/DiskId

The resources in such an API effectively form a tree, the shape of which is defined by the application. But at some point Alok mentioned to me that while the above was fine for a cloud provider, a large company would probably not want the customer part but would want some way of representing divisions, regions, groups, departments, or whatever. “We’ll probably have to fork the code base for each customer,” he said.

I nearly choked on my tea at the sound of that. “Ummmm, let me think about this,” I responded.

In a couple of days I presented him with a design that at first he had trouble understanding. There was no fixed hierarchy to the API at all — the resource tree would be whatever we made it, by creating new nodes and saying what type we wanted each to be. Not only could each customer have a different tree arrangement, we didn’t have to set it up for them — they could do it themselves. They could do it piecemeal; anytime they wanted to add to or rearrange the tree, they could do so easily from the CLI. All using a single code base.

That is, this kind of API was really not about customers and applications and VMs and Disks and NICs and so on — it was about manipulating a tree of typed nodes. If you want a top-level grouping by customer, fine — create a node of type Folder called “cust” and within that a Folder for each customer. If you want to organize things some other way, no problem — just create the appropriate nodes in the tree to represent it. It was completely analogous to the way you create directories in a filesystem to organize your data the way you want, except that you could create nodes of arbitrary types from some available palette. A Folder here, a VM there, an ISO repository over there, etc. — however you want to arrange it.

By mid-2010 we had such a system working; the public API allowed clients to manipulate a dynamically typed resource tree.

But this was one of those rare occasions where an idea turns out to be a lot better than intended. Yes, the interface was now infinitely more flexible, and that was great. Each customer could organize its resources in our system in whatever way made sense for that enterprise. But that was just the beginning.

The API itself treated all nodes in the tree the same — as typed sets of attributes. How much RAM a VM was to have, or CPU, or disk space — all were just attributes of different types. The API was entirely about manipulating a tree of attribute-sets; it didn’t know or care what the resources or attributes were about. Developing in this system involved writing new resource types, but you never had to go back and fiddle with the API to be able to handle them; the API could already handle them. To it, your new resource type was just another typed set of attributes.

And then our CLI got simpler. There was no longer any code specific to VMs or customers or a disks or anything else — it just extended the API’s view of the world, allowing the user to manipulate a tree of attribute-sets from the command line. Not only did we no longer have to fuss with the API when we added functionality, we didn’t have to fuss with the CLI either. As we added new resource types, or added/changed attributes of existing resource types, no work at all had to be done in the API or CLI. This was way cool.

Then we added the notion of a “policy” — a kind of resource that changes the way the system deals with other resources. The fact that the customer was in control of the arrangement of resources gave us a scoping mechanism for policies: wherever you put the policy in the tree, the policy was in effect for the subtree under that point. And if you put another instance of the same type of policy below it, the deeper policy overrode the shallower one. This was a simple yet powerful scoping mechanism, made possible by the fact that we had handed control over the tree’s layout to the user.

Testing was also improved. Among other things, the fact that resource types and attributes were treated so generically meant that much testing could be done once and for all. For example, a single suite of tests can check that an attribute which is supposed to be an integer within a specified range works as it should; then if there are actually a dozen such attributes in the system there is no need to test out-of-range values or non-integer input on each.

Even the API documentation and help system documentation were simplified. Once you understand the basic idea, that the API just allows you to manipulate a tree of typed resources, the bulk of what you need to know is 1) what resource types are available, and 2) what are their attributes? Much of this kind of documentation can be generated automatically by the system.

In effect we had created some good generic infrastructure for creating applications. Starting from that infrastructure, you just define resource types for your domain and you are done. Of course, those resource types are non-trivial — that’s the meat of the application. But you get a lot for free.

There’s more to this, of course — I can write about the API because it is public, but I won’t say much about the interesting mechanics inside the server. I will say, though, that I had wanted to do this in Scala and Akka, but management had never heard of those (“Forget Scala,” I was told) and got someone to set up a much more conventional stack. It worked reasonably well, but had its problems. A bit frustrated, I spent a few weeks over Christmas writing a Scala/Akka/Spray prototype and demoed it to them when they got back from their vacations. They were really impressed at how much cleaner the API code was (the Spray DSL for creating APIs is awesome), and defining resource types in this new system was much easier. To their credit, they took a serious look at Scala and Akka and decided that we should use it. I now have new server infrastructure for dynamic resource trees working in Scala/Akka/Spray; it’s a huge improvement and I am a much happier camper.

And now that same management has told the entire group to use Scala. Go figure.

Test with Bigger Integers

It’s always a jolt when some simple expression in a language I’ve been using for a long time evaluates to something I can’t explain.  It happened today with Java.

I had asked a coworker to run FindBugs against one of our products. He forwarded me the output, which contained several issues of this type:

Suspicious comparison of Integer references

Sure enough, when I looked at the code I found expressions using == to compare one Integer to another. That’s usually bad because Integers are objects, and == for objects tests whether they are the same object, not whether they have equal values, so two Integers could appear to be != even though they hold the same value.

That’s easy enough to fix, but just to make sure I understood all the cases I wrote a test that compared, among other things, an Integer 88 to an int 88 and to another Integer 88. The good news was that Integer == int was true (autounboxing). The bad news was that Integer == Integer was also true — there was something wrong with my test.

I tried a number of similar tests, all with the same result. I began to suspect that this was some kind of optimization, and in that case it was likely only for small numbers. Ultimately I figured out that == always returns true when comparing Integers from -128 to 127. Once I saw that, I was able to find references to this behavior in the specification. The same is true for Longs and Integers and Shorts. The JRE prepares objects for small numbers in advance, and hands you one of those instead of creating a new one each time. Since you get the same object each time for those numbers, == works. But it doesn’t work for numbers outside that range.

OK, cool — I learned something new. But looking at the output of FindBugs, I realized that we hadn’t seen those bugs in testing, probably because we typically used small numbers for those values. Let’s see, we need a couple of numbers for VLAN tags — what do we pick? Typically 1 and 2. Maybe 7 and 12 if we’re feeling wild and crazy. Had we never tested with a number greater than 127, we might have shipped the code with that problem. Yay FindBugs!

So here are some thoughts:

  • If you are in QA for a Java product, use numbers outside the range -128..127 wherever they are allowed.
  • If you are writing a Java product, consider initializing ID generators to 128 rather than 0 or 1, at least during testing.
  • Run FindBugs on your code.
  • Use Scala.

The semantics for == in Java are troublesome: value comparison for primitives, but identity comparison for objects, but effectively back to value comparison for small numeric objects. Oh, and with autoboxing we’re going to magically convert between primitives and objects to help you out, sometimes (depending on the actual value) changing the semantics of ==. You’re welcome.

In Scala, == compares values. Always. That doesn’t depend on what types you are comparing, or on the actual values. If you really want to test whether two variables refer to the same object, you use eq, which isn’t even allowed on Ints, Longs, etc. In Scala it’s pretty hard to screw this up.

Nice.

Scala Performance

“You can’t observe that some poorly performing algorithm was far easier to implement in Scala than Java and conclude that Scala’s performance is poor.”

[This is one of my most popular posts, but it’s long; if you are pressed for time, just read item 4, look at the qualitative chart there, and read the conclusion.]

You may have read the infamous email in which a Fantom fan complained about his difficulties using Scala at Yammer. He complained about a lot of things, including how hard it is to get started with Scala collections and the poor performance they were seeing and the worthless Scala community. Hmm. It is certainly worth reading, with a grain of salt, as the flip side of the success stories, some of which were explicitly about improving performance. Apparently it doesn’t work for everyone. Scala has improved since Yammer’s struggles with Scala began, but still, it’s worth reading about one company’s bad experience.

By the way, I am certainly waiting to see whether Fantom and/or Kotlin gets some traction; I’m always happy to see good competition in languages. And I have some sympathy for a few of the author’s concerns about Scala. For example, while I am happily using SBT now, I was mightily confused at first, in part because of the major changes it was going through at the time. For most of his complaints, though, I have quite a different impression. In particular it’s hard for me to imagine how he got the impression he did of the Scala community; I have found folks to be quite helpful.

But the author of this post read  waaay  too much into the Yammer mail, ultimately saying this about what he infers to be the relative performance of Java and Scala:

“…if I can serve all of my customers on one machine instead of 100 machines …”

Whoa, wait a minute, dude. What’s that you’re smoking?

1. First of all, the implication that real-world code is 100x slower in Scala than Java is so radically far from what we see elsewhere that we should wonder just what the heck they were doing. OK, the Yammer folks didn’t say that the whole app ran 100x slower, but “some parts” did; we don’t know how much difference it made overall. But even so, something is up — 100x is a huge and uncharacteristic difference. When I mentioned this to a colleague, he said “If for loops are M times slower and immutable collections are N times slower, then a for loop on immutable collections will be MxN times slower.” But that isn’t how it works! The truth is that if every single aspect of a program is made M times slower, the result will be just M times slower. So unless we see gists of realistic code demonstrating a 100x difference (don’t hold your breath), we should assume that they were doing something a little goofy. For example, they mentioned an “inner serialization loop”; it’s entirely believable that when they wrote their app they didn’t design the data structures for efficient serialization (long linked lists of Option[Int] vs. arrays of unboxed Ints, for example), but that wouldn’t be Scala’s fault. Also, accurately measuring the performance of code on the JVM is tricky — they wouldn’t be the first to get it wrong. Or perhaps they were repeatedly using a view, causing the computation to be done over and over again — who knows. But if this 100x difference they saw is really more than carelessness on their part, if it’s really a legitimate performance problem in Scala itself, well, gist or it didn’t happen! — extraordinary claims demand extraordinary evidence.

2. A lot of large applications are fairly I/O-bound. In many apps there may be no code at all which, if made several times faster, would have a substantial impact on normal application performance. And if you think about it, it’s quite obvious that this is true, since interpreted languages like Ruby and Python really are orders of magnitude slower than Java and Scala at computation. Does anyone seriously think that companies using Rails and Django are OK with buying 30 times as many machines to run their web sites? “Bill, order another 300 servers — we’ve decided to use Ruby.” Of course not — the penalty is nowhere near that steep for most applications, because so much of the total time is spent waiting for I/O to complete.

3. Generally only 5% or so of the code in a large application really matters much from a performance perspective — the rest could be made several times faster and you could hardly tell. Might there be some small portion of the code that you want to finesse for the sake of performance? Sure, happens all the time — in Scala, in Java, in C, and even in assembly language. If performance matters enough, you even move functionality to hardware — ask anyone who works on high-performance networking equipment or devices that process or render images. But we don’t move *everything* to hardware just because custom hardware is faster — that would be suicide. We move what matters. You take that small portion of code that’s critical to performance and you herniate it and yourself to speed it up, but you leave the rest of the program alone, unoptimized but maintainable, because it just doesn’t matter enough to overall performance. It would be silly to do otherwise.

4. The performance of Scala code depends on how you write it. It is true that on the JVM there is a performance hit for writing functional code, and that really does mean that in that 5% or so of the code that is performance-critical you should consider writing while loops rather than for-comprehensions, using mutable variables, preferring arrays over other collection types, and so on. At that point you’re just using Scala as a better Java, taking advantage of type inference, omitting semicolons, enjoying a better type system, writing tail-recursive code instead of loops that mutate data, etc. But you are still getting essentially the same performance you get from Java itself, with those conveniences. So even in this case, what is the incentive to use Java? And for the other 95% or so of the code, your goal should be to make it robust, maintainable, extensible, etc., in which case you are far better off with Scala, using functional code, immutable data, actors, and so on.

This post gives a great example of the difference between using Scala as a better Java and writing functional Scala. Using immutable linked lists rather than mutable arrays, and filtering rather than modifying in place, make the code dramatically simpler — but also much slower. What may be more of a surprise is that when the author used Scala as a better Java on an in-place sort in an array, the Scala version outperformed the Java version (because the Scala compiler optimizes simple tail recursion). So it’s a trade-off, and it’s up to you to decide when to strive for maintainability and when to strive for performance. But if you are leaning hard toward performance more than about 5% of the time in a large app, you are probably doing your team a disservice.

The functional version — taken from an old paper introducing Scala — should be thought of as a demonstration of the expressive power of Scala, not as the “right way” to do things in Scala.

In fact, the performance difference demonstrated in that post is not really about Scala vs. Java. If you wrote Java code to sort an immutable linked list using the same technique, it would perform just as poorly. The only reason we even think to blame Scala for its poor performance in that algorithm is that the Java implementation would have been so painful to write that we wouldn’t bother. Immutable linked lists are easy in Scala and perform admirably for many algorithms, but they are a poor choice for huge sorts; we developers are expected to know when their use is appropriate. You can’t observe that some poorly performing algorithm was far easier to implement in Scala than Java and conclude that Scala’s performance is poor. It’s not Scala’s fault if you do something silly, even if Scala made it easy to do it.

This StackOverflow answer about how to count the distinct vowels in an English word gives a dramatic Scala example of recoding something that is straightforward for much higher performance.

Scala gives you a wider range of choice in the tradeoff between performance and maintainability than Java does, but you can always get Java-level performance if you need it. So the real difference between the languages is not about performance at all — it’s that you can write much more maintainable code in Scala where performance is not critical. Where it is critical, you do what you need to do, and that’s pretty much the same thing in the two languages. Don’t take this chart too literally, but I’ve tried to roughly convey Scala’s performance spectrum here:

How to think about Scala performance

How to think about Scala performance

What you should get out of this diagram is that for that 5% where performance is critical, Scala code can be about as fast as Java code while being somewhat more convenient; and for the 95% where performance isn’t so important, Scala offers agility comparable to that of dynamic languages while performing better. We just discussed an instance of the former; here are some examples of Scala’s agility: AB, C. I doubt that any could be expressed better, or perhaps even as well, in Python or Ruby.

5. Immutable collections are slower than their mutable counterparts, leading some to suggest that they should be avoided. However, immutable data structures are a huge win when it comes to making your code concurrent, since you can hand off a structure and then modify it without synchronizing or making a copy, both of which have serious performance implications. Also, the use of immutable data usually makes your code easier to reason about. So you should lean toward immutable collections, and Scala makes that easy. Then if it turns out that you could improve performance substantially by using a mutable collection somewhere, go for it, but be careful not to paint yourself into a corner. If your app is a success and demand increases dramatically, you may very well want to make your app more concurrent, and everything mutable will be a land mine waiting to blow up in your face.

6. Often the really big performance wins are in the architecture rather than in individual lines of code — the problem is what you are doing, not how fast you are doing it. Opportunities for transformative architectural change are likely to be more obvious to you — and easier to implement — if you are not knee-deep in boilerplate, imperative code, and mutable data.

Conclusion

So please, no more talk of “if I can serve all of my customers on one machine instead of 100 machines….” Balderdash! And mandates like “never use closures” or “never use immutable collections” are certainly not good advice.

A much better mandate would be to deal with performance in Scala the way you do — or at least should — in any language: design good abstractions, write maintainable code, profile it, and then trade off maintainability for performance only where the engineering cost of writing/testing/maintaining a more complex implementation is exceeded by the performance benefit it delivers. If you do that, you will end up with an app that performs close enough to a Java app for your needs but is much more maintainable and scalable.

UPDATE: This article talks about the goals for Scala 2.11 (currently under construction), which further target performance.

[other Scala posts]  [An Impromptu Scala/Groovy Competition]

An Impromptu Scala/Groovy Competition

I got a kick out of this post. The author, who was learning Scala, came up with a solution to this problem. Don’t bother to analyze it, but here is his solution (digits — what that post calls bigNum — is a String of 1000 digits):

def product( digits:String ) = { digits.map( java.lang.String.valueOf(_).toInt ).reduceLeft( (prod,c) => prod*c ) }
digits.sliding(5).toSeq.map( s => product(s) ).max

He was really happy with it and, understandably, wanted to share it and explain how it works.

Well, the very first comment was a somewhat tacky “mine is better because it’s shorter” Groovy solution (don’t bother with it either), albeit shorter mostly by not breaking out some of the work into a function as the post’s author had chosen to do for clarity (I’ve changed variable names to match the Scala version):

( 0 .. digits.size()-5 ).collect { digits.substring( it, it+5 ).toList().inject(1) { prod,c -> prod * c.toInteger() } }.max {it}

But then a more seasoned Scala developer absolutely blew the Groovy solution’s doors off with this:

digits.map{_.asDigit}.sliding(5).map{_.product}.max

That is, convert the digit characters to their numeric values, map every sliding window of five numbers to its product, and find the maximum of those. Assuming you know what a “sliding window” is and what map does, it’s hard to imagine a much more readable solution.

Now the truth of the matter is that having the shortest solution is usually not nearly as important as having a maintainable (understandable and non-fragile) solution, but in this case the Scala solution is far better on all of those counts. Also, Scala code will often be longer than the equivalent Groovy code, because Scala is statically typed. Whether or not performance, and catching type errors before you ship, are worth having to specify some type information is up to you. But in this case those who bragged about Groovy have good reason to be embarrassed.

But there was more. The ill-fated Groovy challenge was followed by another comment saying, effectively, that the Scala code was unreadable and that clever coders should be fired and replaced with average coders who write three times as much code in Java to do the same thing, because then everyone will understand it later.

Really?

If a company I owned stock in expressed that attitude, I would start thinking about putting in a sell order. There are very good business reasons to use modern programming languages, and there are very good business reasons to hire smart software developers. I am bewildered by people who think it is better to stick with older, less capable languages so that they only need mediocre staff. Good luck!

There’s a big difference between hard-because-unfamiliar and obtuse. The seasoned Scala coder’s solution is way more understandable, to anyone who knows Scala, than an equivalent Java solution would be to another Java developer.

But even if it weren’t understandable, to you — even if you didn’t have any idea what those methods do, or couldn’t synthesize their collective effect in your mind — it is really easy to figure such things out in the REPL. Just start by evaluating some subexpression and iterate toward the whole thing, recalling the previous line with an up-arrow and adding to it:

scala> val digits = "9873459873"
digits: String = 9873459873
scala> digits.map{_.asDigit}
res0: scala.collection.immutable.IndexedSeq[Int] = Vector(9, 8, 7, 3, 4, 5, 9, 8, 7, 3)
scala> digits.map{_.asDigit}.sliding(5)
res1: Iterator[scala.collection.immutable.IndexedSeq[Int]] = non-empty iterator
scala> digits.map{_.asDigit}.sliding(5) foreach println
Vector(9, 8, 7, 3, 4)
Vector(8, 7, 3, 4, 5)
Vector(7, 3, 4, 5, 9)
Vector(3, 4, 5, 9, 8)
Vector(4, 5, 9, 8, 7)
Vector(5, 9, 8, 7, 3)
scala> digits.map{_.asDigit}.sliding(5).map{_.product} foreach println
6048
3360
3780
4320
10080
7560
scala> digits.map{_.asDigit}.sliding(5).map{_.product}.max
res4: Int = 10080

For very little effort we have walked through the code step by step as it transforms the data into the solution. Given a working knowledge of Scala, the most junior developer on your staff could do this.

The REPL is a fantastic tool in the developer’s toolbox. Back when IDE support for Scala was young and unstable, you’d see comments about how tool support for Scala was not as good as Java’s. (To be honest, it’s still not as good, but it’s quite usable.) It was a legitimate concern, but even then Scala had a nice REPL and Java did not.

And now there is a “Scala worksheet” in Eclipse and IntelliJ that allows you to do such explorations in a way which is often more convenient, and these explorations can be stored alongside your code as documentation. And because Scala is a statically typed language, these tools can look up and complete method names for you, a big time-saver. So the guy who comes along in six months and has to figure out what some functional Scala code does is going to find that much easier than reverse-engineering the equivalent imperative Java code.

But please, go right ahead and fire those smart software developers who insist on programming in modern languages. We want to hire some!

[other Scala posts]

A Quick Look at Scala

Here are a few fairly simple steps in Scala that show a little of its flavor and power, followed by a bit of comparative Java code.

      • Let’s start by creating a map from integers to their English names. We’ll populate the map for integers 1 and 2 as examples, and make the name of any integer not in the map “UNKNOWN”.
        val map = Map( 1->"one", 2->"two" ) withDefaultValue "UNKNOWN"
      • Now let’s produce a list of the values from that map that have even keys. That is, for each key-value pair in the map, if the key is even, we want the value.
        for ( (k,v) <- map if k%2 == 0 ) yield v
      • OK, now generalize that by writing a method which, given such a map from Int to String and any test function for the keys, returns a list of those values whose keys pass the test (i.e. the test function returns true).
        def pickValues( map: Map[Int,String], keyTest: Int => Boolean ) =
          for ( (k,v) <- map if keyTest(k) ) yield v
      • Next let’s generalize the method to work with any type of key and any type of value. That is, we’ll write a method which, given any map and a test function appropriate for the map’s keys, will return a list of those values whose keys pass the test.
        def pickValues[K,V]( map: Map[K,V], keyTest: K => Boolean ) =
          for ( (k,v) <- map if keyTest(k) ) yield v
      • Now let’s invoke that last method to get the names of even numbers from the map we created earlier.
        pickValues( map, (x:Int) => x%2 == 0 )
      • That’s already pretty nice.  With some more advanced tricks (which I won’t go into here) we could make that
        map pickValues ( _%2 == 0 )
      • But in reality you probably wouldn’t write this at all because Map has a method that produces the subset Map whose keys satisfy some predicate.  You could just ask for the values of that subset.
        ( map filterKeys ( _%2 == 0 ) ).values

Did you follow that? My guess is that even if you aren’t a Scala developer, you can more or less understand what’s going on without someone having to explain the details. And we very quickly arrived at a fairly general-purpose bit of code that is straightforward to use.

Now for some Java.

        • First let’s create that map.
          Map<Integer,String> map = new HashMap<Integer,String>();
          map.put(1,"one");
          map.put(2,"two");
In addition to being more verbose and less readable, that falls a bit short in that it doesn’t specify a default value.  You could use some other collections library if you wanted that, but the JRE Map interface and its implementation classes just don’t have that feature, so you would either have to handle defaulting each place you used the map or write a method to take care of that once and for all.
        • Now produce a list of the values from that map with even keys.
          List<String> list = new LinkedList<String>();
          for ( Integer i : map.keySet() ) {
              if ( i%2 == 0 ) list.add( map.get(i) );
          }

The next step would be to generalize this to a method which takes a test function on Integers. We could do it, of course, but we’re already at the limit of what most people would be willing to do in order to avoid repeating themselves. Since there are no functions in Java, what we wrote in Scala as

(x:Int) => x%2 == 0

would become an unwieldy anonymous inner class. Most people probably wouldn’t bother. If you are serious about wanting to do such things, then at the very least you should look at Google’s collection library for Java. But even that is going to be much less friendly than Scala.

Perhaps more importantly, the Java collections are not immutable. This may not seem like a big deal when you are first starting out in Scala, but it really has far-reaching implications in a concurrent environment.

Part of the reason I wanted to blog about this is that in the now-infamous Yammer letter in which an engineer who likes the Fantom programming language tries to make Scala sound like a horrible programming language with a horrible community, he enumerates the things he had to explain to new members of his team for even the simplest usage of a collection:

        • implicit parameters
        • builder typeclasses
        • “operator overloading”
        • return type inference
        • Traversable
        • TraversableOnce
        • GenTraversable
        • Iterable
        • IterableLike
        • Should they be choosing the most general type for parameters, and if so what was that?
        • What was a =:= and where could they get one from?

Really? All of that was necessary in order to understand even the simplest use of a collection? As you can probably tell, I am not a fan of his teaching style.

Rather than burying the magic necessary to create a really good collections library in the compiler, where it would be unavailable to the rest of us, the Scala designers made the type system powerful enough to allow a high-quality collections library to be written as ordinary Scala code. The good news is that that means more power for us to use in our own code. Unfortunately this also means that when you look at detailed type signatures in the collections library you see the many traits and classes needed to create it. This is called the “visible plumbing problem.”

Recently the API documentation has been changed to make the less interesting parts of the plumbing less visible. And even so, the type signatures can be a little daunting for newcomers. But nobody should be deceived into thinking that they’ll have to understand GenTraversable, IterableLike, etc. in order to use Scala collections — that’s just FUD. The only reason I can think of for deliberately dragging someone new to Scala through all of that plumbing is to try to convince them that Scala is hard. When teaching someone to drive, do you start by popping the hood and explaining how the drive train works?

Fortunately I didn’t have that guy’s help getting started.

[other Scala posts]

Scala Snippets

This post is really just notes to myself, which I add to as I find things. But I thought I’d share — maybe others will find them useful.


0. Resources:


1. Use flatMap with monadish objects like Some/None to pick out (and map) the interesting stuff:

  val data  = Map( "name" -> "Fred", "age" -> 21, "iq" -> 140 )
  val attrs = List( "name", "age", "income" )
  attrs flatMap data.get   // List(Fred, 21); no iq or income, poor guy

2. Use partial functions to put cases in derived classes which get used from the base class. For example, here is how you would take care of handling bad input in one place (badValue) for a set of related classes:

  import scala.{ PartialFunction => PF }
  abstract class Base {
    def validCases: PF[...]  // define in subclasses
    def   badValue: PF[...] = { case value => err( "bad value: " + value ) }
    def handle( value:... ) = ( validCases orElse badValue )(value)
  }
  class Derived1 extends Base {
    def validCases = {
      case ... => ...
      case ... => ...
    }
  class Derived2 extends Base {
    def validCases = {
      case ... => ...
      case ... => ...
    }
}

3. Use this.type in methods in a base class where you need to return the derived type (otherwise Scala will use the base type). Useful for builder hierarchies:

  abstract class Base {
    def withDefault(...):this.type = {
      ...
      this // but as the derived type!
    }
  }

4. If you are having type problems in Eclipse, try these in order:

  1. Save all files and wait a few moments.
  2. From the menu, do Project / Clean and wait a few moments.
  3. One by one, add types to closure args and method returns.
  4. Eliminate implicit defs, even if you are absolutely convinced that they aren’t coming into play in the problematic code.
  5. If you are getting a “missing parameter type” error where you think the compiler should be able to infer the type, try assigning an intermediate result to a variable to force the compiler to choose the most specific type.

5. From Daniel Sobral, an expression that, given a string containing words (and perhaps other crud), returns a map from words to their frequencies. \P{Alpha} is the negation of \p{Alpha}, the POSIX char class. See the JDK 6 Pattern javadoc for details on regex constructs. Note the use of identity, which is just ( x => x ); it’s defined in the prelude, so you don’t need to create another.

  val text = "How much wood could a wood chuck chuck, if a wood chuck could chuck wood."
  text.toLowerCase.split("""\P{Alpha}+""").groupBy(identity).mapValues(_.length)

6. Don’t forget that you can see the code scalac magically generates for you using a compiler option. The output below has been manually formatted (somewhat) and commented:

% cat Person.scala
 case class Person( name:String, age:Int )

% scalac -Xprint:typer Person.scala    # this is 2.9.1-final

[[syntax trees at end of typer]] // Scala source: Person.scala

package  {

// Here's the generated class.
case class Person extends java.lang.Object with ScalaObject with Product with Serializable {

  // Private members and public accessors.
  private[this] val name: String = _; def name: String = Person.this.name;
  private[this] val  age: Int    = _; def  age: Int    = Person.this.age;

  // Constructor.
  def this(name: String, age: Int): Person = {
    Person.super.this();
    ()
  }

  // Versions of copy() for every combination of supplied parameters.
  def copy(name: String = name, age: Int = age): Person = new Person(name, age)
  def copy$default$2: Int    @scala.annotation.unchecked.uncheckedVariance = Person.this.age
  def copy$default$1: String @scala.annotation.unchecked.uncheckedVariance = Person.this.name

  // The usual stuff you want for every immutable class.
  override def hashCode(): Int    = ScalaRunTime.this._hashCode(Person.this)
  override def toString(): String = ScalaRunTime.this._toString(Person.this)
  override def equals(x$1: Any): Boolean =
    Person.this.eq( x$1.asInstanceOf[java.lang.Object] ).||( x$1 match {
      case (name: String, age: Int)Person( (name$1 @ _), (age$1 @ _) ) if name$1.==(name).&&( age$1.==(age) )
            => x$1.asInstanceOf[Person].canEqual(Person.this)
     case _ => false
    } )
  override def canEqual(x$1: Any): Boolean = x$1.$isInstanceOf[Person]()

  // Methods that allow all case classes to be self-describing.
  override def productPrefix: java.lang.String = "Person";
  override def productArity: Int = 2;
  override def productElement(x$1: Int): Any = x$1 match {
    case 0 => name
    case 1 => age
    case _ => throw new java.lang.IndexOutOfBoundsException(x$1.toString())
  }

}  // class Person

// Here's the companion object.
final object Person extends scala.runtime.AbstractFunction2[String,Int,Person]
  with ScalaObject with Serializable {

  def this(): object Person = {
    Person.super.this();
    ()
  };

  final override def toString(): java.lang.String = "Person";

  case def unapply(x$0: Person): Option[(String, Int)] =
    if (x$0.==(null))
        scala.this.None
      else
        scala.Some.apply[(String, Int)](scala.Tuple2.apply[String, Int](x$0.name, x$0.age));

  case def apply(name: String, age: Int): Person = new Person(name, age);

  protected def readResolve(): java.lang.Object = Person

}  // companion object

}  // package

7. The usual idiom for providing Java versions of Scala APIs:

  import scala.collection.JavaConverters._
  def jFoos = foos.asJava
  def  foos = ...

Info on Java-Scala conversions, and why it is better to avoid the implicit JavaConversions.  Info on JavaConversions.  Do this if you want exceptions checked in the Java code:

  @throws( classOf[TimeoutException] )
  def jFoos ...

8. Passing on repeated parameters (a.k.a. varargs), with modification:

  def foo( a:String, b:Int* ) = bar( a, 7 :: b.toList : _* )

And here’s how you match repeated parameters:

  case Seq( xs @ _* ) => ...

9. Don’t do this (unless you really mean to):

  val x = new X {...}

That leads to the use of reflection (x is structurally typed). This fixes it:

  val x:X = new X {...}

More details here.


10. To filter a Seq by type in a for comprehension, you need to use this trick:

val mixedTypes = List( 5, "five" )
  for (   (x:Int)y@(x:Int)

But if you use a pattern, it works without the trick:

val mixedTypes = List( (1,5), (1,"five") )
  for ( (a,b:Int)

Basically what’s happening is that if you use a pattern, you get a PartialFunction, which will handle values that match and ignore values that don’t. (x:Int) is not a pattern, but y@(x:Int) is.


11. If you are thinking in terms of zipWith or mapcar to combine multiple Seqs, this is probably what you want:

  val xs = List(1,2,3)
  val ys = List(7,8,9)
  (xs,ys).zipped map (_+_)  // List(8,10,12)

Note that no intermediate list is built, as is with

  xs zip ys map { case (x,y) => x+y }

and that zipped works for more than two input Seqs:

  (xs,ys,zs).zipped map ( _*_/_ )

Apparently this only works with Tuple2 and Tuple3, though. Also, be careful — zipped is strict, so attempting to map zipped infinite Streams will win you a StackOverflowError or OutOfMemoryError.


12. Here’s a discussion about profiling and performance. They rave about the YourKit profiler and the Caliper microbenchmarker, and point to this interesting post. This is all pre-macros, so the advice may need an update soon.


13. Here‘s how to use a PriorityQueue to get the top N (in this case 4) elements of a larger list by some ordering criterion, without having to sort the entire list:

  val pq = PriorityQueue( 7,3,4,1,2,0,9,5,8,6 )( Ordering.Int.reverse )
  Iterator.fill(4)( pq.dequeue ) foreach print  // 0123

Of course, a PriorityQueue is perfect for situations where you keep getting new entries and want to always process the highest-priority one first. BUT — currently there is no immutable PriorityQueue in the standard libary, only a mutable one. You might think that you could use SortedSet (TreeSet) as an immutable replacement for PriorityQueue, but it is the set’s Ordering that determines whether or not a value is already in the set, not equals() — so an attempt to add something to the collection which compares as equal to an existing member of the set will be ignored. For example, if you have a SortedSet of Vectors ordered on the size of the vector, then the set will only hold one vector of length 4. Only use SortedSet if you are sure that distinct values will never be considered equal by the set’s Ordering#compare.


14. GenTraversableFactory contains methods (which are available implicitly for Traversables) for creating/initializing new collections:

  collection.mutable.ArrayBuffer.fill(3,2)(1)           // 3x2 ArrayBuffer(s) of 1s
  List.fill(10)( util.Random.nextInt(2) )               // 10 of 0|1
  new String( Array.fill(20)( util.Random.nextPrintableChar ) )  // random String
  Vector.tabulate(3,3)( (i,j) => if (i==j) 1 else 0 )   // 3x3 identity matrix
  util.Random.shuffle( List.range(0,11,2) )             // 0,2,...,10 shuffled
  Iterator.fill(20)( util.Random.nextGaussian ) foreach println  // don't make a List
  List.iterate(1,10)(2*)                                // 1,2,4,8,...,512
  List.concat( 1 to 4, 9 to 12, Set(2,5) )              // 1,2,3,4,9,10,11,12,2,5

15. It’s not surprising that scala.util.Properties lets you get/set properties, but the same class will also tell you which OS you are on, your username, your temp dir, the values of environment variables, etc.

  Properties.propOrNone("foo") match { case None => ...; case Some(foo) => ... }
  Properties. envOrElse("foo","bar")

16. A nice post from Daniel Sobral (a Scala bard) explaining for comprehensions.


17. Tips for investigating long compile times.


18. Here’s the usual way of defining a Stream of Fibonacci numbers, using zip (remember that zipped won’t work because it’s strict):

  val fibs: Stream[BigInt] = 0 #:: 1 #:: ( fibs zip fibs.tail map ( n => n._1 + n._2 ) )

But you can make a more efficient version of this kind of Stream by avoiding the zip, like so:

  val fibs: Stream[BigInt] = {
    def loop( h:BigInt, n:BigInt ): Stream[BigInt] = h #:: loop( n, h+n )
    loop(0,1)
  }

19. To define cyclical dependencies:

  trait A { self:B => ... }
  trait B { self:A => ... }

20. The stackable trait pattern:

  abstract class       Ab { def f(...) : ... /* abstract */ } // or trait
  class Base   extends Ab { def f(...) = ... }
  trait Tweak1 extends Ab { abstract override def f(...) = super.f(...) }
  trait Tweak2 extends Ab { abstract override def f(...) = super.f(...) }

  val tweaked = new Base with Tweak1 with Tweak2    // or whatever
  tweaked.f(...)  // calls Tweak2.f calls Tweak1.f calls Base.f  // reverse order

If Ab is an abstract class (as opposed to a trait) it can take constructor parameters, but of course you can only extend one class.


21. Implicit value classes allow you to effectively add methods to a class without actually creating new objects:

  // This value class adds a times() method to Int, allowing you to do
  // 3 times { ... }
  implicit class Int_times(val numTimes:Int) extends AnyVal {
    def times(f: => _) {
      var i = numTimes
      while (i > 0) { f; i -= 1 }
    }
  }

22. To get the conveniences of case classes within a class hierarchy, make the case classes the leaves, duplicate any common fields in them (sorry DRY), and use abstract methods to represent the common fields in a superclass:

  // Want to be able to access node.value without having to figure out
  // whether it is a Branch or a Leaf.
  abstract class   Node { val value:Int }
  case     class Branch(      value:Int, left:Node, right:Node ) extends Node
  case     class   Leaf(      value:Int                        ) extends Node

23. Be careful with mapValues and filterKeys — they currently return a view (without saying so — a bug), which means that the computation gets done again every time you use the result.  Use map.mapValues(…).view.force to avoid this, or map the result through the identity function:

scala> val m = Map(1->2, 3->4) mapValues { x => println("OUCH!"); x+1 }
OUCH!
OUCH!
m: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3, 3 -> 5)

scala> m(1)
OUCH!
res20: Int = 3

scala> val n = m map identity
OUCH!
OUCH!
n: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3, 3 -> 5)

scala> n(1)
res21: Int = 3

24. Array equality is weird (a Java legacy, presumably):

  import collection.mutable
  import mutable.ArrayBuffer
          Set(1,2)        ==           Set(1,2)        // true
         List(1,2)        ==          List(1,2)        // true
  mutable.Set(1,2)        ==   mutable.Set(1,2)        // true
  ArrayBuffer(1,2)        ==   ArrayBuffer(1,2)        // true
        Array(1,2)        ==         Array(1,2)        // FALSE!!!
        Array(1,2)   sameElements    Array(1,2)        // true
        Array(1,2).deep   ==         Array(1,2).deep   // true
        Array(1,2).toSeq  ==         Array(1,2).toSeq  // true (WrappedArray)

25. Constructor goodies:

  class A         (                   b:Int )    // b is just a constructor arg
  class A         (               val b:Int )    // b is a field
  class A         (       private val b:Int )    // b is private
  class A private (                   b:Int )    // primary constructor is private,
                                                 // but usable by other constructors
  class A         ( @BeanProperty var b:Int )    // adds JavaBean getB/setB methods

And you can subclass a Java class with multiple constructors like so.


26. Composing Futures:

  import scala.concurrent._
  import ExecutionContext.Implicits.global
  def f(x:Int) = { Thread.sleep(5000); x }
  val futures = 1 to 9 map { x => Future( f(x) ) }  // List[ Future[X] ]
  val future  = Future sequence futures             // Future[ List[X] ]
  future onSuccess { case results => println( results.sum ) }
  println("this gets printed before the sum")

27. To write a set of programs which all have to do the same setup (examine environment variables, process arguments, open connections, etc.) and teardown, rather than calling out to a library to do the common stuff you can simply extend App, put the common stuff in there, and then extend that to create each of your programs:

  class  DbApp  extends   App { /* common stuff */ }
  object DbApp1 extends DbApp { ... }
  object DbApp2 extends DbApp { ... }

The children of DbApp will of course inherit any variables defined by it.


28. You cannot use _root_ to import from the default package (ultimately a JVM limitation):

                class C1
  package a   { class C2 }
  package a.b { class C3 {
                  val c2fails = new C2            // not found (in a.b)
                  val c2works = new _root_.a.C2   // OK
                  val c1fails = new _root_.C1     // not a member of pkg 
              } }

29. To throw an exception without the overhead:

  throw new Exception with scala.util.control.NoStackTrace

30. You can use collection.breakout (the breakout method on the scala.collection package object) to provide a CanBuildFrom that produces a specific type, obviating a toList or other conversion.

  type S = String
  def cross( a:S, b:S ):List[S] = for ( c1  b.map( c2 => ""+c1+c2 ) )(collection.breakOut)  // ok
  cross("ab","12")  // List("a1","a2","b1","b2")

And here Rex Kerr shows how to use CanBuildFrom in your own code to make it return the type you started with.


31. While you can use “System exit 1” to exit a program just as you would in Java, “sys exit 1” is better in that it returns Nothing, allowing you to use it in expressions without losing the type info:

  def die( msg:String ) = { log(msg); sys exit 1 }
  val age = if ( born <= now ) now - born else die("born in the future!")

32. Do this to make an infinite Stream by repeating a TraversableOnce:

  implicit class RichTraversable[A](val items: TraversableOnce[A]) extends AnyVal {
    def looped =
      if (items.isEmpty)
        sys error ".looped"
      else {
        lazy val s: Stream[A] = items.toStream #::: s
        s
      }
  }
List(1,2,3).looped take 10 foreach print  // 1231231231

But that’s still a Stream, so it will eventually consume the heap storing calculated results. You could of course make an Iterator, e.g.

  implicit class TraversableLooper[A : scala.reflect.ClassTag](val items: TraversableOnce[A]) {
    def looped: Iterator[A] =
      if (items.isEmpty)
        sys error "(empty).looped"
      else
        new Iterator[A] {
          val array = items.toArray
          val numItems = array.length
          var i = -1
          def hasNext = true
          def next = {
            i += 1
            if (i == numItems) i = 0
            array(i)
          }
        }
  }
  (List(1,2,3).looped take 900000000).sum   // 1800000000

Note that you can’t currently make the latter a value class because it defines an inner class. That restriction is probably going to be eliminated (in 2.11?).  And notice the use of ClassTag so that we can create an Array[A] without knowing what A is.


33. You can do this to choose between methods once (and then use whichever you have chosen repeatedly):

  val f = if (...) f1 _ else f2 _

But if f1 and/or f2 are overloaded you will probably need to say which you mean, like so:

  val f = if (...) f1(_:Int) else f2(_:Int)

More info here.


34. Reverse sorting:

  val names = List("d","b","c","a")
  println( names sorted Ordering[String].reverse )

or, more generally:

  def revSort[ A:Ordering ]( s:Seq[A] ) = s sorted Ordering[A].reverse

Or you could (apparently) add a method to Seq:

  implicit class SeqMethods[A](val xs: Seq[A]) extends AnyVal {
    def revSorted(implicit ord: Ordering[A]) = xs sorted ord.reverse
  }

35. You can bind a lowercase variable to an existential type:

  val x: Some[_] = Some("foo")

  x match {
    case st:Some[t] => // t is a type variable
      val it:t = st.get
      var items = Set[t]() // which we can use to define new types
      items += it
      items
  } // Set[_]

Note that you still get Set[_] even if you start with Some[String], because Some[t] would match other types; this doesn’t bind t to String.


36. You can exclude some parameters of a case class from the usual magic by putting them in a separate parameter list.  E.g. here see that c is ignored by toString() and equals():

scala> case class Foo(a: Int, b: Int)(c: Int)
defined class Foo
scala> val f1 = Foo(1,2)(3)
f1: Foo = Foo(1,2)
scala> val f2 = Foo(4,5)(6)
f2: Foo = Foo(4,5)
scala> val f3 = Foo(1,2)(7)
f3: Foo = Foo(1,2)
scala> Set(f1,f2,f3)
res1: scala.collection.immutable.Set[Foo] = Set(Foo(1,2), Foo(4,5))

37.

[other Scala posts]

Enums in Scala

[To cut to the chase, there is a link to a Scala Enum class you might find useful at the end of this post.]

Newcomers to Scala are often surprised to find that it is easier to write enums in Java than it is in Scala. Nearly everything else is easier in Scala than in Java, and often has preferable semantics as well, so the fact that enums are easier in Java comes as a bit of a shock.

But once you think about it, it makes some sense. The designers of Scala resist putting in highly focused language features, instead preferring features that have very broad applicability. The designers of Java said “Developers need enums; let’s enhance the language to make it easy to get good enums.” The designers of Scala, on the other hand, want as much as possible for problems to be solved by writing classes and traits and putting them in a library. The Scala language designers ask themselves questions like “Developers need to be able to do things like X and Y; have we made it possible to write good library solutions for such things?” Something that is preventing library designers from creating good solutions to common problems is a good argument for a new language feature. The Scala language designers tend to see their role at a different level of abstraction.

Consider Perl’s terse syntax for hash tables. The Perl designers (i.e. Larry Wall) decided to build in language features to make hash tables easy to create and use, whereas the Java designers decided to do that in a library rather than in the language itself.

In the same way, Java targeted enums directly, and as a result the syntax is fairly minimal, whereas the Scala folks have so far decided that the problem doesn’t warrant new language features. And even if the Scala folks were to do something special for enums, they couldn’t make it three times easier to create an enum than in Java; it would be about the same.

It could certainly be easier, but without too much difficulty you can make enums in Scala that are similar to Java enums, and actually do a little more. There seems to be a lot of churn on the web about how to do enums well in Scala, so I wrote up something about my own implementation in this StackOverflow answer (how did developers get anything done before StackOverflow?).

UPDATE:  Wouldn’t macros (available in 2.10) allow for a very usable enum facility?  I wouldn’t be surprised to see something in 2.11.  Ah — here’s a thread about that.  And another.

[other Scala posts]

Type Inference — more than just a pretty face

If you have looked at Scala at all, you undoubtedly know that it infers types. Whereas in Java you might say something like this:

HashMap<String,ArrayList<Person>> groups =
    new HashMap<String,ArrayList<Person>>();

you would never have to repeat the type like that in Scala. Yes, that saves you work. And yes, more importantly, it makes your code easier to read (we read much more code than we write). And yes, it means less work if we change the type later, because the type appears in fewer places. In fact, often the type of a variable is never written out explicitly at all:

val groups = Map( "presidents" -> List( Person("Obama"), Person("Bush") ) )

But those are all about convenience. Tonight it dawned on me that there is something else, something beyond convenience, that you get by not having to mention the type.  It has to do with anonymous classes.

Let’s say you are using Java and you want to hide the nasty details involved in communicating with a device we’ll call a framis. You have to get a connection to the framis from a pool, return the connection when you’re done, handle the exception you get if the session expires by reestablishing the connection, do the real work in another thread which you wait for via a Future so that you can time out the operation, and so on. You don’t want that kind of code repeated dozens or hundreds of times, so you create a FramisJob class that handles all that. To use FramisJob, you create an anonymous subclass of it with a go() method that does what you want to do, and invoke a val() method on the new object to kick things off, like so:

Float temperature = new FramisJob<Float>() {
    Float go() throws Exception { return ...; }
}.val();

FramisJob’s val() method performs whatever heroics are required to talk robustly to a framis, invoking go() in another thread when all is ready, and returns whatever result is returned by go() after doing any needed cleanup. FramisJob itself looks something like this:

abstract public class FramisJob<V> {
    V val() {
        V value = null;
        ... // do setup
        try {
            value = go();
        } catch ...
        }
        ... // do teardown
        return value;
    }
    abstract V go() throws Exception;
}

So off you go, happily using FramisJob without repeating all that setup and teardown boilerplate. Your code is cleaner and more robust.

But then you start running into cases where you need to return two or three values. Because Java doesn’t support closures, go() can’t assign to variables outside the FramisJob directly; in Java an inner class can only use outside variables if they are final. Darn. And in most languages you could just return a tuple, but Java doesn’t offer tuples either. Foiled again!

Then you come up with an idea — define extra variables inside the class itself and pull them out afterwards, like so:

FramisJob job = new FramisJob<Float>() {
    int humidity;
    Float go() throws Exception {
        humidity = ...;
        return ...;
    }
};
Float temperature = job.val();
int humidity = job.humidity;

Unfortunately the compiler doesn’t like your idea. The job variable is of type FramisJob, and a FramisJob does not have a humidity member. Only your extension of FramisJob knows about humidity. Strike three. In order to do this you’ll have to define some additional helper class to hold the returned data.

Finally we come to the epiphany about type inference. Java knows the type of that anonymous class you created, but you do not; there is no name for it — it’s anonymous. Because you have no way of parroting back to Java the type of the thing you just created, a type Java already knows, Java won’t let you access the object’s members. If Java had type inference, you would not give a type for job, it would get the right type automatically, and you could access its members. It is the very lack of type inference that prevents you from accessing the variables and methods you define in anonymous classes.

So, type inference is not just another pretty face — it has semantic consequences! And I will be so happy when I no longer have to fight with Java’s lack of closures and tuples and type inference and tail call optimization and …. After all, if we were using Scala, we would just be writing (and reading) this:

val (temperature,humidity) = FramisJob( () => ( ... , ... ) )

[other Scala posts]

Project Euler Problem 1, Beaten to Death in Scala

I have been trying to learn some Scala in my spare time, and like many I decided to use the Project Euler problems for practice. Problem 1 is simply to “Add all the natural numbers below one thousand that are multiples of 3 or 5.” Lots of folks have answered this already; the usual approach is to iterate through the numbers below 1000, find the ones we want using the modulo operator, and sum them, e.g. (from here):

val r = ( 1 until 1000 ).view.filter( n => n%3 == 0 || n%5 == 0 ).sum

Incidentally, the call to view() prevents the construction of an intermediate collection containing the numbers 1..999. The view allows you to iterate through the numbers without actually manifesting a collection containing them. You can read more about views here, in the nice writeup about the slick rewrite of Scala’s collections API for 2.8.

What follows are three solutions for a more general statement of the problem, using different Scala language features (for my own edification). They grow increasingly efficient (Fast, Faster, Fastest), the last taking a trivial amount of time compared to the others. If you get bogged down reading this, skip to the Fastest solution — it’s probably the easiest and most interesting.

Fast (Streams)

To begin with, I wanted to solve the problem more generally, by writing a function that takes a limit and an arbitrary list of integers for which we want the multiples under the limit summed. Moreover, I wanted it to be efficient in the sense that if the numbers were large, as in “Add all the natural numbers below 1,000,000 that are multiples of 3137, 2953, or 2789,” we would not do way more calculations than necessary. There are only about a thousand multiples of those integers under a million, so if we examined every integer in the range we would be touching about a thousand times as many numbers as we really needed to.

Instead I wanted to try using Streams, i.e. to generate a Stream of multiples for each of the integers and then merge them together into a single Stream from which I could just takeWhile() the desired numbers and sum() them up. That way not only would we potentially touch far fewer numbers, but we would also do multiplications instead of more expensive modulo operations. Here’s what I came up with:

 
object SumMults1 extends App {

  val ints = args.toList.map(_.toInt)
  println( sumMults( ints.head, ints.tail ) )

  def sumMults( limit: Int, ints: List[Int] ) = {
    // Merge two infinite Streams of Ints, removing duplicates.
    def merge( xs: Stream[Int], ys: Stream[Int] ): Stream[Int] = {
      val x = xs.head
      val y = ys.head
      if      (x<y) x #:: merge( xs.tail, ys      )
      else if (x>y) y #:: merge( xs     , ys.tail )
      else  /* x=y */     merge( xs     , ys.tail )
    }
    val natNums = Stream from 1
    ints.map( n => natNums.map(n*) ).reduceLeft(merge).takeWhile(_<limit).sum
  }
}

The last line is where it all comes together: it creates an infinite Stream of multiples for each of the integers, reduces them to a single merged Stream, takes those up to our limit, and returns the sum. If you have your head around functional programming, this is probably pretty clear and concise.

Anyway, it works fine, e.g.

% scala SumMults1 1000 3 5
233168

However, I found that if I used a large limit, I got an OutOfMemoryError. I puzzled over this for a while, thinking that perhaps #:: was not lazy under some circumstances or that the third call to merge wasn’t being optimized as a tail call or ??? I used the @tailrec annotation to verify that the third call to merge was being optimized (the first two aren’t, but that’s ok since they are evaluated lazily).  I finally found the answer, by Rex Kerr, on StackOverflow:

Streams are lazily evaluated, but they store the results of their evaluations. Thus, unless you want to traverse through them multiple times, you should use Iterator instead.

So the problem was not that we were trying to evaluate too much of the Stream, but simply that we were saving everything we had previously taken from the Stream, which was quite a pile.

UPDATE: Recently the Scala community has begun an effort to flesh out the scaladoc with introductory text, and the results for Stream are very impressive indeed! There it explains that as long as something is holding onto the Stream object, the memoized objects cannot be garbage-collected.

Faster (Iterators)

So, how to do this with Iterators. It isn’t anywhere near as easy as just replacing Stream with Iterator, because there is no #:: operator with which to construct a new Iterator from existing Iterators. Well, I decided to go for broke, changing a bunch of things to make a much more efficient version:

object SumMults2 extends App {

  val ints = args.toList.map(_.toInt)
  println( sumMultiples( ints.head, ints.tail.distinct ) )

  def sumMultiples( limit: Int, ints: List[Int] ) = {
    val mults = mergeOf( ints.sorted.map(Series(_)) ).takeWhile(_<limit)
    ( BigInt(0) /: mults ) (_+_)
  }

  // A generator for an arithmetic series with the specified head/step.
  case class Series( step:Int ) {
    var head = step
    def tail = { head += step; this }
  }

  // Return an Iterator that merges a sorted List of Series.
  def mergeOf( list: List[Series] ) = new Iterator[Int] {
    var l = list
    def hasNext = true
    def next = {
      val l0   =  l.head
      val head = l0.head
      l = insert( l0.tail, l.tail )
      head
    }
  }

  // Insert a Series into a list of Series, maintaining the ordering.
  def insert( s: Series, l: List[Series] ): List[Series] = l match {
    case Nil => List(s)
    case _   => {
      val sHead  =  s.head
      val l0     =  l.head
      val l0Head = l0.head
      if      ( sHead < l0Head )   s :: l
      else if ( sHead > l0Head )  l0 :: insert(  s     , l.tail )
      else   /* sHead = l0Head */  s :: insert( l0.tail, l.tail )
    }
  }

}

This version uses a mutable Series class to generate the multiples of an integer by simple addition, and then creates an Iterator that merges a list of Series objects. The Iterator keeps the list sorted by the heads of its Series elements, so that the next value of the merge is always just the head of the first Series in the List.

This version is far faster than the other, and doesn’t chew up lots of memory. Also, this version uses BigInt to accumulate the results, so that we can use much larger limits:

% scala SumMults2 1000 3 5
233168
% scala SumMults2 1000000000 3137 1753 2789
623638337230623

Fastest (Math)

But there is a far faster approach still, one that makes the above versions look like lumbering brontasauri. Perhaps you have already noticed the crucial point, that (for example):

3 + 6 + 9 + … + 300

can be rewritten

3 * ( 1 + 2 + 3 + … + 100)

The parenthesized part — as Gauss supposedly knew as a young child — is just

100 * 101 / 2

So the sum we want is just three times that. We can use this mathematical relationship to calculate the sum without actually iterating through all of the numbers!

However, it’s not quite that simple. If asked to sum the multiples of 3 and 5 less than 1000, we can use the above technique to sum up the multiples of 3 and the multiples of 5 and then add them together, but we will have counted the multiples of 15 twice; we have to subtract those out.

And that’s a little trickier to do that than it sounds. First, what you really need to subtract out are the multiples of the least common multiple (lcm) of the two numbers, not their product. So, for example, if asked to sum the multiples of 6 and 15, we need to subtract off the multiples of 30 (not 90). The lcm of two numbers is their product divided by their greatest common divisor (gcd).

Also, we need to do this for an arbitrarily long list of numbers, so consider what happens if we are asked to sum the multiples of 4, 6, and 10:

  • First sum the multiples of 4.
  • Then add in the multiples of 6, but subtract the multiples of lcm(4,6) = 12.
  • Then add in the multiples of 10, but subtract the multiples of lcm(4,10) = 20 and the multiples of lcm(6,10) = 30.

But oops, now we have gone the other way, subtracting off the multiples of 20 and 30 in common (60,120,…) twice, and our result is too low, so we’ll have to add those back in. And if there were multiple corrections at that level (i.e. if we were given a larger list of numbers), we’d have to subtract their elements in common, and so on ad infinitum. At every step we have to take care not to add or subtract the same numbers twice.

That sounds like a pain, but using recursion it’s pretty straightforward:

object SumMults3 extends App {

  val ints = args.toList.map( BigInt(_) )
  val limit = ints.head - 1
  println( sumMults( ints.tail ) )

  // The sums of all the multiples, up to a limit, of a list of numbers.
  def sumMults( ints: List[BigInt], priorInts: List[BigInt] = Nil ): BigInt =
    ints match {
      case Nil => 0
      case int::rest =>
        val n = limit / int    // # of elements in series to sum
        int * ( n * (n+1) / 2 ) -
          sumMults( priorInts.map( lcm(int,_) ).filter(_<=limit) ) +
          sumMults( rest, int::priorInts )
    }

  // Least Common Multiple and Greatest Common Divisor.
  def lcm( a: BigInt, b: BigInt ) = a*b / gcd(a,b)
  def gcd( a: BigInt, b: BigInt ): BigInt = if (b==0) a else gcd( b, a%b )

}

The first recursive invocation of sumMults takes care of adjusting for the error. Since it uses itself to calculate the error, any double-counting in the adjustment is also handled correctly.

% scala SumMults3 1000 3 5
233168
% scala SumMults3 1000000000 3137 1753 2789
623638337230623

In this version we use BigInts for everything because at this point we are doing so few operations that it hardly matters. Even for very large numbers which would take the other solutions years to process (if they could even handle numbers that large), the program takes only slightly longer than it takes to initialize the jvm:

% time scala SumMults3 10000000000000000000000000 3 5 7 9
27142857142857142857142857857142857142857142857145
scala SumMults3 10000000000000000000000000 3 5 7 9 0.25s user 0.00s system 90% cpu 0.276 total

Gotta love math.  And recursion.

[other Scala posts]