Algorithms will save the day

As we continue our journey to transform the online business at Staples, one thing we’re using to guide our product and technology roadmap is automation. We need to optimize on several axes – speed of operations, (and actual throughput), efficiency, correctness, and cost.

Given the size of the business at Staples (the various online pieces combined are several billion dollars a year), we’re moving as much heuristics as possible into automated decision making systems. This will allow us to improve end-to-end throughput, while optimizing revenue and profit, and reducing cost. Similarly, all incidents will be tagged and tracked, so we can do root cause analysis, eliminate potential sources of defects, and even stop the line if needed.

Take, for example, our new expanded catalog. For a long time (over 25 years), we sold less than 30,000 SKUs, all related to office supplies. In the past year or so, we’ve rapidly expanded our assortment. We’re adding whole slews of products, one vertical at a time. For instance, we recently added hospitality, and retail. If you’re a business in those industries, you can now not only buy your office supplies from us, but anything else you need to run your business. You run a restaurant? Buy your cleaning supplies, cutlery, glassware, etc. from us as well.

We recently crossed 500,000 SKUs on offer, and are on track to cross 2 million soon, and over 5 million within a couple of years. The question, of course, is how to market these new products and to whom. Manually devising strategies to market to the right target audiences has its place, of course. At the Innovation Lab, we set our data scientists to work on this problem. By analyzing all available data on the millions of customers we already have, we’re able to figure out what industries a large number of them belong to. We can then automatically show them relevant new products when they log on, or when we send out promotional emails.

Even where we don’t have comprehensive information about our customers, we can computationally determine several things about them. For instance, can you guess the industry someone within an email address of janet@peninsuladental.com belongs to?

Another example is shipment delivery estimates. We already have a world class logistics platform, having been in the business of receiving orders, fulfilling them, and getting them out to our customers within one business day. As we expand our SKUs though, a larger and larger number of our products will be dropshipped by our vendor partners. The variability on these items is much higher, and the shipments tend to be slightly slower as well. By using historical data around inventory levels, handling times, and shipping carriers’ actual delivery tracking data, we’re able to predict when an item is going to reach a shopper with a very high degree of precision. By communicating this to our customers up front, we’re able to provide a much better experience online.

These are just examples of the several projects we have in the works here at the San Mateo based Staples Innovation Lab. Others include things like real-time selling price optimization, smarter personalized email content targeting, dynamic and personalized product bundles, hyper-personalized product recommendations, a new e-commerce search engine, and several others, for this year alone.

The path forward is clear – automation is the key, and algorithms will save the day 🙂

P. S. – Watch for these project as they go live on Staples.com. And if you’re a technologist, interested in joining the journey we’re on, email me at amit@stapleslabs.com, or check out www.staplesinnovationlab.com

Welcoming Rich Hickey to the Staples Innovation Lab

I’m pleased to announce that we’ve engaged Rich Hickey at the Staples Innovation Lab, as Special Technical Advisor. He will provide architectural and general technical oversight across all the products we’re building. These include things like dynamic and personalized pricing, product recommendation engines, email targeting and behavioral retargeting, a new search engine, shipping delivery optimization, etc. 

We’re already heavy users of Clojure, Datomic, ClojureScript, Simulant, and Pedestal. So this just makes a ton of sense for us… we’re also working with Cognitect on several initiatives, so having Rich’s oversight will just add to the awesomeness of the team.

Please join me in welcoming Rich to the lab! 

P. S. Get in touch with me if you’d like to explore opportunities working with the Clojure stack at the second largest E-Commerce operation in the world.

Why Datomic?

Cross-posted from Zololabs.

Many of you know we’re using Datomic for all our storage needs for Zolodeck. It’s an extremely new database (not even version 1.0 yet), and is not open-source. So why would we want to base our startup on something like it, especially when we have to pay for it? I’ve been asked this question a number of  times, so I figured I’d blog about my reasons:

  • I’m an unabashed fan of Clojure and Rich Hickey
  • I’ve always believed that databases (and the insane number of optimization options) could be simpler
  • We get basically unlimited read scalability (by upping read throughput in Amazon DynamoDB)
  • Automatic built-in caching (no more code to use memcached (makes DB effectively local))
  • Datalog-as-query language (declarative logic programming (and no explicit joins))
  • Datalog is extensible through user-defined functions
  • Full-text search (via Lucene) is built right in
  • Query engine on client-side, so no danger from long-running or computation-heavy queries
  • Immutable data – audits all versions everything automatically
  • “As of” queries and “time-window” queries are possible
  • Minimal schema (think RDF triples (except Datomic tuples also include the notion of time)
  • Supports cardinality out of the box (has-many or has-one)
  • These reference relationships are bi-directional, so you can traverse the relationship graph in either direction
  • Transactions are first-class (can be queried or “subscribed to” (for db-event-driven designs))
  • Transactions can be annotated (with custom meta-data) 
  • Elastic 
  • Write scaling without sharding (hundreds of thousands of facts (tuples) per second)
  • Supports “speculative” transactions that don’t actually persist to datastore
  • Out of the box support for in-memory version (great for unit-testing)
  • All this, and not even v1.0
  • It’s a particularly good fit with Clojure (and with Storm)

This is a long list, but perhaps begins to explain why Datomic is such an amazing step forward. Ping me with questions if you have ’em! And as far as the last point goes, I’ve talked about our technology choices and how they fit in with each other at the Strange Loop conference last year. Here’s a video of that talk.

Pretty-printing in Clojure logs

Cross-posted from Zolo Labs.

Logging is an obvious requirement when it comes to being able to debug non-trivial systems. We’ve been thinking a lot about logging, thanks to the large-scale, distributed nature of the Zolodeck architecture. Unfortunately, when logging larger Clojure data-structures, I often find some kinds of log statements a bit hard to decipher. For instance, consider a map m that looked like this:

When you log things like m (shown here with println for simplicity), you may end up needing to understand this:

Aaugh, look at that second line! Where does the data-structure begin and end? What is nested, and what’s top-level? And this problem gets progressively worse as the size and nested-ness of such data-structures grow. I wrote this following function to help alleviate some of the pain:

Remember to include clojure.pprint. And here’s how you use it:

That’s it, really. Not a big deal, not a particularly clever function. But it’s much better to see this structured and formatted log statement when you’re poring over log files in the middle of the night.

Just note that you want to use this sparingly. I first modified things to make ALL log statements automatically wrap everything being logged with pp-str: it immediately halved the performance of everything. pp-str isn’t cheap (actually, pprint isn’t cheap). So use with caution, where you really need it!

Now go sign-up for Zolodeck!

Why Java programmers have an advantage when learning Clojure

Cross-posted from Zolo Labs.

There is a spectrum of productivity when it comes to programming languages. I don’t really care to argue how much more productive dynamic languages are… but for those who buy that premise and want to learn a hyper-productive language, Clojure is a good choice. And for someone who has a Java background, the choice Clojure becomes the best one. Here’s why:

  • Knowing Java – obviously useful: class-paths, class loaders, constructors, methods, static methods, standard libraries, jar files, etc. etc.
  • Understanding of the JVM – heap, garbage collection, perm-gen space, debugging, profiling, performance tuning, etc.
  • The Java library ecosystem – what logging framework to use? what web-server? database drivers? And on and on….
  • The Maven situation – sometimes you have to know what’s going on underneath lein
  • Understanding of how to structure large code-bases – Clojure codebases also grow
  • OO Analysis and Design – similar to figuring out what functions go where

I’m sure there’s a lot more here, and I’ll elaborate on a few of these in future blog posts.

I’ve not used Java itself in a fairly long time (we’re using Clojure for Zolodeck). Even so, I’m getting a bit tired of some folks looking down on Java devs, when I’ve seen so many Clojure programmers struggle from not understanding the Java landscape.

So, hey Java Devs! Given that there are so many good reasons to learn Clojure – it’s a modern LISP with a full macro system, it’s a functional programming language, it has concurrency semantics, it sits on the JVM and has access to all those libraries, it makes a lot of sense for you to look at it. And if you’re already looking at something more dynamic than Java itself (say Groovy, or JRuby, or something similar), why not just take that extra step to something truly amazing? Especially when you have such an incredible advantage (your knowledge of the Java ecosystem) on your side already?

Clojure utility functions – part II

Cross-posted from Zolo Labs

 

Here’s another useful function I keep around:

Everyone knows what map does, and what concat does. And what mapcat does. 

The function definition for pmapcat above, does what mapcat does, except that by using pmap underneath, it does so in parallel. The semantics are a bit different: first off, the first parameter is called batches (and not, say, coll, for collection). This means that instead of passing in a simple collection of items, you have to pass in a collection of collections, where each is a batch of items. 

Correspondingly, the parameter f is the function that will be applied not to each item, but to each batch of items.

Usage of this might look something like this:

One thing to remember is that pmap uses the Clojure send-off pool to do it’s thing, so the usual caveats will apply wrt to how f should behave.

Clojure utility functions – part I

Cross-posted from Zolo Labs.

 

I kept using an extra line of code for this, so I decided to create the following function:

Another extra line of code can similarly be removed using this function:

Obviously, the raw forms (i.e. using doseq or map) can be far more powerful when used with more arguments. Still, these simple versions cover 99.9% of my use-cases.

I keep both these (and a few more) in a handy utils.clojure namespace I created for just such functions.