40 0 361KB
Notes on Prediction.io Created 04/22/14 Updated 06/28/14, Updated 09/21/14, Updated 11/21/14, Updated 02/10/15
Introduction PredictionIO is an open source machine learning server for software developers to create predictive features, such as personalization, recommendation and content discovery. Their goal is to be the “MySQL” or “LAMP Stack” of Machine Learning and Analytics. Examples of use:
Le Tote, a clothing subscription/rental service that is using PredictionIO to predict customers’ fashion preferences. PerkHub, which is using PredictionIO to personalize product recommendations in the weekly ‘group buying’ emails they send out.
Current version is 0.86. The download was 151MB.
Features The server is written in Scala and runs on Spark. As a complete example, it includes many elements of Hadoop and Mahout (however, the Prediction.io marketing pitch is slowly changing from being a replacement for Hadoop to being an easy implementation of Spark). Recommendation engine example cli = predictionio.Client("") cli.identify("John") cli.record_action_on_item("view", "HackerNews" ) # predict top preferences near a specified location r = cli.get_itemrec_topn("myEngine", 5, {"pio_latlng":[37.9, 91.2]})
Algorithms Supported Item recommendation Item similarity Item rank The implementations are from MLlib in Spark, and including Naive Bayes and ALS.
Company Information Formed in early 2013. Pivoted in late 2013, got next funding in mid-2014. Located in Palo Alto and somewhere in the UK. The company competes with “closed ‘black box” MLaaS services or software’, such as Google Prediction API, Wise.io, BigML, and Skytree. However, since Prediction.io is open and extensible, with a developer community, the company feels that it has an advantage. The problem PredictionIO is setting out to solve is that building Machine Learning into products is expensive and time-consuming — and in some instances is only really within the reach of major and heavily-funded tech companies, such as Google or Amazon, who can afford a large team of PhDs/data scientists. By utilizing the startup’s open source Machine Learning server, startups or larger enterprises no longer need to start from scratch, while also retaining control over the source code and the way in which PredictionIO integrates with their existing wares.
People Simon Chan, CEO (was at UMich, then startups in China, then UCL) Donald Szeto, CTO (Stanford, UC Berkeley)
Page 1 of 5
Kennieth Chan, engineer (UCB) Thomas Stone (VP Sales) (Cornell, University College London)
Funding Raised $2.5M in July 2014, from the following list: StartX, XG Ventures (founded by ex-Googlers), Sood Ventures, Ironfire Capital (activist investor firm), Quest Venture Partners (Menlo Park), Azure Capital Partners (San Francisco and Menlo Park).
Business Model There was no discussion of pricing for the server, or pricing for service/support.
Architecture of the PredictionIO server PredictionIO is mainly built with Scala. Scala runs on the JVM, so Java and Scala stacks can be freely mixed for totally seamless integration. PredictionIO Server consists of a few components: Admin Server IO Server Scheduler Data Store Data Processing Stack
The “DASE” Concept – their counterpart of “MVC” PredictionIO's DASE architecture brings the separation-of-concerns design principle to predictive engine development. DASE stands for the following components of an engine:
Data - includes Data Source and Data Preparator Algorithm(s) Serving Evaluator
As you can see from the Quick Start, MyRecommendation takes a JSON prediction query, e.g.{ "user": "1", "num": 4 }, and return a JSON predicted result. In MyRecommendation/src/main/scala/Engine.scala, the Query case class defines the format of such query: 1 2 3 4
case class Query( user: String, num: Int ) extends Serializable
The PredictedResult case class defines the format of predicted result, such as
Page 2 of 5
1 2 3 4 5 6
{"itemScores":[ {"item":22,"score":4.07}, {"item":62,"score":4.05}, {"item":75,"score":4.04}, {"item":68,"score":3.81} ]}
with: 1 2 3 4 5 6 7 8
case class PredictedResult( itemScores: Array[ItemScore] ) extends Serializable case class ItemScore( item: String, score: Double ) extends Serializable
Finally, RecommendationEngine is the Engine Factory that defines the components this engine will use: Data Source, Data Preparator, Algorithm(s) and Serving components. 1 2 3 4 5 6 7 8 9 10
object RecommendationEngine extends IEngineFactory { def apply() = { new Engine( classOf[DataSource], classOf[Preparator], Map("als" -> classOf[ALSAlgorithm]), classOf[Serving]) } ... }
Spark's MLlib ALS algorithm takes training data of RDD type, i.e. RDD[Rating] and train a model, which is a MatrixFactorizationModel object. The PredictionIO Recommendation Engine Template, which MyRecommendation is based on, integrates this algorithm under the DASE architecture.
Data Processing Stack Built on top of solid data frameworks and technology, such as Hadoop, Cascading, Scalding and Mahout, PredictionIO can handle a huge amount of data efficiently. A variety of machine learning algorithms are available for you to implement with just a few clicks.
Admin Server PredictionIO's Admin Server component provides a web interface for developers to manage applications, engines and algorithms. It is built on top of Play Framework.
IO Server IO Server offers scalable REST API services to communicate with your web or mobile app. It is responsible for handling data input and prediction output. It is built on top of Play Framework.
Page 3 of 5
Scheduler A scalable scheduler that can be used to manage schedules for executing tens, hundreds, or even tens-of-thousands of jobs. Quartz is the default scheduler.
Data Store Data store manages the collected data, the predictive model and the cached prediction results. MongoDB is the default data store.
Documentation
Android and Java SDK Endpoints There are commands to send information, to request recalc, and to request results.
Page 4 of 5
PHP API
Delivery in the Cloud There are EC2 instances which can be spun up preconfigured for Prediction.io https://aws.amazon.com/marketplace/pp/B00ECGJYGE For usage information, see http://docs.prediction.io/current/installation/install-predictionio-on-aws.html
Developer Community There is a forum at https://groups.google.com/forum/#!forum/predictionio-user The developer community of PredictionIO supports a number of projects. To list a project on their site, please contact them or do a pull request through PredictionIO Docs Project. In early 2015, the CEO said they had over 300 developers in their ecosystem.
Questions and Open Issues How does the server store and manage trained models? What data sources can be integrated?
Chronology Late Spring 2014: Learned about this tool Summer 2014: Initial Evaluation Fall 2014: Started another round of evaluation, since it was clearer that they were providing a server, and that the server used Scala / Spark. They were developing templates which captured usage patterns. Also, they were working to create a developer ecosystem. Presentation at Predictive API’s conference in November 2014 http://www.slideshare.net/predictionio/predictionio-the-1st-international-conference-on-predictive-apis-and-apps This was not very technical. 02/09/15: Went to presentation by CEO, hosted by Scala Bay group. The presentation summarized what we mostly knew, but gave a number of key directions, such as they have developed usage templates that greatly improve the ease of learning. The presentation is available at https://www.youtube.com/watch? v=EUDHFOyUumE&feature=youtu.be
Page 5 of 5