Written by Jim Blakley (@jimblakley) | March 5, 2019
(This is a repost of my Intel blog on the work at CMU around Eureka partially sponsored by Intel).
Before its ignominious demise, I was one of the few fans of the TV series “Wisdom of the Crowd”* -- mostly because of its premise that a large group of distributed non-experts could produce extraordinary results when they worked collectively. But, even in that police procedural, the work of the CSI unit and the coroner were not delegated to the crowd. Those specialties, as with many professions that require detailed, deep knowledge about their chosen domain – physicians, astronomers, biologists and other professionals – undergo long training and apprenticeships where they learn their craft. In that process, they acquire knowledge, expertise, intuition and domain-specific ability to pattern match what they see with what they know. One of the promises of artificial intelligence is that it will accelerate and enhance the ability of domain experts to practice their profession – allowing a radiologist to more quickly and accurately diagnose tumors from MRIs or an astronomer to identify planets that could support life from radio telescope data.
But, the Achilles heel of this promise is that it requires those same experts to spend countless hours finding, labeling and preparing training data for machine learning algorithms. In the current industry paradigm, labeling training data is done by academic, commercial or social crowdsourcing, outsourcing it to a company or individual or manually by an in-house expert painstakingly creating the training data from real or simulated data. For many applications, the expert’s skill is needed to label the data -- and data access is limited to authorized users due to privacy and other concerns. There are emerging companies, platforms and tools like Hive and LabelBox that help to automate this process but it is still tedious work for a highly-paid busy professional.
Here at the Intel Science and Technology Center for Visual Cloud Systems (ISTC-VCS), the “Eureka” project is researching processes and systems that can optimize the precious resource of expert attention during training set creation for visually based applications. The ISTC-VCS was launched in 2016 as a collaborative effort between Intel, Carnegie Mellon University (CMU) and Stanford University to research systems issues in large scale deployments of visual computing solutions. Eureka is led by CMU’s Ziqiang (Edmond) Feng, Shilpa George, Jan Harkes, Professor Mahadev Satyanarayanan and Intel Labs’ Padmanabhan Pillai.
Eureka starts from the premise that expert attention is the most critical resource used in collecting and preparing training datasets. That attention can be wasted by forcing the expert to sort through too many candidate examples while filling out a training set, feeding candidates too slowly – leaving the expert waiting -- or forcing the expert to do too much manual labor to manage the workflow. You can get some feel for this by playing with Google Search by Image*. When I use a photo of a common object like, say, an apple to search for possible training examples, I get hundreds of nearly identical apples -- not enough diversity to use as a training set. Too many similar true positives. But, if I search something a little rarer, say, a picture of me sitting on an elephant in a conference room, I get no true positives and a LOT of false positives. The expert shouldn’t have to manually wade through thousands of images to find examples of rare events. Eureka views the process as an iterative, interruptible, human-in-the-loop workflow. The expert should be able to refine the search quickly when she sees the current search having little improvement over previous iterations. Eureka also presupposes that the source data is distributed across a wide area but that each data location has local computing resources to aid in the search for training examples. This distribution could happen across, say,
enterprise data centers or edge nodes positioned near cameras. It could also apply in a cloud data center where data resides in a direct attached disk in a cloud server. The expert using a Eureka console is assumed to be remote from the data as shown. These remote servers have high bandwidth and low latency between compute and data but lower bandwidth and higher latency to the user and motivates doing the compute intensive tasks near the data.
Eureka’s approach is to use gradually more sophisticated filters to trim the number of candidate training images presented to the expert. As the expert identifies true positives and negatives from the candidates, they become part of the training set for next iteration. At each iteration, Eureka allows the user to balance the complexity of the filters against the need to rapidly provide viable candidates for review.
Early stage filters might be simple color histograms, scale-invariant feature transforms (SIFT) or perceptual hashing; mid stage filters could be a support vector machine (SVM) trained with histogram of gradients (HOG) features or a lightweight deep neural network (DNN) like MobileNet. Once a sufficiently large training set has been collected, late stage filters can be more complex DNNs.
For a given iteration, filters are dispatched at the edge nodes (A, B) to find candidates. The number of edge nodes may grow at each iteration to assure that the expert receives a continuous candidate flow in the face of increasingly complex filtering at each node. As the expert finds true positive examples among the candidates, she adds them to the training set, cancels the current iteration, adjusts and adds to the filter set, potentially adds edge nodes and starts the next iteration. Eureka provides a simple intuitive interface to enable this workflow.
So how well does Eureka work to optimize expert attention? Recent results show that it can reduce the number of candidates that the expert needs to inspect to find 100 examples of a rare event by two orders of magnitude over inspection of every image and an order of magnitude over a single stage
filtering process. The team tested three different cases – deer, Taj Mahal and fire hydrant detection -- drawn from the Yahoo Flickr Creative Commons 100 Million (YFCC100M)* data set. Eureka has also found that, for distributed dataset applications in network constrained environments, the early iterations are often network bound while the later stages are compute bound. This argues strongly for the use of compute resources near where the data resides.
Eureka has been open sourced for broad use and the team has created a cloud quick start that will let you easily try it out on Amazon Web Services* using the Yahoo Flickr dataset and your AWS account. Stayed tuned and follow me on Twitter at @jimblakley for updates.
- "Edge-based Discovery of Training Data for Machine Learning", Feng, Z., George, S., Harkes, J., Pillai, P. , Klatzky, R., Satyanarayanan, M., Proceedings of the Third IEEE/ACM Symposium on Edge Computing (SEC 2018), Bellevue, WA, October 2018
- Eureka Quick Start: https://github.com/fzqneo/eureka-yfcc100m
Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation.