Conference: November 6, 2013
Basis Technology’s 3rd Annual Open Source Search conference brings together leading technologists, IT strategists, and program managers from both the government and commercial sector to share the benefits and challenges unique to open source search technology. This year’s conference considers open source search within the larger context of building an overall information strategy and developing the infrastructure to support it. What will the highly integrated information platform of tomorrow look like? How do you measure search effectiveness? Will today’s information access systems handle the demands of big data?
This year’s conference chairwoman Sue Feldman — formerly of IDC and now CEO of Synthexis — will be fielding a survey to help us all learn more about the use of search and text analytics software in organizations. Sue will share the preliminary results of this survey during the conference, and each participant will have the opportunity to receive a written summary of the results when the data has been compiled. Please follow this link to the Synthexis Information Access Survey.
|7:30||Registration and Breakfast|
|8:30||Welcome and Conference Overview
Sue Feldman, CEO, Synthexis
|8:45||Keynote: Watson–The Jeopardy! Grand Challenge and Beyond
Presenter: Eric Brown, Director of Watson Technologies, IBM
Download slides View video
Watson, named after IBM founder Thomas J. Watson, was built by a team of IBM researchers who set out to accomplish a grand challenge–build a computing system that rivals a human’s ability to answer questions posed in natural language with speed, accuracy and confidence. The quiz show Jeopardy! provided the ultimate test of this technology because the game’s clues involve analyzing subtle meaning, irony, riddles and other complexities of natural language in which humans excel and computers traditionally fail. Watson passed its first test on Jeopardy!, beating the show's two greatest champions in a televised exhibition match, but the real test will be in applying the underlying natural language processing and analytics technology in business and across industries. In this talk I will introduce the Jeopardy! grand challenge, present an overview of Watson and the DeepQA technology upon which Watson is built, describe in particular how open source software and content played a role in the project, discuss how Watson is an early example of a cognitive computing system, and explore future applications of this technology.
|9:30||Big Data Analytics: A Single Framework for Structured & Unstructured Analytics
Presenter: George Chitouras, Senior Director Solutions R&D, Pivotal/EMC
This session describes how we created a single software framework or Big (billions of documents) Data Analytics by embedding open source search with database analytics. Using Solr and large-scale database engines we demonstrate a single interface to both structured and unstructured data. We show that this technique scales to billions of documents. This talk features an architectural overview of how the system is designed, concept demonstrations on social media, and a use case study on the application of the architecture to a real eDiscovery problem.
|10:05||Break and Exhibits|
|10:35||Document Relations with Elasticsearch
Presenter: Martijn van Groningen, Software Engineer, Elasticsearch
Lucene-based search engines, including Elasticsearch, are document-based and the usual method of modeling data is to de-normalize/flatten the data. Elasticsearch provides alternative options for modeling data. In this session, we will dive into the document relation features offered by Elasticsearch, and how you can use these features in combination with Elasticsearch’s analytical capabilities for structured and unstructured search.
|11:10||Pushing Geospatial Cloud to the Edge: Building a Transparent Spatial Database Layer Using Cloudant and Open Source
Presenter: Norman Barker, Director, Cloudant
This talk will discuss the benefits and challenges of integrating open geospatial libraries into Cloudant’s database service, and how the global scope and large scale of those efforts will be contributed back to the open source community. Norman’s experience of making it all work together across a distributed system holds lessons for those designing geospatial indexes and multi-node queries, as mobile apps continue to push geospatial technology closer to the network edge.
|11:45||Real World Facets with Entity Resolution
Presenter: Benson Margulies, CTO, Basis Technology
Solr’s ability to facet search results gives end-users a valuable way to drill down to what they want. But for unstructured documents, deriving facets such as the persons mentioned requires advanced analytics. Even if names can be extracted from documents, the user doesn’t want a “George Bush” facet that intermingles documents mentioning either the 41st and 43rd U.S. Presidents, nor does she want separate facets for “George W. Bush” or even “乔治·沃克·布什” (a Chinese translation) that are limited to just one string. We’ll explore the benefits and challenges of empowering Solr users with real-world facets.
Presenter: Donna Harman, National Institute of Standards and Technology (NIST)
This session, from the doyenne of search evaluation, gives a brief overview of the methods and metrics of evaluating search systems. It concludes by examining some of the current evaluation issues in user studies, including a look at the search log studies mainly done by the commercial search engines. The presentation is based on Donna Harman’s book, “Information Retrieval Evaluation,” and on her experience with the Text REtrieval Conference (TREC), an annual workshop hosted by the US government’s National Institute of Standards and Technology. TREC provides the infrastructure necessary for large-scale evaluation of text retrieval methodologies. With the goal of accelerating research in this area, TREC created the first large test collections of full-text documents and standardized retrieval evaluation. The impact has been significant; since TREC’s beginning in 1992, retrieval effectiveness has approximately doubled.
|14:10||Advanced Query Parsing Techniques
Presenter: Aruna Kumar Pamulapati, Senior Technical Consultant, Search Technologies
Apache Lucene, Apache Solr, and vendor products based on these applications provide a number of options for query parsing. They are valuable tools for creating powerful search applications. This presentation will review the role that advanced query parsing can play in building systems including: relevancy customization, taking input from user interface variables, such as the position on a website or geographical indicators, which sources are to be searched, and third party data sources. Query parsing can also enhance data security. Best practices for building and maintaining complex query parsing rules will be discussed and illustrated.
|15:00||The Right Tool for the Job: Search and Analytics Optimization for Professional Information Analysis, Leveraging Open and Closed Source Solutions
Presenter: Kurt Krieg, Technology Manager, BASF
Kurt Krieg, Technology Manager with BASF Corporation, will discuss the requirements and features that should be considered for professional searches and analytics in an innovation-driven R&D enterprise, integrating both internal and external information sources in multiple formats. Kurt will highlight BASF’s experience in evaluating both open and closed source components to find a cost-efficient solution, taking into account limitations, interfaces, development and operational costs. Criteria for testing, market screening and challenging vendors will also be discussed.
|15:35||Big Data Search at BoardReader
Presenters: Andrew Aksyonoff, Founder & Richard Kelm, COO, Sphinx Technologies
This presentation will begin with a search-centric overview of BoardReader, which has become a leading social media content aggregation search engine for forums and boards. By utilizing Sphinx Search, an open source full-text search engine, BoardReader is able to serve relevant results at lighting speed from a quorum of over 20 Billion records. In order to launch into new social media content markets, BoardReader needed a strategy to handle the tokenization of Chinese characters. This talk will take a deep dive into the technical challenges of supporting double byte languages and offer lessons learned for those planning to integrate their preferred search engine with a linguistics platform.
|16:10||Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Presenter: Alex Moundalexis, Solutions Engineer, Cloudera
This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case studies from the field, Mr. Moundalexis demonstrates that one-size, rigid traditional systems don’t fit all, but that combinations of tools in the Apache Hadoop ecosystem provide a versatile and flexible platform for integrating, finding, and analyzing information.
|17:00||Networking Cocktail Reception|
* Agenda is subject to change
Founder, Sphinx Technologies
Big Data Search at BoardReader
Andrew Aksyonoff created Sphinx back in 2001 and has been working on the code base ever since. Fluent in C++, less so in human speak, but keeps trying. He lives in Russia.
Director, Geospatial, Cloudant
Pushing Geospatial Cloud to the Edge: Building a Transparent Spatial Database Layer Using Cloudant and Open Source
Norman is the Director of Geo at Cloudant, a database-as-a-service company. He has been developing geospatial programs for more than 10 years and leads the development of distributed geospatial indexes for Cloudant. His primary interest is in how to use unstructured geospatial data.
Director of Watson Technologies, IBM
Keynote: Watson–The Jeopardy! Grand Challenge and Beyond
Eric Brown is Director of Watson Technologies at the IBM T.J. Watson Research Center, where he is working on the DeepQA project to advance the state-of-the-art in automatic, open domain question answering technology. Eric has been working in the broader area of information retrieval since 1992 and has explored a variety of issues, including scalability, parallel and distributed information retrieval, automatic text categorization, question answering, text analysis in the bio-medical domain, and applications of speech recognition in knowledge management.
Eric received his B.S. (1989) in Computer Science from the University of Vermont, and M.S. (1992) and Ph.D. (1996) in Computer Science from the University of Massachusetts, Amherst. While at UMass, Eric was a research assistant at the Center for Intelligent Information Retrieval and was advised by Bruce Croft. Eric has been at IBM since 1995.
Sr. Dir R&D, Pivotal/EMC
Big Data Analytics: A Single Framework for Structured & Unstructured Analytics
George Chitouras is currently Sr. Director of R&D at Pivotal/EMC where he focuses on rich data types as they pertain to analytics for “big data”. Prior to Pivotal, George developed products for Data Visualization, NLP, Text Analytics, and Information Retrieval at Greenplum, SAP, Business Objects, and Inxight Software.
Program Chair, CEO, Synthexis
Welcome and Conference Overview
Sue Feldman is a search industry veteran, and author of hundreds of works on the technologies, trends and markets for search, text analytics, big data and unified information access and management. The Answer Machine, a practical guide to these technologies and their future, was published in 2012. She is known for such works as The High Cost of Not Finding Information and The Digital Marketplace, both published by IDC, and has won numerous research and writing awards.
Sue was Vice President for Search and Discovery Technologies at IDC before founding Synthexis in 2013. Synthexis provides business advisory services to vendors and buyers of cognitive computing, search and text analytics technologies. She speaks frequently on trends in information interaction, technology and information work, conversational systems, unified information access, and big data technologies. Synthexis clients rely on her for strategic advice and business coaching, as well as for her wide network of contacts in the industry.
At IDC, Sue developed and led the IDC research programs for search, content management, text analytics, categorization, translation software, mobile and rich media search. She wrote the chapter on search engines for the Encyclopedia of Library and Information Science, and was the first editor of the IEEE Computer Society's Digital Library News.
Before coming to IDC, Ms. Feldman founded Datasearch, an independent technology advisory firm, where she consulted on usability and on information retrieval technologies.
|Martijn van Groningen
Software Engineer, Elasticsearch
Document Relations with Elasticsearch
Martijn van Groningen is a software engineer for Elasticsearch and an Apache Lucene committer. Martijn has made a significant contribution to the Lucene community towards result grouping (also known as field collapsing) and joining features. As a core Elasticsearch engineer, he works on all sorts of new features and improvements, including the document relation related features inside Elasticsearch (parent/child and nested objects) and more recently, Martijn improved the percolator (reversed search).
National Institute of Standards and Technology (NIST)
Donna Harman graduated from Cornell University as an Electrical Engineer, and started her career working with Professor Gerard Salton in the design and building of several test collections, including the first MEDLARS one. Later work was concerned with searching large volumes of data on relatively small computers, starting with building the IRX system at the National Library of Medicine in 1987, and then the Citator/PRISE system at the National Institute of Standards and Technology (NIST) in 1988. In 1990 she was asked by DARPA to put together a realistic test collection on the order of 2 gigabytes of text, and this test collection was used in the first Text REtrieval Conference (TREC). TREC is now in its 22th year, and along with its sister evaluations such as CLEF, NTCIR, INEX, and FIRE, serves as a major testing ground for information retrieval algorithms. She received the 1999 Strix Award from the U.K Institute of Information Scientists for this effort. Starting in 2000 she worked with Paul Over at NIST to form a new effort (DUC) to evaluate text summarization, which has now been folded into the Text Analysis Conference (TAC), providing evaluation for several areas in NLP.
COO/VP of Sales, Sphinx Technologies
Big Data Search at BoardReader
From humble beginnings focused on storing and maintaining data in MySQL, Richard matriculated to Sphinx Search, where he’s learned the value of synthesizing mountains of text and attributes into meaning for use by individuals.
Technology Manager, BASF
The Right Tool for the Job: Search and Analytics with
Open Source and Closed Source Software for Professional Information Analysis
Kurt has been working in the IT environment since 1999. Starting off as a network and software engineer, he later refocused on applications of data analytics and search. Open source has always been a part of Kurt’s work but never without benchmarking against other solutions in order to find the best fit for the task. Kurt holds a diploma in computer science from the Technical University of Karlsruhe, Germany.
Chief Technology Officer, Basis Technology
Real World Facets with Entity Resolution
Benson provides technological leadership at Basis Technology. With the R&D team, he breaks ground in new technologies for the company’s core products and incubator projects. In custom solution projects, he works with customers, rapidly grasping their specific needs and then setting direction and strategy to meet those multilingual challenges. Prior to Basis Technology, Benson held technical and management positions at Kendall Square Research, Symbolics, Object Design, and Honeywell Information Systems. He is also an active contributor to the open source software community as a member of the Apache Software Foundation and a Project Management Committee member of Apache Mahout and Apache CXF. Benson holds a degree in computer science from MIT.
Solutions Engineer, Cloudera
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Alex Moundalexis is a Solutions Engineer for Cloudera and has spent the last year installing and configuring Hadoop clusters across the country for a variety of commercial and federal customers. Before entering the land of Big Data, Alex spent the better part of ten years wrangling Linux server farms and writing Perl as a contractor to the Department of Defense and Department of Justice. He likes shiny objects.
|Aruna Kumar Pamulapati
Senior Technical Consultant, Search Technologies
Advanced Query Parsing Techniques
Aruna (Arun) was the architect at Meaningful Machines implementing a natural language-based statistical text search engine on large corpus. Arun is a seasoned Enterprise Search Engineer, contributing to design and hands on implementation of client projects at Search Technologies.
We will be releasing videos of many of the OSS presentations over the coming months. Below you will find links to the videos, and we encourage you to sign up in order to be notified when the next video is available.
|Keynote: Watson–The Jeopardy! Grand Challenge and Beyond
Presenter: Eric Brown, Director of Watson Technologies, IBM