Overview of DBAI@NeurIPS’21

Towards AI Team

2 years ago

Author(s): Nantia Makrynioti

The Workshop on Databases and AI (DBAI) was successfully held last December in conjunction with the virtual NeurIPS’22 conference. The purpose of DBAI is to aspire a conversation on the power of the relational data structure and relational database systems (RDBMS) when it comes to machine learning (ML) algorithms. Research on the areas of relational learning, relational algebra and probabilistic programming has demonstrated the benefits of exploiting the relational data structure when it comes to integrating domain knowledge, avoiding redundant calculations and managing workflows among others, when it comes to ML tasks. Yet there is still a disconnection between the relational world and the world of machine learning, as this is most noticeably manifested by the amount of time that is wasted in denormalizing the data and moving them outside of the databases in order to train ML models. Furthermore, although the intersection of database systems with ML is a hot area in the data management, it’s probably the first time that relational databases are discussed in a NeurIPS workshop. Hence, another goal of DBAI is to draw attention to the possibilities that a synergy between the two communities can bring forward. This blog post gives an overview of DBAI’22 and highlights the main themes that were discussed in the invited and contributed talks, as well as during the panel discussion.

Overview of DBAI@NeurIPS’21 — Source: xkcd.com

DBAI was held in Eastern timezone and had an online attendance of 35 people. Moreover, gatherings of around 15 students and faculty members were organized in four universities. We (the organizers) are very thankful to Snorkel AI and RelationalAI for their generous sponsorship that funded the registrations and lunches for the physical gatherings. The schedule of the workshop aimed for shorter talks, so that a diverse group of speakers with backgrounds from either ML or data management and from both academia and industry could be accommodated. Hence, there were 7 invited and 5 contributed talks, as well as a panel discussion.

The workshop opened its doors with Dan Olteanu’s (University of Zurich) insightful presentation on a first-principles approach that exploits the algebraic and combinatorial structure of relational data processing to improve the runtime performance of machine learning. Then, Paroma Varma (SnorkelAI) shared her state-of-the-art work on programmatically labeling training data, followed by Arun Kumar (UC San Diego) who highlighted how scalability, usability, and manageability concerns across the entire lifecycle of ML/AI applications can be addressed through the lens of database systems. David Chiang (University of Notre Dame) and Eriq Augustine (UC Santa Cruz) shifted the agenda towards more pure ML topics and presented interesting ideas on different notations for weighted or probabilistic relations and on accelerating grounding in statistical relational learning. Finally, Molham Aref (RelationalAI) shared his insights on deep learning on relational data, whereas Olga Papaemmanouil (Brandeis University) presented a promising vision and preliminary results of AI-optimized database components.

Regarding contributed talks, these spanned many interesting topics, covering data programming with knowledge bases, learned indices and buffer managers, as well as relational algebra libraries for data science pipelines and numerical reasoning in relational databases. Here too there was a balanced representation from both industry labs and universities.

Panel Discussion

DBAI concluded with a very interesting panel discussion among Guy Van den Broeck (UCLA), Alexander Ratner (SnorkelAI), Konstantinos Karanasos (Microsoft’s Gray Systems Lab), Molham Aref and Arun Kumar on AI workloads inside databases, moderated by Parisa Kordjamshidi.

Below is a summary of the main points that came out from this discussion:

After two decades of in-RDBMS machine learning research and implementations, database systems have not made a compelling case for data scientists to move their workflows there. A transition phase is currently under way, where the database community with all the experience of the past is looking for crucial features, such as data versioning and data governance, that would make DBMSes attractive to data scientists, and where the definition of in-RDBMS machine learning becomes less rigid with the adoption of data lakes and the interoperability with systems like TensorFlow and open formats like ONNX.

In addition to the above, keeping up with all the innovation that is happening in ML and bringing it in the DBMS has become quite difficult and probably an unrealistic expectation. For instance, the acceleration that
hardware optimized compilers of systems like TensorFlow bring to ML needs substantial work in order to be replicated in DBMSes. Given this, approaches that combine a DBMS with an accelerator and offload different parts of the workflow to where they will executed best is gaining momentum. This is already, for example, implemented in Redshift ML and SQL Server, where data are seamlessly exported to SageMaker or Azure ML where the ML part runs.
On the other hand, query optimization, computation reuse and scalability, which are well studied areas in the database systems, are insufficiently supported in ML platforms.
Many times people use relational algebra and SQL interchangeably, but in reality these are two different things. This means that there is still room for innovation at the language level, which can increase the usability of RDBMSes for data science workflows. At the same time, there is no fundamental issue in translating popular APIs of relational operators, such as Pandas, to SQL and thus allow the user to still write Python or another language of its preference.
So far deep learning has focused on perception data, i.e. visual or speech data, and all the benchmarks commonly used in research papers address tasks like image recognition and language models. Structured data are largely ignored possibly partly due to lack of transferability of knowledge between datasets and tasks, although this is not thoroughly investigated. Hence, to draw attention towards this direction benchmarks of real-world tasks on relational data or knowledge graphs need to be created, e.g. learning the correlation between sales and weather or traffic data. It makes sense that such benchmarks will most probably come via industry.
Statistical Relational Learning (SRL) has not taken off as much as deep learning. One reason for this could be the by-design ability of deep learning networks to compose higher-level from lower-level features, whereas in SRL feature engineering happens via the expression of constraints. That being said graph neural networks is where most of the SRL ideas nowadays live. In addition to this, many times statistical modeling as expressed in probabilistic programming languages such as Stan, encapsulate a structure that is not explicitly referred to as relational, but it indeed shares a common ground with relational algebra.

Concluding Remarks

Overall, we are very happy with the content of the 1st DBAI, as this included insightful presentations and a constructive panel discussion. I’d like to sincerely thank my fellow organizers (Nikolaos Vasilogou, Parisa Kordjamshidi, Maximilian Schleich, Kirk Pruhs and Zenna Tavares), the PC members, the speakers and panelists, the sponsors, the volunteers and last but not least the authors and attendees for contributing each in his/her own way in making DBAI’21 a successful workshop. I really hope we will have the opportunity to organize another DBAI soon.