Categories
knowledge connexions

Streamlit wants to revolutionize building machine learning and data science applications, scores $21 million Series A funding

Streamlit wants to be for data science what business intelligence tools have been for databases: A quick way to get to results, without bothering much with the details

We were confused at first when we got the news. We interpreted “application framework for machine learning and data science” to mean some new framework for working with data, such as PyTorch, DeepLearning4j, and Neuton, to name just a few among many others out there.

So, our first reaction was: Another one, how is it different? Truth is, Streamlit is not a framework for working with data per se. Rather, it is a framework for building data-driven applications. That makes it different to boot with, and there’s more.

Streamlit is aimed at people who don’t necessarily know or care much about application development: Data scientists. It was created by a rock star team of data scientists who met in 2013 while working at Google X, it’s open source, and has been spreading like wildfire, counting some 200.000 applications built since late 2019.

Today Streamlit announced that it has secured $21 million in Series A funding. ZDNet connected with CEO Adrien Treuille to discuss what makes Streamlit special, and where it, and data-driven applications at large, are going next.

To listen to the conversation with Treuille in its entirety, you can head to the Orchestrate All the Things podcast.

From zero to hero: from datasets and models to applications

The investment was co-led by Gradient Ventures and GGV Capital, with additional participation from Bloomberg Beta, Elad Gil, Daniel Gross, and others. Glenn Solomon, a managing partner at GGV Capital, said that:

“Adapting quickly to new information and insights is one of the biggest challenges facing companies today. Streamlit is leading the way in helping data science teams accelerate time to market and amplify the work of machine learning throughout companies of all sizes across a wide variety of industries. At GGV we’re very excited to back this exceptional founding team and support their ambitious global growth plans.”

Let’s take it from the start then. In Treuille’s words, he and his co-founders came to be entrepreneurs via academia, doing machine learning and big data and AI before they were called by these names, and certainly before they were cool. Through his stints at Google X and Zoox AI teams, Treuille observed a pattern.

The promise of machine learning and artificial intelligence was often sequestered in those groups, and not influencing the organization as well or as easily as they could. That led Treuille to start working on a pet project to solve this. Eventually, it started getting used by a number of engineers and growing really quickly. Then investment came, and a big open-source launch.

istock-933321056.jpg

Streamlit is working on enabling data scientists to develop data driven applications in a fraction of the time it usually normally takes

metamorworks, Getty Images/iStockphoto

Streamlit grew from a one-man project to being used in a number of Fortune 500 companies, and beyond, under the radar, until today. And it worked that way for a number of reasons.

First, Treuille and his co-founders leveraged their network. Second, they open-sourced Streamlit, which made it easy for everyone to adopt and experiment with. Third, and perhaps more important, they captured what Treuille called the Zeitgeist: They offered a solution to a problem data scientists, and the organizations employing them, are facing:

How to go from fiddling with datasets and models, to deploying an application using them in production. In essence, to do this, a number of people have to work together. At the very least, data scientists and application developers. As usual in situations like these, skills and culture differ, and collaborating costs time and money.

Streamlit cites Delta Dental as an example. They were told that using AI to analyze their call center traffic would cost a hefty amount and take a year. A data scientist at Delta Dental used Streamlit instead, and he built an MVP in a week, a prototype in three weeks, and had it in production in three months, all at zero cost to the company, says Streamlit.

Taking the application developers out of application development

To understand how this is possible, we need to dive deeper into how Streamlit works. Streamlit tries to take the application development team out of the picture, by enabling data scientists to develop their own applications.

Treuille elaborates on the conundrum of getting data scientists to build applications, or getting application builders to work with data scientists. Data scientists do not necessarily have the core skills for application building, and their applications end up being un-maintainable. Application builders move on to other applications, resulting in freezing features.

What Streamlit does is it lets data scientists create applications as a byproduct of their workflow. It takes their Python scripts and turns them into applications, by letting them insert a few lines of code that abstract application constructs such as widgets.

That’s unorthodox. Software engineers would argue there’s a reason why web development frameworks exist, for example. And there are many years of experience and best practices distilled into them. To throw them all away in favor of annotated Python scripts would look like bad practice, not to mention, an existential threat.

Treuille begs to differ. To support his view, besides widespread adoption, he argues that this is a different way of developing applications. The applications are different, the scope is different, and Streamlit does not intend to reinvent the application development wheel, but rather, to integrate it:

“We view ourselves as a translation layer between the Python world and the web framework world. For example, everything in Streamlit is written in React. When you’ve discovered the joy of React, that’s like programming nirvana. We can take almost anything in the React ecosystem, and translate that into Streamlit almost effortlessly. So our core technology is really that translation layer.”

streamlitapp.gif

From a Python script to a web application, with a few extra lines of code. Image: Streamlit

Treuille went on to add that soon Streamlit will enable any developer to translate any bit of web tech into a single Python function, thus allowing the two ecosystems to flourish independently of one another. The same approach is taken also with regards to using other Python frameworks such as Dask or Ray, for example:

“Streamlit is very modest, in some ways, very small. And therefore we sit alongside whatever — the whole Python data ecosystem. And that is really exciting because of the bigger story here, which is way, way bigger than Streamlit. It’s the data world which was at one time big databases, and then it was Excel, and then it was Tableau, and more recently Looker.

This tsunami is coming, which is open source and machine learning, and Python, and Pandas, SciKit learn. This is basically 20 years of academic research into machine learning, crashing into the data world, and completely transforming it. And we view ourselves as just a little surfboard in that wave, just riding it, or trying to ride it as best we can.”

There’s an app for that. Should you build it with Streamlit?

That may explain the approach, but not the scope. There is more to applications than data and data-driven features. If you are Netflix, for example, the core business revolves around streaming, and the applications should reflect that. They should enable people to manage payments, stream films, and so on.

Recommendations add to that, powered by data and machine learning. But they are not the core business. Treuille acknowledged that Streamlit does not aspire to be the front end to your entire company: “If Netflix came to us and said, hey, we want to write the Netflix app website in Streamlit, we’d say we don’t think that’s a good idea.”

Streamlit is not a general-purpose application development framework. What it does, in a way, is the same thing that business intelligence application frameworks did for databases. It provides a framework that enables quick access to the underlying source of value. For BI frameworks, it was data stored in databases. For Streamlit, it’s machine learning models.

We would still question how many data scientists, or their managers for that matter, would be happy with adding the task of maintaining their Streamlit applications on top of everything else they already do. We would also question whether application developers can, or should, be taken out of the picture entirely, even for purpose-built, data-driven applications, as they grow over time. But Streamlit is early in its lifecycle to be able to answer those questions.

streamlitapp1.png

Data scientists are not necessarily the most suitable people to develop applications. Image: Streamlit

That, however, does not seem to have stopped users or investors. Speaking of which, there’s another interesting question here. What is Streamlit’s business model, and how did it get to convince people to invest money in it? In a nutshell: Software as a service in the cloud, with a tweak.

You can use Streamlit to develop any application without any restrictions. What you pay for, optionally, is deployment. Users can deploy Streamlit anywhere they please, on their own. But Streamlit offers its own cloud solution, called Streamlit for Teams, which comes with additional features around collaboration and deployment.

Treuille was adamant about Streamlit’s bottom-up sales strategy: Just getting the software out there, enabling people to start building applications, and then converting a part of them to paying users.

The bigger picture: Software 2.0

Streamlit is interesting if nothing else then because of the different paradigm it brings to application development. Which in turn, is part of what Treuille sees as a different way of building applications:

“The bigger picture is the way that the Python ecosystem and the community of open source developers and academic developers and corporations — TensorFlow is built by Google, PyTorch by Facebook — how all of these different forces have come together to create this incredibly powerful data ecosystem. That truly can revolutionize the show. That truly has different properties than just a simple spreadsheet and a list of your sales over the past year.”

Some people refer to this as Software 2.0. What we wondered, however, was whether the world is really ready for this. In many ways, most organizations probably have not gotten Software 1.0 right yet. Version control, release management, software development tools, and processes — these are not exactly trivial things.

Now add to that — dataset management, provenance, machine learning, and feature engineering, versioning, to name but a few of the concerns of data-driven development, and what you get is a combinatorial explosion. Treuille conceded that is really part of the Zeitgeist over the past couple years.

Treuille sees Streamlit as being part of a wave of new startups such as Tecton or Weights and Biases, which are essentially productionizing every layer of that stack. He believes talented people are working on this, and it’s coming into view. His take on how to get with the program:

“If you are a company, asking yourself how to get into this world, what is even the first step, I would say: Go to Insight Data Science. Hire one of their machine learning engineers or data scientists finishing the school for data scientists, and then give them Streamlit.”

Content retrieved from: https://www.zdnet.com/article/streamlit-wants-to-revolutionize-building-machine-learning-and-data-science-applications-scores-21-million-series-a-funding/.

Categories
knowledge connexions

Data Lakehouse, meet fast queries and visualization: Databricks unveils Delta Engine, acquires Redash

Data warehouses alone don’t cut it. Data lakes alone don’t cut it either. So whether you call it data lakehouse or by any other name, you need the best of two worlds, says Databricks. A new query engine and a visualization layer are the next pieces in Databricks’ puzzle.

Databricks announced today two significant additions to its Unified Data Analytics Platform: Delta Engine, a high-performance query engine on cloud data lakes, and Redash, an open-source dashboarding and visualization service for data scientists and analysts to do data exploration.

The announcements are important in their own right since they bring significant capabilities to Databricks’ platform, which is already seeing good traction. However, we feel it’s important to put them in the context in the greater scheme of things.

ZDNet connected with Databricks CEO and co-founder Ali Ghodsi to discuss the Data Lakehouse vision and reality, and where do Delta Engine and Redash fit in. To listen to the discussion, check the Orchestrate All the Things podcast.

Data warehouses, data lakes, data lakehouses

Databricks was founded in 2013 by the original creators of Apache Spark to commercialize the project. Spark is one of the most important open-source frameworks for data science, analytics, and AI, and Databricks has grown in importance over the years as well.

Earlier this year, Databricks started floating the term Data Lakehouse, to describe the coalescing of two worlds, and words: Data warehouses and data lakes. Some people liked the term and the idea; others were more critical. No matter, argues Ghodsi, because this is happening anyway, with or without the term, with or without Databricks.

Databricks defines Data Lakehouses as having a number of properties that traditionally are associated with data warehouses, such as schema enforcement and governance and support for business intelligence, and others traditionally associated with data lakes, such as support for diverse data types and workloads.

data-lakehouse.png

The Databricks definition of a data lakehouse

Ghodsi cited two major trends as the driving force for the emergence of the Data Lakehouse: Data science and machine learning and multi-cloud. Neither of those existed at the time data warehouses were first implemented. Data lakes tried to address those, but they have not always succeeded.

As Ghodsi mentioned, data warehouses are not good at storing unstructured data such as multimedia, which is often what is needed for machine learning. Plus, he went on to add, data warehouses use proprietary formats — which means lock-in — and the cost of storing data rises steeply with the volume of data.

Data lakes, on the other hand, don’t have those issues — they have different ones. First, they often become “data swamps,” because they do not provide any structure, which makes it hard to find and use the data. Plus, their performance is not great, and they don’t support business intelligence workloads — at least not in their initial incarnation.

So, what is the Databricks’ answer on how to bring those two worlds together? What are the building blocks of the Data Lakehouse?

How much schema do you need and when?

The first part of the equation, Ghodsi said, is a common, non-proprietary storage format, plus transactional properties: Delta Lake, which has already been open-sourced by Databricks. The other part is a fast query engine, and this is where Delta Engine comes in. Still, we were not entirely convinced.

What we see as the major differentiation between data warehouses and data lakes is governance and schema. This is the dimension on which these approaches sit on opposite sides of the spectrum. So, the real question is: How does the Databricks vision of a Data Lakehouse deal with schema?

Data warehouses have schema on write, which means all data has to have a schema upfront, at the moment of ingestion. Data lakes have schema on read, which means data can be ingested faster and without making a priori decisions, but then deciding what schema applies to the data and even finding the data becomes a challenge.

The Databricks answer to that is called Schema Enforcement & Evolution. Ghodsi framed it as having a cake and eating it, too. The way it works is that it is possible to store data without enforcing a schema. But then if you want to format data into tables, there are different levels of schema enforcement that can apply. In addition to schema enforcement, schema evolution enables users to automatically add new columns of rich data:

eae6a5-7449733d65cf4718ac259d1416d1107c-mv2.jpg

Data lakes, data warehouses, and the data lakehouse. The term was coined in 2016 by Pedro Javier Gonzales Alonso.

“The raw tables, they might be in any particular format and they actually are. Essentially schema on read tables. And then after that, you move your data into a bronze table, then after that into a silver table, and after that into a gold table. At each of these levels, you’re refining your data and you’re putting more schema enforcement on it,” Ghodsi said.

That’s all fine and well, although how this approach compares to a more fully-fledged data catalog is another issue. We could not help but note, however, that this is essentially something Databricks’ Data Lakehouse enables users to do but does not necessarily support them in doing.

Ghodsi pointed out that there are certain training programs that Databricks users can attend, and its architects follow the approach and spread the word, too. In the end, however, just like data warehouses have a certain way of thinking people have to subscribe to, the same applies here.

If you want to get with the Data Lakehouse program, the technology alone won’t cut it. You have to adopt the methodology too, and this is something you should be aware of. And with that, we can move to Delta Engine and Redash and see where exactly they fit in the big picture.

Delta Engine and Redash: A fast query engine, and the missing piece in visualization

When we mentioned data lakes and how they don’t support business intelligence workloads previously, you may have noticed how we noted that’s not entirely true these days. By now, a number of SQL-on-Hadoop engines have enabled business intelligence tools to connect to data lakes.

Some of these engines, like Hive or Impala, have been around for a while. So, the question on Delta Engine is: How is it different? What it boils down to, Ghodsi said, is that it’s much faster. We’ll skip the deep dive, which ZDNet’s Andrew Brust will do tomorrow, but suffice it to say that Delta Engine is written in C++, and it uses vectorization.

That can make a difference in the engine’s performance, which in turn can make a difference in interactive query workloads. On hearing the analysis from Ghodsi, we ventured on a prediction: We figured Delta Engine may not follow the lead of its predecessors. Databricks’ policy has been to initially start projects for its own use, and then open-source them, which is what happened with Delta Lake.

But it sounded like Databricks invested a lot in Delta Engine, and it’s a differentiation point too. Though the answer is not clear, it’s safe to say if Delta Engine will be open-sourced. It won’t happen soon. Open source, however, is a key theme in the acquisition of Redash.

screen-shot-2020-06-03-at-12-05-42-pm.png

Redash visualizations will become part of Databricks’ stack, as it was “love at first sight”. Image: Databricks

Apache Spark, on which Databricks’ platform is based, excels at streaming and batch analytics, as well as machine learning and more code-oriented data engineering work. But neither open-source Spark nor the commercial Databricks platform are focused on visual data pipeline authoring or the full range of connectors necessary to move data from enterprise SaaS applications.

The above paragraph, identifying missing pieces in Databricks’ stack, was written by Brust recently. By acquiring Redash, the visualization missing piece is no longer missing. Databricks and Redash are similar and complementary: They are good at what they do — a back-end and front-end for data, respectively — and they capitalize on open source products, which they offer as managed solutions in the cloud.

Databricks did need a visualization solution for its stack — there’s no question about that. The real question is: Why acquire Redash? Databricks could have gotten the missing piece of the puzzle via a partnership. Or, if they wanted Redash’s technology, they could have just gotten it — it’s open source. To us, this looked like an acqui-hire.

Ghodsi more or less confirmed this. He said it was “love at first sight” with Redash; they liked the product, and they aligned with the team, so they decided to bring them onboard to fully integrate Redash in Databricks’ stack. The core Redash product will remain open source. Why not just get the technology?

“Oftentimes there is actually a factory behind these software artifacts, the factory that builds them. Exactly how that factory works… no one from the outside ever really knows how they actually build the software. And when you acquire the company, you get the whole factory. So you know that it’s going to work,” Ghodsi said.

Accelerating the future

Discussing how the Redash team will be integrated into Databricks brought us to the business recap part of the conversation. A few months back, Ghodsi had stated that Databricks is seeing remarkable growth. We wondered whether that momentum is holding up. We figured the last few months may actually have helped, given the nature of what Databricks does. Ghodsi concurred:

“The pandemic is accelerating the future. People are getting rid of cash. They’re doing more telemedicine, more video conferencing. AI and machine learning is part of that future. It is the future. So it’s getting accelerated, more and more CFOs are saying — let’s actually double down on more automation. Cloud is another thing that is inevitable. Eventually, everybody will be in the cloud. That’s also accelerated.

So those are positive trends. Plus, a lot of startups have been laying off people or hiring freezes. We’ve been fortunate that we’ve sort of planned for an economic downturn, so we were really set up for hitting the gas and accelerating when this happened. For instance, we started hiring and we see a significant boost in hiring. The other thing is that we’re well capitalized, because we’ve been sort of saving money for this.”

Well capitalized indeed — Databricks is fresh from raising a massive $400 million funding round. Of course, it’s every CEO’s job to tell the world that their company is doing great. In this case, however, it looks like Databricks is riding with the times indeed.

The new pieces of the puzzle, Delta Engine and Redash, seem to fit well into the big picture. What remains to be seen is, how well the Databricks recipe for data governance and schema management works in practice for those who adopt it.

Content retrieved from: https://www.zdnet.com/article/data-lakehouse-meet-fast-queries-and-visualization-databricks-unveils-delta-engine-acquires-redash/.

Categories
knowledge connexions

Data Lakehouse, meet fast queries and visualization: Databricks unveils Delta Engine, acquires Redash

Data warehouses alone don’t cut it. Data lakes alone don’t cut it either. So whether you call it data lakehouse or by any other name, you need the best of two worlds, says Databricks. A new query engine and a visualization layer are the next pieces in Databricks’ puzzle.

Databricks announced today two significant additions to its Unified Data Analytics Platform: Delta Engine, a high-performance query engine on cloud data lakes, and Redash, an open-source dashboarding and visualization service for data scientists and analysts to do data exploration.

The announcements are important in their own right since they bring significant capabilities to Databricks’ platform, which is already seeing good traction. However, we feel it’s important to put them in the context in the greater scheme of things.

ZDNet connected with Databricks CEO and co-founder Ali Ghodsi to discuss the Data Lakehouse vision and reality, and where do Delta Engine and Redash fit in. To listen to the discussion, check the Orchestrate All the Things podcast.

Data warehouses, data lakes, data lakehouses

Databricks was founded in 2013 by the original creators of Apache Spark to commercialize the project. Spark is one of the most important open-source frameworks for data science, analytics, and AI, and Databricks has grown in importance over the years as well.

Earlier this year, Databricks started floating the term Data Lakehouse, to describe the coalescing of two worlds, and words: Data warehouses and data lakes. Some people liked the term and the idea; others were more critical. No matter, argues Ghodsi, because this is happening anyway, with or without the term, with or without Databricks.

Databricks defines Data Lakehouses as having a number of properties that traditionally are associated with data warehouses, such as schema enforcement and governance and support for business intelligence, and others traditionally associated with data lakes, such as support for diverse data types and workloads.

data-lakehouse.png

The Databricks definition of a data lakehouse

Ghodsi cited two major trends as the driving force for the emergence of the Data Lakehouse: Data science and machine learning and multi-cloud. Neither of those existed at the time data warehouses were first implemented. Data lakes tried to address those, but they have not always succeeded.

As Ghodsi mentioned, data warehouses are not good at storing unstructured data such as multimedia, which is often what is needed for machine learning. Plus, he went on to add, data warehouses use proprietary formats — which means lock-in — and the cost of storing data rises steeply with the volume of data.

Data lakes, on the other hand, don’t have those issues — they have different ones. First, they often become “data swamps,” because they do not provide any structure, which makes it hard to find and use the data. Plus, their performance is not great, and they don’t support business intelligence workloads — at least not in their initial incarnation.

So, what is the Databricks’ answer on how to bring those two worlds together? What are the building blocks of the Data Lakehouse?

How much schema do you need and when?

The first part of the equation, Ghodsi said, is a common, non-proprietary storage format, plus transactional properties: Delta Lake, which has already been open-sourced by Databricks. The other part is a fast query engine, and this is where Delta Engine comes in. Still, we were not entirely convinced.

What we see as the major differentiation between data warehouses and data lakes is governance and schema. This is the dimension on which these approaches sit on opposite sides of the spectrum. So, the real question is: How does the Databricks vision of a Data Lakehouse deal with schema?

Data warehouses have schema on write, which means all data has to have a schema upfront, at the moment of ingestion. Data lakes have schema on read, which means data can be ingested faster and without making a priori decisions, but then deciding what schema applies to the data and even finding the data becomes a challenge.

The Databricks answer to that is called Schema Enforcement & Evolution. Ghodsi framed it as having a cake and eating it, too. The way it works is that it is possible to store data without enforcing a schema. But then if you want to format data into tables, there are different levels of schema enforcement that can apply. In addition to schema enforcement, schema evolution enables users to automatically add new columns of rich data:

eae6a5-7449733d65cf4718ac259d1416d1107c-mv2.jpg

Data lakes, data warehouses, and the data lakehouse. The term was coined in 2016 by Pedro Javier Gonzales Alonso.

“The raw tables, they might be in any particular format and they actually are. Essentially schema on read tables. And then after that, you move your data into a bronze table, then after that into a silver table, and after that into a gold table. At each of these levels, you’re refining your data and you’re putting more schema enforcement on it,” Ghodsi said.

That’s all fine and well, although how this approach compares to a more fully-fledged data catalog is another issue. We could not help but note, however, that this is essentially something Databricks’ Data Lakehouse enables users to do but does not necessarily support them in doing.

Ghodsi pointed out that there are certain training programs that Databricks users can attend, and its architects follow the approach and spread the word, too. In the end, however, just like data warehouses have a certain way of thinking people have to subscribe to, the same applies here.

If you want to get with the Data Lakehouse program, the technology alone won’t cut it. You have to adopt the methodology too, and this is something you should be aware of. And with that, we can move to Delta Engine and Redash and see where exactly they fit in the big picture.

Delta Engine and Redash: A fast query engine, and the missing piece in visualization

When we mentioned data lakes and how they don’t support business intelligence workloads previously, you may have noticed how we noted that’s not entirely true these days. By now, a number of SQL-on-Hadoop engines have enabled business intelligence tools to connect to data lakes.

Some of these engines, like Hive or Impala, have been around for a while. So, the question on Delta Engine is: How is it different? What it boils down to, Ghodsi said, is that it’s much faster. We’ll skip the deep dive, which ZDNet’s Andrew Brust will do tomorrow, but suffice it to say that Delta Engine is written in C++, and it uses vectorization.

That can make a difference in the engine’s performance, which in turn can make a difference in interactive query workloads. On hearing the analysis from Ghodsi, we ventured on a prediction: We figured Delta Engine may not follow the lead of its predecessors. Databricks’ policy has been to initially start projects for its own use, and then open-source them, which is what happened with Delta Lake.

But it sounded like Databricks invested a lot in Delta Engine, and it’s a differentiation point too. Though the answer is not clear, it’s safe to say if Delta Engine will be open-sourced. It won’t happen soon. Open source, however, is a key theme in the acquisition of Redash.

screen-shot-2020-06-03-at-12-05-42-pm.png

Redash visualizations will become part of Databricks’ stack, as it was “love at first sight”. Image: Databricks

Apache Spark, on which Databricks’ platform is based, excels at streaming and batch analytics, as well as machine learning and more code-oriented data engineering work. But neither open-source Spark nor the commercial Databricks platform are focused on visual data pipeline authoring or the full range of connectors necessary to move data from enterprise SaaS applications.

The above paragraph, identifying missing pieces in Databricks’ stack, was written by Brust recently. By acquiring Redash, the visualization missing piece is no longer missing. Databricks and Redash are similar and complementary: They are good at what they do — a back-end and front-end for data, respectively — and they capitalize on open source products, which they offer as managed solutions in the cloud.

Databricks did need a visualization solution for its stack — there’s no question about that. The real question is: Why acquire Redash? Databricks could have gotten the missing piece of the puzzle via a partnership. Or, if they wanted Redash’s technology, they could have just gotten it — it’s open source. To us, this looked like an acqui-hire.

Ghodsi more or less confirmed this. He said it was “love at first sight” with Redash; they liked the product, and they aligned with the team, so they decided to bring them onboard to fully integrate Redash in Databricks’ stack. The core Redash product will remain open source. Why not just get the technology?

“Oftentimes there is actually a factory behind these software artifacts, the factory that builds them. Exactly how that factory works… no one from the outside ever really knows how they actually build the software. And when you acquire the company, you get the whole factory. So you know that it’s going to work,” Ghodsi said.

Accelerating the future

Discussing how the Redash team will be integrated into Databricks brought us to the business recap part of the conversation. A few months back, Ghodsi had stated that Databricks is seeing remarkable growth. We wondered whether that momentum is holding up. We figured the last few months may actually have helped, given the nature of what Databricks does. Ghodsi concurred:

“The pandemic is accelerating the future. People are getting rid of cash. They’re doing more telemedicine, more video conferencing. AI and machine learning is part of that future. It is the future. So it’s getting accelerated, more and more CFOs are saying — let’s actually double down on more automation. Cloud is another thing that is inevitable. Eventually, everybody will be in the cloud. That’s also accelerated.

So those are positive trends. Plus, a lot of startups have been laying off people or hiring freezes. We’ve been fortunate that we’ve sort of planned for an economic downturn, so we were really set up for hitting the gas and accelerating when this happened. For instance, we started hiring and we see a significant boost in hiring. The other thing is that we’re well capitalized, because we’ve been sort of saving money for this.”

Well capitalized indeed — Databricks is fresh from raising a massive $400 million funding round. Of course, it’s every CEO’s job to tell the world that their company is doing great. In this case, however, it looks like Databricks is riding with the times indeed.

The new pieces of the puzzle, Delta Engine and Redash, seem to fit well into the big picture. What remains to be seen is, how well the Databricks recipe for data governance and schema management works in practice for those who adopt it.

Content retrieved from: https://www.zdnet.com/article/data-lakehouse-meet-fast-queries-and-visualization-databricks-unveils-delta-engine-acquires-redash/.

Categories
knowledge connexions

The state of AI in 2020: Democratization, industrialization, and the way to artificial general intelligence

From fit for purpose development to pie in the sky research, this is what AI looks like in 2020.

After releasing what may well have been the most comprehensive report on the State of AI in 2019, Air Street Capital and RAAIS founder Nathan Benaich and AI angel investor and UCL IIPP visiting professor Ian Hogarth are back for more.

In the State of AI Report 2020, Benaich and Hogarth outdid themselves. While the structure and themes of the report remain mostly intact, its size has grown by nearly 30%. This is a lot, especially considering their 2019 AI report was already a 136 slide long journey on all things AI.

The State of AI Report 2020 is 177 slides long, and it covers technology breakthroughs and their capabilities, supply, demand, and concentration of talent working in the field, large platforms, financing, and areas of application for AI-driven innovation today and tomorrow, special sections on the politics of AI, and predictions for AI.

ZDNet caught up with Benaich and Hogarth to discuss their findings.

AI democratization and industrialization: Open code and MLOps

We set out by discussing the rationale for such a substantial contribution, which Benaich and Hogarth admitted to having taken up an extensive amount of their time. They mentioned their feeling is that their combined industry, research, investment, and policy background and currently held positions give them a unique vantage point. Producing this report is their way of connecting the dots and giving something of value back to the AI ecosystem at large.

Coincidentally, Gartner’s 2020 Hype cycle for AI was also released a couple of days back. Gartner identifies what it calls 2 megatrends that dominate the AI landscape in 2020: Democratization and industrialization. Some of Benaich and Hogarth’s findings were about the massive cost of training AI models, and the limited availability of research. This seems to contradict Gartner’s position, or at least imply a different definition of democratization.

Benaich noted that there are different ways to look at democratization. One of them is the degree to which AI research is open and reproducible. As the duo’s findings show, it is not: only 15% of AI research papers publish their code, and that has not changed much since 2016.

Hogarth added that traditionally AI as an academic field has had an open ethos, but the ongoing industry adoption is changing that. Companies are recruiting more and more researchers (another theme the report covers), and there is a clash of cultures going on as companies want to retain their IP. Notable organizations criticized for not publishing code include OpenAI and DeepMind:

“There’s only so close you can get without a sort of major backlash. But at the same time, I think that data clearly indicates that they’re certainly finding ways to be close when it’s convenient,” said Hogarth.

mlops.png

Industrialization of AI is under way, as open source MLOps tools help bring models to production

As far as industrialization goes, Benaich and Hogarth pointed towards their findings in terms of MLOps. MLOps, short for machine learning operations, is the equivalent of DevOps for ML models: Taking them from development to production, and managing their lifecycle in terms of improvements, fixes, redeployments, and so on.

Some of the more popular and fastest-growing Github projects in 2020 are related to MLOps, the duo pointed out. Hogarth also added that for startup founders, for example, it’s probably easier to get started with AI today than it was a few years ago, in terms of tool availability and infrastructure maturity. But there is a difference when it comes to training models like GPT3:

“If you wanted to start a sort of AGI research company today, the bar is probably higher in terms of the compute requirements. Particularly if you believe in the scale hypothesis, the idea of taking approaches like GPT3 and continuing to scale them up. That’s going to be more and more expensive and less and less accessible to new entrants without large amounts of capital.

The other thing that organizations with very large amounts of capital can do is run lots of experiments and iterates in large experiments without having to worry too much about the cost of training. So there’s a degree to which you can be more experimental with these large models if you have more capital.

Obviously, that slightly biases you towards these almost brute force approaches of just applying more scale, capital and data to the problem. But I think that if you buy the scaling hypothesis, then that’s a fertile area of progress that shouldn’t be dismissed just because it doesn’t have deep intellectual insights at the heart of it.”

How to compete in AI

This is another key finding of the report: huge models, large companies, and massive training costs dominate the hottest area of AI today: NLP (Natural Language Processing). Based on variables released by Google et. al., research has estimated the cost of training NLP models at about $1 per 1000 parameters.

That means that a model such as OpenAI’s GPT3, which has been hailed as the latest and greatest achievement in AI, could have cost tens of millions to train. Experts suggest the likely budget was $10 million. That clearly shows that not everyone can aspire to produce something like GPT3. The question is: Is there another way? Benaich and Hogarth think so and have an example to showcase.

PolyAI is a London-based company active in voice assistants. They produced and open-sourced a conversational AI model (technically, a pre-trained contextual re-ranker based on transformers) that outperforms Google’s BERT model in conversational applications. PolyAI’s model not only performs much better than Google’s, but it required a fraction of the parameters to train, meaning also a fraction of the cost.

polyai.png

PolyAI managed to produce a machine learning language models that performs better than Google in a specific domain, at a fraction of the complexity and cost.

The obvious question is: How did PolyAI do it? This could be an inspiration for others, too. Benaich noted that the task of detecting intent and understanding what somebody on the phone is trying to accomplish by calling is solved in a much better way by treating this problem as what is called a contextual re-ranking problem:

“That is, given a kind of menu of potential options that a caller is trying to possibly accomplish based on our understanding of that domain, we can design a more appropriate model that can better learn customer intent from data than just trying to take a general purpose model — in this case BERT.

BERT can do OK in various conversational applications, but just doesn’t have kind of engineering guardrails or engineering nuances that can make it robust in a real world domain. To get models to work in production, you actually have to do more engineering than you have to do research. And almost by definition, engineering is not interesting to the majority of researchers.”

Categories
knowledge connexions

The state of AI in 2020: Democratization, industrialization, and the way to artificial general intelligence

From fit for purpose development to pie in the sky research, this is what AI looks like in 2020.

After releasing what may well have been the most comprehensive report on the State of AI in 2019, Air Street Capital and RAAIS founder Nathan Benaich and AI angel investor and UCL IIPP visiting professor Ian Hogarth are back for more.

In the State of AI Report 2020, Benaich and Hogarth outdid themselves. While the structure and themes of the report remain mostly intact, its size has grown by nearly 30%. This is a lot, especially considering their 2019 AI report was already a 136 slide long journey on all things AI.

The State of AI Report 2020 is 177 slides long, and it covers technology breakthroughs and their capabilities, supply, demand, and concentration of talent working in the field, large platforms, financing, and areas of application for AI-driven innovation today and tomorrow, special sections on the politics of AI, and predictions for AI.

ZDNet caught up with Benaich and Hogarth to discuss their findings.

AI democratization and industrialization: Open code and MLOps

We set out by discussing the rationale for such a substantial contribution, which Benaich and Hogarth admitted to having taken up an extensive amount of their time. They mentioned their feeling is that their combined industry, research, investment, and policy background and currently held positions give them a unique vantage point. Producing this report is their way of connecting the dots and giving something of value back to the AI ecosystem at large.

Coincidentally, Gartner’s 2020 Hype cycle for AI was also released a couple of days back. Gartner identifies what it calls 2 megatrends that dominate the AI landscape in 2020: Democratization and industrialization. Some of Benaich and Hogarth’s findings were about the massive cost of training AI models, and the limited availability of research. This seems to contradict Gartner’s position, or at least imply a different definition of democratization.

Benaich noted that there are different ways to look at democratization. One of them is the degree to which AI research is open and reproducible. As the duo’s findings show, it is not: only 15% of AI research papers publish their code, and that has not changed much since 2016.

Hogarth added that traditionally AI as an academic field has had an open ethos, but the ongoing industry adoption is changing that. Companies are recruiting more and more researchers (another theme the report covers), and there is a clash of cultures going on as companies want to retain their IP. Notable organizations criticized for not publishing code include OpenAI and DeepMind:

“There’s only so close you can get without a sort of major backlash. But at the same time, I think that data clearly indicates that they’re certainly finding ways to be close when it’s convenient,” said Hogarth.

mlops.png

Industrialization of AI is under way, as open source MLOps tools help bring models to production

As far as industrialization goes, Benaich and Hogarth pointed towards their findings in terms of MLOps. MLOps, short for machine learning operations, is the equivalent of DevOps for ML models: Taking them from development to production, and managing their lifecycle in terms of improvements, fixes, redeployments, and so on.

Some of the more popular and fastest-growing Github projects in 2020 are related to MLOps, the duo pointed out. Hogarth also added that for startup founders, for example, it’s probably easier to get started with AI today than it was a few years ago, in terms of tool availability and infrastructure maturity. But there is a difference when it comes to training models like GPT3:

“If you wanted to start a sort of AGI research company today, the bar is probably higher in terms of the compute requirements. Particularly if you believe in the scale hypothesis, the idea of taking approaches like GPT3 and continuing to scale them up. That’s going to be more and more expensive and less and less accessible to new entrants without large amounts of capital.

The other thing that organizations with very large amounts of capital can do is run lots of experiments and iterates in large experiments without having to worry too much about the cost of training. So there’s a degree to which you can be more experimental with these large models if you have more capital.

Obviously, that slightly biases you towards these almost brute force approaches of just applying more scale, capital and data to the problem. But I think that if you buy the scaling hypothesis, then that’s a fertile area of progress that shouldn’t be dismissed just because it doesn’t have deep intellectual insights at the heart of it.”

How to compete in AI

This is another key finding of the report: huge models, large companies, and massive training costs dominate the hottest area of AI today: NLP (Natural Language Processing). Based on variables released by Google et. al., research has estimated the cost of training NLP models at about $1 per 1000 parameters.

That means that a model such as OpenAI’s GPT3, which has been hailed as the latest and greatest achievement in AI, could have cost tens of millions to train. Experts suggest the likely budget was $10 million. That clearly shows that not everyone can aspire to produce something like GPT3. The question is: Is there another way? Benaich and Hogarth think so and have an example to showcase.

PolyAI is a London-based company active in voice assistants. They produced and open-sourced a conversational AI model (technically, a pre-trained contextual re-ranker based on transformers) that outperforms Google’s BERT model in conversational applications. PolyAI’s model not only performs much better than Google’s, but it required a fraction of the parameters to train, meaning also a fraction of the cost.

polyai.png

PolyAI managed to produce a machine learning language models that performs better than Google in a specific domain, at a fraction of the complexity and cost.

The obvious question is: How did PolyAI do it? This could be an inspiration for others, too. Benaich noted that the task of detecting intent and understanding what somebody on the phone is trying to accomplish by calling is solved in a much better way by treating this problem as what is called a contextual re-ranking problem:

“That is, given a kind of menu of potential options that a caller is trying to possibly accomplish based on our understanding of that domain, we can design a more appropriate model that can better learn customer intent from data than just trying to take a general purpose model — in this case BERT.

BERT can do OK in various conversational applications, but just doesn’t have kind of engineering guardrails or engineering nuances that can make it robust in a real world domain. To get models to work in production, you actually have to do more engineering than you have to do research. And almost by definition, engineering is not interesting to the majority of researchers.”