Apple Comet Brings Fast Vector Processing to Apache Spark

Consumer electronics giant Apple has released into open source a plug-in that would help Apache Spark execute vector searches more efficiently, making the open source data processing platform more appealing for large-scale machine learning data crunching.

logo

The Apple engineers behind the Rust-based plug-in, called Apache Spark DataFusion Comet, have submitted it to become an Apache Software Foundation project, under the Apache Arrow umbrella. It is built on the extensible Apache DataFusion query engine (also written in Rust) and the Arrow columnar data format.

“Our goal is to accelerate Spark query execution via delegating Spark’s physical plan execution to DataFusion’s highly modular execution framework, while still maintaining the same semantics to Spark users,” explained Apple Software Engineer Chao Sun, on an Apache mailing list.

Sun noted that the project is not yet feature-complete, but parts of it are already used in production.

“This is a great example of the composable data system concept that everyone seems to be talking about lately,” noted Apache Arrow Project Management Committee Chair Andy Grove on X. “In this case, using Spark’s very mature planning and scheduling and delegating to DataFusion for native execution.”

What Is Apache Arrow DataFusion Comet?

Using the Apache Arrow DataFusion runtime, Comet can query data in the Apache Arrow columnar format, an approach designed to improve query efficiency and query runtime through native vectorized execution.

Apache Spark was created in 2010 for processing large amounts of distributed data in a variety of formatted and unformatted structures (“Big Data“).

Vector processing has become a favorite technique in the machine learning community thanks to how it can cut time in analyzing large amounts of data.

“Vectorized querying improves the performance, efficiency, scalability and memory footprint of analytical queries by operating on batches of data and processing multiple elements of data in parallel. It is inextricably linked with columnar database architecture, as it allows entire columns to be loaded into a CPU register and processed,” wrote Fivetran Senior Product Evangelist Charles Wang, in an analysis piece last month.

Comet was designed to keep feature parity with Spark itself (currently, it supports Spark versions 3.2 – 3.4). This means users can run the same queries, regardless if Comet extension is being used.

Spark built-in expressions and operators (Filter/Project/Aggregation/Join/Exchange) can work with Comet, as can Apache Parquet columnar storage format, in either read and write mode.

Comet also requires JDK 8 and up and GLIBC 2.17, and can run on either Linux or the Mac OS.

Other Spark Plug-ins That Speed Vector Processing

Apple is not the only member of the FAANG club interested in vector processing: Last year, Meta also released into open source its own project for Spark vector processing: Velox, noted software engineer Chris Riccomini.

Similar projects include Intel’s Gluten (recently accepted into ASF incubation), Nvidia‘s RAPIDS Spark accelerator for GPUs, Blaze (which also works with Apache Arrow DataFusion), and the Ballista distributed SQL query engine.

Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 25 years, including stints at IDG and Government Computer News. Before that, he…

What Is Apache Arrow DataFusion Comet?

Other Spark Plug-ins That Speed Vector Processing

Leave a Comment Cancel reply