27. 4. 2021

Babelfish: AWS’s Latest Open-Source Initiative

AWS (Amazon Web Services) is an industry leader in promoting open-source initiatives and alternatives through heavy integration with their existing cloud services. It leads several popular open-source initiatives including, but not limited to, ElasticSearch (an open-source search and analytics engine), the EKS Distro for Amazon Elastic Kubernetes Service (a certified Kubernetes distribution to create reliable and secure clusters), and the AWS backed production-ready distribution of the Open Telemetry project (which collects metrics for application monitoring). On the database end, it is already a huge proponent of popular relational database services such as MySQL, PostgreSQL, and Redis by readily integrating them into many of its core database-oriented services.
Over the better part of the past decade, non-proprietary software and open-source databases in particular have been growing at a rapid pace. In fact, open-source databases recently surpassed commercial databases in popularity over this past year (as reported by DB-Engines).
Additionally, two of the four most popular databases as of April 2021 (according to DB-Engine) are open source, and the trend does not seem to have an end in sight. Specifically, PostgreSQL adoption has been unparalleled, with it being named the fastest growing database by a healthy margin over 2019 and 2020. Its success can be attributed to its bevy of features, quick performance, and recent integration into major cloud vendors. Despite this widespread adoption, commercial license vendors Oracle and Microsoft SQL Server still have the lion’s share of the relational database market (along with the Oracle-owned MySQL). Much of this is due to companies being entrenched or ‘locked-in’ to specific databases and the large capital expenditure associated with migrating away from said databases. In recent years, as migration to the cloud ramps up, companies have been taking this opportunity to simultaneously migrate over to open-source databases.
Naturally, AWS aims to make migration to the cloud as simple as possible and, historically, they have focused on streamlining the migration of data between databases using services such as AWS Database Migration Service (DMS), and AWS Schema Conversion Tool (SCT). AWS DMS is a fully managed service to help migrate databases (both homogenous and heterogeneous) to AWS quickly, securely and with minimal downtime. AWS SCT is used during heterogeneous migrations to outline database discrepancies and automatically convert the source database schema and objects such as views, functions, and stored procedures into a format compatible with the target database. Despite the widespread popularity and use of these services, one of the largest deterrents behind heterogeneous database migrations lies not with the data itself, but with the migration of legacy applications written for the source database. In most instances, considerable time and resources need to be allocated to the tweaking and retesting of existing applications to ensure they work as intended with the new database. Consequently, many companies decide the work and cost associated with migrating to an open-source database far outweighs the licensing costs associated with their current commercial vendor. With Babelfish, AWS aims to minimize this heavy burden by providing a simple, optional, fully managed endpoint that can be enabled with the click of a button.
The name Babelfish takes inspiration from the popular comedy, science fiction book series ‘The Hitchhiker’s Guide to the Galaxy,’ in which there is a bright, yellow, leech-like fish dubbed the ‘Babel fish’ that can be placed in a person’s ear to translate any language into the wearer’s native tongue. Similarly, Babelfish for PostgreSQL translates both Microsoft SQL Server’s proprietary T-SQL and Tabular Data Stream (TDS) protocol into something understandable by PostgreSQL. It achieves this by adding an endpoint (aka the Babelfish) to PostgreSQL, which coexists with existing drivers and endpoints, to route incoming T-SQL and TDS requests and translate them into PostgreSQL’s native PL/pgSQL. Put another way, with the Babelfish endpoint enabled, an application developed for SQL Server believes it is connecting to an MSSQL database while the PostgreSQL database believes the incoming requests were written in PL/pgSQL. As a result, applications and workloads written for SQL Server will now need minimal to no developer intervention to migrate over to PostgreSQL.
In addition, with Babelfish you can continue to build new applications using PostgreSQL native pgSQL that will run in concert with existing T-SQL code. Consider a situation common to many businesses these days, “the single source of truth” either does not exist or the ‘truth’ is distributed and constantly changing in such a way that it is impossible to define. Many times, this is the result of having multiple databases that cater to rapidly changing and emerging problems, where it is much simpler to add new functionalities in the way of new databases rather than repurpose existing data infrastructure. The versatility of Babelfish shines in these situations because a company based on SQL Server can now create these new databases on PostgreSQL with minimal code changes on existing applications. Additionally, new functionalities can be written in either T-SQL or PL/pgSQL and run alongside existing code. In these cases, Babelfish serves as an entry point for migrating over to an open-source database.
Of course, for most companies, the speed and performance of an application will be one of the highest priorities and adding an extra translational layer will always hurt performance. In these cases, Babelfish should be thought of as a feature that can buy time to allow their developers to gradually port legacy T-SQL applications over to pgSQL rather than porting it all in one go.
Upon release, Babelfish will include support for translation of a wide range of T-SQL elements including syntax/dialect, data types, triggers, stored procedures, functions, cursors, and catalog views. Babelfish will not be released as a standalone service, rather it will be a feature in Amazon Aurora, AWS’s fully managed relational database engine compatible with MySQL and PostgreSQL. Aurora is part of Amazon RDS (Relational Database Service) and can deliver up to five times the throughput of MySQL and three times the throughput of PostgreSQL by automating and managing the backend infrastructure. Babelfish will start as an optional endpoint for PostgreSQL databases based on Aurora and the plan is to expand it into other services as the project matures. Notably, Babelfish will continue to be completely open-source and will be released on the permissive Apache 2.0 license, all but guaranteeing compatibility with any existing licenses.

Conclusion

Babelfish is expected to be launched sometime in 2021, and it is already available for public preview on Amazon Aurora. It remains to be seen how Babelfish will be received by open-source developers and businesses looking to leave commercial vendors, but AWS’s history in leading and dedicating resources to numerous open-source projects as well as Babelfish’s customer-focused and community-oriented goals are a good indication of the continued growth and maturity of this project.