The journey from Oracle and PL SQL to Python and parquet files

enter image description here

In an earlier article I mentioned that we would start migrating some of our processes. Our main goal was to replace the combination of the Oracle database and PL SQL with the combination of parquet files and Python. Here are my experience, the challenges and successes that we have had along this journey.

The Beginning: What?, Why? and How?

It is important - especially at a company level - that you have a clear vision about your goal and the way to achieve it. Usually a lot of money is at stake. As almost always, our goal was to reduce costs and to increase the performance and scalability. As Oracle is quite costly, we decided to replace it. At this point we had something at our disposal; a team in our company had been working on a framework that is designed to process vast amounts of data in an efficient way. The framework is written in C++, but there is a Python layer on top of it. Based on their comments this framework was supposed to be really powerful. Cool, we decided to go with it. As I don’t want to disclose the name of our framework, further on I will just call it the Yoda framework. So, using Yoda could kill two birds with one stone: to reduce costs and to increase the performance. Getting rid of Oracle meant that we had to store the data elsewhere. Yoda supports various data sources including parquet files. Parquet files use columnar format and have very efficient compression. Also, by reading only the required columns you can avoid having a lot of unnecessary data in the memory. As parquet files consume less storage space, this was another money-saver.
Having all these in mind we started planning and doing some POCs (Proof of Concept) to make sure of its feasibility. The general and overall planning was done together among us, with a senior colleague leading it. My role was to get to know the Yoda framework and prepare a POC about it.

The Final Tech Stack

Having completed the planning phase and our POCs, the final tech stack is as the following:

Docker: development environment.
Python (Yoda framework): implementation language, the engine.
Parquet files: to store data.
Hive: to be able to query parquet files easily.
Airflow: to manage the scheduling and execution.
Oracle: we keep a very limited Oracle that is behind our FE which is used by our operation team. They can create some setups which we can use in our processes.
Cloud hosting.

My Role in the Migration

So, my role was to investigate and prove that we can achieve our goals by replacing Oracle with Yoda. Our team works with a lot of data and in many cases our processes are such that were kind of meant to be for databases, so we may think good or bad about Oracle, but “beating” it was not an obvious outcome. Let’s see how it went.

The Development Environment

In our company we use Windows, but Yoda was designed for and can run only on Linux. My first attempt was to use WSL (Windows Subsystem for Linux). I could almost set it up, but in the end I met some blocking issues and I decided to use Docker instead. Probably this is what I should have started with. I started with a clean CentOS 8 image and crawled through the bumpy road of installing everything that the framework required. The framework is quite “fastidious” with regards to libraries and requirements, so it was quite a job to get it all done. I have to confess here that I was not a Linux expert, so this was not something ever easy for me. Anyway, I managed to set up the environment ready to run Yoda. I created a Dockerfile that builds the complete image, apart from running the container there is absolutely nothing else necessary.

Docker has some significant advantages:

Exactly the same environment for all the developers.
The successful set-up of the environment is guaranteed by the Dockerfile.
Very easy to change it.
No need for Linux Admin support teams or help-desk, we can do everything inside the image ourselves.
It is free and well supported.

Technologies and Tools

As already mentioned, our development environment is Docker. Regarding the IDE for Python, I’m personally in favour of PyCharm, however, the community version cannot work with Docker containers. Therefore I chose Visual Code which is still very good.
After every new release of the Yoda framework we just build a new image and run a container from it. Then we attach Visual Code to the container and “ya está” as we say it in Spain, we can start coding.
We share the local Git repository with the container. It is easier to use Git outside of the container plus the code stays put even after we stop the container.
That’s it. Please don’t hesitate to write a comment if you have any questions. I would also like to hear about your opinion or suggestion in the comment section below.

The POC Itself

The POC consisted of two parts. The first and smaller part was to get familiar with the framework itself. Obviously I started with the documentation and after I tried several examples for different scenarios. The second and major part was to implement five sub-processes of our “production execution chain” using the Yoda framework. Our processes are mostly huge data manipulation operations or complex calculations using several data source inputs to produce one output. So, totally different technology yet the result should be the same and the performance should be better. Python and parquet files vs Oracle database and PL SQL. At the end we got what we aimed for, so the POC was successful.
Of course it was not as simple as it sounds. Obviously there was a learning curve to complete. Especially in the beginning I detected some bugs in the framework that slowed me down. There are many algorithms that are implemented and handled totally differently in a database query or script than in a programming language. And in some cases our know-all angel, stackoverflow couldn’t even be asked, because for some things it was given that we have to use the framework, so I had to figure it out with “what I had”.
Anyway, there was success at the end, so sunshine and happiness.

Actual Implementation

Actually we have just started the real implementation these weeks and so far so good. With a colleague of mine we prepared the environment to be ready to start. We created a proper repository, utils libraries, some base classes and we prepared the setup. At the moment we are working on the implementation of the whole production process with some sub-processes already being completed. I really enjoy coding in Python, I think it is a great language with great features.

Personal Experience and Thoughts

I am really happy that I had the chance to take part in this migration project. Building something from the very beginning is very exciting and you can learn a lot. I cannot appreciate enough the fact that we have a clean code base. What a satisfying feeling, it kind of makes me wanna jump in and start coding. By having changed the complete tech stack we can continuously improve and learn, it keeps our motivation high and we have acquired some valuable knowledge. In my opinion the new stack is quite up-to-date and stands its ground in today’s market.

Have a nice week, cheers!

RolkoTech

Search This Blog