Extract, Transform, Load (ETL) is a process that involves extracting data from various sources, transforming it into a format that can be loaded into a target database or system, and then loading it into the target system. ETL is a common task in the world of data engineering, and it is particularly useful for small and mid-size companies that need to move data between systems or integrate data from different sources.
In this article, I will show you how to set up ETL using Keboola and Python. Keboola is a popular ETL platform that allows you to easily extract data from various sources, transform it using SQL or Python, and load it into a target system. I will walk you through the process of setting up a Keboola account, configuring a connection to a data source, and creating an ETL pipeline using Python.
Prerequisites:
Before I get started, you will need the following:
A Keboola account: You can sign up for a free trial at https://www.keboola.com/. Python 3: You can download Python from https://www.python.org/.
A data source: For this tutorial, I will use a CSV file as our data source. You can use any data source that is supported by Keboola, such as a database or a cloud storage service.
Step 1: Set up a Keboola account
To get started, visit https://www.keboola.com/ and sign up for a free trial. Follow the prompts to create a new project, and then create a new connection to your data source. Keboola supports a wide range of data sources, including databases, cloud storage services, and file formats such as CSV and Excel.
For this tutorial, I will use a CSV file as our data source. To connect to a CSV file, click on the "Connections" tab in the left-hand navigation menu, and then click the "Add Connection" button. Select "CSV" as the connection type, and then follow the prompts to upload your CSV file and configure the connection.
Step 2: Create an ETL pipeline
Once you have set up your Keboola account and configured a connection to your data source, it's time to create an ETL pipeline. To do this, click on the "Transformations" tab in the left-hand navigation menu, and then click the "Create Transformation" button.
You will be presented with a list of available transformations. For this tutorial, I will use the "Python" transformation, which allows you to write custom Python code to transform your data. Click the "Python" transformation, and then follow the prompts to create a new transformation.
Step 3: Write the Python code
Now that you have created a new Python transformation, it's time to write some code. The Python transformation provides a template that includes some example code to get you started. I will modify this code to extract data from the CSV file that I connected to in Step 1, transform it, and then load it into a target system.
Here is an example of how you might extract data from a CSV file and transform it using python:
import csv
# Extract data from CSV file
with open('in/tables/customers.csv', 'r') as f:
reader = csv.DictReader(f)
customers = list(reader)
# Transform data
for customer in customers:
customer['first_name'] = customer['first_name'].upper()
customer['last_name'] = customer['last_name'].upper()
customer['email'] = customer['email'].lower()
customer['age'] = int(customer['age'])
#Load data into output table
with open('out/tables/transformed_customers.csv', 'w') as f:
fieldnames = ['first_name', 'last_name', 'email', 'age']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for customer in customers:
writer.writerow(customer)
This code reads customer data from a CSV file named 'customers.csv' located in the 'in/tables/' folder. It then applies transformations to the data by converting the first and last name to uppercase, email to lowercase, and converting age to an integer.
The transformed data is then written to a new CSV file named 'transformed_customers.csv' in the 'out/tables/' folder. It is important to note that the field names in the CSV file should match the field names used in the code and that the file path should match the location of the file. If you have any additional questions or need help with ETL using Keboola, please let me know.
Step 4: save the data into a data store
When working with Keboola and a BI solution, it is typically best to save the transformed data from your ETL process in a data warehouse. A data warehouse is a central repository for all the data that is used to support business intelligence and analytics. Common data warehousing solutions include Amazon Redshift, Google BigQuery, and Microsoft Azure Synapse Analytics.
Once the data is in a data warehouse, it can be easily queried and analyzed using a BI tool such as Tableau, Power BI, or Looker. These tools can connect directly to the data warehouse and allow you to create visualizations, reports, and dashboards that can be used to analyze and understand the data.
Another option is to save the data in a cloud storage bucket like Amazon S3, Google Cloud Storage, or Azure Blob storage. The data can be accessed from there by the BI solution.
It's worth noting that some BI solution has their own data storage solution like Looker, Gainsight, etc.
It's important to consider factors such as data volume, performance requirements, and security when deciding where to store your data.
Why do I prefer Python over SQL in Keboola:
When using Keboola, there are a few reasons why you might choose to use Python for data transformation instead of SQL:
Flexibility: Python is a general-purpose programming language, which means it can be used for a wide variety of tasks, including data transformation. This makes it a good choice when you need to handle more complex or custom data transformation tasks that may not be possible to express in SQL.
Advanced-Data Cleaning and Wrangling: Python has a wide range of libraries such as pandas, NumPy, and sci-kit-learn that are specifically designed for data cleaning, wrangling, and manipulation tasks. These libraries provide a wide range of functions and methods that can be used to clean and transform data, making it easier to handle messy and unstructured data.
Interoperability: Keboola allows you to use Python to interact with other parts of the platform, such as the Keboola Storage API, which allows you to read from and write to data tables in Keboola. This can make it easier to create more complex ETL pipelines that involve multiple steps and data sources.
Reusability: Python code can be written so that it can be reused across different transformations and projects, saving time and making it easier to maintain.
It's worth noting that Keboola allows you to use both Python and SQL for data transformation, and the choice of using one or the other will depend on the specific needs of your use case. SQL is great for querying and manipulating structured data, while Python is great for data cleaning, wrangling, analysis, and automation. Both have their own advantages and disadvantages, and the best approach is to use a combination of both to achieve your desired outcome.
Matěj Srna
Comments