Job Description
Key Responsibilities:
-
Design, develop, and maintain ETL (Extract, Transform, Load) processes to ensure the seamless integration of raw data from various sources into our data lakes or warehouses.
-
Utilize Python, PySpark, SQL, and AWS services like Lambda, Glue, Redshift, S3, etc., to process, analyze, and store large-scale datasets efficiently.
-
Develop and optimize data pipelines using tools such as AWS Glue for ETL tasks, PySpark for big data processing, and Python for scripting and automation. Additionally, experience with Apache Spark/Databricks is highly desirable for advanced ETL workflows.
-
Write and maintain SQL queries for data retrieval, transformation, and storage in relational databases like Redshift or PostgreSQL.
-
Collaborate with cross-functional teams, including data scientists, engineers, and domain experts to design and implement scalable solutions.
-
Troubleshoot and resolve performance issues, data quality problems, and errors in data pipelines.
-
Document processes, code, and best practices for future reference and team training.