Job Title: Data Engineer
Location: Dallas, TX
Job Summary:
As a Databricks Lead, you will be a critical member of our data engineering team, responsible for designing, developing, and optimizing our data pipelines and platforms on Databricks, primarily leveraging AWS services. You will play a key role in implementing robust data governance with Unity Catalog and ensuring cost-effective data solutions. This role requires a strong technical leader who can mentor junior engineers, drive best practices, and contribute hands-on to complex data challenges.
Responsibilities:
- Databricks Platform Leadership:
- Lead the design, development, and deployment of large-scale data solutions on the Databricks platform.
- Establish and enforce best practices for Databricks usage, including notebook development, job orchestration, and cluster management.
- Stay abreast of the latest Databricks features and capabilities, recommending and implementing improvements.
- Data Ingestion and Streaming (Kafka):
- Architect and implement real-time and batch data ingestion pipelines using Apache Kafka for high-volume data streams.
- Integrate Kafka with Databricks for seamless data processing and analysis.
- Optimize Kafka consumers and producers for performance and reliability.
- Data Governance and Management (Unity Catalog):
- Implement and manage data governance policies and access controls using Databricks Unity Catalog.
- Define and enforce data cataloging, lineage, and security standards within the Databricks Lakehouse.
- Collaborate with data governance teams to ensure compliance and data quality.
- Leverage various AWS services (S3, EC2, Lambda, Glue, etc.) to build a robust and scalable data infrastructure.
- Manage and optimize AWS resources for Databricks workloads.
- Ensure secure and compliant integration between Databricks and AWS.
- Proactively identify and implement strategies for cost optimization across Databricks and AWS resources.
- Monitor DBU consumption, cluster utilization, and storage costs, providing recommendations for efficiency gains.
- Implement autoscaling, auto-termination, and right-sizing strategies to minimize operational expenses.
- Technical Leadership & Mentoring:
- Provide technical guidance and mentorship to a team of data engineers.
- Conduct code reviews, promote coding standards, and foster a culture of continuous improvement.
- Lead technical discussions and decision-making for complex data engineering problems.
- Data Pipeline Development & Optimization:
- Develop, test, and maintain robust and efficient ETL/ELT pipelines using PySpark/Spark SQL.
- Optimize Spark jobs for performance, scalability, and resource utilization.
- Troubleshoot and resolve complex data pipeline issues.
- Work closely with data scientists, analysts, and other engineering teams to understand data requirements and deliver solutions.
- Communicate technical concepts effectively to both technical and non-technical stakeholders.
Qualifications:
- Bachelor's or Master's degree in Computer Science, Data Engineering, or a related quantitative field.
- 7+ years of experience in data engineering, with at least 3+ years in a lead or senior role.
- Proven expertise in designing and implementing data solutions on Databricks.
- Strong hands-on experience with Apache Kafka for real-time data streaming.
- In-depth knowledge and practical experience with Databricks Unity Catalog for data governance and access control.
- Solid understanding of AWS cloud services and their application in data architectures (S3, EC2, Lambda, VPC, IAM, etc.).
- Demonstrated ability to optimize cloud resource usage and implement cost-saving strategies.
- Proficiency in Python and Spark (PySpark/Spark SQL) for data processing and analysis.
- Experience with Delta Lake and other modern data lake formats.
- Excellent problem-solving, analytical, and communication skills.
Added Advantage (Bonus Skills):
- Experience with Apache Flink for stream processing.
- Databricks certifications.
- Experience with CI/CD pipelines for Databricks deployments.
- Knowledge of other cloud platforms (Azure, GCP) is a plus.
.