Data Engineer
Remote
Mactores
Startup
Service
B2B
Pre-seed
Data Analytics
Mumbai, Maharashtra, India
Post Status: Active
Permanent
63 applications
Experience: 4-6 years
Skills
Apache Spark
Apache Kafka
Spark (framework)
AWS
Scala
Azure
Data Engineering
Python
PySpark
ETL
Posted 14 days ago

About the job

Mactores is a trusted leader among businesses in providing modern data platform solutions. Since 2008, Mactores have been enabling businesses to accelerate their value through automation by providing End-to-End Data Solutions that are automated, agile, and secure. We collaborate with customers to strategize, navigate, and accelerate an ideal path forward with a digital transformation via assessments, migration, or modernization.

We are seeking a highly skilled and innovative Spark Engineer to join our team. In this role, you will design, develop, optimize, and operationalize high-performance data pipelines and applications using Apache Spark. This role requires hands-on expertise in distributed data processing, ETL engineering, performance tuning, cluster management, and working with cross-functional teams to deliver reliable, scalable, and efficient data solutions

What will you do

  • Architect, design, and build scalable data pipelines and distributed applications using Apache Spark (Spark SQL, Data Frames, RDDs)

  • Develop and manage ETL/ELT pipelines to process structured and unstructured data at scale.

  • Write high-performance code in Scala or PySpark for distributed data processing workloads.

  • Optimize Spark jobs by tuning shuffle, caching, partitioning, memory, executor cores, and cluster resource allocation.

  • Monitor and troubleshoot Spark job failures, cluster performance, bottlenecks, and degraded workloads.

  • Debug production issues using logs, metrics, and execution plans to maintain SLA-driven pipeline reliability.

  • Deploy and manage Spark applications on on-prem or cloud platforms (AWS, Azure, or GCP).

  • Collaborate with data scientists, analysts, and engineers to design data models and enable self-serve analytics.

  • Implement best practices around data quality, data reliability, security, and observability.

  • Support cluster provisioning, configuration, and workload optimization on platforms like Kubernetes, YARN, or EMR/Databricks.

  • Maintain version-controlled codebases, CI/CD pipelines, and deployment automation.

  • Document architecture, data flows, pipelines, and runbooks for operational excellence

What we are looking for

  • Bachelor’s degree in Computer Science, Engineering, or a related field.

  • 4+ years of experience building distributed data processing pipelines, with deep expertise in Apache Spark.

  • Strong understanding of Spark internals (Catalyst optimizer, DAG scheduling, shuffle, partitioning, caching).

  • Proficiency in Scala and/or PySpark with strong software engineering fundamentals.

  • Solid expertise in ETL/ELT, distributed computing, and large-scale data processing.

  • Experience with cluster and job orchestration frameworks.

  • Strong ability to identify and resolve performance bottlenecks and production issues.

  • Familiarity with data security, governance, and data quality frameworks.

  • Excellent communication and collaboration skills to work with distributed engineering teams.

  • Ability to work independently and deliver scalable solutions in a fast-paced environment

You will be preferred if

  • Experience with Databricks, AWS EMR, Glue Spark, or GCP Dataproc.

  • Familiarity with workflow orchestration tools like Apache Airflow, Dagster, or Prefect.

  • Exposure to streaming platforms such as Kafka, Kinesis, or Pub/Sub.

  • Experience running Spark workloads on Kubernetes.

  • Familiarity with data warehouse ecosystems (Snowflake, BigQuery, Redshift, Iceberg, Delta Lake, Hudi).

  • Understanding of DevOps practices, CI/CD, and IaC (Terraform, CloudFormation).

  • Knowledge of distributed logging and monitoring tools (Grafana, Prometheus, CloudWatch, ELK).

  • Prior experience in high-scale production environments or data platform teams