Description

I have more than 10 years of experience in Data Engineering. 8 years with Apache Spark.

I love sharing my knowledge, working on open source software, and writing blog articles.

Medium: medium.com/@pin.furcy

Langues

Français
Bilingue ou natif
Anglais
Capacité professionnelle complète

Préférences en matière de lieu de travail

Accepte de travailler sur site

Paris (jusqu’à 20 km)

Younited Credit
Lead Data Engineer
BANQUE & ASSURANCES
juin 2018 - septembre 2022 (4 ans et 3 mois)
Paris, France
- Led a central team of 4 Data Engineers, in charge of the Data Platform, Data Lake, Data Warehouse and Customer Data Platform. We build and maintained all data pipelines from data ingestion to modelisation for all the functional scopes of the company.

- Set up the new architecture based on Engineering best practices, using CI/CD deployment, Apache Airflow, and a tool similar to dbt but for PySpark that I developed internally, with automatic non-regression capabilities. After two years on the new architecture, our 4-people team had developed more than 1500 Spark jobs that ran every night. In 2021, the team split and I then led a 3-people team which was in charge of the Data Platform and tooling, while the other team was more focused on data ingestion and transformation.

- Developed an internal data quality tool and dashboard similar to Great Expectations.

- Performed a migration of the data storage from Azure Blob Storage to Azure Data Lake Gen2, and reorganized the platform architecture to have pre-production and sandbox environment and increase the platform’s security.

- Supervised the migration of our production Airflow deployment to Cloud Composer.

- Started and animated a Data Architecture Committee, to discuss and propagate best practices between Data Engineers, Data Analyst, DataSecOps, Data Scientists and ML Engineers.

- Active member of the Security Community, raised the alarm about Log4Shell and let the resolution on the Data side. I also wrote an anonymization specification in accordance with the DPO for GDPR compliance.
Tech Lead Data Engineer SQL Big Query PySpark Python DBT Google cloud Microsoft Azure Securité informatique Terraform DevOps
Criteo
Software Engineer R&D
HIGH TECH
décembre 2017 - juin 2018 (6 mois)
Paris, France
- Member of the team that maintained and optimized a distributed implementation of the Louvain algorithm for graph community detection, using Spark (Scala). This algorithm ran twice a day on a graph containing more than a billion nodes and edges, it took several hours to run and required very careful optimization.

- Maintained an in-house ingestion tool that performed data ingestion from S3 to HDFS. Found that it was losing up to 10% of the incoming data, and fixed it.

- Implemented monitoring by adding support for Prometheus endpoints to the company’s in-house scheduler, and implemented alerts.

- Contributed to two internal Scala formations (3 days each), and gave courses during these formation.

- Coached the team in charge of HDFS (composed of devops and SREs) to help them learn Scala and build a tool using the Play framework, that would help them automate cluster backups.
Scala Spark Hadoop
Flaminem
CTO & Cofounder
HIGH TECH
avril 2013 - octobre 2017 (4 ans et 6 mois)
- Hired and Managed a team of 5 developers, and formed them to use Hadoop, Spark and Scala.

- Responsible of technical recruitment and technology intelligence.

- Installed and maintained a Hadoop cluster in a secure environment (VPN). Installed and maintained various additional services like Spark and Presto, via Ansible rules.

- Developed an ingestion and data cleaning tool (successfully used to ingest 5 TB of 1st party data from a large global Insurance company) on MapReduce, and ported it to Spark.

- Developed 15+ UDFs, (including UDAFs and UDTFs) for Hive (used in production).

- Creator and main developer of Flamy, an open-source tool to help organizing, validating, testing and running Hive queries, and to ease the administration of Hive databases: https://github.com/flaminem/flamy
(Used for quality control of more than a hundred tables containing more than 100TB of data. Used in production for regular execution of several complex workflows including 10+ steps and several hours, and used for database migration and refactoring).

- Implemented and optimized a new connected component algorithm in Spark (worked on a 300M node graph in ~10 minutes with ~100 CPUs. Spark’s GraphX implementation took a few hours), running in O(log(d)) round, and an order of magnitude faster than Hash-to-All and Hash-to-Min on specific graphs.

- Handled several Hadoop cluster migrations.
Tech Lead Scala Spark Hadoop Hive Terraform Ansible