Karolina Alexiou - Patterns for Collaboration between Data Scientists And Software Engineers





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jul 26, 2017

The talk is going to present, with examples, how a software engineer team can work together with data scientists (both in-house and external collaborators) in order to leverage their unique domain knowledge and skills in analyzing data, while supporting them to work independently and making sure that their work can be constantly tested/evaluated and easily integrated into the larger product.

Collaboration between data scientists and software engineers can have the following issues:

• Different tools used between data scientists and engineers (more interactive vs more automated, for example ipython notebook vs command line)

• If getting the latest data requires ops/engineering knowledge then the analysis may be done in "stale" data or a too-small subset of the data (As an example: data scientists working with manual exports )

• Regression testing/parameter tuning/evaluation of results/backfills and other common scenarios in data-driven applications also require more engineering knowledge. The engineers are in the best position to provide tools and processes for the data science team, but it can happen that this potential goes untapped

Those issues lead to more time to production, unhappiness in the data science team if they end up fighting with operations work instead of doing mostly the work they like, less trustworthy results and less trust between teams in general. If collaboration is done right however, data science and engineering teams can have a very good symbiotic relationship where each person takes advantage of their strengths towards a common goal.

Some collaboration patterns to foster a good relationship between data scientists and engineers are the following:

• Continuous evaluation – making sure the data science algorithm continues to give good results with every commit (or combinations of commits, in case there is several repositories with different data scientists working on them)

• Report templating – data scientists can work with jupyter notebooks with an extension that allows those ipynb files to be used as templates (ie, where some variable values can be filled in later). Those notebooks can then be applied to different datasets to quickly diagnose issues.

• Data API – have a well documented API for the data scientists to have easy access to the data so that they can do their exploration without needing the software engineering team to manually provide exports

• Some flexibility regarding tools – if domain experts prefer to use SFTP to upload files to the server for analysis, let them. Too much flexibility can be an anti-pattern.


PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

Comments are turned off. Learn more
When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...