Developing Data Science products - Agile approach at Grupa Pracuj
Jan Zyśko, Magdalena Kalbarczyk
We use a case study to present the approach to developing Data Science products at Grupa Pracuj. Agile development and maintenance of such products pose unique challenges, as their usability strongly depends on having accurate models and efficient data pipelines. In the talk we go through different phases of development of one such product, which employs Deep Learning to solve an NLP problem.
There is a long and bumpy road from defining business needs to creating useful and understandable Data Science tool for either internal or external users. Undertaking such task requires preparing data flows, developing Machine Learning models, and presenting the results to end users. Moreover, it all has to be done in close collaboration with Business, in order to ensure rapid prototyping and maximum impact. In this talk, we present a case study of one such project.
During the last several months we tried to address the need for predicting pracuj.pl users’ behaviour. We started with an experiment, which helped us to see if users behave predictably enough on a macro scale for our models to achieve useful results.
After obtaining promising results, we then moved on to create an MVP for internal usage, to see if the product would be useful as a Django application for our CC department. At this point we had a fairly complicated technology stack - Python, Hadoop, SQL Server, and AWS. This is because we prioritized development speed over seamless integration.
When the usability and usefulness of the application was confirmed, we moved over to the creation of a proper ETL, which minimizes the integration and security issues and offers good scalability and computing costs. These works will most likely extend to Q4.
Our future plans for the project involve, in addition to continuous work on model accuracy, tapping into the very recent research on the interpretability of NLP models, in order to provide our end users with actionable feedback. Moreover, we are looking into the possibility of presenting the insights from the model directly to the pracuj.pl customers.
Big Data at Grupa Pracuj
2. Solution development
The experimental phase
Building an application and further model development
Pushing to production
3. What we have learned
Insights for Data Scientists
Advice for Product Owners
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.