Dirta Science

Dirta Science is a theme for the Python3 cookiecutter package that implements CRISP-DM, using GitLab for CI and project phase tracking.

At a high level, data science projects will typically follow the below steps:

Whereas data science competition sites like Kaggle provide clear goals (e.g. binary classification for the Titanic dataset) and the associated documentation, personal or company projects require these aspects to be established. This creates a situation in which each step is required to be monitored as it evolves, in case events should cause an aspect of the project to change. Typically, this is an event affecting an aspect of the “discovery process” (Frawley, Piatetsky-Shapiro and Matheus, 1992):

Without tracking these changes, it becomes difficult to differentiate between what discovered knowledge is valuable or not, which can cause project creep at any stage whereby false-valuable knowledge is accepted as new domain knowledge:

CRISP-DM, co-developed by Chapman et al (2000), ironed out the minutiae of how projects phases should function and what should be included in the documentation for each project phase to ensure business goals are continuously met. With Dirta Science the CRISP-DM template documents are made readily available in a template directory layout that can be specialised to your project upon instantiation, and facilitates portability of analysis performed with Gitlab CI:


Although a standardised base directory is provided to allow ease of switching between multiple projects, the format is not strict and thus is open to you removing inessential parts and adding details specific to the relevant project. However, a ‘must’ for this to work is consistency in naming, i.e. using data consistently as the directory where you place sample data to demonstrate analysis over, rather than using it for storing explanatory documents.

Read more on the layout and setup here.

Acknowledgements

Shoutout to Cookiecutter Data Science, whom without their examples of Makefile and python-dotenv usage within their proposed template layout, wouldn’t have sent me down this ‘mild’ rabbit hole to improve my data mining process.

Bibliography

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. (2000). CRISP-DM 1.0 Step-by-step Data Mining Guide. Technical Report. The CRISP-DM Consortium.

Frawley, W., Piatetsky-Shapiro, G. and Matheus, C. (1992). Knowledge Discovery in Databases: An Overview. AI Magazine, (Vol 13, No 3), pp.57-69.