At the Knowledge Graph conference, who were the speakers with Paco Nathan during the May 5 1700 panel "Panel discussion on Graph Data Science"? I had to leave the session, but am interested in the range of data engineering process that might be involved in using knowledge graphs as inputs to data science and ML processes.
Hey John Cabral! Great question. I am doing data engineering right now 馃槅 for a new project that is a knowledge graph which we are feeding into an ML process TBH - I find the range of data engineering to be the same as any type of data set.
data cleaning
data match / merging
feature engineering, etc.
That is to say: I don't have any differences in the type of data engineering work I have to do. For the data science / ML modeling part of this type of work, the shape of the data is different. Therefore, the features I am looking for require different data science (networkX, graph algorithms, etc). That doesn't come into play until the modeling section - not the data engineering part.
Denise G. Thank you for the explanation. Data Science / ML are new areas for me, so need to do some research to fully understand the answer and the details of the activities. I appreciate your help.
John Cabral happy to help! I am taking a sabbatical 馃ぉ starting tomorrow 馃帀 until August 10th. I would love to help, but I will probably be pretty absent from slack universes. send me a note on twitter @denisekgosnell if you want to keep the convo going! Happy graphing!
Denise G. Enjoy!
I will! So stoked!
Thank you!
Agreed with Denise G.'s points. Like her, my comment will be more for generic data engineering rather than specific to data engineering for KGs. One thing I would add as far as data engineering goes is that you want to "go up the stack" as much as possible. By this I mean that getting your hands on the tracking is better than just doing the ETL. This could be done by your teammates or yourself but there is something great that happens when the people who know the downstream usecases are the ones designing the tracking: you can start tracking signals that would be otherwise ignored. Things like implicit human in the loop feedbacks. Once you have that, maybe that starts feeding your KG generation pipeline, or maybe you use it only for a simple chart on your KG+ML performance etc...
Louis G. Thank you. Could you clarify "tracking." I don't know that I'm following your meaning.
By "tracking" I would mean the events that you can fire when a user interacts with an application, e.g. Google Analytics or other custom solutions. Let's take search as an example. You search for "US president", you get a carousel of president card at position zero, you click on the second one instead of the first one. This can fire a tracking event. You can then process that event to update semantic relatedness between word "president" and the president entities in your KG.