Hi guys! Good evening, someone knows how can I make a more efficient SparQL query to sample some IDs than this:
SELECT DISTINCT ?sourceId WHERE {
?sourceId :hasX ?object.
}
ORDER BY RAND()
LIMIT 100It's returning out of memory errors inconsistenly, seems that the smaller the limit more likely to get out of memory(???), running in AWS Neptune.
Victor Mariano Leite The problem is that your query has to first get all the results and then order them. So, that could easily cause an OOM. One thing to do is to somehow trim your query. Can you add other constraints (as triples in the query or using FILTER)? Also, this StackOverflow might be useful, https://stackoverflow.com/questions/29103024/sparql-randomly-select-one-connection-for-each-node
Andrea R. Hmm, i'm thinking how i could do that, because I wanted to make """kind""" of a stratified sampling, there are 3 fields i've wanted to get: For example: fileId, pageId, pagemetadataId And I wanted to sample X fileId's, and get all of their pages and pages metadata. I don't know if there is a way to filter fileId if I want to sample it previously. I was going to sample it first in a subquery, then use the fileId's returned as a constraint for the outside query. Does it make sense? hahaha
A "dumb" approach is to filter by fileid's first character (I don't know if it's a string or numeric) or by using some other characteristic. Then, the subquery approach would be more doable.
Andrea R. thanks a lot for the tip! I was using a numeric, but you gave me an idea, when the fileId's are generated sequentially, I think it's approximally a "random" sample in my case getting only the one's that fileId % 10 == 0, since there in my case there is (i hope haaha) no bias in being the 10th item, it helped a lot 🙂