The Knowledge Graph Conference Icon
The Knowledge Graph Conference
  • 🏠Home
  • 📅Events
  • 👤Members
  • 🔵Announcements
  • 🔵Ask
  • 🔵Ask The Ontologists
  • 🔵Events
  • 🔵Jobs
  • 🔵Promotions
  • 🔵Share
Powered by Tightknit
Ask
Ask

Optimizing SparQL Queries in AWS Neptune to Avoid Memory Issues

Avatar of Victor Mariano LeiteVictor Mariano Leite
·Sep 30, 2021 03:54 PM

Hi guys! Good evening, someone knows how can I make a more efficient SparQL query to sample some IDs than this:

SELECT DISTINCT ?sourceId WHERE {
     ?sourceId :hasX ?object.
}
ORDER BY RAND()
LIMIT 100

It's returning out of memory errors inconsistenly, seems that the smaller the limit more likely to get out of memory(???), running in AWS Neptune.

5 comments

· Sorted by Oldest
    • Avatar of Andrea R.
      Andrea R.
      ·

      Victor Mariano Leite The problem is that your query has to first get all the results and then order them. So, that could easily cause an OOM. One thing to do is to somehow trim your query. Can you add other constraints (as triples in the query or using FILTER)? Also, this StackOverflow might be useful, https://stackoverflow.com/questions/29103024/sparql-randomly-select-one-connection-for-each-node

    • Avatar of Victor Mariano Leite
      Victor Mariano Leite
      ·

      Andrea R. Hmm, i'm thinking how i could do that, because I wanted to make """kind""" of a stratified sampling, there are 3 fields i've wanted to get: For example: fileId, pageId, pagemetadataId And I wanted to sample X fileId's, and get all of their pages and pages metadata. I don't know if there is a way to filter fileId if I want to sample it previously. I was going to sample it first in a subquery, then use the fileId's returned as a constraint for the outside query. Does it make sense? hahaha

    • Avatar of Andrea R.
      Andrea R.
      ·

      A "dumb" approach is to filter by fileid's first character (I don't know if it's a string or numeric) or by using some other characteristic. Then, the subquery approach would be more doable.

    • Avatar of Victor Mariano Leite
      Victor Mariano Leite
      ·

      Andrea R. thanks a lot for the tip! I was using a numeric, but you gave me an idea, when the fileId's are generated sequentially, I think it's approximally a "random" sample in my case getting only the one's that fileId % 10 == 0, since there in my case there is (i hope haaha) no bias in being the 10th item, it helped a lot 🙂

    • Avatar of Andrea R.
      Andrea R.
      ·

      You are welcome. Tuning a query can be painful. 🙂

      😁1