Neo4j and Java: fast and random sample of Iterable<Relationship> -


i've coded traversal in java returns iterable. worst case iterable size of 850784 relationships.

objective: want sample (without replacement) 20 relationships , want fast.

solution 1 : performing tolist() or casting in sort of collection takes time (> 1 minute). know have taken avantage of shuffle() function, etc. unacceptable.

solution 2: thus, in order directly on iterable, i've used guava collect library, have included time in milliseconds (calculated system.nanotime() , dividing 1000000), each of 3 steps below. need take size of iterable random number generator, real bottleneck.

    /* traversal: 5 ms */     iterable<relationship> simrels = traversal1.traverse(user).relationships();      /* iterable size: 74669 ms */     int simrelssize = com.google.common.collect.iterables.size(simrels);      /* random sample of 20: 28321 ms*/     long seed = system.nanotime();     int[] idxs = new int[20];     random randomgenerator = new xsrandom(seed);     (int = 0; < idxs.length; ++i){         int randomint = randomgenerator.nextint(simrelssize);         idxs[i]=randomint;     }     arrays.sort(idxs);      list<relationship> simrelslist2 = new arraylist<relationship>();     for(int = 0; < idxs.length; ++i){         if (i > 0) {             int pos = idxs[i]-idxs[i-1];             simrelslist2.add(com.google.common.collect.iterables.get(simrels, pos));         }         else{             simrelslist2.add(com.google.common.collect.iterables.get(simrels, idxs[i]));         }     } 

how can optimize code go faster ?

note: have windows 8.1 pc, i5 2.30ghz, ram 16gb, hdd 1tb

as per michal's request, please find file contents below:

neo4j-wrapper

#******************************************************************** # property file references #********************************************************************  wrapper.java.additional=-dorg.neo4j.server.properties=conf/neo4j-server.properties wrapper.java.additional=-djava.util.logging.config.file=conf/logging.properties wrapper.java.additional=-dlog4j.configuration=file:conf/log4j.properties  #******************************************************************** # jvm parameters #********************************************************************  wrapper.java.additional=-xx:+useconcmarksweepgc wrapper.java.additional=-xx:+cmsclassunloadingenabled wrapper.java.additional=-xx:-omitstacktraceinfastthrow  # remote jmx monitoring, uncomment , adjust following lines needed. # make sure update jmx.access , jmx.password files appropriate permission roles , passwords, # shipped configuration contains read role called 'monitor' password 'neo4j'. # more details, see: http://download.oracle.com/javase/7/docs/technotes/guides/management/agent.html # on unix based systems jmx.password file needs owned user run server, # , have permissions set 0600. # details on setting these file permissions on windows see: #     http://docs.oracle.com/javase/7/docs/technotes/guides/management/security-windows.html #wrapper.java.additional=-dcom.sun.management.jmxremote.port=3637 #wrapper.java.additional=-dcom.sun.management.jmxremote.authenticate=true #wrapper.java.additional=-dcom.sun.management.jmxremote.ssl=false #wrapper.java.additional=-dcom.sun.management.jmxremote.password.file=conf/jmx.password #wrapper.java.additional=-dcom.sun.management.jmxremote.access.file=conf/jmx.access  # systems cannot discover host name automatically, , need line configured: #wrapper.java.additional=-djava.rmi.server.hostname=$the_neo4j_server_hostname  # uncomment following lines enable garbage collection logging #wrapper.java.additional=-xloggc:data/log/neo4j-gc.log #wrapper.java.additional=-xx:+printgcdetails #wrapper.java.additional=-xx:+printgcdatestamps #wrapper.java.additional=-xx:+printgcapplicationstoppedtime #wrapper.java.additional=-xx:+printpromotionfailure #wrapper.java.additional=-xx:+printtenuringdistribution  # java heap size: default java heap size dynamically # calculated based on available system resources. # uncomment these lines set specific initial , maximum # heap size in mb. wrapper.java.initmemory=8192 wrapper.java.maxmemory=10240  #******************************************************************** # wrapper settings #******************************************************************** # path relative bin dir wrapper.pidfile=../data/neo4j-server.pid  #******************************************************************** # wrapper windows nt/2000/xp service properties #******************************************************************** # warning - not modify of these properties when application #  using configuration file has been installed service. #  please uninstall service before modifying section.  #  service can reinstalled.  # name of service wrapper.name=neo4j  # user account used linux installs. default current # user if not set. wrapper.user= 

neo4j.properties

# enable able upgrade store older version. #allow_store_upgrade=true  # amount of memory use mapping store files, either in bytes or # percentage of available memory. clipped @ amount of # free memory observed when database starts, , automatically rounded # down nearest whole page. example, if "500mb" configured, # 450mb of memory free when database starts, database # map @ 450mb. if "50%" configured, , system has capacity of # 4gb, @ 2gb of memory mapped, unless database observes # less 2gb of memory free when starts. #mapped_memory_total_size=50%  # enable specify parser other default one. #cypher_parser_version=2.0  # keep logical logs, helps debugging uses more disk space, enabled # legacy reasons limit space needed store historical logs use values such # as: "7 days" or "100m size" instead of "true". #keep_logical_logs=7 days  # autoindexing  # enable auto-indexing nodes, default false. #node_auto_indexing=true  # node property keys auto-indexed, if enabled. #node_keys_indexable=name,age  # enable auto-indexing relationships, default false. #relationship_auto_indexing=true  # relationship property keys auto-indexed, if enabled. #relationship_keys_indexable=name,age  # enable shell server remote clients can connect via neo4j shell. #remote_shell_enabled=true # network interface ip shell listen on (use 0.0.0 interfaces). #remote_shell_host=127.0.0.1 # port shell listen on, default 1337. #remote_shell_port=1337  # type of cache use nodes , relationships. #cache_type=hpc  # maximum size of heap memory dedicate cached nodes. #node_cache_size=  # maximum size of heap memory dedicate cached relationships. #relationship_cache_size=  # enable online backups taken database. online_backup_enabled=true  # port listen incoming backup requests. online_backup_server=127.0.0.1:6362   # uncomment , specify these lines running neo4j in high availability mode. # see high availability setup tutorial more details on these settings # http://neo4j.com/docs/2.2.0-m02/ha-setup-tutorial.html  # ha.server_id number of each instance in ha cluster. should # integer (e.g. 1), , should unique each cluster instance. #ha.server_id=  # ha.initial_hosts comma-separated list (without spaces) of host:port # ha.cluster_server of instances listening. typically # same cluster instances. #ha.initial_hosts=192.168.0.1:5001,192.168.0.2:5001,192.168.0.3:5001  # ip , port instance listen on, communicating cluster status # information iwth other instances (also see ha.initial_hosts). ip # must configured ip address 1 of local interfaces. #ha.cluster_server=192.168.0.1:5001  # ip , port instance listen on, communicating transaction # data other instances (also see ha.initial_hosts). ip # must configured ip address 1 of local interfaces. #ha.server=192.168.0.1:6001  # interval @ slaves pull updates master. comment out # option disable periodic pulling of updates. unit seconds. ha.pull_interval=10  # amount of slaves master try push transaction upon commit # (default 1). master optimistically continue , not fail # transaction if fails reach push factor. setting 0 # increase write performance when writing through master potentially # lead branched data (or loss of transaction) if master goes down. #ha.tx_push_factor=1  # strategy master use when pushing data slaves (if push factor # greater 0). there 2 options available "fixed" (default) or # "round_robin". fixed start pushing slaves ordered server id # (highest first) improving performance since slaves have cache # 1 transaction @ time. #ha.tx_push_strategy=fixed  # policy how handle branched data. #branched_data_policy=keep_all  # clustering timeouts # default timeout. #ha.default_timeout=5s  # how heartbeat messages should sent. defaults ha.default_timeout. #ha.heartbeat_interval=5s  # timeout heartbeats between cluster members. should @ least twice of ha.heartbeat_interval. #heartbeat_timeout=11s 

the reason neo4j returning iterable executes traversal whilst you're iterating. in order sample, i'm afraid have "visit" every relationship. yes, can skip some, still have iterate through of them @ end of day.

we're using "reservoir sampling" algorithm this, implemented here. not sure it's gonna perform better though, reason described above. said should able sample 1m relationships in less 1 second warm cache. if it's taking longer that, may need tweak memory settings bit.


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -