python - How can we define filters/properties for a elasticsearch index using Spark with Scala? -


i used following function in python initialize index in elasticsearch.

def init_index():     constants.es_client.indices.create(         index = constants.index_name,         body = {                 "settings": {                     "index": {                         "type": "default"                     },                     "number_of_shards": 1,                     "number_of_replicas": 1,                 "analysis": {                     "filter": {                         "ap_stop": {                             "type": "stop",                             "stopwords_path": "stoplist.txt"                         },                         "shingle_filter" : {                             "type" : "shingle",                             "min_shingle_size" : 2,                             "max_shingle_size" : 5,                             "output_unigrams": true                         }                     },                     "analyzer": {                         constants.analyzer_name : {                             "type": "custom",                             "tokenizer": "standard",                             "filter": ["standard",                                        "ap_stop",                                        "lowercase",                                        "shingle_filter",                                        "snowball"]                         }                     }                 }             }         }     )      new_mapping = {         constants.type_name: {             "properties": {                 "text": {                     "type": "string",                     "store": true,                     "index": "analyzed",                     "term_vector": "with_positions_offsets_payloads",                     "search_analyzer": constants.analyzer_name,                     "index_analyzer": constants.analyzer_name                 }             }         }     }      constants.es_client.indices.put_mapping (         index = constants.index_name,         doc_type = constants.type_name,         body = new_mapping     ) 

using function able create index user-defined specs.

i started work scala , spark. integrating elasticsearch can either use spark's api i.e. org.elasticsearch.spark or can use hadoop org.elasticsearch.hadoop. of examples see related hadoop's methodology don't wish use hadoop here. went through spark-elasticsearch documentation , able atleast index documents without including hadoop noticed created default, can't specify _id there. generates _id on own.

in scala use following code indexing (not complete code):

val document = mutable.map[string, string]() document("id") = docid document("text") = textchunk.mkstring(" ") //textchunk list of strings sc.makerdd(seq(document)).savetoes("es_park_ap/document") 

this created index way:

{    "es_park_ap": {       "mappings": {          "document": {             "properties": {                "id": {                   "type": "string"                },                "text": {                   "type": "string"                }             }          }       },       "settings": {          "index": {             "creation_date": "1433006647684",             "uuid": "qnxctamgqgkx7rp-h8fvig",             "number_of_replicas": "1",             "number_of_shards": "5",             "version": {                "created": "1040299"             }          }       }    } } 

so if pass document it, following document created:

     {         "_index": "es_park_ap",         "_type": "document",         "_id": "au2l2ixcaorl_gagnja5",         "_score": 1,         "_source": {            "text": "some large text",            "id": "12345"         }      } 

just python, how can use spark , scala create index user defined specifications?

i think should divide question several smaller issues.

if want create index specific mapping / settings should use elasticsearch java api directly (you can use scala code of course). can use following sources examples of index creating using scala:

creating index , adding mapping in elasticsearch java api gives missing analyzer errors

define custom elasticsearch analyzer using java api

elasticsearch hadoop / spark plugin used in order transport data hdfs es. es maintenance should done separately.

the fact still seeing automatically generated id because must specify plugin id field using following syntax:

esspark.savetoes(rdd, "spark/docs", map("es.mapping.id" -> "your_id_field")) 

or in case:

sc.makerdd(seq(document)).savetoes("es_park_ap/document", map("es.mapping.id" -> "your_id_field")) 

you can find more details syntax , proper use here:

https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html

michael


Comments

Popular posts from this blog

node.js - Using Node without global install -

How to access a php class file from PHPFox framework into javascript code written in simple HTML file? -

java - Null response to php query in android, even though php works properly -