Append to Parquet

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Append to Parquet

VinShar
Hi,

Is there any way to use PutParquet to append to an existing parquet file? i
know that i can create a Kite DataSet and write parques to it but i am
looking for an alternate to Spark's DataFrame.write.parquet (destination,
mode="overwrite")

Regards,
Vinay



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Append to Parquet

Bryan Bende
Hello,

As far as I know there is not an option in Parquet to append due to
the way it's internal format works.

The ParquetFileWriter has a mode which only has CREATE and OVERWRITE:

https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L105-L107

-Bryan


On Thu, Nov 30, 2017 at 5:12 PM, VinShar <[hidden email]> wrote:

> Hi,
>
> Is there any way to use PutParquet to append to an existing parquet file? i
> know that i can create a Kite DataSet and write parques to it but i am
> looking for an alternate to Spark's DataFrame.write.parquet (destination,
> mode="overwrite")
>
> Regards,
> Vinay
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Append to Parquet

VinShar
yes this was my understanding also but then i found that Spark's DataFrame
does has a method which appends to Parquet ( df.write.parquet(destName,
mode="append")). below is an article that throws some light on this. i was
wondering if there is a way to achieve the same through NiFi.

http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html

I have a workaround in mind for this where i can save data i want to append
to parque in a file (say in avro format) and then execute a script through
ExecuteProcess to launch a spark job to read avro and append to an existing
Parquet file and then delete avro. I am looking for a simpler way than this.



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Append to Parquet

Bryan Bende
Thanks for the link. Currently we don't have a way to do something
like that, but if we could figure out how that data frame append code
works behind the scenes, then we could potentially offer something
similar.

On Thu, Nov 30, 2017 at 9:44 PM, VinShar <[hidden email]> wrote:

> yes this was my understanding also but then i found that Spark's DataFrame
> does has a method which appends to Parquet ( df.write.parquet(destName,
> mode="append")). below is an article that throws some light on this. i was
> wondering if there is a way to achieve the same through NiFi.
>
> http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
>
> I have a workaround in mind for this where i can save data i want to append
> to parque in a file (say in avro format) and then execute a script through
> ExecuteProcess to launch a spark job to read avro and append to an existing
> Parquet file and then delete avro. I am looking for a simpler way than this.
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Append to Parquet

Giovanni Lanzani
In reply to this post by VinShar
On 1 Dec 2017, at 3:44, VinShar wrote:

> yes this was my understanding also but then i found that Spark's
> DataFrame
> does has a method which appends to Parquet (
> df.write.parquet(destName,
> mode="append")). below is an article that throws some light on this. i
> was
> wondering if there is a way to achieve the same through NiFi.
>
> http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html

You should not believe all that bloggers write :)

In the blog they are writing to the `permit-inspections.parquet`
**folder**. It’s not a parquet file.

The parquet files are contained in the folder. The append mode you are
referring to simply writes new parquet files in the folder, without
touching the existing ones.

If they would have used the `overwrite` option, then the existing folder
would have been emptied before.

Cheers,

Giovanni