How To Split A Text File Into Multiple Columns With Spark
I'm having difficulty on splitting a text data file with delimiter '|' into data frame columns. My loaded data file looks like this: results1.show() +--------------------+ |
Solution 1:
Using RDD API: your mistake is that String.split
expects a regular expression, where pipe ("|"
) is a special character meaning "OR", so it splits on anything. Plus - you should start from index 0 when converting the array into a tuple
The fix is simple - escape that character:
sc.textFile("D:/data/dnr10.txt")
.map(_.split("\\|"))
.map(c => (c(0),c(1),c(2),c(3)))
.toDF()
Using Dataframe API: the same issue with escaping the pipe applies here. Plus you can simplify the code by splitting once and using that split column multiple times when selecting the columns:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
results1.withColumn("split", split($"all", "\\|")).select(
$"split" getItem 0 cast IntegerType as "DEPT_NO",
$"split" getItem 3 cast IntegerType as "ART_GRP_NO",
$"split" getItem 7 as "ART_NO"
)
Using Spark 2.0 built-in CSV support: if you're using Spark 2.0+, you can let the framework do all the hard work for you - use format "csv" and set the delimiter to be the pipe character:
val result = sqlContext.read
.option("header", "true")
.option("delimiter", "|")
.option("inferSchema", "true")
.format("csv")
.load("D:/data/dnr10.txt")
result.show()
// +-------+----------+------+---+
// |DEPT_NO|ART_GRP_NO|ART_NO| TT|
// +-------+----------+------+---+
// | 29| 102|354814|SKO|
// | 29| 102|342677|SKO|
// | 29| 102|334634|DUR|
// | 29| 102|276728|I-P|
// +-------+----------+------+---+
result.printSchema()
// root
// |-- DEPT_NO: integer (nullable = true)
// |-- ART_GRP_NO: integer (nullable = true)
// |-- ART_NO: integer (nullable = true)
// |-- TT: string (nullable = true)
You'll get the column names, the right types - everything... :)
Post a Comment for "How To Split A Text File Into Multiple Columns With Spark"