Packages

object Asn1Indexer

Builds and reads a sidecar record-offset index for BER/DER files, enabling Spark to split large files across multiple tasks.

Index file format (<original>.asn1idx):

  • 8-byte header: 7-byte magic "ASN1IDX" + 1-byte version (0x01)
  • 8 bytes per record: big-endian Long byte offset of the record's first tag byte

Only definite-length BER (and all DER) can be indexed. Indefinite-length constructions stop the scan early and produce a partial index.

Reading efficiency

readIndexSlice uses HDFS positioned reads (pread) to binary-search the index file for the split boundaries, then reads only the matching slice sequentially. For a 100 M-record index (~800 MB), this costs ~27 pread round-trips per task plus a small sequential read — the full index is never loaded into memory.

Source
Asn1Indexer.scala
Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. Asn1Indexer
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. val INDEX_SUFFIX: String
  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. def buildIndex(filePath: Path, conf: Configuration): (Path, Long)

    Scan filePath sequentially and write a sidecar index next to it.

    Scan filePath sequentially and write a sidecar index next to it. Returns (indexPath, recordCount).

    Safe to call on a file that already has an index — the existing index is overwritten. Use buildIndexes to skip already-indexed files.

  7. def buildIndexes(spark: SparkSession, glob: String, overwrite: Boolean = false): DataFrame

    Index all BER/DER files matched by glob in parallel, one Spark task per file.

    Index all BER/DER files matched by glob in parallel, one Spark task per file. Tasks run on executors co-located with the HDFS data blocks, so the sequential scan is local I/O on each node.

    spark

    active SparkSession

    glob

    HDFS glob or directory path, e.g. "/data/cdrs/file.ber" or a glob

    overwrite

    if false (default), files that already have a .asn1idx sidecar are skipped

    returns

    a DataFrame with one row per file: (path: String, records: Long, skipped: Boolean)

  8. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @native()
  9. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  10. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  11. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  12. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  13. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  14. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  15. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  16. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  17. def readIndexSlice(indexPath: Path, conf: Configuration, fromByte: Long, toExclByte: Long): Array[Long]

    Return the byte offsets of all records whose start falls within [fromByte, toExclByte).

    Return the byte offsets of all records whose start falls within [fromByte, toExclByte).

    Uses two binary searches (O(log N) positioned reads) on the index file, then one sequential read of only the matching slice.

  18. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  19. def toString(): String
    Definition Classes
    AnyRef → Any
  20. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  21. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  22. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable]) @Deprecated
    Deprecated

    (Since version 9)

Inherited from AnyRef

Inherited from Any

Ungrouped