DataFrame

Description

Batteries-included entry point for the DataFrame library.

This module re-exports the most commonly used pieces of the dataframe library so you can get productive fast in GHCi, IHaskell, or scripts.

Naming convention

  • Use the D. ("DataFrame") prefix for core table operations.
  • Use the F. ("Functions") prefix for the expression DSL (columns, math, aggregations).

Example session:

We provide a script that imports the core functionality and defines helpful macros for writing safe code.

$ cabal update
$ cabal install dataframe
$ dataframe
Configuring library for fake-package-0...
Warning: No exposed modules
GHCi, version 9.6.7: https://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /tmp/cabal-repl.-242816/setcwd.ghci
========================================
              📦Dataframe
========================================

✨  Modules were automatically imported.

đź’ˇ  Use prefix D for core functionality.
        â—Ź E.g. D.readCsv "/path/to/file"
đź’ˇ  Use prefix F for expression functions.
        â—Ź E.g. F.sum (F.col @Int "value")

âś… Ready.
Loaded GHCi configuration from ./dataframe.ghci
ghci>

Quick start

Load a CSV, select a few columns, filter, derive a column, then group + aggregate:

-- 1) Load data
ghci> df0 <- D.readCsv "data/housing.csv"
ghci> D.describeColumns df0
-------------------------------------------------------------------------------------------------------------
    Column Name     | # Non-null Values | # Null Values | # Partially parsed | # Unique Values |     Type
--------------------|-------------------|---------------|--------------------|-----------------|-------------
        Text        |        Int        |      Int      |        Int         |       Int       |     Text
--------------------|-------------------|---------------|--------------------|-----------------|-------------
 ocean_proximity    | 20640             | 0             | 0                  | 5               | Text
 median_house_value | 20640             | 0             | 0                  | 3842            | Double
 median_income      | 20640             | 0             | 0                  | 12928           | Double
 households         | 20640             | 0             | 0                  | 1815            | Double
 population         | 20640             | 0             | 0                  | 3888            | Double
 total_bedrooms     | 20640             | 0             | 0                  | 1924            | Maybe Double
 total_rooms        | 20640             | 0             | 0                  | 5926            | Double
 housing_median_age | 20640             | 0             | 0                  | 52              | Double
 latitude           | 20640             | 0             | 0                  | 862             | Double
 longitude          | 20640             | 0             | 0                  | 844             | Double

-- 2) Project & filter
ghci> :declareColumns df
ghci> df1 = D.filterWhere (ocean_proximity .== "ISLAND") df0 D.|> D.select [F.name median_house_value, F.name median_income, F.name ocean_proximity]

-- 3) Add a derived column using the expression DSL
--    (col types are explicit via TypeApplications)
ghci> df2 = D.derive "rooms_per_household" (total_rooms / households) df0

-- 4) Group + aggregate
ghci> import DataFrame.Operators
ghci> let grouped   = D.groupBy ["ocean_proximity"] df0
ghci> let summary   =
         D.aggregate
             [ F.maximum median_house_value `as` "max_house_value"]
             grouped
ghci> D.take 5 summary
----------------------------------
 ocean_proximity | max_house_value
-----------------|----------------
      Text       |     Double
-----------------|----------------
 <1H OCEAN       | 500001.0
 INLAND          | 500001.0
 ISLAND          | 450000.0
 NEAR BAY        | 500001.0
 NEAR OCEAN      | 500001.0

Simple operations (cheat sheet)

Most users only need a handful of verbs:

I/O

  • D.readCsv :: FilePath -> IO DataFrame
  • D.readTsv :: FilePath -> IO DataFrame
  • D.writeCsv :: FilePath -> DataFrame -> IO ()
  • D.readParquet :: FilePath -> IO DataFrame
  • D.readParquetWithOpts :: ParquetReadOptions -> FilePath -> IO DataFrame
  • D.readParquetFiles :: FilePath -> IO DataFrame
  • D.readParquetFilesWithOpts :: ParquetReadOptions -> FilePath -> IO DataFrame

Exploration

  • D.take :: Int -> DataFrame -> DataFrame
  • D.takeLast :: Int -> DataFrame -> DataFrame
  • D.describeColumns :: DataFrame -> DataFrame
  • D.summarize :: DataFrame -> DataFrame

Row ops

  • D.filter :: Expr a -> (a -> Bool) -> DataFrame -> DataFrame
  • D.filterWhere :: Expr Bool -> DataFrame -> DataFrame
  • D.sortBy :: SortOrder -> [Text] -> DataFrame -> DataFrame

Column ops

  • D.select :: [Text] -> DataFrame -> DataFrame
  • D.exclude :: [Text] -> DataFrame -> DataFrame
  • D.rename :: [(Text,Text)] -> DataFrame -> DataFrame
  • D.derive :: Text -> D.Expr a -> DataFrame -> DataFrame

Group & aggregate

  • D.groupBy :: [Text] -> DataFrame -> GroupedDataFrame
  • D.aggregate :: [NamedExpr] -> GroupedDataFrame -> DataFrame

Joins

  • D.innerJoin / D.leftJoin / D.rightJoin / D.fullOuterJoin

Expression DSL (F.*) at a glance

Columns (typed):

F.col @Text   "ocean_proximity"
F.col @Double "total_rooms"
F.lit @Double 1.0

Math & comparisons (overloaded by type):

(+), (-), (*), (/), abs, log, exp, round
(F.eq), (F.gt), (F.geq), (F.lt), (F.leq)
(.==), (.>), (.>=), (.<), (.<=)

Aggregations (for D.aggregate):

F.count @a (F.col @a "c")
F.sum   @Double (F.col @Double "x")
F.mean  @Double (F.col @Double "x")
F.min   @t (F.col @t "x")
F.max   @t (F.col @t "x")

REPL power-tool: ':declareColumns'

Use :declareColumns df in GHCi/IHaskell to turn each column of a bound DataFrame into a local binding with the same (mangled if needed) name and the column's concrete vector type. This is great for quick ad-hoc analysis, plotting, or hand-rolled checks.

-- Suppose df has columns: "passengers" :: Int, "fare" :: Double, "payment" :: Text
ghci> :set -XTemplateHaskell
ghci> :declareColumns df

-- Now you have in scope:
ghci> :type passengers
passengers :: Expr Int

ghci> :type fare
fare :: Expr Double

ghci> :type payment
payment :: Expr Text

-- You can use them directly:
ghci> D.derive "fare_with_tip" (fare * 1.2)

Notes:

  • Name mangling: spaces and non-identifier characters are replaced (e.g. "trip id" -> trip_id).
  • Optional/nullable columns are exposed as Expr (Maybe a).

Core data structures

null :: DataFrame -> Bool Source #

Checks if the dataframe is empty (has no columns).

Returns True if the dataframe has no columns, False otherwise. Note that a dataframe with columns but no rows is not considered null.

data GroupedDataFrame Source #

A record that contains information about how and what rows are grouped in the dataframe. This can only be used with aggregate.

Instances

Instances details

fromList :: (Columnable a, ColumnifyRep (KindOf a) a) => [a] -> Column Source #

O(n) Convert a list to a column. Automatically picks the best representation of a vector to store the underlying data in.

Examples:

> fromList [(1 :: Int), 2, 3, 4]
[1,2,3,4]

toList :: Columnable a => Column -> [a] Source #

O(n) Converts a column to a list. Throws an exception if the wrong type is specified.

Examples:

> column = fromList [(1 :: Int), 2, 3, 4]
> toList Int column
[1,2,3,4]
> toList Double column
exception: ...

data Column Source #

Our representation of a column is a GADT that can store data based on the underlying data.

This allows us to pattern match on data kinds and limit some operations to only some kinds of vectors. E.g. operations for missing data only happen in an OptionalColumn.

fromUnboxedVector :: (Columnable a, Unbox a) => Vector a -> Column Source #

O(n) Convert an unboxed vector to a column. This avoids the extra conversion if you already have the data in an unboxed vector.

Examples:

> import qualified Data.Vector.Unboxed as V
> fromUnboxedVector (VB.fromList [(1 :: Int), 2, 3, 4])
[1,2,3,4]

fromVector :: (Columnable a, ColumnifyRep (KindOf a) a) => Vector a -> Column Source #

O(n) Convert a vector to a column. Automatically picks the best representation of a vector to store the underlying data in.

Examples:

> import qualified Data.Vector as V
> fromVector (VB.fromList [(1 :: Int), 2, 3, 4])
[1,2,3,4]

toVector :: forall a v. (Vector v a, Columnable a) => Column -> Either DataFrameException (v a) Source #

Converts a column to a vector of a specific type.

This is a type-safe conversion that requires the column's element type to exactly match the requested type. You must specify the desired type via type applications.

Type Parameters

Expand
a
The element type to convert to
v
The vector type (e.g., Vector, Vector)

Examples

Expand
>>> toVector @Int @VU.Vector column
Right (unboxed vector of Ints)
>>> toVector @Text @VB.Vector column
Right (boxed vector of Text)

Returns

Expand

See also

Expand

For numeric conversions with automatic type coercion, see toDoubleVector, toFloatVector, and toIntVector.

rowValue :: Expr a -> [(Text, Any)] -> Maybe a Source #

Given a row gets the value associated with a field.

Examples

Expand
>>> map (rowValue (F.col @Int "age")) (toRowList df)
[25,30, ...]

toAny :: Columnable a => a -> Any Source #

Wraps a value into an Any type. This helps up represent rows as heterogenous lists.

toRowList :: DataFrame -> [[(Text, Any)]] Source #

Converts the entire dataframe to a list of rows.

Each row contains all columns in the dataframe, ordered by their column indices. The rows are returned in their natural order (from index 0 to n-1).

Examples

Expand
>>> toRowList df
[[("name", "Alice"), ("age", 25), ...], [("name", "Bob"), ("age", 30), ...], ...]

Performance note

Expand

This function materializes all rows into a list, which may be memory-intensive for large dataframes. Consider using toRowVector if you need random access or streaming operations.

toRowVector :: [Text] -> DataFrame -> Vector Row Source #

Converts the dataframe to a vector of rows with only the specified columns.

Each row will contain only the columns named in the names parameter. This is useful when you only need a subset of columns or want to control the column order in the resulting rows.

Parameters

Expand
names
List of column names to include in each row. The order of names determines the order of fields in the resulting rows.
df
The dataframe to convert.

Examples

Expand
>>> toRowVector ["name", "age"] df
Vector of rows with only name and age fields
>>> toRowVector [] df  -- Empty column list
Vector of empty rows (one per dataframe row)

Operator symbols.

Display operations

Core dataframe operations

insert Source #

Adds a foldable collection to the dataframe. If the collection has less elements than the dataframe and the dataframe is not empty the collection is converted to type `Maybe a` filled with Nothing to match the size of the dataframe. Similarly, if the collection has more elements than what's currently in the dataframe, the other columns in the dataframe are change to `Maybe Type` and filled with Nothing.

Be careful not to insert infinite collections with this function as that will crash the program.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> D.insert "numbers" [(1 :: Int)..10] D.empty

--------
 numbers
--------
   Int
--------
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10

fold :: (a -> DataFrame -> DataFrame) -> [a] -> DataFrame -> DataFrame Source #

A left fold for dataframes that takes the dataframe as the last object. This makes it easier to chain operations.

Example

Expand
>>> df = D.fromNamedColumns [("x", D.fromList [1..100]), ("y", D.fromList [11..110])]
>>> D.fold D.dropLast [1..5] df

---------
 x  |  y
----|----
Int | Int
----|----
1   | 11
2   | 12
3   | 13
4   | 14
5   | 15
6   | 16
7   | 17
8   | 18
9   | 19
10  | 20
11  | 21
12  | 22
13  | 23
14  | 24
15  | 25
16  | 26
17  | 27
18  | 28
19  | 29
20  | 30

Showing 20 rows out of 85

rename :: Text -> Text -> DataFrame -> DataFrame Source #

O(n) Renames a single column.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> import qualified Data.Vector as V
>>> df = insertVector "numbers" (V.fromList [1..10]) D.empty
>>> D.rename "numbers" "others" df

-------
 others
-------
  Int
-------
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10

dimensions :: DataFrame -> (Int, Int) Source #

O(1) Get DataFrame dimensions i.e. (rows, columns)

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("a", D.fromList [1..100]), ("b", D.fromList [1..100]), ("c", D.fromList [1..100])]
>>> D.dimensions df

(100, 3)

columnNames :: DataFrame -> [Text] Source #

O(k) Get column names of the DataFrame in order of insertion.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("a", D.fromList [1..100]), ("b", D.fromList [1..100]), ("c", D.fromList [1..100])]
>>> D.columnNames df

["a", "b", "c"]

nRows :: DataFrame -> Int Source #

O(1) Get number of rows in a dataframe.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("a", D.fromList [1..100]), ("b", D.fromList [1..100]), ("c", D.fromList [1..100])]
>>> D.nRows df
100

columnAsVector :: Columnable a => Expr a -> DataFrame -> Either DataFrameException (Vector a) Source #

Get a specific column as a vector.

You must specify the type via type applications.

Examples

Expand
>>> columnAsVector (F.col @Int "age") df
Right [25, 30, 35, ...]
>>> columnAsVector (F.col @Text "name") df
Right ["Alice", "Bob", "Charlie", ...]

columnAsList :: Columnable a => Expr a -> DataFrame -> [a] Source #

Get a specific column as a list.

You must specify the type via type applications.

Examples

Expand
>>> columnAsList @Int "age" df
[25, 30, 35, ...]
>>> columnAsList @Text "name" df
["Alice", "Bob", "Charlie", ...]

Throws

Expand
  • error - if the column type doesn't match the requested type

renameMany :: [(Text, Text)] -> DataFrame -> DataFrame Source #

O(n) Renames many columns.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> import qualified Data.Vector as V
>>> df = D.insertVector "others" (V.fromList [11..20]) (D.insertVector "numbers" (V.fromList [1..10]) D.empty)
>>> df

-----------------
 numbers | others
---------|-------
   Int   |  Int
---------|-------
 1       | 11
 2       | 12
 3       | 13
 4       | 14
 5       | 15
 6       | 16
 7       | 17
 8       | 18
 9       | 19
 10      | 20

>>> D.renameMany [("numbers", "first_10"), ("others", "next_10")] df

-------------------
 first_10 | next_10
----------|--------
   Int    |   Int
----------|--------
 1        | 11
 2        | 12
 3        | 13
 4        | 14
 5        | 15
 6        | 16
 7        | 17
 8        | 18
 9        | 19
 10       | 20

insertColumn Source #

O(n) Add a column to the dataframe.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> D.insertColumn "numbers" (D.fromList [(1 :: Int)..10]) D.empty

--------
 numbers
--------
   Int
--------
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10

insertVector Source #

Adds a vector to the dataframe. If the vector has less elements than the dataframe and the dataframe is not empty the vector is converted to type `Maybe a` filled with Nothing to match the size of the dataframe. Similarly, if the vector has more elements than what's currently in the dataframe, the other columns in the dataframe are change to `Maybe Type` and filled with Nothing.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> import qualified Data.Vector as V
>>> D.insertVector "numbers" (V.fromList [(1 :: Int)..10]) D.empty

--------
 numbers
--------
   Int
--------
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10

cloneColumn :: Text -> Text -> DataFrame -> DataFrame Source #

O(n) Clones a column and places it under a new name in the dataframe.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified Data.Vector as V
>>> df = insertVector "numbers" (V.fromList [1..10]) D.empty
>>> D.cloneColumn "numbers" "others" df

-----------------
 numbers | others
---------|-------
   Int   |  Int
---------|-------
 1       | 1
 2       | 2
 3       | 3
 4       | 4
 5       | 5
 6       | 6
 7       | 7
 8       | 8
 9       | 9
 10      | 10

nColumns :: DataFrame -> Int Source #

O(1) Get number of columns in a dataframe.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("a", D.fromList [1..100]), ("b", D.fromList [1..100]), ("c", D.fromList [1..100])]
>>> D.nColumns df
3

insertVectorWithDefault Source #

Adds a vector to the dataframe and pads it with a default value if it has less elements than the number of rows.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified Data.Vector as V
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("x", D.fromList [(1 :: Int)..10])]
>>> D.insertVectorWithDefault 0 "numbers" (V.fromList [(1 :: Int),2,3]) df

-------------
 x  | numbers
----|--------
Int |   Int
----|--------
1   | 1
2   | 2
3   | 3
4   | 0
5   | 0
6   | 0
7   | 0
8   | 0
9   | 0
10  | 0

insertWithDefault Source #

Adds a list to the dataframe and pads it with a default value if it has less elements than the number of rows.

Example

Expand
>>> :set -XOverloadedStrings
>>> import qualified DataFrame as D
>>> df = D.fromNamedColumns [("x", D.fromList [(1 :: Int)..10])]
>>> D.insertWithDefault 0 "numbers" [(1 :: Int),2,3] df

-------------
 x  | numbers
----|--------
Int |   Int
----|--------
1   | 1
2   | 2
3   | 3
4   | 0
5   | 0
6   | 0
7   | 0
8   | 0
9   | 0
10  | 0

insertUnboxedVector Source #

O(n) Adds an unboxed vector to the dataframe.

Same as insertVector but takes an unboxed vector. If you insert a vector of numbers through insertVector it will either way be converted into an unboxed vector so this function saves that extra work/conversion.

describeColumns :: DataFrame -> DataFrame Source #

O(n * k ^ 2) Returns the number of non-null columns in the dataframe and the type associated with each column.

Example

Expand
>>> import qualified Data.Vector as V
>>> df = D.insertVector "others" (V.fromList [11..20]) (D.insertVector "numbers" (V.fromList [1..10]) D.empty)
>>> D.describeColumns df

--------------------------------------------------------
 Column Name | # Non-null Values | # Null Values | Type
-------------|-------------------|---------------|-----
    Text     |        Int        |      Int      | Text
-------------|-------------------|---------------|-----
 others      | 10                | 0             | Int
 numbers     | 10                | 0             | Int

fromNamedColumns :: [(Text, Column)] -> DataFrame Source #

Creates a dataframe from a list of tuples with name and column.

Example

Expand
>>> df = D.fromNamedColumns [("numbers", D.fromList [1..10]), ("others", D.fromList [11..20])]
>>> df
-----------------
 numbers | others
---------|-------
   Int   |  Int
---------|-------
 1       | 11
 2       | 12
 3       | 13
 4       | 14
 5       | 15
 6       | 16
 7       | 17
 8       | 18
 9       | 19
 10      | 20

fromUnnamedColumns :: [Column] -> DataFrame Source #

Create a dataframe from a list of columns. The column names are "0", "1"... etc. Useful for quick exploration but you should probably always rename the columns after or drop the ones you don't want.

Example

Expand
>>> df = D.fromUnnamedColumns [D.fromList [1..10], D.fromList [11..20]]
>>> df
-----------------
  0  |  1
-----|----
 Int | Int
-----|----
 1   | 11
 2   | 12
 3   | 13
 4   | 14
 5   | 15
 6   | 16
 7   | 17
 8   | 18
 9   | 19
 10  | 20

fromRows :: [Text] -> [[Any]] -> DataFrame Source #

Create a dataframe from a list of column names and rows.

Example

Expand
>>> df = D.fromRows [A, B] [[D.toAny 1, D.toAny 11], [D.toAny 2, D.toAny 12], [D.toAny 3, D.toAny 13]]

>>> df

----------
  A  |  B
-----|----
 Int | Int
-----|----
 1   | 11
 2   | 12
 3   | 13

valueCounts :: (Ord a, Columnable a) => Expr a -> DataFrame -> [(a, Int)] Source #

O (k * n) Counts the occurences of each value in a given column.

Example

Expand
>>> df = D.fromUnnamedColumns [D.fromList [1..10], D.fromList [11..20]]

>>> D.valueCounts @Int "0" df

[(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1),(9,1),(10,1)]

valueProportions :: (Ord a, Columnable a) => Expr a -> DataFrame -> [(a, Double)] Source #

O (k * n) Shows the proportions of each value in a given column.

Example

Expand
>>> df = D.fromUnnamedColumns [D.fromList [1..10], D.fromList [11..20]]

>>> D.valueCounts @Int "0" df

[(1,0.1),(2,0.1),(3,0.1),(4,0.1),(5,0.1),(6,0.1),(7,0.1),(8,0.1),(9,0.1),(10,0.1)]

toFloatMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Float)) Source #

Returns a dataframe as a two dimensional vector of floats.

Converts all columns in the dataframe to float vectors and transposes them into a row-major matrix representation.

This is useful for handing data over into ML systems.

Returns Left with an error if any column cannot be converted to floats.

toDoubleMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Double)) Source #

Returns a dataframe as a two dimensional vector of doubles.

Converts all columns in the dataframe to double vectors and transposes them into a row-major matrix representation.

This is useful for handing data over into ML systems.

Returns Left with an error if any column cannot be converted to doubles.

toIntMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Int)) Source #

Returns a dataframe as a two dimensional vector of ints.

Converts all columns in the dataframe to int vectors and transposes them into a row-major matrix representation.

This is useful for handing data over into ML systems.

Returns Left with an error if any column cannot be converted to ints.

showDerivedExpressions :: DataFrame -> [NamedExpr] Source #

Returns the provenance of all columns in the DataFrame as a list of (name, expression) pairs. Derived columns show their expression; raw columns show an identity col @type name expression.

Types

makeSchema :: [(Text, SchemaType)] -> Schema Source #

Construct a Schema from a list of (columnName, schemaType) pairs.

Example

Expand
>>> :set -XTypeApplications
>>> import qualified Data.Text as T
>>> let s = makeSchema [("name", schemaType @T.Text), ("age", schemaType @Int)]
>>> M.member "age" (elements s)
True

I/O

data ReadOptions Source #

CSV read parameters.

Constructors

ReadOptions 

Fields

data ParquetReadOptions Source #

Options for reading Parquet data.

These options are applied in this order:

  1. predicate filtering
  2. column projection
  3. row range

Column selection for selectedColumns uses leaf column names only.

Constructors

ParquetReadOptions 

Fields

  • selectedColumns :: Maybe [Text]

    Columns to keep in the final dataframe. If set, only these columns are returned. Predicate-referenced columns are read automatically when needed and projected out after filtering.

  • predicate :: Maybe (Expr Bool)

    Optional row filter expression applied before projection.

  • rowRange :: Maybe (Int, Int)

    Optional row slice (start, end) with start-inclusive/end-exclusive semantics.

Instances

Instances details

readParquetFilesWithOpts :: ParquetReadOptions -> FilePath -> IO DataFrame Source #

Read multiple Parquet files (directory or glob) using explicit options.

If path is a directory, all non-directory entries are read. If path is a glob, matching files are read.

For multi-file reads, rowRange is applied once after concatenation (global range semantics).

Example

Expand
ghci> D.readParquetFilesWithOpts
ghci|   (D.defaultParquetReadOptions{D.selectedColumns = Just ["id"], D.rowRange = Just (0, 5)})
ghci|   ".testsdata/alltypes_plain*.parquet"

Type conversion

Operations

filter Source #

O(n * k) Filter rows by a given condition.

filter "x" even df

randomSplit :: RandomGen g => g -> Double -> DataFrame -> (DataFrame, DataFrame) Source #

Split a dataset into two. The first in the tuple gets a sample of p (0 <= p <= 1) and the second gets (1 - p). This is useful for creating test and train splits.

Example

Expand
ghci> import System.Random
ghci> D.randomSplit (mkStdGen 137) 0.9 df

data SortOrder where Source #

Sort order taken as a parameter by the sortBy function.

Instances

Instances details

frequencies :: Columnable a => Expr a -> DataFrame -> DataFrame Source #

Show a frequency table for a categorical feaure.

Examples:

ghci> df <- D.readCsv "./data/housing.csv"

ghci> D.frequencies "ocean_proximity" df

---------------------------------------------------------------------
   Statistic    | <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN
----------------|-----------|--------|--------|----------|-----------
      Text      |    Any    |  Any   |  Any   |   Any    |    Any
----------------|-----------|--------|--------|----------|-----------
 Count          | 9136      | 6551   | 5      | 2290     | 2658
 Percentage (%) | 44.26%    | 31.74% | 0.02%  | 11.09%   | 12.88%

Errors

Plotting