pdxtra module#

PDXtra is a data analysis library built on top of Pandas and geared toward time-series data analysis. The library offers subclasses of traditional Pandas Series and DataFrame objects which provide additional convenience methods for querying and manipulating data while retaining all of the functionality of the original Pandas API. Additionally, the library also provides TimeSeries and TimeSeriesDataFrame subclasses which are specifically designed to be used for time-series and financial data analysis.

This distribution has been developed and tested on a standalone Python build and could–under rare circumstances–exibit unexpected behavior on standard CPython. Although no differences between the execution of code in the standalone build and normal CPython have been found during the process of development, I nevertheless wish to provide fair warning.

class pdxtra.DataFrame(*args, **kwargs)#

Bases: DataFrame

set_temp_index(column_name)#

Context manager for temporarily switching the index column. The index column is automatically reverted back when the context manager exits.

Parameters#

column_namestr: The name of the column to set as the index column while inside the scope of the context manager.

vlookup(values, column, method=None, tolerance=None)#

Performs an Excel-like, veritcal lookup. In this case, the “v” in the function name has a double meaning. It stands for both “vertical” and “vector”, because unlike the typical vertical lookup in Excel, the function returns a vector of values (numpy.array) from the specified column, rather than a scalar. Can perform nearest match on the index column.

Parameters#

valuesSequence

A sequence of values to search for in the index column.

columnstr

The column name from which we wish to extract values whose indicies match the indicies found in the index column.

method{None, str}, default None

Method used to determine whether or not a value is considered a match or near-match. From the Pandas documentation:

[None (default)]: exact matches only.

pad / ffill: find the previous index value if no exact match.

backfill / bfill: use the next index value if no exact match.

nearest: use the nearest index value if no exact match. Tied distances are broken by preferring the larger index value.

tolerance{numeric, date, datetime, None}, default None

The maximum distance to search on either side of the index value when performing nearest match. If searching a series containing date or datetime objects, the user is responsible for providing the appropriate timedelta for computing differences between the index values and the values supplied in the function call. This argument is ignored when the method is None. See pandas.Index.get_indexer for further explanation.

Returns#

A 1-d Numpy array, which can be empty if no values are found.

xlookup(row_idx, column_name, nearest_match=False, tolerance=5, lookback=True)#

Perform an Excel-style ‘xlookup’ by extracting the value(s) at the intersection of a set of (x,y) coordinates, where the Y coordinate is an index value and the X coordinate is a column name. Can perform nearest match on numerical and date indexes.

Parameters#

row_idx{int, float, date, datetime}: The index value specifying which row is to be searched. When using ‘nearest_match’ this value becomes the pivot value supplied to the ‘nearest_to’ function.
column_namestr: The name of the column from which to extract the value.
nearest_matchbool, default False: If true, use the ‘nearest_to’ function to find the value of the index column which is closest to the ‘row_idx’ value.
toleranceint, default 5: The maximum distance to search on either side of the ‘row_idx’ value when performing nearest match. If searching a series containing date or datetime objects, the user is responsible for providing the appropriate timedelta for computing differences between the index values and the value assigned to ‘row_idx’. This argument is ignored when ‘nearest_match’ is false.
lookbackbool, default True: Whether or not to prioritize looking back vs. looking forwards along the sorted values of the Series when performing nearest match. This argument is ignored when ‘nearest_match’ is false.

Returns#

A value from the column being indexed, or None if no value is returned on nearest match.

Raises#

KeyError :
When the dataframe cannot be indexed using the coordinates provided.
TypeError :
When nearest match is performed on a series which contains values that do not support subraction between themselves and the other values contained in the Series, or when a comparison between differences and the tolerance is considered ambiguous.

Examples#

Find the value in column “integers” at index 2.

>>> import pdxtra as pdx
>>> df = pdx.DataFrame({"integers": [1, 2, 3, 4, 5]})
>>> df.xlookup(2, "integers")
3

Find the value in column “integers” where the index equals “d”.

>>> df.index = ["a", "b", "c", "d", "e"]
>>> df.xlookup("d", "integers")
4

Find the value in column “integers” where the index is nearest to 3.

>>> df.index = [2, 4, 6, 8, 10]
>>> df.xlookup(3, "integers", nearest_match=True)
1

Find the value in column “integers” where the index is nearest to 3, but look forwards instead of backwards.

>>> df.xlookup(3, "integers", nearest_match=True, lookback=False)
2

class pdxtra.Series(*args, **kwargs)#

Bases: Series

nearest_to(pivot, tolerance=5, lookback=True, float_method='fuzzy')#

Sort then search the series for a value which is nearest to the pivot value. If an exact match is found, then that match will be returned. All values in the series must support subtraction between themselves and the other values of the series.

Can look forwards and backwards for matches, but does so hierarchically. For instance, a forward lookup will only return the forward value if the distance between it and the pivot value is less than the distance between any of the lower values and the pivot value. The same logic holds true for lookbacks. Any value above or below the pivot value, whose distance from the pivot is greater than the absolute value of the pivot plus the tolerance, will be excluded from the search.

Note

This function can be called directly on a Series object.

Parameters#

seriesSeries: The Pandas series to be searched. When called as a method from the Series class, this argument is assigned to the class instance.
pivot{int, float, date, datetime}: The value on which to match. The search window (as defined by the tolerance) “pivots” around this value.
tolerance{int, float, timedelta}, default 5: The maximum distance to search on either side of the pivot. If searching a series containing date or datetime objects, the user is responsible for providing the appropriate timedelta for computing differences between the series values and the pivot value.
lookbackbool, default True: Whether or not to prioritize looking backwards vs. looking forwards along the sorted values of the series.
float_methodstr {fuzzy, precise}, default fuzzy: Method for comparing floats when determining whether or not the difference between the pivot and a given value exceeds the bounds set by the tolerance.

Warning

Using precise will only increase the likelihood that the value returned falls within bounds. Neither of the two methods, fuzzy or precise, can guarantee perfect comparison of floating-point numbers. See the documentation for numpy.isclose for additional information on how this method compares floating-point numbers when precise is used. For an array of floating-point numbers significantly smaller than zero, precise may produce false-positive comparisons. For columns which do not contain floating-point numbers, the default (fuzzy) should always be used.

Returns#

Either a value from the series or None if no match is found.

Raises#

TypeError:: One or more of the values does not support subtraction between itself and the other values in the series, or a comparison between differences and the tolerance is considered ambiguous.
ValueError:: The ‘float_method’ argument which was passed to the method is not one of the accepted strings.

Examples#

Get the value from the list which is nearest to seven.

>>> import pdxtra as pdx
>>> values = pd.Series([1, 2, 3, 4, 8, 9])
>>> pdx.nearest_to(values, 7)
8

Get the value from the list which is nearest to three and use default lookback=True for tie breaking.

>>> values = pd.Series([1, 2, 4, 5])
>>> pdx.nearest_to(values, 3)
2

Get the value from the list which is nearst to three and look forward to break the tie.

>>> pdx.nearest_to(values, 3, lookback=False)
4

normalize(multiplier=1.5)#

Returns a series with outliers removed via the IQR method.

Note

This function can be called directly on a DataFrame object.

Parameters#

multiplier{int, float} default 1.5: The value by which to multiply (extend) upper and lower bounds.

Returns#

A 1-d Numpy array.

Examples#

>>> import pdxtra as pdx
>>> s = Series([-30, 1, 2, 3, 30])
>>> s = normalize(s)
>>> s
array([1, 2, 3])

true_ema(span)#

Calculate the true N-day, exponential moving average of the Series.

Calculations for exponential moving averages in Pandas start on the last position of the span which means that in order for a moving average with a span of N days to be a true N-day moving average, we need to add one to the minimum number of periods. This method produces an exponential moving average where ‘min_periods’ for the exponential moving window is always ‘span + 1’.

Parameters#

spanint: From the Pandas documentation: “Specify decay in terms of span.” See Pandas documentation on pandas.ewm for a full explanation.

Returns#

A subclass of Pandas ExponentialMovingWindow.

property linearize#: Returns the linear regression of a Series.

class pdxtra.TimeSeries(*args, **kwargs)#

Bases: Series

intersects(other)#

Returns the (x,y) coordinates for each point of intersection between the Series and the sequence supplied to the method. The coordinate vectors are returned as a 2-d Numpy array where index [0, :] contains the X values and index [1, :] contains the Y values for all points of intersection. Each pair for an array of N elements having index positions “i” is equal to the index [:, i].

Parameters#

otherSequence: A sequence of values forming some finite curve that is expected to intersect with the curve represented by the series. This sequence must be the same length as the series itself.

Returns#

A 2-d Numpy array.

macd(short_span=12, long_span=26)#

Converts the Series to moving average convergence-divergence.

Parameters#

short_spanint: The length of the short rolling window.
long_spanint: The length of the long rolling window.

Returns#

A PDXtra TimeSeries object.

near_intersects(other)#

Returns the (x,y) coordinates for each point of intersection between the series and the sequence supplied to the method. The coordinate vectors are returned as a 2-d Numpy array where index [0, :] contains the X values and index [1, :] contains the Y values for all points of intersection. Each coordinate pair for an array of N elements having index positions “i” is equal to the index [:, i]. Every pair represents the nearest non-interpolated coordinates that preceed a point of intersection.

Parameters#

otherSequence: A sequence of values forming some finite curve that is expected to intersect with the curve represented by the series. This sequence must be the same length as the series itself.

Returns#

A 2-d Numpy array.

rsi(span=14)#

Converts the Series to relative strength index.

Parameters#

spanint: The length of the rolling window across the Series.

Returns#

A PDXtra Series object.

class pdxtra.TimeSeriesDataFrame(*args, **kwargs)#

Bases: DataFrame

intersects(column_1, column_2)#

Returns the (x,y) coordinates for each point of intersection of two finite curves as a 2-d array where index [0, :] contains the X values and index [1, :] contains the Y values for all points of intersection. Each pair for an array of N elements having index positions “i” is equal to the index [:, i].

Parameters#

column_1str: The name of the first Series representing a finite curve.
column_2str: The name of the second Series whose curve is expected to intersect with the first.

Returns#

A 2-d Numpy array.

near_intersects(column_1, column_2)#

Returns the (x,y) coordinates for each point of intersection of two finite curves as a 2-d Numpy array where index [0, :] contains the X values and index [1, :] contains the Y values for all points of intersection. Each coordinate pair for an array of N elements having index positions “i” is equal to the index [:, i]. Every pair represents the nearest non-interpolated coordinates that preceed a point of intersection.

Parameters#

column_1str: The name of the first Series representing a finite curve.
column_2str: The name of the second Series whose curve is expected to intersect with first.

Returns#

A 2-d Numpy array.

obv(close, volume)#

Calculates the on-balance volume of the close and volume columns specified. Specifically designed for financial, time-series data analysis.

Parameters#

closestr: Name of the column which represents closing prices.
volumestr: Name of the column which represents the trading volume.

Returns#

A PDXtra TimeSeries object.

pdxtra.downcast_dataframe(df, integers=True, floats=True)#

Downcast all numeric columns of the dataframe while retaining the sign of the values. The precision warnings associated with pandas.to_numeric also apply here.

Parameters#

dfDataframe: The dataframe to be downcast.
integersbool, default True: Whether or not to downcast integer columns.
floatsbool, default True: Whether or not to downcast floating-point columns.

Note

Due to the way in which pandas.select_dtypes selects columns, nullable Pandas types will automatically be downcast and cannot be selectively excluded when calling the function. Instead, exclude the nullable type columns and downcast the view or copy.

Returns#

A Pandas DataFrame.

Raises#

DowncastTypeError:: When attempting to downcast an individual Series.

Examples#

Downcast a dataframe containing an integer and a float column.

>>> import pdxtra as pdx
>>> df = pdx.DataFrame({"a": [1, 2, 3], "b": [1.0, 2.0, 3.0]})
>>> df.dtypes
a      int64
b    float64
dtype: object
>>> df = pdx.downcast_dataframe(df)
>>> df.dtypes
a       int8
b    float32
dtype: object

Downcast only the integer columns.

>>> df = pdx.DataFrame({"a": [1, 2, 3], "b": [1.0, 2.0, 3.0]})
>>> df.dtypes
a      int64
b    float64
dtype: object
>>> df = pdx.downcast_dataframe(df, floats=False)
>>> df.dtypes
a       int8
b    float64
dtype: object

pdxtra.find_intersects(x, y1, y2)#: Returns the (x,y) coordinates for all points of intersection of two finite curves. Each array should represent the Y values of some curve whose X values match the values of the ‘x’ parameter.

pdxtra.find_near_intersects(x, y1, y2)#: Returns the (x,y) coordinates for all points that preceed a point of intersection of two finite curves. Each array should represent the Y values of some curve whose X values match the values of the ‘x’ parameter.

pdxtra.nearest_to(series, pivot, tolerance=5, lookback=True, float_method='fuzzy')#

Note

This function can be called directly on a Series object.

Parameters#

seriesSeries: The Pandas series to be searched. When called as a method from the Series class, this argument is assigned to the class instance.
pivot{int, float, date, datetime}: The value on which to match. The search window (as defined by the tolerance) “pivots” around this value.
tolerance{int, float, timedelta}, default 5: The maximum distance to search on either side of the pivot. If searching a series containing date or datetime objects, the user is responsible for providing the appropriate timedelta for computing differences between the series values and the pivot value.
lookbackbool, default True: Whether or not to prioritize looking backwards vs. looking forwards along the sorted values of the series.
float_methodstr {fuzzy, precise}, default fuzzy: Method for comparing floats when determining whether or not the difference between the pivot and a given value exceeds the bounds set by the tolerance.

Warning

Using precise will only increase the likelihood that the value returned falls within bounds. Neither of the two methods, fuzzy or precise, can guarantee perfect comparison of floating-point numbers. See the documentation for numpy.isclose for additional information on how this method compares floating-point numbers when precise is used. For an array of floating-point numbers significantly smaller than zero, precise may produce false-positive comparisons. For columns which do not contain floating-point numbers, the default (fuzzy) should always be used.

Returns#

Either a value from the series or None if no match is found.

Raises#

TypeError:: One or more of the values does not support subtraction between itself and the other values in the series, or a comparison between differences and the tolerance is considered ambiguous.
ValueError:: The ‘float_method’ argument which was passed to the method is not one of the accepted strings.

Examples#

Get the value from the list which is nearest to seven.

>>> import pdxtra as pdx
>>> values = pd.Series([1, 2, 3, 4, 8, 9])
>>> pdx.nearest_to(values, 7)
8

Get the value from the list which is nearest to three and use default lookback=True for tie breaking.

>>> values = pd.Series([1, 2, 4, 5])
>>> pdx.nearest_to(values, 3)
2

Get the value from the list which is nearst to three and look forward to break the tie.

>>> pdx.nearest_to(values, 3, lookback=False)
4

pdxtra.normalize(series, multiplier=1.5)#

Returns a series with outliers removed via the IQR method.

Note

This function can be called directly on a DataFrame object.

Parameters#

multiplier{int, float} default 1.5: The value by which to multiply (extend) upper and lower bounds.

Returns#

A 1-d Numpy array.

Examples#

>>> import pdxtra as pdx
>>> s = Series([-30, 1, 2, 3, 30])
>>> s = normalize(s)
>>> s
array([1, 2, 3])